summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2020-11-10x86/kexec: Use up-to-dated screen_info copy to fill boot paramsKairui Song
[ Upstream commit afc18069a2cb7ead5f86623a5f3d4ad6e21f940d ] kexec_file_load() currently reuses the old boot_params.screen_info, but if drivers have change the hardware state, boot_param.screen_info could contain invalid info. For example, the video type might be no longer VGA, or the frame buffer address might be changed. If the kexec kernel keeps using the old screen_info, kexec'ed kernel may attempt to write to an invalid framebuffer memory region. There are two screen_info instances globally available, boot_params.screen_info and screen_info. Later one is a copy, and is updated by drivers. So let kexec_file_load use the updated copy. [ mingo: Tidied up the changelog. ] Signed-off-by: Kairui Song <kasong@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20201014092429.1415040-2-kasong@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-10linkage: Introduce new macros for assembler symbolsJiri Slaby
commit ffedeeb780dc554eff3d3b16e6a462a26a41d7ec upstream. Introduce new C macros for annotations of functions and data in assembly. There is a long-standing mess in macros like ENTRY, END, ENDPROC and similar. They are used in different manners and sometimes incorrectly. So introduce macros with clear use to annotate assembly as follows: a) Support macros for the ones below SYM_T_FUNC -- type used by assembler to mark functions SYM_T_OBJECT -- type used by assembler to mark data SYM_T_NONE -- type used by assembler to mark entries of unknown type They are defined as STT_FUNC, STT_OBJECT, and STT_NOTYPE respectively. According to the gas manual, this is the most portable way. I am not sure about other assemblers, so this can be switched back to %function and %object if this turns into a problem. Architectures can also override them by something like ", @function" if they need. SYM_A_ALIGN, SYM_A_NONE -- align the symbol? SYM_L_GLOBAL, SYM_L_WEAK, SYM_L_LOCAL -- linkage of symbols b) Mostly internal annotations, used by the ones below SYM_ENTRY -- use only if you have to (for non-paired symbols) SYM_START -- use only if you have to (for paired symbols) SYM_END -- use only if you have to (for paired symbols) c) Annotations for code SYM_INNER_LABEL_ALIGN -- only for labels in the middle of code SYM_INNER_LABEL -- only for labels in the middle of code SYM_FUNC_START_LOCAL_ALIAS -- use where there are two local names for one function SYM_FUNC_START_ALIAS -- use where there are two global names for one function SYM_FUNC_END_ALIAS -- the end of LOCAL_ALIASed or ALIASed function SYM_FUNC_START -- use for global functions SYM_FUNC_START_NOALIGN -- use for global functions, w/o alignment SYM_FUNC_START_LOCAL -- use for local functions SYM_FUNC_START_LOCAL_NOALIGN -- use for local functions, w/o alignment SYM_FUNC_START_WEAK -- use for weak functions SYM_FUNC_START_WEAK_NOALIGN -- use for weak functions, w/o alignment SYM_FUNC_END -- the end of SYM_FUNC_START_LOCAL, SYM_FUNC_START, SYM_FUNC_START_WEAK, ... For functions with special (non-C) calling conventions: SYM_CODE_START -- use for non-C (special) functions SYM_CODE_START_NOALIGN -- use for non-C (special) functions, w/o alignment SYM_CODE_START_LOCAL -- use for local non-C (special) functions SYM_CODE_START_LOCAL_NOALIGN -- use for local non-C (special) functions, w/o alignment SYM_CODE_END -- the end of SYM_CODE_START_LOCAL or SYM_CODE_START d) For data SYM_DATA_START -- global data symbol SYM_DATA_START_LOCAL -- local data symbol SYM_DATA_END -- the end of the SYM_DATA_START symbol SYM_DATA_END_LABEL -- the labeled end of SYM_DATA_START symbol SYM_DATA -- start+end wrapper around simple global data SYM_DATA_LOCAL -- start+end wrapper around simple local data ========== The macros allow to pair starts and ends of functions and mark functions correctly in the output ELF objects. All users of the old macros in x86 are converted to use these in further patches. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Len Brown <len.brown@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-pm@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: x86-ml <x86@kernel.org> Cc: xen-devel@lists.xenproject.org Link: https://lkml.kernel.org/r/20191011115108.12392-2-jslaby@suse.cz Cc: Jian Cai <jiancai@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05perf/x86/amd/ibs: Fix raw sample data accumulationKim Phillips
commit 36e1be8ada994d509538b3b1d0af8b63c351e729 upstream. Neither IbsBrTarget nor OPDATA4 are populated in IBS Fetch mode. Don't accumulate them into raw sample user data in that case. Also, in Fetch mode, add saving the IBS Fetch Control Extended MSR. Technically, there is an ABI change here with respect to the IBS raw sample data format, but I don't see any perf driver version information being included in perf.data file headers, but, existing users can detect whether the size of the sample record has reduced by 8 bytes to determine whether the IBS driver has this fix. Fixes: 904cb3677f3a ("perf/x86/amd/ibs: Update IBS MSRs and feature definitions") Reported-by: Stephane Eranian <stephane.eranian@google.com> Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200908214740.18097-6-kim.phillips@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05perf/x86/amd/ibs: Don't include randomized bits in get_ibs_op_count()Kim Phillips
commit 680d69635005ba0e58fe3f4c52fc162b8fc743b0 upstream. get_ibs_op_count() adds hardware's current count (IbsOpCurCnt) bits to its count regardless of hardware's valid status. According to the PPR for AMD Family 17h Model 31h B0 55803 Rev 0.54, if the counter rolls over, valid status is set, and the lower 7 bits of IbsOpCurCnt are randomized by hardware. Don't include those bits in the driver's event count. Fixes: 8b1e13638d46 ("perf/x86-ibs: Fix usage of IBS op current count") Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05perf/x86/intel: Fix Ice Lake event constraint tableKan Liang
commit 010cb00265f150bf82b23c02ad1fb87ce5c781e1 upstream. An error occues when sampling non-PEBS INST_RETIRED.PREC_DIST(0x01c0) event. perf record -e cpu/event=0xc0,umask=0x01/ -- sleep 1 Error: The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (cpu/event=0xc0,umask=0x01/). /bin/dmesg | grep -i perf may provide additional information. The idxmsk64 of the event is set to 0. The event never be successfully scheduled. The event should be limit to the fixed counter 0. Fixes: 6017608936c1 ("perf/x86/intel: Add Icelake support") Reported-by: Yi, Ammy <ammy.yi@intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200928134726.13090-1-kan.liang@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05x86/unwind/orc: Fix inactive tasks with stack pointer in %sp on GCC 10 ↵Jiri Slaby
compiled kernels [ Upstream commit f2ac57a4c49d40409c21c82d23b5706df9b438af ] GCC 10 optimizes the scheduler code differently than its predecessors. When CONFIG_DEBUG_SECTION_MISMATCH=y, the Makefile forces GCC not to inline some functions (-fno-inline-functions-called-once). Before GCC 10, "no-inlined" __schedule() starts with the usual prologue: push %bp mov %sp, %bp So the ORC unwinder simply picks stack pointer from %bp and unwinds from __schedule() just perfectly: $ cat /proc/1/stack [<0>] ep_poll+0x3e9/0x450 [<0>] do_epoll_wait+0xaa/0xc0 [<0>] __x64_sys_epoll_wait+0x1a/0x20 [<0>] do_syscall_64+0x33/0x40 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 But now, with GCC 10, there is no %bp prologue in __schedule(): $ cat /proc/1/stack <nothing> The ORC entry of the point in __schedule() is: sp:sp+88 bp:last_sp-48 type:call end:0 In this case, nobody subtracts sizeof "struct inactive_task_frame" in __unwind_start(). The struct is put on the stack by __switch_to_asm() and only then __switch_to_asm() stores %sp to task->thread.sp. But we start unwinding from a point in __schedule() (stored in frame->ret_addr by 'call') and not in __switch_to_asm(). So for these example values in __unwind_start(): sp=ffff94b50001fdc8 bp=ffff8e1f41d29340 ip=__schedule+0x1f0 The stack is: ffff94b50001fdc8: ffff8e1f41578000 # struct inactive_task_frame ffff94b50001fdd0: 0000000000000000 ffff94b50001fdd8: ffff8e1f41d29340 ffff94b50001fde0: ffff8e1f41611d40 # ... ffff94b50001fde8: ffffffff93c41920 # bx ffff94b50001fdf0: ffff8e1f41d29340 # bp ffff94b50001fdf8: ffffffff9376cad0 # ret_addr (and end of the struct) 0xffffffff9376cad0 is __schedule+0x1f0 (after the call to __switch_to_asm). Now follow those 88 bytes from the ORC entry (sp+88). The entry is correct, __schedule() really pushes 48 bytes (8*7) + 32 bytes via subq to store some local values (like 4U below). So to unwind, look at the offset 88-sizeof(long) = 0x50 from here: ffff94b50001fe00: ffff8e1f41578618 ffff94b50001fe08: 00000cc000000255 ffff94b50001fe10: 0000000500000004 ffff94b50001fe18: 7793fab6956b2d00 # NOTE (see below) ffff94b50001fe20: ffff8e1f41578000 ffff94b50001fe28: ffff8e1f41578000 ffff94b50001fe30: ffff8e1f41578000 ffff94b50001fe38: ffff8e1f41578000 ffff94b50001fe40: ffff94b50001fed8 ffff94b50001fe48: ffff8e1f41577ff0 ffff94b50001fe50: ffffffff9376cf12 Here ^^^^^^^^^^^^^^^^ is the correct ret addr from __schedule(). It translates to schedule+0x42 (insn after a call to __schedule()). BUT, unwind_next_frame() tries to take the address starting from 0xffff94b50001fdc8. That is exactly from thread.sp+88-sizeof(long) = 0xffff94b50001fdc8+88-8 = 0xffff94b50001fe18, which is garbage marked as NOTE above. So this quits the unwinding as 7793fab6956b2d00 is obviously not a kernel address. There was a fix to skip 'struct inactive_task_frame' in unwind_get_return_address_ptr in the following commit: 187b96db5ca7 ("x86/unwind/orc: Fix unwind_get_return_address_ptr() for inactive tasks") But we need to skip the struct already in the unwinder proper. So subtract the size (increase the stack pointer) of the structure in __unwind_start() directly. This allows for removal of the code added by commit 187b96db5ca7 completely, as the address is now at '(unsigned long *)state->sp - 1', the same as in the generic case. [ mingo: Cleaned up the changelog a bit, for better readability. ] Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder") Bug: https://bugzilla.suse.com/show_bug.cgi?id=1176907 Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20201014053051.24199-1-jslaby@suse.cz Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-01crypto: x86/crc32c - fix building with clang iasArnd Bergmann
commit 44623b2818f4a442726639572f44fd9b6d0ef68c upstream. The clang integrated assembler complains about movzxw: arch/x86/crypto/crc32c-pcl-intel-asm_64.S:173:2: error: invalid instruction mnemonic 'movzxw' It seems that movzwq is the mnemonic that it expects instead, and this is what objdump prints when disassembling the file. Fixes: 6a8ce1ef3940 ("crypto: crc32c - Optimize CRC32C calculation with PCLMULQDQ instruction") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> [jc: Fixed conflicts due to lack of 34fdce6981b9 ("x86: Change {JMP,CALL}_NOSPEC argument")] Signed-off-by: Jian Cai <jiancai@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01x86/xen: disable Firmware First mode for correctable memory errorsJuergen Gross
commit d759af38572f97321112a0852353613d18126038 upstream. When running as Xen dom0 the kernel isn't responsible for selecting the error handling mode, this should be handled by the hypervisor. So disable setting FF mode when running as Xen pv guest. Not doing so might result in boot splats like: [ 7.509696] HEST: Enabling Firmware First mode for corrected errors. [ 7.510382] mce: [Firmware Bug]: Ignoring request to disable invalid MCA bank 2. [ 7.510383] mce: [Firmware Bug]: Ignoring request to disable invalid MCA bank 3. [ 7.510384] mce: [Firmware Bug]: Ignoring request to disable invalid MCA bank 4. [ 7.510384] mce: [Firmware Bug]: Ignoring request to disable invalid MCA bank 5. [ 7.510385] mce: [Firmware Bug]: Ignoring request to disable invalid MCA bank 6. [ 7.510386] mce: [Firmware Bug]: Ignoring request to disable invalid MCA bank 7. [ 7.510386] mce: [Firmware Bug]: Ignoring request to disable invalid MCA bank 8. Reason is that the HEST ACPI table contains the real number of MCA banks, while the hypervisor is emulating only 2 banks for guests. Cc: stable@vger.kernel.org Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lore.kernel.org/r/20200925140751.31381-1-jgross@suse.com Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01arch/x86/amd/ibs: Fix re-arming IBS FetchKim Phillips
commit 221bfce5ebbdf72ff08b3bf2510ae81058ee568b upstream. Stephane Eranian found a bug in that IBS' current Fetch counter was not being reset when the driver would write the new value to clear it along with the enable bit set, and found that adding an MSR write that would first disable IBS Fetch would make IBS Fetch reset its current count. Indeed, the PPR for AMD Family 17h Model 31h B0 55803 Rev 0.54 - Sep 12, 2019 states "The periodic fetch counter is set to IbsFetchCnt [...] when IbsFetchEn is changed from 0 to 1." Explicitly set IbsFetchEn to 0 and then to 1 when re-enabling IBS Fetch, so the driver properly resets the internal counter to 0 and IBS Fetch starts counting again. A family 15h machine tested does not have this problem, and the extra wrmsr is also not needed on Family 19h, so only do the extra wrmsr on families 16h through 18h. Reported-by: Stephane Eranian <stephane.eranian@google.com> Signed-off-by: Kim Phillips <kim.phillips@amd.com> [peterz: optimized] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01x86/PCI: Fix intel_mid_pci.c build error when ACPI is not enabledRandy Dunlap
commit 035fff1f7aab43e420e0098f0854470a5286fb83 upstream. Fix build error when CONFIG_ACPI is not set/enabled by adding the header file <asm/acpi.h> which contains a stub for the function in the build error. ../arch/x86/pci/intel_mid_pci.c: In function ‘intel_mid_pci_init’: ../arch/x86/pci/intel_mid_pci.c:303:2: error: implicit declaration of function ‘acpi_noirq_set’; did you mean ‘acpi_irq_get’? [-Werror=implicit-function-declaration] acpi_noirq_set(); Fixes: a912a7584ec3 ("x86/platform/intel-mid: Move PCI initialization to arch_init()") Link: https://lore.kernel.org/r/ea903917-e51b-4cc9-2680-bc1e36efa026@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Reviewed-by: Jesse Barnes <jsbarnes@google.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org # v4.16+ Cc: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Jesse Barnes <jsbarnes@google.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-29x86/mce: Make mce_rdmsrl() panic on an inaccessible MSRBorislav Petkov
[ Upstream commit e2def7d49d0812ea40a224161b2001b2e815dce2 ] If an exception needs to be handled while reading an MSR - which is in most of the cases caused by a #GP on a non-existent MSR - then this is most likely the incarnation of a BIOS or a hardware bug. Such bug violates the architectural guarantee that MCA banks are present with all MSRs belonging to them. The proper fix belongs in the hardware/firmware - not in the kernel. Handling an #MC exception which is raised while an NMI is being handled would cause the nasty NMI nesting issue because of the shortcoming of IRET of reenabling NMIs when executed. And the machine is in an #MC context already so <Deity> be at its side. Tracing MSR accesses while in #MC is another no-no due to tracing being inherently a bad idea in atomic context: vmlinux.o: warning: objtool: do_machine_check()+0x4a: call to mce_rdmsrl() leaves .noinstr.text section so remove all that "additional" functionality from mce_rdmsrl() and provide it with a special exception handler which panics the machine when that MSR is not accessible. The exception handler prints a human-readable message explaining what the panic reason is but, what is more, it panics while in the #GP handler and latter won't have executed an IRET, thus opening the NMI nesting issue in the case when the #MC has happened while handling an NMI. (#MC itself won't be reenabled until MCG_STATUS hasn't been cleared). Suggested-by: Andy Lutomirski <luto@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> [ Add missing prototypes for ex_handler_* ] Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200906212130.GA28456@zn.tnic Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29x86/mce: Add Skylake quirk for patrol scrub reported errorsBorislav Petkov
[ Upstream commit fd258dc4442c5c1c069c6b5b42bfe7d10cddda95 ] The patrol scrubber in Skylake and Cascade Lake systems can be configured to report uncorrected errors using a special signature in the machine check bank and to signal using CMCI instead of machine check. Update the severity calculation mechanism to allow specifying the model, minimum stepping and range of machine check bank numbers. Add a new rule to detect the special signature (on model 0x55, stepping >=4 in any of the memory controller banks). [ bp: Rewrite it. aegl: Productize it. ] Suggested-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Co-developed-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200930021313.31810-2-tony.luck@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29x86/asm: Replace __force_order with a memory clobberArvind Sankar
[ Upstream commit aa5cacdc29d76a005cbbee018a47faa6e724dd2d ] The CRn accessor functions use __force_order as a dummy operand to prevent the compiler from reordering CRn reads/writes with respect to each other. The fact that the asm is volatile should be enough to prevent this: volatile asm statements should be executed in program order. However GCC 4.9.x and 5.x have a bug that might result in reordering. This was fixed in 8.1, 7.3 and 6.5. Versions prior to these, including 5.x and 4.9.x, may reorder volatile asm statements with respect to each other. There are some issues with __force_order as implemented: - It is used only as an input operand for the write functions, and hence doesn't do anything additional to prevent reordering writes. - It allows memory accesses to be cached/reordered across write functions, but CRn writes affect the semantics of memory accesses, so this could be dangerous. - __force_order is not actually defined in the kernel proper, but the LLVM toolchain can in some cases require a definition: LLVM (as well as GCC 4.9) requires it for PIE code, which is why the compressed kernel has a definition, but also the clang integrated assembler may consider the address of __force_order to be significant, resulting in a reference that requires a definition. Fix this by: - Using a memory clobber for the write functions to additionally prevent caching/reordering memory accesses across CRn writes. - Using a dummy input operand with an arbitrary constant address for the read functions, instead of a global variable. This will prevent reads from being reordered across writes, while allowing memory loads to be cached/reordered across CRn reads, which should be safe. Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Tested-by: Nathan Chancellor <natechancellor@gmail.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82602 Link: https://lore.kernel.org/lkml/20200527135329.1172644-1-arnd@arndb.de/ Link: https://lkml.kernel.org/r/20200902232152.3709896-1-nivedita@alum.mit.edu Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29KVM: x86: emulating RDPID failure shall return #UD rather than #GPRobert Hoo
[ Upstream commit a9e2e0ae686094571378c72d8146b5a1a92d0652 ] Per Intel's SDM, RDPID takes a #UD if it is unsupported, which is more or less what KVM is emulating when MSR_TSC_AUX is not available. In fact, there are no scenarios in which RDPID is supposed to #GP. Fixes: fb6d4d340e ("KVM: x86: emulate RDPID") Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Message-Id: <1598581422-76264-1-git-send-email-robert.hu@linux.intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29x86/events/amd/iommu: Fix sizeof mismatchColin Ian King
[ Upstream commit 59d5396a4666195f89a67e118e9e627ddd6f53a1 ] An incorrect sizeof is being used, struct attribute ** is not correct, it should be struct attribute *. Note that since ** is the same size as * this is not causing any issues. Improve this fix by using sizeof(*attrs) as this allows us to not even reference the type of the pointer. Addresses-Coverity: ("Sizeof not portable (SIZEOF_MISMATCH)") Fixes: 51686546304f ("x86/events/amd/iommu: Fix sysfs perf attribute groups") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201001113900.58889-1-colin.king@canonical.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29x86/nmi: Fix nmi_handle() duration miscalculationLibing Zhou
[ Upstream commit f94c91f7ba3ba7de2bc8aa31be28e1abb22f849e ] When nmi_check_duration() is checking the time an NMI handler took to execute, the whole_msecs value used should be read from the @duration argument, not from the ->max_duration, the latter being used to store the current maximal duration. [ bp: Rewrite commit message. ] Fixes: 248ed51048c4 ("x86/nmi: Remove irq_work from the long duration NMI handler") Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Libing Zhou <libing.zhou@nokia-sbell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Changbin Du <changbin.du@gmail.com> Link: https://lkml.kernel.org/r/20200820025641.44075-1-libing.zhou@nokia-sbell.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29perf/x86/intel/uncore: Reduce the number of CBOX countersKan Liang
[ Upstream commit ee139385432e919f4d1f59b80edbc073cdae1391 ] An oops is triggered by the fuzzy test. [ 327.853081] unchecked MSR access error: RDMSR from 0x70c at rIP: 0xffffffffc082c820 (uncore_msr_read_counter+0x10/0x50 [intel_uncore]) [ 327.853083] Call Trace: [ 327.853085] <IRQ> [ 327.853089] uncore_pmu_event_start+0x85/0x170 [intel_uncore] [ 327.853093] uncore_pmu_event_add+0x1a4/0x410 [intel_uncore] [ 327.853097] ? event_sched_in.isra.118+0xca/0x240 There are 2 GP counters for each CBOX, but the current code claims 4 counters. Accessing the invalid registers triggers the oops. Fixes: 6e394376ee89 ("perf/x86/intel/uncore: Add Intel Icelake uncore support") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200925134905.8839-3-kan.liang@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29perf/x86/intel/uncore: Update Ice Lake uncore unitsKan Liang
[ Upstream commit 8f5d41f3a0f495435c88ebba8fc150c931c10fef ] There are some updates for the Icelake model specific uncore performance monitors. (The update can be found at 10th generation intel core processors families specification update Revision 004, ICL068) 1) Counter 0 of ARB uncore unit is not available for software use 2) The global 'enable bit' (bit 29) and 'freeze bit' (bit 31) of MSR_UNC_PERF_GLOBAL_CTRL cannot be used to control counter behavior. Needs to use local enable in event select MSR. Accessing the modified bit/registers will be ignored by HW. Users may observe inaccurate results with the current code. The changes of the MSR_UNC_PERF_GLOBAL_CTRL imply that groups cannot be read atomically anymore. Although the error of the result for a group becomes a bit bigger, it still far lower than not using a group. The group support is still kept. Only Remove the *_box() related implementation. Since the counter 0 of ARB uncore unit is not available, update the MSR address for the ARB uncore unit. There is no change for IMC uncore unit, which only include free-running counters. Fixes: 6e394376ee89 ("perf/x86/intel/uncore: Add Intel Icelake uncore support") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200925134905.8839-2-kan.liang@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29x86/fpu: Allow multiple bits in clearcpuid= parameterArvind Sankar
[ Upstream commit 0a4bb5e5507a585532cc413125b921c8546fc39f ] Commit 0c2a3913d6f5 ("x86/fpu: Parse clearcpuid= as early XSAVE argument") changed clearcpuid parsing from __setup() to cmdline_find_option(). While the __setup() function would have been called for each clearcpuid= parameter on the command line, cmdline_find_option() will only return the last one, so the change effectively made it impossible to disable more than one bit. Allow a comma-separated list of bit numbers as the argument for clearcpuid to allow multiple bits to be disabled again. Log the bits being disabled for informational purposes. Also fix the check on the return value of cmdline_find_option(). It returns -1 when the option is not found, so testing as a boolean is incorrect. Fixes: 0c2a3913d6f5 ("x86/fpu: Parse clearcpuid= as early XSAVE argument") Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200907213919.2423441-1-nivedita@alum.mit.edu Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29perf/x86/intel/ds: Fix x86_pmu_stop warning for large PEBSKan Liang
[ Upstream commit 35d1ce6bec133679ff16325d335217f108b84871 ] A warning as below may be triggered when sampling with large PEBS. [ 410.411250] perf: interrupt took too long (72145 > 71975), lowering kernel.perf_event_max_sample_rate to 2000 [ 410.724923] ------------[ cut here ]------------ [ 410.729822] WARNING: CPU: 0 PID: 16397 at arch/x86/events/core.c:1422 x86_pmu_stop+0x95/0xa0 [ 410.933811] x86_pmu_del+0x50/0x150 [ 410.937304] event_sched_out.isra.0+0xbc/0x210 [ 410.941751] group_sched_out.part.0+0x53/0xd0 [ 410.946111] ctx_sched_out+0x193/0x270 [ 410.949862] __perf_event_task_sched_out+0x32c/0x890 [ 410.954827] ? set_next_entity+0x98/0x2d0 [ 410.958841] __schedule+0x592/0x9c0 [ 410.962332] schedule+0x5f/0xd0 [ 410.965477] exit_to_usermode_loop+0x73/0x120 [ 410.969837] prepare_exit_to_usermode+0xcd/0xf0 [ 410.974369] ret_from_intr+0x2a/0x3a [ 410.977946] RIP: 0033:0x40123c [ 411.079661] ---[ end trace bc83adaea7bb664a ]--- In the non-overflow context, e.g., context switch, with large PEBS, perf may stop an event twice. An example is below. //max_samples_per_tick is adjusted to 2 //NMI is triggered intel_pmu_handle_irq() handle_pmi_common() drain_pebs() __intel_pmu_pebs_event() perf_event_overflow() __perf_event_account_interrupt() hwc->interrupts = 1 return 0 //A context switch happens right after the NMI. //In the same tick, the perf_throttled_seq is not changed. perf_event_task_sched_out() perf_pmu_sched_task() intel_pmu_drain_pebs_buffer() __intel_pmu_pebs_event() perf_event_overflow() __perf_event_account_interrupt() ++hwc->interrupts >= max_samples_per_tick return 1 x86_pmu_stop(); # First stop perf_event_context_sched_out() task_ctx_sched_out() ctx_sched_out() event_sched_out() x86_pmu_del() x86_pmu_stop(); # Second stop and trigger the warning Perf should only invoke the perf_event_overflow() in the overflow context. Current drain_pebs() is called from: - handle_pmi_common() -- overflow context - intel_pmu_pebs_sched_task() -- non-overflow context - intel_pmu_pebs_disable() -- non-overflow context - intel_pmu_auto_reload_read() -- possible overflow context With PERF_SAMPLE_READ + PERF_FORMAT_GROUP, the function may be invoked in the NMI handler. But, before calling the function, the PEBS buffer has already been drained. The __intel_pmu_pebs_event() will not be called in the possible overflow context. To fix the issue, an indicator is required to distinguish between the overflow context aka handle_pmi_common() and other cases. The dummy regs pointer can be used as the indicator. In the non-overflow context, perf should treat the last record the same as other PEBS records, and doesn't invoke the generic overflow handler. Fixes: 21509084f999 ("perf/x86/intel: Handle multiple records in the PEBS buffer") Reported-by: Like Xu <like.xu@linux.intel.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Like Xu <like.xu@linux.intel.com> Link: https://lkml.kernel.org/r/20200902210649.2743-1-kan.liang@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29KVM: SVM: Initialize prev_ga_tag before useSuravee Suthikulpanit
commit f6426ab9c957e97418ac5b0466538792767b1738 upstream. The function amd_ir_set_vcpu_affinity makes use of the parameter struct amd_iommu_pi_data.prev_ga_tag to determine if it should delete struct amd_iommu_pi_data from a list when not running in AVIC mode. However, prev_ga_tag is initialized only when AVIC is enabled. The non-zero uninitialized value can cause unintended code path, which ends up making use of the struct vcpu_svm.ir_list and ir_list_lock without being initialized (since they are intended only for the AVIC case). This triggers NULL pointer dereference bug in the function vm_ir_list_del with the following call trace: svm_update_pi_irte+0x3c2/0x550 [kvm_amd] ? proc_create_single_data+0x41/0x50 kvm_arch_irq_bypass_add_producer+0x40/0x60 [kvm] __connect+0x5f/0xb0 [irqbypass] irq_bypass_register_producer+0xf8/0x120 [irqbypass] vfio_msi_set_vector_signal+0x1de/0x2d0 [vfio_pci] vfio_msi_set_block+0x77/0xe0 [vfio_pci] vfio_pci_set_msi_trigger+0x25c/0x2f0 [vfio_pci] vfio_pci_set_irqs_ioctl+0x88/0xb0 [vfio_pci] vfio_pci_ioctl+0x2ea/0xed0 [vfio_pci] ? alloc_file_pseudo+0xa5/0x100 vfio_device_fops_unl_ioctl+0x26/0x30 [vfio] ? vfio_device_fops_unl_ioctl+0x26/0x30 [vfio] __x64_sys_ioctl+0x96/0xd0 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Therefore, initialize prev_ga_tag to zero before use. This should be safe because ga_tag value 0 is invalid (see function avic_vm_init). Fixes: dfa20099e26e ("KVM: SVM: Refactor AVIC vcpu initialization into avic_init_vcpu()") Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Message-Id: <20201003232707.4662-1-suravee.suthikulpanit@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-29KVM: x86/mmu: Commit zap of remaining invalid pages when recovering lpagesSean Christopherson
commit e89505698c9f70125651060547da4ff5046124fc upstream. Call kvm_mmu_commit_zap_page() after exiting the "prepare zap" loop in kvm_recover_nx_lpages() to finish zapping pages in the unlikely event that the loop exited due to lpage_disallowed_mmu_pages being empty. Because the recovery thread drops mmu_lock() when rescheduling, it's possible that lpage_disallowed_mmu_pages could be emptied by a different thread without to_zap reaching zero despite to_zap being derived from the number of disallowed lpages. Fixes: 1aa9b9572b105 ("kvm: x86: mmu: Recovery of shattered NX large pages") Cc: Junaid Shahid <junaids@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923183735.584-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-29KVM: nVMX: Reload vmcs01 if getting vmcs12's pages failsSean Christopherson
commit b89d5ad00e789967a5e2c5335f75c48755bebd88 upstream. Reload vmcs01 when bailing from nested_vmx_enter_non_root_mode() as KVM expects vmcs01 to be loaded when is_guest_mode() is false. Fixes: 671ddc700fd08 ("KVM: nVMX: Don't leak L1 MMIO regions to L2") Cc: stable@vger.kernel.org Cc: Dan Cross <dcross@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: Peter Shier <pshier@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923184452.980-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-29KVM: nVMX: Reset the segment cache when stuffing guest segsSean Christopherson
commit fc387d8daf3960c5e1bc18fa353768056f4fd394 upstream. Explicitly reset the segment cache after stuffing guest segment regs in prepare_vmcs02_rare(). Although the cache is reset when switching to vmcs02, there is nothing that prevents KVM from re-populating the cache prior to writing vmcs02 with vmcs12's values. E.g. if the vCPU is preempted after switching to vmcs02 but before prepare_vmcs02_rare(), kvm_arch_vcpu_put() will dereference GUEST_SS_AR_BYTES via .get_cpl() and cache the stale vmcs02 value. While the current code base only caches stale data in the preemption case, it's theoretically possible future code could read a segment register during the nested flow itself, i.e. this isn't technically illegal behavior in kvm_arch_vcpu_put(), although it did introduce the bug. This manifests as an unexpected nested VM-Enter failure when running with unrestricted guest disabled if the above preemption case coincides with L1 switching L2's CPL, e.g. when switching from a L2 vCPU at CPL3 to to a L2 vCPU at CPL0. stack_segment_valid() will see the new SS_SEL but the old SS_AR_BYTES and incorrectly mark the guest state as invalid due to SS.dpl != SS.rpl. Don't bother updating the cache even though prepare_vmcs02_rare() writes every segment. With unrestricted guest, guest segments are almost never read, let alone L2 guest segments. On the other hand, populating the cache requires a large number of memory writes, i.e. it's unlikely to be a net win. Updating the cache would be a win when unrestricted guest is not supported, as guest_state_valid() will immediately cache all segment registers. But, nested virtualization without unrestricted guest is dirt slow, saving some VMREADs won't change that, and every CPU manufactured in the last decade supports unrestricted guest. In other words, the extra (minor) complexity isn't worth the trouble. Note, kvm_arch_vcpu_put() may see stale data when querying guest CPL depending on when preemption occurs. This is "ok" in that the usage is imperfect by nature, i.e. it's used heuristically to improve performance but doesn't affect functionality. kvm_arch_vcpu_put() could be "fixed" by also disabling preemption while loading segments, but that's pointless and misleading as reading state from kvm_sched_{in,out}() is guaranteed to see stale data in one form or another. E.g. even if all the usage of regs_avail is fixed to call kvm_register_mark_available() after the associated state is set, the individual state might still be stale with respect to the overall vCPU state. I.e. making functional decisions in an asynchronous hook is doomed from the get go. Thankfully KVM doesn't do that. Fixes: de63ad4cf4973 ("KVM: X86: implement the logic for spinlock optimization") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923184452.980-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-01x86/ioapic: Unbreak check_timer()Thomas Gleixner
commit 86a82ae0b5095ea24c55898a3f025791e7958b21 upstream. Several people reported in the kernel bugzilla that between v4.12 and v4.13 the magic which works around broken hardware and BIOSes to find the proper timer interrupt delivery mode stopped working for some older affected platforms which need to fall back to ExtINT delivery mode. The reason is that the core code changed to keep track of the masked and disabled state of an interrupt line more accurately to avoid the expensive hardware operations. That broke an assumption in i8259_make_irq() which invokes disable_irq_nosync(); irq_set_chip_and_handler(); enable_irq(); Up to v4.12 this worked because enable_irq() unconditionally unmasked the interrupt line, but after the state tracking improvements this is not longer the case because the IO/APIC uses lazy disabling. So the line state is unmasked which means that enable_irq() does not call into the new irq chip to unmask it. In principle this is a shortcoming of the core code, but it's more than unclear whether the core code should try to reset state. At least this cannot be done unconditionally as that would break other existing use cases where the chip type is changed, e.g. when changing the trigger type, but the callers expect the state to be preserved. As the way how check_timer() is switching the delivery modes is truly unique, the obvious fix is to simply unmask the i8259 manually after changing the mode to ExtINT delivery and switching the irq chip to the legacy PIC. Note, that the fixes tag is not really precise, but identifies the commit which broke the assumptions in the IO/APIC and i8259 code and that's the kernel version to which this needs to be backported. Fixes: bf22ff45bed6 ("genirq: Avoid unnecessary low level irq function calls") Reported-by: p_c_chan@hotmail.com Reported-by: ecm4@mail.com Reported-by: perdigao1@yahoo.com Reported-by: matzes@users.sourceforge.net Reported-by: rvelascog@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: p_c_chan@hotmail.com Tested-by: matzes@users.sourceforge.net Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=197769 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-01arch/x86/lib/usercopy_64.c: fix __copy_user_flushcache() cache writebackMikulas Patocka
commit a1cd6c2ae47ee10ff21e62475685d5b399e2ed4a upstream. If we copy less than 8 bytes and if the destination crosses a cache line, __copy_user_flushcache would invalidate only the first cache line. This patch makes it invalidate the second cache line as well. Fixes: 0aed55af88345b ("x86, uaccess: introduce copy_from_iter_flushcache for pmem / cache-bypass operations") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Dan Williams <dan.j.wiilliams@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/alpine.LRH.2.02.2009161451140.21915@file01.intranet.prod.int.rdu2.redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-01KVM: SVM: Add a dedicated INVD intercept routineTom Lendacky
[ Upstream commit 4bb05f30483fd21ea5413eaf1182768f251cf625 ] The INVD instruction intercept performs emulation. Emulation can't be done on an SEV guest because the guest memory is encrypted. Provide a dedicated intercept routine for the INVD intercept. And since the instruction is emulated as a NOP, just skip it instead. Fixes: 1654efcbc431 ("KVM: SVM: Add KVM_SEV_INIT command") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Message-Id: <a0b9a19ffa7fef86a3cc700c7ea01cb2731e04e5.1600972918.git.thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01KVM: x86: Reset MMU context if guest toggles CR4.SMAP or CR4.PKESean Christopherson
[ Upstream commit 8d214c481611b29458a57913bd786f0ac06f0605 ] Reset the MMU context during kvm_set_cr4() if SMAP or PKE is toggled. Recent commits to (correctly) not reload PDPTRs when SMAP/PKE are toggled inadvertantly skipped the MMU context reset due to the mask of bits that triggers PDPTR loads also being used to trigger MMU context resets. Fixes: 427890aff855 ("kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode") Fixes: cb957adb4ea4 ("kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode") Cc: Jim Mattson <jmattson@google.com> Cc: Peter Shier <pshier@google.com> Cc: Oliver Upton <oupton@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923215352.17756-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01x86/speculation/mds: Mark mds_user_clear_cpu_buffers() __always_inlineThomas Gleixner
[ Upstream commit a7ef9ba986b5fae9d80f8a7b31db0423687efe4e ] Prevent the compiler from uninlining and creating traceable/probable functions as this is invoked _after_ context tracking switched to CONTEXT_USER and rcu idle. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134340.902709267@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01KVM: x86: handle wrap around 32-bit address spacePaolo Bonzini
[ Upstream commit fede8076aab4c2280c673492f8f7a2e87712e8b4 ] KVM is not handling the case where EIP wraps around the 32-bit address space (that is, outside long mode). This is needed both in vmx.c and in emulate.c. SVM with NRIPS is okay, but it can still print an error to dmesg due to integer overflow. Reported-by: Nick Peterson <everdox@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01KVM: Remove CREATE_IRQCHIP/SET_PIT2 raceSteve Rutherford
[ Upstream commit 7289fdb5dcdbc5155b5531529c44105868a762f2 ] Fixes a NULL pointer dereference, caused by the PIT firing an interrupt before the interrupt table has been initialized. SET_PIT2 can race with the creation of the IRQchip. In particular, if SET_PIT2 is called with a low PIT timer period (after the creation of the IOAPIC, but before the instantiation of the irq routes), the PIT can fire an interrupt at an uninitialized table. Signed-off-by: Steve Rutherford <srutherford@google.com> Signed-off-by: Jon Cargille <jcargill@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20200416191152.259434-1-jcargill@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01KVM: LAPIC: Mark hrtimer for period or oneshot mode to expire in hard ↵He Zhe
interrupt context [ Upstream commit edec6e015a02003c2af0ce82c54ea016b5a9e3f0 ] apic->lapic_timer.timer was initialized with HRTIMER_MODE_ABS_HARD but started later with HRTIMER_MODE_ABS, which may cause the following warning in PREEMPT_RT kernel. WARNING: CPU: 1 PID: 2957 at kernel/time/hrtimer.c:1129 hrtimer_start_range_ns+0x348/0x3f0 CPU: 1 PID: 2957 Comm: qemu-system-x86 Not tainted 5.4.23-rt11 #1 Hardware name: Supermicro SYS-E300-9A-8C/A2SDi-8C-HLN4F, BIOS 1.1a 09/18/2018 RIP: 0010:hrtimer_start_range_ns+0x348/0x3f0 Code: 4d b8 0f 94 c1 0f b6 c9 e8 35 f1 ff ff 4c 8b 45 b0 e9 3b fd ff ff e8 d7 3f fa ff 48 98 4c 03 34 c5 a0 26 bf 93 e9 a1 fd ff ff <0f> 0b e9 fd fc ff ff 65 8b 05 fa b7 90 6d 89 c0 48 0f a3 05 60 91 RSP: 0018:ffffbc60026ffaf8 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff9d81657d4110 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000006cc7987bcf RDI: ffff9d81657d4110 RBP: ffffbc60026ffb58 R08: 0000000000000001 R09: 0000000000000010 R10: 0000000000000000 R11: 0000000000000000 R12: 0000006cc7987bcf R13: 0000000000000000 R14: 0000006cc7987bcf R15: ffffbc60026d6a00 FS: 00007f401daed700(0000) GS:ffff9d81ffa40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000ffffffff CR3: 0000000fa7574000 CR4: 00000000003426e0 Call Trace: ? kvm_release_pfn_clean+0x22/0x60 [kvm] start_sw_timer+0x85/0x230 [kvm] ? vmx_vmexit+0x1b/0x30 [kvm_intel] kvm_lapic_switch_to_sw_timer+0x72/0x80 [kvm] vmx_pre_block+0x1cb/0x260 [kvm_intel] ? vmx_vmexit+0xf/0x30 [kvm_intel] ? vmx_vmexit+0x1b/0x30 [kvm_intel] ? vmx_vmexit+0xf/0x30 [kvm_intel] ? vmx_vmexit+0x1b/0x30 [kvm_intel] ? vmx_vmexit+0xf/0x30 [kvm_intel] ? vmx_vmexit+0x1b/0x30 [kvm_intel] ? vmx_vmexit+0xf/0x30 [kvm_intel] ? vmx_vmexit+0xf/0x30 [kvm_intel] ? vmx_vmexit+0x1b/0x30 [kvm_intel] ? vmx_vmexit+0xf/0x30 [kvm_intel] ? vmx_vmexit+0x1b/0x30 [kvm_intel] ? vmx_vmexit+0xf/0x30 [kvm_intel] ? vmx_vmexit+0x1b/0x30 [kvm_intel] ? vmx_vmexit+0xf/0x30 [kvm_intel] ? vmx_vmexit+0x1b/0x30 [kvm_intel] ? vmx_vmexit+0xf/0x30 [kvm_intel] ? vmx_sync_pir_to_irr+0x9e/0x100 [kvm_intel] ? kvm_apic_has_interrupt+0x46/0x80 [kvm] kvm_arch_vcpu_ioctl_run+0x85b/0x1fa0 [kvm] ? _raw_spin_unlock_irqrestore+0x18/0x50 ? _copy_to_user+0x2c/0x30 kvm_vcpu_ioctl+0x235/0x660 [kvm] ? rt_spin_unlock+0x2c/0x50 do_vfs_ioctl+0x3e4/0x650 ? __fget+0x7a/0xa0 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x4d/0x120 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f4027cc54a7 Code: 00 00 90 48 8b 05 e9 59 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b9 59 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007f401dae9858 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00005558bd029690 RCX: 00007f4027cc54a7 RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000000d RBP: 00007f4028b72000 R08: 00005558bc829ad0 R09: 00000000ffffffff R10: 00005558bcf90ca0 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 00005558bce1c840 --[ end trace 0000000000000002 ]-- Signed-off-by: He Zhe <zhe.he@windriver.com> Message-Id: <1584687967-332859-1-git-send-email-zhe.he@windriver.com> Reviewed-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01x86/pkeys: Add check for pkey "overflow"Dave Hansen
[ Upstream commit 16171bffc829272d5e6014bad48f680cb50943d9 ] Alex Shi reported the pkey macros above arch_set_user_pkey_access() to be unused. They are unused, and even refer to a nonexistent CONFIG option. But, they might have served a good use, which was to ensure that the code does not try to set values that would not fit in the PKRU register. As it stands, a too-large 'pkey' value would be likely to silently overflow the u32 new_pkru_bits. Add a check to look for overflows. Also add a comment to remind any future developer to closely examine the types used to store pkey values if arch_max_pkey() ever changes. This boots and passes the x86 pkey selftests. Reported-by: Alex Shi <alex.shi@linux.alibaba.com> Signed-off-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200122165346.AD4DA150@viggo.jf.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01KVM: nVMX: Hold KVM's srcu lock when syncing vmcs12->shadowwanpeng li
[ Upstream commit c9dfd3fb08352d439f0399b6fabe697681d2638c ] For the duration of mapping eVMCS, it derefences ->memslots without holding ->srcu or ->slots_lock when accessing hv assist page. This patch fixes it by moving nested_sync_vmcs12_to_shadow to prepare_guest_switch, where the SRCU is already taken. It can be reproduced by running kvm's evmcs_test selftest. ============================= warning: suspicious rcu usage 5.6.0-rc1+ #53 tainted: g w ioe ----------------------------- ./include/linux/kvm_host.h:623 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by evmcs_test/8507: #0: ffff9ddd156d00d0 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0x85/0x680 [kvm] stack backtrace: cpu: 6 pid: 8507 comm: evmcs_test tainted: g w ioe 5.6.0-rc1+ #53 hardware name: dell inc. optiplex 7040/0jctf8, bios 1.4.9 09/12/2016 call trace: dump_stack+0x68/0x9b kvm_read_guest_cached+0x11d/0x150 [kvm] kvm_hv_get_assist_page+0x33/0x40 [kvm] nested_enlightened_vmentry+0x2c/0x60 [kvm_intel] nested_vmx_handle_enlightened_vmptrld.part.52+0x32/0x1c0 [kvm_intel] nested_sync_vmcs12_to_shadow+0x439/0x680 [kvm_intel] vmx_vcpu_run+0x67a/0xe60 [kvm_intel] vcpu_enter_guest+0x35e/0x1bc0 [kvm] kvm_arch_vcpu_ioctl_run+0x40b/0x670 [kvm] kvm_vcpu_ioctl+0x370/0x680 [kvm] ksys_ioctl+0x235/0x850 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x77/0x780 entry_syscall_64_after_hwframe+0x49/0xbe Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01KVM: x86: fix incorrect comparison in trace eventPaolo Bonzini
[ Upstream commit 147f1a1fe5d7e6b01b8df4d0cbd6f9eaf6b6c73b ] The "u" field in the event has three states, -1/0/1. Using u8 however means that comparison with -1 will always fail, so change to signed char. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01x86/kdump: Always reserve the low 1M when the crashkernel option is specifiedLianbo Jiang
[ Upstream commit 6f599d84231fd27e42f4ca2a786a6641e8cddf00 ] On x86, purgatory() copies the first 640K of memory to a backup region because the kernel needs those first 640K for the real mode trampoline during boot, among others. However, when SME is enabled, the kernel cannot properly copy the old memory to the backup area but reads only its encrypted contents. The result is that the crash tool gets invalid pointers when parsing vmcore: crash> kmem -s|grep -i invalid kmem: dma-kmalloc-512: slab:ffffd77680001c00 invalid freepointer:a6086ac099f0c5a4 kmem: dma-kmalloc-512: slab:ffffd77680001c00 invalid freepointer:a6086ac099f0c5a4 crash> So reserve the remaining low 1M memory when the crashkernel option is specified (after reserving real mode memory) so that allocated memory does not fall into the low 1M area and thus the copying of the contents of the first 640k to a backup region in purgatory() can be avoided altogether. This way, it does not need to be included in crash dumps or used for anything except the trampolines that must live in the low 1M. [ bp: Heavily rewrite commit message, flip check logic in crash_reserve_low_1M().] Signed-off-by: Lianbo Jiang <lijiang@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: bhe@redhat.com Cc: Dave Young <dyoung@redhat.com> Cc: d.hatayama@fujitsu.com Cc: dhowells@redhat.com Cc: ebiederm@xmission.com Cc: horms@verge.net.au Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jürgen Gross <jgross@suse.com> Cc: kexec@lists.infradead.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: vgoyal@redhat.com Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20191108090027.11082-2-lijiang@redhat.com Link: https://bugzilla.kernel.org/show_bug.cgi?id=204793 Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-23x86/boot/compressed: Disable relocation relaxationArvind Sankar
commit 09e43968db40c33a73e9ddbfd937f46d5c334924 upstream. The x86-64 psABI [0] specifies special relocation types (R_X86_64_[REX_]GOTPCRELX) for indirection through the Global Offset Table, semantically equivalent to R_X86_64_GOTPCREL, which the linker can take advantage of for optimization (relaxation) at link time. This is supported by LLD and binutils versions 2.26 onwards. The compressed kernel is position-independent code, however, when using LLD or binutils versions before 2.27, it must be linked without the -pie option. In this case, the linker may optimize certain instructions into a non-position-independent form, by converting foo@GOTPCREL(%rip) to $foo. This potential issue has been present with LLD and binutils-2.26 for a long time, but it has never manifested itself before now: - LLD and binutils-2.26 only relax movq foo@GOTPCREL(%rip), %reg to leaq foo(%rip), %reg which is still position-independent, rather than mov $foo, %reg which is permitted by the psABI when -pie is not enabled. - GCC happens to only generate GOTPCREL relocations on mov instructions. - CLang does generate GOTPCREL relocations on non-mov instructions, but when building the compressed kernel, it uses its integrated assembler (due to the redefinition of KBUILD_CFLAGS dropping -no-integrated-as), which has so far defaulted to not generating the GOTPCRELX relocations. Nick Desaulniers reports [1,2]: "A recent change [3] to a default value of configuration variable (ENABLE_X86_RELAX_RELOCATIONS OFF -> ON) in LLVM now causes Clang's integrated assembler to emit R_X86_64_GOTPCRELX/R_X86_64_REX_GOTPCRELX relocations. LLD will relax instructions with these relocations based on whether the image is being linked as position independent or not. When not, then LLD will relax these instructions to use absolute addressing mode (R_RELAX_GOT_PC_NOPIC). This causes kernels built with Clang and linked with LLD to fail to boot." Patch series [4] is a solution to allow the compressed kernel to be linked with -pie unconditionally, but even if merged is unlikely to be backported. As a simple solution that can be applied to stable as well, prevent the assembler from generating the relaxed relocation types using the -mrelax-relocations=no option. For ease of backporting, do this unconditionally. [0] https://gitlab.com/x86-psABIs/x86-64-ABI/-/blob/master/x86-64-ABI/linker-optimization.tex#L65 [1] https://lore.kernel.org/lkml/20200807194100.3570838-1-ndesaulniers@google.com/ [2] https://github.com/ClangBuiltLinux/linux/issues/1121 [3] https://reviews.llvm.org/rGc41a18cf61790fc898dcda1055c3efbf442c14c0 [4] https://lore.kernel.org/lkml/20200731202738.2577854-1-nivedita@alum.mit.edu/ Reported-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200812004308.1448603-1-nivedita@alum.mit.edu Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-17KVM: VMX: Don't freeze guest when event delivery causes an APIC-access exitWanpeng Li
commit 99b82a1437cb31340dbb2c437a2923b9814a7b15 upstream. According to SDM 27.2.4, Event delivery causes an APIC-access VM exit. Don't report internal error and freeze guest when event delivery causes an APIC-access exit, it is handleable and the event will be re-injected during the next vmentry. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1597827327-25055-2-git-send-email-wanpengli@tencent.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-17vgacon: remove software scrollback supportLinus Torvalds
commit 973c096f6a85e5b5f2a295126ba6928d9a6afd45 upstream. Yunhai Zhang recently fixed a VGA software scrollback bug in commit ebfdfeeae8c0 ("vgacon: Fix for missing check in scrollback handling"), but that then made people look more closely at some of this code, and there were more problems on the vgacon side, but also the fbcon software scrollback. We don't really have anybody who maintains this code - probably because nobody actually _uses_ it any more. Sure, people still use both VGA and the framebuffer consoles, but they are no longer the main user interfaces to the kernel, and haven't been for decades, so these kinds of extra features end up bitrotting and not really being used. So rather than try to maintain a likely unused set of code, I'll just aggressively remove it, and see if anybody even notices. Maybe there are people who haven't jumped on the whole GUI badnwagon yet, and think it's just a fad. And maybe those people use the scrollback code. If that turns out to be the case, we can resurrect this again, once we've found the sucker^Wmaintainer for it who actually uses it. Reported-by: NopNop Nop <nopitydays@gmail.com> Tested-by: Willy Tarreau <w@1wt.eu> Cc: 张云海 <zhangyunhai@nsfocus.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Willy Tarreau <w@1wt.eu> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-09tracing/kprobes, x86/ptrace: Fix regs argument order for i386Vamshi K Sthambamkadi
commit 2356bb4b8221d7dc8c7beb810418122ed90254c9 upstream. On i386, the order of parameters passed on regs is eax,edx,and ecx (as per regparm(3) calling conventions). Change the mapping in regs_get_kernel_argument(), so that arg1=ax arg2=dx, and arg3=cx. Running the selftests testcase kprobes_args_use.tc shows the result as passed. Fixes: 3c88ee194c28 ("x86: ptrace: Add function argument access API") Signed-off-by: Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200828113242.GA1424@cosmos Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-09x86, fakenuma: Fix invalid starting node IDHuang Ying
[ Upstream commit ccae0f36d500aef727f98acd8d0601e6b262a513 ] Commit: cc9aec03e58f ("x86/numa_emulation: Introduce uniform split capability") uses "-1" as the starting node ID, which causes the strange kernel log as follows, when "numa=fake=32G" is added to the kernel command line: Faking node -1 at [mem 0x0000000000000000-0x0000000893ffffff] (35136MB) Faking node 0 at [mem 0x0000001840000000-0x000000203fffffff] (32768MB) Faking node 1 at [mem 0x0000000894000000-0x000000183fffffff] (64192MB) Faking node 2 at [mem 0x0000002040000000-0x000000283fffffff] (32768MB) Faking node 3 at [mem 0x0000002840000000-0x000000303fffffff] (32768MB) And finally the kernel crashes: BUG: Bad page state in process swapper pfn:00011 page:(____ptrval____) refcount:0 mapcount:1 mapping:(____ptrval____) index:0x55cd7e44b270 pfn:0x11 failed to read mapping contents, not a valid kernel address? flags: 0x5(locked|uptodate) raw: 0000000000000005 000055cd7e44af30 000055cd7e44af50 0000000100000006 raw: 000055cd7e44b270 000055cd7e44b290 0000000000000000 000055cd7e44b510 page dumped because: page still charged to cgroup page->mem_cgroup:000055cd7e44b510 Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 5.9.0-rc2 #1 Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019 Call Trace: dump_stack+0x57/0x80 bad_page.cold+0x63/0x94 __free_pages_ok+0x33f/0x360 memblock_free_all+0x127/0x195 mem_init+0x23/0x1f5 start_kernel+0x219/0x4f5 secondary_startup_64+0xb6/0xc0 Fix this bug via using 0 as the starting node ID. This restores the original behavior before cc9aec03e58f. [ mingo: Massaged the changelog. ] Fixes: cc9aec03e58f ("x86/numa_emulation: Introduce uniform split capability") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200904061047.612950-1-ying.huang@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-03x86/hotplug: Silence APIC only after all interrupts are migratedAshok Raj
commit 52d6b926aabc47643cd910c85edb262b7f44c168 upstream. There is a race when taking a CPU offline. Current code looks like this: native_cpu_disable() { ... apic_soft_disable(); /* * Any existing set bits for pending interrupt to * this CPU are preserved and will be sent via IPI * to another CPU by fixup_irqs(). */ cpu_disable_common(); { .... /* * Race window happens here. Once local APIC has been * disabled any new interrupts from the device to * the old CPU are lost */ fixup_irqs(); // Too late to capture anything in IRR. ... } } The fix is to disable the APIC *after* cpu_disable_common(). Testing was done with a USB NIC that provided a source of frequent interrupts. A script migrated interrupts to a specific CPU and then took that CPU offline. Fixes: 60dcaad5736f ("x86/hotplug: Silence APIC and NMI when CPU is dead") Reported-by: Evan Green <evgreen@chromium.org> Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Mathias Nyman <mathias.nyman@linux.intel.com> Tested-by: Evan Green <evgreen@chromium.org> Reviewed-by: Evan Green <evgreen@chromium.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/lkml/875zdarr4h.fsf@nanos.tec.linutronix.de/ Link: https://lore.kernel.org/r/1598501530-45821-1-git-send-email-ashok.raj@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-26KVM: Pass MMU notifier range flags to kvm_unmap_hva_range()Will Deacon
commit fdfe7cbd58806522e799e2a50a15aee7f2cbb7b6 upstream. The 'flags' field of 'struct mmu_notifier_range' is used to indicate whether invalidate_range_{start,end}() are permitted to block. In the case of kvm_mmu_notifier_invalidate_range_start(), this field is not forwarded on to the architecture-specific implementation of kvm_unmap_hva_range() and therefore the backend cannot sensibly decide whether or not to block. Add an extra 'flags' parameter to kvm_unmap_hva_range() so that architectures are aware as to whether or not they are permitted to block. Cc: <stable@vger.kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: James Morse <james.morse@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20200811102725.7121-2-will@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-26Fix build error when CONFIG_ACPI is not set/enabled:Randy Dunlap
[ Upstream commit ee87e1557c42dc9c2da11c38e11b87c311569853 ] ../arch/x86/pci/xen.c: In function ‘pci_xen_init’: ../arch/x86/pci/xen.c:410:2: error: implicit declaration of function ‘acpi_noirq_set’; did you mean ‘acpi_irq_get’? [-Werror=implicit-function-declaration] acpi_noirq_set(); Fixes: 88e9ca161c13 ("xen/pci: Use acpi_noirq_set() helper to avoid #ifdef") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: xen-devel@lists.xenproject.org Cc: linux-pci@vger.kernel.org Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-26kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE modeJim Mattson
[ Upstream commit cb957adb4ea422bd758568df5b2478ea3bb34f35 ] See the SDM, volume 3, section 4.4.1: If PAE paging would be in use following an execution of MOV to CR0 or MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then the PDPTEs are loaded from the address in CR3. Fixes: b9baba8614890 ("KVM, pkeys: expose CPUID/CR4 to guest") Cc: Huaitong Han <huaitong.han@intel.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200817181655.3716509-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-26kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE modeJim Mattson
[ Upstream commit 427890aff8558eb4326e723835e0eae0e6fe3102 ] See the SDM, volume 3, section 4.4.1: If PAE paging would be in use following an execution of MOV to CR0 or MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then the PDPTEs are loaded from the address in CR3. Fixes: 0be0226f07d14 ("KVM: MMU: fix SMAP virtualization") Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200817181655.3716509-2-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-26x86/boot: kbuild: allow readelf executable to be specifiedDmitry Golovin
commit eefb8c124fd969e9a174ff2bedff86aa305a7438 upstream. Introduce a new READELF variable to top-level Makefile, so the name of readelf binary can be specified. Before this change the name of the binary was hardcoded to "$(CROSS_COMPILE)readelf" which might not be present for every toolchain. This allows to build with LLVM Object Reader by using make parameter READELF=llvm-readelf. Link: https://github.com/ClangBuiltLinux/linux/issues/771 Signed-off-by: Dmitry Golovin <dima@golovin.in> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-21perf/x86/rapl: Fix missing psys sysfs attributesZhang Rui
[ Upstream commit 4bb5fcb97a5df0bbc0a27e0252b1e7ce140a8431 ] This fixes a problem introduced by commit: 5fb5273a905c ("perf/x86/rapl: Use new MSR detection interface") that perf event sysfs attributes for psys RAPL domain are missing. Fixes: 5fb5273a905c ("perf/x86/rapl: Use new MSR detection interface") Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Reviewed-by: Len Brown <len.brown@intel.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Link: https://lore.kernel.org/r/20200811153149.12242-2-rui.zhang@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-21x86/tsr: Fix tsc frequency enumeration bug on Lightning Mountain SoCDilip Kota
[ Upstream commit 7d98585860d845e36ee612832a5ff021f201dbaf ] Frequency descriptor of Lightning Mountain SoC doesn't have all the frequency entries so resulting in the below failure causing a kernel hang: Error MSR_FSB_FREQ index 15 is unknown tsc: Fast TSC calibration failed So, add all the frequency entries in the Lightning Mountain SoC frequency descriptor. Fixes: 0cc5359d8fd45 ("x86/cpu: Update init data for new Airmont CPU model") Fixes: 812c2d7506fd ("x86/tsc_msr: Use named struct initializers") Signed-off-by: Dilip Kota <eswara.kota@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Link: https://lore.kernel.org/r/211c643ae217604b46cbec43a2c0423946dc7d2d.1596440057.git.eswara.kota@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-21genirq/affinity: Make affinity setting if activated opt-inThomas Gleixner
commit f0c7baca180046824e07fc5f1326e83a8fd150c7 upstream. John reported that on a RK3288 system the perf per CPU interrupts are all affine to CPU0 and provided the analysis: "It looks like what happens is that because the interrupts are not per-CPU in the hardware, armpmu_request_irq() calls irq_force_affinity() while the interrupt is deactivated and then request_irq() with IRQF_PERCPU | IRQF_NOBALANCING. Now when irq_startup() runs with IRQ_STARTUP_NORMAL, it calls irq_setup_affinity() which returns early because IRQF_PERCPU and IRQF_NOBALANCING are set, leaving the interrupt on its original CPU." This was broken by the recent commit which blocked interrupt affinity setting in hardware before activation of the interrupt. While this works in general, it does not work for this particular case. As contrary to the initial analysis not all interrupt chip drivers implement an activate callback, the safe cure is to make the deferred interrupt affinity setting at activation time opt-in. Implement the necessary core logic and make the two irqchip implementations for which this is required opt-in. In hindsight this would have been the right thing to do, but ... Fixes: baedb87d1b53 ("genirq/affinity: Handle affinity setting on inactive interrupts correctly") Reported-by: John Keeping <john@metanate.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Marc Zyngier <maz@kernel.org> Acked-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/87blk4tzgm.fsf@nanos.tec.linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>