summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2018-04-24um: Use POSIX ucontext_t instead of struct ucontextKrzysztof Mazur
commit 4d1a535b8ec5e74b42dfd9dc809142653b2597f6 upstream. glibc 2.26 removed the 'struct ucontext' to "improve" POSIX compliance and break programs, including User Mode Linux. Fix User Mode Linux by using POSIX ucontext_t. This fixes: arch/um/os-Linux/signal.c: In function 'hard_handler': arch/um/os-Linux/signal.c:163:22: error: dereferencing pointer to incomplete type 'struct ucontext' mcontext_t *mc = &uc->uc_mcontext; arch/x86/um/stub_segv.c: In function 'stub_segv_handler': arch/x86/um/stub_segv.c:16:13: error: dereferencing pointer to incomplete type 'struct ucontext' &uc->uc_mcontext); Cc: stable@vger.kernel.org Signed-off-by: Krzysztof Mazur <krzysiek@podlesie.net> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-24um: Compile with modern headersJason A. Donenfeld
commit 530ba6c7cb3c22435a4d26de47037bb6f86a5329 upstream. Recent libcs have gotten a bit more strict, so we actually need to include the right headers and use the right types. This enables UML to compile again. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: stable@vger.kernel.org Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-13KVM: nVMX: Update vmcs12->guest_linear_address on nested VM-exitJim Mattson
[ Upstream commit d281e13b0bfe745a21061a194e386a949784393f ] The guest-linear address field is set for VM exits due to attempts to execute LMSW with a memory operand and VM exits due to attempts to execute INS or OUTS for which the relevant segment is usable, regardless of whether or not EPT is in use. Fixes: 119a9c01a5922 ("KVM: nVMX: pass valid guest linear-address to the L1") Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-13KVM: SVM: do not zero out segment attributes if segment is unusable or not ↵Roman Pen
present [ Upstream commit d9c1b5431d5f0e07575db785a022bce91051ac1d ] This is a fix for the problem [1], where VMCB.CPL was set to 0 and interrupt was taken on userspace stack. The root cause lies in the specific AMD CPU behaviour which manifests itself as unusable segment attributes on SYSRET. The corresponding work around for the kernel is the following: 61f01dd941ba ("x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue") In other turn virtualization side treated unusable segment incorrectly and restored CPL from SS attributes, which were zeroed out few lines above. In current patch it is assured only that P bit is cleared in VMCB.save state and segment attributes are not zeroed out if segment is not presented or is unusable, therefore CPL can be safely restored from DPL field. This is only one part of the fix, since QEMU side should be fixed accordingly not to zero out attributes on its side. Corresponding patch will follow. [1] Message id: CAJrWOzD6Xq==b-zYCDdFLgSRMPM-NkNuTSDFEtX=7MreT45i7Q@mail.gmail.com Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> Signed-off-by: Mikhail Sennikovskii <mikhail.sennikovskii@profitbricks.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim KrÄmář <rkrcmar@redhat.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-13x86/efi: Disable runtime services on kexec kernel if booted with efi=old_mapSai Praneeth
[ Upstream commit 4e52797d2efefac3271abdc54439a3435abd77b9 ] Booting kexec kernel with "efi=old_map" in kernel command line hits kernel panic as shown below. BUG: unable to handle kernel paging request at ffff88007fe78070 IP: virt_efi_set_variable.part.7+0x63/0x1b0 PGD 7ea28067 PUD 7ea2b067 PMD 7ea2d067 PTE 0 [...] Call Trace: virt_efi_set_variable() efi_delete_dummy_variable() efi_enter_virtual_mode() start_kernel() x86_64_start_reservations() x86_64_start_kernel() start_cpu() [ efi=old_map was never intended to work with kexec. The problem with using efi=old_map is that the virtual addresses are assigned from the memory region used by other kernel mappings; vmalloc() space. Potentially there could be collisions when booting kexec if something else is mapped at the virtual address we allocated for runtime service regions in the initial boot - Matt Fleming ] Since kexec was never intended to work with efi=old_map, disable runtime services in kexec if booted with efi=old_map, so that we don't panic. Tested-by: Lee Chun-Yi <jlee@suse.com> Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Acked-by: Dave Young <dyoung@redhat.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Shankar <ravi.v.shankar@intel.com> Cc: Ricardo Neri <ricardo.neri@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/20170526113652.21339-4-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-13KVM: nVMX: Fix handling of lmsw instructionJan H. Schönherr
[ Upstream commit e1d39b17e044e8ae819827810d87d809ba5f58c0 ] The decision whether or not to exit from L2 to L1 on an lmsw instruction is based on bogus values: instead of using the information encoded within the exit qualification, it uses the data also used for the mov-to-cr instruction, which boils down to using whatever is in %eax at that point. Use the correct values instead. Without this fix, an L1 may not get notified when a 32-bit Linux L2 switches its secondary CPUs to protected mode; the L1 is only notified on the next modification of CR0. This short time window poses a problem, when there is some other reason to exit to L1 in between. Then, L2 will be resumed in real mode and chaos ensues. Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de> Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-13KVM: X86: Fix preempt the preemption timer cancelWanpeng Li
[ Upstream commit 5acc1ca4fb15f00bfa3d4046e35ca381bc25d580 ] Preemption can occur during cancel preemption timer, and there will be inconsistent status in lapic, vmx and vmcs field. CPU0 CPU1 preemption timer vmexit handle_preemption_timer(vCPU0) kvm_lapic_expired_hv_timer vmx_cancel_hv_timer vmx->hv_deadline_tsc = -1 vmcs_clear_bits /* hv_timer_in_use still true */ sched_out sched_in kvm_arch_vcpu_load vmx_set_hv_timer write vmx->hv_deadline_tsc vmcs_set_bits /* back in kvm_lapic_expired_hv_timer */ hv_timer_in_use = false ... vmx_vcpu_run vmx_arm_hv_run write preemption timer deadline spurious preemption timer vmexit handle_preemption_timer(vCPU0) kvm_lapic_expired_hv_timer WARN_ON(!apic->lapic_timer.hv_timer_in_use); This can be reproduced sporadically during boot of L2 on a preemptible L1, causing a splat on L1. WARNING: CPU: 3 PID: 1952 at arch/x86/kvm/lapic.c:1529 kvm_lapic_expired_hv_timer+0xb5/0xd0 [kvm] CPU: 3 PID: 1952 Comm: qemu-system-x86 Not tainted 4.12.0-rc1+ #24 RIP: 0010:kvm_lapic_expired_hv_timer+0xb5/0xd0 [kvm] Call Trace: handle_preemption_timer+0xe/0x20 [kvm_intel] vmx_handle_exit+0xc9/0x15f0 [kvm_intel] ? lock_acquire+0xdb/0x250 ? lock_acquire+0xdb/0x250 ? kvm_arch_vcpu_ioctl_run+0xdf3/0x1ce0 [kvm] kvm_arch_vcpu_ioctl_run+0xe55/0x1ce0 [kvm] kvm_vcpu_ioctl+0x384/0x7b0 [kvm] ? kvm_vcpu_ioctl+0x384/0x7b0 [kvm] ? __fget+0xf3/0x210 do_vfs_ioctl+0xa4/0x700 ? __fget+0x114/0x210 SyS_ioctl+0x79/0x90 do_syscall_64+0x8f/0x750 ? trace_hardirqs_on_thunk+0x1a/0x1c entry_SYSCALL64_slow_path+0x25/0x25 This patch fixes it by disabling preemption while cancelling preemption timer. This way cancel_hv_timer is atomic with respect to kvm_arch_vcpu_load. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-13x86/tsc: Provide 'tsc=unstable' boot parameterPeter Zijlstra
[ Upstream commit 8309f86cd41e8714526867177facf7a316d9be53 ] Since the clocksource watchdog will only detect broken TSC after the fact, all TSC based clocks will likely have observed non-continuous values before/when switching away from TSC. Therefore only thing to fully avoid random clock movement when your BIOS randomly mucks with TSC values from SMI handlers is reporting the TSC as unstable at boot. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-13x86/boot: Declare error() as noreturnKees Cook
[ Upstream commit 60854a12d281e2fa25662fa32ac8022bbff17432 ] The compressed boot function error() is used to halt execution, but it wasn't marked with "noreturn". This fixes that in preparation for supporting kernel FORTIFY_SOURCE, which uses the noreturn annotation on panic, and calls error(). GCC would warn about a noreturn function calling a non-noreturn function: arch/x86/boot/compressed/misc.c: In function ‘fortify_panic’: arch/x86/boot/compressed/misc.c:416:1: warning: ‘noreturn’ function does return } ^ Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Link: http://lkml.kernel.org/r/20170506045116.GA2879@beast Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-13x86/mm/kaslr: Use the _ASM_MUL macro for multiplication to work around Clang ↵Matthias Kaehlcke
incompatibility [ Upstream commit 121843eb02a6e2fa30aefab64bfe183c97230c75 ] The constraint "rm" allows the compiler to put mix_const into memory. When the input operand is a memory location then MUL needs an operand size suffix, since Clang can't infer the multiplication width from the operand. Add and use the _ASM_MUL macro which determines the operand size and resolves to the NUL instruction with the corresponding suffix. This fixes the following error when building with clang: CC arch/x86/lib/kaslr.o /tmp/kaslr-dfe1ad.s: Assembler messages: /tmp/kaslr-dfe1ad.s:182: Error: no instruction mnemonic suffix given and no register operands; can't size instruction Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Cc: Grant Grundler <grundler@chromium.org> Cc: Greg Hackmann <ghackmann@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Davidson <md@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170501224741.133938-1-mka@chromium.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-13x86/asm: Don't use RBP as a temporary register in csum_partial_copy_generic()Josh Poimboeuf
[ Upstream commit 42fc6c6cb1662ba2fa727dd01c9473c63be4e3b6 ] Andrey Konovalov reported the following warning while fuzzing the kernel with syzkaller: WARNING: kernel stack regs at ffff8800686869f8 in a.out:4933 has bad 'bp' value c3fc855a10167ec0 The unwinder dump revealed that RBP had a bad value when an interrupt occurred in csum_partial_copy_generic(). That function saves RBP on the stack and then overwrites it, using it as a scratch register. That's problematic because it breaks stack traces if an interrupt occurs in the middle of the function. Replace the usage of RBP with another callee-saved register (R15) so stack traces are no longer affected. Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: David S . Miller <davem@davemloft.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlad Yasevich <vyasevich@gmail.com> Cc: linux-sctp@vger.kernel.org Cc: netdev <netdev@vger.kernel.org> Cc: syzkaller <syzkaller@googlegroups.com> Link: http://lkml.kernel.org/r/4b03a961efda5ec9bfe46b7b9c9ad72d1efad343.1493909486.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-08crypto: x86/cast5-avx - fix ECB encryption when long sg follows short oneEric Biggers
commit 8f461b1e02ed546fbd0f11611138da67fd85a30f upstream. With ecb-cast5-avx, if a 128+ byte scatterlist element followed a shorter one, then the algorithm accidentally encrypted/decrypted only 8 bytes instead of the expected 128 bytes. Fix it by setting the encryption/decryption 'fn' correctly. Fixes: c12ab20b162c ("crypto: cast5/avx - avoid using temporary stack buffers") Cc: <stable@vger.kernel.org> # v3.8+ Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-08kprobes/x86: Fix to set RWX bits correctly before releasing trampolineMasami Hiramatsu
commit c93f5cf571e7795f97d49ef51b766cf25e328545 upstream. Fix kprobes to set(recover) RWX bits correctly on trampoline buffer before releasing it. Releasing readonly page to module_memfree() crash the kernel. Without this fix, if kprobes user register a bunch of kprobes in function body (since kprobes on function entry usually use ftrace) and unregister it, kernel hits a BUG and crash. Link: http://lkml.kernel.org/r/149570868652.3518.14120169373590420503.stgit@devbox Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Fixes: d0381c81c2f7 ("kprobes/x86: Set kprobes pages read-only") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28bpf, x64: increase number of passesDaniel Borkmann
commit 6007b080d2e2adb7af22bf29165f0594ea12b34c upstream. In Cilium some of the main programs we run today are hitting 9 passes on x64's JIT compiler, and we've had cases already where we surpassed the limit where the JIT then punts the program to the interpreter instead, leading to insertion failures due to CONFIG_BPF_JIT_ALWAYS_ON or insertion failures due to the prog array owner being JITed but the program to insert not (both must have the same JITed/non-JITed property). One concrete case the program image shrunk from 12,767 bytes down to 10,288 bytes where the image converged after 16 steps. I've measured that this took 340us in the JIT until it converges on my i7-6600U. Thus, increase the original limit we had from day one where the JIT covered cBPF only back then before we run into the case (as similar with the complexity limit) where we trip over this and hit program rejections. Also add a cond_resched() into the compilation loop, the JIT process runs without any locks and may sleep anyway. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28perf/x86/intel/uncore: Fix multi-domain PCI CHA enumeration bug on Skylake ↵Kan Liang
servers commit 320b0651f32b830add6497fcdcfdcb6ae8c7b8a0 upstream. The number of CHAs is miscalculated on multi-domain PCI Skylake server systems, resulting in an uncore driver initialization error. Gary Kroening explains: "For systems with a single PCI segment, it is sufficient to look for the bus number to change in order to determine that all of the CHa's have been counted for a single socket. However, for multi PCI segment systems, each socket is given a new segment and the bus number does NOT change. So looking only for the bus number to change ends up counting all of the CHa's on all sockets in the system. This leads to writing CPU MSRs beyond a valid range and causes an error in ivbep_uncore_msr_init_box()." To fix this bug, query the number of CHAs from the CAPID6 register: it should read bits 27:0 in the CAPID6 register located at Device 30, Function 3, Offset 0x9C. These 28 bits form a bit vector of available LLC slices and the CHAs that manage those slices. Reported-by: Kroening, Gary <gary.kroening@hpe.com> Tested-by: Kroening, Gary <gary.kroening@hpe.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: abanman@hpe.com Cc: dimitri.sivanich@hpe.com Cc: hpa@zytor.com Cc: mike.travis@hpe.com Cc: russ.anderson@hpe.com Fixes: cd34cd97b7b4 ("perf/x86/intel/uncore: Add Skylake server uncore support") Link: http://lkml.kernel.org/r/1520967094-13219-1-git-send-email-kan.liang@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28perf/x86/intel: Don't accidentally clear high bits in bdw_limit_period()Dan Carpenter
commit e5ea9b54a055619160bbfe527ebb7d7191823d66 upstream. We intended to clear the lowest 6 bits but because of a type bug we clear the high 32 bits as well. Andi says that periods are rarely more than U32_MAX so this bug probably doesn't have a huge runtime impact. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: 294fe0f52a44 ("perf/x86/intel: Add INST_RETIRED.ALL workarounds") Link: http://lkml.kernel.org/r/20180317115216.GB4035@mwanda Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28perf/x86/intel/uncore: Fix Skylake UPI event formatKan Liang
commit 317660940fd9dddd3201c2f92e25c27902c753fa upstream. There is no event extension (bit 21) for SKX UPI, so use 'event' instead of 'event_ext'. Reported-by: Stephane Eranian <eranian@google.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: cd34cd97b7b4 ("perf/x86/intel/uncore: Add Skylake server uncore support") Link: http://lkml.kernel.org/r/1520004150-4855-1-git-send-email-kan.liang@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28x86/entry/64: Don't use IST entry for #BP stackAndy Lutomirski
commit d8ba61ba58c88d5207c1ba2f7d9a2280e7d03be9 upstream. There's nothing IST-worthy about #BP/int3. We don't allow kprobes in the small handful of places in the kernel that run at CPL0 with an invalid stack, and 32-bit kernels have used normal interrupt gates for #BP forever. Furthermore, we don't allow kprobes in places that have usergs while in kernel mode, so "paranoid" is also unnecessary. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28x86/boot/64: Verify alignment of the LOAD segmentH.J. Lu
commit c55b8550fa57ba4f5e507be406ff9fc2845713e8 upstream. Since the x86-64 kernel must be aligned to 2MB, refuse to boot the kernel if the alignment of the LOAD segment isn't a multiple of 2MB. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/CAMe9rOrR7xSJgUfiCoZLuqWUwymRxXPoGBW38%2BpN%3D9g%2ByKNhZw@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28x86/build/64: Force the linker to use 2MB page sizeH.J. Lu
commit e3d03598e8ae7d195af5d3d049596dec336f569f upstream. Binutils 2.31 will enable -z separate-code by default for x86 to avoid mixing code pages with data to improve cache performance as well as security. To reduce x86-64 executable and shared object sizes, the maximum page size is reduced from 2MB to 4KB. But x86-64 kernel must be aligned to 2MB. Pass -z max-page-size=0x200000 to linker to force 2MB page size regardless of the default page size used by linker. Tested with Linux kernel 4.15.6 on x86-64. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/CAMe9rOp4_%3D_8twdpTyAP2DhONOCeaTOsniJLoppzhoNptL8xzA@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28kvm/x86: fix icebp instruction handlingLinus Torvalds
commit 32d43cd391bacb5f0814c2624399a5dad3501d09 upstream. The undocumented 'icebp' instruction (aka 'int1') works pretty much like 'int3' in the absense of in-circuit probing equipment (except, obviously, that it raises #DB instead of raising #BP), and is used by some validation test-suites as such. But Andy Lutomirski noticed that his test suite acted differently in kvm than on bare hardware. The reason is that kvm used an inexact test for the icebp instruction: it just assumed that an all-zero VM exit qualification value meant that the VM exit was due to icebp. That is not unlike the guess that do_debug() does for the actual exception handling case, but it's purely a heuristic, not an absolute rule. do_debug() does it because it wants to ascribe _some_ reasons to the #DB that happened, and an empty %dr6 value means that 'icebp' is the most likely casue and we have no better information. But kvm can just do it right, because unlike the do_debug() case, kvm actually sees the real reason for the #DB in the VM-exit interruption information field. So instead of relying on an inexact heuristic, just use the actual VM exit information that says "it was 'icebp'". Right now the 'icebp' instruction isn't technically documented by Intel, but that will hopefully change. The special "privileged software exception" information _is_ actually mentioned in the Intel SDM, even though the cause of it isn't enumerated. Reported-by: Andy Lutomirski <luto@kernel.org> Tested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28x86/mm: implement free pmd/pte page interfacesToshi Kani
commit 28ee90fe6048fa7b7ceaeb8831c0e4e454a4cf89 upstream. Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which clear a given pud/pmd entry and free up lower level page table(s). The address range associated with the pud/pmd entry must have been purged by INVLPG. Link: http://lkml.kernel.org/r/20180314180155.19492-3-toshi.kani@hpe.com Fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings") Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reported-by: Lei Li <lious.lilei@hisilicon.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Borislav Petkov <bp@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28mm/vmalloc: add interfaces to free unmapped page tableToshi Kani
commit b6bdb7517c3d3f41f20e5c2948d6bc3f8897394e upstream. On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap() may create pud/pmd mappings. A kernel panic was observed on arm64 systems with Cortex-A75 in the following steps as described by Hanjun Guo. 1. ioremap a 4K size, valid page table will build, 2. iounmap it, pte0 will set to 0; 3. ioremap the same address with 2M size, pgd/pmd is unchanged, then set the a new value for pmd; 4. pte0 is leaked; 5. CPU may meet exception because the old pmd is still in TLB, which will lead to kernel panic. This panic is not reproducible on x86. INVLPG, called from iounmap, purges all levels of entries associated with purged address on x86. x86 still has memory leak. The patch changes the ioremap path to free unmapped page table(s) since doing so in the unmap path has the following issues: - The iounmap() path is shared with vunmap(). Since vmap() only supports pte mappings, making vunmap() to free a pte page is an overhead for regular vmap users as they do not need a pte page freed up. - Checking if all entries in a pte page are cleared in the unmap path is racy, and serializing this check is expensive. - The unmap path calls free_vmap_area_noflush() to do lazy TLB purges. Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB purge. Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(), which clear a given pud/pmd entry and free up a page for the lower level entries. This patch implements their stub functions on x86 and arm64, which work as workaround. [akpm@linux-foundation.org: fix typo in pmd_free_pte_page() stub] Link: http://lkml.kernel.org/r/20180314180155.19492-2-toshi.kani@hpe.com Fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings") Reported-by: Lei Li <lious.lilei@hisilicon.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Wang Xuefeng <wxf.wang@hisilicon.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Borislav Petkov <bp@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Chintan Pandya <cpandya@codeaurora.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24x86/xen: split xen_smp_prepare_boot_cpu()Vitaly Kuznetsov
[ Upstream commit a2d1078a35f9a38ae888aa6147e4ca32666154a1 ] Split xen_smp_prepare_boot_cpu() into xen_pv_smp_prepare_boot_cpu() and xen_hvm_smp_prepare_boot_cpu() to support further splitting of smp.c. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24x86/KASLR: Fix kexec kernel boot crash when KASLR randomization failsBaoquan He
[ Upstream commit da63b6b20077469bd6bd96e07991ce145fc4fbc4 ] Dave found that a kdump kernel with KASLR enabled will reset to the BIOS immediately if physical randomization failed to find a new position for the kernel. A kernel with the 'nokaslr' option works in this case. The reason is that KASLR will install a new page table for the identity mapping, while it missed building it for the original kernel location if KASLR physical randomization fails. This only happens in the kexec/kdump kernel, because the identity mapping has been built for kexec/kdump in the 1st kernel for the whole memory by calling init_pgtable(). Here if physical randomizaiton fails, it won't build the identity mapping for the original area of the kernel but change to a new page table '_pgtable'. Then the kernel will triple fault immediately caused by no identity mappings. The normal kernel won't see this bug, because it comes here via startup_32() and CR3 will be set to _pgtable already. In startup_32() the identity mapping is built for the 0~4G area. In KASLR we just append to the existing area instead of entirely overwriting it for on-demand identity mapping building. So the identity mapping for the original area of kernel is still there. To fix it we just switch to the new identity mapping page table when physical KASLR succeeds. Otherwise we keep the old page table unchanged just like "nokaslr" does. Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Dave Young <dyoung@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1493278940-5885-1-git-send-email-bhe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24x86/reboot: Turn off KVM when halting a CPUTiantian Feng
[ Upstream commit fba4f472b33aa81ca1836f57d005455261e9126f ] A CPU in VMX root mode will ignore INIT signals and will fail to bring up the APs after reboot. Therefore, on a panic we disable VMX on all CPUs before rebooting or triggering kdump. Do this when halting the machine as well, in case a firmware-level reboot does not perform a cold reset for all processors. Without doing this, rebooting the host may hang. Signed-off-by: Tiantian Feng <fengtiantian@huawei.com> Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> [ Rewritten commit message. ] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kvm@vger.kernel.org Link: http://lkml.kernel.org/r/20170419161839.30550-1-pbonzini@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24x86: i8259: export legacy_pic symbolHans de Goede
[ Upstream commit 7ee06cb2f840a96be46233181ed4557901a74385 ] The classic PC rtc-coms driver has a workaround for broken ACPI device nodes for it which lack an irq resource. This workaround used to unconditionally hardcode the irq to 8 in these cases. This was causing irq conflict problems on systems without a legacy-pic so a recent patch added an if (nr_legacy_irqs()) guard to the workaround to avoid this irq conflict. nr_legacy_irqs() uses the legacy_pic symbol under the hood causing an undefined symbol error if the rtc-cmos code is build as a module. This commit exports the legacy_pic symbol to fix this. Cc: rtc-linux@googlegroups.com Cc: alexandre.belloni@free-electrons.com Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22x86/mm: Fix vmalloc_fault to use pXd_largeToshi Kani
commit 18a955219bf7d9008ce480d4451b6b8bf4483a22 upstream. Gratian Crisan reported that vmalloc_fault() crashes when CONFIG_HUGETLBFS is not set since the function inadvertently uses pXn_huge(), which always return 0 in this case. ioremap() does not depend on CONFIG_HUGETLBFS. Fix vmalloc_fault() to call pXd_large() instead. Fixes: f4eafd8bcd52 ("x86/mm: Fix vmalloc_fault() to handle large pages properly") Reported-by: Gratian Crisan <gratian.crisan@ni.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Cc: linux-mm@kvack.org Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20180313170347.3829-2-toshi.kani@hpe.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22x86/speculation: Remove Skylake C2 from Speculation Control microcode blacklistAlexander Sergeyev
commit e3b3121fa8da94cb20f9e0c64ab7981ae47fd085 upstream. In accordance with Intel's microcode revision guidance from March 6 MCU rev 0xc2 is cleared on both Skylake H/S and Skylake Xeon E3 processors that share CPUID 506E3. Signed-off-by: Alexander Sergeyev <sergeev917@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jia Zhang <qianyue.zj@alibaba-inc.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kyle Huey <me@kylehuey.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Link: https://lkml.kernel.org/r/20180313193856.GA8580@localhost.localdomain Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22x86/speculation, objtool: Annotate indirect calls/jumps for objtool on ↵Andy Whitcroft
32-bit kernels commit a14bff131108faf50cc0cf864589fd71ee216c96 upstream. In the following commit: 9e0e3c5130e9 ("x86/speculation, objtool: Annotate indirect calls/jumps for objtool") ... we added annotations for CALL_NOSPEC/JMP_NOSPEC on 64-bit x86 kernels, but we did not annotate the 32-bit path. Annotate it similarly. Signed-off-by: Andy Whitcroft <apw@canonical.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180314112427.22351-1-apw@canonical.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22x86/vm86/32: Fix POPF emulationAndy Lutomirski
commit b5069782453459f6ec1fdeb495d9901a4545fcb5 upstream. POPF would trap if VIP was set regardless of whether IF was set. Fix it. Suggested-by: Stas Sergeev <stsp@list.ru> Reported-by: Bart Oldeman <bartoldeman@gmail.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Fixes: 5ed92a8ab71f ("x86/vm86: Use the normal pt_regs area for vm86") Link: http://lkml.kernel.org/r/ce95f40556e7b2178b6bc06ee9557827ff94bd28.1521003603.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22x86/cpufeatures: Add Intel PCONFIG cpufeatureKirill A. Shutemov
commit 7958b2246fadf54b7ff820a2a5a2c5ca1554716f upstream. CPUID.0x7.0x0:EDX[18] indicates whether Intel CPU support PCONFIG instruction. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kai Huang <kai.huang@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180305162610.37510-4-kirill.shutemov@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22x86/boot/32: Fix UP boot on Quark and possibly other platformsAndy Lutomirski
commit d2b6dc61a8dd3c429609b993778cb54e75a5c5f0 upstream. This partially reverts commit: 23b2a4ddebdd17f ("x86/boot/32: Defer resyncing initial_page_table until per-cpu is set up") That commit had one definite bug and one potential bug. The definite bug is that setup_per_cpu_areas() uses a differnet generic implementation on UP kernels, so initial_page_table never got resynced. This was fine for access to percpu data (it's in the identity map on UP), but it breaks other users of initial_page_table. The potential bug is that helpers like efi_init() would be called before the tables were synced. Avoid both problems by just syncing the page tables in setup_arch() *and* setup_per_cpu_areas(). Reported-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22kprobes/x86: Set kprobes pages read-onlyMasami Hiramatsu
[ Upstream commit d0381c81c2f782fa2131178d11e0cfb23d50d631 ] Set the pages which is used for kprobes' singlestep buffer and optprobe's trampoline instruction buffer to readonly. This can prevent unexpected (or unintended) instruction modification. This also passes rodata_test as below. Without this patch, rodata_test shows a warning: WARNING: CPU: 0 PID: 1 at arch/x86/mm/dump_pagetables.c:235 note_page+0x7a9/0xa20 x86/mm: Found insecure W+X mapping at address ffffffffa0000000/0xffffffffa0000000 With this fix, no W+X pages are found: x86/mm: Checked W+X mappings: passed, no W+X pages found. rodata_test: all tests were successful Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: David S . Miller <davem@davemloft.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ye Xiaolong <xiaolong.ye@intel.com> Link: http://lkml.kernel.org/r/149076375592.22469.14174394514338612247.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22kprobes/x86: Fix kprobe-booster not to boost far call instructionsMasami Hiramatsu
[ Upstream commit bd0b90676c30fe640e7ead919b3e38846ac88ab7 ] Fix the kprobe-booster not to boost far call instruction, because a call may store the address in the single-step execution buffer to the stack, which should be modified after single stepping. Currently, this instruction will be filtered as not boostable in resume_execution(), so this is not a critical issue. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: David S . Miller <davem@davemloft.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ye Xiaolong <xiaolong.ye@intel.com> Link: http://lkml.kernel.org/r/149076340615.22469.14066273186134229909.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22kvm: nVMX: Disallow userspace-injected exceptions in guest modeJim Mattson
[ Upstream commit 28d06353881939703c34d82a1465136af176c620 ] The userspace exception injection API and code path are entirely unprepared for exceptions that might cause a VM-exit from L2 to L1, so the best course of action may be to simply disallow this for now. 1. The API provides no mechanism for userspace to specify the new DR6 bits for a #DB exception or the new CR2 value for a #PF exception. Presumably, userspace is expected to modify these registers directly with KVM_SET_SREGS before the next KVM_RUN ioctl. However, in the event that L1 intercepts the exception, these registers should not be changed. Instead, the new values should be provided in the exit_qualification field of vmcs12 (Intel SDM vol 3, section 27.1). 2. In the case of a userspace-injected #DB, inject_pending_event() clears DR7.GD before calling vmx_queue_exception(). However, in the event that L1 intercepts the exception, this is too early, because DR7.GD should not be modified by a #DB that causes a VM-exit directly (Intel SDM vol 3, section 27.1). 3. If the injected exception is a #PF, nested_vmx_check_exception() doesn't properly check whether or not L1 is interested in the associated error code (using the #PF error code mask and match fields from vmcs12). It may either return 0 when it should call nested_vmx_vmexit() or vice versa. 4. nested_vmx_check_exception() assumes that it is dealing with a hardware-generated exception intercept from L2, with some of the relevant details (the VM-exit interruption-information and the exit qualification) live in vmcs02. For userspace-injected exceptions, this is not the case. 5. prepare_vmcs12() assumes that when its exit_intr_info argument specifies valid information with a valid error code that it can VMREAD the VM-exit interruption error code from vmcs02. For userspace-injected exceptions, this is not the case. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22kvm/svm: Setup MCG_CAP on AMD properlyBorislav Petkov
[ Upstream commit 74f169090b6f36b867c9df0454366dd9af6f62d1 ] MCG_CAP[63:9] bits are reserved on AMD. However, on an AMD guest, this MSR returns 0x100010a. More specifically, bit 24 is set, which is simply wrong. That bit is MCG_SER_P and is present only on Intel. Thus, clean up the reserved bits in order not to confuse guests. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Joerg Roedel <joro@8bytes.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22x86/boot/32: Defer resyncing initial_page_table until per-cpu is set upAndy Lutomirski
[ Upstream commit 23b2a4ddebdd17fad265b4bb77256c2e4ec37dee ] The x86 smpboot trampoline expects initial_page_table to have the GDT mapped. If the GDT ends up in a virtually mapped per-cpu page, then it won't be in the page tables at all until perc-pu areas are set up. The result will be a triple fault the first time that the CPU attempts to access the GDT after LGDT loads the perc-pu GDT. This appears to be an old bug, but somehow the GDT fixmap rework is triggering it. This seems to have something to do with the memory layout. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/a553264a5972c6a86f9b5caac237470a0c74a720.1490218061.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22x86/mce: Init some CPU features earlyYazen Ghannam
[ Upstream commit 5204bf17031b69fa5faa4dc80a9dc1e2446d74f9 ] When the MCA banks in __mcheck_cpu_init_generic() are polled for leftover errors logged during boot or from the previous boot, its required to have CPU features detected sufficiently so that the reading out and handling of those early errors is done correctly. If those features are not available, the decoding may miss some information and get incomplete errors logged. For example, on SMCA systems the MCA_IPID and MCA_SYND registers are not logged and MCA_ADDR is not masked appropriately. To cure that, do a subset of the basic feature detection early while the rest happens in its usual place in __mcheck_cpu_init_vendor(). Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/1489599055-20756-1-git-send-email-Yazen.Ghannam@amd.com [ Massage commit message and simplify. ] Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22x86/mce: Handle broadcasted MCE gracefully with kexecXunlei Pang
[ Upstream commit 5bc329503e8191c91c4c40836f062ef771d8ba83 ] When we are about to kexec a crash kernel and right then and there a broadcasted MCE fires while we're still in the first kernel and while the other CPUs remain in a holding pattern, the #MC handler of the first kernel will timeout and then panic due to never completing MCE synchronization. Handle this in a similar way as to when the CPUs are offlined when that broadcasted MCE happens. [ Boris: rewrote commit message and comments. ] Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Xunlei Pang <xlpang@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Tony Luck <tony.luck@intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: kexec@lists.infradead.org Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1487857012-9059-1-git-send-email-xlpang@redhat.com Link: http://lkml.kernel.org/r/20170313095019.19351-1-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22x86/mm: Make mmap(MAP_32BIT) work correctlyDmitry Safonov
[ Upstream commit 3e6ef9c80946f781fc25e8490c9875b1d2b61158 ] mmap(MAP_32BIT) is broken due to the dependency on the TIF_ADDR32 thread flag. For 64bit applications MAP_32BIT will force legacy bottom-up allocations and the 1GB address space restriction even if the application issued a compat syscall, which should not be subject of these restrictions. For 32bit applications, which issue 64bit syscalls the newly introduced mmap base separation into 64-bit and compat bases changed the behaviour because now a 64-bit mapping is returned, but due to the TIF_ADDR32 dependency MAP_32BIT is ignored. Before the separation a 32-bit mapping was returned, so the MAP_32BIT handling was irrelevant. Replace the check for TIF_ADDR32 with a check for the compat syscall. That solves both the 64-bit issuing a compat syscall and the 32-bit issuing a 64-bit syscall problems. [ tglx: Massaged changelog ] Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: 0x7f454c46@gmail.com Cc: linux-mm@kvack.org Cc: Andy Lutomirski <luto@kernel.org> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Borislav Petkov <bp@suse.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20170306141721.9188-5-dsafonov@virtuozzo.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18x86: Treat R_X86_64_PLT32 as R_X86_64_PC32H.J. Lu
commit b21ebf2fb4cde1618915a97cc773e287ff49173e upstream. On i386, there are 2 types of PLTs, PIC and non-PIC. PIE and shared objects must use PIC PLT. To use PIC PLT, you need to load _GLOBAL_OFFSET_TABLE_ into EBX first. There is no need for that on x86-64 since x86-64 uses PC-relative PLT. On x86-64, for 32-bit PC-relative branches, we can generate PLT32 relocation, instead of PC32 relocation, which can also be used as a marker for 32-bit PC-relative branches. Linker can always reduce PLT32 relocation to PC32 if function is defined locally. Local functions should use PC32 relocation. As far as Linux kernel is concerned, R_X86_64_PLT32 can be treated the same as R_X86_64_PC32 since Linux kernel doesn't use PLT. R_X86_64_PLT32 for 32-bit PC-relative branches has been enabled in binutils master branch which will become binutils 2.31. [ hjl is working on having better documentation on this all, but a few more notes from him: "PLT32 relocation is used as marker for PC-relative branches. Because of EBX, it looks odd to generate PLT32 relocation on i386 when EBX doesn't have GOT. As for symbol resolution, PLT32 and PC32 relocations are almost interchangeable. But when linker sees PLT32 relocation against a protected symbol, it can resolved locally at link-time since it is used on a branch instruction. Linker can't do that for PC32 relocation" but for the kernel use, the two are basically the same, and this commit gets things building and working with the current binutils master - Linus ] Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18x86/module: Detect and skip invalid relocationsJosh Poimboeuf
commit eda9cec4c9a12208a6f69fbe68f72a6311d50032 upstream. There have been some cases where external tooling (e.g., kpatch-build) creates a corrupt relocation which targets the wrong address. This is a silent failure which can corrupt memory in unexpected places. On x86, the bytes of data being overwritten by relocations are always initialized to zero beforehand. Use that knowledge to add sanity checks to detect such cases before they corrupt memory. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: jeyu@kernel.org Cc: live-patching@vger.kernel.org Link: http://lkml.kernel.org/r/37450d6c6225e54db107fba447ce9e56e5f758e9.1509713553.git.jpoimboe@redhat.com [ Restructured the messages, as it's unclear whether the relocation or the target is corrupted. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18x86/paravirt, objtool: Annotate indirect callsPeter Zijlstra
commit 3010a0663fd949d122eca0561b06b0a9453f7866 upstream. Paravirt emits indirect calls which get flagged by objtool retpoline checks, annotate it away because all these indirect calls will be patched out before we start userspace. This patching happens through alternative_instructions() -> apply_paravirt() -> pv_init_ops.patch() which will eventually end up in paravirt_patch_default(). This function _will_ write direct alternatives. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18x86/speculation: Move firmware_restrict_branch_speculation_*() from C to CPPIngo Molnar
commit d72f4e29e6d84b7ec02ae93088aa459ac70e733b upstream. firmware_restrict_branch_speculation_*() recently started using preempt_enable()/disable(), but those are relatively high level primitives and cause build failures on some 32-bit builds. Since we want to keep <asm/nospec-branch.h> low level, convert them to macros to avoid header hell... Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: arjan.van.de.ven@intel.com Cc: bp@alien8.de Cc: dave.hansen@intel.com Cc: jmattson@google.com Cc: karahmed@amazon.de Cc: kvm@vger.kernel.org Cc: pbonzini@redhat.com Cc: rkrcmar@redhat.com Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18x86/boot, objtool: Annotate indirect jump in secondary_startup_64()Peter Zijlstra
commit bd89004f6305cbf7352238f61da093207ee518d6 upstream. The objtool retpoline validation found this indirect jump. Seeing how it's on CPU bringup before we run userspace it should be safe, annotate it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18x86/speculation, objtool: Annotate indirect calls/jumps for objtoolPeter Zijlstra
commit 9e0e3c5130e949c389caabc8033e9799b129e429 upstream. Annotate the indirect calls/jumps in the CALL_NOSPEC/JUMP_NOSPEC alternatives. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18x86/retpoline: Support retpoline builds with ClangDavid Woodhouse
commit 87358710c1fb4f1bf96bbe2349975ff9953fc9b2 upstream. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: arjan.van.de.ven@intel.com Cc: bp@alien8.de Cc: dave.hansen@intel.com Cc: jmattson@google.com Cc: karahmed@amazon.de Cc: kvm@vger.kernel.org Cc: pbonzini@redhat.com Cc: rkrcmar@redhat.com Link: http://lkml.kernel.org/r/1519037457-7643-5-git-send-email-dwmw@amazon.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18x86/speculation: Use IBRS if available before calling into firmwareDavid Woodhouse
commit dd84441a797150dcc49298ec95c459a8891d8bb1 upstream. Retpoline means the kernel is safe because it has no indirect branches. But firmware isn't, so use IBRS for firmware calls if it's available. Block preemption while IBRS is set, although in practice the call sites already had to be doing that. Ignore hpwdt.c for now. It's taking spinlocks and calling into firmware code, from an NMI handler. I don't want to touch that with a bargepole. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: arjan.van.de.ven@intel.com Cc: bp@alien8.de Cc: dave.hansen@intel.com Cc: jmattson@google.com Cc: karahmed@amazon.de Cc: kvm@vger.kernel.org Cc: pbonzini@redhat.com Cc: rkrcmar@redhat.com Link: http://lkml.kernel.org/r/1519037457-7643-2-git-send-email-dwmw@amazon.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18Revert "x86/retpoline: Simplify vmexit_fill_RSB()"David Woodhouse
commit d1c99108af3c5992640aa2afa7d2e88c3775c06e upstream. This reverts commit 1dde7415e99933bb7293d6b2843752cbdb43ec11. By putting the RSB filling out of line and calling it, we waste one RSB slot for returning from the function itself, which means one fewer actual function call we can make if we're doing the Skylake abomination of call-depth counting. It also changed the number of RSB stuffings we do on vmexit from 32, which was correct, to 16. Let's just stop with the bikeshedding; it didn't actually *fix* anything anyway. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: arjan.van.de.ven@intel.com Cc: bp@alien8.de Cc: dave.hansen@intel.com Cc: jmattson@google.com Cc: karahmed@amazon.de Cc: kvm@vger.kernel.org Cc: pbonzini@redhat.com Cc: rkrcmar@redhat.com Link: http://lkml.kernel.org/r/1519037457-7643-4-git-send-email-dwmw@amazon.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>