summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2016-07-10KVM: x86: mask CPUID(0xD,0x1).EAX against host valuePaolo Bonzini
[ Upstream commit 316314cae15fb0e3869b76b468f59a0c83ac3d4e ] This ensures that the guest doesn't see XSAVE extensions (e.g. xgetbv1 or xsavec) that the host lacks. Cc: stable@vger.kernel.org Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-07-10ACPI / processor: Request native thermal interrupt handling via _OSCSrinivas Pandruvada
[ Upstream commit a21211672c9a1d730a39aa65d4a5b3414700adfb ] There are several reports of freeze on enabling HWP (Hardware PStates) feature on Skylake-based systems by the Intel P-states driver. The root cause is identified as the HWP interrupts causing BIOS code to freeze. HWP interrupts use the thermal LVT which can be handled by Linux natively, but on the affected Skylake-based systems SMM will respond to it by default. This is a problem for several reasons: - On the affected systems the SMM thermal LVT handler is broken (it will crash when invoked) and a BIOS update is necessary to fix it. - With thermal interrupt handled in SMM we lose all of the reporting features of the arch/x86/kernel/cpu/mcheck/therm_throt driver. - Some thermal drivers like x86-package-temp depend on the thermal threshold interrupts signaled via the thermal LVT. - The HWP interrupts are useful for debugging and tuning performance (if the kernel can handle them). The native handling of thermal interrupts needs to be enabled because of that. This requires some way to tell SMM that the OS can handle thermal interrupts. That can be done by using _OSC/_PDC in processor scope very early during ACPI initialization. The meaning of _OSC/_PDC bit 12 in processor scope is whether or not the OS supports native handling of interrupts for Collaborative Processor Performance Control (CPPC) notifications. Since on HWP-capable systems CPPC is a firmware interface to HWP, setting this bit effectively tells the firmware that the OS will handle thermal interrupts natively going forward. For details on _OSC/_PDC refer to: http://www.intel.com/content/www/us/en/standards/processor-vendor-specific-acpi-specification.html To implement the _OSC/_PDC handshake as described, introduce a new function, acpi_early_processor_osc(), that walks the ACPI namespace looking for ACPI processor objects and invokes _OSC for them with bit 12 in the capabilities buffer set and terminates the namespace walk on the first success. Also modify intel_thermal_interrupt() to clear HWP status bits in the HWP_STATUS MSR to acknowledge HWP interrupts (which prevents them from firing continuously). Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw: Subject & changelog, function rename ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-07-10x86/mm/kmmio: Fix mmiotrace for hugepagesKarol Herbst
[ Upstream commit cfa52c0cfa4d727aa3e457bf29aeff296c528a08 ] Because Linux might use bigger pages than the 4K pages to handle those mmio ioremaps, the kmmio code shouldn't rely on the pade id as it currently does. Using the memory address instead of the page id lets us look up how big the page is and what its base address is, so that we won't get a page fault within the same page twice anymore. Tested-by: Pierre Moreau <pierre.morrow@free.fr> Signed-off-by: Karol Herbst <nouveau@karolherbst.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: linux-mm@kvack.org Cc: linux-x86_64@vger.kernel.org Cc: nouveau@lists.freedesktop.org Cc: pq@iki.fi Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/1456966991-6861-1-git-send-email-nouveau@karolherbst.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-07-10kprobes/x86: Clear TF bit in fault on single-steppingMasami Hiramatsu
[ Upstream commit dcfc47248d3f7d28df6f531e6426b933de94370d ] Fix kprobe_fault_handler() to clear the TF (trap flag) bit of the flags register in the case of a fault fixup on single-stepping. If we put a kprobe on the instruction which caused a page fault (e.g. actual mov instructions in copy_user_*), that fault happens on the single-stepping buffer. In this case, kprobes resets running instance so that the CPU can retry execution on the original ip address. However, current code forgets to reset the TF bit. Since this fault happens with TF bit set for enabling single-stepping, when it retries, it causes a debug exception and kprobes can not handle it because it already reset itself. On the most of x86-64 platform, it can be easily reproduced by using kprobe tracer. E.g. # cd /sys/kernel/debug/tracing # echo p copy_user_enhanced_fast_string+5 > kprobe_events # echo 1 > events/kprobes/enable And you'll see a kernel panic on do_debug(), since the debug trap is not handled by kprobes. To fix this problem, we just need to clear the TF bit when resetting running kprobe. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Reviewed-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: systemtap@sourceware.org Cc: stable@vger.kernel.org # All the way back to ancient kernels Link: http://lkml.kernel.org/r/20160611140648.25885.37482.stgit@devbox [ Updated the comments. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-06-18x86, build: copy ldlinux.c32 to image.isoH. Peter Anvin
[ Upstream commit 9c77679cadb118c0aa99e6f88533d91765a131ba ] For newer versions of Syslinux, we need ldlinux.c32 in addition to isolinux.bin to reside on the boot disk, so if the latter is found, copy it, too, to the isoimage tree. Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: Linux Stable Tree <stable@vger.kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-06-17KVM: x86: fix OOPS after invalid KVM_SET_DEBUGREGSPaolo Bonzini
[ Upstream commit d14bdb553f9196169f003058ae1cdabe514470e6 ] MOV to DR6 or DR7 causes a #GP if an attempt is made to write a 1 to any of bits 63:32. However, this is not detected at KVM_SET_DEBUGREGS time, and the next KVM_RUN oopses: general protection fault: 0000 [#1] SMP CPU: 2 PID: 14987 Comm: a.out Not tainted 4.4.9-300.fc23.x86_64 #1 Hardware name: LENOVO 2325F51/2325F51, BIOS G2ET32WW (1.12 ) 05/30/2012 [...] Call Trace: [<ffffffffa072c93d>] kvm_arch_vcpu_ioctl_run+0x141d/0x14e0 [kvm] [<ffffffffa071405d>] kvm_vcpu_ioctl+0x33d/0x620 [kvm] [<ffffffff81241648>] do_vfs_ioctl+0x298/0x480 [<ffffffff812418a9>] SyS_ioctl+0x79/0x90 [<ffffffff817a0f2e>] entry_SYSCALL_64_fastpath+0x12/0x71 Code: 55 83 ff 07 48 89 e5 77 27 89 ff ff 24 fd 90 87 80 81 0f 23 fe 5d c3 0f 23 c6 5d c3 0f 23 ce 5d c3 0f 23 d6 5d c3 0f 23 de 5d c3 <0f> 23 f6 5d c3 0f 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 RIP [<ffffffff810639eb>] native_set_debugreg+0x2b/0x40 RSP <ffff88005836bd50> Testcase (beautified/reduced from syzkaller output): #include <unistd.h> #include <sys/syscall.h> #include <string.h> #include <stdint.h> #include <linux/kvm.h> #include <fcntl.h> #include <sys/ioctl.h> long r[8]; int main() { struct kvm_debugregs dr = { 0 }; r[2] = open("/dev/kvm", O_RDONLY); r[3] = ioctl(r[2], KVM_CREATE_VM, 0); r[4] = ioctl(r[3], KVM_CREATE_VCPU, 7); memcpy(&dr, "\x5d\x6a\x6b\xe8\x57\x3b\x4b\x7e\xcf\x0d\xa1\x72" "\xa3\x4a\x29\x0c\xfc\x6d\x44\x00\xa7\x52\xc7\xd8" "\x00\xdb\x89\x9d\x78\xb5\x54\x6b\x6b\x13\x1c\xe9" "\x5e\xd3\x0e\x40\x6f\xb4\x66\xf7\x5b\xe3\x36\xcb", 48); r[7] = ioctl(r[4], KVM_SET_DEBUGREGS, &dr); r[6] = ioctl(r[4], KVM_RUN, 0); } Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-06-06xen/x86: actually allocate legacy interrupts on PV guestsStefano Stabellini
[ Upstream commit 702f926067d2a4b28c10a3c41a1172dd62d9e735 ] b4ff8389ed14 is incomplete: relies on nr_legacy_irqs() to get the number of legacy interrupts when actually nr_legacy_irqs() returns 0 after probe_8259A(). Use NR_IRQS_LEGACY instead. Signed-off-by: Stefano Stabellini <sstabellini@kernel.org> CC: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-05-30perf/x86/intel/pt: Generate PMI in the STOP region as wellSasha Levin
[ Upstream commit ab92b232ae05c382c3df0e3d6a5c6d16b639ac8c ] Currently, the PT driver always sets the PMI bit one region (page) before the STOP region so that we can wake up the consumer before we run out of room in the buffer and have to disable the event. However, we also need an interrupt in the last output region, so that we actually get to disable the event (if no more room from new data is available at that point), otherwise hardware just quietly refuses to start, but the event is scheduled in and we end up losing trace data till the event gets removed. For a cpu-wide event it is even worse since there may not be any re-scheduling at all and no chance for the ring buffer code to notice that its buffer is filled up and the event needs to be disabled (so that the consumer can re-enable it when it finishes reading the data out). In other words, all the trace data will be lost after the buffer gets filled up. This patch makes PT also generate a PMI when the last output region is full. Reported-by: Markus Metzger <markus.t.metzger@intel.com> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: vince@deater.net Link: http://lkml.kernel.org/r/1462886313-13660-2-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-05-17x86/tsc: Read all ratio bits from MSR_PLATFORM_INFOChen Yu
[ Upstream commit 886123fb3a8656699dff40afa0573df359abeb18 ] Currently we read the tsc radio: ratio = (MSR_PLATFORM_INFO >> 8) & 0x1f; Thus we get bit 8-12 of MSR_PLATFORM_INFO, however according to the SDM (35.5), the ratio bits are bit 8-15. Ignoring the upper bits can result in an incorrect tsc ratio, which causes the TSC calibration and the Local APIC timer frequency to be incorrect. Fix this problem by masking 0xff instead. [ tglx: Massaged changelog ] Fixes: 7da7c1561366 "x86, tsc: Add static (MSR) TSC calibration on Intel Atom SoCs" Signed-off-by: Chen Yu <yu.c.chen@intel.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: stable@vger.kernel.org Cc: Bin Gao <bin.gao@intel.com> Cc: Len Brown <lenb@kernel.org> Link: http://lkml.kernel.org/r/1462505619-5516-1-git-send-email-yu.c.chen@intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-05-17x86/sysfb_efi: Fix valid BAR address range checkWang YanQing
[ Upstream commit c10fcb14c7afd6688c7b197a814358fecf244222 ] The code for checking whether a BAR address range is valid will break out of the loop when a start address of 0x0 is encountered. This behaviour is wrong since by breaking out of the loop we may miss the BAR that describes the EFI frame buffer in a later iteration. Because of this bug I can't use video=efifb: boot parameter to get efifb on my new ThinkPad E550 for my old linux system hard disk with 3.10 kernel. In 3.10, efifb is the only choice due to DRM/I915 not supporting the GPU. This patch also add a trivial optimization to break out after we find the frame buffer address range without testing later BARs. Signed-off-by: Wang YanQing <udknight@gmail.com> [ Rewrote changelog. ] Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Reviewed-by: Peter Jones <pjones@redhat.com> Cc: <stable@vger.kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomi Valkeinen <tomi.valkeinen@ti.com> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/1462454061-21561-2-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-05-10x86/mm/xen: Suppress hugetlbfs in PV guestsJan Beulich
[ Upstream commit 103f6112f253017d7062cd74d17f4a514ed4485c ] Huge pages are not normally available to PV guests. Not suppressing hugetlbfs use results in an endless loop of page faults when user mode code tries to access a hugetlbfs mapped area (since the hypervisor denies such PTEs to be created, but error indications can't be propagated out of xen_set_pte_at(), just like for various of its siblings), and - once killed in an oops like this: kernel BUG at .../fs/hugetlbfs/inode.c:428! invalid opcode: 0000 [#1] SMP ... RIP: e030:[<ffffffff811c333b>] [<ffffffff811c333b>] remove_inode_hugepages+0x25b/0x320 ... Call Trace: [<ffffffff811c3415>] hugetlbfs_evict_inode+0x15/0x40 [<ffffffff81167b3d>] evict+0xbd/0x1b0 [<ffffffff8116514a>] __dentry_kill+0x19a/0x1f0 [<ffffffff81165b0e>] dput+0x1fe/0x220 [<ffffffff81150535>] __fput+0x155/0x200 [<ffffffff81079fc0>] task_work_run+0x60/0xa0 [<ffffffff81063510>] do_exit+0x160/0x400 [<ffffffff810637eb>] do_group_exit+0x3b/0xa0 [<ffffffff8106e8bd>] get_signal+0x1ed/0x470 [<ffffffff8100f854>] do_signal+0x14/0x110 [<ffffffff810030e9>] prepare_exit_to_usermode+0xe9/0xf0 [<ffffffff814178a5>] retint_user+0x8/0x13 This is CVE-2016-3961 / XSA-174. Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Juergen Gross <JGross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: stable@vger.kernel.org Cc: xen-devel <xen-devel@lists.xenproject.org> Link: http://lkml.kernel.org/r/57188ED802000078000E431C@prv-mh.provo.novell.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-05-08crypto: sha1-mb - use corrcet pointer while completing jobsXiaodong Liu
[ Upstream commit 0851561d9c965df086ef8a53f981f5f95a57c2c8 ] In sha_complete_job, incorrect mcryptd_hash_request_ctx pointer is used when check and complete other jobs. If the memory of first completed req is freed, while still completing other jobs in the func, kernel will crash since NULL pointer is assigned to RIP. Cc: <stable@vger.kernel.org> Signed-off-by: Xiaodong Liu <xiaodong.liu@intel.com> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-20kvm: x86: do not leak guest xcr0 into host interrupt handlersDavid Matlack
[ Upstream commit fc5b7f3bf1e1414bd4e91db6918c85ace0c873a5 ] An interrupt handler that uses the fpu can kill a KVM VM, if it runs under the following conditions: - the guest's xcr0 register is loaded on the cpu - the guest's fpu context is not loaded - the host is using eagerfpu Note that the guest's xcr0 register and fpu context are not loaded as part of the atomic world switch into "guest mode". They are loaded by KVM while the cpu is still in "host mode". Usage of the fpu in interrupt context is gated by irq_fpu_usable(). The interrupt handler will look something like this: if (irq_fpu_usable()) { kernel_fpu_begin(); [... code that uses the fpu ...] kernel_fpu_end(); } As long as the guest's fpu is not loaded and the host is using eager fpu, irq_fpu_usable() returns true (interrupted_kernel_fpu_idle() returns true). The interrupt handler proceeds to use the fpu with the guest's xcr0 live. kernel_fpu_begin() saves the current fpu context. If this uses XSAVE[OPT], it may leave the xsave area in an undesirable state. According to the SDM, during XSAVE bit i of XSTATE_BV is not modified if bit i is 0 in xcr0. So it's possible that XSTATE_BV[i] == 1 and xcr0[i] == 0 following an XSAVE. kernel_fpu_end() restores the fpu context. Now if any bit i in XSTATE_BV == 1 while xcr0[i] == 0, XRSTOR generates a #GP. The fault is trapped and SIGSEGV is delivered to the current process. Only pre-4.2 kernels appear to be vulnerable to this sequence of events. Commit 653f52c ("kvm,x86: load guest FPU context more eagerly") from 4.2 forces the guest's fpu to always be loaded on eagerfpu hosts. This patch fixes the bug by keeping the host's xcr0 loaded outside of the interrupts-disabled region where KVM switches into guest mode. Cc: stable@vger.kernel.org Suggested-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: David Matlack <dmatlack@google.com> [Move load after goto cancel_injection. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-20KVM: x86: Inject pending interrupt even if pending nmi existYuki Shibuya
[ Upstream commit 321c5658c5e9192dea0d58ab67cf1791e45b2b26 ] Non maskable interrupts (NMI) are preferred to interrupts in current implementation. If a NMI is pending and NMI is blocked by the result of nmi_allowed(), pending interrupt is not injected and enable_irq_window() is not executed, even if interrupts injection is allowed. In old kernel (e.g. 2.6.32), schedule() is often called in NMI context. In this case, interrupts are needed to execute iret that intends end of NMI. The flag of blocking new NMI is not cleared until the guest execute the iret, and interrupts are blocked by pending NMI. Due to this, iret can't be invoked in the guest, and the guest is starved until block is cleared by some events (e.g. canceling injection). This patch injects pending interrupts, when it's allowed, even if NMI is blocked. And, If an interrupts is pending after executing inject_pending_event(), enable_irq_window() is executed regardless of NMI pending counter. Cc: stable@vger.kernel.org Signed-off-by: Yuki Shibuya <shibuya.yk@ncos.nec.co.jp> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18KVM: VMX: avoid guest hang on invalid invept instructionPaolo Bonzini
[ Upstream commit 2849eb4f99d54925c543db12917127f88b3c38ff ] A guest executing an invalid invept instruction would hang because the instruction pointer was not updated. Cc: stable@vger.kernel.org Fixes: bfd0a56b90005f8c8a004baf407ad90045c2b11e Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18x86/apic: Fix suspicious RCU usage in smp_trace_call_function_interrupt()Dave Jones
[ Upstream commit 7834c10313fb823e538f2772be78edcdeed2e6e3 ] Since 4.4, I've been able to trigger this occasionally: =============================== [ INFO: suspicious RCU usage. ] 4.5.0-rc7-think+ #3 Not tainted Cc: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/20160315012054.GA17765@codemonkey.org.uk Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ------------------------------- ./arch/x86/include/asm/msr-trace.h:47 suspicious rcu_dereference_check() usage! other info that might help us debug this: RCU used illegally from idle CPU! rcu_scheduler_active = 1, debug_locks = 1 RCU used illegally from extended quiescent state! no locks held by swapper/3/0. stack backtrace: CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.5.0-rc7-think+ #3 ffffffff92f821e0 1f3e5c340597d7fc ffff880468e07f10 ffffffff92560c2a ffff880462145280 0000000000000001 ffff880468e07f40 ffffffff921376a6 ffffffff93665ea0 0000cc7c876d28da 0000000000000005 ffffffff9383dd60 Call Trace: <IRQ> [<ffffffff92560c2a>] dump_stack+0x67/0x9d [<ffffffff921376a6>] lockdep_rcu_suspicious+0xe6/0x100 [<ffffffff925ae7a7>] do_trace_write_msr+0x127/0x1a0 [<ffffffff92061c83>] native_apic_msr_eoi_write+0x23/0x30 [<ffffffff92054408>] smp_trace_call_function_interrupt+0x38/0x360 [<ffffffff92d1ca60>] trace_call_function_interrupt+0x90/0xa0 <EOI> [<ffffffff92ac5124>] ? cpuidle_enter_state+0x1b4/0x520 Move the entering_irq() call before ack_APIC_irq(), because entering_irq() tells the RCU susbstems to end the extended quiescent state, so that the following trace call in ack_APIC_irq() works correctly. Suggested-by: Andi Kleen <ak@linux.intel.com> Fixes: 4787c368a9bc "x86/tracing: Add irq_enter/exit() in smp_trace_reschedule_interrupt()" Signed-off-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18x86/iopl: Fix iopl capability check on Xen PVAndy Lutomirski
[ Upstream commit c29016cf41fe9fa994a5ecca607cf5f1cd98801e ] iopl(3) is supposed to work if iopl is already 3, even if unprivileged. This didn't work right on Xen PV. Fix it. Reviewewd-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/8ce12013e6e4c0a44a97e316be4a6faff31bd5ea.1458162709.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18x86/PCI: Mark Broadwell-EP Home Agent & PCU as having non-compliant BARsBjorn Helgaas
[ Upstream commit b894157145e4ac7598d7062bc93320898a5e059e ] The Home Agent and PCU PCI devices in Broadwell-EP have a non-BAR register where a BAR should be. We don't know what the side effects of sizing the "BAR" would be, and we don't know what address space the "BAR" might appear to describe. Mark these devices as having non-compliant BARs so the PCI core doesn't touch them. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Andi Kleen <ak@linux.intel.com> CC: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18perf/x86/intel: Add definition for PT PMI bitStephane Eranian
[ Upstream commit 5690ae28e472d25e330ad0c637a5cea3fc39fb32 ] This patch adds a definition for GLOBAL_OVFL_STATUS bit 55 which is used with the Processor Trace (PT) feature. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: namhyung@kernel.org Link: http://lkml.kernel.org/r/1457034642-21837-2-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18x86: Add new MSRs and MSR bits used for Intel Skylake PMU supportAndi Kleen
[ Upstream commit b83ff1c8617aac03a1cf807aafa848fe0f0908f2 ] Add new MSRs (LBR_INFO) and some new MSR bits used by the Intel Skylake PMU driver. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: eranian@google.com Link: http://lkml.kernel.org/r/1431285767-27027-4-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18KVM: i8254: change PIT discard tick policyRadim Krčmář
[ Upstream commit 7dd0fdff145c5be7146d0ac06732ae3613412ac1 ] Discard policy uses ack_notifiers to prevent injection of PIT interrupts before EOI from the last one. This patch changes the policy to always try to deliver the interrupt, which makes a difference when its vector is in ISR. Old implementation would drop the interrupt, but proposed one injects to IRR, like real hardware would. The old policy breaks legacy NMI watchdogs, where PIT is used through virtual wire (LVT0): PIT never sends an interrupt before receiving EOI, thus a guest deadlock with disabled interrupts will stop NMIs. Note that NMI doesn't do EOI, so PIT also had to send a normal interrupt through IOAPIC. (KVM's PIT is deeply rotten and luckily not used much in modern systems.) Even though there is a chance of regressions, I think we can fix the LVT0 NMI bug without introducing a new tick policy. Cc: <stable@vger.kernel.org> Reported-by: Yuki Shibuya <shibuya.yk@ncos.nec.co.jp> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18x86/iopl/64: Properly context-switch IOPL on Xen PVAndy Lutomirski
commit b7a584598aea7ca73140cb87b40319944dd3393f upstream. On Xen PV, regs->flags doesn't reliably reflect IOPL and the exit-to-userspace code doesn't change IOPL. We need to context switch it manually. I'm doing this without going through paravirt because this is specific to Xen PV. After the dust settles, we can merge this with the 32-bit code, tidy up the iopl syscall implementation, and remove the set_iopl pvop entirely. Fixes XSA-171. Reviewewd-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/693c3bd7aeb4d3c27c92c622b7d0f554a458173c.1458162709.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> [ kamal: backport to 4.2-stable: no X86_FEATURE_XENPV so just call xen_pv_domain() directly ] Acked-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-13KVM: x86: Reload pit counters for all channels when restoring stateAndrew Honig
[ Upstream commit 0185604c2d82c560dab2f2933a18f797e74ab5a8 ] Currently if userspace restores the pit counters with a count of 0 on channels 1 or 2 and the guest attempts to read the count on those channels, then KVM will perform a mod of 0 and crash. This will ensure that 0 values are converted to 65536 as per the spec. This is CVE-2015-7513. Signed-off-by: Andy Honig <ahonig@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-13x86/entry/compat: Add missing CLAC to entry_INT80_32Andy Lutomirski
[ Upstream commit 3d44d51bd339766f0178f0cf2e8d048b4a4872aa ] This doesn't seem to fix a regression -- I don't think the CLAC was ever there. I double-checked in a debugger: entries through the int80 gate do not automatically clear AC. Stable maintainers: I can provide a backport to 4.3 and earlier if needed. This needs to be backported all the way to 3.10. Reported-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> # v3.10 and later Fixes: 63bcff2a307b ("x86, smap: Add STAC and CLAC instructions to control user space access") Link: http://lkml.kernel.org/r/b02b7e71ae54074be01fc171cbd4b72517055c0e.1456345086.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-03-22KVM: MMU: fix ept=0/pte.u=1/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 comboPaolo Bonzini
[ Upstream commit 844a5fe219cf472060315971e15cbf97674a3324 ] Yes, all of these are needed. :) This is admittedly a bit odd, but kvm-unit-tests access.flat tests this if you run it with "-cpu host" and of course ept=0. KVM runs the guest with CR0.WP=1, so it must handle supervisor writes specially when pte.u=1/pte.w=0/CR0.WP=0. Such writes cause a fault when U=1 and W=0 in the SPTE, but they must succeed because CR0.WP=0. When KVM gets the fault, it sets U=0 and W=1 in the shadow PTE and restarts execution. This will still cause a user write to fault, while supervisor writes will succeed. User reads will fault spuriously now, and KVM will then flip U and W again in the SPTE (U=1, W=0). User reads will be enabled and supervisor writes disabled, going back to the originary situation where supervisor writes fault spuriously. When SMEP is in effect, however, U=0 will enable kernel execution of this page. To avoid this, KVM also sets NX=1 in the shadow PTE together with U=0. If the guest has not enabled NX, the result is a continuous stream of page faults due to the NX bit being reserved. The fix is to force EFER.NX=1 even if the CPU is taking care of the EFER switch. (All machines with SMEP have the CPU_LOAD_IA32_EFER vm-entry control, so they do not use user-return notifiers for EFER---if they did, EFER.NX would be forced to the same value as the host). There is another bug in the reserved bit check, which I've split to a separate patch for easier application to stable kernels. Cc: stable@vger.kernel.org Cc: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Fixes: f6577a5fa15d82217ca73c74cd2dcbc0f6c781dd Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-03-22KVM: VMX: disable PEBS before a guest entryRadim Krčmář
[ Upstream commit 7099e2e1f4d9051f31bbfa5803adf954bb5d76ef ] Linux guests on Haswell (and also SandyBridge and Broadwell, at least) would crash if you decided to run a host command that uses PEBS, like perf record -e 'cpu/mem-stores/pp' -a This happens because KVM is using VMX MSR switching to disable PEBS, but SDM [2015-12] 18.4.4.4 Re-configuring PEBS Facilities explains why it isn't safe: When software needs to reconfigure PEBS facilities, it should allow a quiescent period between stopping the prior event counting and setting up a new PEBS event. The quiescent period is to allow any latent residual PEBS records to complete its capture at their previously specified buffer address (provided by IA32_DS_AREA). There might not be a quiescent period after the MSR switch, so a CPU ends up using host's MSR_IA32_DS_AREA to access an area in guest's memory. (Or MSR switching is just buggy on some models.) The guest can learn something about the host this way: If the guest doesn't map address pointed by MSR_IA32_DS_AREA, it results in #PF where we leak host's MSR_IA32_DS_AREA through CR2. After that, a malicious guest can map and configure memory where MSR_IA32_DS_AREA is pointing and can therefore get an output from host's tracing. This is not a critical leak as the host must initiate with PEBS tracing and I have not been able to get a record from more than one instruction before vmentry in vmx_vcpu_run() (that place has most registers already overwritten with guest's). We could disable PEBS just few instructions before vmentry, but disabling it earlier shouldn't affect host tracing too much. We also don't need to switch MSR_IA32_PEBS_ENABLE on VMENTRY, but that optimization isn't worth its code, IMO. (If you are implementing PEBS for guests, be sure to handle the case where both host and guest enable PEBS, because this patch doesn't.) Fixes: 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry.") Cc: <stable@vger.kernel.org> Reported-by: Jiří Olša <jolsa@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-03-14KVM: x86: move steal time initialization to vcpu entry timeMarcelo Tosatti
[ Upstream commit 7cae2bedcbd4680b155999655e49c27b9cf020fa ] As reported at https://bugs.launchpad.net/qemu/+bug/1494350, it is possible to have vcpu->arch.st.last_steal initialized from a thread other than vcpu thread, say the iothread, via KVM_SET_MSRS. Which can cause an overflow later (when subtracting from vcpu threads sched_info.run_delay). To avoid that, move steal time accumulation to vcpu entry time, before copying steal time data to guest. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-03-09PM / sleep / x86: Fix crash on graph trace through x86 suspendTodd E Brandt
[ Upstream commit 92f9e179a702a6adbc11e2fedc76ecd6ffc9e3f7 ] Pause/unpause graph tracing around do_suspend_lowlevel as it has inconsistent call/return info after it jumps to the wakeup vector. The graph trace buffer will otherwise become misaligned and may eventually crash and hang on suspend. To reproduce the issue and test the fix: Run a function_graph trace over suspend/resume and set the graph function to suspend_devices_and_enter. This consistently hangs the system without this fix. Signed-off-by: Todd Brandt <todd.e.brandt@linux.intel.com> Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-03-09x86/mpx: Fix off-by-one comparison with nr_registersColin Ian King
[ Upstream commit 9bf148cb0812595bfdf5100bd2c07e9bec9c6ef5 ] In the unlikely event that regno == nr_registers then we get an array overrun on regoff because the invalid register check is currently off-by-one. Fix this with a check that regno is >= nr_registers instead. Detected with static analysis using CoverityScan. Fixes: fcc7ffd67991 "x86, mpx: Decode MPX instruction to get bound violation information" Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1456512931-3388-1-git-send-email-colin.king@canonical.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-03-09KVM: x86: fix root cause for missed hardware breakpointsPaolo Bonzini
[ Upstream commit 70e4da7a8ff62f2775337b705f45c804bb450454 ] Commit 172b2386ed16 ("KVM: x86: fix missed hardware breakpoints", 2016-02-10) worked around a case where the debug registers are not loaded correctly on preemption and on the first entry to KVM_RUN. However, Xiao Guangrong pointed out that the root cause must be that KVM_DEBUGREG_BP_ENABLED is not being set correctly. This can indeed happen due to the lazy debug exit mechanism, which does not call kvm_update_dr7. Fix it by replacing the existing loop (more or less equivalent to kvm_update_dr0123) with calls to all the kvm_update_dr* functions. Cc: stable@vger.kernel.org # 4.1+ Fixes: 172b2386ed16a9143d9a456aae5ec87275c61489 Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-03-09KVM: x86: MMU: fix ubsan index-out-of-range warningMike Krinkin
[ Upstream commit 17e4bce0ae63c7e03f3c7fa8d80890e7af3d4971 ] Ubsan reports the following warning due to a typo in update_accessed_dirty_bits template, the patch fixes the typo: [ 168.791851] ================================================================================ [ 168.791862] UBSAN: Undefined behaviour in arch/x86/kvm/paging_tmpl.h:252:15 [ 168.791866] index 4 is out of range for type 'u64 [4]' [ 168.791871] CPU: 0 PID: 2950 Comm: qemu-system-x86 Tainted: G O L 4.5.0-rc5-next-20160222 #7 [ 168.791873] Hardware name: LENOVO 23205NG/23205NG, BIOS G2ET95WW (2.55 ) 07/09/2013 [ 168.791876] 0000000000000000 ffff8801cfcaf208 ffffffff81c9f780 0000000041b58ab3 [ 168.791882] ffffffff82eb2cc1 ffffffff81c9f6b4 ffff8801cfcaf230 ffff8801cfcaf1e0 [ 168.791886] 0000000000000004 0000000000000001 0000000000000000 ffffffffa1981600 [ 168.791891] Call Trace: [ 168.791899] [<ffffffff81c9f780>] dump_stack+0xcc/0x12c [ 168.791904] [<ffffffff81c9f6b4>] ? _atomic_dec_and_lock+0xc4/0xc4 [ 168.791910] [<ffffffff81da9e81>] ubsan_epilogue+0xd/0x8a [ 168.791914] [<ffffffff81daafa2>] __ubsan_handle_out_of_bounds+0x15c/0x1a3 [ 168.791918] [<ffffffff81daae46>] ? __ubsan_handle_shift_out_of_bounds+0x2bd/0x2bd [ 168.791922] [<ffffffff811287ef>] ? get_user_pages_fast+0x2bf/0x360 [ 168.791954] [<ffffffffa1794050>] ? kvm_largepages_enabled+0x30/0x30 [kvm] [ 168.791958] [<ffffffff81128530>] ? __get_user_pages_fast+0x360/0x360 [ 168.791987] [<ffffffffa181b818>] paging64_walk_addr_generic+0x1b28/0x2600 [kvm] [ 168.792014] [<ffffffffa1819cf0>] ? init_kvm_mmu+0x1100/0x1100 [kvm] [ 168.792019] [<ffffffff8129e350>] ? debug_check_no_locks_freed+0x350/0x350 [ 168.792044] [<ffffffffa1819cf0>] ? init_kvm_mmu+0x1100/0x1100 [kvm] [ 168.792076] [<ffffffffa181c36d>] paging64_gva_to_gpa+0x7d/0x110 [kvm] [ 168.792121] [<ffffffffa181c2f0>] ? paging64_walk_addr_generic+0x2600/0x2600 [kvm] [ 168.792130] [<ffffffff812e848b>] ? debug_lockdep_rcu_enabled+0x7b/0x90 [ 168.792178] [<ffffffffa17d9a4a>] emulator_read_write_onepage+0x27a/0x1150 [kvm] [ 168.792208] [<ffffffffa1794d44>] ? __kvm_read_guest_page+0x54/0x70 [kvm] [ 168.792234] [<ffffffffa17d97d0>] ? kvm_task_switch+0x160/0x160 [kvm] [ 168.792238] [<ffffffff812e848b>] ? debug_lockdep_rcu_enabled+0x7b/0x90 [ 168.792263] [<ffffffffa17daa07>] emulator_read_write+0xe7/0x6d0 [kvm] [ 168.792290] [<ffffffffa183b620>] ? em_cr_write+0x230/0x230 [kvm] [ 168.792314] [<ffffffffa17db005>] emulator_write_emulated+0x15/0x20 [kvm] [ 168.792340] [<ffffffffa18465f8>] segmented_write+0xf8/0x130 [kvm] [ 168.792367] [<ffffffffa1846500>] ? em_lgdt+0x20/0x20 [kvm] [ 168.792374] [<ffffffffa14db512>] ? vmx_read_guest_seg_ar+0x42/0x1e0 [kvm_intel] [ 168.792400] [<ffffffffa1846d82>] writeback+0x3f2/0x700 [kvm] [ 168.792424] [<ffffffffa1846990>] ? em_sidt+0xa0/0xa0 [kvm] [ 168.792449] [<ffffffffa185554d>] ? x86_decode_insn+0x1b3d/0x4f70 [kvm] [ 168.792474] [<ffffffffa1859032>] x86_emulate_insn+0x572/0x3010 [kvm] [ 168.792499] [<ffffffffa17e71dd>] x86_emulate_instruction+0x3bd/0x2110 [kvm] [ 168.792524] [<ffffffffa17e6e20>] ? reexecute_instruction.part.110+0x2e0/0x2e0 [kvm] [ 168.792532] [<ffffffffa14e9a81>] handle_ept_misconfig+0x61/0x460 [kvm_intel] [ 168.792539] [<ffffffffa14e9a20>] ? handle_pause+0x450/0x450 [kvm_intel] [ 168.792546] [<ffffffffa15130ea>] vmx_handle_exit+0xd6a/0x1ad0 [kvm_intel] [ 168.792572] [<ffffffffa17f6a6c>] ? kvm_arch_vcpu_ioctl_run+0xbdc/0x6090 [kvm] [ 168.792597] [<ffffffffa17f6bcd>] kvm_arch_vcpu_ioctl_run+0xd3d/0x6090 [kvm] [ 168.792621] [<ffffffffa17f6a6c>] ? kvm_arch_vcpu_ioctl_run+0xbdc/0x6090 [kvm] [ 168.792627] [<ffffffff8293b530>] ? __ww_mutex_lock_interruptible+0x1630/0x1630 [ 168.792651] [<ffffffffa17f5e90>] ? kvm_arch_vcpu_runnable+0x4f0/0x4f0 [kvm] [ 168.792656] [<ffffffff811eeb30>] ? preempt_notifier_unregister+0x190/0x190 [ 168.792681] [<ffffffffa17e0447>] ? kvm_arch_vcpu_load+0x127/0x650 [kvm] [ 168.792704] [<ffffffffa178e9a3>] kvm_vcpu_ioctl+0x553/0xda0 [kvm] [ 168.792727] [<ffffffffa178e450>] ? vcpu_put+0x40/0x40 [kvm] [ 168.792732] [<ffffffff8129e350>] ? debug_check_no_locks_freed+0x350/0x350 [ 168.792735] [<ffffffff82946087>] ? _raw_spin_unlock+0x27/0x40 [ 168.792740] [<ffffffff8163a943>] ? handle_mm_fault+0x1673/0x2e40 [ 168.792744] [<ffffffff8129daa8>] ? trace_hardirqs_on_caller+0x478/0x6c0 [ 168.792747] [<ffffffff8129dcfd>] ? trace_hardirqs_on+0xd/0x10 [ 168.792751] [<ffffffff812e848b>] ? debug_lockdep_rcu_enabled+0x7b/0x90 [ 168.792756] [<ffffffff81725a80>] do_vfs_ioctl+0x1b0/0x12b0 [ 168.792759] [<ffffffff817258d0>] ? ioctl_preallocate+0x210/0x210 [ 168.792763] [<ffffffff8174aef3>] ? __fget+0x273/0x4a0 [ 168.792766] [<ffffffff8174acd0>] ? __fget+0x50/0x4a0 [ 168.792770] [<ffffffff8174b1f6>] ? __fget_light+0x96/0x2b0 [ 168.792773] [<ffffffff81726bf9>] SyS_ioctl+0x79/0x90 [ 168.792777] [<ffffffff82946880>] entry_SYSCALL_64_fastpath+0x23/0xc1 [ 168.792780] ================================================================================ Signed-off-by: Mike Krinkin <krinkin.m.u@gmail.com> Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-03-07KVM: x86: fix missed hardware breakpointsPaolo Bonzini
[ Upstream commit 172b2386ed16a9143d9a456aae5ec87275c61489 ] Sometimes when setting a breakpoint a process doesn't stop on it. This is because the debug registers are not loaded correctly on VCPU load. The following simple reproducer from Oleg Nesterov tries using debug registers in two threads. To see the bug, run a 2-VCPU guest with "taskset -c 0" and run "./bp 0 1" inside the guest. #include <unistd.h> #include <signal.h> #include <stdlib.h> #include <stdio.h> #include <sys/wait.h> #include <sys/ptrace.h> #include <sys/user.h> #include <asm/debugreg.h> #include <assert.h> #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER) unsigned long encode_dr7(int drnum, int enable, unsigned int type, unsigned int len) { unsigned long dr7; dr7 = ((len | type) & 0xf) << (DR_CONTROL_SHIFT + drnum * DR_CONTROL_SIZE); if (enable) dr7 |= (DR_GLOBAL_ENABLE << (drnum * DR_ENABLE_SIZE)); return dr7; } int write_dr(int pid, int dr, unsigned long val) { return ptrace(PTRACE_POKEUSER, pid, offsetof (struct user, u_debugreg[dr]), val); } void set_bp(pid_t pid, void *addr) { unsigned long dr7; assert(write_dr(pid, 0, (long)addr) == 0); dr7 = encode_dr7(0, 1, DR_RW_EXECUTE, DR_LEN_1); assert(write_dr(pid, 7, dr7) == 0); } void *get_rip(int pid) { return (void*)ptrace(PTRACE_PEEKUSER, pid, offsetof(struct user, regs.rip), 0); } void test(int nr) { void *bp_addr = &&label + nr, *bp_hit; int pid; printf("test bp %d\n", nr); assert(nr < 16); // see 16 asm nops below pid = fork(); if (!pid) { assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0); kill(getpid(), SIGSTOP); for (;;) { label: asm ( "nop; nop; nop; nop;" "nop; nop; nop; nop;" "nop; nop; nop; nop;" "nop; nop; nop; nop;" ); } } assert(pid == wait(NULL)); set_bp(pid, bp_addr); for (;;) { assert(ptrace(PTRACE_CONT, pid, 0, 0) == 0); assert(pid == wait(NULL)); bp_hit = get_rip(pid); if (bp_hit != bp_addr) fprintf(stderr, "ERR!! hit wrong bp %ld != %d\n", bp_hit - &&label, nr); } } int main(int argc, const char *argv[]) { while (--argc) { int nr = atoi(*++argv); if (!fork()) test(nr); } while (wait(NULL) > 0) ; return 0; } Cc: stable@vger.kernel.org Suggested-by: Nadav Amit <namit@cs.technion.ac.il> Reported-by: Andrey Wagin <avagin@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-03-04x86/mm: Fix vmalloc_fault() to handle large pages properlyToshi Kani
[ Upstream commit f4eafd8bcd5229e998aa252627703b8462c3b90f ] A kernel page fault oops with the callstack below was observed when a read syscall was made to a pmem device after a huge amount (>512GB) of vmalloc ranges was allocated by ioremap() on a x86_64 system: BUG: unable to handle kernel paging request at ffff880840000ff8 IP: vmalloc_fault+0x1be/0x300 PGD c7f03a067 PUD 0 Oops: 0000 [#1] SM Call Trace: __do_page_fault+0x285/0x3e0 do_page_fault+0x2f/0x80 ? put_prev_entity+0x35/0x7a0 page_fault+0x28/0x30 ? memcpy_erms+0x6/0x10 ? schedule+0x35/0x80 ? pmem_rw_bytes+0x6a/0x190 [nd_pmem] ? schedule_timeout+0x183/0x240 btt_log_read+0x63/0x140 [nd_btt] : ? __symbol_put+0x60/0x60 ? kernel_read+0x50/0x80 SyS_finit_module+0xb9/0xf0 entry_SYSCALL_64_fastpath+0x1a/0xa4 Since v4.1, ioremap() supports large page (pud/pmd) mappings in x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc range is limited to pte mappings. vmalloc faults do not normally happen in ioremap'd ranges since ioremap() sets up the kernel page tables, which are shared by user processes. pgd_ctor() sets the kernel's PGD entries to user's during fork(). When allocation of the vmalloc ranges crosses a 512GB boundary, ioremap() allocates a new pud table and updates the kernel PGD entry to point it. If user process's PGD entry does not have this update yet, a read/write syscall to the range will cause a vmalloc fault, which hits the Oops above as it does not handle a large page properly. Following changes are made to vmalloc_fault(). 64-bit: - No change for the PGD sync operation as it handles large pages already. - Add pud_huge() and pmd_huge() to the validation code to handle large pages. - Change pud_page_vaddr() to pud_pfn() since an ioremap range is not directly mapped (while the if-statement still works with a bogus addr). - Change pmd_page() to pmd_pfn() since an ioremap range is not backed by struct page (while the if-statement still works with a bogus addr). 32-bit: - No change for the sync operation since the index3 PGD entry covers the entire vmalloc range, which is always valid. (A separate change to sync PGD entry is necessary if this memory layout is changed regardless of the page size.) - Add pmd_huge() to the validation code to handle large pages. This is for completeness since vmalloc_fault() won't happen in ioremap'd ranges as its PGD entry is always valid. Reported-by: Henning Schild <henning.schild@siemens.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Acked-by: Borislav Petkov <bp@alien8.de> Cc: <stable@vger.kernel.org> # 4.1+ Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: linux-mm@kvack.org Cc: linux-nvdimm@lists.01.org Link: http://lkml.kernel.org/r/1455758214-24623-1-git-send-email-toshi.kani@hpe.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-02-15x86/mm/pat: Avoid truncation when converting cpa->numpages to addressMatt Fleming
[ Upstream commit 742563777e8da62197d6cb4b99f4027f59454735 ] There are a couple of nasty truncation bugs lurking in the pageattr code that can be triggered when mapping EFI regions, e.g. when we pass a cpa->pgd pointer. Because cpa->numpages is a 32-bit value, shifting left by PAGE_SHIFT will truncate the resultant address to 32-bits. Viorel-Cătălin managed to trigger this bug on his Dell machine that provides a ~5GB EFI region which requires 1236992 pages to be mapped. When calling populate_pud() the end of the region gets calculated incorrectly in the following buggy expression, end = start + (cpa->numpages << PAGE_SHIFT); And only 188416 pages are mapped. Next, populate_pud() gets invoked for a second time because of the loop in __change_page_attr_set_clr(), only this time no pages get mapped because shifting the remaining number of pages (1048576) by PAGE_SHIFT is zero. At which point the loop in __change_page_attr_set_clr() spins forever because we fail to map progress. Hitting this bug depends very much on the virtual address we pick to map the large region at and how many pages we map on the initial run through the loop. This explains why this issue was only recently hit with the introduction of commit a5caa209ba9c ("x86/efi: Fix boot crash by mapping EFI memmap entries bottom-up at runtime, instead of top-down") It's interesting to note that safe uses of cpa->numpages do exist in the pageattr code. If instead of shifting ->numpages we multiply by PAGE_SIZE, no truncation occurs because PAGE_SIZE is a UL value, and so the result is unsigned long. To avoid surprises when users try to convert very large cpa->numpages values to addresses, change the data type from 'int' to 'unsigned long', thereby making it suitable for shifting by PAGE_SHIFT without any type casting. The alternative would be to make liberal use of casting, but that is far more likely to cause problems in the future when someone adds more code and fails to cast properly; this bug was difficult enough to track down in the first place. Reported-and-tested-by: Viorel-Cătălin Răpițeanu <rapiteanu.catalin@gmail.com> Acked-by: Borislav Petkov <bp@alien8.de> Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Link: https://bugzilla.kernel.org/show_bug.cgi?id=110131 Link: http://lkml.kernel.org/r/1454067370-10374-1-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-02-15x86/mm: Fix types used in pgprot cacheability flags translationsJan Beulich
[ Upstream commit 3625c2c234ef66acf21a72d47a5ffa94f6c5ebf2 ] For PAE kernels "unsigned long" is not suitable to hold page protection flags, since _PAGE_NX doesn't fit there. This is the reason for quite a few W+X pages getting reported as insecure during boot (observed namely for the entire initrd range). Fixes: 281d4078be ("x86: Make page cache mode a real type") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <JGross@suse.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/56A7635602000078000CAFF1@prv-mh.provo.novell.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-01-31x86/mm: Improve switch_mm() barrier commentsAndy Lutomirski
commit 4eaffdd5a5fe6ff9f95e1ab4de1ac904d5e0fa8b upstream. My previous comments were still a bit confusing and there was a typo. Fix it up. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 71b3c126e611 ("x86/mm: Add barriers and document switch_mm()-vs-flush synchronization") Link: http://lkml.kernel.org/r/0a0b43cdcdd241c5faaaecfbcc91a155ddedc9a1.1452631609.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31x86/mm: Add barriers and document switch_mm()-vs-flush synchronizationAndy Lutomirski
commit 71b3c126e61177eb693423f2e18a1914205b165e upstream. When switch_mm() activates a new PGD, it also sets a bit that tells other CPUs that the PGD is in use so that TLB flush IPIs will be sent. In order for that to work correctly, the bit needs to be visible prior to loading the PGD and therefore starting to fill the local TLB. Document all the barriers that make this work correctly and add a couple that were missing. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31x86/boot: Double BOOT_HEAP_SIZE to 64KBH.J. Lu
commit 8c31902cffc4d716450be549c66a67a8a3dd479c upstream. When decompressing kernel image during x86 bootup, malloc memory for ELF program headers may run out of heap space, which leads to system halt. This patch doubles BOOT_HEAP_SIZE to 64KB. Tested with 32-bit kernel which failed to boot without this patch. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31x86/reboot/quirks: Add iMac10,1 to pci_reboot_dmi_table[]Mario Kleiner
commit 2f0c0b2d96b1205efb14347009748d786c2d9ba5 upstream. Without the reboot=pci method, the iMac 10,1 simply hangs after printing "Restarting system" at the point when it should reboot. This fixes it. Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1450466646-26663-1-git-send-email-mario.kleiner.de@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31KVM: x86: correctly print #AC in tracesPaolo Bonzini
commit aba2f06c070f604e388cf77b1dcc7f4cf4577eb0 upstream. Poor #AC was so unimportant until a few days ago that we were not even tracing its name correctly. But now it's all over the place. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31KVM: x86: expose MSR_TSC_AUX to userspacePaolo Bonzini
commit 9dbe6cf941a6fe82933aef565e4095fb10f65023 upstream. If we do not do this, it is not properly saved and restored across migration. Windows notices due to its self-protection mechanisms, and is very upset about it (blue screen of death). Cc: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31KVM: svm: unconditionally intercept #DBPaolo Bonzini
commit cbdb967af3d54993f5814f1cee0ed311a055377d upstream. This is needed to avoid the possibility that the guest triggers an infinite stream of #DB exceptions (CVE-2015-8104). VMX is not affected: because it does not save DR6 in the VMCS, it already intercepts #DB unconditionally. Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31KVM: VMX: fix SMEP and SMAP without EPTRadim Krčmář
commit 656ec4a4928a3db7d16e5cb9bce351a478cfd3d5 upstream. The comment in code had it mostly right, but we enable paging for emulated real mode regardless of EPT. Without EPT (which implies emulated real mode), secondary VCPUs won't start unless we disable SM[AE]P when the guest doesn't use paging. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31x86/xen: don't reset vcpu_info on a cancelled suspendOuyang Zhaowei (Charles)
commit 6a1f513776b78c994045287073e55bae44ed9f8c upstream. On a cancelled suspend the vcpu_info location does not change (it's still in the per-cpu area registered by xen_vcpu_setup()). So do not call xen_hvm_init_shared_info() which would make the kernel think its back in the shared info. With the wrong vcpu_info, events cannot be received and the domain will hang after a cancelled suspend. Signed-off-by: Charles Ouyang <ouyangzhaowei@huawei.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31x86/signal: Fix restart_syscall number for x32 tasksDmitry V. Levin
commit 22eab1108781eff09961ae7001704f7bd8fb1dce upstream. When restarting a syscall with regs->ax == -ERESTART_RESTARTBLOCK, regs->ax is assigned to a restart_syscall number. For x32 tasks, this syscall number must have __X32_SYSCALL_BIT set, otherwise it will be an x86_64 syscall number instead of a valid x32 syscall number. This issue has been there since the introduction of x32. Reported-by: strace/tests/restart_syscall.test Reported-and-tested-by: Elvira Khabirova <lineprinter0@gmail.com> Signed-off-by: Dmitry V. Levin <ldv@altlinux.org> Cc: Elvira Khabirova <lineprinter0@gmail.com> Link: http://lkml.kernel.org/r/20151130215436.GA25996@altlinux.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31x86/mpx: Fix instruction decoder conditionDave Hansen
commit 8e8efe0379bd93e8219ca0fc6fa80b5dd85b09cb upstream. MPX decodes instructions in order to tell which bounds register was violated. Part of this decoding involves looking at the "REX prefix" which is a special instrucion prefix used to retrofit support for new registers in to old instructions. The X86_REX_*() macros are defined to return actual bit values: #define X86_REX_R(rex) ((rex) & 4) *not* boolean values. However, the MPX code was checking for them like they were booleans. This might have led to us mis-decoding the "REX prefix" and giving false information out to userspace about bounds violations. X86_REX_B() actually is bit 1, so this is really only broken for the X86_REX_X() case. Fix the conditionals up to tolerate the non-boolean values. Fixes: fcc7ffd67991 "x86, mpx: Decode MPX instruction to get bound violation information" Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20151201003113.D800C1E0@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-09x86/cpu: Fix SMAP check in PVOPS environmentsAndrew Cooper
commit 581b7f158fe0383b492acd1ce3fb4e99d4e57808 upstream. There appears to be no formal statement of what pv_irq_ops.save_fl() is supposed to return precisely. Native returns the full flags, while lguest and Xen only return the Interrupt Flag, and both have comments by the implementations stating that only the Interrupt Flag is looked at. This may have been true when initially implemented, but no longer is. To make matters worse, the Xen PVOP leaves the upper bits undefined, making the BUG_ON() undefined behaviour. Experimentally, this now trips for 32bit PV guests on Broadwell hardware. The BUG_ON() is consistent for an individual build, but not consistent for all builds. It has also been a sitting timebomb since SMAP support was introduced. Use native_save_fl() instead, which will obtain an accurate view of the AC flag. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Tested-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: <lguest@lists.ozlabs.org> Cc: Xen-devel <xen-devel@lists.xen.org> Link: http://lkml.kernel.org/r/1433323874-6927-1-git-send-email-andrew.cooper3@citrix.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-09x86/cpu: Call verify_cpu() after having entered long mode tooBorislav Petkov
commit 04633df0c43d710e5f696b06539c100898678235 upstream. When we get loaded by a 64-bit bootloader, kernel entry point is startup_64 in head_64.S. We don't trust any and all bootloaders because some will fiddle with CPU configuration so we go ahead and massage each CPU into sanity again. For example, some dell BIOSes have this XD disable feature which set IA32_MISC_ENABLE[34] and disable NX. This might be some dumb workaround for other OSes but Linux sure doesn't need it. A similar thing is present in the Surface 3 firmware - see https://bugzilla.kernel.org/show_bug.cgi?id=106051 - which sets this bit only on the BSP: # rdmsr -a 0x1a0 400850089 850089 850089 850089 I know, right?! There's not even an off switch in there. So fix all those cases by sanitizing the 64-bit entry point too. For that, make verify_cpu() callable in 64-bit mode also. Requested-and-debugged-by: "H. Peter Anvin" <hpa@zytor.com> Reported-and-tested-by: Bastien Nocera <bugzilla@hadess.net> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1446739076-21303-1-git-send-email-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-09x86/setup: Fix low identity map for >= 2GB kernel rangeKrzysztof Mazur
commit 68accac392d859d24adcf1be3a90e41f978bd54c upstream. The commit f5f3497cad8c extended the low identity mapping. However, if the kernel uses more than 2 GB (VMSPLIT_2G_OPT or VMSPLIT_1G memory split), the normal memory mapping is overwritten by the low identity mapping causing a crash. To avoid overwritting, limit the low identity map to cover only memory before kernel range (PAGE_OFFSET). Fixes: f5f3497cad8c "x86/setup: Extend low identity map to cover whole kernel range Signed-off-by: Krzysztof Mazur <krzysiek@podlesie.net> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Laszlo Ersek <lersek@redhat.com> Cc: Matt Fleming <matt.fleming@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Link: http://lkml.kernel.org/r/1446815916-22105-1-git-send-email-krzysiek@podlesie.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-09x86/setup: Extend low identity map to cover whole kernel rangePaolo Bonzini
commit f5f3497cad8c8416a74b9aaceb127908755d020a upstream. On 32-bit systems, the initial_page_table is reused by efi_call_phys_prolog as an identity map to call SetVirtualAddressMap. efi_call_phys_prolog takes care of converting the current CPU's GDT to a physical address too. For PAE kernels the identity mapping is achieved by aliasing the first PDPE for the kernel memory mapping into the first PDPE of initial_page_table. This makes the EFI stub's trick "just work". However, for non-PAE kernels there is no guarantee that the identity mapping in the initial_page_table extends as far as the GDT; in this case, accesses to the GDT will cause a page fault (which quickly becomes a triple fault). Fix this by copying the kernel mappings from swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at identity mapping. For some reason, this is only reproducible with QEMU's dynamic translation mode, and not for example with KVM. However, even under KVM one can clearly see that the page table is bogus: $ qemu-system-i386 -pflash OVMF.fd -M q35 vmlinuz0 -s -S -daemonize $ gdb (gdb) target remote localhost:1234 (gdb) hb *0x02858f6f Hardware assisted breakpoint 1 at 0x2858f6f (gdb) c Continuing. Breakpoint 1, 0x02858f6f in ?? () (gdb) monitor info registers ... GDT= 0724e000 000000ff IDT= fffbb000 000007ff CR0=0005003b CR2=ff896000 CR3=032b7000 CR4=00000690 ... The page directory is sane: (gdb) x/4wx 0x32b7000 0x32b7000: 0x03398063 0x03399063 0x0339a063 0x0339b063 (gdb) x/4wx 0x3398000 0x3398000: 0x00000163 0x00001163 0x00002163 0x00003163 (gdb) x/4wx 0x3399000 0x3399000: 0x00400003 0x00401003 0x00402003 0x00403003 but our particular page directory entry is empty: (gdb) x/1wx 0x32b7000 + (0x724e000 >> 22) * 4 0x32b7070: 0x00000000 [ It appears that you can skate past this issue if you don't receive any interrupts while the bogus GDT pointer is loaded, or if you avoid reloading the segment registers in general. Andy Lutomirski provides some additional insight: "AFAICT it's entirely permissible for the GDTR and/or LDT descriptor to point to unmapped memory. Any attempt to use them (segment loads, interrupts, IRET, etc) will try to access that memory as if the access came from CPL 0 and, if the access fails, will generate a valid page fault with CR2 pointing into the GDT or LDT." Up until commit 23a0d4e8fa6d ("efi: Disable interrupts around EFI calls, not in the epilog/prolog calls") interrupts were disabled around the prolog and epilog calls, and the functional GDT was re-installed before interrupts were re-enabled. Which explains why no one has hit this issue until now. ] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reported-by: Laszlo Ersek <lersek@redhat.com> Cc: <stable@vger.kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Matt Fleming <matt.fleming@intel.com> [ Updated changelog. ] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>