summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2014-12-06x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-onlyPaolo Bonzini
commit c1118b3602c2329671ad5ec8bdf8e374323d6343 upstream. On x86_64, kernel text mappings are mapped read-only with CONFIG_DEBUG_RODATA. In that case, KVM will fail to patch VMCALL instructions to VMMCALL as required on AMD processors. The failure mode is currently a divide-by-zero exception, which obviously is a KVM bug that has to be fixed. However, picking the right instruction between VMCALL and VMMCALL will be faster and will help if you cannot upgrade the hypervisor. Reported-by: Chris Webb <chris@arachsys.com> Tested-by: Chris Webb <chris@arachsys.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Chris J Arges <chris.j.arges@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUMEAndy Lutomirski
commit 82975bc6a6df743b9a01810fb32cb65d0ec5d60b upstream. x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set but not on non-paranoid returns. I suspect that this is a mistake and that the code only works because int3 is paranoid. Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a workaround for the x86 bug. With that bug fixed, we can remove _TIF_NOTIFY_RESUME from the uprobes code. Reported-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06x86, kaslr: Handle Gold linker for finding bss/brkKees Cook
commit 70b61e362187b5fccac206506d402f3424e3e749 upstream. When building with the Gold linker, the .bss and .brk areas of vmlinux are shown as consecutive instead of having the same file offset. Allow for either state, as long as things add up correctly. Fixes: e6023367d779 ("x86, kaslr: Prevent .bss from overlaping initrd") Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Junjie Mao <eternal.n08@gmail.com> Link: http://lkml.kernel.org/r/20141118001604.GA25045@www.outflux.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06x86, mm: Set NX across entire PMD at bootKees Cook
commit 45e2a9d4701d8c624d4a4bcdd1084eae31e92f58 upstream. When setting up permissions on kernel memory at boot, the end of the PMD that was split from bss remained executable. It should be NX like the rest. This performs a PMD alignment instead of a PAGE alignment to get the correct span of memory. Before: ---[ High Kernel Mapping ]--- ... 0xffffffff8202d000-0xffffffff82200000 1868K RW GLB NX pte 0xffffffff82200000-0xffffffff82c00000 10M RW PSE GLB NX pmd 0xffffffff82c00000-0xffffffff82df5000 2004K RW GLB NX pte 0xffffffff82df5000-0xffffffff82e00000 44K RW GLB x pte 0xffffffff82e00000-0xffffffffc0000000 978M pmd After: ---[ High Kernel Mapping ]--- ... 0xffffffff8202d000-0xffffffff82200000 1868K RW GLB NX pte 0xffffffff82200000-0xffffffff82e00000 12M RW PSE GLB NX pmd 0xffffffff82e00000-0xffffffffc0000000 978M pmd [ tglx: Changed it to roundup(_brk_end, PMD_SIZE) and added a comment. We really should unmap the reminder along with the holes caused by init,initdata etc. but thats a different issue ] Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Wang Nan <wangnan0@huawei.com> Cc: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/20141114194737.GA3091@www.outflux.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06x86: Require exact match for 'noxsave' command line optionDave Hansen
commit 2cd3949f702692cf4c5d05b463f19cd706a92dd3 upstream. We have some very similarly named command-line options: arch/x86/kernel/cpu/common.c:__setup("noxsave", x86_xsave_setup); arch/x86/kernel/cpu/common.c:__setup("noxsaveopt", x86_xsaveopt_setup); arch/x86/kernel/cpu/common.c:__setup("noxsaves", x86_xsaves_setup); __setup() is designed to match options that take arguments, like "foo=bar" where you would have: __setup("foo", x86_foo_func...); The problem is that "noxsave" actually _matches_ "noxsaves" in the same way that "foo" matches "foo=bar". If you boot an old kernel that does not know about "noxsaves" with "noxsaves" on the command line, it will interpret the argument as "noxsave", which is not what you want at all. This makes the "noxsave" handler only return success when it finds an *exact* match. [ tglx: We really need to make __setup() more robust. ] Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Hansen <dave@sr71.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: x86@kernel.org Link: http://lkml.kernel.org/r/20141111220133.FE053984@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06x86_64, traps: Rework bad_iretAndy Lutomirski
commit b645af2d5905c4e32399005b867987919cbfc3ae upstream. It's possible for iretq to userspace to fail. This can happen because of a bad CS, SS, or RIP. Historically, we've handled it by fixing up an exception from iretq to land at bad_iret, which pretends that the failed iret frame was really the hardware part of #GP(0) from userspace. To make this work, there's an extra fixup to fudge the gs base into a usable state. This is suboptimal because it loses the original exception. It's also buggy because there's no guarantee that we were on the kernel stack to begin with. For example, if the failing iret happened on return from an NMI, then we'll end up executing general_protection on the NMI stack. This is bad for several reasons, the most immediate of which is that general_protection, as a non-paranoid idtentry, will try to deliver signals and/or schedule from the wrong stack. This patch throws out bad_iret entirely. As a replacement, it augments the existing swapgs fudge into a full-blown iret fixup, mostly written in C. It's should be clearer and more correct. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06x86_64, traps: Stop using IST for #SSAndy Lutomirski
commit 6f442be2fb22be02cafa606f1769fa1e6f894441 upstream. On a 32-bit kernel, this has no effect, since there are no IST stacks. On a 64-bit kernel, #SS can only happen in user code, on a failed iret to user space, a canonical violation on access via RSP or RBP, or a genuine stack segment violation in 32-bit kernel code. The first two cases don't need IST, and the latter two cases are unlikely fatal bugs, and promoting them to double faults would be fine. This fixes a bug in which the espfix64 code mishandles a stack segment violation. This saves 4k of memory per CPU and a tiny bit of code. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06x86_64, traps: Fix the espfix64 #DF fixup and rewrite it in CAndy Lutomirski
commit af726f21ed8af2cdaa4e93098dc211521218ae65 upstream. There's nothing special enough about the espfix64 double fault fixup to justify writing it in assembly. Move it to C. This also fixes a bug: if the double fault came from an IST stack, the old asm code would return to a partially uninitialized stack frame. Fixes: 3891a04aafd668686239349ea58f3314ea2af86b Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21KVM: x86: Don't report guest userspace emulation error to userspaceNadav Amit
commit a2b9e6c1a35afcc0973acb72e591c714e78885ff upstream. Commit fc3a9157d314 ("KVM: X86: Don't report L2 emulation failures to user-space") disabled the reporting of L2 (nested guest) emulation failures to userspace due to race-condition between a vmexit and the instruction emulator. The same rational applies also to userspace applications that are permitted by the guest OS to access MMIO area or perform PIO. This patch extends the current behavior - of injecting a #UD instead of reporting it to userspace - also for guest userspace code. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21x86, kaslr: Prevent .bss from overlaping initrdJunjie Mao
commit e6023367d779060fddc9a52d1f474085b2b36298 upstream. When choosing a random address, the current implementation does not take into account the reversed space for .bss and .brk sections. Thus the relocated kernel may overlap other components in memory. Here is an example of the overlap from a x86_64 kernel in qemu (the ranges of physical addresses are presented): Physical Address 0x0fe00000 --+--------------------+ <-- randomized base / | relocated kernel | vmlinux.bin | (from vmlinux.bin) | 0x1336d000 (an ELF file) +--------------------+-- \ | | \ 0x1376d870 --+--------------------+ | | relocs table | | 0x13c1c2a8 +--------------------+ .bss and .brk | | | 0x13ce6000 +--------------------+ | | | / 0x13f77000 | initrd |-- | | 0x13fef374 +--------------------+ The initrd image will then be overwritten by the memset during early initialization: [ 1.655204] Unpacking initramfs... [ 1.662831] Initramfs unpacking failed: junk in compressed archive This patch prevents the above situation by requiring a larger space when looking for a random kernel base, so that existing logic can effectively avoids the overlap. [kees: switched to perl to avoid hex translation pain in mawk vs gawk] [kees: calculated overlap without relocs table] Fixes: 82fa9637a2 ("x86, kaslr: Select random position from e820 maps") Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Junjie Mao <eternal.n08@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Matt Fleming <matt.fleming@intel.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/1414762838-13067-1-git-send-email-eternal.n08@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21x86, microcode, AMD: Fix ucode patch stashing on 32-bitBorislav Petkov
commit c0a717f23dccdb6e3b03471bc846fdc636f2b353 upstream. Save the patch while we're running on the BSP instead of later, before the initrd has been jettisoned. More importantly, on 32-bit we need to access the physical address instead of the virtual. This way we actually do find it on the APs instead of having to go through the initrd each time. Tested-by: Richard Hendershot <rshendershot@mchsi.com> Fixes: 5335ba5cf475 ("x86, microcode, AMD: Fix early ucode loading") Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21x86, microcode: Fix accessing dis_ucode_ldr on 32-bitBorislav Petkov
commit 85be07c32496dc264661308e4d9d4e9ccaff8072 upstream. We should be accessing it through a pointer, like on the BSP. Tested-by: Richard Hendershot <rshendershot@mchsi.com> Fixes: 65cef1311d5d ("x86, microcode: Add a disable chicken bit") Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21x86, microcode, AMD: Fix early ucode loading on 32-bitBorislav Petkov
commit 4750a0d112cbfcc744929f1530ffe3193436766c upstream. Konrad triggered the following splat below in a 32-bit guest on an AMD box. As it turns out, in save_microcode_in_initrd_amd() we're using the *physical* address of the container *after* we have enabled paging and thus we #PF in load_microcode_amd() when trying to access the microcode container in the ramdisk range. Because the ramdisk is exactly there: [ 0.000000] RAMDISK: [mem 0x35e04000-0x36ef9fff] and we fault at 0x35e04304. And since this guest doesn't relocate the ramdisk, we don't do the computation which will give us the correct virtual address and we end up with the PA. So, we should actually be using virtual addresses on 32-bit too by the time we're freeing the initrd. Do that then! Unpacking initramfs... BUG: unable to handle kernel paging request at 35d4e304 IP: [<c042e905>] load_microcode_amd+0x25/0x4a0 *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.17.1-302.fc21.i686 #1 Hardware name: Xen HVM domU, BIOS 4.4.1 10/01/2014 task: f5098000 ti: f50d0000 task.ti: f50d0000 EIP: 0060:[<c042e905>] EFLAGS: 00010246 CPU: 0 EIP is at load_microcode_amd+0x25/0x4a0 EAX: 00000000 EBX: f6e9ec4c ECX: 00001ec4 EDX: 00000000 ESI: f5d4e000 EDI: 35d4e2fc EBP: f50d1ed0 ESP: f50d1e94 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 CR0: 8005003b CR2: 35d4e304 CR3: 00e33000 CR4: 000406d0 Stack: 00000000 00000000 f50d1ebc f50d1ec4 f5d4e000 c0d7735a f50d1ed0 15a3d17f f50d1ec4 00600f20 00001ec4 bfb83203 f6e9ec4c f5d4e000 c0d7735a f50d1ed8 c0d80861 f50d1ee0 c0d80429 f50d1ef0 c0d889a9 f5d4e000 c0000000 f50d1f04 Call Trace: ? unpack_to_rootfs ? unpack_to_rootfs save_microcode_in_initrd_amd save_microcode_in_initrd free_initrd_mem populate_rootfs ? unpack_to_rootfs do_one_initcall ? unpack_to_rootfs ? repair_env_string ? proc_mkdir kernel_init_freeable kernel_init ret_from_kernel_thread ? rest_init Reported-and-tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> References: https://bugzilla.redhat.com/show_bug.cgi?id=1158204 Fixes: 75a1ba5b2c52 ("x86, microcode, AMD: Unify valid container checks") Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/20141101100100.GA4462@pd.tnic Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21x86, x32, audit: Fix x32's AUDIT_ARCH wrt auditAndy Lutomirski
commit 81f49a8fd7088cfcb588d182eeede862c0e3303e upstream. is_compat_task() is the wrong check for audit arch; the check should be is_ia32_task(): x32 syscalls should be AUDIT_ARCH_X86_64, not AUDIT_ARCH_I386. CONFIG_AUDITSYSCALL is currently incompatible with x32, so this has no visible effect. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/a0138ed8c709882aec06e4acc30bfa9b623b8717.1409954077.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21KVM: x86: Fix uninitialized op->type for some immediate valuesNadav Amit
commit d29b9d7ed76c0b961603ca692b8a562556a20212 upstream. The emulator could reuse an op->type from a previous instruction for some immediate values. If it mistakenly considers the operands as memory operands, it will performs a memory read and overwrite op->val. Consider for instance the ROR instruction - src2 (the number of times) would be read from memory instead of being used as immediate. Mark every immediate operand as such to avoid this problem. Fixes: c44b4c6ab80eef3a9c52c7b3f0c632942e6489aa Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21x86/build: Add arch/x86/purgatory/ make generated files to gitignoreShuah Khan
commit 4ea48a01bb1a99f4185b77cd90cf962730336cc4 upstream. The following generated files are missing from gitignore and show up in git status after x86_64 build. Add them to gitignore. arch/x86/purgatory/kexec-purgatory.c arch/x86/purgatory/purgatory.ro Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com> Link: http://lkml.kernel.org/r/1412016116-7213-1-git-send-email-shuahkh@osg.samsung.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14KVM: x86: Fix far-jump to non-canonical checkNadav Amit
commit 7e46dddd6f6cd5dbf3c7bd04a7e75d19475ac9f2 upstream. Commit d1442d85cc30 ("KVM: x86: Handle errors when RIP is set during far jumps") introduced a bug that caused the fix to be incomplete. Due to incorrect evaluation, far jump to segment with L bit cleared (i.e., 32-bit segment) and RIP with any of the high bits set (i.e, RIP[63:32] != 0) set may not trigger #GP. As we know, this imposes a security problem. In addition, the condition for two warnings was incorrect. Fixes: d1442d85cc30ea75f7d399474ca738e0bc96f715 Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> [Add #ifdef CONFIG_X86_64 to avoid complaints of undefined behavior. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14x86, intel-mid: Create IRQs for APB timers and RTC timersJiang Liu
commit f18298595aefa2c836a128ec6e0f75f39965dd81 upstream. Intel MID platforms has no legacy interrupts, so no IRQ descriptors preallocated. We need to call mp_map_gsi_to_irq() to create IRQ descriptors for APB timers and RTC timers, otherwise it may cause invalid memory access as: [ 0.116839] BUG: unable to handle kernel NULL pointer dereference at 0000003a [ 0.123803] IP: [<c1071c0e>] setup_irq+0xf/0x4d Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: David Cohen <david.a.cohen@linux.intel.com> Link: http://lkml.kernel.org/r/1414387308-27148-3-git-send-email-jiang.liu@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14x86, apic: Handle a bad TSC more gracefullyAndy Lutomirski
commit b47dcbdc5161d3d5756f430191e2840d9b855492 upstream. If the TSC is unusable or disabled, then this patch fixes: - Confusion while trying to clear old APIC interrupts. - Division by zero and incorrect programming of the TSC deadline timer. This fixes boot if the CPU has a TSC deadline timer but a missing or broken TSC. The failure to boot can be observed with qemu using -cpu qemu64,-tsc,+tsc-deadline This also happens to me in nested KVM for unknown reasons. With this patch, I can boot cleanly (although without a TSC). Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: Bandan Das <bsd@redhat.com> Link: http://lkml.kernel.org/r/e2fa274e498c33988efac0ba8b7e3120f7f92d78.1413393027.git.luto@amacapital.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ACPI, irq, x86: Return IRQ instead of GSI in mp_register_gsi()Jiang Liu
commit b77e8f435337baa1cd15852fb9db3f6d26cd8eb7 upstream. Function mp_register_gsi() returns blindly the GSI number for the ACPI SCI interrupt. That causes a regression when the GSI for ACPI SCI is shared with other devices. The regression was caused by commit 84245af7297ced9e8fe "x86, irq, ACPI: Change __acpi_register_gsi to return IRQ number instead of GSI" and exposed on a SuperMicro system, which shares one GSI between ACPI SCI and PCI device, with following failure: http://sourceforge.net/p/linux1394/mailman/linux1394-user/?viewmonth=201410 [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 20 low level) [ 2.699224] firewire_ohci 0000:06:00.0: failed to allocate interrupt 20 Return mp_map_gsi_to_irq(gsi, 0) instead of the GSI number. Reported-and-Tested-by: Daniel Robbins <drobbins@funtoo.org> Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Link: http://lkml.kernel.org/r/1414387308-27148-4-git-send-email-jiang.liu@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14x86: ACPI: Do not translate GSI number if IOAPIC is disabledJiang Liu
commit 961b6a7003acec4f9d70dabc1a253b783cb74272 upstream. When IOAPIC is disabled, acpi_gsi_to_irq() should return gsi directly instead of calling mp_map_gsi_to_irq() to translate gsi to IRQ by IOAPIC. It fixes https://bugzilla.kernel.org/show_bug.cgi?id=84381. This regression was introduced with commit 6b9fb7082409 "x86, ACPI, irq: Consolidate algorithm of mapping (ioapic, pin) to IRQ number" Reported-and-Tested-by: Thomas Richter <thor@math.tu-berlin.de> Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Thomas Richter <thor@math.tu-berlin.de> Cc: rui.zhang@intel.com Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Bjorn Helgaas <bhelgaas@google.com> Link: http://lkml.kernel.org/r/1413816327-12850-1-git-send-email-jiang.liu@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14x86: Add cpu_detect_cache_sizes to init_intel() add Quark legacy_cache()Bryan O'Donoghue
commit aece118e487a744eafcdd0c77fe32b55ee2092a1 upstream. Intel processors which don't report cache information via cpuid(2) or cpuid(4) need quirk code in the legacy_cache_size callback to report this data. For Intel that callback is is intel_size_cache(). This patch enables calling of cpu_detect_cache_sizes() inside of init_intel() and hence the calling of the legacy_cache callback in intel_size_cache(). Adding this call will ensure that PIII Tualatin currently in intel_size_cache() and Quark SoC X1000 being added to intel_size_cache() in this patch will report their respective cache sizes. This model of calling cpu_detect_cache_sizes() is consistent with AMD/Via/Cirix/Transmeta and Centaur. Also added is a string to idenitfy the Quark as Quark SoC X1000 giving better and more descriptive output via /proc/cpuinfo Adding cpu_detect_cache_sizes to init_intel() will enable calling of intel_size_cache() on Intel processors which currently no code can reach. Therefore this patch will also re-enable reporting of PIII Tualatin cache size information as well as add Quark SoC X1000 support. Comment text and cache flow logic suggested by Thomas Gleixner Signed-off-by: Bryan O'Donoghue <pure.logic@nexus-software.ie> Cc: davej@redhat.com Cc: hmh@hmh.eng.br Link: http://lkml.kernel.org/r/1412641189-12415-3-git-send-email-pure.logic@nexus-software.ie Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Chang Rebecca Swee Fun <rebecca.swee.fun.chang@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14x86/platform/intel/iosf: Add Braswell PCI IDDavid E. Box
commit 849f5d894383d25c49132437aa289c9a9c98d5df upstream. Add Braswell PCI ID to list of supported ID's for the IOSF driver. Signed-off-by: David E. Box <david.e.box@linux.intel.com> Link: http://lkml.kernel.org/r/1411017231-20807-2-git-send-email-david.e.box@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Chang Rebecca Swee Fun <rebecca.swee.fun.chang@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14kvm: vmx: handle invvpid vm exit gracefullyPetr Matousek
commit a642fc305053cc1c6e47e4f4df327895747ab485 upstream. On systems with invvpid instruction support (corresponding bit in IA32_VMX_EPT_VPID_CAP MSR is set) guest invocation of invvpid causes vm exit, which is currently not handled and results in propagation of unknown exit to userspace. Fix this by installing an invvpid vm exit handler. This is CVE-2014-3646. Signed-off-by: Petr Matousek <pmatouse@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14KVM: x86: Handle errors when RIP is set during far jumpsNadav Amit
commit d1442d85cc30ea75f7d399474ca738e0bc96f715 upstream. Far jmp/call/ret may fault while loading a new RIP. Currently KVM does not handle this case, and may result in failed vm-entry once the assignment is done. The tricky part of doing so is that loading the new CS affects the VMCS/VMCB state, so if we fail during loading the new RIP, we are left in unconsistent state. Therefore, this patch saves on 64-bit the old CS descriptor and restores it if loading RIP failed. This fixes CVE-2014-3647. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14KVM: x86: Emulator fixes for eip canonical checks on near branchesNadav Amit
commit 234f3ce485d54017f15cf5e0699cff4100121601 upstream. Before changing rip (during jmp, call, ret, etc.) the target should be asserted to be canonical one, as real CPUs do. During sysret, both target rsp and rip should be canonical. If any of these values is noncanonical, a #GP exception should occur. The exception to this rule are syscall and sysenter instructions in which the assigned rip is checked during the assignment to the relevant MSRs. This patch fixes the emulator to behave as real CPUs do for near branches. Far branches are handled by the next patch. This fixes CVE-2014-3647. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14KVM: x86: Fix wrong masking on relative jump/callNadav Amit
commit 05c83ec9b73c8124555b706f6af777b10adf0862 upstream. Relative jumps and calls do the masking according to the operand size, and not according to the address size as the KVM emulator does today. This patch fixes KVM behavior. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14kvm: x86: don't kill guest on unknown exit reasonMichael S. Tsirkin
commit 2bc19dc3754fc066c43799659f0d848631c44cfe upstream. KVM_EXIT_UNKNOWN is a kvm bug, we don't really know whether it was triggered by a priveledged application. Let's not kill the guest: WARN and inject #UD instead. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14KVM: x86: Check non-canonical addresses upon WRMSRNadav Amit
commit 854e8bb1aa06c578c2c9145fa6bfe3680ef63b23 upstream. Upon WRMSR, the CPU should inject #GP if a non-canonical value (address) is written to certain MSRs. The behavior is "almost" identical for AMD and Intel (ignoring MSRs that are not implemented in either architecture since they would anyhow #GP). However, IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if non-canonical address is written on Intel but not on AMD (which ignores the top 32-bits). Accordingly, this patch injects a #GP on the MSRs which behave identically on Intel and AMD. To eliminate the differences between the architecutres, the value which is written to IA32_SYSENTER_ESP and IA32_SYSENTER_EIP is turned to canonical value before writing instead of injecting a #GP. Some references from Intel and AMD manuals: According to Intel SDM description of WRMSR instruction #GP is expected on WRMSR "If the source register contains a non-canonical address and ECX specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE, IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP." According to AMD manual instruction manual: LSTAR/CSTAR (SYSCALL): "The WRMSR instruction loads the target RIP into the LSTAR and CSTAR registers. If an RIP written by WRMSR is not in canonical form, a general-protection exception (#GP) occurs." IA32_GS_BASE and IA32_FS_BASE (WRFSBASE/WRGSBASE): "The address written to the base field must be in canonical form or a #GP fault will occur." IA32_KERNEL_GS_BASE (SWAPGS): "The address stored in the KernelGSbase MSR must be in canonical form." This patch fixes CVE-2014-3610. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14KVM: x86: Improve thread safety in pitAndy Honig
commit 2febc839133280d5a5e8e1179c94ea674489dae2 upstream. There's a race condition in the PIT emulation code in KVM. In __kvm_migrate_pit_timer the pit_timer object is accessed without synchronization. If the race condition occurs at the wrong time this can crash the host kernel. This fixes CVE-2014-3611. Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14KVM: x86: Prevent host from panicking on shared MSR writes.Andy Honig
commit 8b3c3104c3f4f706e99365c3e0d2aa61b95f969f upstream. The previous patch blocked invalid writes directly when the MSR is written. As a precaution, prevent future similar mistakes by gracefulling handle GPs caused by writes to shared MSRs. Signed-off-by: Andrew Honig <ahonig@google.com> [Remove parts obsoleted by Nadav's patch. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14KVM: x86: PREFETCH and HINT_NOP should have SrcMem flagNadav Amit
commit 3f6f1480d86bf9fc16c160d803ab1d006e3058d5 upstream. The decode phase of the x86 emulator assumes that every instruction with the ModRM flag, and which can be used with RIP-relative addressing, has either SrcMem or DstMem. This is not the case for several instructions - prefetch, hint-nop and clflush. Adding SrcMem|NoAccess for prefetch and hint-nop and SrcMem for clflush. This fixes CVE-2014-8480. Fixes: 41061cdb98a0bec464278b4db8e894a3121671f5 Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14KVM: x86: Emulator does not decode clflush wellNadav Amit
commit 13e457e0eebf0a0c82c38ceb890d93eb826d62a6 upstream. Currently, all group15 instructions are decoded as clflush (e.g., mfence, xsave). In addition, the clflush instruction requires no prefix (66/f2/f3) would exist. If prefix exists it may encode a different instruction (e.g., clflushopt). Creating a group for clflush, and different group for each prefix. This has been the case forever, but the next patch needs the cflush group in order to fix a bug introduced in 3.17. Fixes: 41061cdb98a0bec464278b4db8e894a3121671f5 Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14KVM: x86: Decoding guest instructions which cross page boundary may failNadav Amit
commit 08da44aedba0f493e10695fa334348a7a4f72eb3 upstream. Once an instruction crosses a page boundary, the size read from the second page disregards the common case that part of the operand resides on the first page. As a result, fetch of long insturctions may fail, and thereby cause the decoding to fail as well. Fixes: 5cfc7e0f5e5e1adf998df94f8e36edaf5d30d38e Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14KVM: emulate: avoid accessing NULL ctxt->memoppPaolo Bonzini
commit a430c9166312e1aa3d80bce32374233bdbfeba32 upstream. A failure to decode the instruction can cause a NULL pointer access. This is fixed simply by moving the "done" label as close as possible to the return. This fixes CVE-2014-8481. Reported-by: Andy Lutomirski <luto@amacapital.net> Fixes: 41061cdb98a0bec464278b4db8e894a3121671f5 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14x86, pageattr: Prevent overflow in slow_virt_to_phys() for X86_PAEDexuan Cui
commit d1cd1210834649ce1ca6bafe5ac25d2f40331343 upstream. pte_pfn() returns a PFN of long (32 bits in 32-PAE), so "long << PAGE_SHIFT" will overflow for PFNs above 4GB. Due to this issue, some Linux 32-PAE distros, running as guests on Hyper-V, with 5GB memory assigned, can't load the netvsc driver successfully and hence the synthetic network device can't work (we can use the kernel parameter mem=3000M to work around the issue). Cast pte_pfn() to phys_addr_t before shifting. Fixes: "commit d76565344512: x86, mm: Create slow_virt_to_phys()" Signed-off-by: Dexuan Cui <decui@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: gregkh@linuxfoundation.org Cc: linux-mm@kvack.org Cc: olaf@aepfle.de Cc: apw@canonical.com Cc: jasowang@redhat.com Cc: dave.hansen@intel.com Cc: riel@redhat.com Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1414580017-27444-1-git-send-email-decui@microsoft.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14x86_64, entry: Fix out of bounds read on sysenterAndy Lutomirski
commit 653bc77af60911ead1f423e588f54fc2547c4957 upstream. Rusty noticed a Really Bad Bug (tm) in my NT fix. The entry code reads out of bounds, causing the NT fix to be unreliable. But, and this is much, much worse, if your stack is somehow just below the top of the direct map (or a hole), you read out of bounds and crash. Excerpt from the crash: [ 1.129513] RSP: 0018:ffff88001da4bf88 EFLAGS: 00010296 2b:* f7 84 24 90 00 00 00 testl $0x4000,0x90(%rsp) That read is deterministically above the top of the stack. I thought I even single-stepped through this code when I wrote it to check the offset, but I clearly screwed it up. Fixes: 8c7aa698baca ("x86_64, entry: Filter RFLAGS.NT on entry from userspace") Reported-by: Rusty Russell <rusty@ozlabs.org> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14x86_64, entry: Filter RFLAGS.NT on entry from userspaceAndy Lutomirski
commit 8c7aa698baca5e8f1ba9edb68081f1e7a1abf455 upstream. The NT flag doesn't do anything in long mode other than causing IRET to #GP. Oddly, CPL3 code can still set NT using popf. Entry via hardware or software interrupt clears NT automatically, so the only relevant entries are fast syscalls. If user code causes kernel code to run with NT set, then there's at least some (small) chance that it could cause trouble. For example, user code could cause a call to EFI code with NT set, and who knows what would happen? Apparently some games on Wine sometimes do this (!), and, if an IRET return happens, they will segfault. That segfault cannot be handled, because signal delivery fails, too. This patch programs the CPU to clear NT on entry via SYSCALL (both 32-bit and 64-bit, by my reading of the AMD APM), and it clears NT in software on entry via SYSENTER. To save a few cycles, this borrows a trick from Jan Beulich in Xen: it checks whether NT is set before trying to clear it. As a result, it seems to have very little effect on SYSENTER performance on my machine. There's another minor bug fix in here: it looks like the CFI annotations were wrong if CONFIG_AUDITSYSCALL=n. Testers beware: on Xen, SYSENTER with NT set turns into a GPF. I haven't touched anything on 32-bit kernels. The syscall mask change comes from a variant of this patch by Anish Bhatt. Note to stable maintainers: there is no known security issue here. A misguided program can set NT and cause the kernel to try and fail to deliver SIGSEGV, crashing the program. This patch fixes Far Cry on Wine: https://bugs.winehq.org/show_bug.cgi?id=33275 Reported-by: Anish Bhatt <anish@chelsio.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/395749a5d39a29bd3e4b35899cf3a3c1340e5595.1412189265.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14x86, fpu: shift drop_init_fpu() from save_xstate_sig() to handle_signal()Oleg Nesterov
commit 66463db4fc5605d51c7bb81d009d5bf30a783a2c upstream. save_xstate_sig()->drop_init_fpu() doesn't look right. setup_rt_frame() can fail after that, in this case the next setup_rt_frame() triggered by SIGSEGV won't save fpu simply because the old state was lost. This obviously mean that fpu won't be restored after sys_rt_sigreturn() from SIGSEGV handler. Shift drop_init_fpu() into !failed branch in handle_signal(). Test-case (needs -O2): #include <stdio.h> #include <signal.h> #include <unistd.h> #include <sys/syscall.h> #include <sys/mman.h> #include <pthread.h> #include <assert.h> volatile double D; void test(double d) { int pid = getpid(); for (D = d; D == d; ) { /* sys_tkill(pid, SIGHUP); asm to avoid save/reload * fp regs around "C" call */ asm ("" : : "a"(200), "D"(pid), "S"(1)); asm ("syscall" : : : "ax"); } printf("ERR!!\n"); } void sigh(int sig) { } char altstack[4096 * 10] __attribute__((aligned(4096))); void *tfunc(void *arg) { for (;;) { mprotect(altstack, sizeof(altstack), PROT_READ); mprotect(altstack, sizeof(altstack), PROT_READ|PROT_WRITE); } } int main(void) { stack_t st = { .ss_sp = altstack, .ss_size = sizeof(altstack), .ss_flags = SS_ONSTACK, }; struct sigaction sa = { .sa_handler = sigh, }; pthread_t pt; sigaction(SIGSEGV, &sa, NULL); sigaltstack(&st, NULL); sa.sa_flags = SA_ONSTACK; sigaction(SIGHUP, &sa, NULL); pthread_create(&pt, NULL, tfunc, NULL); test(123.456); return 0; } Reported-by: Bean Anderson <bean@azulsystems.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: http://lkml.kernel.org/r/20140902175713.GA21646@redhat.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14x86, fpu: __restore_xstate_sig()->math_state_restore() needs preempt_disable()Oleg Nesterov
commit df24fb859a4e200d9324e2974229fbb7adf00aef upstream. Add preempt_disable() + preempt_enable() around math_state_restore() in __restore_xstate_sig(). Otherwise __switch_to() after __thread_fpu_begin() can overwrite fpu->state we are going to restore. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: http://lkml.kernel.org/r/20140902175717.GA21649@redhat.com Reviewed-by: Suresh Siddha <sbsiddha@gmail.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14x86: Reject x32 executables if x32 ABI not supportedBen Hutchings
commit 0e6d3112a4e95d55cf6dca88f298d5f4b8f29bd1 upstream. It is currently possible to execve() an x32 executable on an x86_64 kernel that has only ia32 compat enabled. However all its syscalls will fail, even _exit(). This usually causes it to segfault. Change the ELF compat architecture check so that x32 executables are rejected if we don't support the x32 ABI. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Link: http://lkml.kernel.org/r/1410120305.6822.9.camel@decadent.org.uk Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14x86: bpf_jit: fix two bugs in eBPF JIT compilerAlexei Starovoitov
[ Upstream commit e0ee9c12157dc74e49e4731e0d07512e7d1ceb95 ] 1. JIT compiler using multi-pass approach to converge to final image size, since x86 instructions are variable length. It starts with large gaps between instructions (so some jumps may use imm32 instead of imm8) and iterates until total program size is the same as in previous pass. This algorithm works only if program size is strictly decreasing. Programs that use LD_ABS insn need additional code in prologue, but it was not emitted during 1st pass, so there was a chance that 2nd pass would adjust imm32->imm8 jump offsets to the same number of bytes as increase in prologue, which may cause algorithm to erroneously decide that size converged. Fix it by always emitting largest prologue in the first pass which is detected by oldproglen==0 check. Also change error check condition 'proglen != oldproglen' to fail gracefully. 2. while staring at the code realized that 64-byte buffer may not be enough when 1st insn is large, so increase it to 128 to avoid buffer overflow (theoretical maximum size of prologue+div is 109) and add runtime check. Fixes: 622582786c9e ("net: filter: x86: internal BPF JIT") Reported-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Tested-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14KVM: emulator: fix execution close to the segment limitPaolo Bonzini
commit fd56e1546a5f734290cbedd2b81c518850736511 upstream. Emulation of code that is 14 bytes to the segment limit or closer (e.g. RIP = 0xFFFFFFF2 after reset) is broken because we try to read as many as 15 bytes from the beginning of the instruction, and __linearize fails when the passed (address, size) pair reaches out of the segment. To fix this, let __linearize return the maximum accessible size (clamped to 2^32-1) for usage in __do_insn_fetch_bytes, and avoid the limit check by passing zero for the desired size. For expand-down segments, __linearize is performing a redundant check. (u32)(addr.ea + size - 1) <= lim can only happen if addr.ea is close to 4GB; in this case, addr.ea + size - 1 will also fail the check against the upper bound of the segment (which is provided by the D/B bit). After eliminating the redundant check, it is simple to compute the *max_size for expand-down segments too. Now that the limit check is done in __do_insn_fetch_bytes, we want to inject a general protection fault there if size < op_size (like __linearize would have done), instead of just aborting. This fixes booting Tiano Core from emulated flash with EPT disabled. Fixes: 719d5a9b2487e0562f178f61e323c3dc18a8b200 Reported-by: Borislav Petkov <bp@suse.de> Tested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30x86/intel/quark: Switch off CR4.PGE so TLB flush uses CR3 insteadBryan O'Donoghue
commit ee1b5b165c0a2f04d2107e634e51f05d0eb107de upstream. Quark x1000 advertises PGE via the standard CPUID method PGE bits exist in Quark X1000's PTEs. In order to flush an individual PTE it is necessary to reload CR3 irrespective of the PTE.PGE bit. See Quark Core_DevMan_001.pdf section 6.4.11 This bug was fixed in Galileo kernels, unfixed vanilla kernels are expected to crash and burn on this platform. Signed-off-by: Bryan O'Donoghue <pure.logic@nexus-software.ie> Cc: Borislav Petkov <bp@alien8.de> Link: http://lkml.kernel.org/r/1411514784-14885-1-git-send-email-pure.logic@nexus-software.ie Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30x86,kvm,vmx: Preserve CR4 across VM entryAndy Lutomirski
commit d974baa398f34393db76be45f7d4d04fbdbb4a0a upstream. CR4 isn't constant; at least the TSD and PCE bits can vary. TBH, treating CR0 and CR3 as constant scares me a bit, too, but it looks like it's correct. This adds a branch and a read from cr4 to each vm entry. Because it is extremely likely that consecutive entries into the same vcpu will have the same host cr4 value, this fixes up the vmcs instead of restoring cr4 after the fact. A subsequent patch will add a kernel-wide cr4 shadow, reducing the overhead in the common case to just two memory reads and a branch. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Gleb Natapov <gleb@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30KVM: do not bias the generation number in kvm_current_mmio_generationPaolo Bonzini
commit 00f034a12fdd81210d58116326d92780aac5c238 upstream. The next patch will give a meaning (a la seqcount) to the low bit of the generation number. Ensure that it matches between kvm->memslots->generation and kvm_current_mmio_generation(). Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30kvm: fix potentially corrupt mmio cacheDavid Matlack
commit ee3d1570b58677885b4552bce8217fda7b226a68 upstream. vcpu exits and memslot mutations can run concurrently as long as the vcpu does not aquire the slots mutex. Thus it is theoretically possible for memslots to change underneath a vcpu that is handling an exit. If we increment the memslot generation number again after synchronize_srcu_expedited(), vcpus can safely cache memslot generation without maintaining a single rcu_dereference through an entire vm exit. And much of the x86/kvm code does not maintain a single rcu_dereference of the current memslots during each exit. We can prevent the following case: vcpu (CPU 0) | thread (CPU 1) --------------------------------------------+-------------------------- 1 vm exit | 2 srcu_read_unlock(&kvm->srcu) | 3 decide to cache something based on | old memslots | 4 | change memslots | (increments generation) 5 | synchronize_srcu(&kvm->srcu); 6 retrieve generation # from new memslots | 7 tag cache with new memslot generation | 8 srcu_read_unlock(&kvm->srcu) | ... | <action based on cache occurs even | though the caching decision was based | on the old memslots> | ... | <action *continues* to occur until next | memslot generation change, which may | be never> | | By incrementing the generation after synchronizing with kvm->srcu readers, we ensure that the generation retrieved in (6) will become invalid soon after (8). Keeping the existing increment is not strictly necessary, but we do keep it and just move it for consistency from update_memslots to install_new_memslots. It invalidates old cached MMIOs immediately, instead of having to wait for the end of synchronize_srcu_expedited, which makes the code more clearly correct in case CPU 1 is preempted right after synchronize_srcu() returns. To avoid halving the generation space in SPTEs, always presume that the low bit of the generation is zero when reconstructing a generation number out of an SPTE. This effectively disables MMIO caching in SPTEs during the call to synchronize_srcu_expedited. Using the low bit this way is somewhat like a seqcount---where the protected thing is a cache, and instead of retrying we can simply punt if we observe the low bit to be 1. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30kvm: x86: fix stale mmio cache bugDavid Matlack
commit 56f17dd3fbc44adcdbc3340fe3988ddb833a47a7 upstream. The following events can lead to an incorrect KVM_EXIT_MMIO bubbling up to userspace: (1) Guest accesses gpa X without a memory slot. The gfn is cached in struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets the SPTE write-execute-noread so that future accesses cause EPT_MISCONFIGs. (2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION covering the page just accessed. (3) Guest attempts to read or write to gpa X again. On Intel, this generates an EPT_MISCONFIG. The memory slot generation number that was incremented in (2) would normally take care of this but we fast path mmio faults through quickly_check_mmio_pf(), which only checks the per-vcpu mmio cache. Since we hit the cache, KVM passes a KVM_EXIT_MMIO up to userspace. This patch fixes the issue by using the memslot generation number to validate the mmio cache. Signed-off-by: David Matlack <dmatlack@google.com> [xiaoguangrong: adjust the code to make it simpler for stable-tree fix.] Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Tested-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-15x86: Tell irq work about self IPI supportFrederic Weisbecker
commit 3010279f0fc36f0388872203e63ca49912f648fd upstream. x86 supports irq work self-IPIs when local apic is available. This is partly known on runtime so lets implement arch_irq_work_has_interrupt() accordingly. This should be safely called after setup_arch(). Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-15irq_work: Introduce arch_irq_work_has_interrupt()Peter Zijlstra
commit c5c38ef3d70377dc504a6a3f611a3ec814bc757b upstream. The nohz full code needs irq work to trigger its own interrupt so that the subsystem can work even when the tick is stopped. Lets introduce arch_irq_work_has_interrupt() that archs can override to tell about their support for this ability. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>