summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2015-11-09xen: fix backport of previous kexec patchGreg Kroah-Hartman
Fixes the backport of 0b34a166f291d255755be46e43ed5497cdd194f2 upstream Commit 0b34a166f291d255755be46e43ed5497cdd194f2 "x86/xen: Support kexec/kdump in HVM guests by doing a soft reset" has been added to the 4.2-stable tree" needed to correct the CONFIG variable, as CONFIG_KEXEC_CORE only showed up in 4.3. Reported-by: David Vrabel <david.vrabel@citrix.com> Reported-by: Luis Henriques <luis.henriques@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-11-09x86/efi: Fix multiple GOP device supportKővágó, Zoltán
commit 8a53554e12e98d1759205afd7b8e9e2ea0936f48 upstream. When multiple GOP devices exists, but none of them implements ConOut, the code should just choose the first GOP (according to the comments). But currently 'fb_base' will refer to the last GOP, while other parameters to the first GOP, which will likely result in a garbled display. I can reliably reproduce this bug using my ASRock Z87M Extreme4 motherboard with CSM and integrated GPU disabled, and two PCIe video cards (NVidia GT640 and GTX980), booting from efi-stub (booting from grub works fine). On the primary display the ASRock logo remains and on the secondary screen it is garbled up completely. Signed-off-by: Kővágó, Zoltán <DirtY.iCE.hu@gmail.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1444659236-24837-2-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-27sched/preempt: Fix cond_resched_lock() and cond_resched_softirq()Konstantin Khlebnikov
commit fe32d3cd5e8eb0f82e459763374aa80797023403 upstream. These functions check should_resched() before unlocking spinlock/bh-enable: preempt_count always non-zero => should_resched() always returns false. cond_resched_lock() worked iff spin_needbreak is set. This patch adds argument "preempt_offset" to should_resched(). preempt_count offset constants for that: PREEMPT_DISABLE_OFFSET - offset after preempt_disable() PREEMPT_LOCK_OFFSET - offset after spin_lock() SOFTIRQ_DISABLE_OFFSET - offset after local_bh_distable() SOFTIRQ_LOCK_OFFSET - offset after spin_lock_bh() Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Graf <agraf@suse.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: bdb438065890 ("sched: Extract the basic add/sub preempt_count modifiers") Link: http://lkml.kernel.org/r/20150715095204.12246.98268.stgit@buzz Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22x86/xen: Support kexec/kdump in HVM guests by doing a soft resetVitaly Kuznetsov
commit 0b34a166f291d255755be46e43ed5497cdd194f2 upstream. Currently there is a number of issues preventing PVHVM Xen guests from doing successful kexec/kdump: - Bound event channels. - Registered vcpu_info. - PIRQ/emuirq mappings. - shared_info frame after XENMAPSPACE_shared_info operation. - Active grant mappings. Basically, newly booted kernel stumbles upon already set up Xen interfaces and there is no way to reestablish them. In Xen-4.7 a new feature called 'soft reset' is coming. A guest performing kexec/kdump operation is supposed to call SCHEDOP_shutdown hypercall with SHUTDOWN_soft_reset reason before jumping to new kernel. Hypervisor (with some help from toolstack) will do full domain cleanup (but keeping its memory and vCPU contexts intact) returning the guest to the state it had when it was first booted and thus allowing it to start over. Doing SHUTDOWN_soft_reset on Xen hypervisors which don't support it is probably OK as by default all unknown shutdown reasons cause domain destroy with a message in toolstack log: 'Unknown shutdown reason code 5. Destroying domain.' which gives a clue to what the problem is and eliminates false expectations. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22x86/mm: Set NX on gap between __ex_table and rodataStephen Smalley
commit ab76f7b4ab2397ffdd2f1eb07c55697d19991d10 upstream. Unused space between the end of __ex_table and the start of rodata can be left W+x in the kernel page tables. Extend the setting of the NX bit to cover this gap by starting from text_end rather than rodata_start. Before: ---[ High Kernel Mapping ]--- 0xffffffff80000000-0xffffffff81000000 16M pmd 0xffffffff81000000-0xffffffff81600000 6M ro PSE GLB x pmd 0xffffffff81600000-0xffffffff81754000 1360K ro GLB x pte 0xffffffff81754000-0xffffffff81800000 688K RW GLB x pte 0xffffffff81800000-0xffffffff81a00000 2M ro PSE GLB NX pmd 0xffffffff81a00000-0xffffffff81b3b000 1260K ro GLB NX pte 0xffffffff81b3b000-0xffffffff82000000 4884K RW GLB NX pte 0xffffffff82000000-0xffffffff82200000 2M RW PSE GLB NX pmd 0xffffffff82200000-0xffffffffa0000000 478M pmd After: ---[ High Kernel Mapping ]--- 0xffffffff80000000-0xffffffff81000000 16M pmd 0xffffffff81000000-0xffffffff81600000 6M ro PSE GLB x pmd 0xffffffff81600000-0xffffffff81754000 1360K ro GLB x pte 0xffffffff81754000-0xffffffff81800000 688K RW GLB NX pte 0xffffffff81800000-0xffffffff81a00000 2M ro PSE GLB NX pmd 0xffffffff81a00000-0xffffffff81b3b000 1260K ro GLB NX pte 0xffffffff81b3b000-0xffffffff82000000 4884K RW GLB NX pte 0xffffffff82000000-0xffffffff82200000 2M RW PSE GLB NX pmd 0xffffffff82200000-0xffffffffa0000000 478M pmd Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1443704662-3138-1-git-send-email-sds@tycho.nsa.gov Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22x86/process: Add proper bound checks in 64bit get_wchan()Thomas Gleixner
commit eddd3826a1a0190e5235703d1e666affa4d13b96 upstream. Dmitry Vyukov reported the following using trinity and the memory error detector AddressSanitizer (https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel). [ 124.575597] ERROR: AddressSanitizer: heap-buffer-overflow on address ffff88002e280000 [ 124.576801] ffff88002e280000 is located 131938492886538 bytes to the left of 28857600-byte region [ffffffff81282e0a, ffffffff82e0830a) [ 124.578633] Accessed by thread T10915: [ 124.579295] inlined in describe_heap_address ./arch/x86/mm/asan/report.c:164 [ 124.579295] #0 ffffffff810dd277 in asan_report_error ./arch/x86/mm/asan/report.c:278 [ 124.580137] #1 ffffffff810dc6a0 in asan_check_region ./arch/x86/mm/asan/asan.c:37 [ 124.581050] #2 ffffffff810dd423 in __tsan_read8 ??:0 [ 124.581893] #3 ffffffff8107c093 in get_wchan ./arch/x86/kernel/process_64.c:444 The address checks in the 64bit implementation of get_wchan() are wrong in several ways: - The lower bound of the stack is not the start of the stack page. It's the start of the stack page plus sizeof (struct thread_info) - The upper bound must be: top_of_stack - TOP_OF_KERNEL_STACK_PADDING - 2 * sizeof(unsigned long). The 2 * sizeof(unsigned long) is required because the stack pointer points at the frame pointer. The layout on the stack is: ... IP FP ... IP FP. So we need to make sure that both IP and FP are in the bounds. Fix the bound checks and get rid of the mix of numeric constants, u64 and unsigned long. Making all unsigned long allows us to use the same function for 32bit as well. Use READ_ONCE() when accessing the stack. This does not prevent a concurrent wakeup of the task and the stack changing, but at least it avoids TOCTOU. Also check task state at the end of the loop. Again that does not prevent concurrent changes, but it avoids walking for nothing. Add proper comments while at it. Reported-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Based-on-patch-from: Wolfram Gloger <wmglo@dent.med.uni-muenchen.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@alien8.de> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: kasan-dev <kasan-dev@googlegroups.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Wolfram Gloger <wmglo@dent.med.uni-muenchen.de> Link: http://lkml.kernel.org/r/20150930083302.694788319@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22x86/kexec: Fix kexec crash in syscall kexec_file_load()Lee, Chun-Yi
commit e3c41e37b0f4b18cbd4dac76cbeece5a7558b909 upstream. The original bug is a page fault crash that sometimes happens on big machines when preparing ELF headers: BUG: unable to handle kernel paging request at ffffc90613fc9000 IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260 The bug is caused by us under-counting the number of memory ranges and subsequently not allocating enough ELF header space for them. The bug is typically masked on smaller systems, because the ELF header allocation is rounded up to the next page. This patch modifies the code in fill_up_crash_elf_data() by using walk_system_ram_res() instead of walk_system_ram_range() to correctly count the max number of crash memory ranges. That's because the walk_system_ram_range() filters out small memory regions that reside in the same page, but walk_system_ram_res() does not. Here's how I found the bug: After tracing prepare_elf64_headers() and prepare_elf64_ram_headers_callback(), the code uses walk_system_ram_res() to fill-in crash memory regions information to the program header, so it counts those small memory regions that reside in a page area. But, when the kernel was using walk_system_ram_range() in fill_up_crash_elf_data() to count the number of crash memory regions, it filters out small regions. I printed those small memory regions, for example: kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0 Based on the code in walk_system_ram_range(), this memory region will be filtered out: pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593 end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593 end_pfn - pfn = 0x77593 - 0x77593 = 0 <=== if (end_pfn > pfn) is FALSE So, the max_nr_ranges that's counted by the kernel doesn't include small memory regions - causing us to under-allocate the required space. That causes the page fault crash that happens in a later code path when preparing ELF headers. This bug is not easy to reproduce on small machines that have few CPUs, because the allocated page aligned ELF buffer has more free space to cover those small memory regions' PT_LOAD headers. Signed-off-by: Lee, Chun-Yi <jlee@suse.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Takashi Iwai <tiwai@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: kexec@lists.infradead.org Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1443531537-29436-1-git-send-email-jlee@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22x86/efi: Fix boot crash by mapping EFI memmap entries bottom-up at runtime, ↵Matt Fleming
instead of top-down commit a5caa209ba9c29c6421292e7879d2387a2ef39c9 upstream. Beginning with UEFI v2.5 EFI_PROPERTIES_TABLE was introduced that signals that the firmware PE/COFF loader supports splitting code and data sections of PE/COFF images into separate EFI memory map entries. This allows the kernel to map those regions with strict memory protections, e.g. EFI_MEMORY_RO for code, EFI_MEMORY_XP for data, etc. Unfortunately, an unwritten requirement of this new feature is that the regions need to be mapped with the same offsets relative to each other as observed in the EFI memory map. If this is not done crashes like this may occur, BUG: unable to handle kernel paging request at fffffffefe6086dd IP: [<fffffffefe6086dd>] 0xfffffffefe6086dd Call Trace: [<ffffffff8104c90e>] efi_call+0x7e/0x100 [<ffffffff81602091>] ? virt_efi_set_variable+0x61/0x90 [<ffffffff8104c583>] efi_delete_dummy_variable+0x63/0x70 [<ffffffff81f4e4aa>] efi_enter_virtual_mode+0x383/0x392 [<ffffffff81f37e1b>] start_kernel+0x38a/0x417 [<ffffffff81f37495>] x86_64_start_reservations+0x2a/0x2c [<ffffffff81f37582>] x86_64_start_kernel+0xeb/0xef Here 0xfffffffefe6086dd refers to an address the firmware expects to be mapped but which the OS never claimed was mapped. The issue is that included in these regions are relative addresses to other regions which were emitted by the firmware toolchain before the "splitting" of sections occurred at runtime. Needless to say, we don't satisfy this unwritten requirement on x86_64 and instead map the EFI memory map entries in reverse order. The above crash is almost certainly triggerable with any kernel newer than v3.13 because that's when we rewrote the EFI runtime region mapping code, in commit d2f7cbe7b26a ("x86/efi: Runtime services virtual mapping"). For kernel versions before v3.13 things may work by pure luck depending on the fragmentation of the kernel virtual address space at the time we map the EFI regions. Instead of mapping the EFI memory map entries in reverse order, where entry N has a higher virtual address than entry N+1, map them in the same order as they appear in the EFI memory map to preserve this relative offset between regions. This patch has been kept as small as possible with the intention that it should be applied aggressively to stable and distribution kernels. It is very much a bugfix rather than support for a new feature, since when EFI_PROPERTIES_TABLE is enabled we must map things as outlined above to even boot - we have no way of asking the firmware not to split the code/data regions. In fact, this patch doesn't even make use of the more strict memory protections available in UEFI v2.5. That will come later. Suggested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reported-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: Chun-Yi <jlee@suse.com> Cc: Dave Young <dyoung@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: James Bottomley <JBottomley@Odin.com> Cc: Lee, Chun-Yi <jlee@suse.com> Cc: Leif Lindholm <leif.lindholm@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Jones <pjones@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1443218539-7610-2-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22Use WARN_ON_ONCE for missing X86_FEATURE_NRIPSDirk Müller
commit d2922422c48df93f3edff7d872ee4f3191fefb08 upstream. The cpu feature flags are not ever going to change, so warning everytime can cause a lot of kernel log spam (in our case more than 10GB/hour). The warning seems to only occur when nested virtualization is enabled, so it's probably triggered by a KVM bug. This is a sensible and safe change anyway, and the KVM bug fix might not be suitable for stable releases anyway. Signed-off-by: Dirk Mueller <dmueller@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22x86/nmi/64: Fix a paravirt stack-clobbering bug in the NMI codeAndy Lutomirski
commit 83c133cf11fb0e68a51681447e372489f052d40e upstream. The NMI entry code that switches to the normal kernel stack needs to be very careful not to clobber any extra stack slots on the NMI stack. The code is fine under the assumption that SWAPGS is just a normal instruction, but that assumption isn't really true. Use SWAPGS_UNSAFE_STACK instead. This is part of a fix for some random crashes that Sasha saw. Fixes: 9b6e6a8334d5 ("x86/nmi/64: Switch stacks on userspace NMI entry") Reported-and-tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Link: http://lkml.kernel.org/r/974bc40edffdb5c2950a5c4977f821a446b76178.1442791737.git.luto@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22x86/paravirt: Replace the paravirt nop with a bona fide empty functionAndy Lutomirski
commit fc57a7c68020dcf954428869eafd934c0ab1536f upstream. PARAVIRT_ADJUST_EXCEPTION_FRAME generates this code (using nmi as an example, trimmed for readability): ff 15 00 00 00 00 callq *0x0(%rip) # 2796 <nmi+0x6> 2792: R_X86_64_PC32 pv_irq_ops+0x2c That's a call through a function pointer to regular C function that does nothing on native boots, but that function isn't protected against kprobes, isn't marked notrace, and is certainly not guaranteed to preserve any registers if the compiler is feeling perverse. This is bad news for a CLBR_NONE operation. Of course, if everything works correctly, once paravirt ops are patched, it gets nopped out, but what if we hit this code before paravirt ops are patched in? This can potentially cause breakage that is very difficult to debug. A more subtle failure is possible here, too: if _paravirt_nop uses the stack at all (even just to push RBP), it will overwrite the "NMI executing" variable if it's called in the NMI prologue. The Xen case, perhaps surprisingly, is fine, because it's already written in asm. Fix all of the cases that default to paravirt_nop (including adjust_exception_frame) with a big hammer: replace paravirt_nop with an asm function that is just a ret instruction. The Xen case may have other problems, so document them. This is part of a fix for some random crashes that Sasha saw. Reported-and-tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Link: http://lkml.kernel.org/r/8f5d2ba295f9d73751c33d97fda03e0495d9ade0.1442791737.git.luto@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22x86/platform: Fix Geode LX timekeeping in the generic x86 buildDavid Woodhouse
commit 03da3ff1cfcd7774c8780d2547ba0d995f7dc03d upstream. In 2007, commit 07190a08eef36 ("Mark TSC on GeodeLX reliable") bypassed verification of the TSC on Geode LX. However, this code (now in the check_system_tsc_reliable() function in arch/x86/kernel/tsc.c) was only present if CONFIG_MGEODE_LX was set. OpenWRT has recently started building its generic Geode target for Geode GX, not LX, to include support for additional platforms. This broke the timekeeping on LX-based devices, because the TSC wasn't marked as reliable: https://dev.openwrt.org/ticket/20531 By adding a runtime check on is_geode_lx(), we can also include the fix if CONFIG_MGEODEGX1 or CONFIG_X86_GENERIC are set, thus fixing the problem. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Andres Salomon <dilinger@queued.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Marcelo Tosatti <marcelo@kvack.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1442409003.131189.87.camel@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22x86/alternatives: Make optimize_nops() interrupt safe and syncedThomas Gleixner
commit 66c117d7fa2ae429911e60d84bf31a90b2b96189 upstream. Richard reported the following crash: [ 0.036000] BUG: unable to handle kernel paging request at 55501e06 [ 0.036000] IP: [<c0aae48b>] common_interrupt+0xb/0x38 [ 0.036000] Call Trace: [ 0.036000] [<c0409c80>] ? add_nops+0x90/0xa0 [ 0.036000] [<c040a054>] apply_alternatives+0x274/0x630 Chuck decoded: " 0: 8d 90 90 83 04 24 lea 0x24048390(%eax),%edx 6: 80 fc 0f cmp $0xf,%ah 9: a8 0f test $0xf,%al >> b: a0 06 1e 50 55 mov 0x55501e06,%al 10: 57 push %edi 11: 56 push %esi Interrupt 0x30 occurred while the alternatives code was replacing the initial 0x90,0x90,0x90 NOPs (from the ASM_CLAC macro) with the optimized version, 0x8d,0x76,0x00. Only the first byte has been replaced so far, and it makes a mess out of the insn decoding." optimize_nops() is buggy in two aspects: - It's not disabling interrupts across the modification - It's lacking a sync_core() call Add both. Fixes: 4fd4b6e5537c 'x86/alternatives: Use optimized NOPs for padding' Reported-and-tested-by: "Richard W.M. Jones" <rjones@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Richard W.M. Jones <rjones@redhat.com> Cc: Chuck Ebbert <cebbert.lkml@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1509031232340.15006@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22x86/apic: Serialize LVTT and TSC_DEADLINE writesShaohua Li
commit 5d7c631d926b59aa16f3c56eaeb83f1036c81dc7 upstream. The APIC LVTT register is MMIO mapped but the TSC_DEADLINE register is an MSR. The write to the TSC_DEADLINE MSR is not serializing, so it's not guaranteed that the write to LVTT has reached the APIC before the TSC_DEADLINE MSR is written. In such a case the write to the MSR is ignored and as a consequence the local timer interrupt never fires. The SDM decribes this issue for xAPIC and x2APIC modes. The serialization methods recommended by the SDM differ. xAPIC: "1. Memory-mapped write to LVT Timer Register, setting bits 18:17 to 10b. 2. WRMSR to the IA32_TSC_DEADLINE MSR a value much larger than current time-stamp counter. 3. If RDMSR of the IA32_TSC_DEADLINE MSR returns zero, go to step 2. 4. WRMSR to the IA32_TSC_DEADLINE MSR the desired deadline." x2APIC: "To allow for efficient access to the APIC registers in x2APIC mode, the serializing semantics of WRMSR are relaxed when writing to the APIC registers. Thus, system software should not use 'WRMSR to APIC registers in x2APIC mode' as a serializing instruction. Read and write accesses to the APIC registers will occur in program order. A WRMSR to an APIC register may complete before all preceding stores are globally visible; software can prevent this by inserting a serializing instruction, an SFENCE, or an MFENCE before the WRMSR." The xAPIC method is to just wait for the memory mapped write to hit the LVTT by checking whether the MSR write has reached the hardware. There is no reason why a proper MFENCE after the memory mapped write would not do the same. Andi Kleen confirmed that MFENCE is sufficient for the xAPIC case as well. Issue MFENCE before writing to the TSC_DEADLINE MSR. This can be done unconditionally as all CPUs which have TSC_DEADLINE also have MFENCE support. [ tglx: Massaged the changelog ] Signed-off-by: Shaohua Li <shli@fb.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: <Kernel-team@fb.com> Cc: <lenb@kernel.org> Cc: <fenghua.yu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Link: http://lkml.kernel.org/r/20150909041352.GA2059853@devbig257.prn2.facebook.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22perf/x86/intel: Fix constraint accessPeter Zijlstra
commit ebfb4988f0378e2ac3b4a0aa1ea20d724293f392 upstream. Sasha reported that we can get here with .idx==-1, and cpuc->event_constraints unallocated. Suggested-by: Stephane Eranian <eranian@google.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: b371b5943178 ("perf/x86: Fix event/group validation") Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22KVM: vmx: fix VPID is 0000H in non-root operationWanpeng Li
commit 04bb92e4b4cf06a66889d37b892b78f926faa9d4 upstream. Reference SDM 28.1: The current VPID is 0000H in the following situations: - Outside VMX operation. (This includes operation in system-management mode under the default treatment of SMIs and SMM with VMX operation; see Section 34.14.) - In VMX root operation. - In VMX non-root operation when the “enable VPID” VM-execution control is 0. The VPID should never be 0000H in non-root operation when "enable VPID" VM-execution control is 1. However, commit 34a1cd60 ("kvm: x86: vmx: move some vmx setting from vmx_init() to hardware_setup()") remove the codes which reserve 0000H for VMX root operation. This patch fix it by again reserving 0000H for VMX root operation. Fixes: 34a1cd60d17f62c1f077c1478a6c2ca8c3d17af4 Reported-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29lib/decompressors: use real out buf size for gunzip with kernelYinghai Lu
commit 2d3862d26e67a59340ba1cf1748196c76c5787de upstream. When loading x86 64bit kernel above 4GiB with patched grub2, got kernel gunzip error. | early console in decompress_kernel | decompress_kernel: | input: [0x807f2143b4-0x807ff61aee] | output: [0x807cc00000-0x807f3ea29b] 0x027ea29c: output_len | boot via startup_64 | KASLR using RDTSC... | new output: [0x46fe000000-0x470138cfff] 0x0338d000: output_run_size | decompress: [0x46fe000000-0x47007ea29b] <=== [0x807f2143b4-0x807ff61aee] | | Decompressing Linux... gz... | | uncompression error | | -- System halted the new buffer is at 0x46fe000000ULL, decompressor_gzip is using 0xffffffb901ffffff as out_len. gunzip in lib/zlib_inflate/inflate.c cap that len to 0x01ffffff and decompress fails later. We could hit this problem with crashkernel booting that uses kexec loading kernel above 4GiB. We have decompress_* support: 1. inbuf[]/outbuf[] for kernel preboot. 2. inbuf[]/flush() for initramfs 3. fill()/flush() for initrd. This bug only affect kernel preboot path that use outbuf[]. Add __decompress and take real out_buf_len for gunzip instead of guessing wrong buf size. Fixes: 1431574a1c4 (lib/decompressors: fix "no limit" output buffer length) Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Alexandre Courbot <acourbot@nvidia.com> Cc: Jon Medhurst <tixy@linaro.org> Cc: Stephen Warren <swarren@wwwdotorg.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29x86/mm: Initialize pmd_idx in page_table_range_init_count()Minfei Huang
commit 9962eea9e55f797f05f20ba6448929cab2a9f018 upstream. The variable pmd_idx is not initialized for the first iteration of the for loop. Assign the proper value which indexes the start address. Fixes: 719272c45b82 'x86, mm: only call early_ioremap_page_table_range_init() once' Signed-off-by: Minfei Huang <mnfhuang@gmail.com> Cc: tony.luck@intel.com Cc: wangnan0@huawei.com Cc: david.vrabel@citrix.com Reviewed-by: yinghai@kernel.org Link: http://lkml.kernel.org/r/1436703522-29552-1-git-send-email-mhuang@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-21ACPI, PCI: Penalize legacy IRQ used by ACPI SCIJiang Liu
commit 5d0ddfebb93069061880fc57ee4ba7246bd1e1ee upstream. Nick Meier reported a regression with HyperV that " After rebooting the VM, the following messages are logged in syslog when trying to load the tulip driver: tulip: Linux Tulip drivers version 1.1.15 (Feb 27, 2007) tulip: 0000:00:0a.0: PCI INT A: failed to register GSI tulip: Cannot enable tulip board #0, aborting tulip: probe of 0000:00:0a.0 failed with error -16 Errors occur in 3.19.0 kernel Works in 3.17 kernel. " According to the ACPI dump file posted by Nick at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1440072 The ACPI MADT table includes an interrupt source overridden entry for ACPI SCI: [236h 0566 1] Subtable Type : 02 <Interrupt Source Override> [237h 0567 1] Length : 0A [238h 0568 1] Bus : 00 [239h 0569 1] Source : 09 [23Ah 0570 4] Interrupt : 00000009 [23Eh 0574 2] Flags (decoded below) : 000D Polarity : 1 Trigger Mode : 3 And in DSDT table, we have _PRT method to define PCI interrupts, which eventually goes to: Name (PRSA, ResourceTemplate () { IRQ (Level, ActiveLow, Shared, ) {3,4,5,7,9,10,11,12,14,15} }) Name (PRSB, ResourceTemplate () { IRQ (Level, ActiveLow, Shared, ) {3,4,5,7,9,10,11,12,14,15} }) Name (PRSC, ResourceTemplate () { IRQ (Level, ActiveLow, Shared, ) {3,4,5,7,9,10,11,12,14,15} }) Name (PRSD, ResourceTemplate () { IRQ (Level, ActiveLow, Shared, ) {3,4,5,7,9,10,11,12,14,15} }) According to the MADT and DSDT tables, IRQ 9 may be used for: 1) ACPI SCI in level, high mode 2) PCI legacy IRQ in level, low mode So there's a conflict in polarity setting for IRQ 9. Prior to commit cd68f6bd53cf ("x86, irq, acpi: Get rid of special handling of GSI for ACPI SCI"), ACPI SCI is handled specially and there's no check for conflicts between ACPI SCI and PCI legagy IRQ. And it seems that the HyperV hypervisor doesn't make use of the polarity configuration in IOAPIC entry, so it just works. Commit cd68f6bd53cf gets rid of the specially handling of ACPI SCI, and then the pin attribute checking code discloses the conflicts between ACPI SCI and PCI legacy IRQ on HyperV virtual machine, and rejects the request to assign IRQ9 to PCI devices. So penalize legacy IRQ used by ACPI SCI and mark it unusable if ACPI SCI attributes conflict with PCI IRQ attributes. Please refer to following links for more information: https://bugzilla.kernel.org/show_bug.cgi?id=101301 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1440072 Fixes: cd68f6bd53cf ("x86, irq, acpi: Get rid of special handling of GSI for ACPI SCI") Reported-and-tested-by: Nick Meier <nmeier@microsoft.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-21x86/mce: Reenable CMCI banks when swiching back to interrupt modeXie XiuQi
commit 1b48465500611a2dc5e75800c61ac352e22d41c3 upstream. Zhang Liguang reported the following issue: 1) System detects a CMCI storm on the current CPU. 2) Kernel disables the CMCI interrupt on banks owned by the current CPU and switches to poll mode 3) After the CMCI storm subsides, kernel switches back to interrupt mode 4) We expect the system to reenable the CMCI interrupt on banks owned by the current CPU mce_intel_adjust_timer |-> cmci_reenable |-> cmci_discover # owned banks are ignored here static void cmci_discover(int banks) ... for (i = 0; i < banks; i++) { ... if (test_bit(i, owned)) # ownd banks is ignore here continue; So convert cmci_storm_disable_banks() to cmci_toggle_interrupt_mode() which controls whether to enable or disable CMCI interrupts with its argument. NB: We cannot clear the owned bit because the banks won't be polled, otherwise. See: 27f6c573e0f7 ("x86, CMCI: Add proper detection of end of CMCI storms") for more info. Reported-by: Zhang Liguang <zhangliguang@huawei.com> Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: huawei.libin@huawei.com Cc: linux-edac <linux-edac@vger.kernel.org> Cc: rui.xiang@huawei.com Link: http://lkml.kernel.org/r/1439396985-12812-10-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-21KVM: x86: Use adjustment in guest cycles when handling MSR_IA32_TSC_ADJUSTHaozhong Zhang
commit d7add05458084a5e3d65925764a02ca9c8202c1e upstream. When kvm_set_msr_common() handles a guest's write to MSR_IA32_TSC_ADJUST, it will calcuate an adjustment based on the data written by guest and then use it to adjust TSC offset by calling a call-back adjust_tsc_offset(). The 3rd parameter of adjust_tsc_offset() indicates whether the adjustment is in host TSC cycles or in guest TSC cycles. If SVM TSC scaling is enabled, adjust_tsc_offset() [i.e. svm_adjust_tsc_offset()] will first scale the adjustment; otherwise, it will just use the unscaled one. As the MSR write here comes from the guest, the adjustment is in guest TSC cycles. However, the current kvm_set_msr_common() uses it as a value in host TSC cycles (by using true as the 3rd parameter of adjust_tsc_offset()), which can result in an incorrect adjustment of TSC offset if SVM TSC scaling is enabled. This patch fixes this problem. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-21KVM: MMU: fix validation of mmio page faultXiao Guangrong
commit 6f691251c0350ac52a007c54bf3ef62e9d8cdc5e upstream. We got the bug that qemu complained with "KVM: unknown exit, hardware reason 31" and KVM shown these info: [84245.284948] EPT: Misconfiguration. [84245.285056] EPT: GPA: 0xfeda848 [84245.285154] ept_misconfig_inspect_spte: spte 0x5eaef50107 level 4 [84245.285344] ept_misconfig_inspect_spte: spte 0x5f5fadc107 level 3 [84245.285532] ept_misconfig_inspect_spte: spte 0x5141d18107 level 2 [84245.285723] ept_misconfig_inspect_spte: spte 0x52e40dad77 level 1 This is because we got a mmio #PF and the handler see the mmio spte becomes normal (points to the ram page) However, this is valid after introducing fast mmio spte invalidation which increases the generation-number instead of zapping mmio sptes, a example is as follows: 1. QEMU drops mmio region by adding a new memslot 2. invalidate all mmio sptes 3. VCPU 0 VCPU 1 access the invalid mmio spte access the region originally was MMIO before set the spte to the normal ram map mmio #PF check the spte and see it becomes normal ram mapping !!! This patch fixes the bug just by dropping the check in mmio handler, it's good for backport. Full check will be introduced in later patches Reported-by: Pavel Shirshov <ru.pchel@gmail.com> Tested-by: Pavel Shirshov <ru.pchel@gmail.com> Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-21crypto: ghash-clmulni: specify context size for ghash async algorithmAndrey Ryabinin
commit 71c6da846be478a61556717ef1ee1cea91f5d6a8 upstream. Currently context size (cra_ctxsize) doesn't specified for ghash_async_alg. Which means it's zero. Thus crypto_create_tfm() doesn't allocate needed space for ghash_async_ctx, so any read/write to ctx (e.g. in ghash_async_init_tfm()) is not valid. Signed-off-by: Andrey Ryabinin <aryabinin@odin.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-21x86/ldt: Further fix FPU emulationAndy Lutomirski
commit 12e244f4b550498bbaf654a52f93633f7dde2dc7 upstream. The previous fix confused a selector with a segment prefix. Fix it. Compile-tested only. Cc: Juergen Gross <jgross@suse.com> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Fixes: 4809146b86c3 ("x86/ldt: Correct FPU emulation access to LDT") Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-21x86/ldt: Correct FPU emulation access to LDTJuergen Gross
commit 4809146b86c3d41ce588fdb767d021e2a80600dd upstream. Commit 37868fe113ff ("x86/ldt: Make modify_ldt synchronous") introduced a new struct ldt_struct anchored at mm->context.ldt. Adapt the x86 fpu emulation code to use that new structure. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: billm@melbpc.org.au Link: http://lkml.kernel.org/r/1438883674-1240-1-git-send-email-jgross@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-21x86/ldt: Correct LDT access in single stepping logicJuergen Gross
commit 136d9d83c07c5e30ac49fc83b27e8c4842f108fc upstream. Commit 37868fe113ff ("x86/ldt: Make modify_ldt synchronous") introduced a new struct ldt_struct anchored at mm->context.ldt. convert_ip_to_linear() was changed to reflect this, but indexing into the ldt has to be changed as the pointer is no longer void *. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@suse.de Link: http://lkml.kernel.org/r/1438848278-12906-1-git-send-email-jgross@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-21x86/ldt: Make modify_ldt synchronousAndy Lutomirski
commit 37868fe113ff2ba814b3b4eb12df214df555f8dc upstream. modify_ldt() has questionable locking and does not synchronize threads. Improve it: redesign the locking and synchronize all threads' LDTs using an IPI on all modifications. This will dramatically slow down modify_ldt in multithreaded programs, but there shouldn't be any multithreaded programs that care about modify_ldt's performance in the first place. This fixes some fallout from the CVE-2015-5157 fixes. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: security@kernel.org <security@kernel.org> Cc: xen-devel <xen-devel@lists.xen.org> Link: http://lkml.kernel.org/r/4c6978476782160600471bd865b318db34c7b628.1438291540.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-13x86/idle: Restore trace_cpu_idle to mwait_idle() callsJisheng Zhang
commit e43d0189ac02415fe4487f79fc35e8f147e9ea0d upstream. Commit b253149b843f ("sched/idle/x86: Restore mwait_idle() to fix boot hangs, to improve power savings and to improve performance") restores mwait_idle(), but the trace_cpu_idle related calls are missing. This causes powertop on my old desktop powered by Intel Core2 E6550 to report zero wakeups and zero events. Add them back to restore the proper behaviour. Fixes: b253149b843f ("sched/idle/x86: Restore mwait_idle() to ...") Signed-off-by: Jisheng Zhang <jszhang@marvell.com> Cc: <len.brown@intel.com> Link: http://lkml.kernel.org/r/1440046479-4262-1-git-send-email-jszhang@marvell.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-13x86/apic: Fix fallout from x2apic cleanupThomas Gleixner
commit a57e456a7b28431b55e407e5ab78ebd5b378d19e upstream. In the recent x2apic cleanup I got two things really wrong: 1) The safety check in __disable_x2apic which allows the function to be called unconditionally is backwards. The check is there to prevent access to the apic MSR in case that the machine has no apic. Though right now it returns if the machine has an apic and therefor the disabling of x2apic is never invoked. 2) x2apic_disable() sets x2apic_mode to 0 after registering the local apic. That's wrong, because register_lapic_address() checks x2apic mode and therefor takes the wrong code path. This results in boot failures on machines with x2apic preenabled by BIOS and can also lead to an fatal MSR access on machines without apic. The solutions are simple: 1) Correct the sanity check for apic availability 2) Clear x2apic_mode _before_ calling register_lapic_address() Fixes: 659006bf3ae3 'x86/x2apic: Split enable and setup function' Reported-and-tested-by: Javier Monteagudo <javiermon@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://bugzilla.redhat.com/show_bug.cgi?id=1224764 Cc: Laura Abbott <labbott@redhat.com> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Borislav Petkov <bp@alien8.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-13x86/xen: make CONFIG_XEN depend on CONFIG_X86_LOCAL_APICDavid Vrabel
commit 87ffd2b9bb74061c120f450e4d0f3409bb603ae0 upstream. Since commit feb44f1f7a4ac299d1ab1c3606860e70b9b89d69 (x86/xen: Provide a "Xen PV" APIC driver to support >255 VCPUs) Xen guests need a full APIC driver and thus should depend on X86_LOCAL_APIC. This fixes an i386 build failure with !SMP && !CONFIG_X86_UP_APIC by disabling Xen support in this configuration. Users needing Xen support in a non-SMP i386 kernel will need to enable CONFIG_X86_UP_APIC. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-13Revert x86 sigcontext cleanupsLinus Torvalds
commit ed596cde9425509ec6ce88e19f03e9b13b6f518b upstream. This reverts commits 9a036b93a344 ("x86/signal/64: Remove 'fs' and 'gs' from sigcontext") and c6f2062935c8 ("x86/signal/64: Fix SS handling for signals delivered to 64-bit programs"). They were cleanups, but they break dosemu by changing the signal return behavior (and removing 'fs' and 'gs' from the sigcontext struct - while not actually changing any behavior - causes build problems). Reported-and-tested-by: Stas Sergeev <stsp@list.ru> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-13x86/xen: build "Xen PV" APIC driver for domU as wellJason A. Donenfeld
commit fc5fee86bdd3d720e2d1d324e4fae0c35845fa63 upstream. It turns out that a PV domU also requires the "Xen PV" APIC driver. Otherwise, the flat driver is used and we get stuck in busy loops that never exit, such as in this stack trace: (gdb) target remote localhost:9999 Remote debugging using localhost:9999 __xapic_wait_icr_idle () at ./arch/x86/include/asm/ipi.h:56 56 while (native_apic_mem_read(APIC_ICR) & APIC_ICR_BUSY) (gdb) bt #0 __xapic_wait_icr_idle () at ./arch/x86/include/asm/ipi.h:56 #1 __default_send_IPI_shortcut (shortcut=<optimized out>, dest=<optimized out>, vector=<optimized out>) at ./arch/x86/include/asm/ipi.h:75 #2 apic_send_IPI_self (vector=246) at arch/x86/kernel/apic/probe_64.c:54 #3 0xffffffff81011336 in arch_irq_work_raise () at arch/x86/kernel/irq_work.c:47 #4 0xffffffff8114990c in irq_work_queue (work=0xffff88000fc0e400) at kernel/irq_work.c:100 #5 0xffffffff8110c29d in wake_up_klogd () at kernel/printk/printk.c:2633 #6 0xffffffff8110ca60 in vprintk_emit (facility=0, level=<optimized out>, dict=0x0 <irq_stack_union>, dictlen=<optimized out>, fmt=<optimized out>, args=<optimized out>) at kernel/printk/printk.c:1778 #7 0xffffffff816010c8 in printk (fmt=<optimized out>) at kernel/printk/printk.c:1868 #8 0xffffffffc00013ea in ?? () #9 0x0000000000000000 in ?? () Mailing-list-thread: https://lkml.org/lkml/2015/8/4/755 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-16kvm: x86: fix kvm_apic_has_events to check for NULL pointerPaolo Bonzini
commit ce40cd3fc7fa40a6119e5fe6c0f2bc0eb4541009 upstream. Malicious (or egregiously buggy) userspace can trigger it, but it should never happen in normal operation. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Wang Kai <morgan.wang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-16x86/xen: Probe target addresses in set_aliased_prot() before the hypercallAndy Lutomirski
commit aa1acff356bbedfd03b544051f5b371746735d89 upstream. The update_va_mapping hypercall can fail if the VA isn't present in the guest's page tables. Under certain loads, this can result in an OOPS when the target address is in unpopulated vmap space. While we're at it, add comments to help explain what's going on. This isn't a great long-term fix. This code should probably be changed to use something like set_memory_ro. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Vrabel <dvrabel@cantab.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: security@kernel.org <security@kernel.org> Cc: xen-devel <xen-devel@lists.xen.org> Link: http://lkml.kernel.org/r/0b0e55b995cda11e7829f140b833ef932fcabe3a.1438291540.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-16x86/nmi/64: Use DF to avoid userspace RSP confusing nested NMI detectionAndy Lutomirski
commit 810bc075f78ff2c221536eb3008eac6a492dba2d upstream. We have a tricky bug in the nested NMI code: if we see RSP pointing to the NMI stack on NMI entry from kernel mode, we assume that we are executing a nested NMI. This isn't quite true. A malicious userspace program can point RSP at the NMI stack, issue SYSCALL, and arrange for an NMI to happen while RSP is still pointing at the NMI stack. Fix it with a sneaky trick. Set DF in the region of code that the RSP check is intended to detect. IRET will clear DF atomically. ( Note: other than paravirt, there's little need for all this complexity. We could check RIP instead of RSP. ) Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-16x86/nmi/64: Reorder nested NMI checksAndy Lutomirski
commit a27507ca2d796cfa8d907de31ad730359c8a6d06 upstream. Check the repeat_nmi .. end_repeat_nmi special case first. The next patch will rework the RSP check and, as a side effect, the RSP check will no longer detect repeat_nmi .. end_repeat_nmi, so we'll need this ordering of the checks. Note: this is more subtle than it appears. The check for repeat_nmi .. end_repeat_nmi jumps straight out of the NMI code instead of adjusting the "iret" frame to force a repeat. This is necessary, because the code between repeat_nmi and end_repeat_nmi sets "NMI executing" and then writes to the "iret" frame itself. If a nested NMI comes in and modifies the "iret" frame while repeat_nmi is also modifying it, we'll end up with garbage. The old code got this right, as does the new code, but the new code is a bit more explicit. If we were to move the check right after the "NMI executing" check, then we'd get it wrong and have random crashes. ( Because the "NMI executing" check would jump to the code that would modify the "iret" frame without checking if the interrupted NMI was currently modifying it. ) Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-16x86/nmi/64: Improve nested NMI commentsAndy Lutomirski
commit 0b22930ebad563ae97ff3f8d7b9f12060b4c6e6b upstream. I found the nested NMI documentation to be difficult to follow. Improve the comments. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-16x86/nmi/64: Switch stacks on userspace NMI entryAndy Lutomirski
commit 9b6e6a8334d56354853f9c255d1395c2ba570e0a upstream. Returning to userspace is tricky: IRET can fail, and ESPFIX can rearrange the stack prior to IRET. The NMI nesting fixup relies on a precise stack layout and atomic IRET. Rather than trying to teach the NMI nesting fixup to handle ESPFIX and failed IRET, punt: run NMIs that came from user mode on the normal kernel stack. This will make some nested NMIs visible to C code, but the C code is okay with that. As a side effect, this should speed up perf: it eliminates an RDMSR when NMIs come from user mode. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-16x86/nmi/64: Remove asm code that saves CR2Andy Lutomirski
commit 0e181bb58143cb4a2e8f01c281b0816cd0e4798e upstream. Now that do_nmi saves CR2, we don't need to save it in asm. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-16x86/nmi: Enable nested do_nmi() handling for 64-bit kernelsAndy Lutomirski
commit 9d05041679904b12c12421cbcf9cb5f4860a8d7b upstream. 32-bit kernels handle nested NMIs in C. Enable the exact same handling on 64-bit kernels as well. This isn't currently necessary, but it will become necessary once the asm code starts allowing limited nesting. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-16x86/asm/entry/64: Remove pointless jump to irq_returnAndy Lutomirski
commit 5ca6f70f387b4f82903037cc3c5488e2c97dcdbc upstream. INTERRUPT_RETURN turns into a jmp instruction. There's no need for extra indirection. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: <linux-kernel@vger.kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/2f2318653dbad284a59311f13f08cea71298fd7c.1433449436.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-10perf/x86/intel/cqm: Return cached counter value from IRQ contextMatt Fleming
commit 2c534c0da0a68418693e10ce1c4146e085f39518 upstream. Peter reported the following potential crash which I was able to reproduce with his test program, [ 148.765788] ------------[ cut here ]------------ [ 148.765796] WARNING: CPU: 34 PID: 2840 at kernel/smp.c:417 smp_call_function_many+0xb6/0x260() [ 148.765797] Modules linked in: [ 148.765800] CPU: 34 PID: 2840 Comm: perf Not tainted 4.2.0-rc1+ #4 [ 148.765803] ffffffff81cdc398 ffff88085f105950 ffffffff818bdfd5 0000000000000007 [ 148.765805] 0000000000000000 ffff88085f105990 ffffffff810e413a 0000000000000000 [ 148.765807] ffffffff82301080 0000000000000022 ffffffff8107f640 ffffffff8107f640 [ 148.765809] Call Trace: [ 148.765810] <NMI> [<ffffffff818bdfd5>] dump_stack+0x45/0x57 [ 148.765818] [<ffffffff810e413a>] warn_slowpath_common+0x8a/0xc0 [ 148.765822] [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60 [ 148.765824] [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60 [ 148.765825] [<ffffffff810e422a>] warn_slowpath_null+0x1a/0x20 [ 148.765827] [<ffffffff811613f6>] smp_call_function_many+0xb6/0x260 [ 148.765829] [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60 [ 148.765831] [<ffffffff81161748>] on_each_cpu_mask+0x28/0x60 [ 148.765832] [<ffffffff8107f6ef>] intel_cqm_event_count+0x7f/0xe0 [ 148.765836] [<ffffffff811cdd35>] perf_output_read+0x2a5/0x400 [ 148.765839] [<ffffffff811d2e5a>] perf_output_sample+0x31a/0x590 [ 148.765840] [<ffffffff811d333d>] ? perf_prepare_sample+0x26d/0x380 [ 148.765841] [<ffffffff811d3497>] perf_event_output+0x47/0x60 [ 148.765843] [<ffffffff811d36c5>] __perf_event_overflow+0x215/0x240 [ 148.765844] [<ffffffff811d4124>] perf_event_overflow+0x14/0x20 [ 148.765847] [<ffffffff8107e7f4>] intel_pmu_handle_irq+0x1d4/0x440 [ 148.765849] [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0 [ 148.765853] [<ffffffff81219bad>] ? vunmap_page_range+0x19d/0x2f0 [ 148.765854] [<ffffffff81219d11>] ? unmap_kernel_range_noflush+0x11/0x20 [ 148.765859] [<ffffffff814ce6fe>] ? ghes_copy_tofrom_phys+0x11e/0x2a0 [ 148.765863] [<ffffffff8109e5db>] ? native_apic_msr_write+0x2b/0x30 [ 148.765865] [<ffffffff8109e44d>] ? x2apic_send_IPI_self+0x1d/0x20 [ 148.765869] [<ffffffff81065135>] ? arch_irq_work_raise+0x35/0x40 [ 148.765872] [<ffffffff811c8d86>] ? irq_work_queue+0x66/0x80 [ 148.765875] [<ffffffff81075306>] perf_event_nmi_handler+0x26/0x40 [ 148.765877] [<ffffffff81063ed9>] nmi_handle+0x79/0x100 [ 148.765879] [<ffffffff81064422>] default_do_nmi+0x42/0x100 [ 148.765880] [<ffffffff81064563>] do_nmi+0x83/0xb0 [ 148.765884] [<ffffffff818c7c0f>] end_repeat_nmi+0x1e/0x2e [ 148.765886] [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0 [ 148.765888] [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0 [ 148.765890] [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0 [ 148.765891] <<EOE>> [<ffffffff8110ab66>] finish_task_switch+0x156/0x210 [ 148.765898] [<ffffffff818c1671>] __schedule+0x341/0x920 [ 148.765899] [<ffffffff818c1c87>] schedule+0x37/0x80 [ 148.765903] [<ffffffff810ae1af>] ? do_page_fault+0x2f/0x80 [ 148.765905] [<ffffffff818c1f4a>] schedule_user+0x1a/0x50 [ 148.765907] [<ffffffff818c666c>] retint_careful+0x14/0x32 [ 148.765908] ---[ end trace e33ff2be78e14901 ]--- The CQM task events are not safe to be called from within interrupt context because they require performing an IPI to read the counter value on all sockets. And performing IPIs from within IRQ context is a "no-no". Make do with the last read counter value currently event in event->count when we're invoked in this context. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vikas Shivappa <vikas.shivappa@intel.com> Cc: Kanaka Juvva <kanaka.d.juvva@intel.com> Cc: Will Auld <will.auld@intel.com> Link: http://lkml.kernel.org/r/1437490509-15373-1-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-10x86/efi: Use all 64 bit of efi_memmap in setup_e820()Dmitry Skorodumov
commit 7cc03e48965453b5df1cce5062c826189b04b960 upstream. The efi_info structure stores low 32 bits of memory map in efi_memmap and high 32 bits in efi_memmap_hi. While constructing pointer in the setup_e820(), need to take into account all 64 bit of the pointer. It is because on 64bit machine the function efi_get_memory_map() may return full 64bit pointer and before the patch that pointer was truncated. The issue is triggered on Parallles virtual machine and fixed with this patch. Signed-off-by: Dmitry Skorodumov <sdmitry@parallels.com> Cc: Denis V. Lunev <den@openvz.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-10efi: Check for NULL efi kernel parametersRicardo Neri
commit 9115c7589b11349a1c3099758b4bded579ff69e0 upstream. Even though it is documented how to specifiy efi parameters, it is possible to cause a kernel panic due to a dereference of a NULL pointer when parsing such parameters if "efi" alone is given: PANIC: early exception 0e rip 10:ffffffff812fb361 error 0 cr2 0 [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.2.0-rc1+ #450 [ 0.000000] ffffffff81fe20a9 ffffffff81e03d50 ffffffff8184bb0f 00000000000003f8 [ 0.000000] 0000000000000000 ffffffff81e03e08 ffffffff81f371a1 64656c62616e6520 [ 0.000000] 0000000000000069 000000000000005f 0000000000000000 0000000000000000 [ 0.000000] Call Trace: [ 0.000000] [<ffffffff8184bb0f>] dump_stack+0x45/0x57 [ 0.000000] [<ffffffff81f371a1>] early_idt_handler_common+0x81/0xae [ 0.000000] [<ffffffff812fb361>] ? parse_option_str+0x11/0x90 [ 0.000000] [<ffffffff81f4dd69>] arch_parse_efi_cmdline+0x15/0x42 [ 0.000000] [<ffffffff81f376e1>] do_early_param+0x50/0x8a [ 0.000000] [<ffffffff8106b1b3>] parse_args+0x1e3/0x400 [ 0.000000] [<ffffffff81f37a43>] parse_early_options+0x24/0x28 [ 0.000000] [<ffffffff81f37691>] ? loglevel+0x31/0x31 [ 0.000000] [<ffffffff81f37a78>] parse_early_param+0x31/0x3d [ 0.000000] [<ffffffff81f3ae98>] setup_arch+0x2de/0xc08 [ 0.000000] [<ffffffff8109629a>] ? vprintk_default+0x1a/0x20 [ 0.000000] [<ffffffff81f37b20>] start_kernel+0x90/0x423 [ 0.000000] [<ffffffff81f37495>] x86_64_start_reservations+0x2a/0x2c [ 0.000000] [<ffffffff81f37582>] x86_64_start_kernel+0xeb/0xef [ 0.000000] RIP 0xffffffff81ba2efc This panic is not reproducible with "efi=" as this will result in a non-NULL zero-length string. Thus, verify that the pointer to the parameter string is not NULL. This is consistent with other parameter-parsing functions which check for NULL pointers. Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-10x86/mm: Add parenthesis for TLB tracepoint size calculationDave Hansen
commit bbc03778b9954a2ec93baed63718e4df0192f130 upstream. flush_tlb_info->flush_start/end are both normal virtual addresses. When calculating 'nr_pages' (only used for the tracepoint), I neglected to put parenthesis in. Thanks to David Koufaty for pointing this out. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave@sr71.net Link: http://lkml.kernel.org/r/20150720230153.9E834081@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-10x86, perf: Fix static_key bug in load_mm_cr4()Peter Zijlstra
commit a833581e372a4adae2319d8dc379493edbc444e9 upstream. Mikulas reported his K6-3 not booting. This is because the static_key API confusion struck and bit Andy, this wants to be static_key_false(). Reported-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Vince Weaver <vince@deater.net> Cc: hillf.zj <hillf.zj@alibaba-inc.com> Fixes: a66734297f78 ("perf/x86: Add /sys/devices/cpu/rdpmc=2 to allow rdpmc for all tasks") Link: http://lkml.kernel.org/r/20150709172338.GC19282@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-10x86/kasan: Fix boot crash on AMD processorsAndrey Ryabinin
commit d4f86beacc21d538dc41e1fc75a22e084f547edf upstream. While populating zero shadow wrong bits in upper level page tables used. __PAGE_KERNEL_RO that was used for pgd/pud/pmd has _PAGE_BIT_GLOBAL set. Global bit is present only in the lowest level of the page translation hierarchy (ptes), and it should be zero in upper levels. This bug seems doesn't cause any troubles on Intel cpus, while on AMDs it cause kernel crash on boot. Use _KERNPG_TABLE bits for pgds/puds/pmds to fix this. Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Alexander Popov <alpopov@ptsecurity.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1435828178-10975-5-git-send-email-a.ryabinin@samsung.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-10x86/kasan: Flush TLBs after switching CR3Andrey Ryabinin
commit 241d2c54c62fa0939fc9a9512b48ac3434e90a89 upstream. load_cr3() doesn't cause tlb_flush if PGE enabled. This may cause tons of false positive reports spamming the kernel to death. To fix this __flush_tlb_all() should be called explicitly after CR3 changed. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Alexander Popov <alpopov@ptsecurity.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1435828178-10975-4-git-send-email-a.ryabinin@samsung.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-10x86/kasan: Fix KASAN shadow region page tablesAlexander Popov
commit 5d5aa3cfca5cf74cd928daf3674642e6004328d1 upstream. Currently KASAN shadow region page tables created without respect of physical offset (phys_base). This causes kernel halt when phys_base is not zero. So let's initialize KASAN shadow region page tables in kasan_early_init() using __pa_nodebug() which considers phys_base. This patch also separates x86_64_start_kernel() from KASAN low level details by moving kasan_map_early_shadow(init_level4_pgt) into kasan_early_init(). Remove the comment before clear_bss() which stopped bringing much profit to the code readability. Otherwise describing all the new order dependencies would be too verbose. Signed-off-by: Alexander Popov <alpopov@ptsecurity.com> Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1435828178-10975-3-git-send-email-a.ryabinin@samsung.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-10x86/init: Clear 'init_level4_pgt' earlierAndrey Ryabinin
commit d0f77d4d04b222a817925d33ba3589b190bfa863 upstream. Currently x86_64_start_kernel() has two KASAN related function calls. The first call maps shadow to early_level4_pgt, the second maps shadow to init_level4_pgt. If we move clear_page(init_level4_pgt) earlier, we could hide KASAN low level detail from generic x86_64 initialization code. The next patch will do it. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Alexander Popov <alpopov@ptsecurity.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1435828178-10975-2-git-send-email-a.ryabinin@samsung.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>