summaryrefslogtreecommitdiff
path: root/arch/x86/include
AgeCommit message (Collapse)Author
2016-10-31Revert "fix minor infoleak in get_user_ex()"Greg Kroah-Hartman
This reverts commit 9d25c78ec01c402dc56272693c44ef9d72ecdd2e which is 1c109fabbd51863475cd12ac206bdd249aee35af upstream Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-31Revert "x86/mm: Expand the exception table logic to allow new handling options"Greg Kroah-Hartman
This reverts commit fcf5e5198b447969ed2a56ec335dae3c695a6b46 which is 548acf19234dbda5a52d5a8e7e205af46e9da840 upstream. Cc: Tony Luck <tony.luck@intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-31x86/mm: Expand the exception table logic to allow new handling optionsTony Luck
commit 548acf19234dbda5a52d5a8e7e205af46e9da840 upstream. Huge amounts of help from Andy Lutomirski and Borislav Petkov to produce this. Andy provided the inspiration to add classes to the exception table with a clever bit-squeezing trick, Boris pointed out how much cleaner it would all be if we just had a new field. Linus Torvalds blessed the expansion with: ' I'd rather not be clever in order to save just a tiny amount of space in the exception table, which isn't really criticial for anybody. ' The third field is another relative function pointer, this one to a handler that executes the actions. We start out with three handlers: 1: Legacy - just jumps the to fixup IP 2: Fault - provide the trap number in %ax to the fixup code 3: Cleaned up legacy for the uaccess error hack Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/f6af78fcbd348cf4939875cfda9c19689b5e50b8.1455732970.git.tony.luck@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-07x86/init: Fix cr4_init_shadow() on CR4-less machinesAndy Lutomirski
commit e1bfc11c5a6f40222a698a818dc269113245820e upstream. cr4_init_shadow() will panic on 486-like machines without CR4. Fix it using __read_cr4_safe(). Reported-by: david@saggiorato.net Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 1e02ce4cccdc ("x86: Store a per-cpu shadow copy of CR4") Link: http://lkml.kernel.org/r/43a20f81fb504013bf613913dc25574b45336a61.1475091074.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-24fix minor infoleak in get_user_ex()Al Viro
commit 1c109fabbd51863475cd12ac206bdd249aee35af upstream. get_user_ex(x, ptr) should zero x on failure. It's not a lot of a leak (at most we are leaking uninitialized 64bit value off the kernel stack, and in a fairly constrained situation, at that), but the fix is trivial, so... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> [ This sat in different branch from the uaccess fixes since mid-August ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-07x86/mm: Disable preemption during CR3 read+writeSebastian Andrzej Siewior
commit 5cf0791da5c162ebc14b01eb01631cfa7ed4fa6e upstream. There's a subtle preemption race on UP kernels: Usually current->mm (and therefore mm->pgd) stays the same during the lifetime of a task so it does not matter if a task gets preempted during the read and write of the CR3. But then, there is this scenario on x86-UP: TaskA is in do_exit() and exit_mm() sets current->mm = NULL followed by: -> mmput() -> exit_mmap() -> tlb_finish_mmu() -> tlb_flush_mmu() -> tlb_flush_mmu_tlbonly() -> tlb_flush() -> flush_tlb_mm_range() -> __flush_tlb_up() -> __flush_tlb() -> __native_flush_tlb() At this point current->mm is NULL but current->active_mm still points to the "old" mm. Let's preempt taskA _after_ native_read_cr3() by taskB. TaskB has its own mm so CR3 has changed. Now preempt back to taskA. TaskA has no ->mm set so it borrows taskB's mm and so CR3 remains unchanged. Once taskA gets active it continues where it was interrupted and that means it writes its old CR3 value back. Everything is fine because userland won't need its memory anymore. Now the fun part: Let's preempt taskA one more time and get back to taskB. This time switch_mm() won't do a thing because oldmm (->active_mm) is the same as mm (as per context_switch()). So we remain with a bad CR3 / PGD and return to userland. The next thing that happens is handle_mm_fault() with an address for the execution of its code in userland. handle_mm_fault() realizes that it has a PTE with proper rights so it returns doing nothing. But the CPU looks at the wrong PGD and insists that something is wrong and faults again. And again. And one more time… This pagefault circle continues until the scheduler gets tired of it and puts another task on the CPU. It gets little difficult if the task is a RT task with a high priority. The system will either freeze or it gets fixed by the software watchdog thread which usually runs at RT-max prio. But waiting for the watchdog will increase the latency of the RT task which is no good. Fix this by disabling preemption across the critical code section. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/1470404259-26290-1-git-send-email-bigeasy@linutronix.de [ Prettified the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16x86/xen, pat: Remove PAT table init code from XenToshi Kani
commit 88ba281108ed0c25c9d292b48bd3f272fcb90dd0 upstream. Xen supports PAT without MTRRs for its guests. In order to enable WC attribute, it was necessary for xen_start_kernel() to call pat_init_cache_modes() to update PAT table before starting guest kernel. Now that the kernel initializes PAT table to the BIOS handoff state when MTRR is disabled, this Xen-specific PAT init code is no longer necessary. Delete it from xen_start_kernel(). Also change __init_cache_modes() to a static function since PAT table should not be tweaked by other modules. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Juergen Gross <jgross@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Toshi Kani <toshi.kani@hp.com> Cc: elliott@hpe.com Cc: paul.gortmaker@windriver.com Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1458769323-24491-7-git-send-email-toshi.kani@hpe.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16x86/mtrr: Fix Xorg crashes in Qemu sessionsToshi Kani
commit edfe63ec97ed8d4496225f7ba54c9ce4207c5431 upstream. A Xorg failure on qemu32 was reported as a regression [1] caused by commit 9cd25aac1f44 ("x86/mm/pat: Emulate PAT when it is disabled"). This patch fixes the Xorg crash. Negative effects of this regression were the following two failures [2] in Xorg on QEMU with QEMU CPU model "qemu32" (-cpu qemu32), which were triggered by the fact that its virtual CPU does not support MTRRs. #1. copy_process() failed in the check in reserve_pfn_range() copy_process copy_mm dup_mm dup_mmap copy_page_range track_pfn_copy reserve_pfn_range A WC map request was tracked as WC in memtype, which set a PTE as UC (pgprot) per __cachemode2pte_tbl[]. This led to this error in reserve_pfn_range() called from track_pfn_copy(), which obtained a pgprot from a PTE. It converts pgprot to page_cache_mode, which does not necessarily result in the original page_cache_mode since __cachemode2pte_tbl[] redirects multiple types to UC. #2. error path in copy_process() then hit WARN_ON_ONCE in untrack_pfn(). x86/PAT: Xorg:509 map pfn expected mapping type uncached- minus for [mem 0xfd000000-0xfdffffff], got write-combining Call Trace: dump_stack warn_slowpath_common ? untrack_pfn ? untrack_pfn warn_slowpath_null untrack_pfn ? __kunmap_atomic unmap_single_vma ? pagevec_move_tail_fn unmap_vmas exit_mmap mmput copy_process.part.47 _do_fork SyS_clone do_syscall_32_irqs_on entry_INT80_32 These negative effects are caused by two separate bugs, but they can be addressed in separate patches. Fixing the pat_init() issue described below addresses the root cause, and avoids Xorg to hit these cases. When the CPU does not support MTRRs, MTRR does not call pat_init(), which leaves PAT enabled without initializing PAT. This pat_init() issue is a long-standing issue, but manifested as issue #1 (and then hit issue #2) with the above-mentioned commit because the memtype now tracks cache attribute with 'page_cache_mode'. This pat_init() issue existed before the commit, but we used pgprot in memtype. Hence, we did not have issue #1 before. But WC request resulted in WT in effect because WC pgrot is actually WT when PAT is not initialized. This is not how it was designed to work. When PAT is set to disable properly, WC is converted to UC. The use of WT can result in a system crash if the target range does not support WT. Fortunately, nobody ran into such issue before. To fix this pat_init() issue, PAT code has been enhanced to provide pat_disable() interface. Call this interface when MTRRs are disabled. By setting PAT to disable properly, PAT bypasses the memtype check, and avoids issue #1. [1]: https://lkml.org/lkml/2016/3/3/828 [2]: https://lkml.org/lkml/2016/3/4/775 Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Toshi Kani <toshi.kani@hp.com> Cc: elliott@hpe.com Cc: konrad.wilk@oracle.com Cc: paul.gortmaker@windriver.com Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1458769323-24491-5-git-send-email-toshi.kani@hpe.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16x86/mm/pat: Add pat_disable() interfaceToshi Kani
commit 224bb1e5d67ba0f2872c98002d6a6f991ac6fd4a upstream. In preparation for fixing a regression caused by: 9cd25aac1f44 ("x86/mm/pat: Emulate PAT when it is disabled") ... PAT needs to provide an interface that prevents the OS from initializing the PAT MSR. PAT MSR initialization must be done on all CPUs using the specific sequence of operations defined in the Intel SDM. This requires MTRRs to be enabled since pat_init() is called as part of MTRR init from mtrr_rendezvous_handler(). Make pat_disable() as the interface that prevents the OS from initializing the PAT MSR. MTRR will call this interface when it cannot provide the SDM-defined sequence to initialize PAT. This also assures that pat_disable() called from pat_bsp_init() will set the PAT table properly when CPU does not support PAT. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robert Elliott <elliott@hpe.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: konrad.wilk@oracle.com Cc: paul.gortmaker@windriver.com Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1458769323-24491-3-git-send-email-toshi.kani@hpe.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16x86/mm/pat: Add support of non-default PAT MSR settingToshi Kani
commit 02f037d641dc6672be5cfe7875a48ab99b95b154 upstream. In preparation for fixing a regression caused by: 9cd25aac1f44 ("x86/mm/pat: Emulate PAT when it is disabled")' ... PAT needs to support a case that PAT MSR is initialized with a non-default value. When pat_init() is called and PAT is disabled, it initializes the PAT table with the BIOS default value. Xen, however, sets PAT MSR with a non-default value to enable WC. This causes inconsistency between the PAT table and PAT MSR when PAT is set to disable on Xen. Change pat_init() to handle the PAT disable cases properly. Add init_cache_modes() to handle two cases when PAT is set to disable. 1. CPU supports PAT: Set PAT table to be consistent with PAT MSR. 2. CPU does not support PAT: Set PAT table to be consistent with PWT and PCD bits in a PTE. Note, __init_cache_modes(), renamed from pat_init_cache_modes(), will be changed to a static function in a later patch. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Toshi Kani <toshi.kani@hp.com> Cc: elliott@hpe.com Cc: konrad.wilk@oracle.com Cc: paul.gortmaker@windriver.com Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1458769323-24491-2-git-send-email-toshi.kani@hpe.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-10pvclock: Add CPU barriers to get correct version valueMinfei Huang
commit 749d088b8e7f4b9826ede02b9a043e417fa84aa1 upstream. Protocol for the "version" fields is: hypervisor raises it (making it uneven) before it starts updating the fields and raises it again (making it even) when it is done. Thus the guest can make sure the time values it got are consistent by checking the version before and after reading them. Add CPU barries after getting version value just like what function vread_pvclock does, because all of callees in this function is inline. Fixes: 502dfeff239e8313bfbe906ca0a1a6827ac8481b Signed-off-by: Minfei Huang <mnghuan@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-04x86/mm/xen: Suppress hugetlbfs in PV guestsJan Beulich
commit 103f6112f253017d7062cd74d17f4a514ed4485c upstream. Huge pages are not normally available to PV guests. Not suppressing hugetlbfs use results in an endless loop of page faults when user mode code tries to access a hugetlbfs mapped area (since the hypervisor denies such PTEs to be created, but error indications can't be propagated out of xen_set_pte_at(), just like for various of its siblings), and - once killed in an oops like this: kernel BUG at .../fs/hugetlbfs/inode.c:428! invalid opcode: 0000 [#1] SMP ... RIP: e030:[<ffffffff811c333b>] [<ffffffff811c333b>] remove_inode_hugepages+0x25b/0x320 ... Call Trace: [<ffffffff811c3415>] hugetlbfs_evict_inode+0x15/0x40 [<ffffffff81167b3d>] evict+0xbd/0x1b0 [<ffffffff8116514a>] __dentry_kill+0x19a/0x1f0 [<ffffffff81165b0e>] dput+0x1fe/0x220 [<ffffffff81150535>] __fput+0x155/0x200 [<ffffffff81079fc0>] task_work_run+0x60/0xa0 [<ffffffff81063510>] do_exit+0x160/0x400 [<ffffffff810637eb>] do_group_exit+0x3b/0xa0 [<ffffffff8106e8bd>] get_signal+0x1ed/0x470 [<ffffffff8100f854>] do_signal+0x14/0x110 [<ffffffff810030e9>] prepare_exit_to_usermode+0xe9/0xf0 [<ffffffff814178a5>] retint_user+0x8/0x13 This is CVE-2016-3961 / XSA-174. Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Juergen Gross <JGross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: xen-devel <xen-devel@lists.xenproject.org> Link: http://lkml.kernel.org/r/57188ED802000078000E431C@prv-mh.provo.novell.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-20Revert "PCI, x86: Implement pcibios_alloc_irq() and pcibios_free_irq()"Bjorn Helgaas
commit 6c777e8799a93e3bdb67bec622429e1b48dc90fb upstream. 991de2e59090 ("PCI, x86: Implement pcibios_alloc_irq() and pcibios_free_irq()") appeared in v4.3 and helps support IOAPIC hotplug. Олег reported that the Elcus-1553 TA1-PCI driver worked in v4.2 but not v4.3 and bisected it to 991de2e59090. Sunjin reported that the RocketRAID 272x driver worked in v4.2 but not v4.3. In both cases booting with "pci=routirq" is a workaround. I think the problem is that after 991de2e59090, we no longer call pcibios_enable_irq() for upstream bridges. Prior to 991de2e59090, when a driver called pci_enable_device(), we recursively called pcibios_enable_irq() for upstream bridges via pci_enable_bridge(). After 991de2e59090, we call pcibios_enable_irq() from pci_device_probe() instead of the pci_enable_device() path, which does *not* call pcibios_enable_irq() for upstream bridges. Revert 991de2e59090 to fix these driver regressions. Link: https://bugzilla.kernel.org/show_bug.cgi?id=111211 Fixes: 991de2e59090 ("PCI, x86: Implement pcibios_alloc_irq() and pcibios_free_irq()") Reported-and-tested-by: Олег Мороз <oleg.moroz@mcc.vniiem.ru> Reported-by: Sunjin Yang <fan4326@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> CC: Jiang Liu <jiang.liu@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-20KVM: x86: reduce default value of halt_poll_ns parameterPaolo Bonzini
commit 14ebda3394fd3e5388747e742e510b0802a65d24 upstream. Windows lets applications choose the frequency of the timer tick, and in Windows 10 the maximum rate was changed from 1024 Hz to 2048 Hz. Unfortunately, because of the way the Windows API works, most applications who need a higher rate than the default 64 Hz will just do timeGetDevCaps(&tc, sizeof(tc)); timeBeginPeriod(tc.wPeriodMin); and pick the maximum rate. This causes very high CPU usage when playing media or games on Windows 10, even if the guest does not actually use the CPU very much, because the frequent timer tick causes halt_poll_ns to kick in. There is no really good solution, especially because Microsoft could sooner or later bump the limit to 4096 Hz, but for now the best we can do is lower a bit the upper limit for halt_poll_ns. :-( Reported-by: Jon Panozzo <jonp@lime-technology.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12x86/iopl/64: Properly context-switch IOPL on Xen PVAndy Lutomirski
commit b7a584598aea7ca73140cb87b40319944dd3393f upstream. On Xen PV, regs->flags doesn't reliably reflect IOPL and the exit-to-userspace code doesn't change IOPL. We need to context switch it manually. I'm doing this without going through paravirt because this is specific to Xen PV. After the dust settles, we can merge this with the 32-bit code, tidy up the iopl syscall implementation, and remove the set_iopl pvop entirely. Fixes XSA-171. Reviewewd-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/693c3bd7aeb4d3c27c92c622b7d0f554a458173c.1458162709.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12x86/apic: Fix suspicious RCU usage in smp_trace_call_function_interrupt()Dave Jones
commit 7834c10313fb823e538f2772be78edcdeed2e6e3 upstream. Since 4.4, I've been able to trigger this occasionally: =============================== [ INFO: suspicious RCU usage. ] 4.5.0-rc7-think+ #3 Not tainted Cc: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/20160315012054.GA17765@codemonkey.org.uk Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> ------------------------------- ./arch/x86/include/asm/msr-trace.h:47 suspicious rcu_dereference_check() usage! other info that might help us debug this: RCU used illegally from idle CPU! rcu_scheduler_active = 1, debug_locks = 1 RCU used illegally from extended quiescent state! no locks held by swapper/3/0. stack backtrace: CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.5.0-rc7-think+ #3 ffffffff92f821e0 1f3e5c340597d7fc ffff880468e07f10 ffffffff92560c2a ffff880462145280 0000000000000001 ffff880468e07f40 ffffffff921376a6 ffffffff93665ea0 0000cc7c876d28da 0000000000000005 ffffffff9383dd60 Call Trace: <IRQ> [<ffffffff92560c2a>] dump_stack+0x67/0x9d [<ffffffff921376a6>] lockdep_rcu_suspicious+0xe6/0x100 [<ffffffff925ae7a7>] do_trace_write_msr+0x127/0x1a0 [<ffffffff92061c83>] native_apic_msr_eoi_write+0x23/0x30 [<ffffffff92054408>] smp_trace_call_function_interrupt+0x38/0x360 [<ffffffff92d1ca60>] trace_call_function_interrupt+0x90/0xa0 <EOI> [<ffffffff92ac5124>] ? cpuidle_enter_state+0x1b4/0x520 Move the entering_irq() call before ack_APIC_irq(), because entering_irq() tells the RCU susbstems to end the extended quiescent state, so that the following trace call in ack_APIC_irq() works correctly. Suggested-by: Andi Kleen <ak@linux.intel.com> Fixes: 4787c368a9bc "x86/tracing: Add irq_enter/exit() in smp_trace_reschedule_interrupt()" Signed-off-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-04-12x86/irq: Cure live lock in fixup_irqs()Thomas Gleixner
commit 551adc60573cb68e3d55cacca9ba1b7437313df7 upstream. Harry reported, that he's able to trigger a system freeze with cpu hot unplug. The freeze turned out to be a live lock caused by recent changes in irq_force_complete_move(). When fixup_irqs() and from there irq_force_complete_move() is called on the dying cpu, then all other cpus are in stop machine an wait for the dying cpu to complete the teardown. If there is a move of an interrupt pending then irq_force_complete_move() sends the cleanup IPI to the cpus in the old_domain mask and waits for them to clear the mask. That's obviously impossible as those cpus are firmly stuck in stop machine with interrupts disabled. I should have known that, but I completely overlooked it being concentrated on the locking issues around the vectors. And the existance of the call to __irq_complete_move() in the code, which actually sends the cleanup IPI made it reasonable to wait for that cleanup to complete. That call was bogus even before the recent changes as it was just a pointless distraction. We have to look at two cases: 1) The move_in_progress flag of the interrupt is set This means the ioapic has been updated with the new vector, but it has not fired yet. In theory there is a race: set_ioapic(new_vector) <-- Interrupt is raised before update is effective, i.e. it's raised on the old vector. So if the target cpu cannot handle that interrupt before the old vector is cleaned up, we get a spurious interrupt and in the worst case the ioapic irq line becomes stale, but my experiments so far have only resulted in spurious interrupts. But in case of cpu hotplug this should be a non issue because if the affinity update happens right before all cpus rendevouz in stop machine, there is no way that the interrupt can be blocked on the target cpu because all cpus loops first with interrupts enabled in stop machine, so the old vector is not yet cleaned up when the interrupt fires. So the only way to run into this issue is if the delivery of the interrupt on the apic/system bus would be delayed beyond the point where the target cpu disables interrupts in stop machine. I doubt that it can happen, but at least there is a theroretical chance. Virtualization might be able to expose this, but AFAICT the IOAPIC emulation is not as stupid as the real hardware. I've spent quite some time over the weekend to enforce that situation, though I was not able to trigger the delayed case. 2) The move_in_progress flag is not set and the old_domain cpu mask is not empty. That means, that an interrupt was delivered after the change and the cleanup IPI has been sent to the cpus in old_domain, but not all CPUs have responded to it yet. In both cases we can assume that the next interrupt will arrive on the new vector, so we can cleanup the old vectors on the cpus in the old_domain cpu mask. Fixes: 98229aa36caa "x86/irq: Plug vector cleanup race" Reported-by: Harry Junior <harryjr@outlook.fr> Tested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Joe Lawrence <joe.lawrence@stratus.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Ben Hutchings <ben@decadent.org.uk> Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603140931430.3657@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12perf/x86/intel: Add definition for PT PMI bitStephane Eranian
commit 5690ae28e472d25e330ad0c637a5cea3fc39fb32 upstream. This patch adds a definition for GLOBAL_OVFL_STATUS bit 55 which is used with the Processor Trace (PT) feature. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: namhyung@kernel.org Link: http://lkml.kernel.org/r/1457034642-21837-2-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12x86/microcode: Untangle from BLK_DEV_INITRDBorislav Petkov
commit 5f9c01aa7c49a2d74474d6d879a797b8badf29e6 upstream. Thomas Voegtle reported that doing oldconfig with a .config which has CONFIG_MICROCODE enabled but BLK_DEV_INITRD disabled prevents the microcode loading mechanism from being built. So untangle it from the BLK_DEV_INITRD dependency so that oldconfig doesn't turn it off and add an explanatory text to its Kconfig help what the supported methods for supplying microcode are. Reported-by: Thomas Voegtle <tv@lio96.de> Tested-by: Thomas Voegtle <tv@lio96.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1454499225-21544-2-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03x86/irq: Call irq_force_move_complete with irq descriptorThomas Gleixner
commit 90a2282e23f0522e4b3f797ad447c5e91bf7fe32 upstream. First of all there is no point in looking up the irq descriptor again, but we also need the descriptor for the final cleanup race fix in the next patch. Make that change seperate. No functional difference. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Borislav Petkov <bp@alien8.de> Tested-by: Joe Lawrence <joe.lawrence@stratus.com> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Jeremiah Mahler <jmmahler@gmail.com> Cc: andy.shevchenko@gmail.com Cc: Guenter Roeck <linux@roeck-us.net> Link: http://lkml.kernel.org/r/20151231160107.125211743@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25x86/mm: Fix types used in pgprot cacheability flags translationsJan Beulich
commit 3625c2c234ef66acf21a72d47a5ffa94f6c5ebf2 upstream. For PAE kernels "unsigned long" is not suitable to hold page protection flags, since _PAGE_NX doesn't fit there. This is the reason for quite a few W+X pages getting reported as insecure during boot (observed namely for the entire initrd range). Fixes: 281d4078be ("x86: Make page cache mode a real type") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <JGross@suse.com> Link: http://lkml.kernel.org/r/56A7635602000078000CAFF1@prv-mh.provo.novell.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31x86/mm: Improve switch_mm() barrier commentsAndy Lutomirski
commit 4eaffdd5a5fe6ff9f95e1ab4de1ac904d5e0fa8b upstream. My previous comments were still a bit confusing and there was a typo. Fix it up. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 71b3c126e611 ("x86/mm: Add barriers and document switch_mm()-vs-flush synchronization") Link: http://lkml.kernel.org/r/0a0b43cdcdd241c5faaaecfbcc91a155ddedc9a1.1452631609.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31x86/mm: Add barriers and document switch_mm()-vs-flush synchronizationAndy Lutomirski
commit 71b3c126e61177eb693423f2e18a1914205b165e upstream. When switch_mm() activates a new PGD, it also sets a bit that tells other CPUs that the PGD is in use so that TLB flush IPIs will be sent. In order for that to work correctly, the bit needs to be visible prior to loading the PGD and therefore starting to fill the local TLB. Document all the barriers that make this work correctly and add a couple that were missing. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31x86/boot: Double BOOT_HEAP_SIZE to 64KBH.J. Lu
commit 8c31902cffc4d716450be549c66a67a8a3dd479c upstream. When decompressing kernel image during x86 bootup, malloc memory for ELF program headers may run out of heap space, which leads to system halt. This patch doubles BOOT_HEAP_SIZE to 64KB. Tested with 32-bit kernel which failed to boot without this patch. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-19x86/paravirt: Prevent rtc_cmos platform device init on PV guestsDavid Vrabel
Adding the rtc platform device in non-privileged Xen PV guests causes an IRQ conflict because these guests do not have legacy PIC and may allocate irqs in the legacy range. In a single VCPU Xen PV guest we should have: /proc/interrupts: CPU0 0: 4934 xen-percpu-virq timer0 1: 0 xen-percpu-ipi spinlock0 2: 0 xen-percpu-ipi resched0 3: 0 xen-percpu-ipi callfunc0 4: 0 xen-percpu-virq debug0 5: 0 xen-percpu-ipi callfuncsingle0 6: 0 xen-percpu-ipi irqwork0 7: 321 xen-dyn-event xenbus 8: 90 xen-dyn-event hvc_console ... But hvc_console cannot get its interrupt because it is already in use by rtc0 and the console does not work. genirq: Flags mismatch irq 8. 00000000 (hvc_console) vs. 00000000 (rtc0) We can avoid this problem by realizing that unprivileged PV guests (both Xen and lguests) are not supposed to have rtc_cmos device and so adding it is not necessary. Privileged guests (i.e. Xen's dom0) do use it but they should not have irq conflicts since they allocate irqs above legacy range (above gsi_top, in fact). Instead of explicitly testing whether the guest is privileged we can extend pv_info structure to include information about guest's RTC support. Reported-and-tested-by: Sander Eikelenboom <linux@eikelenboom.it> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: vkuznets@redhat.com Cc: xen-devel@lists.xenproject.org Cc: konrad.wilk@oracle.com Cc: stable@vger.kernel.org # 4.2+ Link: http://lkml.kernel.org/r/1449842873-2613-1-git-send-email-boris.ostrovsky@oracle.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-19x86/xen: Avoid fast syscall path for Xen PV guestsBoris Ostrovsky
After 32-bit syscall rewrite, and specifically after commit: 5f310f739b4c ("x86/entry/32: Re-implement SYSENTER using the new C path") ... the stack frame that is passed to xen_sysexit is no longer a "standard" one (i.e. it's not pt_regs). Since we end up calling xen_iret from xen_sysexit we don't need to fix up the stack and instead follow entry_SYSENTER_32's IRET path directly to xen_iret. We can do the same thing for compat mode even though stack does not need to be fixed. This will allow us to drop usergs_sysret32 paravirt op (in the subsequent patch) Suggested-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Borislav Petkov <bp@suse.de> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: david.vrabel@citrix.com Cc: konrad.wilk@oracle.com Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1447970147-1733-2-git-send-email-boris.ostrovsky@oracle.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-04x86/mm: Fix regression with huge pages on PAEKirill A. Shutemov
Recent PAT patchset has caused issue on 32-bit PAE machines: page:eea45000 count:0 mapcount:-128 mapping: (null) index:0x0 flags: 0x40000000() page dumped because: VM_BUG_ON_PAGE(page_mapcount(page) < 0) ------------[ cut here ]------------ kernel BUG at /home/build/linux-boris/mm/huge_memory.c:1485! invalid opcode: 0000 [#1] SMP [...] Call Trace: unmap_single_vma ? __wake_up unmap_vmas unmap_region do_munmap vm_munmap SyS_munmap do_fast_syscall_32 ? __do_page_fault sysenter_past_esp Code: ... EIP: [<c11bde80>] zap_huge_pmd+0x240/0x260 SS:ESP 0068:f6459d98 The problem is in pmd_pfn_mask() and pmd_flags_mask(). These helpers use PMD_PAGE_MASK to calculate resulting mask. PMD_PAGE_MASK is 'unsigned long', not 'unsigned long long' as phys_addr_t is on 32-bit PAE (ARCH_PHYS_ADDR_T_64BIT). As a result, the upper bits of resulting mask get truncated. pud_pfn_mask() and pud_flags_mask() aren't problematic since we don't have PUD page table level on 32-bit systems, but it's reasonable to keep them consistent with PMD counterpart. Introduce PHYSICAL_PMD_PAGE_MASK and PHYSICAL_PUD_PAGE_MASK in addition to existing PHYSICAL_PAGE_MASK and reworks helpers to use them. Reported-and-Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [ Fix -Woverflow warnings from the realmode code. ] Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Toshi Kani <toshi.kani@hpe.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jürgen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: elliott@hpe.com Cc: konrad.wilk@oracle.com Cc: linux-mm <linux-mm@kvack.org> Fixes: f70abb0fc3da ("x86/asm: Fix pud/pmd interfaces to handle large PAT bit") Link: http://lkml.kernel.org/r/1448878233-11390-2-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-16Merge branch 'pm-tools'Rafael J. Wysocki
* pm-tools: x86: remove unused definition of MSR_NHM_PLATFORM_INFO tools/power turbostat: use new name for MSR_PLATFORM_INFO
2015-11-15Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A couple of fixes and updates related to x86: - Fix the W+X check regression on XEN - The real fix for the low identity map trainwreck - Probe legacy PIC early instead of unconditionally allocating legacy irqs - Add cpu verification to long mode entry - Adjust the cache topology to AMD Fam17H systems - Let Merrifield use the TSC across S3" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Call verify_cpu() after having entered long mode too x86/setup: Fix low identity map for >= 2GB kernel range x86/mm: Skip the hypervisor range when walking PGD x86/AMD: Fix last level cache topology for AMD Fam17h systems x86/irq: Probe for PIC presence before allocating descs for legacy IRQs x86/cpu/intel: Enable X86_FEATURE_NONSTOP_TSC_S3 for Merrifield
2015-11-13x86: remove unused definition of MSR_NHM_PLATFORM_INFOLen Brown
MSR_NHM_PLATFORM_INFO has been replaced by... MSR_PLATFORM_INFO Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-11-12Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull second batch of kvm updates from Paolo Bonzini: "Four changes: - x86: work around two nasty cases where a benign exception occurs while another is being delivered. The endless stream of exceptions causes an infinite loop in the processor, which not even NMIs or SMIs can interrupt; in the virt case, there is no possibility to exit to the host either. - x86: support for Skylake per-guest TSC rate. Long supported by AMD, the patches mostly move things from there to common arch/x86/kvm/ code. - generic: remove local_irq_save/restore from the guest entry and exit paths when context tracking is enabled. The patches are a few months old, but we discussed them again at kernel summit. Andy will pick up from here and, in 4.5, try to remove it from the user entry/exit paths. - PPC: Two bug fixes, see merge commit 370289756becc for details" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits) KVM: x86: rename update_db_bp_intercept to update_bp_intercept KVM: svm: unconditionally intercept #DB KVM: x86: work around infinite loop in microcode when #AC is delivered context_tracking: avoid irq_save/irq_restore on guest entry and exit context_tracking: remove duplicate enabled check KVM: VMX: Dump TSC multiplier in dump_vmcs() KVM: VMX: Use a scaled host TSC for guest readings of MSR_IA32_TSC KVM: VMX: Setup TSC scaling ratio when a vcpu is loaded KVM: VMX: Enable and initialize VMX TSC scaling KVM: x86: Use the correct vcpu's TSC rate to compute time scale KVM: x86: Move TSC scaling logic out of call-back read_l1_tsc() KVM: x86: Move TSC scaling logic out of call-back adjust_tsc_offset() KVM: x86: Replace call-back compute_tsc_offset() with a common function KVM: x86: Replace call-back set_tsc_khz() with a common function KVM: x86: Add a common TSC scaling function KVM: x86: Add a common TSC scaling ratio field in kvm_vcpu_arch KVM: x86: Collect information for setting TSC scaling ratio KVM: x86: declare a few variables as __read_mostly KVM: x86: merge handle_mmio_page_fault and handle_mmio_page_fault_common KVM: PPC: Book3S HV: Don't dynamically split core when already split ...
2015-11-10KVM: x86: rename update_db_bp_intercept to update_bp_interceptPaolo Bonzini
Because #DB is now intercepted unconditionally, this callback only operates on #BP for both VMX and SVM. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10KVM: x86: work around infinite loop in microcode when #AC is deliveredEric Northup
It was found that a guest can DoS a host by triggering an infinite stream of "alignment check" (#AC) exceptions. This causes the microcode to enter an infinite loop where the core never receives another interrupt. The host kernel panics pretty quickly due to the effects (CVE-2015-5307). Signed-off-by: Eric Northup <digitaleric@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10KVM: VMX: Enable and initialize VMX TSC scalingHaozhong Zhang
This patch exhances kvm-intel module to enable VMX TSC scaling and collects information of TSC scaling ratio during initialization. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10KVM: x86: Move TSC scaling logic out of call-back read_l1_tsc()Haozhong Zhang
Both VMX and SVM scales the host TSC in the same way in call-back read_l1_tsc(), so this patch moves the scaling logic from call-back read_l1_tsc() to a common function kvm_read_l1_tsc(). Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10KVM: x86: Move TSC scaling logic out of call-back adjust_tsc_offset()Haozhong Zhang
For both VMX and SVM, if the 2nd argument of call-back adjust_tsc_offset() is the host TSC, then adjust_tsc_offset() will scale it first. This patch moves this common TSC scaling logic to its caller adjust_tsc_offset_host() and rename the call-back adjust_tsc_offset() to adjust_tsc_offset_guest(). Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10KVM: x86: Replace call-back compute_tsc_offset() with a common functionHaozhong Zhang
Both VMX and SVM calculate the tsc-offset in the same way, so this patch removes the call-back compute_tsc_offset() and replaces it with a common function kvm_compute_tsc_offset(). Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10KVM: x86: Replace call-back set_tsc_khz() with a common functionHaozhong Zhang
Both VMX and SVM propagate virtual_tsc_khz in the same way, so this patch removes the call-back set_tsc_khz() and replaces it with a common function. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10KVM: x86: Add a common TSC scaling functionHaozhong Zhang
VMX and SVM calculate the TSC scaling ratio in a similar logic, so this patch generalizes it to a common TSC scaling function. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> [Inline the multiplication and shift steps into mul_u64_u64_shr. Remove BUG_ON. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10KVM: x86: Add a common TSC scaling ratio field in kvm_vcpu_archHaozhong Zhang
This patch moves the field of TSC scaling ratio from the architecture struct vcpu_svm to the common struct kvm_vcpu_arch. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10KVM: x86: Collect information for setting TSC scaling ratioHaozhong Zhang
The number of bits of the fractional part of the 64-bit TSC scaling ratio in VMX and SVM is different. This patch makes the architecture code to collect the number of fractional bits and other related information into variables that can be accessed in the common code. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10KVM: x86: declare a few variables as __read_mostlyPaolo Bonzini
These include module parameters and variables that are set by kvm_x86_ops->hardware_setup. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-09kmap_atomic_to_page() has no users, remove itNicolas Pitre
Removal started in commit 5bbeed12bdc3 ("sparc32: drop unused kmap_atomic_to_page"). Let's do it across the whole tree. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-07x86/irq: Probe for PIC presence before allocating descs for legacy IRQsVitaly Kuznetsov
Commit d32932d02e18 ("x86/irq: Convert IOAPIC to use hierarchical irqdomain interfaces") brought a regression for Hyper-V Gen2 instances. These instances don't have i8259 legacy PIC but they use legacy IRQs for serial port, rtc, and acpi. With this commit included we end up with these IRQs not initialized. Earlier, there was a special workaround for legacy IRQs in mp_map_pin_to_irq() doing mp_irqdomain_map() without looking at nr_legacy_irqs() and now we fail in __irq_domain_alloc_irqs() when irq_domain_alloc_descs() returns -EEXIST. The essence of the issue seems to be that early_irq_init() calls arch_probe_nr_irqs() to figure out the number of legacy IRQs before we probe for i8259 and gets 16. Later when init_8259A() is called we switch to NULL legacy PIC and nr_legacy_irqs() starts to return 0 but we already have 16 descs allocated. Solve the issue by separating i8259 probe from init and calling it in arch_probe_nr_irqs() before we actually use nr_legacy_irqs() information. Fixes: d32932d02e18 ("x86/irq: Convert IOAPIC to use hierarchical irqdomain interfaces") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1446543614-3621-1-git-send-email-vkuznets@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-11-05Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: "First batch of KVM changes for 4.4. s390: A bunch of fixes and optimizations for interrupt and time handling. PPC: Mostly bug fixes. ARM: No big features, but many small fixes and prerequisites including: - a number of fixes for the arch-timer - introducing proper level-triggered semantics for the arch-timers - a series of patches to synchronously halt a guest (prerequisite for IRQ forwarding) - some tracepoint improvements - a tweak for the EL2 panic handlers - some more VGIC cleanups getting rid of redundant state x86: Quite a few changes: - support for VT-d posted interrupts (i.e. PCI devices can inject interrupts directly into vCPUs). This introduces a new component (in virt/lib/) that connects VFIO and KVM together. The same infrastructure will be used for ARM interrupt forwarding as well. - more Hyper-V features, though the main one Hyper-V synthetic interrupt controller will have to wait for 4.5. These will let KVM expose Hyper-V devices. - nested virtualization now supports VPID (same as PCID but for vCPUs) which makes it quite a bit faster - for future hardware that supports NVDIMM, there is support for clflushopt, clwb, pcommit - support for "split irqchip", i.e. LAPIC in kernel + IOAPIC/PIC/PIT in userspace, which reduces the attack surface of the hypervisor - obligatory smattering of SMM fixes - on the guest side, stable scheduler clock support was rewritten to not require help from the hypervisor" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (123 commits) KVM: VMX: Fix commit which broke PML KVM: x86: obey KVM_X86_QUIRK_CD_NW_CLEARED in kvm_set_cr0() KVM: x86: allow RSM from 64-bit mode KVM: VMX: fix SMEP and SMAP without EPT KVM: x86: move kvm_set_irq_inatomic to legacy device assignment KVM: device assignment: remove pointless #ifdefs KVM: x86: merge kvm_arch_set_irq with kvm_set_msi_inatomic KVM: x86: zero apic_arb_prio on reset drivers/hv: share Hyper-V SynIC constants with userspace KVM: x86: handle SMBASE as physical address in RSM KVM: x86: add read_phys to x86_emulate_ops KVM: x86: removing unused variable KVM: don't pointlessly leave KVM_COMPAT=y in non-KVM configs KVM: arm/arm64: Merge vgic_set_lr() and vgic_sync_lr_elrsr() KVM: arm/arm64: Clean up vgic_retire_lr() and surroundings KVM: arm/arm64: Optimize away redundant LR tracking KVM: s390: use simple switch statement as multiplexer KVM: s390: drop useless newline in debugging data KVM: s390: SCA must not cross page boundaries KVM: arm: Do not indent the arguments of DECLARE_BITMAP ...
2015-11-04Merge tag 'pm+acpi-4.4-rc1-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management and ACPI updates from Rafael Wysocki: "Quite a new features are included this time. First off, the Collaborative Processor Performance Control interface (version 2) defined by ACPI will now be supported on ARM64 along with a cpufreq frontend for CPU performance scaling. Second, ACPI gets a new infrastructure for the early probing of IRQ chips and clock sources (along the lines of the existing similar mechanism for DT). Next, the ACPI core and the generic device properties API will now support a recently introduced hierarchical properties extension of the _DSD (Device Specific Data) ACPI device configuration object. If the ACPI platform firmware uses that extension to organize device properties in a hierarchical way, the kernel will automatically handle it and make those properties available to device drivers via the generic device properties API. It also will be possible to build the ACPICA's AML interpreter debugger into the kernel now and use that to diagnose AML-related problems more efficiently. In the future, this should make it possible to single-step AML execution and do similar things. Interesting stuff, although somewhat experimental at this point. Finally, the PM core gets a new mechanism that can be used by device drivers to distinguish between suspend-to-RAM (based on platform firmware support) and suspend-to-idle (or other variants of system suspend the platform firmware is not involved in) and possibly optimize their device suspend/resume handling accordingly. In addition to that, some existing features are re-organized quite substantially. First, the ACPI-based handling of PCI host bridges on x86 and ia64 is unified and the common code goes into the ACPI core (so as to reduce code duplication and eliminate non-essential differences between the two architectures in that area). Second, the Operating Performance Points (OPP) framework is reorganized to make the code easier to find and follow. Next, the cpufreq core's sysfs interface is reorganized to get rid of the "primary CPU" concept for configurations in which the same performance scaling settings are shared between multiple CPUs. Finally, some interfaces that aren't necessary any more are dropped from the generic power domains framework. On top of the above we have some minor extensions, cleanups and bug fixes in multiple places, as usual. Specifics: - ACPICA update to upstream revision 20150930 (Bob Moore, Lv Zheng). The most significant change is to allow the AML debugger to be built into the kernel. On top of that there is an update related to the NFIT table (the ACPI persistent memory interface) and a few fixes and cleanups. - ACPI CPPC2 (Collaborative Processor Performance Control v2) support along with a cpufreq frontend (Ashwin Chaugule). This can only be enabled on ARM64 at this point. - New ACPI infrastructure for the early probing of IRQ chips and clock sources (Marc Zyngier). - Support for a new hierarchical properties extension of the ACPI _DSD (Device Specific Data) device configuration object allowing the kernel to handle hierarchical properties (provided by the platform firmware this way) automatically and make them available to device drivers via the generic device properties interface (Rafael Wysocki). - Generic device properties API extension to obtain an index of certain string value in an array of strings, along the lines of of_property_match_string(), but working for all of the supported firmware node types, and support for the "dma-names" device property based on it (Mika Westerberg). - ACPI core fix to parse the MADT (Multiple APIC Description Table) entries in the order expected by platform firmware (and mandated by the specification) to avoid confusion on systems with more than 255 logical CPUs (Lukasz Anaczkowski). - Consolidation of the ACPI-based handling of PCI host bridges on x86 and ia64 (Jiang Liu). - ACPI core fixes to ensure that the correct IRQ number is used to represent the SCI (System Control Interrupt) in the cases when it has been re-mapped (Chen Yu). - New ACPI backlight quirk for Lenovo IdeaPad S405 (Hans de Goede). - ACPI EC driver fixes (Lv Zheng). - Assorted ACPI fixes and cleanups (Dan Carpenter, Insu Yun, Jiri Kosina, Rami Rosen, Rasmus Villemoes). - New mechanism in the PM core allowing drivers to check if the platform firmware is going to be involved in the upcoming system suspend or if it has been involved in the suspend the system is resuming from at the moment (Rafael Wysocki). This should allow drivers to optimize their suspend/resume handling in some cases and the changes include a couple of users of it (the i8042 input driver, PCI PM). - PCI PM fix to prevent runtime-suspended devices with PME enabled from being resumed during system suspend even if they aren't configured to wake up the system from sleep (Rafael Wysocki). - New mechanism to report the number of a wakeup IRQ that woke up the system from sleep last time (Alexandra Yates). - Removal of unused interfaces from the generic power domains framework and fixes related to latency measurements in that code (Ulf Hansson, Daniel Lezcano). - cpufreq core sysfs interface rework to make it handle CPUs that share performance scaling settings (represented by a common cpufreq policy object) more symmetrically (Viresh Kumar). This should help to simplify the CPU offline/online handling among other things. - cpufreq core fixes and cleanups (Viresh Kumar). - intel_pstate fixes related to the Turbo Activation Ratio (TAR) mechanism on client platforms which causes the turbo P-states range to vary depending on platform firmware settings (Srinivas Pandruvada). - intel_pstate sysfs interface fix (Prarit Bhargava). - Assorted cpufreq driver (imx, tegra20, powernv, integrator) fixes and cleanups (Bai Ping, Bartlomiej Zolnierkiewicz, Shilpasri G Bhat, Luis de Bethencourt). - cpuidle mvebu driver cleanups (Russell King). - OPP (Operating Performance Points) framework code reorganization to make it more maintainable (Viresh Kumar). - Intel Broxton support for the RAPL (Running Average Power Limits) power capping driver (Amy Wiles). - Assorted power management code fixes and cleanups (Dan Carpenter, Geert Uytterhoeven, Geliang Tang, Luis de Bethencourt, Rasmus Villemoes)" * tag 'pm+acpi-4.4-rc1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (108 commits) cpufreq: postfix policy directory with the first CPU in related_cpus cpufreq: create cpu/cpufreq/policyX directories cpufreq: remove cpufreq_sysfs_{create|remove}_file() cpufreq: create cpu/cpufreq at boot time cpufreq: Use cpumask_copy instead of cpumask_or to copy a mask cpufreq: ondemand: Drop unnecessary locks from update_sampling_rate() PM / Domains: Merge measurements for PM QoS device latencies PM / Domains: Don't measure ->start|stop() latency in system PM callbacks PM / clk: Fix broken build due to non-matching code and header #ifdefs ACPI / Documentation: add copy_dsdt to ACPI format options ACPI / sysfs: correctly check failing memory allocation ACPI / video: Add a quirk to force native backlight on Lenovo IdeaPad S405 ACPI / CPPC: Fix potential memory leak ACPI / CPPC: signedness bug in register_pcc_channel() ACPI / PAD: power_saving_thread() is not freezable ACPI / PM: Fix incorrect wakeup IRQ setting during suspend-to-idle ACPI: Using correct irq when waiting for events ACPI: Use correct IRQ when uninstalling ACPI interrupt handler cpuidle: mvebu: disable the bind/unbind attributes and use builtin_platform_driver cpuidle: mvebu: clean up multiple platform drivers ...
2015-11-04Merge tag 'for-linus-4.4-rc0-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from David Vrabel: - Improve balloon driver memory hotplug placement. - Use unpopulated hotplugged memory for foreign pages (if supported/enabled). - Support 64 KiB guest pages on arm64. - CPU hotplug support on arm/arm64. * tag 'for-linus-4.4-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (44 commits) xen: fix the check of e_pfn in xen_find_pfn_range x86/xen: add reschedule point when mapping foreign GFNs xen/arm: don't try to re-register vcpu_info on cpu_hotplug. xen, cpu_hotplug: call device_offline instead of cpu_down xen/arm: Enable cpu_hotplug.c xenbus: Support multiple grants ring with 64KB xen/grant-table: Add an helper to iterate over a specific number of grants xen/xenbus: Rename *RING_PAGE* to *RING_GRANT* xen/arm: correct comment in enlighten.c xen/gntdev: use types from linux/types.h in userspace headers xen/gntalloc: use types from linux/types.h in userspace headers xen/balloon: Use the correct sizeof when declaring frame_list xen/swiotlb: Add support for 64KB page granularity xen/swiotlb: Pass addresses rather than frame numbers to xen_arch_need_swiotlb arm/xen: Add support for 64KB page granularity xen/privcmd: Add support for Linux 64KB page granularity net/xen-netback: Make it running on 64KB page granularity net/xen-netfront: Make it running on 64KB page granularity block/xen-blkback: Make it running on 64KB page granularity block/xen-blkfront: Make it running on 64KB page granularity ...
2015-11-04Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Martin Schwidefsky: "There is only one new feature in this pull for the 4.4 merge window, most of it is small enhancements, cleanup and bug fixes: - Add the s390 backend for the software dirty bit tracking. This adds two new pgtable functions pte_clear_soft_dirty and pmd_clear_soft_dirty which is why there is a hit to arch/x86/include/asm/pgtable.h in this pull request. - A series of cleanup patches for the AP bus, this includes the removal of the support for two outdated crypto cards (PCICC and PCICA). - The irq handling / signaling on buffer full in the runtime instrumentation code is dropped. - Some micro optimizations: remove unnecessary memory barriers for a couple of functions: [smb_]rmb, [smb_]wmb, atomics, bitops, and for spin_unlock. Use the builtin bswap if available and make test_and_set_bit_lock more cache friendly. - Statistics and a tracepoint for the diagnose calls to the hypervisor. - The CPU measurement facility support to sample KVM guests is improved. - The vector instructions are now always enabled for user space processes if the hardware has the vector facility. This simplifies the FPU handling code. The fpu-internal.h header is split into fpu internals, api and types just like x86. - Cleanup and improvements for the common I/O layer. - Rework udelay to solve a problem with kprobe. udelay has busy loop semantics but still uses an idle processor state for the wait" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (66 commits) s390: remove runtime instrumentation interrupts s390/cio: de-duplicate subchannel validation s390/css: unneeded initialization in for_each_subchannel s390/Kconfig: use builtin bswap s390/dasd: fix disconnected device with valid path mask s390/dasd: fix invalid PAV assignment after suspend/resume s390/dasd: fix double free in dasd_eckd_read_conf s390/kernel: fix ptrace peek/poke for floating point registers s390/cio: move ccw_device_stlck functions s390/cio: move ccw_device_call_handler s390/topology: reduce per_cpu() invocations s390/nmi: reduce size of percpu variable s390/nmi: fix terminology s390/nmi: remove casts s390/nmi: remove pointless error strings s390: don't store registers on disabled wait anymore s390: get rid of __set_psw_mask() s390/fpu: split fpu-internal.h into fpu internals, api, and type headers s390/dasd: fix list_del corruption after lcu changes s390/spinlock: remove unneeded serializations at unlock ...
2015-11-04drivers/hv: share Hyper-V SynIC constants with userspaceAndrey Smetanin
Moved Hyper-V synic contants from guest Hyper-V drivers private header into x86 arch uapi Hyper-V header. Added Hyper-V synic msr's flags into x86 arch uapi Hyper-V header. Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Vitaly Kuznetsov <vkuznets@redhat.com> CC: "K. Y. Srinivasan" <kys@microsoft.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-04KVM: x86: add read_phys to x86_emulate_opsRadim Krčmář
We want to read the physical memory when emulating RSM. X86EMUL_IO_NEEDED is returned on all errors for consistency with other helpers. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Tested-by: Laszlo Ersek <lersek@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>