summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2014-08-07x86/espfix/xen: Fix allocation of pages for paravirt page tablesBoris Ostrovsky
commit 8762e5092828c4dc0f49da5a47a644c670df77f3 upstream. init_espfix_ap() is currently off by one level when informing hypervisor that allocated pages will be used for ministacks' page tables. The most immediate effect of this on a PV guest is that if 'stack_page = __get_free_page()' returns a non-zeroed-out page the hypervisor will refuse to use it for a page table (which it shouldn't be anyway). This will result in warnings by both Xen and Linux. More importantly, a subsequent write to that page (again, by a PV guest) is likely to result in fatal page fault. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: http://lkml.kernel.org/r/1404926298-5565-1-git-send-email-boris.ostrovsky@oracle.com Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-07x86/xen: no need to explicitly register an NMI callbackDavid Vrabel
commit ea9f9274bf4337ba7cbab241c780487651642d63 upstream. Remove xen_enable_nmi() to fix a 64-bit guest crash when registering the NMI callback on Xen 3.1 and earlier. It's not needed since the NMI callback is set by a set_trap_table hypercall (in xen_load_idt() or xen_write_idt_entry()). It's also broken since it only set the current VCPU's callback. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Steven Noonan <steven@uplinklabs.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-07x86_64/entry/xen: Do not invoke espfix64 on XenAndy Lutomirski
commit 7209a75d2009dbf7745e2fd354abf25c3deb3ca3 upstream. This moves the espfix64 logic into native_iret. To make this work, it gets rid of the native patch for INTERRUPT_RETURN: INTERRUPT_RETURN on native kernels is now 'jmp native_iret'. This changes the 16-bit SS behavior on Xen from OOPSing to leaking some bits of the Xen hypervisor's RSP (I think). [ hpa: this is a nonzero cost on native, but probably not enough to measure. Xen needs to fix this in their own code, probably doing something equivalent to espfix64. ] Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/7b8f1d8ef6597cb16ae004a43c56980a7de3cf94.1406129132.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-07x86, espfix: Make it possible to disable 16-bit supportH. Peter Anvin
commit 34273f41d57ee8d854dcd2a1d754cbb546cb548f upstream. Embedded systems, which may be very memory-size-sensitive, are extremely unlikely to ever encounter any 16-bit software, so make it a CONFIG_EXPERT option to turn off support for any 16-bit software whatsoever. Signed-off-by: H. Peter Anvin <hpa@zytor.com> Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-07x86, espfix: Make espfix64 a Kconfig option, fix UMLH. Peter Anvin
commit 197725de65477bc8509b41388157c1a2283542bb upstream. Make espfix64 a hidden Kconfig option. This fixes the x86-64 UML build which had broken due to the non-existence of init_espfix_bsp() in UML: since UML uses its own Kconfig, this option does not appear in the UML build. This also makes it possible to make support for 16-bit segments a configuration option, for the people who want to minimize the size of the kernel. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: Richard Weinberger <richard@nod.at> Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-07x86, espfix: Fix broken header guardH. Peter Anvin
commit 20b68535cd27183ebd3651ff313afb2b97dac941 upstream. Header guard is #ifndef, not #ifdef... Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-07x86, espfix: Move espfix definitions into a separate header fileH. Peter Anvin
commit e1fe9ed8d2a4937510d0d60e20705035c2609aea upstream. Sparse warns that the percpu variables aren't declared before they are defined. Rather than hacking around it, move espfix definitions into a proper header file. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-07x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stackH. Peter Anvin
commit 3891a04aafd668686239349ea58f3314ea2af86b upstream. The IRET instruction, when returning to a 16-bit segment, only restores the bottom 16 bits of the user space stack pointer. This causes some 16-bit software to break, but it also leaks kernel state to user space. We have a software workaround for that ("espfix") for the 32-bit kernel, but it relies on a nonzero stack segment base which is not available in 64-bit mode. In checkin: b3b42ac2cbae x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels we "solved" this by forbidding 16-bit segments on 64-bit kernels, with the logic that 16-bit support is crippled on 64-bit kernels anyway (no V86 support), but it turns out that people are doing stuff like running old Win16 binaries under Wine and expect it to work. This works around this by creating percpu "ministacks", each of which is mapped 2^16 times 64K apart. When we detect that the return SS is on the LDT, we copy the IRET frame to the ministack and use the relevant alias to return to userspace. The ministacks are mapped readonly, so if IRET faults we promote #GP to #DF which is an IST vector and thus has its own stack; we then do the fixup in the #DF handler. (Making #GP an IST exception would make the msr_safe functions unsafe in NMI/MC context, and quite possibly have other effects.) Special thanks to: - Andy Lutomirski, for the suggestion of using very small stack slots and copy (as opposed to map) the IRET frame there, and for the suggestion to mark them readonly and let the fault promote to #DF. - Konrad Wilk for paravirt fixup and testing. - Borislav Petkov for testing help and useful comments. Reported-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Andrew Lutomriski <amluto@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Dirk Hohndel <dirk@hohndel.org> Cc: Arjan van de Ven <arjan.van.de.ven@intel.com> Cc: comex <comexk@gmail.com> Cc: Alexander van Heukelum <heukelum@fastmail.fm> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: <stable@vger.kernel.org> # consider after upstream merge Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-07Revert "x86-64, modify_ldt: Make support for 16-bit segments a runtime option"H. Peter Anvin
commit 7ed6fb9b5a5510e4ef78ab27419184741169978a upstream. This reverts commit fa81511bb0bbb2b1aace3695ce869da9762624ff in preparation of merging in the proper fix (espfix64). Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-31x86/efi: Include a .bss section within the PE/COFF headersMichael Brown
commit c7fb93ec51d462ec3540a729ba446663c26a0505 upstream. The PE/COFF headers currently describe only the initialised-data portions of the image, and result in no space being allocated for the uninitialised-data portions. Consequently, the EFI boot stub will end up overwriting unexpected areas of memory, with unpredictable results. Fix by including a .bss section in the PE/COFF headers (functionally equivalent to the init_size field in the bzImage header). Signed-off-by: Michael Brown <mbrown@fensystems.co.uk> Cc: Thomas Bächler <thomas@archlinux.org> Cc: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-31x86_32, entry: Store badsys error code in %eaxSven Wegener
commit 8142b215501f8b291a108a202b3a053a265b03dd upstream. Commit 554086d ("x86_32, entry: Do syscall exit work on badsys (CVE-2014-4508)") introduced a regression in the x86_32 syscall entry code, resulting in syscall() not returning proper errors for undefined syscalls on CPUs supporting the sysenter feature. The following code: > int result = syscall(666); > printf("result=%d errno=%d error=%s\n", result, errno, strerror(errno)); results in: > result=666 errno=0 error=Success Obviously, the syscall return value is the called syscall number, but it should have been an ENOSYS error. When run under ptrace it behaves correctly, which makes it hard to debug in the wild: > result=-1 errno=38 error=Function not implemented The %eax register is the return value register. For debugging via ptrace the syscall entry code stores the complete register context on the stack. The badsys handlers only store the ENOSYS error code in the ptrace register set and do not set %eax like a regular syscall handler would. The old resume_userspace call chain contains code that clobbers %eax and it restores %eax from the ptrace registers afterwards. The same goes for the ptrace-enabled call chain. When ptrace is not used, the syscall return value is the passed-in syscall number from the untouched %eax register. Use %eax as the return value register in syscall_badsys and sysenter_badsys, like a real syscall handler does, and have the caller push the value onto the stack for ptrace access. Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Link: http://lkml.kernel.org/r/alpine.LNX.2.11.1407221022380.31021@titan.int.lan.stealer.net Reviewed-and-tested-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-28locking/mutex: Disable optimistic spinning on some architecturesPeter Zijlstra
commit 4badad352a6bb202ec68afa7a574c0bb961e5ebc upstream. The optimistic spin code assumes regular stores and cmpxchg() play nice; this is found to not be true for at least: parisc, sparc32, tile32, metag-lock1, arc-!llsc and hexagon. There is further wreckage, but this in particular seemed easy to trigger, so blacklist this. Opt in for known good archs. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Reported-by: Mikulas Patocka <mpatocka@redhat.com> Cc: David Miller <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Jason Low <jason.low2@hp.com> Cc: Waiman Long <waiman.long@hp.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: John David Anglin <dave.anglin@bell.net> Cc: James Hogan <james.hogan@imgtec.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: sparclinux@vger.kernel.org Link: http://lkml.kernel.org/r/20140606175316.GV13930@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-28x86, tsc: Fix cpufreq lockupPeter Zijlstra
commit 3896c329df8092661dac80f55a8c3f60136fd61a upstream. Mauro reported that his AMD X2 using the powernow-k8 cpufreq driver locked up when doing cpu hotplug. Because we called set_cyc2ns_scale() from the time_cpufreq_notifier() unconditionally, it gets called multiple times for each freq change, instead of only the once, when the tsc_khz value actually changes. Because it gets called more than once, we run out of cyc2ns data slots and stall, waiting for a free one, but because we're half way offline, there's no consumers to free slots. By placing the call inside the condition that actually changes tsc_khz we avoid superfluous calls and avoid the problem. Reported-by: Mauro <registosites@hotmail.com> Tested-by: Mauro <registosites@hotmail.com> Fixes: 20d1c86a5776 ("sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs") Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Bin Gao <bin.gao@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mika Westerberg <mika.westerberg@linux.intel.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Stefani Seibold <stefani@seibold.net> Cc: linux-kernel@vger.kernel.org Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-28perf/x86/intel: ignore CondChgd bit to avoid false NMI handlingHATAYAMA Daisuke
commit b292d7a10487aee6e74b1c18b8d95b92f40d4a4f upstream. Currently, any NMI is falsely handled by a NMI handler of NMI watchdog if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set. For example, we use external NMI to make system panic to get crash dump, but in this case, the external NMI is falsely handled do to the issue. This commit deals with the issue simply by ignoring CondChgd bit. Here is explanation in detail. On x86 NMI watchdog uses performance monitoring feature to periodically signal NMI each time performance counter gets overflowed. intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI handler of NMI watchdog, perf_event_nmi_handler(). It identifies an owner of a given NMI by looking at overflow status bits in MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it handles the given NMI as its own NMI. The problem is that the intel_pmu_handle_irq() doesn't distinguish CondChgd bit from other bits. Unlike the other status bits, CondChgd bit doesn't represent overflow status for performance counters. Thus, CondChgd bit cannot be thought of as a mark indicating a given NMI is NMI watchdog's. As a result, if CondChgd bit is set, any NMI is falsely handled by the NMI handler of NMI watchdog. Also, if type of the falsely handled NMI is either NMI_UNKNOWN, NMI_SERR or NMI_IO_CHECK, the corresponding action is never performed until CondChgd bit is cleared. I noticed this behavior on systems with Ivy Bridge processors: Intel Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems, CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set in the beginning at boot. Then the CondChgd bit is immediately cleared by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0. On the other hand, on older processors such as Nehalem, Xeon E7540, CondChgd bit is not set in the beginning at boot. I'm not sure about exact behavior of CondChgd bit, in particular when this bit is set. Although I read Intel System Programmer's Manual to figure out that, the descriptions I found are: In 18.9.1: "The MSR_PERF_GLOBAL_STATUS MSR also provides a ¡sticky bit¢ to indicate changes to the state of performancmonitoring hardware" In Table 35-2 IA-32 Architectural MSRs 63 CondChg: status bits of this register has changed. These are different from the bahviour I see on the actual system as I explained above. At least, I think ignoring CondChgd bit should be enough for NMI watchdog perspective. Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Acked-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20140625.103503.409316067.d.hatayama@jp.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-17x86, ioremap: Speed up check for RAM pagesRoland Dreier
commit c81c8a1eeede61e92a15103748c23d100880cc8a upstream. In __ioremap_caller() (the guts of ioremap), we loop over the range of pfns being remapped and checks each one individually with page_is_ram(). For large ioremaps, this can be very slow. For example, we have a device with a 256 GiB PCI BAR, and ioremapping this BAR can take 20+ seconds -- sometimes long enough to trigger the soft lockup detector! Internally, page_is_ram() calls walk_system_ram_range() on a single page. Instead, we can make a single call to walk_system_ram_range() from __ioremap_caller(), and do our further checks only for any RAM pages that we find. For the common case of MMIO, this saves an enormous amount of work, since the range being ioremapped doesn't intersect system RAM at all. With this change, ioremap on our 256 GiB BAR takes less than 1 second. Signed-off-by: Roland Dreier <roland@purestorage.com> Link: http://lkml.kernel.org/r/1399054721-1331-1-git-send-email-roland@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-17crypto: sha512_ssse3 - fix byte count to bit count conversionJussi Kivilinna
commit cfe82d4f45c7cc39332a2be7c4c1d3bf279bbd3d upstream. Byte-to-bit-count computation is only partly converted to big-endian and is mixing in CPU-endian values. Problem was noticed by sparce with warning: CHECK arch/x86/crypto/sha512_ssse3_glue.c arch/x86/crypto/sha512_ssse3_glue.c:144:19: warning: restricted __be64 degrades to integer arch/x86/crypto/sha512_ssse3_glue.c:144:17: warning: incorrect type in assignment (different base types) arch/x86/crypto/sha512_ssse3_glue.c:144:17: expected restricted __be64 <noident> arch/x86/crypto/sha512_ssse3_glue.c:144:17: got unsigned long long Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09kvm: fix wrong address when writing Hyper-V tsc pageXiaoming Gao
commit e1fa108d24697b78348fd4e5a531029a50d0d36d upstream. When kvm_write_guest writes the tsc_ref structure to the guest, or it will lead the low HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT bits of the TSC page address must be cleared, or the guest can see a non-zero sequence number. Otherwise Windows guests would not be able to get a correct clocksource (QueryPerformanceCounter will always return 0) which causes serious chaos. Signed-off-by: Xiaoming Gao <newtongao@tencnet.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09KVM: x86: preserve the high 32-bits of the PAT registerPaolo Bonzini
commit 7cb060a91c0efc5ff94f83c6df3ed705e143cdb9 upstream. KVM does not really do much with the PAT, so this went unnoticed for a long time. It is exposed however if you try to do rdmsr on the PAT register. Reported-by: Valentine Sinitsyn <valentine.sinitsyn@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09KVM: x86: Increase the number of fixed MTRR regs to 10Nadav Amit
commit 682367c494869008eb89ef733f196e99415ae862 upstream. Recent Intel CPUs have 10 variable range MTRRs. Since operating systems sometime make assumptions on CPUs while they ignore capability MSRs, it is better for KVM to be consistent with recent CPUs. Reporting more MTRRs than actually supported has no functional implications. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06ptrace,x86: force IRET path after a ptrace_stop()Tejun Heo
commit b9cd18de4db3c9ffa7e17b0dc0ca99ed5aa4d43a upstream. The 'sysret' fastpath does not correctly restore even all regular registers, much less any segment registers or reflags values. That is very much part of why it's faster than 'iret'. Normally that isn't a problem, because the normal ptrace() interface catches the process using the signal handler infrastructure, which always returns with an iret. However, some paths can get caught using ptrace_event() instead of the signal path, and for those we need to make sure that we aren't going to return to user space using 'sysret'. Otherwise the modifications that may have been done to the register set by the tracer wouldn't necessarily take effect. Fix it by forcing IRET path by setting TIF_NOTIFY_RESUME from arch_ptrace_stop_needed() which is invoked from ptrace_stop(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30x86_32, entry: Do syscall exit work on badsys (CVE-2014-4508)Andy Lutomirski
commit 554086d85e71f30abe46fc014fea31929a7c6a8a upstream. The bad syscall nr paths are their own incomprehensible route through the entry control flow. Rearrange them to work just like syscalls that return -ENOSYS. This fixes an OOPS in the audit code when fast-path auditing is enabled and sysenter gets a bad syscall nr (CVE-2014-4508). This has probably been broken since Linux 2.6.27: af0575bba0 i386 syscall audit fast-path Cc: Roland McGrath <roland@redhat.com> Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/e09c499eade6fc321266dd6b54da7beb28d6991c.1403558229.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30x86, x32: Use compat shims for io_{setup,submit}Mike Frysinger
commit 7fd44dacdd803c0bbf38bf478d51d280902bb0f1 upstream. The io_setup takes a pointer to a context id of type aio_context_t. This in turn is typed to a __kernel_ulong_t. We could tweak the exported headers to define this as a 64bit quantity for specific ABIs, but since we already have a 32bit compat shim for the x86 ABI, let's just re-use that logic. The libaio package is also written to expect this as a pointer type, so a compat shim would simplify that. The io_submit func operates on an array of pointers to iocb structs. Padding out the array to be 64bit aligned is a huge pain, so convert it over to the existing compat shim too. We don't convert io_getevents to the compat func as its only purpose is to handle the timespec struct, and the x32 ABI uses 64bit times. With this change, the libaio package can now pass its testsuite when built for the x32 ABI. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Link: http://lkml.kernel.org/r/1399250595-5005-1-git-send-email-vapier@gentoo.org Cc: H.J. Lu <hjl.tools@gmail.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30x86-32, espfix: Remove filter for espfix32 due to raceH. Peter Anvin
commit 246f2d2ee1d715e1077fc47d61c394569c8ee692 upstream. It is not safe to use LAR to filter when to go down the espfix path, because the LDT is per-process (rather than per-thread) and another thread might change the descriptors behind our back. Fortunately it is always *safe* (if a bit slow) to go down the espfix path, and a 32-bit LDT stack segment is extremely rare. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30hugetlb: restrict hugepage_migration_support() to x86_64Naoya Horiguchi
commit c177c81e09e517bbf75b67762cdab1b83aba6976 upstream. Currently hugepage migration is available for all archs which support pmd-level hugepage, but testing is done only for x86_64 and there're bugs for other archs. So to avoid breaking such archs, this patch limits the availability strictly to x86_64 until developers of other archs get interested in enabling this feature. Simply disabling hugepage migration on non-x86_64 archs is not enough to fix the reported problem where sys_move_pages() hits the BUG_ON() in follow_page(FOLL_GET), so let's fix this by checking if hugepage migration is supported in vma_migratable(). Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Michael Ellerman <mpe@ellerman.id.au> Acked-by: Hugh Dickins <hughd@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-26KVM: lapic: sync highest ISR to hardware apic on EOIPaolo Bonzini
commit fc57ac2c9ca8109ea97fcc594f4be436944230cc upstream. When Hyper-V enlightenments are in effect, Windows prefers to issue an Hyper-V MSR write to issue an EOI rather than an x2apic MSR write. The Hyper-V MSR write is not handled by the processor, and besides being slower, this also causes bugs with APIC virtualization. The reason is that on EOI the processor will modify the highest in-service interrupt (SVI) field of the VMCS, as explained in section 29.1.4 of the SDM; every other step in EOI virtualization is already done by apic_send_eoi or on VM entry, but this one is missing. We need to do the same, and be careful not to muck with the isr_count and highest_isr_cache fields that are unused when virtual interrupt delivery is enabled. Reviewed-by: Yang Zhang <yang.z.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07x86-64, modify_ldt: Make support for 16-bit segments a runtime optionLinus Torvalds
commit fa81511bb0bbb2b1aace3695ce869da9762624ff upstream. Checkin: b3b42ac2cbae x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels disabled 16-bit segments on 64-bit kernels due to an information leak. However, it does seem that people are genuinely using Wine to run old 16-bit Windows programs on Linux. A proper fix for this ("espfix64") is coming in the upcoming merge window, but as a temporary fix, create a sysctl to allow the administrator to re-enable support for 16-bit segments. It adds a "/proc/sys/abi/ldt16" sysctl that defaults to zero (off). If you hit this issue and care about your old Windows program more than you care about a kernel stack address information leak, you can do echo 1 > /proc/sys/abi/ldt16 as root (add it to your startup scripts), and you should be ok. The sysctl table is only added if you have COMPAT support enabled on x86-64, but I assume anybody who runs old windows binaries very much does that ;) Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/r/CA%2B55aFw9BPoD10U1LfHbOMpHWZkvJTkMcfCs9s3urPr1YyWBxw@mail.gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07x86, mm, hugetlb: Add missing TLB page invalidation for hugetlb_cow()Anthony Iliopoulos
commit 9844f5462392b53824e8b86726e7c33b5ecbb676 upstream. The invalidation is required in order to maintain proper semantics under CoW conditions. In scenarios where a process clones several threads, a thread operating on a core whose DTLB entry for a particular hugepage has not been invalidated, will be reading from the hugepage that belongs to the forked child process, even after hugetlb_cow(). The thread will not see the updated page as long as the stale DTLB entry remains cached, the thread attempts to write into the page, the child process exits, or the thread gets migrated to a different processor. Signed-off-by: Anthony Iliopoulos <anthony.iliopoulos@huawei.com> Link: http://lkml.kernel.org/r/20140514092948.GA17391@server-36.huawei.corp Suggested-by: Shay Goikhman <shay.goikhman@huawei.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31net: filter: x86: fix JIT address randomizationAlexei Starovoitov
[ Upstream commit 773cd38f40b8834be991dbfed36683acc1dd41ee ] bpf_alloc_binary() adds 128 bytes of room to JITed program image and rounds it up to the nearest page size. If image size is close to page size (like 4000), it is rounded to two pages: round_up(4000 + 4 + 128) == 8192 then 'hole' is computed as 8192 - (4000 + 4) = 4188 If prandom_u32() % hole selects a number >= PAGE_SIZE - sizeof(*header) then kernel will crash during bpf_jit_free(): kernel BUG at arch/x86/mm/pageattr.c:887! Call Trace: [<ffffffff81037285>] change_page_attr_set_clr+0x135/0x460 [<ffffffff81694cc0>] ? _raw_spin_unlock_irq+0x30/0x50 [<ffffffff810378ff>] set_memory_rw+0x2f/0x40 [<ffffffffa01a0d8d>] bpf_jit_free_deferred+0x2d/0x60 [<ffffffff8106bf98>] process_one_work+0x1d8/0x6a0 [<ffffffff8106bf38>] ? process_one_work+0x178/0x6a0 [<ffffffff8106c90c>] worker_thread+0x11c/0x370 since bpf_jit_free() does: unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK; struct bpf_binary_header *header = (void *)addr; to compute start address of 'bpf_binary_header' and header->pages will pass junk to: set_memory_rw(addr, header->pages); Fix it by making sure that &header->image[prandom_u32() % hole] and &header are in the same page Fixes: 314beb9bcabfd ("x86: bpf_jit_comp: secure bpf jit against spraying attacks") Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31xen/spinlock: Don't enable them unconditionally.Konrad Rzeszutek Wilk
commit e0fc17a936334c08b2729fff87168c03fdecf5b6 upstream. The git commit a945928ea2709bc0e8e8165d33aed855a0110279 ('xen: Do not enable spinlocks before jump_label_init() has executed') was added to deal with the jump machinery. Earlier the code that turned on the jump label was only called by Xen specific functions. But now that it had been moved to the initcall machinery it gets called on Xen, KVM, and baremetal - ouch!. And the detection machinery to only call it on Xen wasn't remembered in the heat of merge window excitement. This means that the slowpath is enabled on baremetal while it should not be. Reported-by: Waiman Long <waiman.long@hp.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> CC: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31x86,preempt: Fix preemption for i386Peter Zijlstra
Many people reported preemption/reschedule problems with i386 kernels for .13 and .14. After Michele bisected this to a combination of 3e8e42c69bb ("sched: Revert need_resched() to look at TIF_NEED_RESCHED") ded79754754 ("irq: Force hardirq exit's softirq processing on its own stack") it finally dawned on me that i386's current_thread_info() was to blame. When we are on interrupt/exception stacks, we fail to observe the right TIF_NEED_RESCHED bit and therefore the PREEMPT_NEED_RESCHED folding malfunctions. Current upstream fixes this by making i386 behave the same as x86_64 already did: 2432e1364bbe ("x86: Nuke the supervisor_stack field in i386 thread_info") b807902a88c4 ("x86: Nuke GET_THREAD_INFO_WITH_ESP() macro for i386") 0788aa6a23cb ("x86: Prepare removal of previous_esp from i386 thread_info structure") 198d208df437 ("x86: Keep thread_info on thread stack in x86_32") However, that is far too much to stuff into -stable. Therefore I propose we merge the below patch which uses task_thread_info(current) for tif_need_resched() instead of the ESP based current_thread_info(). This makes sure we always observe the one true TIF_NEED_RESCHED bit and things will work as expected again. Cc: bp@alien8.de Cc: fweisbec@gmail.com Cc: david.a.cohen@linux.intel.com Cc: mingo@kernel.org Cc: fweisbec@gmail.com Cc: greg@kroah.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: gregkh@linuxfoundation.org Cc: pbonzini@redhat.com Cc: rostedt@goodmis.org Cc: stefan.bader@canonical.com Cc: mingo@kernel.org Cc: toralf.foerster@gmx.de Cc: David Cohen <david.a.cohen@linux.intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: torvalds@linux-foundation.org Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: David Cohen <david.a.cohen@linux.intel.com> Cc: <stable@vger.kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: <stable-commits@vger.kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: peterz@infradead.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: barra_cuda@katamail.com Tested-by: Stefan Bader <stefan.bader@canonical.com> Tested-by: Toralf F¿rster <toralf.foerster@gmx.de> Tested-by: Michele Ballabio <barra_cuda@katamail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140409142447.GD13658@twins.programming.kicks-ass.net
2014-05-31KVM: x86: remove WARN_ON from get_kernel_ns()Marcelo Tosatti
commit b351c39cc9e0151cee9b8d52a1e714928faabb38 upstream. Function and callers can be preempted. https://bugzilla.kernel.org/show_bug.cgi?id=73721 Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-13x86-64, build: Fix stack protector Makefile breakage with 32-bit userlandGeorge Spelvin
commit 14262d67fe348018af368a07430fbc06eadeabb1 upstream. If you are using a 64-bit kernel with 32-bit userland, then scripts/gcc-x86_64-has-stack-protector.sh invokes 32-bit gcc with -mcmodel=kernel, which produces: <stdin>:1:0: error: code model 'kernel' not supported in the 32 bit mode and trips the "broken compiler" test at arch/x86/Makefile:120. There are several places a fix is possible, but the following seems cleanest. (But it's minimal; it would also be possible to factor out a bunch of stuff from the two branches of the if.) Signed-off-by: George Spelvin <linux@horizon.com> Link: http://lkml.kernel.org/r/20140507210552.7581.qmail@ns.horizon.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06x86/efi: Correct EFI boot stub use of code32_startMatt Fleming
commit 7e8213c1f3acc064aef37813a39f13cbfe7c3ce7 upstream. code32_start should point at the start of the protected mode code, and *not* at the beginning of the bzImage. This is much easier to do in assembly so document that callers of make_boot_params() need to fill out code32_start. The fallout from this bug is that we would end up relocating the image but copying the image at some offset, resulting in what appeared to be memory corruption. Reported-by: Thomas Bächler <thomas@archlinux.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernelsH. Peter Anvin
commit b3b42ac2cbae1f3cecbb6229964a4d48af31d382 upstream. The IRET instruction, when returning to a 16-bit segment, only restores the bottom 16 bits of the user space stack pointer. We have a software workaround for that ("espfix") for the 32-bit kernel, but it relies on a nonzero stack segment base which is not available in 32-bit mode. Since 16-bit support is somewhat crippled anyway on a 64-bit kernel (no V86 mode), and most (if not quite all) 64-bit processors support virtualization for the users who really need it, simply reject attempts at creating a 16-bit segment when running on top of a 64-bit kernel. Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/n/tip-kicdm89kzw9lldryb1br9od0@git.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06ftrace/x86: One more missing sync after fixup of function modification failurePetr Mladek
commit 12729f14d8357fb845d75155228b21e76360272d upstream. If a failure occurs while modifying ftrace function, it bails out and will remove the tracepoints to be back to what the code originally was. There is missing the final sync run across the CPUs after the fix up is done and before the ftrace int3 handler flag is reset. Here's the description of the problem: CPU0 CPU1 ---- ---- remove_breakpoint(); modifying_ftrace_code = 0; [still sees breakpoint] <takes trap> [sees modifying_ftrace_code as zero] [no breakpoint handler] [goto failed case] [trap exception - kernel breakpoint, no handler] BUG() Link: http://lkml.kernel.org/r/1393258342-29978-2-git-send-email-pmladek@suse.cz Fixes: 8a4d0a687a5 "ftrace: Use breakpoint method to update ftrace caller" Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Petr Mladek <pmladek@suse.cz> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06x86, AVX-512: Enable AVX-512 States Context SwitchFenghua Yu
commit c2bc11f10a39527cd1bb252097b5525664560956 upstream. This patch enables Opmask, ZMM_Hi256, and Hi16_ZMM AVX-512 states for xstate context switch. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Link: http://lkml.kernel.org/r/1392931491-33237-2-git-send-email-fenghua.yu@intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06x86, AVX-512: AVX-512 Feature DetectionFenghua Yu
commit 8e5780fdeef7dc490b3f0b3a62704593721fa4f3 upstream. AVX-512 is an extention of AVX2. Its spec can be found at: http://download-software.intel.com/sites/default/files/managed/71/2e/319433-017.pdf This patch detects AVX-512 features by CPUID. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Link: http://lkml.kernel.org/r/1392931491-33237-1-git-send-email-fenghua.yu@intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06x86, hash: Fix build failure with older binutilsJan Beulich
commit 06325190bd577e11429444d54f454b9d13f560c9 upstream. Just like for other ISA extension instruction uses we should check whether the assembler actually supports them. The fallback here simply is to encode an instruction with fixed operands (%eax and %ecx). [ hpa: tagging for -stable as a build fix ] Signed-off-by: Jan Beulich <jbeulich@suse.com> Link: http://lkml.kernel.org/r/530F0996020000780011FBE7@nat28.tlf.novell.com Cc: Francesco Fusco <ffusco@redhat.com> Cc: Thomas Graf <tgraf@redhat.com> Cc: David S. Miller <davem@davemloft.net> Acked-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26x86: Adjust irq remapping quirk for older revisions of 5500/5520 chipsetsNeil Horman
commit 6f8a1b335fde143b7407036e2368d3cd6eb55674 upstream. Commit 03bbcb2e7e2 (iommu/vt-d: add quirk for broken interrupt remapping on 55XX chipsets) properly disables irq remapping on the 5500/5520 chipsets that don't correctly perform that feature. However, when I wrote it, I followed the errata sheet linked in that commit too closely, and explicitly tied the activation of the quirk to revision 0x13 of the chip, under the assumption that earlier revisions were not in the field. Recently a system was reported to be suffering from this remap bug and the quirk hadn't triggered, because the revision id register read at a lower value that 0x13, so the quirk test failed improperly. Given this, it seems only prudent to adjust this quirk so that any revision less than 0x13 has the quirk asserted. [ tglx: Removed the 0x12 comparison of pci id 3405 as this is covered by the <= 0x13 check already ] Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Link: http://lkml.kernel.org/r/1394649873-14913-1-git-send-email-nhorman@tuxdriver.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26x86, hyperv: Bypass the timer_irq_works() checkJason Wang
commit ca3ba2a2f4a49a308e7d78c784d51b2332064f15 upstream. This patch bypass the timer_irq_works() check for hyperv guest since: - It was guaranteed to work. - timer_irq_works() may fail sometime due to the lpj calibration were inaccurate in a hyperv guest or a buggy host. In the future, we should get the tsc frequency from hypervisor and use preset lpj instead. [ hpa: I would prefer to not defer things to "the future" in the future... ] Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Acked-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Link: http://lkml.kernel.org/r/1393558229-14755-1-git-send-email-jasowang@redhat.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-14crypto: ghash-clmulni-intel - use C implementation for setkey()Ard Biesheuvel
commit 8ceee72808d1ae3fb191284afc2257a2be964725 upstream. The GHASH setkey() function uses SSE registers but fails to call kernel_fpu_begin()/kernel_fpu_end(). Instead of adding these calls, and then having to deal with the restriction that they cannot be called from interrupt context, move the setkey() implementation to the C domain. Note that setkey() does not use any particular SSE features and is not expected to become a performance bottleneck. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: H. Peter Anvin <hpa@linux.intel.com> Fixes: 0e1227d356e9b (crypto: ghash - Add PCLMULQDQ accelerated implementation) Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-14x86/efi: Make efi virtual runtime map passing more robustBorislav Petkov
commit b7b898ae0c0a82489511a1ce1b35f26215e6beb5 upstream. Currently, running SetVirtualAddressMap() and passing the physical address of the virtual map array was working only by a lucky coincidence because the memory was present in the EFI page table too. Until Toshi went and booted this on a big HP box - the krealloc() manner of resizing the memmap we're doing did allocate from such physical addresses which were not mapped anymore and boom: http://lkml.kernel.org/r/1386806463.1791.295.camel@misato.fc.hp.com One way to take care of that issue is to reimplement the krealloc thing but with pages. We start with contiguous pages of order 1, i.e. 2 pages, and when we deplete that memory (shouldn't happen all that often but you know firmware) we realloc the next power-of-two pages. Having the pages, it is much more handy and easy to map them into the EFI page table with the already existing mapping code which we're using for building the virtual mappings. Thanks to Toshi Kani and Matt for the great debugging help. Reported-by: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-14x86, pageattr: Export page unmapping interfaceBorislav Petkov
commit 42a5477251f0e0f33ad5f6a95c48d685ec03191e upstream. We will use it in efi so expose it. Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-28x86: fix boot on uniprocessor systemsArtem Fetishev
On x86 uniprocessor systems topology_physical_package_id() returns -1 which causes rapl_cpu_prepare() to leave rapl_pmu variable uninitialized which leads to GPF in rapl_pmu_init(). See arch/x86/kernel/cpu/perf_event_intel_rapl.c. It turns out that physical_package_id and core_id can actually be retreived for uniprocessor systems too. Enabling them also fixes rapl_pmu code. Signed-off-by: Artem Fetishev <artem_fetishev@epam.com> Cc: Stephane Eranian <eranian@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-25Revert "xen: properly account for _PAGE_NUMA during xen pte translations"David Vrabel
This reverts commit a9c8e4beeeb64c22b84c803747487857fe424b68. PTEs in Xen PV guests must contain machine addresses if _PAGE_PRESENT is set and pseudo-physical addresses is _PAGE_PRESENT is clear. This is because during a domain save/restore (migration) the page table entries are "canonicalised" and uncanonicalised". i.e., MFNs are converted to PFNs during domain save so that on a restore the page table entries may be rewritten with the new MFNs on the destination. This canonicalisation is only done for PTEs that are present. This change resulted in writing PTEs with MFNs if _PAGE_PROTNONE (or _PAGE_NUMA) was set but _PAGE_PRESENT was clear. These PTEs would be migrated as-is which would result in unexpected behaviour in the destination domain. Either a) the MFN would be translated to the wrong PFN/page; b) setting the _PAGE_PRESENT bit would clear the PTE because the MFN is no longer owned by the domain; or c) the present bit would not get set. Symptoms include "Bad page" reports when munmapping after migrating a domain. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: <stable@vger.kernel.org> [3.12+]
2014-03-19Merge tag 'pci-v3.14-fixes-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI resource management fix from Bjorn Helgaas: "This is a fix for an AGP regression exposed by e501b3d87f00 ("agp: Support 64-bit APBASE"), which we merged in v3.14-rc1. We've warned about the conflict between the GART and PCI resources and cleared out the PCI resource for a long time, but after e501b3d87f00, we still *use* that cleared-out PCI resource. I think the GART resource is incorrect, so this patch removes it" * tag 'pci-v3.14-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: Revert "[PATCH] Insert GART region into resource map"
2014-03-18Revert "[PATCH] Insert GART region into resource map"Bjorn Helgaas
This reverts commit 56dd669a138c, which makes the GART visible in /proc/iomem. This fixes a regression: e501b3d87f00 ("agp: Support 64-bit APBASE") exposed an existing problem with a conflict between the GART region and a PCI BAR region. The GART addresses are bus addresses, not CPU addresses, and therefore should not be inserted in iomem_resource. On many machines, the GART region is addressable by the CPU as well as by an AGP master, but CPU addressability is not required by the spec. On some of these machines, the GART is mapped by a PCI BAR, and in that case, the PCI core automatically inserts it into iomem_resource, just as it does for all BARs. Inserting it here means we'll have a conflict if the PCI core later tries to claim the GART region, so let's drop the insertion here. The conflict indirectly causes X failures, as reported by Jouni in the bugzilla below. We detected the conflict even before e501b3d87f00, but after it the AGP code (fix_northbridge()) uses the PCI resource (which is zeroed because of the conflict) instead of reading the BAR again. Conflicts: arch/x86_64/kernel/aperture.c Fixes: e501b3d87f00 agp: Support 64-bit APBASE Link: https://bugzilla.kernel.org/show_bug.cgi?id=72201 Reported-and-tested-by: Jouni Mettälä <jtmettala@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2014-03-16Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Misc smaller fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Fix leak in uncore_type_init failure paths perf machine: Use map as success in ip__resolve_ams perf symbols: Fix crash in elf_section_by_name perf trace: Decode architecture-specific signal numbers
2014-03-14Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Peter Anvin: "Two x86 fixes: Suresh's eager FPU fix, and a fix to the NUMA quirk for AMD northbridges. This only includes Suresh's fix patch, not the "mostly a cleanup" patch which had __init issues" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/amd/numa: Fix northbridge quirk to assign correct NUMA node x86, fpu: Check tsk_used_math() in kernel_fpu_end() for eager FPU
2014-03-14x86/amd/numa: Fix northbridge quirk to assign correct NUMA nodeDaniel J Blueman
For systems with multiple servers and routed fabric, all northbridges get assigned to the first server. Fix this by also using the node reported from the PCI bus. For single-fabric systems, the northbriges are on PCI bus 0 by definition, which are on NUMA node 0 by definition, so this is invarient on most systems. Tested on fam10h and fam15h single and multi-fabric systems and candidate for stable. Signed-off-by: Daniel J Blueman <daniel@numascale.com> Acked-by: Steffen Persvold <sp@numascale.com> Acked-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1394710981-3596-1-git-send-email-daniel@numascale.com Signed-off-by: Ingo Molnar <mingo@kernel.org>