summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2013-03-04x86: Make sure we can boot in the case the BDA contains pure garbageH. Peter Anvin
commit 7c10093692ed2e6f318387d96b829320aa0ca64c upstream. On non-BIOS platforms it is possible that the BIOS data area contains garbage instead of being zeroed or something equivalent (firmware people: we are talking of 1.5K here, so please do the sane thing.) We need on the order of 20-30K of low memory in order to boot, which may grow up to < 64K in the future. We probably want to avoid the lowest of the low memory. At the same time, it seems extremely unlikely that a legitimate EBDA would ever reach down to the 128K (which would require it to be over half a megabyte in size.) Thus, pick 128K as the cutoff for "this is insane, ignore." We may still end up reserving a bunch of extra memory on the low megabyte, but that is not really a major issue these days. In the worst case we lose 512K of RAM. This code really should be merged with trim_bios_range() in arch/x86/kernel/setup.c, but that is a bigger patch for a later merge window. Reported-by: Darren Hart <dvhart@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Matt Fleming <matt.fleming@intel.com> Link: http://lkml.kernel.org/n/tip-oebml055yyfm8yxmria09rja@git.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-04x86, efi: Make "noefi" really disable EFI runtime serivcesMatt Fleming
commit fb834c7acc5e140cf4f9e86da93a66de8c0514da upstream. commit 1de63d60cd5b ("efi: Clear EFI_RUNTIME_SERVICES rather than EFI_BOOT by "noefi" boot parameter") attempted to make "noefi" true to its documentation and disable EFI runtime services to prevent the bricking bug described in commit e0094244e41c ("samsung-laptop: Disable on EFI hardware"). However, it's not possible to clear EFI_RUNTIME_SERVICES from an early param function because EFI_RUNTIME_SERVICES is set in efi_init() *after* parse_early_param(). This resulted in "noefi" effectively becoming a no-op and no longer providing users with a way to disable EFI, which is bad for those users that have buggy machines. Reported-by: Walt Nelson Jr <walt0924@gmail.com> Cc: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Link: http://lkml.kernel.org/r/1361392572-25657-1-git-send-email-matt@console-pimps.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28xen: Send spinlock IPI to all waitersStefan Bader
commit 76eaca031f0af2bb303e405986f637811956a422 upstream. There is a loophole between Xen's current implementation of pv-spinlocks and the scheduler. This was triggerable through a testcase until v3.6 changed the TLB flushing code. The problem potentially is still there just not observable in the same way. What could happen was (is): 1. CPU n tries to schedule task x away and goes into a slow wait for the runq lock of CPU n-# (must be one with a lower number). 2. CPU n-#, while processing softirqs, tries to balance domains and goes into a slow wait for its own runq lock (for updating some records). Since this is a spin_lock_irqsave in softirq context, interrupts will be re-enabled for the duration of the poll_irq hypercall used by Xen. 3. Before the runq lock of CPU n-# is unlocked, CPU n-1 receives an interrupt (e.g. endio) and when processing the interrupt, tries to wake up task x. But that is in schedule and still on_cpu, so try_to_wake_up goes into a tight loop. 4. The runq lock of CPU n-# gets unlocked, but the message only gets sent to the first waiter, which is CPU n-# and that is busily stuck. 5. CPU n-# never returns from the nested interruption to take and release the lock because the scheduler uses a busy wait. And CPU n never finishes the task migration because the unlock notification only went to CPU n-#. To avoid this and since the unlocking code has no real sense of which waiter is best suited to grab the lock, just send the IPI to all of them. This causes the waiters to return from the hyper- call (those not interrupted at least) and do active spinlocking. BugLink: http://bugs.launchpad.net/bugs/1011792 Acked-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28x86: Hyper-V: register clocksource only if its advertisedOlaf Hering
commit 32068f6527b8f1822a30671dedaf59c567325026 upstream. Enable hyperv_clocksource only if its advertised as a feature. XenServer 6 returns the signature which is checked in ms_hyperv_platform(), but it does not offer all features. Currently the clocksource is enabled unconditionally in ms_hyperv_init_platform(), and the result is a hanging guest. Hyper-V spec Bit 1 indicates the availability of Partition Reference Counter. Register the clocksource only if this bit is set. The guest in question prints this in dmesg: [ 0.000000] Hypervisor detected: Microsoft HyperV [ 0.000000] HyperV: features 0x70, hints 0x0 This bug can be reproduced easily be setting 'viridian=1' in a HVM domU .cfg file. A workaround without this patch is to boot the HVM guest with 'clocksource=jiffies'. Signed-off-by: Olaf Hering <olaf@aepfle.de> Link: http://lkml.kernel.org/r/1359940959-32168-1-git-send-email-kys@microsoft.com Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28x86-32, mm: Remove reference to alloc_remap()H. Peter Anvin
commit 07f4207a305c834f528d08428df4531744e25678 upstream. We have removed the remap allocator for x86-32, and x86-64 never had it (and doesn't need it). Remove residual reference to it. Reported-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/CAE9FiQVn6_QZi3fNQ-JHYiR-7jeDJ5hT0SyT_%2BzVvfOj=PzF3w@mail.gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28x86-32, mm: Remove reference to resume_map_numa_kva()H. Peter Anvin
commit bb112aec5ee41427e9b9726e3d57b896709598ed upstream. Remove reference to removed function resume_map_numa_kva(). Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20130131005616.1C79F411@kernel.stglabs.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28x86-32, mm: Rip out x86_32 NUMA remapping codeDave Hansen
commit f03574f2d5b2d6229dcdf2d322848065f72953c7 upstream. This code was an optimization for 32-bit NUMA systems. It has probably been the cause of a number of subtle bugs over the years, although the conditions to excite them would have been hard to trigger. Essentially, we remap part of the kernel linear mapping area, and then sometimes part of that area gets freed back in to the bootmem allocator. If those pages get used by kernel data structures (say mem_map[] or a dentry), there's no big deal. But, if anyone ever tried to use the linear mapping for these pages _and_ cared about their physical address, bad things happen. For instance, say you passed __GFP_ZERO to the page allocator and then happened to get handed one of these pages, it zero the remapped page, but it would make a pte to the _old_ page. There are probably a hundred other ways that it could screw with things. We don't need to hang on to performance optimizations for these old boxes any more. All my 32-bit NUMA systems are long dead and buried, and I probably had access to more than most people. This code is causing real things to break today: https://lkml.org/lkml/2013/1/9/376 I looked in to actually fixing this, but it requires surgery to way too much brittle code, as well as stuff like per_cpu_ptr_to_phys(). [ hpa: Cc: this for -stable, since it is a memory corruption issue. However, an alternative is to simply mark NUMA as depends BROKEN rather than EXPERIMENTAL in the X86_32 subclause... ] Link: http://lkml.kernel.org/r/20130131005616.1C79F411@kernel.stglabs.ibm.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-17efi: Clear EFI_RUNTIME_SERVICES rather than EFI_BOOT by "noefi" boot parameterSatoru Takeuchi
commit 1de63d60cd5b0d33a812efa455d5933bf1564a51 upstream. There was a serious problem in samsung-laptop that its platform driver is designed to run under BIOS and running under EFI can cause the machine to become bricked or can cause Machine Check Exceptions. Discussion about this problem: https://bugs.launchpad.net/ubuntu-cdimage/+bug/1040557 https://bugzilla.kernel.org/show_bug.cgi?id=47121 The patches to fix this problem: efi: Make 'efi_enabled' a function to query EFI facilities 83e68189745ad931c2afd45d8ee3303929233e7f samsung-laptop: Disable on EFI hardware e0094244e41c4d0c7ad69920681972fc45d8ce34 Unfortunately this problem comes back again if users specify "noefi" option. This parameter clears EFI_BOOT and that driver continues to run even if running under EFI. Refer to the document, this parameter should clear EFI_RUNTIME_SERVICES instead. Documentation/kernel-parameters.txt: =============================================================================== ... noefi [X86] Disable EFI runtime services support. ... =============================================================================== Documentation/x86/x86_64/uefi.txt: =============================================================================== ... - If some or all EFI runtime services don't work, you can try following kernel command line parameters to turn off some or all EFI runtime services. noefi turn off all EFI runtime services ... =============================================================================== Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Link: http://lkml.kernel.org/r/511C2C04.2070108@jp.fujitsu.com Cc: Matt Fleming <matt.fleming@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-17x86/xen: don't assume %ds is usable in xen_iret for 32-bit PVOPS.Jan Beulich
commit 13d2b4d11d69a92574a55bfd985cfb0ca77aebdc upstream. This fixes CVE-2013-0228 / XSA-42 Drew Jones while working on CVE-2013-0190 found that that unprivileged guest user in 32bit PV guest can use to crash the > guest with the panic like this: ------------- general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/vbd-51712/block/xvda/dev Modules linked in: sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 xen_netfront ext4 mbcache jbd2 xen_blkfront dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 1250, comm: r Not tainted 2.6.32-356.el6.i686 #1 EIP: 0061:[<c0407462>] EFLAGS: 00010086 CPU: 0 EIP is at xen_iret+0x12/0x2b EAX: eb8d0000 EBX: 00000001 ECX: 08049860 EDX: 00000010 ESI: 00000000 EDI: 003d0f00 EBP: b77f8388 ESP: eb8d1fe0 DS: 0000 ES: 007b FS: 0000 GS: 00e0 SS: 0069 Process r (pid: 1250, ti=eb8d0000 task=c2953550 task.ti=eb8d0000) Stack: 00000000 0027f416 00000073 00000206 b77f8364 0000007b 00000000 00000000 Call Trace: Code: c3 8b 44 24 18 81 4c 24 38 00 02 00 00 8d 64 24 30 e9 03 00 00 00 8d 76 00 f7 44 24 08 00 00 02 80 75 33 50 b8 00 e0 ff ff 21 e0 <8b> 40 10 8b 04 85 a0 f6 ab c0 8b 80 0c b0 b3 c0 f6 44 24 0d 02 EIP: [<c0407462>] xen_iret+0x12/0x2b SS:ESP 0069:eb8d1fe0 general protection fault: 0000 [#2] ---[ end trace ab0d29a492dcd330 ]--- Kernel panic - not syncing: Fatal exception Pid: 1250, comm: r Tainted: G D --------------- 2.6.32-356.el6.i686 #1 Call Trace: [<c08476df>] ? panic+0x6e/0x122 [<c084b63c>] ? oops_end+0xbc/0xd0 [<c084b260>] ? do_general_protection+0x0/0x210 [<c084a9b7>] ? error_code+0x73/ ------------- Petr says: " I've analysed the bug and I think that xen_iret() cannot cope with mangled DS, in this case zeroed out (null selector/descriptor) by either xen_failsafe_callback() or RESTORE_REGS because the corresponding LDT entry was invalidated by the reproducer. " Jan took a look at the preliminary patch and came up a fix that solves this problem: "This code gets called after all registers other than those handled by IRET got already restored, hence a null selector in %ds or a non-null one that got loaded from a code or read-only data descriptor would cause a kernel mode fault (with the potential of crashing the kernel as a whole, if panic_on_oops is set)." The way to fix this is to realize that the we can only relay on the registers that IRET restores. The two that are guaranteed are the %cs and %ss as they are always fixed GDT selectors. Also they are inaccessible from user mode - so they cannot be altered. This is the approach taken in this patch. Another alternative option suggested by Jan would be to relay on the subtle realization that using the %ebp or %esp relative references uses the %ss segment. In which case we could switch from using %eax to %ebp and would not need the %ss over-rides. That would also require one extra instruction to compensate for the one place where the register is used as scaled index. However Andrew pointed out that is too subtle and if further work was to be done in this code-path it could escape folks attention and lead to accidents. Reviewed-by: Petr Matousek <pmatouse@redhat.com> Reported-by: Petr Matousek <pmatouse@redhat.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-17x86/mm: Check if PUD is large when validating a kernel addressMel Gorman
commit 0ee364eb316348ddf3e0dfcd986f5f13f528f821 upstream. A user reported the following oops when a backup process reads /proc/kcore: BUG: unable to handle kernel paging request at ffffbb00ff33b000 IP: [<ffffffff8103157e>] kern_addr_valid+0xbe/0x110 [...] Call Trace: [<ffffffff811b8aaa>] read_kcore+0x17a/0x370 [<ffffffff811ad847>] proc_reg_read+0x77/0xc0 [<ffffffff81151687>] vfs_read+0xc7/0x130 [<ffffffff811517f3>] sys_read+0x53/0xa0 [<ffffffff81449692>] system_call_fastpath+0x16/0x1b Investigation determined that the bug triggered when reading system RAM at the 4G mark. On this system, that was the first address using 1G pages for the virt->phys direct mapping so the PUD is pointing to a physical address, not a PMD page. The problem is that the page table walker in kern_addr_valid() is not checking pud_large() and treats the physical address as if it was a PMD. If it happens to look like pmd_none then it'll silently fail, probably returning zeros instead of real data. If the data happens to look like a present PMD though, it will be walked resulting in the oops above. This patch adds the necessary pud_large() check. Unfortunately the problem was not readily reproducible and now they are running the backup program without accessing /proc/kcore so the patch has not been validated but I think it makes sense. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.coM> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20130211145236.GX21389@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-17x86/apic: Work around boot failure on HP ProLiant DL980 G7 Server systemsStoney Wang
commit cb214ede7657db458fd0b2a25ea0b28dbf900ebc upstream. When a HP ProLiant DL980 G7 Server boots a regular kernel, there will be intermittent lost interrupts which could result in a hang or (in extreme cases) data loss. The reason is that this system only supports x2apic physical mode, while the kernel boots with a logical-cluster default setting. This bug can be worked around by specifying the "x2apic_phys" or "nox2apic" boot option, but we want to handle this system without requiring manual workarounds. The BIOS sets ACPI_FADT_APIC_PHYSICAL in FADT table. As all apicids are smaller than 255, BIOS need to pass the control to the OS with xapic mode, according to x2apic-spec, chapter 2.9. Current code handle x2apic when BIOS pass with xapic mode enabled: When user specifies x2apic_phys, or FADT indicates PHYSICAL: 1. During madt oem check, apic driver is set with xapic logical or xapic phys driver at first. 2. enable_IR_x2apic() will enable x2apic_mode. 3. if user specifies x2apic_phys on the boot line, x2apic_phys_probe() will install the correct x2apic phys driver and use x2apic phys mode. Otherwise it will skip the driver will let x2apic_cluster_probe to take over to install x2apic cluster driver (wrong one) even though FADT indicates PHYSICAL, because x2apic_phys_probe does not check FADT PHYSICAL. Add checking x2apic_fadt_phys in x2apic_phys_probe() to fix the problem. Signed-off-by: Stoney Wang <song-bo.wang@hp.com> [ updated the changelog and simplified the code ] Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1360263182-16226-1-git-send-email-yinghai@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-17x86: Do not leak kernel page mapping locationsKees Cook
commit e575a86fdc50d013bf3ad3aa81d9100e8e6cc60d upstream. Without this patch, it is trivial to determine kernel page mappings by examining the error code reported to dmesg[1]. Instead, declare the entire kernel memory space as a violation of a present page. Additionally, since show_unhandled_signals is enabled by default, switch branch hinting to the more realistic expectation, and unobfuscate the setting of the PF_PROT bit to improve readability. [1] http://vulnfactory.org/blog/2013/02/06/a-linux-memory-trick/ Reported-by: Dan Rosenberg <dan.j.rosenberg@gmail.com> Suggested-by: Brad Spengler <spender@grsecurity.net> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20130207174413.GA12485@www.outflux.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-14efi: Make 'efi_enabled' a function to query EFI facilitiesMatt Fleming
commit 83e68189745ad931c2afd45d8ee3303929233e7f upstream. Originally 'efi_enabled' indicated whether a kernel was booted from EFI firmware. Over time its semantics have changed, and it now indicates whether or not we are booted on an EFI machine with bit-native firmware, e.g. 64-bit kernel with 64-bit firmware. The immediate motivation for this patch is the bug report at, https://bugs.launchpad.net/ubuntu-cdimage/+bug/1040557 which details how running a platform driver on an EFI machine that is designed to run under BIOS can cause the machine to become bricked. Also, the following report, https://bugzilla.kernel.org/show_bug.cgi?id=47121 details how running said driver can also cause Machine Check Exceptions. Drivers need a new means of detecting whether they're running on an EFI machine, as sadly the expression, if (!efi_enabled) hasn't been a sufficient condition for quite some time. Users actually want to query 'efi_enabled' for different reasons - what they really want access to is the list of available EFI facilities. For instance, the x86 reboot code needs to know whether it can invoke the ResetSystem() function provided by the EFI runtime services, while the ACPI OSL code wants to know whether the EFI config tables were mapped successfully. There are also checks in some of the platform driver code to simply see if they're running on an EFI machine (which would make it a bad idea to do BIOS-y things). This patch is a prereq for the samsung-laptop fix patch. Signed-off-by: Matt Fleming <matt.fleming@intel.com> Cc: David Airlie <airlied@linux.ie> Cc: Corentin Chary <corentincj@iksaif.net> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Olof Johansson <olof@lixom.net> Cc: Peter Jones <pjones@redhat.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: Steve Langasek <steve.langasek@canonical.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Konrad Rzeszutek Wilk <konrad@kernel.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-11x86-64: Replace left over sti/cli in ia32 audit exit codeJan Beulich
commit 40a1ef95da85843696fc3ebe5fce39b0db32669f upstream. For some reason they didn't get replaced so far by their paravirt equivalents, resulting in code to be run with interrupts disabled that doesn't expect so (causing, in the observed case, a BUG_ON() to trigger) when syscall auditing is enabled. David (Cc-ed) came up with an identical fix, so likely this can be taken to count as an ack from him. Reported-by: Peter Moody <pmoody@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/5108E01902000078000BA9C5@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Tested-by: Peter Moody <pmoody@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-03x86/Sandy Bridge: Sandy Bridge workaround depends on CONFIG_PCIH. Peter Anvin
commit e43b3cec711a61edf047adf6204d542f3a659ef8 upstream. early_pci_allowed() and read_pci_config_16() are only available if CONFIG_PCI is defined. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Abdallah Chatila <abdallah.chatila@ericsson.com>
2013-02-03x86, efi: Set runtime_version to the EFI spec revisionMatt Fleming
commit 712ba9e9afc4b3d3d6fa81565ca36fe518915c01 upstream. efi.runtime_version is erroneously being set to the value of the vendor's firmware revision instead of that of the implemented EFI specification. We can't deduce which EFI functions are available based on the revision of the vendor's firmware since the version scheme is likely to be unique to each vendor. What we really need to know is the revision of the implemented EFI specification, which is available in the EFI System Table header. Signed-off-by: Matt Fleming <matt.fleming@intel.com> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-03efi, x86: Pass a proper identity mapping in efi_call_phys_prelogNathan Zimmer
commit b8f2c21db390273c3eaf0e5308faeaeb1e233840 upstream. Update efi_call_phys_prelog to install an identity mapping of all available memory. This corrects a bug on very large systems with more then 512 GB in which bios would not be able to access addresses above not in the mapping. The result is a crash that looks much like this. BUG: unable to handle kernel paging request at 000000effd870020 IP: [<0000000078bce331>] 0x78bce330 PGD 0 Oops: 0000 [#1] SMP Modules linked in: CPU 0 Pid: 0, comm: swapper/0 Tainted: G W 3.8.0-rc1-next-20121224-medusa_ntz+ #2 Intel Corp. Stoutland Platform RIP: 0010:[<0000000078bce331>] [<0000000078bce331>] 0x78bce330 RSP: 0000:ffffffff81601d28 EFLAGS: 00010006 RAX: 0000000078b80e18 RBX: 0000000000000004 RCX: 0000000000000004 RDX: 0000000078bcf958 RSI: 0000000000002400 RDI: 8000000000000000 RBP: 0000000078bcf760 R08: 000000effd870000 R09: 0000000000000000 R10: 0000000000000000 R11: 00000000000000c3 R12: 0000000000000030 R13: 000000effd870000 R14: 0000000000000000 R15: ffff88effd870000 FS: 0000000000000000(0000) GS:ffff88effe400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000effd870020 CR3: 000000000160c000 CR4: 00000000000006b0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper/0 (pid: 0, threadinfo ffffffff81600000, task ffffffff81614400) Stack: 0000000078b80d18 0000000000000004 0000000078bced7b ffff880078b81fff 0000000000000000 0000000000000082 0000000078bce3a8 0000000000002400 0000000060000202 0000000078b80da0 0000000078bce45d ffffffff8107cb5a Call Trace: [<ffffffff8107cb5a>] ? on_each_cpu+0x77/0x83 [<ffffffff8102f4eb>] ? change_page_attr_set_clr+0x32f/0x3ed [<ffffffff81035946>] ? efi_call4+0x46/0x80 [<ffffffff816c5abb>] ? efi_enter_virtual_mode+0x1f5/0x305 [<ffffffff816aeb24>] ? start_kernel+0x34a/0x3d2 [<ffffffff816ae5ed>] ? repair_env_string+0x60/0x60 [<ffffffff816ae2be>] ? x86_64_start_reservations+0xba/0xc1 [<ffffffff816ae120>] ? early_idt_handlers+0x120/0x120 [<ffffffff816ae419>] ? x86_64_start_kernel+0x154/0x163 Code: Bad RIP value. RIP [<0000000078bce331>] 0x78bce330 RSP <ffffffff81601d28> CR2: 000000effd870020 ---[ end trace ead828934fef5eab ]--- Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-03x86/msr: Add capabilities checkAlan Cox
commit c903f0456bc69176912dee6dd25c6a66ee1aed00 upstream. At the moment the MSR driver only relies upon file system checks. This means that anything as root with any capability set can write to MSRs. Historically that wasn't very interesting but on modern processors the MSRs are such that writing to them provides several ways to execute arbitary code in kernel space. Sample code and documentation on doing this is circulating and MSR attacks are used on Windows 64bit rootkits already. In the Linux case you still need to be able to open the device file so the impact is fairly limited and reduces the security of some capability and security model based systems down towards that of a generic "root owns the box" setup. Therefore they should require CAP_SYS_RAWIO to prevent an elevation of capabilities. The impact of this is fairly minimal on most setups because they don't have heavy use of capabilities. Those using SELinux, SMACK or AppArmor rules might want to consider if their rulesets on the MSR driver could be tighter. Signed-off-by: Alan Cox <alan@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21xen: Fix stack corruption in xen_failsafe_callback for 32bit PVOPS guests.Frediano Ziglio
commit 9174adbee4a9a49d0139f5d71969852b36720809 upstream. This fixes CVE-2013-0190 / XSA-40 There has been an error on the xen_failsafe_callback path for failed iret, which causes the stack pointer to be wrong when entering the iret_exc error path. This can result in the kernel crashing. In the classic kernel case, the relevant code looked a little like: popl %eax # Error code from hypervisor jz 5f addl $16,%esp jmp iret_exc # Hypervisor said iret fault 5: addl $16,%esp # Hypervisor said segment selector fault Here, there are two identical addls on either option of a branch which appears to have been optimised by hoisting it above the jz, and converting it to an lea, which leaves the flags register unaffected. In the PVOPS case, the code looks like: popl_cfi %eax # Error from the hypervisor lea 16(%esp),%esp # Add $16 before choosing fault path CFI_ADJUST_CFA_OFFSET -16 jz 5f addl $16,%esp # Incorrectly adjust %esp again jmp iret_exc It is possible unprivileged userspace applications to cause this behaviour, for example by loading an LDT code selector, then changing the code selector to be not-present. At this point, there is a race condition where it is possible for the hypervisor to return back to userspace from an interrupt, fault on its own iret, and inject a failsafe_callback into the kernel. This bug has been present since the introduction of Xen PVOPS support in commit 5ead97c84 (xen: Core Xen implementation), in 2.6.23. Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21x86/Sandy Bridge: reserve pages when integrated graphics is presentJesse Barnes
commit a9acc5365dbda29f7be2884efb63771dc24bd815 upstream. SNB graphics devices have a bug that prevent them from accessing certain memory ranges, namely anything below 1M and in the pages listed in the table. So reserve those at boot if set detect a SNB gfx device on the CPU to avoid GPU hangs. Stephane Marchesin had a similar patch to the page allocator awhile back, but rather than reserving pages up front, it leaked them at allocation time. [ hpa: made a number of stylistic changes, marked arrays as static const, and made less verbose; use "memblock=debug" for full verbosity. ] Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: CAI Qian <caiqian@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-17x86, amd: Disable way access filter on Piledriver CPUsAndre Przywara
commit 2bbf0a1427c377350f001fbc6260995334739ad7 upstream. The Way Access Filter in recent AMD CPUs may hurt the performance of some workloads, caused by aliasing issues in the L1 cache. This patch disables it on the affected CPUs. The issue is similar to that one of last year: http://lkml.indiana.edu/hypermail/linux/kernel/1107.3/00041.html This new patch does not replace the old one, we just need another quirk for newer CPUs. The performance penalty without the patch depends on the circumstances, but is a bit less than the last year's 3%. The workloads affected would be those that access code from the same physical page under different virtual addresses, so different processes using the same libraries with ASLR or multiple instances of PIE-binaries. The code needs to be accessed simultaneously from both cores of the same compute unit. More details can be found here: http://developer.amd.com/Assets/SharedL1InstructionCacheonAMD15hCPU.pdf CPUs affected are anything with the core known as Piledriver. That includes the new parts of the AMD A-Series (aka Trinity) and the just released new CPUs of the FX-Series (aka Vishera). The model numbering is a bit odd here: FX CPUs have model 2, A-Series has model 10h, with possible extensions to 1Fh. Hence the range of model ids. Signed-off-by: Andre Przywara <osp@andrep.de> Link: http://lkml.kernel.org/r/1351700450-9277-1-git-send-email-osp@andrep.de Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: CAI Qian <caiqian@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-11x86, amd: Disable way access filter on Piledriver CPUsAndre Przywara
commit 2bbf0a1427c377350f001fbc6260995334739ad7 upstream. The Way Access Filter in recent AMD CPUs may hurt the performance of some workloads, caused by aliasing issues in the L1 cache. This patch disables it on the affected CPUs. The issue is similar to that one of last year: http://lkml.indiana.edu/hypermail/linux/kernel/1107.3/00041.html This new patch does not replace the old one, we just need another quirk for newer CPUs. The performance penalty without the patch depends on the circumstances, but is a bit less than the last year's 3%. The workloads affected would be those that access code from the same physical page under different virtual addresses, so different processes using the same libraries with ASLR or multiple instances of PIE-binaries. The code needs to be accessed simultaneously from both cores of the same compute unit. More details can be found here: http://developer.amd.com/Assets/SharedL1InstructionCacheonAMD15hCPU.pdf CPUs affected are anything with the core known as Piledriver. That includes the new parts of the AMD A-Series (aka Trinity) and the just released new CPUs of the FX-Series (aka Vishera). The model numbering is a bit odd here: FX CPUs have model 2, A-Series has model 10h, with possible extensions to 1Fh. Hence the range of model ids. Signed-off-by: Andre Przywara <osp@andrep.de> Link: http://lkml.kernel.org/r/1351700450-9277-1-git-send-email-osp@andrep.de Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: CAI Qian <caiqian@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-12-17x86: hpet: Fix masking of MSI interruptsJan Beulich
commit 6acf5a8c931da9d26c8dd77d784daaf07fa2bff0 upstream. HPET_TN_FSB is not a proper mask bit; it merely toggles between MSI and legacy interrupt delivery. The proper mask bit is HPET_TN_ENABLE, so use both bits when (un)masking the interrupt. Signed-off-by: Jan Beulich <jbeulich@suse.com> Link: http://lkml.kernel.org/r/5093E09002000078000A60E6@nat28.tlf.novell.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-12-10x86, fpu: Avoid FPU lazy restore after suspendVincent Palatin
commit 644c154186386bb1fa6446bc5e037b9ed098db46 upstream. When a cpu enters S3 state, the FPU state is lost. After resuming for S3, if we try to lazy restore the FPU for a process running on the same CPU, this will result in a corrupted FPU context. Ensure that "fpu_owner_task" is properly invalided when (re-)initializing a CPU, so nobody will try to lazy restore a state which doesn't exist in the hardware. Tested with a 64-bit kernel on a 4-core Ivybridge CPU with eagerfpu=off, by doing thousands of suspend/resume cycles with 4 processes doing FPU operations running. Without the patch, a process is killed after a few hundreds cycles by a SIGFPE. Signed-off-by: Vincent Palatin <vpalatin@chromium.org> Cc: Duncan Laurie <dlaurie@chromium.org> Cc: Olof Johansson <olofj@chromium.org> Link: http://lkml.kernel.org/r/1354306532-1014-1-git-send-email-vpalatin@chromium.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-12-05x86-32: Export kernel_stack_pointer() for modulesH. Peter Anvin
commit cb57a2b4cff7edf2a4e32c0163200e9434807e0a upstream. Modules, in particular oprofile (and possibly other similar tools) need kernel_stack_pointer(), so export it using EXPORT_SYMBOL_GPL(). Cc: Yang Wei <wei.yang@windriver.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Jun Zhang <jun.zhang@intel.com> Link: http://lkml.kernel.org/r/20120912135059.GZ8285@erda.amd.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Robert Richter <rric@kernel.org> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Philip Müller <philm@manjaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-12-03KVM: x86: invalid opcode oops on SET_SREGS with OSXSAVE bit set (CVE-2012-4461)Petr Matousek
commit 6d1068b3a98519247d8ba4ec85cd40ac136dbdf9 upstream. On hosts without the XSAVE support unprivileged local user can trigger oops similar to the one below by setting X86_CR4_OSXSAVE bit in guest cr4 register using KVM_SET_SREGS ioctl and later issuing KVM_RUN ioctl. invalid opcode: 0000 [#2] SMP Modules linked in: tun ip6table_filter ip6_tables ebtable_nat ebtables ... Pid: 24935, comm: zoog_kvm_monito Tainted: G D 3.2.0-3-686-pae EIP: 0060:[<f8b9550c>] EFLAGS: 00210246 CPU: 0 EIP is at kvm_arch_vcpu_ioctl_run+0x92a/0xd13 [kvm] EAX: 00000001 EBX: 000f387e ECX: 00000000 EDX: 00000000 ESI: 00000000 EDI: 00000000 EBP: ef5a0060 ESP: d7c63e70 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process zoog_kvm_monito (pid: 24935, ti=d7c62000 task=ed84a0c0 task.ti=d7c62000) Stack: 00000001 f70a1200 f8b940a9 ef5a0060 00000000 00200202 f8769009 00000000 ef5a0060 000f387e eda5c020 8722f9c8 00015bae 00000000 ed84a0c0 ed84a0c0 c12bf02d 0000ae80 ef7f8740 fffffffb f359b740 ef5a0060 f8b85dc1 0000ae80 Call Trace: [<f8b940a9>] ? kvm_arch_vcpu_ioctl_set_sregs+0x2fe/0x308 [kvm] ... [<c12bfb44>] ? syscall_call+0x7/0xb Code: 89 e8 e8 14 ee ff ff ba 00 00 04 00 89 e8 e8 98 48 ff ff 85 c0 74 1e 83 7d 48 00 75 18 8b 85 08 07 00 00 31 c9 8b 95 0c 07 00 00 <0f> 01 d1 c7 45 48 01 00 00 00 c7 45 1c 01 00 00 00 0f ae f0 89 EIP: [<f8b9550c>] kvm_arch_vcpu_ioctl_run+0x92a/0xd13 [kvm] SS:ESP 0068:d7c63e70 QEMU first retrieves the supported features via KVM_GET_SUPPORTED_CPUID and then sets them later. So guest's X86_FEATURE_XSAVE should be masked out on hosts without X86_FEATURE_XSAVE, making kvm_set_cr4 with X86_CR4_OSXSAVE fail. Userspaces that allow specifying guest cpuid with X86_FEATURE_XSAVE even on hosts that do not support it, might be susceptible to this attack from inside the guest as well. Allow setting X86_CR4_OSXSAVE bit only if host has XSAVE support. Signed-off-by: Petr Matousek <pmatouse@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-12-03x86, microcode, AMD: Add support for family 16h processorsBoris Ostrovsky
commit 36c46ca4f322a7bf89aad5462a3a1f61713edce7 upstream. Add valid patch size for family 16h processors. [ hpa: promoting to urgent/stable since it is hw enabling and trivial ] Signed-off-by: Boris Ostrovsky <boris.ostrovsky@amd.com> Acked-by: Andreas Herrmann <herrmann.der.user@googlemail.com> Link: http://lkml.kernel.org/r/1353004910-2204-1-git-send-email-boris.ostrovsky@amd.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-12-03x86, efi: Fix processor-specific memcpy() build errorMatt Fleming
commit 0f905a43ce955b638139bd84486194770a6a2c08 upstream. Building for Athlon/Duron/K7 results in the following build error, arch/x86/boot/compressed/eboot.o: In function `__constant_memcpy3d': eboot.c:(.text+0x385): undefined reference to `_mmx_memcpy' arch/x86/boot/compressed/eboot.o: In function `efi_main': eboot.c:(.text+0x1a22): undefined reference to `_mmx_memcpy' because the boot stub code doesn't link with the kernel proper, and therefore doesn't have access to the 3DNow version of memcpy. So, follow the example of misc.c and #undef memcpy so that we use the version provided by misc.c. See https://bugzilla.kernel.org/show_bug.cgi?id=50391 Reported-by: Al Viro <viro@zeniv.linux.org.uk> Reported-by: Ryan Underwood <nemesis@icequake.net> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-12-03x86-32: Fix invalid stack address while in softirqRobert Richter
commit 1022623842cb72ee4d0dbf02f6937f38c92c3f41 upstream. In 32 bit the stack address provided by kernel_stack_pointer() may point to an invalid range causing NULL pointer access or page faults while in NMI (see trace below). This happens if called in softirq context and if the stack is empty. The address at &regs->sp is then out of range. Fixing this by checking if regs and &regs->sp are in the same stack context. Otherwise return the previous stack pointer stored in struct thread_info. If that address is invalid too, return address of regs. BUG: unable to handle kernel NULL pointer dereference at 0000000a IP: [<c1004237>] print_context_stack+0x6e/0x8d *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: Pid: 4434, comm: perl Not tainted 3.6.0-rc3-oprofile-i386-standard-g4411a05 #4 Hewlett-Packard HP xw9400 Workstation/0A1Ch EIP: 0060:[<c1004237>] EFLAGS: 00010093 CPU: 0 EIP is at print_context_stack+0x6e/0x8d EAX: ffffe000 EBX: 0000000a ECX: f4435f94 EDX: 0000000a ESI: f4435f94 EDI: f4435f94 EBP: f5409ec0 ESP: f5409ea0 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 CR0: 8005003b CR2: 0000000a CR3: 34ac9000 CR4: 000007d0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 Process perl (pid: 4434, ti=f5408000 task=f5637850 task.ti=f4434000) Stack: 000003e8 ffffe000 00001ffc f4e39b00 00000000 0000000a f4435f94 c155198c f5409ef0 c1003723 c155198c f5409f04 00000000 f5409edc 00000000 00000000 f5409ee8 f4435f94 f5409fc4 00000001 f5409f1c c12dce1c 00000000 c155198c Call Trace: [<c1003723>] dump_trace+0x7b/0xa1 [<c12dce1c>] x86_backtrace+0x40/0x88 [<c12db712>] ? oprofile_add_sample+0x56/0x84 [<c12db731>] oprofile_add_sample+0x75/0x84 [<c12ddb5b>] op_amd_check_ctrs+0x46/0x260 [<c12dd40d>] profile_exceptions_notify+0x23/0x4c [<c1395034>] nmi_handle+0x31/0x4a [<c1029dc5>] ? ftrace_define_fields_irq_handler_entry+0x45/0x45 [<c13950ed>] do_nmi+0xa0/0x2ff [<c1029dc5>] ? ftrace_define_fields_irq_handler_entry+0x45/0x45 [<c13949e5>] nmi_stack_correct+0x28/0x2d [<c1029dc5>] ? ftrace_define_fields_irq_handler_entry+0x45/0x45 [<c1003603>] ? do_softirq+0x4b/0x7f <IRQ> [<c102a06f>] irq_exit+0x35/0x5b [<c1018f56>] smp_apic_timer_interrupt+0x6c/0x7a [<c1394746>] apic_timer_interrupt+0x2a/0x30 Code: 89 fe eb 08 31 c9 8b 45 0c ff 55 ec 83 c3 04 83 7d 10 00 74 0c 3b 5d 10 73 26 3b 5d e4 73 0c eb 1f 3b 5d f0 76 1a 3b 5d e8 73 15 <8b> 13 89 d0 89 55 e0 e8 ad 42 03 00 85 c0 8b 55 e0 75 a6 eb cc EIP: [<c1004237>] print_context_stack+0x6e/0x8d SS:ESP 0068:f5409ea0 CR2: 000000000000000a ---[ end trace 62afee3481b00012 ]--- Kernel panic - not syncing: Fatal exception in interrupt V2: * add comments to kernel_stack_pointer() * always return a valid stack address by falling back to the address of regs Reported-by: Yang Wei <wei.yang@windriver.com> Signed-off-by: Robert Richter <robert.richter@amd.com> Link: http://lkml.kernel.org/r/20120912135059.GZ8285@erda.amd.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Jun Zhang <jun.zhang@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-17xen/mmu: Use Xen specific TLB flush instead of the generic one.Konrad Rzeszutek Wilk
commit 95a7d76897c1e7243d4137037c66d15cbf2cce76 upstream. As Mukesh explained it, the MMUEXT_TLB_FLUSH_ALL allows the hypervisor to do a TLB flush on all active vCPUs. If instead we were using the generic one (which ends up being xen_flush_tlb) we end up making the MMUEXT_TLB_FLUSH_LOCAL hypercall. But before we make that hypercall the kernel will IPI all of the vCPUs (even those that were asleep from the hypervisor perspective). The end result is that we needlessly wake them up and do a TLB flush when we can just let the hypervisor do it correctly. This patch gives around 50% speed improvement when migrating idle guest's from one host to another. Oracle-bug: 14630170 Tested-by: Jingjie Jiang <jingjie.jiang@oracle.com> Suggested-by: Mukesh Rathor <mukesh.rathor@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-31x86, mm: Use memblock memory loop instead of e820_RAMYinghai Lu
commit 1f2ff682ac951ed82cc043cf140d2851084512df upstream. We need to handle E820_RAM and E820_RESERVED_KERNEL at the same time. Also memblock has page aligned range for ram, so we could avoid mapping partial pages. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/CAE9FiQVZirvaBMFYRfXMmWEcHbKSicQEHz4VAwUv0xFCk51ZNw@mail.gmail.com Acked-by: Jacob Shin <jacob.shin@amd.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-31x86: efi: Turn off efi_enabled after setup on mixed fw/kernelOlof Johansson
commit 5189c2a7c7769ee9d037d76c1a7b8550ccf3481c upstream. When 32-bit EFI is used with 64-bit kernel (or vice versa), turn off efi_enabled once setup is done. Beyond setup, it is normally used to determine if runtime services are available and we will have none. This will resolve issues stemming from efivars modprobe panicking on a 32/64-bit setup, as well as some reboot issues on similar setups. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=45991 Reported-by: Marko Kohtala <marko.kohtala@gmail.com> Reported-by: Maxim Kammerer <mk@dee.su> Signed-off-by: Olof Johansson <olof@lixom.net> Acked-by: Maarten Lankhorst <maarten.lankhorst@canonical.com> Cc: Matthew Garrett <mjg@redhat.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-31efi: Defer freeing boot services memory until after ACPI initJosh Triplett
commit 785107923a83d8456bbd8564e288a24d84109a46 upstream. Some new ACPI 5.0 tables reference resources stored in boot services memory, so keep that memory around until we have ACPI and can extract data from it. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Link: http://lkml.kernel.org/r/baaa6d44bdc4eb0c58e5d1b4ccd2c729f854ac55.1348876882.git.josh@joshtriplett.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Matt Fleming <matt@console-pimps.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-31x86, mm: Undo incorrect revert in arch/x86/mm/init.cYinghai Lu
commit f82f64dd9f485e13f29f369772d4a0e868e5633a upstream. Commit 844ab6f9 x86, mm: Find_early_table_space based on ranges that are actually being mapped added back some lines back wrongly that has been removed in commit 7b16bbf97 Revert "x86/mm: Fix the size calculation of mapping tables" remove them again. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/CAE9FiQW_vuaYQbmagVnxT2DGsYc=9tNeAbdBq53sYkitPOwxSQ@mail.gmail.com Acked-by: Jacob Shin <jacob.shin@amd.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-31x86, mm: Find_early_table_space based on ranges that are actually being mappedJacob Shin
commit 844ab6f993b1d32eb40512503d35ff6ad0c57030 upstream. Current logic finds enough space for direct mapping page tables from 0 to end. Instead, we only need to find enough space to cover mr[0].start to mr[nr_range].end -- the range that is actually being mapped by init_memory_mapping() This is needed after 1bbbbe779aabe1f0768c2bf8f8c0a5583679b54a, to address the panic reported here: https://lkml.org/lkml/2012/10/20/160 https://lkml.org/lkml/2012/10/21/157 Signed-off-by: Jacob Shin <jacob.shin@amd.com> Link: http://lkml.kernel.org/r/20121024195311.GB11779@jshin-Toonie Tested-by: Tom Rini <trini@ti.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-31x86, mm: Trim memory in memblock to be page alignedYinghai Lu
commit 6ede1fd3cb404c0016de6ac529df46d561bd558b upstream. We will not map partial pages, so need to make sure memblock allocation will not allocate those bytes out. Also we will use for_each_mem_pfn_range() to loop to map memory range to keep them consistent. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/CAE9FiQVZirvaBMFYRfXMmWEcHbKSicQEHz4VAwUv0xFCk51ZNw@mail.gmail.com Acked-by: Jacob Shin <jacob.shin@amd.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-28xen/x86: don't corrupt %eip when returning from a signal handlerDavid Vrabel
commit a349e23d1cf746f8bdc603dcc61fae9ee4a695f6 upstream. In 32 bit guests, if a userspace process has %eax == -ERESTARTSYS (-512) or -ERESTARTNOINTR (-513) when it is interrupted by an event /and/ the process has a pending signal then %eip (and %eax) are corrupted when returning to the main process after handling the signal. The application may then crash with SIGSEGV or a SIGILL or it may have subtly incorrect behaviour (depending on what instruction it returned to). The occurs because handle_signal() is incorrectly thinking that there is a system call that needs to restarted so it adjusts %eip and %eax to re-execute the system call instruction (even though user space had not done a system call). If %eax == -514 (-ERESTARTNOHAND (-514) or -ERESTART_RESTARTBLOCK (-516) then handle_signal() only corrupted %eax (by setting it to -EINTR). This may cause the application to crash or have incorrect behaviour. handle_signal() assumes that regs->orig_ax >= 0 means a system call so any kernel entry point that is not for a system call must push a negative value for orig_ax. For example, for physical interrupts on bare metal the inverse of the vector is pushed and page_fault() sets regs->orig_ax to -1, overwriting the hardware provided error code. xen_hypervisor_callback() was incorrectly pushing 0 for orig_ax instead of -1. Classic Xen kernels pushed %eax which works as %eax cannot be both non-negative and -RESTARTSYS (etc.), but using -1 is consistent with other non-system call entry points and avoids some of the tests in handle_signal(). There were similar bugs in xen_failsafe_callback() of both 32 and 64-bit guests. If the fault was corrected and the normal return path was used then 0 was incorrectly pushed as the value for orig_ax. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Jan Beulich <JBeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-28x86: Exclude E820_RESERVED regions and memory holes above 4 GB from direct ↵Jacob Shin
mapping. commit 1bbbbe779aabe1f0768c2bf8f8c0a5583679b54a upstream. On systems with very large memory (1 TB in our case), BIOS may report a reserved region or a hole in the E820 map, even above the 4 GB range. Exclude these from the direct mapping. [ hpa: this should be done not just for > 4 GB but for everything above the legacy region (1 MB), at the very least. That, however, turns out to require significant restructuring. That work is well underway, but is not suitable for rc/stable. ] Signed-off-by: Jacob Shin <jacob.shin@amd.com> Link: http://lkml.kernel.org/r/1319145326-13902-1-git-send-email-jacob.shin@amd.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-28oprofile, x86: Fix wrapping bug in op_x86_get_ctrl()Dan Carpenter
commit 44009105081b51417f311f4c3be0061870b6b8ed upstream. The "event" variable is a u16 so the shift will always wrap to zero making the line a no-op. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-21xen/bootup: allow read_tscp call for Xen PV guests.Konrad Rzeszutek Wilk
commit cd0608e71e9757f4dae35bcfb4e88f4d1a03a8ab upstream. The hypervisor will trap it. However without this patch, we would crash as the .read_tscp is set to NULL. This patch fixes it and sets it to the native_read_tscp call. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-21xen/bootup: allow {read|write}_cr8 pvops call.Konrad Rzeszutek Wilk
commit 1a7bbda5b1ab0e02622761305a32dc38735b90b2 upstream. We actually do not do anything about it. Just return a default value of zero and if the kernel tries to write anything but 0 we BUG_ON. This fixes the case when an user tries to suspend the machine and it blows up in save_processor_state b/c 'read_cr8' is set to NULL and we get: kernel BUG at /home/konrad/ssd/linux/arch/x86/include/asm/paravirt.h:100! invalid opcode: 0000 [#1] SMP Pid: 2687, comm: init.late Tainted: G O 3.6.0upstream-00002-gac264ac-dirty #4 Bochs Bochs RIP: e030:[<ffffffff814d5f42>] [<ffffffff814d5f42>] save_processor_state+0x212/0x270 .. snip.. Call Trace: [<ffffffff810733bf>] do_suspend_lowlevel+0xf/0xac [<ffffffff8107330c>] ? x86_acpi_suspend_lowlevel+0x10c/0x150 [<ffffffff81342ee2>] acpi_suspend_enter+0x57/0xd5 Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-13efi: initialize efi.runtime_version to make ↵Seiji Aguchi
query_variable_info/update_capsule workable commit d6cf86d8f23253225fe2a763d627ecf7dfee9dae upstream. A value of efi.runtime_version is checked before calling update_capsule()/query_variable_info() as follows. But it isn't initialized anywhere. <snip> static efi_status_t virt_efi_query_variable_info(u32 attr, u64 *storage_space, u64 *remaining_space, u64 *max_variable_size) { if (efi.runtime_version < EFI_2_00_SYSTEM_TABLE_REVISION) return EFI_UNSUPPORTED; <snip> This patch initializes a value of efi.runtime_version at boot time. Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com> Acked-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Ivan Hu <ivan.hu@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-13efi: Build EFI stub with EFI-appropriate optionsMatthew Garrett
commit 9dead5bbb825d7c25c0400e61de83075046322d0 upstream. We can't assume the presence of the red zone while we're still in a boot services environment, so we should build with -fno-red-zone to avoid problems. Change the size of wchar at the same time to make string handling simpler. Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Acked-by: Josh Boyer <jwboyer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-13mm: thp: fix pmd_present for split_huge_page and PROT_NONE with THPAndrea Arcangeli
commit 027ef6c87853b0a9df53175063028edb4950d476 upstream. In many places !pmd_present has been converted to pmd_none. For pmds that's equivalent and pmd_none is quicker so using pmd_none is better. However (unless we delete pmd_present) we should provide an accurate pmd_present too. This will avoid the risk of code thinking the pmd is non present because it's under __split_huge_page_map, see the pmd_mknotpresent there and the comment above it. If the page has been mprotected as PROT_NONE, it would also lead to a pmd_present false negative in the same way as the race with split_huge_page. Because the PSE bit stays on at all times (both during split_huge_page and when the _PAGE_PROTNONE bit get set), we could only check for the PSE bit, but checking the PROTNONE bit too is still good to remember pmd_present must always keep PROT_NONE into account. This explains a not reproducible BUG_ON that was seldom reported on the lists. The same issue is in pmd_large, it would go wrong with both PROT_NONE and if it races with split_huge_page. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-13kbuild: Fix gcc -x syntaxJean Delvare
commit b1e0d8b70fa31821ebca3965f2ef8619d7c5e316 upstream. The correct syntax for gcc -x is "gcc -x assembler", not "gcc -xassembler". Even though the latter happens to work, the former is what is documented in the manual page and thus what gcc wrappers such as icecream do expect. This isn't a cosmetic change. The missing space prevents icecream from recognizing compilation tasks it can't handle, leading to silent kernel miscompilations. Besides me, credits go to Michael Matz and Dirk Mueller for investigating the miscompilation issue and tracking it down to this incorrect -x parameter syntax. Signed-off-by: Jean Delvare <jdelvare@suse.de> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Bernhard Walle <bernhard@bwalle.de> Cc: Michal Marek <mmarek@suse.cz> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Michal Marek <mmarek@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-07x86/alternatives: Fix p6 nops on non-modular kernelsAvi Kivity
commit cb09cad44f07044d9810f18f6f9a6a6f3771f979 upstream. Probably a leftover from the early days of self-patching, p6nops are marked __initconst_or_module, which causes them to be discarded in a non-modular kernel. If something later triggers patching, it will overwrite kernel code with garbage. Reported-by: Tomas Racek <tracek@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com> Cc: Michael Tokarev <mjt@tls.msk.ru> Cc: Borislav Petkov <borislav.petkov@amd.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: qemu-devel@nongnu.org Cc: Anthony Liguori <anthony@codemonkey.ws> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Alan Cox <alan@linux.intel.com> Link: http://lkml.kernel.org/r/5034AE84.90708@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Ben Jencks <ben@bjencks.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-02x86: Fix boot on Twinhead H12YAlan Cox
commit 80b3e557371205566a71e569fbfcce5b11f92dbe upstream. Despite lots of investigation into why this is needed we don't know or have an elegant cure. The only answer found on this laptop is to mark a problem region as used so that Linux doesn't put anything there. Currently all the users add reserve= command lines and anyone not knowing this needs to find the magic page that documents it. Automate it instead. Signed-off-by: Alan Cox <alan@linux.intel.com> Tested-and-bugfixed-by: Arne Fitzenreiter <arne@fitzenreiter.de> Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=10231 Link: http://lkml.kernel.org/r/20120515174347.5109.94551.stgit@bluebook Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-02xen/boot: Disable NUMA for PV guests.Konrad Rzeszutek Wilk
commit 8d54db795dfb1049d45dc34f0dddbc5347ec5642 upstream. The hypervisor is in charge of allocating the proper "NUMA" memory and dealing with the CPU scheduler to keep them bound to the proper NUMA node. The PV guests (and PVHVM) have no inkling of where they run and do not need to know that right now. In the future we will need to inject NUMA configuration data (if a guest spans two or more NUMA nodes) so that the kernel can make the right choices. But those patches are not yet present. In the meantime, disable the NUMA capability in the PV guest, which also fixes a bootup issue. Andre says: "we see Dom0 crashes due to the kernel detecting the NUMA topology not by ACPI, but directly from the northbridge (CONFIG_AMD_NUMA). This will detect the actual NUMA config of the physical machine, but will crash about the mismatch with Dom0's virtual memory. Variation of the theme: Dom0 sees what it's not supposed to see. This happens with the said config option enabled and on a machine where this scanning is still enabled (K8 and Fam10h, not Bulldozer class) We have this dump then: NUMA: Warning: node ids are out of bound, from=-1 to=-1 distance=10 Scanning NUMA topology in Northbridge 24 Number of physical nodes 4 Node 0 MemBase 0000000000000000 Limit 0000000040000000 Node 1 MemBase 0000000040000000 Limit 0000000138000000 Node 2 MemBase 0000000138000000 Limit 00000001f8000000 Node 3 MemBase 00000001f8000000 Limit 0000000238000000 Initmem setup node 0 0000000000000000-0000000040000000 NODE_DATA [000000003ffd9000 - 000000003fffffff] Initmem setup node 1 0000000040000000-0000000138000000 NODE_DATA [0000000137fd9000 - 0000000137ffffff] Initmem setup node 2 0000000138000000-00000001f8000000 NODE_DATA [00000001f095e000 - 00000001f0984fff] Initmem setup node 3 00000001f8000000-0000000238000000 Cannot find 159744 bytes in node 3 BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81d220e6>] __alloc_bootmem_node+0x43/0x96 Pid: 0, comm: swapper Not tainted 3.3.6 #1 AMD Dinar/Dinar RIP: e030:[<ffffffff81d220e6>] [<ffffffff81d220e6>] __alloc_bootmem_node+0x43/0x96 .. snip.. [<ffffffff81d23024>] sparse_early_usemaps_alloc_node+0x64/0x178 [<ffffffff81d23348>] sparse_init+0xe4/0x25a [<ffffffff81d16840>] paging_init+0x13/0x22 [<ffffffff81d07fbb>] setup_arch+0x9c6/0xa9b [<ffffffff81683954>] ? printk+0x3c/0x3e [<ffffffff81d01a38>] start_kernel+0xe5/0x468 [<ffffffff81d012cf>] x86_64_start_reservations+0xba/0xc1 [<ffffffff81007153>] ? xen_setup_runstate_info+0x2c/0x36 [<ffffffff81d050ee>] xen_start_kernel+0x565/0x56c " so we just disable NUMA scanning by setting numa_off=1. Reported-and-Tested-by: Andre Przywara <andre.przywara@amd.com> Acked-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-02xen/m2p: do not reuse kmap_op->dev_bus_addrStefano Stabellini
commit 2fc136eecd0c647a6b13fcd00d0c41a1a28f35a5 upstream. If the caller passes a valid kmap_op to m2p_add_override, we use kmap_op->dev_bus_addr to store the original mfn, but dev_bus_addr is part of the interface with Xen and if we are batching the hypercalls it might not have been written by the hypervisor yet. That means that later on Xen will write to it and we'll think that the original mfn is actually what Xen has written to it. Rather than "stealing" struct members from kmap_op, keep using page->index to store the original mfn and add another parameter to m2p_remove_override to get the corresponding kmap_op instead. It is now responsibility of the caller to keep track of which kmap_op corresponds to a particular page in the m2p_override (gntdev, the only user of this interface that passes a valid kmap_op, is already doing that). Reported-and-Tested-By: Sander Eikelenboom <linux@eikelenboom.it> Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-09-14x86, microcode, AMD: Fix broken ucode patch size checkAndreas Herrmann
commit 36bf50d7697be18c6bfd0401e037df10bff1e573 upstream. This issue was recently observed on an AMD C-50 CPU where a patch of maximum size was applied. Commit be62adb49294 ("x86, microcode, AMD: Simplify ucode verification") added current_size in get_matching_microcode(). This is calculated as size of the ucode patch + 8 (ie. size of the header). Later this is compared against the maximum possible ucode patch size for a CPU family. And of course this fails if the patch has already maximum size. Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Link: http://lkml.kernel.org/r/1344361461-10076-1-git-send-email-bp@amd64.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>