From a706d965dcfdff73bf2bad1c300f8119900714c7 Mon Sep 17 00:00:00 2001 From: David Ahern Date: Fri, 28 Dec 2012 19:56:07 -0700 Subject: perf x86: revert 20b279 - require exclude_guest to use PEBS - kernel side This patch is brought to you by the letter 'H'. Commit 20b279 breaks compatiblity with older perf binaries when run with precise modifier (:p or :pp) by requiring the exclude_guest attribute to be set. Older binaries default exclude_guest to 0 (ie., wanting guest-based samples) unless host only profiling is requested (:H modifier). The workaround for older binaries is to add H to the modifier list (e.g., -e cycles:ppH - toggles exclude_guest to 1). This was deemed unacceptable by Linus: https://lkml.org/lkml/2012/12/12/570 Between family in town and the fresh snow in Breckenridge there is no time left to be working on the proper fix for this over the holidays. In the New Year I have more pressing problems to resolve -- like some memory leaks in perf which are proving to be elusive -- although the aforementioned snow is probably why they are proving to be elusive. Either way I do not have any spare time to work on this and from the time I have managed to spend on it the solution is more difficult than just moving to a new exclude_guest flag (does not work) or flipping the logic to include_guest (which is not as trivial as one would think). So, two options: silently force exclude_guest on as suggested by Gleb which means no impact to older perf binaries or revert the original patch which caused the breakage. This patch does the latter -- reverts the original patch that introduced the regression. The problem can be revisited in the future as time allows. Signed-off-by: David Ahern Cc: Avi Kivity Cc: Gleb Natapov Cc: Ingo Molnar Cc: Jiri Olsa Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Robert Richter Link: http://lkml.kernel.org/r/1356749767-17322-1-git-send-email-dsahern@gmail.com Signed-off-by: Arnaldo Carvalho de Melo --- arch/x86/kernel/cpu/perf_event.c | 6 ------ 1 file changed, 6 deletions(-) (limited to 'arch/x86/kernel') diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 4428fd178bce..6774c17a5576 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -340,9 +340,6 @@ int x86_setup_perfctr(struct perf_event *event) /* BTS is currently only allowed for user-mode. */ if (!attr->exclude_kernel) return -EOPNOTSUPP; - - if (!attr->exclude_guest) - return -EOPNOTSUPP; } hwc->config |= config; @@ -385,9 +382,6 @@ int x86_pmu_hw_config(struct perf_event *event) if (event->attr.precise_ip) { int precise = 0; - if (!event->attr.exclude_guest) - return -EOPNOTSUPP; - /* Support for constant skid */ if (x86_pmu.pebs_active && !x86_pmu.pebs_broken) { precise++; -- cgit v1.2.3 From 9174adbee4a9a49d0139f5d71969852b36720809 Mon Sep 17 00:00:00 2001 From: Andrew Cooper Date: Wed, 16 Jan 2013 12:00:55 +0000 Subject: xen: Fix stack corruption in xen_failsafe_callback for 32bit PVOPS guests. This fixes CVE-2013-0190 / XSA-40 There has been an error on the xen_failsafe_callback path for failed iret, which causes the stack pointer to be wrong when entering the iret_exc error path. This can result in the kernel crashing. In the classic kernel case, the relevant code looked a little like: popl %eax # Error code from hypervisor jz 5f addl $16,%esp jmp iret_exc # Hypervisor said iret fault 5: addl $16,%esp # Hypervisor said segment selector fault Here, there are two identical addls on either option of a branch which appears to have been optimised by hoisting it above the jz, and converting it to an lea, which leaves the flags register unaffected. In the PVOPS case, the code looks like: popl_cfi %eax # Error from the hypervisor lea 16(%esp),%esp # Add $16 before choosing fault path CFI_ADJUST_CFA_OFFSET -16 jz 5f addl $16,%esp # Incorrectly adjust %esp again jmp iret_exc It is possible unprivileged userspace applications to cause this behaviour, for example by loading an LDT code selector, then changing the code selector to be not-present. At this point, there is a race condition where it is possible for the hypervisor to return back to userspace from an interrupt, fault on its own iret, and inject a failsafe_callback into the kernel. This bug has been present since the introduction of Xen PVOPS support in commit 5ead97c84 (xen: Core Xen implementation), in 2.6.23. Signed-off-by: Frediano Ziglio Signed-off-by: Andrew Cooper Cc: stable@vger.kernel.org Signed-off-by: Konrad Rzeszutek Wilk --- arch/x86/kernel/entry_32.S | 1 - 1 file changed, 1 deletion(-) (limited to 'arch/x86/kernel') diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index 88b725aa1d52..cf8639b4dcf3 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -1084,7 +1084,6 @@ ENTRY(xen_failsafe_callback) lea 16(%esp),%esp CFI_ADJUST_CFA_OFFSET -16 jz 5f - addl $16,%esp jmp iret_exc 5: pushl_cfi $-1 /* orig_ax = -1 => not a system call */ SAVE_ALL -- cgit v1.2.3 From 021ef050fc092d5638e69868d126c18006ea7296 Mon Sep 17 00:00:00 2001 From: "H. Peter Anvin" Date: Sat, 19 Jan 2013 10:29:37 -0800 Subject: x86-32: Start out cr0 clean, disable paging before modifying cr3/4 Patch 5a5a51db78e x86-32: Start out eflags and cr4 clean ... made x86-32 match x86-64 in that we initialize %eflags and %cr4 from scratch. This broke OLPC XO-1.5, because the XO enters the kernel with paging enabled, which the kernel doesn't expect. Since we no longer support 386 (the source of most of the variability in %cr0 configuration), we can simply match further x86-64 and initialize %cr0 to a fixed value -- the one variable part remaining in %cr0 is for FPU control, but all that is handled later on in initialization; in particular, configuring %cr0 as if the FPU is present until proven otherwise is correct and necessary for the probe to work. To deal with the XO case sanely, explicitly disable paging in %cr0 before we muck with %cr3, %cr4 or EFER -- those operations are inherently unsafe with paging enabled. NOTE: There is still a lot of 386-related junk in head_32.S which we can and should get rid of, however, this is intended as a minimal fix whereas the cleanup can be deferred to the next merge window. Reported-by: Andres Salomon Tested-by: Daniel Drake Link: http://lkml.kernel.org/r/50FA0661.2060400@linux.intel.com Signed-off-by: H. Peter Anvin --- arch/x86/kernel/head_32.S | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) (limited to 'arch/x86/kernel') diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S index 8e7f6556028f..c8932c79e78b 100644 --- a/arch/x86/kernel/head_32.S +++ b/arch/x86/kernel/head_32.S @@ -300,6 +300,12 @@ ENTRY(startup_32_smp) leal -__PAGE_OFFSET(%ecx),%esp default_entry: +#define CR0_STATE (X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | \ + X86_CR0_NE | X86_CR0_WP | X86_CR0_AM | \ + X86_CR0_PG) + movl $(CR0_STATE & ~X86_CR0_PG),%eax + movl %eax,%cr0 + /* * New page tables may be in 4Mbyte page mode and may * be using the global pages. @@ -364,8 +370,7 @@ default_entry: */ movl $pa(initial_page_table), %eax movl %eax,%cr3 /* set the page table pointer.. */ - movl %cr0,%eax - orl $X86_CR0_PG,%eax + movl $CR0_STATE,%eax movl %eax,%cr0 /* ..and set paging (PG) bit */ ljmp $__BOOT_CS,$1f /* Clear prefetch and normalize %eip */ 1: -- cgit v1.2.3 From 9899d11f654474d2d54ea52ceaa2a1f4db3abd68 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Mon, 21 Jan 2013 20:48:00 +0100 Subject: ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL putreg() assumes that the tracee is not running and pt_regs_access() can safely play with its stack. However a killed tracee can return from ptrace_stop() to the low-level asm code and do RESTORE_REST, this means that debugger can actually read/modify the kernel stack until the tracee does SAVE_REST again. set_task_blockstep() can race with SIGKILL too and in some sense this race is even worse, the very fact the tracee can be woken up breaks the logic. As Linus suggested we can clear TASK_WAKEKILL around the arch_ptrace() call, this ensures that nobody can ever wakeup the tracee while the debugger looks at it. Not only this fixes the mentioned problems, we can do some cleanups/simplifications in arch_ptrace() paths. Probably ptrace_unfreeze_traced() needs more callers, for example it makes sense to make the tracee killable for oom-killer before access_process_vm(). While at it, add the comment into may_ptrace_stop() to explain why ptrace_stop() still can't rely on SIGKILL and signal_pending_state(). Reported-by: Salman Qazi Reported-by: Suleiman Souhlal Suggested-by: Linus Torvalds Signed-off-by: Oleg Nesterov Signed-off-by: Linus Torvalds --- arch/x86/kernel/step.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'arch/x86/kernel') diff --git a/arch/x86/kernel/step.c b/arch/x86/kernel/step.c index cd3b2438a980..9b4d51d0c0d0 100644 --- a/arch/x86/kernel/step.c +++ b/arch/x86/kernel/step.c @@ -165,10 +165,11 @@ void set_task_blockstep(struct task_struct *task, bool on) * Ensure irq/preemption can't change debugctl in between. * Note also that both TIF_BLOCKSTEP and debugctl should * be changed atomically wrt preemption. - * FIXME: this means that set/clear TIF_BLOCKSTEP is simply - * wrong if task != current, SIGKILL can wakeup the stopped - * tracee and set/clear can play with the running task, this - * can confuse the next __switch_to_xtra(). + * + * NOTE: this means that set/clear TIF_BLOCKSTEP is only safe if + * task is current or it can't be running, otherwise we can race + * with __switch_to_xtra(). We rely on ptrace_freeze_traced() but + * PTRACE_KILL is not safe. */ local_irq_disable(); debugctl = get_debugctlmsr(); -- cgit v1.2.3 From 444723dccc3c855fe88ea138cdec46f30e707b74 Mon Sep 17 00:00:00 2001 From: Jan Beulich Date: Thu, 24 Jan 2013 09:27:31 +0000 Subject: x86-64: Fix unwind annotations in recent NMI changes While in one case a plain annotation is necessary, in the other case the stack adjustment can simply be folded into the immediately preceding RESTORE_ALL, thus getting the correct annotation for free. Signed-off-by: Jan Beulich Cc: Steven Rostedt Cc: Linus Torvalds Cc: Alexander van Heukelum Link: http://lkml.kernel.org/r/51010C9302000078000B9045@nat28.tlf.novell.com Signed-off-by: Ingo Molnar --- arch/x86/kernel/entry_64.S | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) (limited to 'arch/x86/kernel') diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 07a7a04529bc..cb3c591339aa 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -1781,6 +1781,7 @@ first_nmi: * Leave room for the "copied" frame */ subq $(5*8), %rsp + CFI_ADJUST_CFA_OFFSET 5*8 /* Copy the stack frame to the Saved frame */ .rept 5 @@ -1863,10 +1864,8 @@ end_repeat_nmi: nmi_swapgs: SWAPGS_UNSAFE_STACK nmi_restore: - RESTORE_ALL 8 - - /* Pop the extra iret frame */ - addq $(5*8), %rsp + /* Pop the extra iret frame at once */ + RESTORE_ALL 6*8 /* Clear the NMI executing stack variable */ movq $0, 5*8(%rsp) -- cgit v1.2.3 From 73b664ceb5f815c38def1c68912b83f83455e9eb Mon Sep 17 00:00:00 2001 From: Maarten Lankhorst Date: Fri, 16 Nov 2012 11:17:14 +0100 Subject: x86/dma-debug: Bump PREALLOC_DMA_DEBUG_ENTRIES I ran out of free entries when I had CONFIG_DMA_API_DEBUG enabled. Some other archs seem to default to 65536, so increase this limit for x86 too. Signed-off-by: Maarten Lankhorst Cc: Bjorn Helgaas Link: http://lkml.kernel.org/r/50A612AA.7040206@canonical.com Signed-off-by: Ingo Molnar ---- --- arch/x86/kernel/pci-dma.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86/kernel') diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c index 0f5dec5c80e0..872079a67e4d 100644 --- a/arch/x86/kernel/pci-dma.c +++ b/arch/x86/kernel/pci-dma.c @@ -56,7 +56,7 @@ struct device x86_dma_fallback_dev = { EXPORT_SYMBOL(x86_dma_fallback_dev); /* Number of entries preallocated for DMA-API debugging */ -#define PREALLOC_DMA_DEBUG_ENTRIES 32768 +#define PREALLOC_DMA_DEBUG_ENTRIES 65536 int dma_set_mask(struct device *dev, u64 mask) { -- cgit v1.2.3 From c903f0456bc69176912dee6dd25c6a66ee1aed00 Mon Sep 17 00:00:00 2001 From: Alan Cox Date: Thu, 15 Nov 2012 13:06:22 +0000 Subject: x86/msr: Add capabilities check At the moment the MSR driver only relies upon file system checks. This means that anything as root with any capability set can write to MSRs. Historically that wasn't very interesting but on modern processors the MSRs are such that writing to them provides several ways to execute arbitary code in kernel space. Sample code and documentation on doing this is circulating and MSR attacks are used on Windows 64bit rootkits already. In the Linux case you still need to be able to open the device file so the impact is fairly limited and reduces the security of some capability and security model based systems down towards that of a generic "root owns the box" setup. Therefore they should require CAP_SYS_RAWIO to prevent an elevation of capabilities. The impact of this is fairly minimal on most setups because they don't have heavy use of capabilities. Those using SELinux, SMACK or AppArmor rules might want to consider if their rulesets on the MSR driver could be tighter. Signed-off-by: Alan Cox Cc: Linus Torvalds Cc: Andrew Morton Cc: Peter Zijlstra Cc: Horses Signed-off-by: Ingo Molnar --- arch/x86/kernel/msr.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'arch/x86/kernel') diff --git a/arch/x86/kernel/msr.c b/arch/x86/kernel/msr.c index a7c5661f8496..4929502c1372 100644 --- a/arch/x86/kernel/msr.c +++ b/arch/x86/kernel/msr.c @@ -174,6 +174,9 @@ static int msr_open(struct inode *inode, struct file *file) unsigned int cpu; struct cpuinfo_x86 *c; + if (!capable(CAP_SYS_RAWIO)) + return -EPERM; + cpu = iminor(file->f_path.dentry->d_inode); if (cpu >= nr_cpu_ids || !cpu_online(cpu)) return -ENXIO; /* No such CPU */ -- cgit v1.2.3 From 83e68189745ad931c2afd45d8ee3303929233e7f Mon Sep 17 00:00:00 2001 From: Matt Fleming Date: Wed, 14 Nov 2012 09:42:35 +0000 Subject: efi: Make 'efi_enabled' a function to query EFI facilities Originally 'efi_enabled' indicated whether a kernel was booted from EFI firmware. Over time its semantics have changed, and it now indicates whether or not we are booted on an EFI machine with bit-native firmware, e.g. 64-bit kernel with 64-bit firmware. The immediate motivation for this patch is the bug report at, https://bugs.launchpad.net/ubuntu-cdimage/+bug/1040557 which details how running a platform driver on an EFI machine that is designed to run under BIOS can cause the machine to become bricked. Also, the following report, https://bugzilla.kernel.org/show_bug.cgi?id=47121 details how running said driver can also cause Machine Check Exceptions. Drivers need a new means of detecting whether they're running on an EFI machine, as sadly the expression, if (!efi_enabled) hasn't been a sufficient condition for quite some time. Users actually want to query 'efi_enabled' for different reasons - what they really want access to is the list of available EFI facilities. For instance, the x86 reboot code needs to know whether it can invoke the ResetSystem() function provided by the EFI runtime services, while the ACPI OSL code wants to know whether the EFI config tables were mapped successfully. There are also checks in some of the platform driver code to simply see if they're running on an EFI machine (which would make it a bad idea to do BIOS-y things). This patch is a prereq for the samsung-laptop fix patch. Cc: David Airlie Cc: Corentin Chary Cc: Matthew Garrett Cc: Dave Jiang Cc: Olof Johansson Cc: Peter Jones Cc: Colin Ian King Cc: Steve Langasek Cc: Tony Luck Cc: Konrad Rzeszutek Wilk Cc: Rafael J. Wysocki Cc: Signed-off-by: Matt Fleming Signed-off-by: H. Peter Anvin --- arch/x86/kernel/reboot.c | 2 +- arch/x86/kernel/setup.c | 28 ++++++++++++++-------------- 2 files changed, 15 insertions(+), 15 deletions(-) (limited to 'arch/x86/kernel') diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c index 4e8ba39eaf0f..76fa1e9a2b39 100644 --- a/arch/x86/kernel/reboot.c +++ b/arch/x86/kernel/reboot.c @@ -584,7 +584,7 @@ static void native_machine_emergency_restart(void) break; case BOOT_EFI: - if (efi_enabled) + if (efi_enabled(EFI_RUNTIME_SERVICES)) efi.reset_system(reboot_mode ? EFI_RESET_WARM : EFI_RESET_COLD, diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 00f6c1472b85..8b24289cc10c 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -807,15 +807,15 @@ void __init setup_arch(char **cmdline_p) #ifdef CONFIG_EFI if (!strncmp((char *)&boot_params.efi_info.efi_loader_signature, "EL32", 4)) { - efi_enabled = 1; - efi_64bit = false; + set_bit(EFI_BOOT, &x86_efi_facility); } else if (!strncmp((char *)&boot_params.efi_info.efi_loader_signature, "EL64", 4)) { - efi_enabled = 1; - efi_64bit = true; + set_bit(EFI_BOOT, &x86_efi_facility); + set_bit(EFI_64BIT, &x86_efi_facility); } - if (efi_enabled && efi_memblock_x86_reserve_range()) - efi_enabled = 0; + + if (efi_enabled(EFI_BOOT)) + efi_memblock_x86_reserve_range(); #endif x86_init.oem.arch_setup(); @@ -888,7 +888,7 @@ void __init setup_arch(char **cmdline_p) finish_e820_parsing(); - if (efi_enabled) + if (efi_enabled(EFI_BOOT)) efi_init(); dmi_scan_machine(); @@ -971,7 +971,7 @@ void __init setup_arch(char **cmdline_p) * The EFI specification says that boot service code won't be called * after ExitBootServices(). This is, in fact, a lie. */ - if (efi_enabled) + if (efi_enabled(EFI_MEMMAP)) efi_reserve_boot_services(); /* preallocate 4k for mptable mpc */ @@ -1114,7 +1114,7 @@ void __init setup_arch(char **cmdline_p) #ifdef CONFIG_VT #if defined(CONFIG_VGA_CONSOLE) - if (!efi_enabled || (efi_mem_type(0xa0000) != EFI_CONVENTIONAL_MEMORY)) + if (!efi_enabled(EFI_BOOT) || (efi_mem_type(0xa0000) != EFI_CONVENTIONAL_MEMORY)) conswitchp = &vga_con; #elif defined(CONFIG_DUMMY_CONSOLE) conswitchp = &dummy_con; @@ -1131,14 +1131,14 @@ void __init setup_arch(char **cmdline_p) register_refined_jiffies(CLOCK_TICK_RATE); #ifdef CONFIG_EFI - /* Once setup is done above, disable efi_enabled on mismatched - * firmware/kernel archtectures since there is no support for - * runtime services. + /* Once setup is done above, unmap the EFI memory map on + * mismatched firmware/kernel archtectures since there is no + * support for runtime services. */ - if (efi_enabled && IS_ENABLED(CONFIG_X86_64) != efi_64bit) { + if (efi_enabled(EFI_BOOT) && + IS_ENABLED(CONFIG_X86_64) != efi_enabled(EFI_64BIT)) { pr_info("efi: Setup done, disabling due to 32/64-bit mismatch\n"); efi_unmap_memmap(); - efi_enabled = 0; } #endif } -- cgit v1.2.3