summaryrefslogtreecommitdiff
path: root/arch/x86/kernel/cpu
AgeCommit message (Collapse)Author
2015-01-29x86, hyperv: Mark the Hyper-V clocksource as being continuousK. Y. Srinivasan
commit 32c6590d126836a062b3140ed52d898507987017 upstream. The Hyper-V clocksource is continuous; mark it accordingly. Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Acked-by: jasowang@redhat.com Cc: gregkh@linuxfoundation.org Cc: devel@linuxdriverproject.org Cc: olaf@aepfle.de Cc: apw@canonical.com Link: http://lkml.kernel.org/r/1421108762-3331-1-git-send-email-kys@microsoft.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-16perf/x86/intel/uncore: Make sure only uncore events are collectedJiri Olsa
commit af91568e762d04931dcbdd6bef4655433d8b9418 upstream. The uncore_collect_events functions assumes that event group might contain only uncore events which is wrong, because it might contain any type of events. This bug leads to uncore framework touching 'not' uncore events, which could end up all sorts of bugs. One was triggered by Vince's perf fuzzer, when the uncore code touched breakpoint event private event space as if it was uncore event and caused BUG: BUG: unable to handle kernel paging request at ffffffff82822068 IP: [<ffffffff81020338>] uncore_assign_events+0x188/0x250 ... The code in uncore_assign_events() function was looking for event->hw.idx data while the event was initialized as a breakpoint with different members in event->hw union. This patch forces uncore_collect_events() to collect only uncore events. Reported-by: Vince Weaver <vince@deater.net> Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Yan, Zheng <zheng.z.yan@intel.com> Link: http://lkml.kernel.org/r/1418243031-20367-2-git-send-email-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-16perf/x86/intel: Protect LBR and extra_regs against KVM lyingKan Liang
commit 338b522ca43cfd32d11a370f4203bcd089c6c877 upstream. With -cpu host, KVM reports LBR and extra_regs support, if the host has support. When the guest perf driver tries to access LBR or extra_regs MSR, it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support. So check the related MSRs access right once at initialization time to avoid the error access at runtime. For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y (for host kernel). And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel). Start the guest with -cpu host. Run perf record with --branch-any or --branch-filter in guest to trigger LBR Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to trigger offcore_rsp #GP Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com> Cc: Mark Davies <junk@eslaf.co.uk> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephane Eranian <eranian@google.com> Cc: Yan, Zheng <zheng.z.yan@intel.com> Link: http://lkml.kernel.org/r/1405365957-20202-1-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Dongsu Park <dongsu.park@profitbricks.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-onlyPaolo Bonzini
commit c1118b3602c2329671ad5ec8bdf8e374323d6343 upstream. On x86_64, kernel text mappings are mapped read-only with CONFIG_DEBUG_RODATA. In that case, KVM will fail to patch VMCALL instructions to VMMCALL as required on AMD processors. The failure mode is currently a divide-by-zero exception, which obviously is a KVM bug that has to be fixed. However, picking the right instruction between VMCALL and VMMCALL will be faster and will help if you cannot upgrade the hypervisor. Reported-by: Chris Webb <chris@arachsys.com> Tested-by: Chris Webb <chris@arachsys.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Chris J Arges <chris.j.arges@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06x86: Require exact match for 'noxsave' command line optionDave Hansen
commit 2cd3949f702692cf4c5d05b463f19cd706a92dd3 upstream. We have some very similarly named command-line options: arch/x86/kernel/cpu/common.c:__setup("noxsave", x86_xsave_setup); arch/x86/kernel/cpu/common.c:__setup("noxsaveopt", x86_xsaveopt_setup); arch/x86/kernel/cpu/common.c:__setup("noxsaves", x86_xsaves_setup); __setup() is designed to match options that take arguments, like "foo=bar" where you would have: __setup("foo", x86_foo_func...); The problem is that "noxsave" actually _matches_ "noxsaves" in the same way that "foo" matches "foo=bar". If you boot an old kernel that does not know about "noxsaves" with "noxsaves" on the command line, it will interpret the argument as "noxsave", which is not what you want at all. This makes the "noxsave" handler only return success when it finds an *exact* match. [ tglx: We really need to make __setup() more robust. ] Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Hansen <dave@sr71.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: x86@kernel.org Link: http://lkml.kernel.org/r/20141111220133.FE053984@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21perf/x86/intel: Use proper dTLB-load-misses event on IvyBridgeVince Weaver
commit 1996388e9f4e3444db8273bc08d25164d2967c21 upstream. This was discussed back in February: https://lkml.org/lkml/2014/2/18/956 But I never saw a patch come out of it. On IvyBridge we share the SandyBridge cache event tables, but the dTLB-load-miss event is not compatible. Patch it up after the fact to the proper DTLB_LOAD_MISSES.DEMAND_LD_MISS_CAUSES_A_WALK Signed-off-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1407141528200.17214@vincent-weaver-1.umelst.maine.edu Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Hou Pengyang <houpengyang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21x86, microcode, AMD: Fix ucode patch stashing on 32-bitBorislav Petkov
commit c0a717f23dccdb6e3b03471bc846fdc636f2b353 upstream. Save the patch while we're running on the BSP instead of later, before the initrd has been jettisoned. More importantly, on 32-bit we need to access the physical address instead of the virtual. This way we actually do find it on the APs instead of having to go through the initrd each time. Tested-by: Richard Hendershot <rshendershot@mchsi.com> Fixes: 5335ba5cf475 ("x86, microcode, AMD: Fix early ucode loading") Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21x86, microcode, AMD: Fix early ucode loading on 32-bitBorislav Petkov
commit 4750a0d112cbfcc744929f1530ffe3193436766c upstream. Konrad triggered the following splat below in a 32-bit guest on an AMD box. As it turns out, in save_microcode_in_initrd_amd() we're using the *physical* address of the container *after* we have enabled paging and thus we #PF in load_microcode_amd() when trying to access the microcode container in the ramdisk range. Because the ramdisk is exactly there: [ 0.000000] RAMDISK: [mem 0x35e04000-0x36ef9fff] and we fault at 0x35e04304. And since this guest doesn't relocate the ramdisk, we don't do the computation which will give us the correct virtual address and we end up with the PA. So, we should actually be using virtual addresses on 32-bit too by the time we're freeing the initrd. Do that then! Unpacking initramfs... BUG: unable to handle kernel paging request at 35d4e304 IP: [<c042e905>] load_microcode_amd+0x25/0x4a0 *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.17.1-302.fc21.i686 #1 Hardware name: Xen HVM domU, BIOS 4.4.1 10/01/2014 task: f5098000 ti: f50d0000 task.ti: f50d0000 EIP: 0060:[<c042e905>] EFLAGS: 00010246 CPU: 0 EIP is at load_microcode_amd+0x25/0x4a0 EAX: 00000000 EBX: f6e9ec4c ECX: 00001ec4 EDX: 00000000 ESI: f5d4e000 EDI: 35d4e2fc EBP: f50d1ed0 ESP: f50d1e94 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 CR0: 8005003b CR2: 35d4e304 CR3: 00e33000 CR4: 000406d0 Stack: 00000000 00000000 f50d1ebc f50d1ec4 f5d4e000 c0d7735a f50d1ed0 15a3d17f f50d1ec4 00600f20 00001ec4 bfb83203 f6e9ec4c f5d4e000 c0d7735a f50d1ed8 c0d80861 f50d1ee0 c0d80429 f50d1ef0 c0d889a9 f5d4e000 c0000000 f50d1f04 Call Trace: ? unpack_to_rootfs ? unpack_to_rootfs save_microcode_in_initrd_amd save_microcode_in_initrd free_initrd_mem populate_rootfs ? unpack_to_rootfs do_one_initcall ? unpack_to_rootfs ? repair_env_string ? proc_mkdir kernel_init_freeable kernel_init ret_from_kernel_thread ? rest_init Reported-and-tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> References: https://bugzilla.redhat.com/show_bug.cgi?id=1158204 Fixes: 75a1ba5b2c52 ("x86, microcode, AMD: Unify valid container checks") Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/20141101100100.GA4462@pd.tnic Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14x86: Add cpu_detect_cache_sizes to init_intel() add Quark legacy_cache()Bryan O'Donoghue
commit aece118e487a744eafcdd0c77fe32b55ee2092a1 upstream. Intel processors which don't report cache information via cpuid(2) or cpuid(4) need quirk code in the legacy_cache_size callback to report this data. For Intel that callback is is intel_size_cache(). This patch enables calling of cpu_detect_cache_sizes() inside of init_intel() and hence the calling of the legacy_cache callback in intel_size_cache(). Adding this call will ensure that PIII Tualatin currently in intel_size_cache() and Quark SoC X1000 being added to intel_size_cache() in this patch will report their respective cache sizes. This model of calling cpu_detect_cache_sizes() is consistent with AMD/Via/Cirix/Transmeta and Centaur. Also added is a string to idenitfy the Quark as Quark SoC X1000 giving better and more descriptive output via /proc/cpuinfo Adding cpu_detect_cache_sizes to init_intel() will enable calling of intel_size_cache() on Intel processors which currently no code can reach. Therefore this patch will also re-enable reporting of PIII Tualatin cache size information as well as add Quark SoC X1000 support. Comment text and cache flow logic suggested by Thomas Gleixner Signed-off-by: Bryan O'Donoghue <pure.logic@nexus-software.ie> Cc: davej@redhat.com Cc: hmh@hmh.eng.br Link: http://lkml.kernel.org/r/1412641189-12415-3-git-send-email-pure.logic@nexus-software.ie Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Chang Rebecca Swee Fun <rebecca.swee.fun.chang@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14x86_64, entry: Filter RFLAGS.NT on entry from userspaceAndy Lutomirski
commit 8c7aa698baca5e8f1ba9edb68081f1e7a1abf455 upstream. The NT flag doesn't do anything in long mode other than causing IRET to #GP. Oddly, CPL3 code can still set NT using popf. Entry via hardware or software interrupt clears NT automatically, so the only relevant entries are fast syscalls. If user code causes kernel code to run with NT set, then there's at least some (small) chance that it could cause trouble. For example, user code could cause a call to EFI code with NT set, and who knows what would happen? Apparently some games on Wine sometimes do this (!), and, if an IRET return happens, they will segfault. That segfault cannot be handled, because signal delivery fails, too. This patch programs the CPU to clear NT on entry via SYSCALL (both 32-bit and 64-bit, by my reading of the AMD APM), and it clears NT in software on entry via SYSENTER. To save a few cycles, this borrows a trick from Jan Beulich in Xen: it checks whether NT is set before trying to clear it. As a result, it seems to have very little effect on SYSENTER performance on my machine. There's another minor bug fix in here: it looks like the CFI annotations were wrong if CONFIG_AUDITSYSCALL=n. Testers beware: on Xen, SYSENTER with NT set turns into a GPF. I haven't touched anything on 32-bit kernels. The syscall mask change comes from a variant of this patch by Anish Bhatt. Note to stable maintainers: there is no known security issue here. A misguided program can set NT and cause the kernel to try and fail to deliver SIGSEGV, crashing the program. This patch fixes Far Cry on Wine: https://bugs.winehq.org/show_bug.cgi?id=33275 Reported-by: Anish Bhatt <anish@chelsio.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/395749a5d39a29bd3e4b35899cf3a3c1340e5595.1412189265.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30x86/intel/quark: Switch off CR4.PGE so TLB flush uses CR3 insteadBryan O'Donoghue
commit ee1b5b165c0a2f04d2107e634e51f05d0eb107de upstream. Quark x1000 advertises PGE via the standard CPUID method PGE bits exist in Quark X1000's PTEs. In order to flush an individual PTE it is necessary to reload CR3 irrespective of the PTE.PGE bit. See Quark Core_DevMan_001.pdf section 6.4.11 This bug was fixed in Galileo kernels, unfixed vanilla kernels are expected to crash and burn on this platform. Signed-off-by: Bryan O'Donoghue <pure.logic@nexus-software.ie> Cc: Borislav Petkov <bp@alien8.de> Link: http://lkml.kernel.org/r/1411514784-14885-1-git-send-email-pure.logic@nexus-software.ie Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05perf/x86/intel: Use rdmsrl_safe() when initializing RAPL PMUVenkatesh Srinivas
commit 24223657806a0ebd0ae5c9caaf7b021091889cf2 upstream. CPUs which should support the RAPL counters according to Family/Model/Stepping may still issue #GP when attempting to access the RAPL MSRs. This may happen when Linux is running under KVM and we are passing-through host F/M/S data, for example. Use rdmsrl_safe to first access the RAPL_POWER_UNIT MSR; if this fails, do not attempt to use this PMU. Signed-off-by: Venkatesh Srinivas <venkateshs@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1394739386-22260-1-git-send-email-venkateshs@google.com Cc: zheng.z.yan@intel.com Cc: eranian@google.com Cc: ak@linux.intel.com Cc: linux-kernel@vger.kernel.org [ The patch also silently fixes another bug: rapl_pmu_init() didn't handle the memory alloc failure case previously. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> [backport by whissi] Cc: Thomas D <whissi@whissi.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-28perf/x86/intel: ignore CondChgd bit to avoid false NMI handlingHATAYAMA Daisuke
commit b292d7a10487aee6e74b1c18b8d95b92f40d4a4f upstream. Currently, any NMI is falsely handled by a NMI handler of NMI watchdog if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set. For example, we use external NMI to make system panic to get crash dump, but in this case, the external NMI is falsely handled do to the issue. This commit deals with the issue simply by ignoring CondChgd bit. Here is explanation in detail. On x86 NMI watchdog uses performance monitoring feature to periodically signal NMI each time performance counter gets overflowed. intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI handler of NMI watchdog, perf_event_nmi_handler(). It identifies an owner of a given NMI by looking at overflow status bits in MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it handles the given NMI as its own NMI. The problem is that the intel_pmu_handle_irq() doesn't distinguish CondChgd bit from other bits. Unlike the other status bits, CondChgd bit doesn't represent overflow status for performance counters. Thus, CondChgd bit cannot be thought of as a mark indicating a given NMI is NMI watchdog's. As a result, if CondChgd bit is set, any NMI is falsely handled by the NMI handler of NMI watchdog. Also, if type of the falsely handled NMI is either NMI_UNKNOWN, NMI_SERR or NMI_IO_CHECK, the corresponding action is never performed until CondChgd bit is cleared. I noticed this behavior on systems with Ivy Bridge processors: Intel Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems, CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set in the beginning at boot. Then the CondChgd bit is immediately cleared by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0. On the other hand, on older processors such as Nehalem, Xeon E7540, CondChgd bit is not set in the beginning at boot. I'm not sure about exact behavior of CondChgd bit, in particular when this bit is set. Although I read Intel System Programmer's Manual to figure out that, the descriptions I found are: In 18.9.1: "The MSR_PERF_GLOBAL_STATUS MSR also provides a ¡sticky bit¢ to indicate changes to the state of performancmonitoring hardware" In Table 35-2 IA-32 Architectural MSRs 63 CondChg: status bits of this register has changed. These are different from the bahviour I see on the actual system as I explained above. At least, I think ignoring CondChgd bit should be enough for NMI watchdog perspective. Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Acked-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20140625.103503.409316067.d.hatayama@jp.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26x86, hyperv: Bypass the timer_irq_works() checkJason Wang
commit ca3ba2a2f4a49a308e7d78c784d51b2332064f15 upstream. This patch bypass the timer_irq_works() check for hyperv guest since: - It was guaranteed to work. - timer_irq_works() may fail sometime due to the lpj calibration were inaccurate in a hyperv guest or a buggy host. In the future, we should get the tsc frequency from hypervisor and use preset lpj instead. [ hpa: I would prefer to not defer things to "the future" in the future... ] Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Acked-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Link: http://lkml.kernel.org/r/1393558229-14755-1-git-send-email-jasowang@redhat.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-16Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Misc smaller fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Fix leak in uncore_type_init failure paths perf machine: Use map as success in ip__resolve_ams perf symbols: Fix crash in elf_section_by_name perf trace: Decode architecture-specific signal numbers
2014-03-11x86: Remove CONFIG_X86_OOSTOREDave Jones
This was an optimization that made memcpy type benchmarks a little faster on ancient (Circa 1998) IDT Winchip CPUs. In real-life workloads, it wasn't even noticable, and I doubt anyone is running benchmarks on 16 year old silicon any more. Given this code has likely seen very little use over the last decade, let's just remove it. Signed-off-by: Dave Jones <davej@fedoraproject.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-11perf/x86: Fix leak in uncore_type_init failure pathsDave Jones
The error path of uncore_type_init() frees up any allocations that were made along the way, but it relies upon type->pmus being set, which only happens if the function succeeds. As type->pmus remains null in this case, the call to uncore_type_exit will do nothing. Moving the assignment earlier will allow us to actually free those allocations should something go awry. Signed-off-by: Dave Jones <davej@fedoraproject.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140306172028.GA552@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27perf/x86: Fix event schedulingPeter Zijlstra
Vince "Super Tester" Weaver reported a new round of syscall fuzzing (Trinity) failures, with perf WARN_ON()s triggering. He also provided traces of the failures. This is I think the relevant bit: > pec_1076_warn-2804 [000] d... 147.926153: x86_pmu_disable: x86_pmu_disable > pec_1076_warn-2804 [000] d... 147.926153: x86_pmu_state: Events: { > pec_1076_warn-2804 [000] d... 147.926156: x86_pmu_state: 0: state: .R config: ffffffffffffffff ( (null)) > pec_1076_warn-2804 [000] d... 147.926158: x86_pmu_state: 33: state: AR config: 0 (ffff88011ac99800) > pec_1076_warn-2804 [000] d... 147.926159: x86_pmu_state: } > pec_1076_warn-2804 [000] d... 147.926160: x86_pmu_state: n_events: 1, n_added: 0, n_txn: 1 > pec_1076_warn-2804 [000] d... 147.926161: x86_pmu_state: Assignment: { > pec_1076_warn-2804 [000] d... 147.926162: x86_pmu_state: 0->33 tag: 1 config: 0 (ffff88011ac99800) > pec_1076_warn-2804 [000] d... 147.926163: x86_pmu_state: } > pec_1076_warn-2804 [000] d... 147.926166: collect_events: Adding event: 1 (ffff880119ec8800) So we add the insn:p event (fd[23]). At this point we should have: n_events = 2, n_added = 1, n_txn = 1 > pec_1076_warn-2804 [000] d... 147.926170: collect_events: Adding event: 0 (ffff8800c9e01800) > pec_1076_warn-2804 [000] d... 147.926172: collect_events: Adding event: 4 (ffff8800cbab2c00) We try and add the {BP,cycles,br_insn} group (fd[3], fd[4], fd[15]). These events are 0:cycles and 4:br_insn, the BP event isn't x86_pmu so that's not visible. group_sched_in() pmu->start_txn() /* nop - BP pmu */ event_sched_in() event->pmu->add() So here we should end up with: 0: n_events = 3, n_added = 2, n_txn = 2 4: n_events = 4, n_added = 3, n_txn = 3 But seeing the below state on x86_pmu_enable(), the must have failed, because the 0 and 4 events aren't there anymore. Looking at group_sched_in(), since the BP is the leader, its event_sched_in() must have succeeded, for otherwise we would not have seen the sibling adds. But since neither 0 or 4 are in the below state; their event_sched_in() must have failed; but I don't see why, the complete state: 0,0,1:p,4 fits perfectly fine on a core2. However, since we try and schedule 4 it means the 0 event must have succeeded! Therefore the 4 event must have failed, its failure will have put group_sched_in() into the fail path, which will call: event_sched_out() event->pmu->del() on 0 and the BP event. Now x86_pmu_del() will reduce n_events; but it will not reduce n_added; giving what we see below: n_event = 2, n_added = 2, n_txn = 2 > pec_1076_warn-2804 [000] d... 147.926177: x86_pmu_enable: x86_pmu_enable > pec_1076_warn-2804 [000] d... 147.926177: x86_pmu_state: Events: { > pec_1076_warn-2804 [000] d... 147.926179: x86_pmu_state: 0: state: .R config: ffffffffffffffff ( (null)) > pec_1076_warn-2804 [000] d... 147.926181: x86_pmu_state: 33: state: AR config: 0 (ffff88011ac99800) > pec_1076_warn-2804 [000] d... 147.926182: x86_pmu_state: } > pec_1076_warn-2804 [000] d... 147.926184: x86_pmu_state: n_events: 2, n_added: 2, n_txn: 2 > pec_1076_warn-2804 [000] d... 147.926184: x86_pmu_state: Assignment: { > pec_1076_warn-2804 [000] d... 147.926186: x86_pmu_state: 0->33 tag: 1 config: 0 (ffff88011ac99800) > pec_1076_warn-2804 [000] d... 147.926188: x86_pmu_state: 1->0 tag: 1 config: 1 (ffff880119ec8800) > pec_1076_warn-2804 [000] d... 147.926188: x86_pmu_state: } > pec_1076_warn-2804 [000] d... 147.926190: x86_pmu_enable: S0: hwc->idx: 33, hwc->last_cpu: 0, hwc->last_tag: 1 hwc->state: 0 So the problem is that x86_pmu_del(), when called from a group_sched_in() that fails (for whatever reason), and without x86_pmu TXN support (because the leader is !x86_pmu), will corrupt the n_added state. Reported-and-Tested-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Stephane Eranian <eranian@google.com> Cc: Dave Jones <davej@redhat.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20140221150312.GF3104@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Misc fixlets from all around the place" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/uncore: Fix IVT/SNB-EP uncore CBOX NID filter table perf/x86: Correctly use FEATURE_PDCM perf, nmi: Fix unknown NMI warning perf trace: Fix ioctl 'request' beautifier build problems on !(i386 || x86_64) arches perf trace: Add fallback definition of EFD_SEMAPHORE perf list: Fix checking for supported events on older kernels perf tools: Handle PERF_RECORD_HEADER_EVENT_TYPE properly perf probe: Do not add offset twice to uprobe address perf/x86: Fix Userspace RDPMC switch perf/x86/intel/p6: Add userspace RDPMC quirk for PPro
2014-02-21perf/x86/uncore: Fix IVT/SNB-EP uncore CBOX NID filter tableStephane Eranian
This patch updates the CBOX PMU filters mapping tables for SNB-EP and IVT (model 45 and 62 respectively). The NID umask always comes in addition to another umask. When set, the NID filter is applied. The current mapping tables were missing some code/umask combinations to account for the NID umask. This patch fixes that. Cc: mingo@elte.hu Cc: ak@linux.intel.com Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140219131018.GA24475@quad Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21perf/x86: Correctly use FEATURE_PDCMPeter Zijlstra
The current code simply assumes Intel Arch PerfMon v2+ to have the IA32_PERF_CAPABILITIES MSR; the SDM specifies that we should check CPUID[1].ECX[15] (aka, FEATURE_PDCM) instead. This was found by KVM which implements v2+ but didn't provide the capabilities MSR. Change the code to DTRT; KVM will also implement the MSR and return 0. Cc: pbonzini@redhat.com Reported-by: "Michael S. Tsirkin" <mst@redhat.com> Suggested-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140203132903.GI8874@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21perf, nmi: Fix unknown NMI warningMarkus Metzger
When using BTS on Core i7-4*, I get the below kernel warning. $ perf record -c 1 -e branches:u ls Message from syslogd@labpc1501 at Nov 11 15:49:25 ... kernel:[ 438.317893] Uhhuh. NMI received for unknown reason 31 on CPU 2. Message from syslogd@labpc1501 at Nov 11 15:49:25 ... kernel:[ 438.317920] Do you have a strange power saving mode enabled? Message from syslogd@labpc1501 at Nov 11 15:49:25 ... kernel:[ 438.317945] Dazed and confused, but trying to continue Make intel_pmu_handle_irq() take the full exit path when returning early. Cc: eranian@google.com Cc: peterz@infradead.org Cc: mingo@kernel.org Signed-off-by: Markus Metzger <markus.t.metzger@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392425048-5309-1-git-send-email-andi@firstfloor.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-13x86, smap: Don't enable SMAP if CONFIG_X86_SMAP is disabledH. Peter Anvin
If SMAP support is not compiled into the kernel, don't enable SMAP in CR4 -- in fact, we should clear it, because the kernel doesn't contain the proper STAC/CLAC instructions for SMAP support. Found by Fengguang Wu's test system. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Link: http://lkml.kernel.org/r/20140213124550.GA30497@localhost Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: <stable@vger.kernel.org> # v3.7+
2014-02-09perf/x86: Fix Userspace RDPMC switchPeter Zijlstra
The current code forgets to change the CR4 state on the current CPU. Use on_each_cpu() instead of smp_call_function(). Reported-by: Mark Davies <junk@eslaf.co.uk> Suggested-by: Mark Davies <junk@eslaf.co.uk> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: fweisbec@gmail.com Link: http://lkml.kernel.org/n/tip-69efsat90ibhnd577zy3z9gh@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-09perf/x86/intel/p6: Add userspace RDPMC quirk for PProPeter Zijlstra
PPro machines can die hard when PCE gets enabled due to a CPU erratum. The safe way it so disable it by default and keep it disabled. See erratum 26 in: http://download.intel.com/design/archives/processors/pro/docs/24268935.pdf Reported-and-Tested-by: Mark Davies <junk@eslaf.co.uk> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Stephane Eranian <eranian@google.com> Cc: Vince Weaver <vince@deater.net> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140206170815.GW2936@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-07Merge tag 'efi-urgent' into x86/urgentH. Peter Anvin
* Avoid WARN_ON() when mapping BGRT on Baytrail (EFI 32-bit). Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-06x86, microcode, AMD: Unify valid container checksBorislav Petkov
For additional coverage, BorisO and friends unknowlingly did swap AMD microcode with Intel microcode blobs in order to see what happens. What did happen on 32-bit was [ 5.722656] BUG: unable to handle kernel paging request at be3a6008 [ 5.722693] IP: [<c106d6b4>] load_microcode_amd+0x24/0x3f0 [ 5.722716] *pdpt = 0000000000000000 *pde = 0000000000000000 because there was a valid initrd there but without valid microcode in it and the container check happened *after* the relocated ramdisk handling on 32-bit, which was clearly wrong. While at it, take care of the ramdisk relocation on both 32- and 64-bit as it is done on both. Also, comment what we're doing because this code is a bit tricky. Reported-and-tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1391460104-7261-1-git-send-email-bp@alien8.de Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2014-01-25Merge branch 'linus' into x86/urgentIngo Molnar
Merge in the x86 changes to apply a fix. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-25mm, x86: Revisit tlb_flushall_shift tuning for page flushes except on IvyBridgeMel Gorman
There was a large ebizzy performance regression that was bisected to commit 611ae8e3 (x86/tlb: enable tlb flush range support for x86). The problem was related to the tlb_flushall_shift tuning for IvyBridge which was altered. The problem is that it is not clear if the tuning values for each CPU family is correct as the methodology used to tune the values is unclear. This patch uses a conservative tlb_flushall_shift value for all CPU families except IvyBridge so the decision can be revisited if any regression is found as a result of this change. IvyBridge is an exception as testing with one methodology determined that the value of 2 is acceptable. Details are in the changelog for the patch "x86: mm: Change tlb_flushall_shift for IvyBridge". One important aspect of this to watch out for is Xen. The original commit log mentioned large performance gains on Xen. It's possible Xen is more sensitive to this value if it flushes small ranges of pages more frequently than workloads on bare metal typically do. Signed-off-by: Mel Gorman <mgorman@suse.de> Tested-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alex Shi <alex.shi@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-dyzMww3fqugnhbhgo6Gxmtkw@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-25x86: mm: change tlb_flushall_shift for IvyBridgeMel Gorman
There was a large performance regression that was bisected to commit 611ae8e3 ("x86/tlb: enable tlb flush range support for x86"). This patch simply changes the default balance point between a local and global flush for IvyBridge. In the interest of allowing the tests to be reproduced, this patch was tested using mmtests 0.15 with the following configurations configs/config-global-dhp__tlbflush-performance configs/config-global-dhp__scheduler-performance configs/config-global-dhp__network-performance Results are from two machines Ivybridge 4 threads: Intel(R) Core(TM) i3-3240 CPU @ 3.40GHz Ivybridge 8 threads: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz Page fault microbenchmark showed nothing interesting. Ebizzy was configured to run multiple iterations and threads. Thread counts ranged from 1 to NR_CPUS*2. For each thread count, it ran 100 iterations and each iteration lasted 10 seconds. Ivybridge 4 threads 3.13.0-rc7 3.13.0-rc7 vanilla altshift-v3 Mean 1 6395.44 ( 0.00%) 6789.09 ( 6.16%) Mean 2 7012.85 ( 0.00%) 8052.16 ( 14.82%) Mean 3 6403.04 ( 0.00%) 6973.74 ( 8.91%) Mean 4 6135.32 ( 0.00%) 6582.33 ( 7.29%) Mean 5 6095.69 ( 0.00%) 6526.68 ( 7.07%) Mean 6 6114.33 ( 0.00%) 6416.64 ( 4.94%) Mean 7 6085.10 ( 0.00%) 6448.51 ( 5.97%) Mean 8 6120.62 ( 0.00%) 6462.97 ( 5.59%) Ivybridge 8 threads 3.13.0-rc7 3.13.0-rc7 vanilla altshift-v3 Mean 1 7336.65 ( 0.00%) 7787.02 ( 6.14%) Mean 2 8218.41 ( 0.00%) 9484.13 ( 15.40%) Mean 3 7973.62 ( 0.00%) 8922.01 ( 11.89%) Mean 4 7798.33 ( 0.00%) 8567.03 ( 9.86%) Mean 5 7158.72 ( 0.00%) 8214.23 ( 14.74%) Mean 6 6852.27 ( 0.00%) 7952.45 ( 16.06%) Mean 7 6774.65 ( 0.00%) 7536.35 ( 11.24%) Mean 8 6510.50 ( 0.00%) 6894.05 ( 5.89%) Mean 12 6182.90 ( 0.00%) 6661.29 ( 7.74%) Mean 16 6100.09 ( 0.00%) 6608.69 ( 8.34%) Ebizzy hits the worst case scenario for TLB range flushing every time and it shows for these Ivybridge CPUs at least that the default choice is a poor on. The patch addresses the problem. Next was a tlbflush microbenchmark written by Alex Shi at http://marc.info/?l=linux-kernel&m=133727348217113 . It measures access costs while the TLB is being flushed. The expectation is that if there are always full TLB flushes that the benchmark would suffer and it benefits from range flushing There are 320 iterations of the test per thread count. The number of entries is randomly selected with a min of 1 and max of 512. To ensure a reasonably even spread of entries, the full range is broken up into 8 sections and a random number selected within that section. iteration 1, random number between 0-64 iteration 2, random number between 64-128 etc This is still a very weak methodology. When you do not know what are typical ranges, random is a reasonable choice but it can be easily argued that the opimisation was for smaller ranges and an even spread is not representative of any workload that matters. To improve this, we'd need to know the probability distribution of TLB flush range sizes for a set of workloads that are considered "common", build a synthetic trace and feed that into this benchmark. Even that is not perfect because it would not account for the time between flushes but there are limits of what can be reasonably done and still be doing something useful. If a representative synthetic trace is provided then this benchmark could be revisited and the shift values retuned. Ivybridge 4 threads 3.13.0-rc7 3.13.0-rc7 vanilla altshift-v3 Mean 1 10.50 ( 0.00%) 10.50 ( 0.03%) Mean 2 17.59 ( 0.00%) 17.18 ( 2.34%) Mean 3 22.98 ( 0.00%) 21.74 ( 5.41%) Mean 5 47.13 ( 0.00%) 46.23 ( 1.92%) Mean 8 43.30 ( 0.00%) 42.56 ( 1.72%) Ivybridge 8 threads 3.13.0-rc7 3.13.0-rc7 vanilla altshift-v3 Mean 1 9.45 ( 0.00%) 9.36 ( 0.93%) Mean 2 9.37 ( 0.00%) 9.70 ( -3.54%) Mean 3 9.36 ( 0.00%) 9.29 ( 0.70%) Mean 5 14.49 ( 0.00%) 15.04 ( -3.75%) Mean 8 41.08 ( 0.00%) 38.73 ( 5.71%) Mean 13 32.04 ( 0.00%) 31.24 ( 2.49%) Mean 16 40.05 ( 0.00%) 39.04 ( 2.51%) For both CPUs, average access time is reduced which is good as this is the benchmark that was used to tune the shift values in the first place albeit it is now known *how* the benchmark was used. The scheduler benchmarks were somewhat inconclusive. They showed gains and losses and makes me reconsider how stable those benchmarks really are or if something else might be interfering with the test results recently. Network benchmarks were inconclusive. Almost all results were flat except for netperf-udp tests on the 4 thread machine. These results were unstable and showed large variations between reboots. It is unknown if this is a recent problems but I've noticed before that netperf-udp results tend to vary. Based on these results, changing the default for Ivybridge seems like a logical choice. Signed-off-by: Mel Gorman <mgorman@suse.de> Tested-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Alex Shi <alex.shi@linaro.org> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-cqnadffh1tiqrshthRj3Esge@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-25mm, x86: Account for TLB flushes only when debuggingMel Gorman
Bisection between 3.11 and 3.12 fingered commit 9824cf97 ("mm: vmstats: tlb flush counters") to cause overhead problems. The counters are undeniably useful but how often do we really need to debug TLB flush related issues? It does not justify taking the penalty everywhere so make it a debugging option. Signed-off-by: Mel Gorman <mgorman@suse.de> Tested-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Alex Shi <alex.shi@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-XzxjntugxuwpxXhcrxqqh53b@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-20Merge tag 'driver-core-3.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core / sysfs patches from Greg KH: "Here's the big driver core and sysfs patch set for 3.14-rc1. There's a lot of work here moving sysfs logic out into a "kernfs" to allow other subsystems to also have a virtual filesystem with the same attributes of sysfs (handle device disconnect, dynamic creation / removal as needed / unneeded, etc) This is primarily being done for the cgroups filesystem, but the goal is to also move debugfs to it when it is ready, solving all of the known issues in that filesystem as well. The code isn't completed yet, but all should be stable now (there is a big section that was reverted due to problems found when testing) There's also some other smaller fixes, and a driver core addition that allows for a "collection" of objects, that the DRM people will be using soon (it's in this tree to make merges after -rc1 easier) All of this has been in linux-next with no reported issues" * tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (113 commits) kernfs: associate a new kernfs_node with its parent on creation kernfs: add struct dentry declaration in kernfs.h kernfs: fix get_active failure handling in kernfs_seq_*() Revert "kernfs: fix get_active failure handling in kernfs_seq_*()" Revert "kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq" Revert "kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()" Revert "kernfs: remove KERNFS_REMOVED" Revert "kernfs: restructure removal path to fix possible premature return" Revert "kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()" Revert "kernfs: remove kernfs_addrm_cxt" Revert "kernfs: make kernfs_get_active() block if the node is deactivated but not removed" Revert "kernfs: implement kernfs_{de|re}activate[_self]()" Revert "kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers" Revert "pci: use device_remove_file_self() instead of device_schedule_callback()" Revert "scsi: use device_remove_file_self() instead of device_schedule_callback()" Revert "s390: use device_remove_file_self() instead of device_schedule_callback()" Revert "sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()" Revert "kernfs: remove unnecessary NULL check in __kernfs_remove()" kernfs: remove unnecessary NULL check in __kernfs_remove() drivers/base: provide an infrastructure for componentised subsystems ...
2014-01-20Merge branch 'x86-kaslr-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 kernel address space randomization support from Peter Anvin: "This enables kernel address space randomization for x86" * 'x86-kaslr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, kaslr: Clarify RANDOMIZE_BASE_MAX_OFFSET x86, kaslr: Remove unused including <linux/version.h> x86, kaslr: Use char array to gain sizeof sanity x86, kaslr: Add a circular multiply for better bit diffusion x86, kaslr: Mix entropy sources together as needed x86/relocs: Add percpu fixup for GNU ld 2.23 x86, boot: Rename get_flags() and check_flags() to *_cpuflags() x86, kaslr: Raise the maximum virtual address to -1 GiB on x86_64 x86, kaslr: Report kernel offset on panic x86, kaslr: Select random position from e820 maps x86, kaslr: Provide randomness functions x86, kaslr: Return location from decompress_kernel x86, boot: Move CPU flags out of cpucheck x86, relocs: Add more per-cpu gold special cases
2014-01-20Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull leftover x86 fixes from Ingo Molnar: "Two leftover fixes that did not make it into v3.13" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Add check for number of available vectors before CPU down x86, cpu, amd: Add workaround for family 16h, erratum 793
2014-01-20Merge branch 'x86-ras-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RAS changes from Ingo Molnar: - SCI reporting for other error types not only correctable ones - GHES cleanups - Add the functionality to override error reporting agents as some machines are sporting a new extended error logging capability which, if done properly in the BIOS, makes a corresponding EDAC module redundant - PCIe AER tracepoint severity levels fix - Error path correction for the mce device init - MCE timer fix - Add more flexibility to the error injection (EINJ) debugfs interface * 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, mce: Fix mce_start_timer semantics ACPI, APEI, GHES: Cleanup ghes memory error handling ACPI, APEI: Cleanup alignment-aware accesses ACPI, APEI, GHES: Do not report only correctable errors with SCI ACPI, APEI, EINJ: Changes to the ACPI/APEI/EINJ debugfs interface ACPI, eMCA: Combine eMCA/EDAC event reporting priority EDAC, sb_edac: Modify H/W event reporting policy EDAC: Add an edac_report parameter to EDAC PCI, AER: Fix severity usage in aer trace event x86, mce: Call put_device on device_register failure
2014-01-20Merge branch 'x86-microcode-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 microcode loader updates from Ingo Molnar: "There are two main changes in this tree: - AMD microcode early loading fixes - some microcode loader source files reorganization" * 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, microcode: Move to a proper location x86, microcode, AMD: Fix early ucode loading x86, microcode: Share native MSR accessing variants x86, ramdisk: Export relocated ramdisk VA
2014-01-20Merge branch 'x86-cpu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 TLB detection update from Ingo Molnar: "A single change that extends our TLB cache size detection+reporting code" * 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, cpu: Detect more TLB configuration
2014-01-20Merge branch 'x86-cleanups-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "Misc cleanups" * 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, cpu, amd: Fix a shadowed variable situation um, x86: Fix vDSO build x86: Delete non-required instances of include <linux/init.h> x86, realmode: Pointer walk cleanups, pull out invariant use of __pa() x86/traps: Clean up error exception handler definitions
2014-01-20Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: - Add the initial implementation of SCHED_DEADLINE support: a real-time scheduling policy where tasks that meet their deadlines and periodically execute their instances in less than their runtime quota see real-time scheduling and won't miss any of their deadlines. Tasks that go over their quota get delayed (Available to privileged users for now) - Clean up and fix preempt_enable_no_resched() abuse all around the tree - Do sched_clock() performance optimizations on x86 and elsewhere - Fix and improve auto-NUMA balancing - Fix and clean up the idle loop - Apply various cleanups and fixes * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits) sched: Fix __sched_setscheduler() nice test sched: Move SCHED_RESET_ON_FORK into attr::sched_flags sched: Fix up attr::sched_priority warning sched: Fix up scheduler syscall LTP fails sched: Preserve the nice level over sched_setscheduler() and sched_setparam() calls sched/core: Fix htmldocs warnings sched/deadline: No need to check p if dl_se is valid sched/deadline: Remove unused variables sched/deadline: Fix sparse static warnings m68k: Fix build warning in mac_via.h sched, thermal: Clean up preempt_enable_no_resched() abuse sched, net: Fixup busy_loop_us_clock() sched, net: Clean up preempt_enable_no_resched() abuse sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding sched/preempt, locking: Rework local_bh_{dis,en}able() sched/clock, x86: Avoid a runtime condition in native_sched_clock() sched/clock: Fix up clear_sched_clock_stable() sched/clock, x86: Use a static_key for sched_clock_stable sched/clock: Remove local_irq_disable() from the clocks sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs ...
2014-01-16Merge branch 'perf/urgent' into perf/coreIngo Molnar
Pick up the latest fixes, refresh the development tree. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-16perf/x86/amd/ibs: Fix waking up from S3 for AMD family 10hRobert Richter
On AMD family 10h we see following error messages while waking up from S3 for all non-boot CPUs leading to a failed IBS initialization: Enabling non-boot CPUs ... smpboot: Booting Node 0 Processor 1 APIC 0x1 [Firmware Bug]: cpu 1, try to use APIC500 (LVT offset 0) for vector 0x400, but the register is already in use for vector 0xf9 on another cpu perf: IBS APIC setup failed on cpu #1 process: Switch to broadcast mode on CPU1 CPU1 is up ... ACPI: Waking up from system sleep state S3 Reason for this is that during suspend the LVT offset for the IBS vector gets lost and needs to be reinialized while resuming. The offset is read from the IBSCTL msr. On family 10h the offset needs to be 1 as offset 0 is used for the MCE threshold interrupt, but firmware assings it for IBS to 0 too. The kernel needs to reprogram the vector. The msr is a readonly node msr, but a new value can be written via pci config space access. The reinitialization is implemented for family 10h in setup_ibs_ctl() which is forced during IBS setup. This patch fixes IBS setup after waking up from S3 by adding resume/supend hooks for the boot cpu which does the offset reinitialization. Marking it as stable to let distros pick up this fix. Signed-off-by: Robert Richter <rric@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> v3.2.. Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1389797849-5565-1-git-send-email-rric.net@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-15x86, cpu, amd: Fix a shadowed variable situationBorislav Petkov
Having u32 and struct cpuinfo_x86 * by the same name is not very smart, although it was ok in this case due to the limited scope of u32 c and it being used only once in there. Fix this. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1389786735-16751-1-git-send-email-bp@alien8.de Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2014-01-14x86, cpu, amd: Add workaround for family 16h, erratum 793Borislav Petkov
This adds the workaround for erratum 793 as a precaution in case not every BIOS implements it. This addresses CVE-2013-6885. Erratum text: [Revision Guide for AMD Family 16h Models 00h-0Fh Processors, document 51810 Rev. 3.04 November 2013] 793 Specific Combination of Writes to Write Combined Memory Types and Locked Instructions May Cause Core Hang Description Under a highly specific and detailed set of internal timing conditions, a locked instruction may trigger a timing sequence whereby the write to a write combined memory type is not flushed, causing the locked instruction to stall indefinitely. Potential Effect on System Processor core hang. Suggested Workaround BIOS should set MSR C001_1020[15] = 1b. Fix Planned No fix planned [ hpa: updated description, fixed typo in MSR name ] Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/20140114230711.GS29865@pd.tnic Tested-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-01-13x86, microcode: Move to a proper locationBorislav Petkov
We've grown a bunch of microcode loader files all prefixed with "microcode_". They should be under cpu/ because this is strictly CPU-related functionality so do that and drop the prefix since they're in their own directory now which gives that prefix. :) While at it, drop MICROCODE_INTEL_LIB config item and stash the functionality under CONFIG_MICROCODE_INTEL as it was its only user. Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
2014-01-13sched/clock, x86: Use a static_key for sched_clock_stablePeter Zijlstra
In order to avoid the runtime condition and variable load turn sched_clock_stable into a static_key. Also provide a shorter implementation of local_clock() and cpu_clock(int) when sched_clock_stable==1. MAINLINE PRE POST sched_clock_stable: 1 1 1 (cold) sched_clock: 329841 221876 215295 (cold) local_clock: 301773 234692 220773 (warm) sched_clock: 38375 25602 25659 (warm) local_clock: 100371 33265 27242 (warm) rdtsc: 27340 24214 24208 sched_clock_stable: 0 0 0 (cold) sched_clock: 382634 235941 237019 (cold) local_clock: 396890 297017 294819 (warm) sched_clock: 38194 25233 25609 (warm) local_clock: 143452 71234 71232 (warm) rdtsc: 27345 24245 24243 Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-eummbdechzz37mwmpags1gjr@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQsPeter Zijlstra
Use a ring-buffer like multi-version object structure which allows always having a coherent object; we use this to avoid having to disable IRQs while reading sched_clock() and avoids a problem when getting an NMI while changing the cyc2ns data. MAINLINE PRE POST sched_clock_stable: 1 1 1 (cold) sched_clock: 329841 331312 257223 (cold) local_clock: 301773 310296 309889 (warm) sched_clock: 38375 38247 25280 (warm) local_clock: 100371 102713 85268 (warm) rdtsc: 27340 27289 24247 sched_clock_stable: 0 0 0 (cold) sched_clock: 382634 372706 301224 (cold) local_clock: 396890 399275 399870 (warm) sched_clock: 38194 38124 25630 (warm) local_clock: 143452 148698 129629 (warm) rdtsc: 27345 27365 24307 Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-s567in1e5ekq2nlyhn8f987r@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-12Merge tag 'ras_for_3.14_p2' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/ras Pull RAS updates from Borislav Petkov: " SCI reporting for other error types not only correctable ones + APEI GHES cleanups + mce timer fix " Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-12Merge tag 'v3.13-rc8' into x86/ras, to pick up fixes.Ingo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-12x86, mce: Fix mce_start_timer semanticsBorislav Petkov
So mce_start_timer() has a 'cpu' argument which is supposed to mean to start a timer on that cpu. However, the code currently starts a timer on the *current* cpu the function runs on and causes the sanity-check in mce_timer_fn to fire: WARNING: CPU: 0 PID: 0 at arch/x86/kernel/cpu/mcheck/mce.c:1286 mce_timer_fn because it is running on the wrong cpu. This was triggered by Prarit Bhargava <prarit@redhat.com> by offlining all the cpus in succession. Then, we were fiddling with the CMCI storm settings when starting the timer whereas there's no need for that - if there's storm happening on this newly restarted cpu, we're going to be in normal CMCI mode initially and then when the CMCI interrupt starts firing, we're going to go to the polling mode with the timer real soon. Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Prarit Bhargava <prarit@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Reviewed-by: Chen, Gong <gong.chen@linux.intel.com> Link: http://lkml.kernel.org/r/1387722156-5511-1-git-send-email-prarit@redhat.com
2014-01-12perf/x86/intel: Add Intel RAPL PP1 energy counter supportStephane Eranian
This patch adds support for the Intel RAPL energy counter PP1 (Power Plane 1). On client processors, it usually corresponds to the energy consumption of the builtin graphic card. That is why the sysfs event is called energy-gpu. New event: - name: power/energy-gpu/ - code: event=0x4 - unit: 2^-32 Joules On processors without graphics, this should count 0. The patch only enables this event on client processors. Reviewed-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com> Signed-off-by: Stephane Eranian <eranian@google.com> Cc: ak@linux.intel.com Cc: acme@redhat.com Cc: jolsa@redhat.com Cc: zheng.z.yan@intel.com Cc: bp@alien8.de Cc: vincent.weaver@maine.edu Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389176153-3128-3-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>