From 237d28db036e411f22c03cfd5b0f6dc2aa9bf3bc Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Red Hat)" Date: Mon, 12 Jan 2015 12:12:03 -0500 Subject: ftrace/jprobes/x86: Fix conflict between jprobes and function graph tracing If the function graph tracer traces a jprobe callback, the system will crash. This can easily be demonstrated by compiling the jprobe sample module that is in the kernel tree, loading it and running the function graph tracer. # modprobe jprobe_example.ko # echo function_graph > /sys/kernel/debug/tracing/current_tracer # ls The first two commands end up in a nice crash after the first fork. (do_fork has a jprobe attached to it, so "ls" just triggers that fork) The problem is caused by the jprobe_return() that all jprobe callbacks must end with. The way jprobes works is that the function a jprobe is attached to has a breakpoint placed at the start of it (or it uses ftrace if fentry is supported). The breakpoint handler (or ftrace callback) will copy the stack frame and change the ip address to return to the jprobe handler instead of the function. The jprobe handler must end with jprobe_return() which swaps the stack and does an int3 (breakpoint). This breakpoint handler will then put back the saved stack frame, simulate the instruction at the beginning of the function it added a breakpoint to, and then continue on. For function tracing to work, it hijakes the return address from the stack frame, and replaces it with a hook function that will trace the end of the call. This hook function will restore the return address of the function call. If the function tracer traces the jprobe handler, the hook function for that handler will not be called, and its saved return address will be used for the next function. This will result in a kernel crash. To solve this, pause function tracing before the jprobe handler is called and unpause it before it returns back to the function it probed. Some other updates: Used a variable "saved_sp" to hold kcb->jprobe_saved_sp. This makes the code look a bit cleaner and easier to understand (various tries to fix this bug required this change). Note, if fentry is being used, jprobes will change the ip address before the function graph tracer runs and it will not be able to trace the function that the jprobe is probing. Link: http://lkml.kernel.org/r/20150114154329.552437962@goodmis.org Cc: stable@vger.kernel.org # 2.6.30+ Acked-by: Masami Hiramatsu Signed-off-by: Steven Rostedt --- arch/x86/kernel/kprobes/core.c | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-) (limited to 'arch/x86/kernel') diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c index f7e3cd50ece0..98f654d466e5 100644 --- a/arch/x86/kernel/kprobes/core.c +++ b/arch/x86/kernel/kprobes/core.c @@ -1020,6 +1020,15 @@ int setjmp_pre_handler(struct kprobe *p, struct pt_regs *regs) regs->flags &= ~X86_EFLAGS_IF; trace_hardirqs_off(); regs->ip = (unsigned long)(jp->entry); + + /* + * jprobes use jprobe_return() which skips the normal return + * path of the function, and this messes up the accounting of the + * function graph tracer to get messed up. + * + * Pause function graph tracing while performing the jprobe function. + */ + pause_graph_tracing(); return 1; } NOKPROBE_SYMBOL(setjmp_pre_handler); @@ -1048,24 +1057,25 @@ int longjmp_break_handler(struct kprobe *p, struct pt_regs *regs) struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); u8 *addr = (u8 *) (regs->ip - 1); struct jprobe *jp = container_of(p, struct jprobe, kp); + void *saved_sp = kcb->jprobe_saved_sp; if ((addr > (u8 *) jprobe_return) && (addr < (u8 *) jprobe_return_end)) { - if (stack_addr(regs) != kcb->jprobe_saved_sp) { + if (stack_addr(regs) != saved_sp) { struct pt_regs *saved_regs = &kcb->jprobe_saved_regs; printk(KERN_ERR "current sp %p does not match saved sp %p\n", - stack_addr(regs), kcb->jprobe_saved_sp); + stack_addr(regs), saved_sp); printk(KERN_ERR "Saved registers for jprobe %p\n", jp); show_regs(saved_regs); printk(KERN_ERR "Current registers\n"); show_regs(regs); BUG(); } + /* It's OK to start function graph tracing again */ + unpause_graph_tracing(); *regs = kcb->jprobe_saved_regs; - memcpy((kprobe_opcode_t *)(kcb->jprobe_saved_sp), - kcb->jprobes_stack, - MIN_STACK_SIZE(kcb->jprobe_saved_sp)); + memcpy(saved_sp, kcb->jprobes_stack, MIN_STACK_SIZE(saved_sp)); preempt_enable_no_resched(); return 1; } -- cgit v1.2.3 From 433678bdc6ed39f053c55da96b51de5bf0aeebb1 Mon Sep 17 00:00:00 2001 From: Stephane Eranian Date: Tue, 13 Jan 2015 23:59:53 +0100 Subject: perf/rapl: Fix sysfs_show() initialization for RAPL PMU This patch fixes a problem with the initialization of the sysfs_show() routine for the RAPL PMU. The current code was wrongly relying on the EVENT_ATTR_STR() macro which uses the events_sysfs_show() function in the x86 PMU code. That function itself was relying on the x86_pmu data structure. Yet RAPL and the core PMU (x86_pmu) have nothing to do with each other. They should therefore not interact with each other. The x86_pmu structure is initialized at boot time based on the host CPU model. When the host CPU is not supported, the x86_pmu remains uninitialized and some of the callbacks it contains are NULL. The false dependency with x86_pmu could potentially cause crashes in case the x86_pmu is not initialized while the RAPL PMU is. This may, for instance, be the case in virtualized environments. This patch fixes the problem by using a private sysfs_show() routine for exporting the RAPL PMU events. Signed-off-by: Stephane Eranian Signed-off-by: Peter Zijlstra (Intel) Link: http://lkml.kernel.org/r/20150113225953.GA21525@thinkpad Cc: vincent.weaver@maine.edu Cc: jolsa@redhat.com Cc: Arnaldo Carvalho de Melo Cc: Linus Torvalds Signed-off-by: Ingo Molnar --- arch/x86/kernel/cpu/perf_event_intel_rapl.c | 44 +++++++++++++++++++++-------- 1 file changed, 32 insertions(+), 12 deletions(-) (limited to 'arch/x86/kernel') diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c index 673f930c700f..6e434f8e5fc8 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_rapl.c +++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c @@ -103,6 +103,13 @@ static struct kobj_attribute format_attr_##_var = \ #define RAPL_CNTR_WIDTH 32 /* 32-bit rapl counters */ +#define RAPL_EVENT_ATTR_STR(_name, v, str) \ +static struct perf_pmu_events_attr event_attr_##v = { \ + .attr = __ATTR(_name, 0444, rapl_sysfs_show, NULL), \ + .id = 0, \ + .event_str = str, \ +}; + struct rapl_pmu { spinlock_t lock; int hw_unit; /* 1/2^hw_unit Joule */ @@ -379,23 +386,36 @@ static struct attribute_group rapl_pmu_attr_group = { .attrs = rapl_pmu_attrs, }; -EVENT_ATTR_STR(energy-cores, rapl_cores, "event=0x01"); -EVENT_ATTR_STR(energy-pkg , rapl_pkg, "event=0x02"); -EVENT_ATTR_STR(energy-ram , rapl_ram, "event=0x03"); -EVENT_ATTR_STR(energy-gpu , rapl_gpu, "event=0x04"); +static ssize_t rapl_sysfs_show(struct device *dev, + struct device_attribute *attr, + char *page) +{ + struct perf_pmu_events_attr *pmu_attr = \ + container_of(attr, struct perf_pmu_events_attr, attr); + + if (pmu_attr->event_str) + return sprintf(page, "%s", pmu_attr->event_str); + + return 0; +} + +RAPL_EVENT_ATTR_STR(energy-cores, rapl_cores, "event=0x01"); +RAPL_EVENT_ATTR_STR(energy-pkg , rapl_pkg, "event=0x02"); +RAPL_EVENT_ATTR_STR(energy-ram , rapl_ram, "event=0x03"); +RAPL_EVENT_ATTR_STR(energy-gpu , rapl_gpu, "event=0x04"); -EVENT_ATTR_STR(energy-cores.unit, rapl_cores_unit, "Joules"); -EVENT_ATTR_STR(energy-pkg.unit , rapl_pkg_unit, "Joules"); -EVENT_ATTR_STR(energy-ram.unit , rapl_ram_unit, "Joules"); -EVENT_ATTR_STR(energy-gpu.unit , rapl_gpu_unit, "Joules"); +RAPL_EVENT_ATTR_STR(energy-cores.unit, rapl_cores_unit, "Joules"); +RAPL_EVENT_ATTR_STR(energy-pkg.unit , rapl_pkg_unit, "Joules"); +RAPL_EVENT_ATTR_STR(energy-ram.unit , rapl_ram_unit, "Joules"); +RAPL_EVENT_ATTR_STR(energy-gpu.unit , rapl_gpu_unit, "Joules"); /* * we compute in 0.23 nJ increments regardless of MSR */ -EVENT_ATTR_STR(energy-cores.scale, rapl_cores_scale, "2.3283064365386962890625e-10"); -EVENT_ATTR_STR(energy-pkg.scale, rapl_pkg_scale, "2.3283064365386962890625e-10"); -EVENT_ATTR_STR(energy-ram.scale, rapl_ram_scale, "2.3283064365386962890625e-10"); -EVENT_ATTR_STR(energy-gpu.scale, rapl_gpu_scale, "2.3283064365386962890625e-10"); +RAPL_EVENT_ATTR_STR(energy-cores.scale, rapl_cores_scale, "2.3283064365386962890625e-10"); +RAPL_EVENT_ATTR_STR(energy-pkg.scale, rapl_pkg_scale, "2.3283064365386962890625e-10"); +RAPL_EVENT_ATTR_STR(energy-ram.scale, rapl_ram_scale, "2.3283064365386962890625e-10"); +RAPL_EVENT_ATTR_STR(energy-gpu.scale, rapl_gpu_scale, "2.3283064365386962890625e-10"); static struct attribute *rapl_events_srv_attr[] = { EVENT_PTR(rapl_cores), -- cgit v1.2.3 From 33636732dcd7cc738a5913bb730d663c6b03c8fb Mon Sep 17 00:00:00 2001 From: Kan Liang Date: Mon, 12 Jan 2015 17:42:21 +0000 Subject: perf/x86/intel: Fix bug for "cycles:p" and "cycles:pp" on SLM cycles:p and cycles:pp do not work on SLM since commit: 86a04461a99f ("perf/x86: Revamp PEBS event selection") UOPS_RETIRED.ALL is not a PEBS capable event, so it should not be used to count cycle number. Actually SLM calls intel_pebs_aliases_core2() which uses INST_RETIRED.ANY_P to count the number of cycles. It's a PEBS capable event. But inv and cmask must be set to count cycles. Considering SLM allows all events as PEBS with no flags, only INST_RETIRED.ANY_P, inv=1, cmask=16 needs to handled specially. Signed-off-by: Kan Liang Signed-off-by: Peter Zijlstra (Intel) Link: http://lkml.kernel.org/r/1421084541-31639-1-git-send-email-kan.liang@intel.com Cc: Arnaldo Carvalho de Melo Cc: Linus Torvalds Signed-off-by: Ingo Molnar --- arch/x86/kernel/cpu/perf_event_intel_ds.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch/x86/kernel') diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c index 3c895d480cd7..073983398364 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c @@ -568,8 +568,8 @@ struct event_constraint intel_atom_pebs_event_constraints[] = { }; struct event_constraint intel_slm_pebs_event_constraints[] = { - /* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */ - INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf), + /* INST_RETIRED.ANY_P, inv=1, cmask=16 (cycles:p). */ + INTEL_FLAGS_EVENT_CONSTRAINT(0x108000c0, 0x1), /* Allow all events as PEBS with no flags */ INTEL_ALL_EVENT_CONSTRAINT(0, 0x1), EVENT_CONSTRAINT_END -- cgit v1.2.3 From b568b8601f05a591a7ff09d8ee1cedb5b2e815fe Mon Sep 17 00:00:00 2001 From: Jiang Liu Date: Tue, 20 Jan 2015 10:21:05 +0800 Subject: x86/xen: Treat SCI interrupt as normal GSI interrupt Currently Xen Domain0 has special treatment for ACPI SCI interrupt, that is initialize irq for ACPI SCI at early stage in a special way as: xen_init_IRQ() ->pci_xen_initial_domain() ->xen_setup_acpi_sci() Allocate and initialize irq for ACPI SCI Function xen_setup_acpi_sci() calls acpi_gsi_to_irq() to get an irq number for ACPI SCI. But unfortunately acpi_gsi_to_irq() depends on IOAPIC irqdomains through following path acpi_gsi_to_irq() ->mp_map_gsi_to_irq() ->mp_map_pin_to_irq() ->check IOAPIC irqdomain For PV domains, it uses Xen event based interrupt manangement and doesn't make uses of native IOAPIC, so no irqdomains created for IOAPIC. This causes Xen domain0 fail to install interrupt handler for ACPI SCI and all ACPI events will be lost. Please refer to: https://lkml.org/lkml/2014/12/19/178 So the fix is to get rid of special treatment for ACPI SCI, just treat ACPI SCI as normal GSI interrupt as: acpi_gsi_to_irq() ->acpi_register_gsi() ->acpi_register_gsi_xen() ->xen_register_gsi() With above change, there's no need for xen_setup_acpi_sci() anymore. The above change also works with bare metal kernel too. Signed-off-by: Jiang Liu Tested-by: Sander Eikelenboom Cc: Tony Luck Cc: xen-devel@lists.xenproject.org Cc: Konrad Rzeszutek Wilk Cc: David Vrabel Cc: Rafael J. Wysocki Cc: Len Brown Cc: Pavel Machek Cc: Bjorn Helgaas Link: http://lkml.kernel.org/r/1421720467-7709-2-git-send-email-jiang.liu@linux.intel.com Signed-off-by: Thomas Gleixner --- arch/x86/kernel/acpi/boot.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) (limited to 'arch/x86/kernel') diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index d1626364a28a..b9e30daa0881 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -611,20 +611,20 @@ void __init acpi_pic_sci_set_trigger(unsigned int irq, u16 trigger) int acpi_gsi_to_irq(u32 gsi, unsigned int *irqp) { - int irq; - - if (acpi_irq_model == ACPI_IRQ_MODEL_PIC) { - *irqp = gsi; - } else { - mutex_lock(&acpi_ioapic_lock); - irq = mp_map_gsi_to_irq(gsi, - IOAPIC_MAP_ALLOC | IOAPIC_MAP_CHECK); - mutex_unlock(&acpi_ioapic_lock); - if (irq < 0) - return -1; - *irqp = irq; + int rc, irq, trigger, polarity; + + rc = acpi_get_override_irq(gsi, &trigger, &polarity); + if (rc == 0) { + trigger = trigger ? ACPI_LEVEL_SENSITIVE : ACPI_EDGE_SENSITIVE; + polarity = polarity ? ACPI_ACTIVE_LOW : ACPI_ACTIVE_HIGH; + irq = acpi_register_gsi(NULL, gsi, trigger, polarity); + if (irq >= 0) { + *irqp = irq; + return 0; + } } - return 0; + + return -1; } EXPORT_SYMBOL_GPL(acpi_gsi_to_irq); -- cgit v1.2.3 From 4a0d3107d6b19125f21172c2b7d95f9c30ecaf6f Mon Sep 17 00:00:00 2001 From: Jan Beulich Date: Fri, 16 Jan 2015 15:47:07 +0000 Subject: x86, irq: Properly tag virtualization entry in /proc/interrupts The mis-naming likely was a copy-and-paste effect. Signed-off-by: Jan Beulich Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/54B9408B0200007800055E8B@mail.emea.novell.com Signed-off-by: Thomas Gleixner --- arch/x86/kernel/irq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86/kernel') diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index 6307a0f0cf17..705ef8d48e2d 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -127,7 +127,7 @@ int arch_show_interrupts(struct seq_file *p, int prec) seq_puts(p, " Machine check polls\n"); #endif #if IS_ENABLED(CONFIG_HYPERV) || defined(CONFIG_XEN) - seq_printf(p, "%*s: ", prec, "THR"); + seq_printf(p, "%*s: ", prec, "HYP"); for_each_online_cpu(j) seq_printf(p, "%10u ", irq_stats(j)->irq_hv_callback_count); seq_puts(p, " Hypervisor callback interrupts\n"); -- cgit v1.2.3 From 32c6590d126836a062b3140ed52d898507987017 Mon Sep 17 00:00:00 2001 From: "K. Y. Srinivasan" Date: Mon, 12 Jan 2015 16:26:02 -0800 Subject: x86, hyperv: Mark the Hyper-V clocksource as being continuous The Hyper-V clocksource is continuous; mark it accordingly. Signed-off-by: K. Y. Srinivasan Acked-by: jasowang@redhat.com Cc: gregkh@linuxfoundation.org Cc: devel@linuxdriverproject.org Cc: olaf@aepfle.de Cc: apw@canonical.com Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1421108762-3331-1-git-send-email-kys@microsoft.com Signed-off-by: Thomas Gleixner --- arch/x86/kernel/cpu/mshyperv.c | 1 + 1 file changed, 1 insertion(+) (limited to 'arch/x86/kernel') diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c index a450373e8e91..939155ffdece 100644 --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -107,6 +107,7 @@ static struct clocksource hyperv_cs = { .rating = 400, /* use this when running on Hyperv*/ .read = read_hv_clock, .mask = CLOCKSOURCE_MASK(64), + .flags = CLOCK_SOURCE_IS_CONTINUOUS, }; static void __init ms_hyperv_init_platform(void) -- cgit v1.2.3 From 3669ef9fa7d35f573ec9c0e0341b29251c2734a7 Mon Sep 17 00:00:00 2001 From: Andy Lutomirski Date: Thu, 22 Jan 2015 11:27:59 -0800 Subject: x86, tls: Interpret an all-zero struct user_desc as "no segment" The Witcher 2 did something like this to allocate a TLS segment index: struct user_desc u_info; bzero(&u_info, sizeof(u_info)); u_info.entry_number = (uint32_t)-1; syscall(SYS_set_thread_area, &u_info); Strictly speaking, this code was never correct. It should have set read_exec_only and seg_not_present to 1 to indicate that it wanted to find a free slot without putting anything there, or it should have put something sensible in the TLS slot if it wanted to allocate a TLS entry for real. The actual effect of this code was to allocate a bogus segment that could be used to exploit espfix. The set_thread_area hardening patches changed the behavior, causing set_thread_area to return -EINVAL and crashing the game. This changes set_thread_area to interpret this as a request to find a free slot and to leave it empty, which isn't *quite* what the game expects but should be close enough to keep it working. In particular, using the code above to allocate two segments will allocate the same segment both times. According to FrostbittenKing on Github, this fixes The Witcher 2. If this somehow still causes problems, we could instead allocate a limit==0 32-bit data segment, but that seems rather ugly to me. Fixes: 41bdc78544b8 x86/tls: Validate TLS entries to protect espfix Signed-off-by: Andy Lutomirski Cc: stable@vger.kernel.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/0cb251abe1ff0958b8e468a9a9a905b80ae3a746.1421954363.git.luto@amacapital.net Signed-off-by: Thomas Gleixner --- arch/x86/kernel/tls.c | 25 +++++++++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-) (limited to 'arch/x86/kernel') diff --git a/arch/x86/kernel/tls.c b/arch/x86/kernel/tls.c index 4e942f31b1a7..7fc5e843f247 100644 --- a/arch/x86/kernel/tls.c +++ b/arch/x86/kernel/tls.c @@ -29,7 +29,28 @@ static int get_free_idx(void) static bool tls_desc_okay(const struct user_desc *info) { - if (LDT_empty(info)) + /* + * For historical reasons (i.e. no one ever documented how any + * of the segmentation APIs work), user programs can and do + * assume that a struct user_desc that's all zeros except for + * entry_number means "no segment at all". This never actually + * worked. In fact, up to Linux 3.19, a struct user_desc like + * this would create a 16-bit read-write segment with base and + * limit both equal to zero. + * + * That was close enough to "no segment at all" until we + * hardened this function to disallow 16-bit TLS segments. Fix + * it up by interpreting these zeroed segments the way that they + * were almost certainly intended to be interpreted. + * + * The correct way to ask for "no segment at all" is to specify + * a user_desc that satisfies LDT_empty. To keep everything + * working, we accept both. + * + * Note that there's a similar kludge in modify_ldt -- look at + * the distinction between modes 1 and 0x11. + */ + if (LDT_empty(info) || LDT_zero(info)) return true; /* @@ -71,7 +92,7 @@ static void set_tls_desc(struct task_struct *p, int idx, cpu = get_cpu(); while (n-- > 0) { - if (LDT_empty(info)) + if (LDT_empty(info) || LDT_zero(info)) desc->a = desc->b = 0; else fill_ldt(desc, info); -- cgit v1.2.3