linux-toradex.git/arch/x86/kernel/cpu/perf_event.c, branch v3.0.56

x86, perf: Check that current->mm is alive before getting user callchain

2011-10-03T18:40:09+00:00

commit 20afc60f892d285fde179ead4b24e6a7938c2f1b upstream.

An event may occur when an mm is already released.

I added an event in dequeue_entity() and caught a panic with
the following backtrace:

[  434.421110] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
[  434.421258] IP: [] __get_user_pages_fast+0x9c/0x120
...
[  434.421258] Call Trace:
[  434.421258]  [] copy_from_user_nmi+0x51/0xf0
[  434.421258]  [] ? sched_clock_local+0x25/0x90
[  434.421258]  [] perf_callchain_user+0x128/0x170
[  434.421258]  [] ? __perf_event_header__init_id+0xed/0x100
[  434.421258]  [] perf_prepare_sample+0x200/0x280
[  434.421258]  [] __perf_event_overflow+0x1b8/0x290
[  434.421258]  [] ? tg_shares_up+0x0/0x670
[  434.421258]  [] ? walk_tg_tree+0x6a/0xb0
[  434.421258]  [] perf_swevent_overflow+0xc4/0xf0
[  434.421258]  [] do_perf_sw_event+0x1e0/0x250
[  434.421258]  [] perf_tp_event+0x44/0x70
[  434.421258]  [] ftrace_profile_sched_block+0xdf/0x110
[  434.421258]  [] dequeue_entity+0x2ad/0x2d0
[  434.421258]  [] dequeue_task_fair+0x1c/0x60
[  434.421258]  [] dequeue_task+0x9a/0xb0
[  434.421258]  [] deactivate_task+0x42/0xe0
[  434.421258]  [] thread_return+0x191/0x808
[  434.421258]  [] ? switch_task_namespaces+0x24/0x60
[  434.421258]  [] do_exit+0x464/0x910
[  434.421258]  [] do_group_exit+0x58/0xd0
[  434.421258]  [] sys_exit_group+0x17/0x20
[  434.421258]  [] system_call_fastpath+0x16/0x1b

Signed-off-by: Andrey Vagin 
Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/r/1314693156-24131-1-git-send-email-avagin@openvz.org
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

x86: Remove warning and warning_symbol from struct stacktrace_ops

2011-05-12T13:31:28+00:00

Both warning and warning_symbol are nowhere used.
Let's get rid of them.

Signed-off-by: Richard Weinberger 
Cc: Oleg Nesterov 
Cc: Andrew Morton 
Cc: Huang Ying 
Cc: Soeren Sandmann Pedersen 
Cc: Namhyung Kim 
Cc: x86 
Cc: H. Peter Anvin 
Cc: Thomas Gleixner 
Cc: Robert Richter 
Cc: Paul Mundt 
Link: http://lkml.kernel.org/r/1305205872-10321-2-git-send-email-richard@nod.at
Signed-off-by: Frederic Weisbecker

Merge branch 'tip/perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core

2011-05-01T17:09:39+00:00

perf, x86, nmi: Move LVT un-masking into irq handlers

2011-04-27T15:59:11+00:00

It was noticed that P4 machines were generating double NMIs for
each perf event.  These extra NMIs lead to 'Dazed and confused'
messages on the screen.

I tracked this down to a P4 quirk that said the overflow bit had
to be cleared before re-enabling the apic LVT mask.  My first
attempt was to move the un-masking inside the perf nmi handler
from before the chipset NMI handler to after.

This broke Nehalem boxes that seem to like the unmasking before
the counters themselves are re-enabled.

In order to keep this change simple for 2.6.39, I decided to
just simply move the apic LVT un-masking to the beginning of all
the chipset NMI handlers, with the exception of Pentium4's to
fix the double NMI issue.

Later on we can move the un-masking to later in the handlers to
save a number of 'extra' NMIs on those particular chipsets.

I tested this change on a P4 machine, an AMD machine, a Nehalem
box, and a core2quad box.  'perf top' worked correctly along
with various other small 'perf record' runs.  Anything high
stress breaks all the machines but that is a different problem.

Thanks to various people for testing different versions of this
patch.

Reported-and-tested-by: Shaun Ruffell 
Signed-off-by: Don Zickus 
Cc: Cyrill Gorcunov 
Link: http://lkml.kernel.org/r/1303900353-10242-1-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar 
CC: Cyrill Gorcunov

perf, x86: Fix BTS condition

2011-04-26T11:34:34+00:00

Currently the x86 backend incorrectly assumes that any BRANCH_INSN
with sample_period==1 is a BTS request. This is not true when we do
frequency driven profiling such as 'perf record -e branches'.

Solves this error:

  $ perf record -e branches ./array
  Error: sys_perf_event_open() syscall returned with 95 (Operation not supported).

Signed-off-by: Peter Zijlstra 
Reported-by: Ingo Molnar 
Cc: "Metzger, Markus T" 
Cc: Peter Zijlstra 
Cc: Arnaldo Carvalho de Melo 
Cc: Frederic Weisbecker 
Link: http://lkml.kernel.org/n/tip-rd2y4ct71hjawzz6fpvsy9hg@git.kernel.org
Signed-off-by: Ingo Molnar

x86, perf event: Turn off unstructured raw event access to offcore registers

2011-04-22T08:02:53+00:00

Andi Kleen pointed out that the Intel offcore support patches were merged
without user-space tool support to the functionality:

 |
 | The offcore_msr perf kernel code was merged into 2.6.39-rc*, but the
 | user space bits were not. This made it impossible to set the extra mask
 | and actually do the OFFCORE profiling
 |

Andi submitted a preliminary patch for user-space support, as an
extension to perf's raw event syntax:

 |
 | Some raw events -- like the Intel OFFCORE events -- support additional
 | parameters. These can be appended after a ':'.
 |
 | For example on a multi socket Intel Nehalem:
 |
 |    perf stat -e r1b7:20ff -a sleep 1
 |
 | Profile the OFFCORE_RESPONSE.ANY_REQUEST with event mask REMOTE_DRAM_0
 | that measures any access to DRAM on another socket.
 |

But this kind of usability is absolutely unacceptable - users should not
be expected to type in magic, CPU and model specific incantations to get
access to useful hardware functionality.

The proper solution is to expose useful offcore functionality via
generalized events - that way users do not have to care which specific
CPU model they are using, they can use the conceptual event and not some
model specific quirky hexa number.

We already have such generalization in place for CPU cache events,
and it's all very extensible.

"Offcore" events measure general DRAM access patters along various
parameters. They are particularly useful in NUMA systems.

We want to support them via generalized DRAM events: either as the
fourth level of cache (after the last-level cache), or as a separate
generalization category.

That way user-space support would be very obvious, memory access
profiling could be done via self-explanatory commands like:

  perf record -e dram ./myapp
  perf record -e dram-remote ./myapp

... to measure DRAM accesses or more expensive cross-node NUMA DRAM
accesses.

These generalized events would work on all CPUs and architectures that
have comparable PMU features.

( Note, these are just examples: actual implementation could have more
  sophistication and more parameter - as long as they center around
  similarly simple usecases. )

Now we do not want to revert *all* of the current offcore bits, as they
are still somewhat useful for generic last-level-cache events, implemented
in this commit:

  e994d7d23a0b: perf: Fix LLC-* events on Intel Nehalem/Westmere

But we definitely do not yet want to expose the unstructured raw events
to user-space, until better generalization and usability is implemented
for these hardware event features.

( Note: after generalization has been implemented raw offcore events can be
  supported as well: there can always be an odd event that is marginally
  useful but not useful enough to generalize. DRAM profiling is definitely
  *not* such a category so generalization must be done first. )

Furthermore, PERF_TYPE_RAW access to these registers was not intended
to go upstream without proper support - it was a side-effect of the above
e994d7d23a0b commit, not mentioned in the changelog.

As v2.6.39 is nearing release we go for the simplest approach: disable
the PERF_TYPE_RAW offcore hack for now, before it escapes into a released
kernel and becomes an ABI.

Once proper structure is implemented for these hardware events and users
are offered usable solutions we can revisit this issue.

Reported-by: Andi Kleen 
Acked-by: Peter Zijlstra 
Cc: Arnaldo Carvalho de Melo 
Cc: Frederic Weisbecker 
Cc: Thomas Gleixner 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/1302658203-4239-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar

perf, x86: Use ALTERNATIVE() to check for X86_FEATURE_PERFCTR_CORE

2011-04-19T08:08:12+00:00

Using ALTERNATIVE() when checking for X86_FEATURE_PERFCTR_CORE avoids
an extra pointer chase and data cache hit.

Signed-off-by: Robert Richter 
Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/r/1302913676-14352-4-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar

Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

2011-03-26T00:53:09+00:00

* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf, x86: Complain louder about BIOSen corrupting CPU/PMU state and continue
  perf, x86: P4 PMU - Read proper MSR register to catch unflagged overflows
  perf symbols: Look at .dynsym again if .symtab not found
  perf build-id: Add quirk to deal with perf.data file format breakage
  perf session: Pass evsel in event_ops->sample()
  perf: Better fit max unprivileged mlock pages for tools needs
  perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx()
  perf top: Fix uninitialized 'counter' variable
  tracing: Fix set_ftrace_filter probe function display
  perf, x86: Fix Intel fixed counters base initialization

perf, x86: Complain louder about BIOSen corrupting CPU/PMU state and continue

2011-03-25T10:23:41+00:00

Eric Dumazet reported that hardware PMU events do not work on his
system, due to the BIOS corrupting PMU state:

    Performance Events: PEBS fmt0+, Core2 events, Broken BIOS detected, using software events only.
    [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 186 is 43003c)

Linus suggested that we continue in the face of such BIOS-induced CPU
state corruption:

   http://lkml.org/lkml/2011/3/24/608

Such BIOSes will have to be fixed - Linux developers rely on a working and
fully capable PMU and the BIOS interfering with the CPU's PMU state is simply
not acceptable.

So this patch changes perf to continue when it detects such BIOS
interaction, some hardware events may be unreliable due to the BIOS
writing and re-writing them - there's not much the kernel can do
about that but to detect the corruption and report it.

Reported-and-tested-by: Eric Dumazet 
Suggested-by: Linus Torvalds 
Acked-by: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Arnaldo Carvalho de Melo 
Cc: Frederic Weisbecker 
Cc: Mike Galbraith 
Cc: Steven Rostedt 
LKML-Reference: 
Signed-off-by: Ingo Molnar

perf, x86: Fix Intel fixed counters base initialization

2011-03-19T18:00:49+00:00

The following patch solves the problems introduced by Robert's
commit 41bf498 and reported by Arun Sharma. This commit gets rid
of the base + index notation for reading and writing PMU msrs.

The problem is that for fixed counters, the new calculation for
the base did not take into account the fixed counter indexes,
thus all fixed counters were read/written from fixed counter 0.
Although all fixed counters share the same config MSR, they each
have their own counter register.

Without:

 $ task -e unhalted_core_cycles -e instructions_retired -e baclears noploop 1 noploop for 1 seconds

  242202299 unhalted_core_cycles (0.00% scaling, ena=1000790892, run=1000790892)
 2389685946 instructions_retired (0.00% scaling, ena=1000790892, run=1000790892)
      49473 baclears             (0.00% scaling, ena=1000790892, run=1000790892)

With:

 $ task -e unhalted_core_cycles -e instructions_retired -e baclears noploop 1 noploop for 1 seconds

 2392703238 unhalted_core_cycles (0.00% scaling, ena=1000840809, run=1000840809)
 2389793744 instructions_retired (0.00% scaling, ena=1000840809, run=1000840809)
      47863 baclears             (0.00% scaling, ena=1000840809, run=1000840809)

Signed-off-by: Stephane Eranian 
Cc: peterz@infradead.org
Cc: ming.m.lin@intel.com
Cc: robert.richter@amd.com
Cc: asharma@fb.com
Cc: perfmon2-devel@lists.sf.net
LKML-Reference: <20110319172005.GB4978@quad>
Signed-off-by: Ingo Molnar