<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/include/uapi/linux/perf_event.h, branch v4.12</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>Merge tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux</title>
<updated>2017-05-05T18:36:44+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2017-05-05T18:36:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=7246f60068840847bdcf595be5f0b5ca632736e0'/>
<id>7246f60068840847bdcf595be5f0b5ca632736e0</id>
<content type='text'>
Pull powerpc updates from Michael Ellerman:
 "Highlights include:

   - Larger virtual address space on 64-bit server CPUs. By default we
     use a 128TB virtual address space, but a process can request access
     to the full 512TB by passing a hint to mmap().

   - Support for the new Power9 "XIVE" interrupt controller.

   - TLB flushing optimisations for the radix MMU on Power9.

   - Support for CAPI cards on Power9, using the "Coherent Accelerator
     Interface Architecture 2.0".

   - The ability to configure the mmap randomisation limits at build and
     runtime.

   - Several small fixes and cleanups to the kprobes code, as well as
     support for KPROBES_ON_FTRACE.

   - Major improvements to handling of system reset interrupts,
     correctly treating them as NMIs, giving them a dedicated stack and
     using a new hypervisor call to trigger them, all of which should
     aid debugging and robustness.

   - Many fixes and other minor enhancements.

  Thanks to: Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple,
  Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual, Anton
  Blanchard, Balbir Singh, Ben Hutchings, Benjamin Herrenschmidt,
  Bhupesh Sharma, Chris Packham, Christian Zigotzky, Christophe Leroy,
  Christophe Lombard, Daniel Axtens, David Gibson, Gautham R. Shenoy,
  Gavin Shan, Geert Uytterhoeven, Guilherme G. Piccoli, Hamish Martin,
  Hari Bathini, Kees Cook, Laurent Dufour, Madhavan Srinivasan, Mahesh J
  Salgaonkar, Mahesh Salgaonkar, Masami Hiramatsu, Matt Brown, Matthew
  R. Ochs, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Oliver
  O'Halloran, Pan Xinhui, Paul Mackerras, Rashmica Gupta, Russell
  Currey, Sukadev Bhattiprolu, Thadeu Lima de Souza Cascardo, Tobin C.
  Harding, Tyrel Datwyler, Uma Krishnan, Vaibhav Jain, Vipin K Parashar,
  Yang Shi"

* tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (214 commits)
  powerpc/64s: Power9 has no LPCR[VRMASD] field so don't set it
  powerpc/powernv: Fix TCE kill on NVLink2
  powerpc/mm/radix: Drop support for CPUs without lockless tlbie
  powerpc/book3s/mce: Move add_taint() later in virtual mode
  powerpc/sysfs: Move #ifdef CONFIG_HOTPLUG_CPU out of the function body
  powerpc/smp: Document irq enable/disable after migrating IRQs
  powerpc/mpc52xx: Don't select user-visible RTAS_PROC
  powerpc/powernv: Document cxl dependency on special case in pnv_eeh_reset()
  powerpc/eeh: Clean up and document event handling functions
  powerpc/eeh: Avoid use after free in eeh_handle_special_event()
  cxl: Mask slice error interrupts after first occurrence
  cxl: Route eeh events to all drivers in cxl_pci_error_detected()
  cxl: Force context lock during EEH flow
  powerpc/64: Allow CONFIG_RELOCATABLE if COMPILE_TEST
  powerpc/xmon: Teach xmon oops about radix vectors
  powerpc/mm/hash: Fix off-by-one in comment about kernel contexts ids
  powerpc/pseries: Enable VFIO
  powerpc/powernv: Fix iommu table size calculation hook for small tables
  powerpc/powernv: Check kzalloc() return value in pnv_pci_table_alloc
  powerpc: Add arch/powerpc/tools directory
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull powerpc updates from Michael Ellerman:
 "Highlights include:

   - Larger virtual address space on 64-bit server CPUs. By default we
     use a 128TB virtual address space, but a process can request access
     to the full 512TB by passing a hint to mmap().

   - Support for the new Power9 "XIVE" interrupt controller.

   - TLB flushing optimisations for the radix MMU on Power9.

   - Support for CAPI cards on Power9, using the "Coherent Accelerator
     Interface Architecture 2.0".

   - The ability to configure the mmap randomisation limits at build and
     runtime.

   - Several small fixes and cleanups to the kprobes code, as well as
     support for KPROBES_ON_FTRACE.

   - Major improvements to handling of system reset interrupts,
     correctly treating them as NMIs, giving them a dedicated stack and
     using a new hypervisor call to trigger them, all of which should
     aid debugging and robustness.

   - Many fixes and other minor enhancements.

  Thanks to: Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple,
  Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual, Anton
  Blanchard, Balbir Singh, Ben Hutchings, Benjamin Herrenschmidt,
  Bhupesh Sharma, Chris Packham, Christian Zigotzky, Christophe Leroy,
  Christophe Lombard, Daniel Axtens, David Gibson, Gautham R. Shenoy,
  Gavin Shan, Geert Uytterhoeven, Guilherme G. Piccoli, Hamish Martin,
  Hari Bathini, Kees Cook, Laurent Dufour, Madhavan Srinivasan, Mahesh J
  Salgaonkar, Mahesh Salgaonkar, Masami Hiramatsu, Matt Brown, Matthew
  R. Ochs, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Oliver
  O'Halloran, Pan Xinhui, Paul Mackerras, Rashmica Gupta, Russell
  Currey, Sukadev Bhattiprolu, Thadeu Lima de Souza Cascardo, Tobin C.
  Harding, Tyrel Datwyler, Uma Krishnan, Vaibhav Jain, Vipin K Parashar,
  Yang Shi"

* tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (214 commits)
  powerpc/64s: Power9 has no LPCR[VRMASD] field so don't set it
  powerpc/powernv: Fix TCE kill on NVLink2
  powerpc/mm/radix: Drop support for CPUs without lockless tlbie
  powerpc/book3s/mce: Move add_taint() later in virtual mode
  powerpc/sysfs: Move #ifdef CONFIG_HOTPLUG_CPU out of the function body
  powerpc/smp: Document irq enable/disable after migrating IRQs
  powerpc/mpc52xx: Don't select user-visible RTAS_PROC
  powerpc/powernv: Document cxl dependency on special case in pnv_eeh_reset()
  powerpc/eeh: Clean up and document event handling functions
  powerpc/eeh: Avoid use after free in eeh_handle_special_event()
  cxl: Mask slice error interrupts after first occurrence
  cxl: Route eeh events to all drivers in cxl_pci_error_detected()
  cxl: Force context lock during EEH flow
  powerpc/64: Allow CONFIG_RELOCATABLE if COMPILE_TEST
  powerpc/xmon: Teach xmon oops about radix vectors
  powerpc/mm/hash: Fix off-by-one in comment about kernel contexts ids
  powerpc/pseries: Enable VFIO
  powerpc/powernv: Fix iommu table size calculation hook for small tables
  powerpc/powernv: Check kzalloc() return value in pnv_pci_table_alloc
  powerpc: Add arch/powerpc/tools directory
  ...
</pre>
</div>
</content>
</entry>
<entry>
<title>powerpc/perf: Define big-endian version of perf_mem_data_src</title>
<updated>2017-04-19T10:00:21+00:00</updated>
<author>
<name>Sukadev Bhattiprolu</name>
<email>sukadev@linux.vnet.ibm.com</email>
</author>
<published>2017-04-11T01:51:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=8c5073db0ee680c7e70e123918c9b260e49f757d'/>
<id>8c5073db0ee680c7e70e123918c9b260e49f757d</id>
<content type='text'>
perf_mem_data_src is a union that is initialized in the kernel via the -&gt;val
field and accessed by userspace via the mem_xxx bitfields. For this to work
correctly on big endian platforms, we need a big-endian definition for the
bitfields.

Currently on a big endian system, if a user requests PERF_SAMPLE_DATA_SRC (perf
report -d), they will get the default value from perf_sample_data_init(), which
is PERF_MEM_NA. The value for PERF_MEM_NA is constructed using shifts:

  /* TLB access */
  #define PERF_MEM_TLB_NA		0x01 /* not available */
  ...
  #define PERF_MEM_TLB_SHIFT	26

  #define PERF_MEM_S(a, s) \
	(((__u64)PERF_MEM_##a##_##s) &lt;&lt; PERF_MEM_##a##_SHIFT)

  #define PERF_MEM_NA (PERF_MEM_S(OP, NA)   |\
		    PERF_MEM_S(LVL, NA)   |\
		    PERF_MEM_S(SNOOP, NA) |\
		    PERF_MEM_S(LOCK, NA)  |\
		    PERF_MEM_S(TLB, NA))

Which works out as:

  ((0x01 &lt;&lt; 0) | (0x01 &lt;&lt; 5) | (0x01 &lt;&lt; 19) | (0x01 &lt;&lt; 24) | (0x01 &lt;&lt; 26))

Which means the PERF_MEM_NA value comes out of the kernel as 0x5080021
in CPU endian.

But then in the perf tool, the code uses the bitfields to inspect the value, and
currently the bitfields are defined using little endian ordering.

So eg. in perf_mem__tlb_scnprintf() we see:
  data_src-&gt;val = 0x5080021
             op = 0x0
            lvl = 0x0
          snoop = 0x0
           lock = 0x0
           dtlb = 0x0
           rsvd = 0x5080021

Because of the way the perf tool code is written this is still displayed to the
user as "N/A", so there is no bug visible at the UI level.

Currently there are no big endian architectures which export a meaningful
value (ie. other than PERF_MEM_NA), so the extent of the bug on big endian
platforms is that the PERF_MEM_NA value is exported incorrectly as described
above. Subsequent patches will add support on big endian powerpc for populating
the data source value.

This patch does a minimal fix of adding big endian definition of the bitfields
to match the values that are already exported by the kernel on big endian. And
it makes no change on little endian.

Signed-off-by: Sukadev Bhattiprolu &lt;sukadev@linux.vnet.ibm.com&gt;
Signed-off-by: Madhavan Srinivasan &lt;maddy@linux.vnet.ibm.com&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
perf_mem_data_src is a union that is initialized in the kernel via the -&gt;val
field and accessed by userspace via the mem_xxx bitfields. For this to work
correctly on big endian platforms, we need a big-endian definition for the
bitfields.

Currently on a big endian system, if a user requests PERF_SAMPLE_DATA_SRC (perf
report -d), they will get the default value from perf_sample_data_init(), which
is PERF_MEM_NA. The value for PERF_MEM_NA is constructed using shifts:

  /* TLB access */
  #define PERF_MEM_TLB_NA		0x01 /* not available */
  ...
  #define PERF_MEM_TLB_SHIFT	26

  #define PERF_MEM_S(a, s) \
	(((__u64)PERF_MEM_##a##_##s) &lt;&lt; PERF_MEM_##a##_SHIFT)

  #define PERF_MEM_NA (PERF_MEM_S(OP, NA)   |\
		    PERF_MEM_S(LVL, NA)   |\
		    PERF_MEM_S(SNOOP, NA) |\
		    PERF_MEM_S(LOCK, NA)  |\
		    PERF_MEM_S(TLB, NA))

Which works out as:

  ((0x01 &lt;&lt; 0) | (0x01 &lt;&lt; 5) | (0x01 &lt;&lt; 19) | (0x01 &lt;&lt; 24) | (0x01 &lt;&lt; 26))

Which means the PERF_MEM_NA value comes out of the kernel as 0x5080021
in CPU endian.

But then in the perf tool, the code uses the bitfields to inspect the value, and
currently the bitfields are defined using little endian ordering.

So eg. in perf_mem__tlb_scnprintf() we see:
  data_src-&gt;val = 0x5080021
             op = 0x0
            lvl = 0x0
          snoop = 0x0
           lock = 0x0
           dtlb = 0x0
           rsvd = 0x5080021

Because of the way the perf tool code is written this is still displayed to the
user as "N/A", so there is no bug visible at the UI level.

Currently there are no big endian architectures which export a meaningful
value (ie. other than PERF_MEM_NA), so the extent of the bug on big endian
platforms is that the PERF_MEM_NA value is exported incorrectly as described
above. Subsequent patches will add support on big endian powerpc for populating
the data source value.

This patch does a minimal fix of adding big endian definition of the bitfields
to match the values that are already exported by the kernel on big endian. And
it makes no change on little endian.

Signed-off-by: Sukadev Bhattiprolu &lt;sukadev@linux.vnet.ibm.com&gt;
Signed-off-by: Madhavan Srinivasan &lt;maddy@linux.vnet.ibm.com&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>perf/core: Add a flag for partial AUX records</title>
<updated>2017-03-16T08:51:11+00:00</updated>
<author>
<name>Alexander Shishkin</name>
<email>alexander.shishkin@linux.intel.com</email>
</author>
<published>2017-02-20T13:33:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=ae0c2d995d648d5165545d5e05e2869642009b38'/>
<id>ae0c2d995d648d5165545d5e05e2869642009b38</id>
<content type='text'>
The Intel PT driver needs to be able to communicate partial AUX transactions,
that is, transactions with gaps in data for reasons other than no room
left in the buffer (i.e. truncated transactions). Therefore, this condition
does not imply a wakeup for the consumer.

To this end, add a new "partial" AUX flag.

Signed-off-by: Alexander Shishkin &lt;alexander.shishkin@linux.intel.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@infradead.org&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@redhat.com&gt;
Cc: Jiri Olsa &lt;jolsa@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mathieu Poirier &lt;mathieu.poirier@linaro.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Stephane Eranian &lt;eranian@google.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Vince Weaver &lt;vincent.weaver@maine.edu&gt;
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170220133352.17995-4-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The Intel PT driver needs to be able to communicate partial AUX transactions,
that is, transactions with gaps in data for reasons other than no room
left in the buffer (i.e. truncated transactions). Therefore, this condition
does not imply a wakeup for the consumer.

To this end, add a new "partial" AUX flag.

Signed-off-by: Alexander Shishkin &lt;alexander.shishkin@linux.intel.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@infradead.org&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@redhat.com&gt;
Cc: Jiri Olsa &lt;jolsa@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mathieu Poirier &lt;mathieu.poirier@linaro.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Stephane Eranian &lt;eranian@google.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Vince Weaver &lt;vincent.weaver@maine.edu&gt;
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170220133352.17995-4-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>perf: Add PERF_RECORD_NAMESPACES to include namespaces related info</title>
<updated>2017-03-13T18:57:41+00:00</updated>
<author>
<name>Hari Bathini</name>
<email>hbathini@linux.vnet.ibm.com</email>
</author>
<published>2017-03-07T20:41:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=e422267322cd319e2695a535e47c5b1feeac45eb'/>
<id>e422267322cd319e2695a535e47c5b1feeac45eb</id>
<content type='text'>
With the advert of container technologies like docker, that depend on
namespaces for isolation, there is a need for tracing support for
namespaces. This patch introduces new PERF_RECORD_NAMESPACES event for
recording namespaces related info. By recording info for every
namespace, it is left to userspace to take a call on the definition of a
container and trace containers by updating perf tool accordingly.

Each namespace has a combination of device and inode numbers. Though
every namespace has the same device number currently, that may change in
future to avoid the need for a namespace of namespaces. Considering such
possibility, record both device and inode numbers separately for each
namespace.

Signed-off-by: Hari Bathini &lt;hbathini@linux.vnet.ibm.com&gt;
Acked-by: Jiri Olsa &lt;jolsa@kernel.org&gt;
Acked-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Alexander Shishkin &lt;alexander.shishkin@linux.intel.com&gt;
Cc: Alexei Starovoitov &lt;ast@fb.com&gt;
Cc: Ananth N Mavinakayanahalli &lt;ananth@linux.vnet.ibm.com&gt;
Cc: Aravinda Prasad &lt;aravinda@linux.vnet.ibm.com&gt;
Cc: Brendan Gregg &lt;brendan.d.gregg@gmail.com&gt;
Cc: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Cc: Eric Biederman &lt;ebiederm@xmission.com&gt;
Cc: Sargun Dhillon &lt;sargun@sargun.me&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Link: http://lkml.kernel.org/r/148891929686.25309.2827618988917007768.stgit@hbathini.in.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo &lt;acme@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
With the advert of container technologies like docker, that depend on
namespaces for isolation, there is a need for tracing support for
namespaces. This patch introduces new PERF_RECORD_NAMESPACES event for
recording namespaces related info. By recording info for every
namespace, it is left to userspace to take a call on the definition of a
container and trace containers by updating perf tool accordingly.

Each namespace has a combination of device and inode numbers. Though
every namespace has the same device number currently, that may change in
future to avoid the need for a namespace of namespaces. Considering such
possibility, record both device and inode numbers separately for each
namespace.

Signed-off-by: Hari Bathini &lt;hbathini@linux.vnet.ibm.com&gt;
Acked-by: Jiri Olsa &lt;jolsa@kernel.org&gt;
Acked-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Alexander Shishkin &lt;alexander.shishkin@linux.intel.com&gt;
Cc: Alexei Starovoitov &lt;ast@fb.com&gt;
Cc: Ananth N Mavinakayanahalli &lt;ananth@linux.vnet.ibm.com&gt;
Cc: Aravinda Prasad &lt;aravinda@linux.vnet.ibm.com&gt;
Cc: Brendan Gregg &lt;brendan.d.gregg@gmail.com&gt;
Cc: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Cc: Eric Biederman &lt;ebiederm@xmission.com&gt;
Cc: Sargun Dhillon &lt;sargun@sargun.me&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Link: http://lkml.kernel.org/r/148891929686.25309.2827618988917007768.stgit@hbathini.in.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo &lt;acme@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>perf core: Per event callchain limit</title>
<updated>2016-05-30T15:41:44+00:00</updated>
<author>
<name>Arnaldo Carvalho de Melo</name>
<email>acme@redhat.com</email>
</author>
<published>2016-04-28T16:16:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=97c79a38cd454602645f0470ffb444b3b75ce574'/>
<id>97c79a38cd454602645f0470ffb444b3b75ce574</id>
<content type='text'>
Additionally to being able to control the system wide maximum depth via
/proc/sys/kernel/perf_event_max_stack, now we are able to ask for
different depths per event, using perf_event_attr.sample_max_stack for
that.

This uses an u16 hole at the end of perf_event_attr, that, when
perf_event_attr.sample_type has the PERF_SAMPLE_CALLCHAIN, if
sample_max_stack is zero, means use perf_event_max_stack, otherwise
it'll be bounds checked under callchain_mutex.

Cc: Adrian Hunter &lt;adrian.hunter@intel.com&gt;
Cc: Alexander Shishkin &lt;alexander.shishkin@linux.intel.com&gt;
Cc: Alexei Starovoitov &lt;ast@kernel.org&gt;
Cc: Brendan Gregg &lt;brendan.d.gregg@gmail.com&gt;
Cc: David Ahern &lt;dsahern@gmail.com&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: He Kuang &lt;hekuang@huawei.com&gt;
Cc: Jiri Olsa &lt;jolsa@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Masami Hiramatsu &lt;mhiramat@kernel.org&gt;
Cc: Milian Wolff &lt;milian.wolff@kdab.com&gt;
Cc: Namhyung Kim &lt;namhyung@kernel.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Stephane Eranian &lt;eranian@google.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Vince Weaver &lt;vincent.weaver@maine.edu&gt;
Cc: Wang Nan &lt;wangnan0@huawei.com&gt;
Cc: Zefan Li &lt;lizefan@huawei.com&gt;
Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo &lt;acme@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Additionally to being able to control the system wide maximum depth via
/proc/sys/kernel/perf_event_max_stack, now we are able to ask for
different depths per event, using perf_event_attr.sample_max_stack for
that.

This uses an u16 hole at the end of perf_event_attr, that, when
perf_event_attr.sample_type has the PERF_SAMPLE_CALLCHAIN, if
sample_max_stack is zero, means use perf_event_max_stack, otherwise
it'll be bounds checked under callchain_mutex.

Cc: Adrian Hunter &lt;adrian.hunter@intel.com&gt;
Cc: Alexander Shishkin &lt;alexander.shishkin@linux.intel.com&gt;
Cc: Alexei Starovoitov &lt;ast@kernel.org&gt;
Cc: Brendan Gregg &lt;brendan.d.gregg@gmail.com&gt;
Cc: David Ahern &lt;dsahern@gmail.com&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: He Kuang &lt;hekuang@huawei.com&gt;
Cc: Jiri Olsa &lt;jolsa@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Masami Hiramatsu &lt;mhiramat@kernel.org&gt;
Cc: Milian Wolff &lt;milian.wolff@kdab.com&gt;
Cc: Namhyung Kim &lt;namhyung@kernel.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Stephane Eranian &lt;eranian@google.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Vince Weaver &lt;vincent.weaver@maine.edu&gt;
Cc: Wang Nan &lt;wangnan0@huawei.com&gt;
Cc: Zefan Li &lt;lizefan@huawei.com&gt;
Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo &lt;acme@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>perf core: Separate accounting of contexts and real addresses in a stack trace</title>
<updated>2016-05-17T02:11:53+00:00</updated>
<author>
<name>Arnaldo Carvalho de Melo</name>
<email>acme@redhat.com</email>
</author>
<published>2016-05-12T16:06:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=c85b03349640b34f3545503c8429fc43005e9a92'/>
<id>c85b03349640b34f3545503c8429fc43005e9a92</id>
<content type='text'>
The perf_sample-&gt;ip_callchain-&gt;nr value includes all the entries in the
ip_callchain-&gt;ip[] array, real addresses and PERF_CONTEXT_{KERNEL,USER,etc},
while what the user expects is that what is in the kernel.perf_event_max_stack
sysctl or in the upcoming per event perf_event_attr.sample_max_stack knob be
honoured in terms of IP addresses in the stack trace.

So allocate a bunch of extra entries for contexts, and do the accounting
via perf_callchain_entry_ctx struct members.

A new sysctl, kernel.perf_event_max_contexts_per_stack is also
introduced for investigating possible bugs in the callchain
implementation by some arch.

Cc: Adrian Hunter &lt;adrian.hunter@intel.com&gt;
Cc: Alexander Shishkin &lt;alexander.shishkin@linux.intel.com&gt;
Cc: Alexei Starovoitov &lt;ast@kernel.org&gt;
Cc: Brendan Gregg &lt;brendan.d.gregg@gmail.com&gt;
Cc: David Ahern &lt;dsahern@gmail.com&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: He Kuang &lt;hekuang@huawei.com&gt;
Cc: Jiri Olsa &lt;jolsa@redhat.com&gt;
Cc: Masami Hiramatsu &lt;mhiramat@kernel.org&gt;
Cc: Milian Wolff &lt;milian.wolff@kdab.com&gt;
Cc: Namhyung Kim &lt;namhyung@kernel.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Stephane Eranian &lt;eranian@google.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Vince Weaver &lt;vincent.weaver@maine.edu&gt;
Cc: Wang Nan &lt;wangnan0@huawei.com&gt;
Cc: Zefan Li &lt;lizefan@huawei.com&gt;
Link: http://lkml.kernel.org/n/tip-3b4wnqk340c4sg4gwkfdi9yk@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo &lt;acme@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The perf_sample-&gt;ip_callchain-&gt;nr value includes all the entries in the
ip_callchain-&gt;ip[] array, real addresses and PERF_CONTEXT_{KERNEL,USER,etc},
while what the user expects is that what is in the kernel.perf_event_max_stack
sysctl or in the upcoming per event perf_event_attr.sample_max_stack knob be
honoured in terms of IP addresses in the stack trace.

So allocate a bunch of extra entries for contexts, and do the accounting
via perf_callchain_entry_ctx struct members.

A new sysctl, kernel.perf_event_max_contexts_per_stack is also
introduced for investigating possible bugs in the callchain
implementation by some arch.

Cc: Adrian Hunter &lt;adrian.hunter@intel.com&gt;
Cc: Alexander Shishkin &lt;alexander.shishkin@linux.intel.com&gt;
Cc: Alexei Starovoitov &lt;ast@kernel.org&gt;
Cc: Brendan Gregg &lt;brendan.d.gregg@gmail.com&gt;
Cc: David Ahern &lt;dsahern@gmail.com&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: He Kuang &lt;hekuang@huawei.com&gt;
Cc: Jiri Olsa &lt;jolsa@redhat.com&gt;
Cc: Masami Hiramatsu &lt;mhiramat@kernel.org&gt;
Cc: Milian Wolff &lt;milian.wolff@kdab.com&gt;
Cc: Namhyung Kim &lt;namhyung@kernel.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Stephane Eranian &lt;eranian@google.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Vince Weaver &lt;vincent.weaver@maine.edu&gt;
Cc: Wang Nan &lt;wangnan0@huawei.com&gt;
Cc: Zefan Li &lt;lizefan@huawei.com&gt;
Link: http://lkml.kernel.org/n/tip-3b4wnqk340c4sg4gwkfdi9yk@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo &lt;acme@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>perf/core: Add ::write_backward attribute to perf event</title>
<updated>2016-04-23T12:12:39+00:00</updated>
<author>
<name>Wang Nan</name>
<email>wangnan0@huawei.com</email>
</author>
<published>2016-04-05T14:11:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=9ecda41acb971ebd07c8fb35faf24005c0baea12'/>
<id>9ecda41acb971ebd07c8fb35faf24005c0baea12</id>
<content type='text'>
This patch introduces 'write_backward' bit to perf_event_attr, which
controls the direction of a ring buffer. After set, the corresponding
ring buffer is written from end to beginning. This feature is design to
support reading from overwritable ring buffer.

Ring buffer can be created by mapping a perf event fd. Kernel puts event
records into ring buffer, user tooling like perf fetch them from
address returned by mmap(). To prevent racing between kernel and tooling,
they communicate to each other through 'head' and 'tail' pointers.
Kernel maintains 'head' pointer, points it to the next free area (tail
of the last record). Tooling maintains 'tail' pointer, points it to the
tail of last consumed record (record has already been fetched). Kernel
determines the available space in a ring buffer using these two
pointers to avoid overwrite unfetched records.

By mapping without 'PROT_WRITE', an overwritable ring buffer is created.
Different from normal ring buffer, tooling is unable to maintain 'tail'
pointer because writing is forbidden. Therefore, for this type of ring
buffers, kernel overwrite old records unconditionally, works like flight
recorder. This feature would be useful if reading from overwritable ring
buffer were as easy as reading from normal ring buffer. However,
there's an obscure problem.

The following figure demonstrates a full overwritable ring buffer. In
this figure, the 'head' pointer points to the end of last record, and a
long record 'E' is pending. For a normal ring buffer, a 'tail' pointer
would have pointed to position (X), so kernel knows there's no more
space in the ring buffer. However, for an overwritable ring buffer,
kernel ignore the 'tail' pointer.

   (X)                              head
    .                                |
    .                                V
    +------+-------+----------+------+---+
    |A....A|B.....B|C........C|D....D|   |
    +------+-------+----------+------+---+

Record 'A' is overwritten by event 'E':

      head
       |
       V
    +--+---+-------+----------+------+---+
    |.E|..A|B.....B|C........C|D....D|E..|
    +--+---+-------+----------+------+---+

Now tooling decides to read from this ring buffer. However, none of these
two natural positions, 'head' and the start of this ring buffer, are
pointing to the head of a record. Even the full ring buffer can be
accessed by tooling, it is unable to find a position to start decoding.

The first attempt tries to solve this problem AFAIK can be found from
[1]. It makes kernel to maintain 'tail' pointer: updates it when ring
buffer is half full. However, this approach introduces overhead to
fast path. Test result shows a 1% overhead [2]. In addition, this method
utilizes no more tham 50% records.

Another attempt can be found from [3], which allows putting the size of
an event at the end of each record. This approach allows tooling to find
records in a backward manner from 'head' pointer by reading size of a
record from its tail. However, because of alignment requirement, it
needs 8 bytes to record the size of a record, which is a huge waste. Its
performance is also not good, because more data need to be written.
This approach also introduces some extra branch instructions to fast
path.

'write_backward' is a better solution to this problem.

Following figure demonstrates the state of the overwritable ring buffer
when 'write_backward' is set before overwriting:

       head
        |
        V
    +---+------+----------+-------+------+
    |   |D....D|C........C|B.....B|A....A|
    +---+------+----------+-------+------+

and after overwriting:
                                     head
                                      |
                                      V
    +---+------+----------+-------+---+--+
    |..E|D....D|C........C|B.....B|A..|E.|
    +---+------+----------+-------+---+--+

In each situation, 'head' points to the beginning of the newest record.
From this record, tooling can iterate over the full ring buffer and fetch
records one by one.

The only limitation that needs to be considered is back-to-back reading.
Due to the non-deterministic of user programs, it is impossible to ensure
the ring buffer keeps stable during reading. Consider an extreme situation:
tooling is scheduled out after reading record 'D', then a burst of events
come, eat up the whole ring buffer (one or multiple rounds). When the
tooling process comes back, reading after 'D' is incorrect now.

To prevent this problem, we need to find a way to ensure the ring buffer
is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is
suggested because its overhead is lower than
ioctl(PERF_EVENT_IOC_ENABLE).

By carefully verifying 'header' pointer, reader can avoid pausing the
ring-buffer. For example:

    /* A union of all possible events */
    union perf_event event;

    p = head = perf_mmap__read_head();
    while (true) {
        /* copy header of next event */
        fetch(&amp;event.header, p, sizeof(event.header));

        /* read 'head' pointer */
        head = perf_mmap__read_head();

        /* check overwritten: is the header good? */
        if (!verify(sizeof(event.header), p, head))
            break;

        /* copy the whole event */
        fetch(&amp;event, p, event.header.size);

        /* read 'head' pointer again */
        head = perf_mmap__read_head();

        /* is the whole event good? */
        if (!verify(event.header.size, p, head))
            break;
        p += event.header.size;
    }

However, the overhead is high because:

 a) In-place decoding is not safe.
    Copying-verifying-decoding is required.
 b) Fetching 'head' pointer requires additional synchronization.

(From Alexei Starovoitov:

Even when this trick works, pause is needed for more than stability of
reading. When we collect the events into overwrite buffer we're waiting
for some other trigger (like all cpu utilization spike or just one cpu
running and all others are idle) and when it happens the buffer has
valuable info from the past. At this point new events are no longer
interesting and buffer should be paused, events read and unpaused until
next trigger comes.)

This patch utilizes event's default overflow_handler introduced
previously. perf_event_output_backward() is created as the default
overflow handler for backward ring buffers. To avoid extra overhead to
fast path, original perf_event_output() becomes __perf_event_output()
and marked '__always_inline'. In theory, there's no extra overhead
introduced to fast path.

Performance testing:

Calling 3000000 times of 'close(-1)', use gettimeofday() to check
duration.  Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
system calls. In ns.

Testing environment:

  CPU    : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
  Kernel : v4.5.0
                    MEAN         STDVAR
 BASE            800214.950    2853.083
 PRE1           2253846.700    9997.014
 PRE2           2257495.540    8516.293
 POST           2250896.100    8933.921

Where 'BASE' is pure performance without capturing. 'PRE1' is test
result of pure 'v4.5.0' kernel. 'PRE2' is test result before this
patch. 'POST' is test result after this patch. See [4] for the detailed
experimental setup.

Considering the stdvar, this patch doesn't introduce performance
overhead to the fast path.

 [1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html
 [2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html
 [3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html
 [4] http://lkml.kernel.org/g/56F89DCD.1040202@huawei.com

Signed-off-by: Wang Nan &lt;wangnan0@huawei.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Cc: &lt;acme@kernel.org&gt;
Cc: &lt;pi3orama@163.com&gt;
Cc: Alexander Shishkin &lt;alexander.shishkin@linux.intel.com&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@redhat.com&gt;
Cc: Brendan Gregg &lt;brendan.d.gregg@gmail.com&gt;
Cc: He Kuang &lt;hekuang@huawei.com&gt;
Cc: Jiri Olsa &lt;jolsa@kernel.org&gt;
Cc: Jiri Olsa &lt;jolsa@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Masami Hiramatsu &lt;masami.hiramatsu.pt@hitachi.com&gt;
Cc: Namhyung Kim &lt;namhyung@kernel.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Stephane Eranian &lt;eranian@google.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Vince Weaver &lt;vincent.weaver@maine.edu&gt;
Cc: Zefan Li &lt;lizefan@huawei.com&gt;
Link: http://lkml.kernel.org/r/1459865478-53413-1-git-send-email-wangnan0@huawei.com
[ Fixed the changelog some more. ]
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;

Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This patch introduces 'write_backward' bit to perf_event_attr, which
controls the direction of a ring buffer. After set, the corresponding
ring buffer is written from end to beginning. This feature is design to
support reading from overwritable ring buffer.

Ring buffer can be created by mapping a perf event fd. Kernel puts event
records into ring buffer, user tooling like perf fetch them from
address returned by mmap(). To prevent racing between kernel and tooling,
they communicate to each other through 'head' and 'tail' pointers.
Kernel maintains 'head' pointer, points it to the next free area (tail
of the last record). Tooling maintains 'tail' pointer, points it to the
tail of last consumed record (record has already been fetched). Kernel
determines the available space in a ring buffer using these two
pointers to avoid overwrite unfetched records.

By mapping without 'PROT_WRITE', an overwritable ring buffer is created.
Different from normal ring buffer, tooling is unable to maintain 'tail'
pointer because writing is forbidden. Therefore, for this type of ring
buffers, kernel overwrite old records unconditionally, works like flight
recorder. This feature would be useful if reading from overwritable ring
buffer were as easy as reading from normal ring buffer. However,
there's an obscure problem.

The following figure demonstrates a full overwritable ring buffer. In
this figure, the 'head' pointer points to the end of last record, and a
long record 'E' is pending. For a normal ring buffer, a 'tail' pointer
would have pointed to position (X), so kernel knows there's no more
space in the ring buffer. However, for an overwritable ring buffer,
kernel ignore the 'tail' pointer.

   (X)                              head
    .                                |
    .                                V
    +------+-------+----------+------+---+
    |A....A|B.....B|C........C|D....D|   |
    +------+-------+----------+------+---+

Record 'A' is overwritten by event 'E':

      head
       |
       V
    +--+---+-------+----------+------+---+
    |.E|..A|B.....B|C........C|D....D|E..|
    +--+---+-------+----------+------+---+

Now tooling decides to read from this ring buffer. However, none of these
two natural positions, 'head' and the start of this ring buffer, are
pointing to the head of a record. Even the full ring buffer can be
accessed by tooling, it is unable to find a position to start decoding.

The first attempt tries to solve this problem AFAIK can be found from
[1]. It makes kernel to maintain 'tail' pointer: updates it when ring
buffer is half full. However, this approach introduces overhead to
fast path. Test result shows a 1% overhead [2]. In addition, this method
utilizes no more tham 50% records.

Another attempt can be found from [3], which allows putting the size of
an event at the end of each record. This approach allows tooling to find
records in a backward manner from 'head' pointer by reading size of a
record from its tail. However, because of alignment requirement, it
needs 8 bytes to record the size of a record, which is a huge waste. Its
performance is also not good, because more data need to be written.
This approach also introduces some extra branch instructions to fast
path.

'write_backward' is a better solution to this problem.

Following figure demonstrates the state of the overwritable ring buffer
when 'write_backward' is set before overwriting:

       head
        |
        V
    +---+------+----------+-------+------+
    |   |D....D|C........C|B.....B|A....A|
    +---+------+----------+-------+------+

and after overwriting:
                                     head
                                      |
                                      V
    +---+------+----------+-------+---+--+
    |..E|D....D|C........C|B.....B|A..|E.|
    +---+------+----------+-------+---+--+

In each situation, 'head' points to the beginning of the newest record.
From this record, tooling can iterate over the full ring buffer and fetch
records one by one.

The only limitation that needs to be considered is back-to-back reading.
Due to the non-deterministic of user programs, it is impossible to ensure
the ring buffer keeps stable during reading. Consider an extreme situation:
tooling is scheduled out after reading record 'D', then a burst of events
come, eat up the whole ring buffer (one or multiple rounds). When the
tooling process comes back, reading after 'D' is incorrect now.

To prevent this problem, we need to find a way to ensure the ring buffer
is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is
suggested because its overhead is lower than
ioctl(PERF_EVENT_IOC_ENABLE).

By carefully verifying 'header' pointer, reader can avoid pausing the
ring-buffer. For example:

    /* A union of all possible events */
    union perf_event event;

    p = head = perf_mmap__read_head();
    while (true) {
        /* copy header of next event */
        fetch(&amp;event.header, p, sizeof(event.header));

        /* read 'head' pointer */
        head = perf_mmap__read_head();

        /* check overwritten: is the header good? */
        if (!verify(sizeof(event.header), p, head))
            break;

        /* copy the whole event */
        fetch(&amp;event, p, event.header.size);

        /* read 'head' pointer again */
        head = perf_mmap__read_head();

        /* is the whole event good? */
        if (!verify(event.header.size, p, head))
            break;
        p += event.header.size;
    }

However, the overhead is high because:

 a) In-place decoding is not safe.
    Copying-verifying-decoding is required.
 b) Fetching 'head' pointer requires additional synchronization.

(From Alexei Starovoitov:

Even when this trick works, pause is needed for more than stability of
reading. When we collect the events into overwrite buffer we're waiting
for some other trigger (like all cpu utilization spike or just one cpu
running and all others are idle) and when it happens the buffer has
valuable info from the past. At this point new events are no longer
interesting and buffer should be paused, events read and unpaused until
next trigger comes.)

This patch utilizes event's default overflow_handler introduced
previously. perf_event_output_backward() is created as the default
overflow handler for backward ring buffers. To avoid extra overhead to
fast path, original perf_event_output() becomes __perf_event_output()
and marked '__always_inline'. In theory, there's no extra overhead
introduced to fast path.

Performance testing:

Calling 3000000 times of 'close(-1)', use gettimeofday() to check
duration.  Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
system calls. In ns.

Testing environment:

  CPU    : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
  Kernel : v4.5.0
                    MEAN         STDVAR
 BASE            800214.950    2853.083
 PRE1           2253846.700    9997.014
 PRE2           2257495.540    8516.293
 POST           2250896.100    8933.921

Where 'BASE' is pure performance without capturing. 'PRE1' is test
result of pure 'v4.5.0' kernel. 'PRE2' is test result before this
patch. 'POST' is test result after this patch. See [4] for the detailed
experimental setup.

Considering the stdvar, this patch doesn't introduce performance
overhead to the fast path.

 [1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html
 [2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html
 [3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html
 [4] http://lkml.kernel.org/g/56F89DCD.1040202@huawei.com

Signed-off-by: Wang Nan &lt;wangnan0@huawei.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Cc: &lt;acme@kernel.org&gt;
Cc: &lt;pi3orama@163.com&gt;
Cc: Alexander Shishkin &lt;alexander.shishkin@linux.intel.com&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@redhat.com&gt;
Cc: Brendan Gregg &lt;brendan.d.gregg@gmail.com&gt;
Cc: He Kuang &lt;hekuang@huawei.com&gt;
Cc: Jiri Olsa &lt;jolsa@kernel.org&gt;
Cc: Jiri Olsa &lt;jolsa@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Masami Hiramatsu &lt;masami.hiramatsu.pt@hitachi.com&gt;
Cc: Namhyung Kim &lt;namhyung@kernel.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Stephane Eranian &lt;eranian@google.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Vince Weaver &lt;vincent.weaver@maine.edu&gt;
Cc: Zefan Li &lt;lizefan@huawei.com&gt;
Link: http://lkml.kernel.org/r/1459865478-53413-1-git-send-email-wangnan0@huawei.com
[ Fixed the changelog some more. ]
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;

Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>perf/ring_buffer: Introduce new ioctl options to pause and resume the ring-buffer</title>
<updated>2016-03-31T08:30:45+00:00</updated>
<author>
<name>Wang Nan</name>
<email>wangnan0@huawei.com</email>
</author>
<published>2016-03-28T06:41:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=86e7972f690c1017fd086cdfe53d8524e68c661c'/>
<id>86e7972f690c1017fd086cdfe53d8524e68c661c</id>
<content type='text'>
Add new ioctl() to pause/resume ring-buffer output.

In some situations we want to read from the ring-buffer only when we
ensure nothing can write to the ring-buffer during reading. Without
this patch we have to turn off all events attached to this ring-buffer
to achieve this.

This patch is a prerequisite to enable overwrite support for the
perf ring-buffer support. Following commits will introduce new methods
support reading from overwrite ring buffer. Before reading, caller
must ensure the ring buffer is frozen, or the reading is unreliable.

Signed-off-by: Wang Nan &lt;wangnan0@huawei.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: &lt;pi3orama@163.com&gt;
Cc: Alexander Shishkin &lt;alexander.shishkin@linux.intel.com&gt;
Cc: Alexei Starovoitov &lt;ast@kernel.org&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@redhat.com&gt;
Cc: Brendan Gregg &lt;brendan.d.gregg@gmail.com&gt;
Cc: He Kuang &lt;hekuang@huawei.com&gt;
Cc: Jiri Olsa &lt;jolsa@kernel.org&gt;
Cc: Jiri Olsa &lt;jolsa@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Masami Hiramatsu &lt;masami.hiramatsu.pt@hitachi.com&gt;
Cc: Namhyung Kim &lt;namhyung@kernel.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Stephane Eranian &lt;eranian@google.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Vince Weaver &lt;vincent.weaver@maine.edu&gt;
Cc: Zefan Li &lt;lizefan@huawei.com&gt;
Link: http://lkml.kernel.org/r/1459147292-239310-2-git-send-email-wangnan0@huawei.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Add new ioctl() to pause/resume ring-buffer output.

In some situations we want to read from the ring-buffer only when we
ensure nothing can write to the ring-buffer during reading. Without
this patch we have to turn off all events attached to this ring-buffer
to achieve this.

This patch is a prerequisite to enable overwrite support for the
perf ring-buffer support. Following commits will introduce new methods
support reading from overwrite ring buffer. Before reading, caller
must ensure the ring buffer is frozen, or the reading is unreliable.

Signed-off-by: Wang Nan &lt;wangnan0@huawei.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: &lt;pi3orama@163.com&gt;
Cc: Alexander Shishkin &lt;alexander.shishkin@linux.intel.com&gt;
Cc: Alexei Starovoitov &lt;ast@kernel.org&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@redhat.com&gt;
Cc: Brendan Gregg &lt;brendan.d.gregg@gmail.com&gt;
Cc: He Kuang &lt;hekuang@huawei.com&gt;
Cc: Jiri Olsa &lt;jolsa@kernel.org&gt;
Cc: Jiri Olsa &lt;jolsa@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Masami Hiramatsu &lt;masami.hiramatsu.pt@hitachi.com&gt;
Cc: Namhyung Kim &lt;namhyung@kernel.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Stephane Eranian &lt;eranian@google.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Vince Weaver &lt;vincent.weaver@maine.edu&gt;
Cc: Zefan Li &lt;lizefan@huawei.com&gt;
Link: http://lkml.kernel.org/r/1459147292-239310-2-git-send-email-wangnan0@huawei.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>perf/x86: Add option to disable reading branch flags/cycles</title>
<updated>2015-11-23T08:58:25+00:00</updated>
<author>
<name>Andi Kleen</name>
<email>ak@linux.intel.com</email>
</author>
<published>2015-10-20T18:46:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=b16a5b52eb90d92b597257778e51e1fdc6423e64'/>
<id>b16a5b52eb90d92b597257778e51e1fdc6423e64</id>
<content type='text'>
With LBRv5 reading the extra LBR flags like mispredict, TSX, cycles is
not free anymore, as it has moved to a separate MSR.

For callstack mode we don't need any of this information; so we can
avoid the unnecessary MSR read. Add flags to the perf interface where
perf record can request not collecting this information.

Add branch_sample_type flags for CYCLES and FLAGS. It's a bit unusual
for branch_sample_types to be negative (disable), not positive (enable),
but since the legacy ABI reported the flags we need some form of
explicit disabling to avoid breaking the ABI.

After we have the flags the x86 perf code can keep track if any users
need the flags. If noone needs it the information is not collected.

This cuts down the cost of LBR callstack on Skylake significantly.
Profiling a kernel build with LBR call stack the average run time of
the PMI handler drops by 43%.

Signed-off-by: Andi Kleen &lt;ak@linux.intel.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@redhat.com&gt;
Cc: Jiri Olsa &lt;jolsa@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Stephane Eranian &lt;eranian@google.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Vince Weaver &lt;vincent.weaver@maine.edu&gt;
Cc: acme@kernel.org
Cc: jolsa@kernel.org
Link: http://lkml.kernel.org/r/1445366797-30894-2-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
With LBRv5 reading the extra LBR flags like mispredict, TSX, cycles is
not free anymore, as it has moved to a separate MSR.

For callstack mode we don't need any of this information; so we can
avoid the unnecessary MSR read. Add flags to the perf interface where
perf record can request not collecting this information.

Add branch_sample_type flags for CYCLES and FLAGS. It's a bit unusual
for branch_sample_types to be negative (disable), not positive (enable),
but since the legacy ABI reported the flags we need some form of
explicit disabling to avoid breaking the ABI.

After we have the flags the x86 perf code can keep track if any users
need the flags. If noone needs it the information is not collected.

This cuts down the cost of LBR callstack on Skylake significantly.
Profiling a kernel build with LBR call stack the average run time of
the PMI handler drops by 43%.

Signed-off-by: Andi Kleen &lt;ak@linux.intel.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@redhat.com&gt;
Cc: Jiri Olsa &lt;jolsa@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Stephane Eranian &lt;eranian@google.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Vince Weaver &lt;vincent.weaver@maine.edu&gt;
Cc: acme@kernel.org
Cc: jolsa@kernel.org
Link: http://lkml.kernel.org/r/1445366797-30894-2-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next</title>
<updated>2015-11-04T17:41:05+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2015-11-04T17:41:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=b0f85fa11aefc4f3e03306b4cd47f113bd57dcba'/>
<id>b0f85fa11aefc4f3e03306b4cd47f113bd57dcba</id>
<content type='text'>
Pull networking updates from David Miller:

Changes of note:

 1) Allow to schedule ICMP packets in IPVS, from Alex Gartrell.

 2) Provide FIB table ID in ipv4 route dumps just as ipv6 does, from
    David Ahern.

 3) Allow the user to ask for the statistics to be filtered out of
    ipv4/ipv6 address netlink dumps.  From Sowmini Varadhan.

 4) More work to pass the network namespace context around deep into
    various packet path APIs, starting with the netfilter hooks.  From
    Eric W Biederman.

 5) Add layer 2 TX/RX checksum offloading to qeth driver, from Thomas
    Richter.

 6) Use usec resolution for SYN/ACK RTTs in TCP, from Yuchung Cheng.

 7) Support Very High Throughput in wireless MESH code, from Bob
    Copeland.

 8) Allow setting the ageing_time in switchdev/rocker.  From Scott
    Feldman.

 9) Properly autoload L2TP type modules, from Stephen Hemminger.

10) Fix and enable offload features by default in 8139cp driver, from
    David Woodhouse.

11) Support both ipv4 and ipv6 sockets in a single vxlan device, from
    Jiri Benc.

12) Fix CWND limiting of thin streams in TCP, from Bendik Rønning
    Opstad.

13) Fix IPSEC flowcache overflows on large systems, from Steffen
    Klassert.

14) Convert bridging to track VLANs using rhashtable entries rather than
    a bitmap.  From Nikolay Aleksandrov.

15) Make TCP listener handling completely lockless, this is a major
    accomplishment.  Incoming request sockets now live in the
    established hash table just like any other socket too.

    From Eric Dumazet.

15) Provide more bridging attributes to netlink, from Nikolay
    Aleksandrov.

16) Use hash based algorithm for ipv4 multipath routing, this was very
    long overdue.  From Peter Nørlund.

17) Several y2038 cures, mostly avoiding timespec.  From Arnd Bergmann.

18) Allow non-root execution of EBPF programs, from Alexei Starovoitov.

19) Support SO_INCOMING_CPU as setsockopt, from Eric Dumazet.  This
    influences the port binding selection logic used by SO_REUSEPORT.

20) Add ipv6 support to VRF, from David Ahern.

21) Add support for Mellanox Spectrum switch ASIC, from Jiri Pirko.

22) Add rtl8xxxu Realtek wireless driver, from Jes Sorensen.

23) Implement RACK loss recovery in TCP, from Yuchung Cheng.

24) Support multipath routes in MPLS, from Roopa Prabhu.

25) Fix POLLOUT notification for listening sockets in AF_UNIX, from Eric
    Dumazet.

26) Add new QED Qlogic river, from Yuval Mintz, Manish Chopra, and
    Sudarsana Kalluru.

27) Don't fetch timestamps on AF_UNIX sockets, from Hannes Frederic
    Sowa.

28) Support ipv6 geneve tunnels, from John W Linville.

29) Add flood control support to switchdev layer, from Ido Schimmel.

30) Fix CHECKSUM_PARTIAL handling of potentially fragmented frames, from
    Hannes Frederic Sowa.

31) Support persistent maps and progs in bpf, from Daniel Borkmann.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1790 commits)
  sh_eth: use DMA barriers
  switchdev: respect SKIP_EOPNOTSUPP flag in case there is no recursion
  net: sched: kill dead code in sch_choke.c
  irda: Delete an unnecessary check before the function call "irlmp_unregister_service"
  net: dsa: mv88e6xxx: include DSA ports in VLANs
  net: dsa: mv88e6xxx: disable SA learning for DSA and CPU ports
  net/core: fix for_each_netdev_feature
  vlan: Invoke driver vlan hooks only if device is present
  arcnet/com20020: add LEDS_CLASS dependency
  bpf, verifier: annotate verbose printer with __printf
  dp83640: Only wait for timestamps for packets with timestamping enabled.
  ptp: Change ptp_class to a proper bitmask
  dp83640: Prune rx timestamp list before reading from it
  dp83640: Delay scheduled work.
  dp83640: Include hash in timestamp/packet matching
  ipv6: fix tunnel error handling
  net/mlx5e: Fix LSO vlan insertion
  net/mlx5e: Re-eanble client vlan TX acceleration
  net/mlx5e: Return error in case mlx5e_set_features() fails
  net/mlx5e: Don't allow more than max supported channels
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull networking updates from David Miller:

Changes of note:

 1) Allow to schedule ICMP packets in IPVS, from Alex Gartrell.

 2) Provide FIB table ID in ipv4 route dumps just as ipv6 does, from
    David Ahern.

 3) Allow the user to ask for the statistics to be filtered out of
    ipv4/ipv6 address netlink dumps.  From Sowmini Varadhan.

 4) More work to pass the network namespace context around deep into
    various packet path APIs, starting with the netfilter hooks.  From
    Eric W Biederman.

 5) Add layer 2 TX/RX checksum offloading to qeth driver, from Thomas
    Richter.

 6) Use usec resolution for SYN/ACK RTTs in TCP, from Yuchung Cheng.

 7) Support Very High Throughput in wireless MESH code, from Bob
    Copeland.

 8) Allow setting the ageing_time in switchdev/rocker.  From Scott
    Feldman.

 9) Properly autoload L2TP type modules, from Stephen Hemminger.

10) Fix and enable offload features by default in 8139cp driver, from
    David Woodhouse.

11) Support both ipv4 and ipv6 sockets in a single vxlan device, from
    Jiri Benc.

12) Fix CWND limiting of thin streams in TCP, from Bendik Rønning
    Opstad.

13) Fix IPSEC flowcache overflows on large systems, from Steffen
    Klassert.

14) Convert bridging to track VLANs using rhashtable entries rather than
    a bitmap.  From Nikolay Aleksandrov.

15) Make TCP listener handling completely lockless, this is a major
    accomplishment.  Incoming request sockets now live in the
    established hash table just like any other socket too.

    From Eric Dumazet.

15) Provide more bridging attributes to netlink, from Nikolay
    Aleksandrov.

16) Use hash based algorithm for ipv4 multipath routing, this was very
    long overdue.  From Peter Nørlund.

17) Several y2038 cures, mostly avoiding timespec.  From Arnd Bergmann.

18) Allow non-root execution of EBPF programs, from Alexei Starovoitov.

19) Support SO_INCOMING_CPU as setsockopt, from Eric Dumazet.  This
    influences the port binding selection logic used by SO_REUSEPORT.

20) Add ipv6 support to VRF, from David Ahern.

21) Add support for Mellanox Spectrum switch ASIC, from Jiri Pirko.

22) Add rtl8xxxu Realtek wireless driver, from Jes Sorensen.

23) Implement RACK loss recovery in TCP, from Yuchung Cheng.

24) Support multipath routes in MPLS, from Roopa Prabhu.

25) Fix POLLOUT notification for listening sockets in AF_UNIX, from Eric
    Dumazet.

26) Add new QED Qlogic river, from Yuval Mintz, Manish Chopra, and
    Sudarsana Kalluru.

27) Don't fetch timestamps on AF_UNIX sockets, from Hannes Frederic
    Sowa.

28) Support ipv6 geneve tunnels, from John W Linville.

29) Add flood control support to switchdev layer, from Ido Schimmel.

30) Fix CHECKSUM_PARTIAL handling of potentially fragmented frames, from
    Hannes Frederic Sowa.

31) Support persistent maps and progs in bpf, from Daniel Borkmann.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1790 commits)
  sh_eth: use DMA barriers
  switchdev: respect SKIP_EOPNOTSUPP flag in case there is no recursion
  net: sched: kill dead code in sch_choke.c
  irda: Delete an unnecessary check before the function call "irlmp_unregister_service"
  net: dsa: mv88e6xxx: include DSA ports in VLANs
  net: dsa: mv88e6xxx: disable SA learning for DSA and CPU ports
  net/core: fix for_each_netdev_feature
  vlan: Invoke driver vlan hooks only if device is present
  arcnet/com20020: add LEDS_CLASS dependency
  bpf, verifier: annotate verbose printer with __printf
  dp83640: Only wait for timestamps for packets with timestamping enabled.
  ptp: Change ptp_class to a proper bitmask
  dp83640: Prune rx timestamp list before reading from it
  dp83640: Delay scheduled work.
  dp83640: Include hash in timestamp/packet matching
  ipv6: fix tunnel error handling
  net/mlx5e: Fix LSO vlan insertion
  net/mlx5e: Re-eanble client vlan TX acceleration
  net/mlx5e: Return error in case mlx5e_set_features() fails
  net/mlx5e: Don't allow more than max supported channels
  ...
</pre>
</div>
</content>
</entry>
</feed>
