<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/kernel, branch toradex_5.3.y</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>tracing: Fix "gfp_t" format for synthetic events</title>
<updated>2019-11-10T10:34:15+00:00</updated>
<author>
<name>Zhengjun Xing</name>
<email>zhengjun.xing@linux.intel.com</email>
</author>
<published>2019-10-18T01:20:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=e3d2a341b2de3c4a43e39dcbb01b7af20f0e13f4'/>
<id>e3d2a341b2de3c4a43e39dcbb01b7af20f0e13f4</id>
<content type='text'>
[ Upstream commit 9fa8c9c647be624e91b09ecffa7cd97ee0600b40 ]

In the format of synthetic events, the "gfp_t" is shown as "signed:1",
but in fact the "gfp_t" is "unsigned", should be shown as "signed:0".

The issue can be reproduced by the following commands:

echo 'memlatency u64 lat; unsigned int order; gfp_t gfp_flags; int migratetype' &gt; /sys/kernel/debug/tracing/synthetic_events
cat  /sys/kernel/debug/tracing/events/synthetic/memlatency/format

name: memlatency
ID: 2233
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:u64 lat;  offset:8;       size:8; signed:0;
        field:unsigned int order;       offset:16;      size:4; signed:0;
        field:gfp_t gfp_flags;  offset:24;      size:4; signed:1;
        field:int migratetype;  offset:32;      size:4; signed:1;

print fmt: "lat=%llu, order=%u, gfp_flags=%x, migratetype=%d", REC-&gt;lat, REC-&gt;order, REC-&gt;gfp_flags, REC-&gt;migratetype

Link: http://lkml.kernel.org/r/20191018012034.6404-1-zhengjun.xing@linux.intel.com

Reviewed-by: Tom Zanussi &lt;tom.zanussi@linux.intel.com&gt;
Signed-off-by: Zhengjun Xing &lt;zhengjun.xing@linux.intel.com&gt;
Signed-off-by: Steven Rostedt (VMware) &lt;rostedt@goodmis.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 9fa8c9c647be624e91b09ecffa7cd97ee0600b40 ]

In the format of synthetic events, the "gfp_t" is shown as "signed:1",
but in fact the "gfp_t" is "unsigned", should be shown as "signed:0".

The issue can be reproduced by the following commands:

echo 'memlatency u64 lat; unsigned int order; gfp_t gfp_flags; int migratetype' &gt; /sys/kernel/debug/tracing/synthetic_events
cat  /sys/kernel/debug/tracing/events/synthetic/memlatency/format

name: memlatency
ID: 2233
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:u64 lat;  offset:8;       size:8; signed:0;
        field:unsigned int order;       offset:16;      size:4; signed:0;
        field:gfp_t gfp_flags;  offset:24;      size:4; signed:1;
        field:int migratetype;  offset:32;      size:4; signed:1;

print fmt: "lat=%llu, order=%u, gfp_flags=%x, migratetype=%d", REC-&gt;lat, REC-&gt;order, REC-&gt;gfp_flags, REC-&gt;migratetype

Link: http://lkml.kernel.org/r/20191018012034.6404-1-zhengjun.xing@linux.intel.com

Reviewed-by: Tom Zanussi &lt;tom.zanussi@linux.intel.com&gt;
Signed-off-by: Zhengjun Xing &lt;zhengjun.xing@linux.intel.com&gt;
Signed-off-by: Steven Rostedt (VMware) &lt;rostedt@goodmis.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched/fair: Fix -Wunused-but-set-variable warnings</title>
<updated>2019-11-06T12:09:34+00:00</updated>
<author>
<name>Qian Cai</name>
<email>cai@lca.pw</email>
</author>
<published>2019-08-20T18:40:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=f909e9a21ac4389cbe93f1ce737c187a2f320515'/>
<id>f909e9a21ac4389cbe93f1ce737c187a2f320515</id>
<content type='text'>
[ Upstream commit 763a9ec06c409dcde2a761aac4bb83ff3938e0b3 ]

Commit:

   de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")

introduced a few compilation warnings:

  kernel/sched/fair.c: In function '__refill_cfs_bandwidth_runtime':
  kernel/sched/fair.c:4365:6: warning: variable 'now' set but not used [-Wunused-but-set-variable]
  kernel/sched/fair.c: In function 'start_cfs_bandwidth':
  kernel/sched/fair.c:4992:6: warning: variable 'overrun' set but not used [-Wunused-but-set-variable]

Also, __refill_cfs_bandwidth_runtime() does no longer update the
expiration time, so fix the comments accordingly.

Signed-off-by: Qian Cai &lt;cai@lca.pw&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Ben Segall &lt;bsegall@google.com&gt;
Reviewed-by: Dave Chiluk &lt;chiluk+linux@indeed.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: pauld@redhat.com
Fixes: de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
Link: https://lkml.kernel.org/r/1566326455-8038-1-git-send-email-cai@lca.pw
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 763a9ec06c409dcde2a761aac4bb83ff3938e0b3 ]

Commit:

   de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")

introduced a few compilation warnings:

  kernel/sched/fair.c: In function '__refill_cfs_bandwidth_runtime':
  kernel/sched/fair.c:4365:6: warning: variable 'now' set but not used [-Wunused-but-set-variable]
  kernel/sched/fair.c: In function 'start_cfs_bandwidth':
  kernel/sched/fair.c:4992:6: warning: variable 'overrun' set but not used [-Wunused-but-set-variable]

Also, __refill_cfs_bandwidth_runtime() does no longer update the
expiration time, so fix the comments accordingly.

Signed-off-by: Qian Cai &lt;cai@lca.pw&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Ben Segall &lt;bsegall@google.com&gt;
Reviewed-by: Dave Chiluk &lt;chiluk+linux@indeed.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: pauld@redhat.com
Fixes: de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
Link: https://lkml.kernel.org/r/1566326455-8038-1-git-send-email-cai@lca.pw
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices</title>
<updated>2019-11-06T12:09:31+00:00</updated>
<author>
<name>Dave Chiluk</name>
<email>chiluk+linux@indeed.com</email>
</author>
<published>2019-07-23T16:44:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=2bd11026299bdcb6d4b751d584913433b6d9b0b6'/>
<id>2bd11026299bdcb6d4b751d584913433b6d9b0b6</id>
<content type='text'>
commit de53fd7aedb100f03e5d2231cfce0e4993282425 upstream.

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to cpu-local run queue being allocated per cpu
bandwidth slices, and then not fully using that slice within the period.
At which point the slice and quota expires. This expiration of unused
slice results in applications not being able to utilize the quota for
which they are allocated.

The non-expiration of per-cpu slices was recently fixed by
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this had been broken since
at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b-&gt;quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq-&gt;runtime_expires != cfs_b-&gt;runtime_expires) {
	/* extend local deadline, drift is bounded above by 2 ticks */
	cfs_rq-&gt;runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevent the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk &lt;chiluk+linux@indeed.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Phil Auld &lt;pauld@redhat.com&gt;
Reviewed-by: Ben Segall &lt;bsegall@google.com&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: John Hammond &lt;jhammond@indeed.com&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Kyle Anderson &lt;kwa@yelp.com&gt;
Cc: Gabriel Munos &lt;gmunoz@netflix.com&gt;
Cc: Peter Oskolkov &lt;posk@posk.io&gt;
Cc: Cong Wang &lt;xiyou.wangcong@gmail.com&gt;
Cc: Brendan Gregg &lt;bgregg@netflix.com&gt;
Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit de53fd7aedb100f03e5d2231cfce0e4993282425 upstream.

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to cpu-local run queue being allocated per cpu
bandwidth slices, and then not fully using that slice within the period.
At which point the slice and quota expires. This expiration of unused
slice results in applications not being able to utilize the quota for
which they are allocated.

The non-expiration of per-cpu slices was recently fixed by
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this had been broken since
at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b-&gt;quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq-&gt;runtime_expires != cfs_b-&gt;runtime_expires) {
	/* extend local deadline, drift is bounded above by 2 ticks */
	cfs_rq-&gt;runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevent the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk &lt;chiluk+linux@indeed.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Phil Auld &lt;pauld@redhat.com&gt;
Reviewed-by: Ben Segall &lt;bsegall@google.com&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: John Hammond &lt;jhammond@indeed.com&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Kyle Anderson &lt;kwa@yelp.com&gt;
Cc: Gabriel Munos &lt;gmunoz@netflix.com&gt;
Cc: Peter Oskolkov &lt;posk@posk.io&gt;
Cc: Cong Wang &lt;xiyou.wangcong@gmail.com&gt;
Cc: Brendan Gregg &lt;bgregg@netflix.com&gt;
Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>perf/aux: Fix tracking of auxiliary trace buffer allocation</title>
<updated>2019-11-06T12:08:49+00:00</updated>
<author>
<name>Thomas Richter</name>
<email>tmricht@linux.ibm.com</email>
</author>
<published>2019-10-21T08:33:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=724337746b1a1a5519e10b536e8d32f32312f6aa'/>
<id>724337746b1a1a5519e10b536e8d32f32312f6aa</id>
<content type='text'>
[ Upstream commit 5e6c3c7b1ec217c1c4c95d9148182302b9969b97 ]

The following commit from the v5.4 merge window:

  d44248a41337 ("perf/core: Rework memory accounting in perf_mmap()")

... breaks auxiliary trace buffer tracking.

If I run command 'perf record -e rbd000' to record samples and saving
them in the **auxiliary** trace buffer then the value of 'locked_vm' becomes
negative after all trace buffers have been allocated and released:

During allocation the values increase:

  [52.250027] perf_mmap user-&gt;locked_vm:0x87 pinned_vm:0x0 ret:0
  [52.250115] perf_mmap user-&gt;locked_vm:0x107 pinned_vm:0x0 ret:0
  [52.250251] perf_mmap user-&gt;locked_vm:0x188 pinned_vm:0x0 ret:0
  [52.250326] perf_mmap user-&gt;locked_vm:0x208 pinned_vm:0x0 ret:0
  [52.250441] perf_mmap user-&gt;locked_vm:0x289 pinned_vm:0x0 ret:0
  [52.250498] perf_mmap user-&gt;locked_vm:0x309 pinned_vm:0x0 ret:0
  [52.250613] perf_mmap user-&gt;locked_vm:0x38a pinned_vm:0x0 ret:0
  [52.250715] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x2 ret:0
  [52.250834] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x83 ret:0
  [52.250915] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x103 ret:0
  [52.251061] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x184 ret:0
  [52.251146] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x204 ret:0
  [52.251299] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x285 ret:0
  [52.251383] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x305 ret:0
  [52.251544] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x386 ret:0
  [52.251634] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x406 ret:0
  [52.253018] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x487 ret:0
  [52.253197] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x508 ret:0
  [52.253374] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x589 ret:0
  [52.253550] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x60a ret:0
  [52.253726] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x68b ret:0
  [52.253903] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x70c ret:0
  [52.254084] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x78d ret:0
  [52.254263] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x80e ret:0

The value of user-&gt;locked_vm increases to a limit then the memory
is tracked by pinned_vm.

During deallocation the size is subtracted from pinned_vm until
it hits a limit. Then a larger value is subtracted from locked_vm
leading to a large number (because of type unsigned):

  [64.267797] perf_mmap_close mmap_user-&gt;locked_vm:0x408 pinned_vm:0x78d
  [64.267826] perf_mmap_close mmap_user-&gt;locked_vm:0x408 pinned_vm:0x70c
  [64.267848] perf_mmap_close mmap_user-&gt;locked_vm:0x408 pinned_vm:0x68b
  [64.267869] perf_mmap_close mmap_user-&gt;locked_vm:0x408 pinned_vm:0x60a
  [64.267891] perf_mmap_close mmap_user-&gt;locked_vm:0x408 pinned_vm:0x589
  [64.267911] perf_mmap_close mmap_user-&gt;locked_vm:0x408 pinned_vm:0x508
  [64.267933] perf_mmap_close mmap_user-&gt;locked_vm:0x408 pinned_vm:0x487
  [64.267952] perf_mmap_close mmap_user-&gt;locked_vm:0x408 pinned_vm:0x406
  [64.268883] perf_mmap_close mmap_user-&gt;locked_vm:0x307 pinned_vm:0x406
  [64.269117] perf_mmap_close mmap_user-&gt;locked_vm:0x206 pinned_vm:0x406
  [64.269433] perf_mmap_close mmap_user-&gt;locked_vm:0x105 pinned_vm:0x406
  [64.269536] perf_mmap_close mmap_user-&gt;locked_vm:0x4 pinned_vm:0x404
  [64.269797] perf_mmap_close mmap_user-&gt;locked_vm:0xffffffffffffff84 pinned_vm:0x303
  [64.270105] perf_mmap_close mmap_user-&gt;locked_vm:0xffffffffffffff04 pinned_vm:0x202
  [64.270374] perf_mmap_close mmap_user-&gt;locked_vm:0xfffffffffffffe84 pinned_vm:0x101
  [64.270628] perf_mmap_close mmap_user-&gt;locked_vm:0xfffffffffffffe04 pinned_vm:0x0

This value sticks for the user until system is rebooted, causing
follow-on system calls using locked_vm resource limit to fail.

Note: There is no issue using the normal trace buffer.

In fact the issue is in perf_mmap_close(). During allocation auxiliary
trace buffer memory is either traced as 'extra' and added to 'pinned_vm'
or trace as 'user_extra' and added to 'locked_vm'. This applies for
normal trace buffers and auxiliary trace buffer.

However in function perf_mmap_close() all auxiliary trace buffer is
subtraced from 'locked_vm' and never from 'pinned_vm'. This breaks the
ballance.

Signed-off-by: Thomas Richter &lt;tmricht@linux.ibm.com&gt;
Acked-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Alexander Shishkin &lt;alexander.shishkin@linux.intel.com&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@redhat.com&gt;
Cc: Jiri Olsa &lt;jolsa@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mark Rutland &lt;mark.rutland@arm.com&gt;
Cc: Namhyung Kim &lt;namhyung@kernel.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: acme@kernel.org
Cc: gor@linux.ibm.com
Cc: hechaol@fb.com
Cc: heiko.carstens@de.ibm.com
Cc: linux-perf-users@vger.kernel.org
Cc: songliubraving@fb.com
Fixes: d44248a41337 ("perf/core: Rework memory accounting in perf_mmap()")
Link: https://lkml.kernel.org/r/20191021083354.67868-1-tmricht@linux.ibm.com
[ Minor readability edits. ]
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 5e6c3c7b1ec217c1c4c95d9148182302b9969b97 ]

The following commit from the v5.4 merge window:

  d44248a41337 ("perf/core: Rework memory accounting in perf_mmap()")

... breaks auxiliary trace buffer tracking.

If I run command 'perf record -e rbd000' to record samples and saving
them in the **auxiliary** trace buffer then the value of 'locked_vm' becomes
negative after all trace buffers have been allocated and released:

During allocation the values increase:

  [52.250027] perf_mmap user-&gt;locked_vm:0x87 pinned_vm:0x0 ret:0
  [52.250115] perf_mmap user-&gt;locked_vm:0x107 pinned_vm:0x0 ret:0
  [52.250251] perf_mmap user-&gt;locked_vm:0x188 pinned_vm:0x0 ret:0
  [52.250326] perf_mmap user-&gt;locked_vm:0x208 pinned_vm:0x0 ret:0
  [52.250441] perf_mmap user-&gt;locked_vm:0x289 pinned_vm:0x0 ret:0
  [52.250498] perf_mmap user-&gt;locked_vm:0x309 pinned_vm:0x0 ret:0
  [52.250613] perf_mmap user-&gt;locked_vm:0x38a pinned_vm:0x0 ret:0
  [52.250715] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x2 ret:0
  [52.250834] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x83 ret:0
  [52.250915] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x103 ret:0
  [52.251061] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x184 ret:0
  [52.251146] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x204 ret:0
  [52.251299] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x285 ret:0
  [52.251383] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x305 ret:0
  [52.251544] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x386 ret:0
  [52.251634] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x406 ret:0
  [52.253018] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x487 ret:0
  [52.253197] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x508 ret:0
  [52.253374] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x589 ret:0
  [52.253550] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x60a ret:0
  [52.253726] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x68b ret:0
  [52.253903] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x70c ret:0
  [52.254084] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x78d ret:0
  [52.254263] perf_mmap user-&gt;locked_vm:0x408 pinned_vm:0x80e ret:0

The value of user-&gt;locked_vm increases to a limit then the memory
is tracked by pinned_vm.

During deallocation the size is subtracted from pinned_vm until
it hits a limit. Then a larger value is subtracted from locked_vm
leading to a large number (because of type unsigned):

  [64.267797] perf_mmap_close mmap_user-&gt;locked_vm:0x408 pinned_vm:0x78d
  [64.267826] perf_mmap_close mmap_user-&gt;locked_vm:0x408 pinned_vm:0x70c
  [64.267848] perf_mmap_close mmap_user-&gt;locked_vm:0x408 pinned_vm:0x68b
  [64.267869] perf_mmap_close mmap_user-&gt;locked_vm:0x408 pinned_vm:0x60a
  [64.267891] perf_mmap_close mmap_user-&gt;locked_vm:0x408 pinned_vm:0x589
  [64.267911] perf_mmap_close mmap_user-&gt;locked_vm:0x408 pinned_vm:0x508
  [64.267933] perf_mmap_close mmap_user-&gt;locked_vm:0x408 pinned_vm:0x487
  [64.267952] perf_mmap_close mmap_user-&gt;locked_vm:0x408 pinned_vm:0x406
  [64.268883] perf_mmap_close mmap_user-&gt;locked_vm:0x307 pinned_vm:0x406
  [64.269117] perf_mmap_close mmap_user-&gt;locked_vm:0x206 pinned_vm:0x406
  [64.269433] perf_mmap_close mmap_user-&gt;locked_vm:0x105 pinned_vm:0x406
  [64.269536] perf_mmap_close mmap_user-&gt;locked_vm:0x4 pinned_vm:0x404
  [64.269797] perf_mmap_close mmap_user-&gt;locked_vm:0xffffffffffffff84 pinned_vm:0x303
  [64.270105] perf_mmap_close mmap_user-&gt;locked_vm:0xffffffffffffff04 pinned_vm:0x202
  [64.270374] perf_mmap_close mmap_user-&gt;locked_vm:0xfffffffffffffe84 pinned_vm:0x101
  [64.270628] perf_mmap_close mmap_user-&gt;locked_vm:0xfffffffffffffe04 pinned_vm:0x0

This value sticks for the user until system is rebooted, causing
follow-on system calls using locked_vm resource limit to fail.

Note: There is no issue using the normal trace buffer.

In fact the issue is in perf_mmap_close(). During allocation auxiliary
trace buffer memory is either traced as 'extra' and added to 'pinned_vm'
or trace as 'user_extra' and added to 'locked_vm'. This applies for
normal trace buffers and auxiliary trace buffer.

However in function perf_mmap_close() all auxiliary trace buffer is
subtraced from 'locked_vm' and never from 'pinned_vm'. This breaks the
ballance.

Signed-off-by: Thomas Richter &lt;tmricht@linux.ibm.com&gt;
Acked-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Alexander Shishkin &lt;alexander.shishkin@linux.intel.com&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@redhat.com&gt;
Cc: Jiri Olsa &lt;jolsa@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mark Rutland &lt;mark.rutland@arm.com&gt;
Cc: Namhyung Kim &lt;namhyung@kernel.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: acme@kernel.org
Cc: gor@linux.ibm.com
Cc: hechaol@fb.com
Cc: heiko.carstens@de.ibm.com
Cc: linux-perf-users@vger.kernel.org
Cc: songliubraving@fb.com
Fixes: d44248a41337 ("perf/core: Rework memory accounting in perf_mmap()")
Link: https://lkml.kernel.org/r/20191021083354.67868-1-tmricht@linux.ibm.com
[ Minor readability edits. ]
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tracing: Initialize iter-&gt;seq after zeroing in tracing_read_pipe()</title>
<updated>2019-11-06T12:08:47+00:00</updated>
<author>
<name>Petr Mladek</name>
<email>pmladek@suse.com</email>
</author>
<published>2019-10-11T14:21:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=1eca0cbc9b613ac67f71e84f97c5380c563cd63a'/>
<id>1eca0cbc9b613ac67f71e84f97c5380c563cd63a</id>
<content type='text'>
[ Upstream commit d303de1fcf344ff7c15ed64c3f48a991c9958775 ]

A customer reported the following softlockup:

[899688.160002] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [test.sh:16464]
[899688.160002] CPU: 0 PID: 16464 Comm: test.sh Not tainted 4.12.14-6.23-azure #1 SLE12-SP4
[899688.160002] RIP: 0010:up_write+0x1a/0x30
[899688.160002] Kernel panic - not syncing: softlockup: hung tasks
[899688.160002] RIP: 0010:up_write+0x1a/0x30
[899688.160002] RSP: 0018:ffffa86784d4fde8 EFLAGS: 00000257 ORIG_RAX: ffffffffffffff12
[899688.160002] RAX: ffffffff970fea00 RBX: 0000000000000001 RCX: 0000000000000000
[899688.160002] RDX: ffffffff00000001 RSI: 0000000000000080 RDI: ffffffff970fea00
[899688.160002] RBP: ffffffffffffffff R08: ffffffffffffffff R09: 0000000000000000
[899688.160002] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b59014720d8
[899688.160002] R13: ffff8b59014720c0 R14: ffff8b5901471090 R15: ffff8b5901470000
[899688.160002]  tracing_read_pipe+0x336/0x3c0
[899688.160002]  __vfs_read+0x26/0x140
[899688.160002]  vfs_read+0x87/0x130
[899688.160002]  SyS_read+0x42/0x90
[899688.160002]  do_syscall_64+0x74/0x160

It caught the process in the middle of trace_access_unlock(). There is
no loop. So, it must be looping in the caller tracing_read_pipe()
via the "waitagain" label.

Crashdump analyze uncovered that iter-&gt;seq was completely zeroed
at this point, including iter-&gt;seq.seq.size. It means that
print_trace_line() was never able to print anything and
there was no forward progress.

The culprit seems to be in the code:

	/* reset all but tr, trace, and overruns */
	memset(&amp;iter-&gt;seq, 0,
	       sizeof(struct trace_iterator) -
	       offsetof(struct trace_iterator, seq));

It was added by the commit 53d0aa773053ab182877 ("ftrace:
add logic to record overruns"). It was v2.6.27-rc1.
It was the time when iter-&gt;seq looked like:

     struct trace_seq {
	unsigned char		buffer[PAGE_SIZE];
	unsigned int		len;
     };

There was no "size" variable and zeroing was perfectly fine.

The solution is to reinitialize the structure after or without
zeroing.

Link: http://lkml.kernel.org/r/20191011142134.11997-1-pmladek@suse.com

Signed-off-by: Petr Mladek &lt;pmladek@suse.com&gt;
Signed-off-by: Steven Rostedt (VMware) &lt;rostedt@goodmis.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit d303de1fcf344ff7c15ed64c3f48a991c9958775 ]

A customer reported the following softlockup:

[899688.160002] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [test.sh:16464]
[899688.160002] CPU: 0 PID: 16464 Comm: test.sh Not tainted 4.12.14-6.23-azure #1 SLE12-SP4
[899688.160002] RIP: 0010:up_write+0x1a/0x30
[899688.160002] Kernel panic - not syncing: softlockup: hung tasks
[899688.160002] RIP: 0010:up_write+0x1a/0x30
[899688.160002] RSP: 0018:ffffa86784d4fde8 EFLAGS: 00000257 ORIG_RAX: ffffffffffffff12
[899688.160002] RAX: ffffffff970fea00 RBX: 0000000000000001 RCX: 0000000000000000
[899688.160002] RDX: ffffffff00000001 RSI: 0000000000000080 RDI: ffffffff970fea00
[899688.160002] RBP: ffffffffffffffff R08: ffffffffffffffff R09: 0000000000000000
[899688.160002] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b59014720d8
[899688.160002] R13: ffff8b59014720c0 R14: ffff8b5901471090 R15: ffff8b5901470000
[899688.160002]  tracing_read_pipe+0x336/0x3c0
[899688.160002]  __vfs_read+0x26/0x140
[899688.160002]  vfs_read+0x87/0x130
[899688.160002]  SyS_read+0x42/0x90
[899688.160002]  do_syscall_64+0x74/0x160

It caught the process in the middle of trace_access_unlock(). There is
no loop. So, it must be looping in the caller tracing_read_pipe()
via the "waitagain" label.

Crashdump analyze uncovered that iter-&gt;seq was completely zeroed
at this point, including iter-&gt;seq.seq.size. It means that
print_trace_line() was never able to print anything and
there was no forward progress.

The culprit seems to be in the code:

	/* reset all but tr, trace, and overruns */
	memset(&amp;iter-&gt;seq, 0,
	       sizeof(struct trace_iterator) -
	       offsetof(struct trace_iterator, seq));

It was added by the commit 53d0aa773053ab182877 ("ftrace:
add logic to record overruns"). It was v2.6.27-rc1.
It was the time when iter-&gt;seq looked like:

     struct trace_seq {
	unsigned char		buffer[PAGE_SIZE];
	unsigned int		len;
     };

There was no "size" variable and zeroing was perfectly fine.

The solution is to reinitialize the structure after or without
zeroing.

Link: http://lkml.kernel.org/r/20191011142134.11997-1-pmladek@suse.com

Signed-off-by: Petr Mladek &lt;pmladek@suse.com&gt;
Signed-off-by: Steven Rostedt (VMware) &lt;rostedt@goodmis.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>perf/core: Fix corner case in perf_rotate_context()</title>
<updated>2019-11-06T12:08:35+00:00</updated>
<author>
<name>Song Liu</name>
<email>songliubraving@fb.com</email>
</author>
<published>2019-10-08T16:59:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=d4b04822616f03573a0df66736bbf4b767afb644'/>
<id>d4b04822616f03573a0df66736bbf4b767afb644</id>
<content type='text'>
[ Upstream commit 7fa343b7fdc4f351de4e3f28d5c285937dd1f42f ]

In perf_rotate_context(), when the first cpu flexible event fail to
schedule, cpu_rotate is 1, while cpu_event is NULL. Since cpu_event is
NULL, perf_rotate_context will _NOT_ call cpu_ctx_sched_out(), thus
cpuctx-&gt;ctx.is_active will have EVENT_FLEXIBLE set. Then, the next
perf_event_sched_in() will skip all cpu flexible events because of the
EVENT_FLEXIBLE bit.

In the next call of perf_rotate_context(), cpu_rotate stays 1, and
cpu_event stays NULL, so this process repeats. The end result is, flexible
events on this cpu will not be scheduled (until another event being added
to the cpuctx).

Here is an easy repro of this issue. On Intel CPUs, where ref-cycles
could only use one counter, run one pinned event for ref-cycles, one
flexible event for ref-cycles, and one flexible event for cycles. The
flexible ref-cycles is never scheduled, which is expected. However,
because of this issue, the cycles event is never scheduled either.

 $ perf stat -e ref-cycles:D,ref-cycles,cycles -C 5 -I 1000

           time             counts unit events
    1.000152973         15,412,480      ref-cycles:D
    1.000152973      &lt;not counted&gt;      ref-cycles     (0.00%)
    1.000152973      &lt;not counted&gt;      cycles         (0.00%)
    2.000486957         18,263,120      ref-cycles:D
    2.000486957      &lt;not counted&gt;      ref-cycles     (0.00%)
    2.000486957      &lt;not counted&gt;      cycles         (0.00%)

To fix this, when the flexible_active list is empty, try rotate the
first event in the flexible_groups. Also, rename ctx_first_active() to
ctx_event_to_rotate(), which is more accurate.

Signed-off-by: Song Liu &lt;songliubraving@fb.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: &lt;kernel-team@fb.com&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@kernel.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Sasha Levin &lt;sashal@kernel.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Fixes: 8d5bce0c37fa ("perf/core: Optimize perf_rotate_context() event scheduling")
Link: https://lkml.kernel.org/r/20191008165949.920548-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 7fa343b7fdc4f351de4e3f28d5c285937dd1f42f ]

In perf_rotate_context(), when the first cpu flexible event fail to
schedule, cpu_rotate is 1, while cpu_event is NULL. Since cpu_event is
NULL, perf_rotate_context will _NOT_ call cpu_ctx_sched_out(), thus
cpuctx-&gt;ctx.is_active will have EVENT_FLEXIBLE set. Then, the next
perf_event_sched_in() will skip all cpu flexible events because of the
EVENT_FLEXIBLE bit.

In the next call of perf_rotate_context(), cpu_rotate stays 1, and
cpu_event stays NULL, so this process repeats. The end result is, flexible
events on this cpu will not be scheduled (until another event being added
to the cpuctx).

Here is an easy repro of this issue. On Intel CPUs, where ref-cycles
could only use one counter, run one pinned event for ref-cycles, one
flexible event for ref-cycles, and one flexible event for cycles. The
flexible ref-cycles is never scheduled, which is expected. However,
because of this issue, the cycles event is never scheduled either.

 $ perf stat -e ref-cycles:D,ref-cycles,cycles -C 5 -I 1000

           time             counts unit events
    1.000152973         15,412,480      ref-cycles:D
    1.000152973      &lt;not counted&gt;      ref-cycles     (0.00%)
    1.000152973      &lt;not counted&gt;      cycles         (0.00%)
    2.000486957         18,263,120      ref-cycles:D
    2.000486957      &lt;not counted&gt;      ref-cycles     (0.00%)
    2.000486957      &lt;not counted&gt;      cycles         (0.00%)

To fix this, when the flexible_active list is empty, try rotate the
first event in the flexible_groups. Also, rename ctx_first_active() to
ctx_event_to_rotate(), which is more accurate.

Signed-off-by: Song Liu &lt;songliubraving@fb.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: &lt;kernel-team@fb.com&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@kernel.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Sasha Levin &lt;sashal@kernel.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Fixes: 8d5bce0c37fa ("perf/core: Optimize perf_rotate_context() event scheduling")
Link: https://lkml.kernel.org/r/20191008165949.920548-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>perf/core: Rework memory accounting in perf_mmap()</title>
<updated>2019-11-06T12:08:34+00:00</updated>
<author>
<name>Song Liu</name>
<email>songliubraving@fb.com</email>
</author>
<published>2019-09-04T21:46:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=489dbad097a21d510950fac87982e6802181451f'/>
<id>489dbad097a21d510950fac87982e6802181451f</id>
<content type='text'>
[ Upstream commit d44248a41337731a111374822d7d4451b64e73e4 ]

perf_mmap() always increases user-&gt;locked_vm. As a result, "extra" could
grow bigger than "user_extra", which doesn't make sense. Here is an
example case:

(Note: Assume "user_lock_limit" is very small.)

  | # of perf_mmap calls |vma-&gt;vm_mm-&gt;pinned_vm|user-&gt;locked_vm|
  | 0                    | 0                   | 0             |
  | 1                    | user_extra          | user_extra    |
  | 2                    | 3 * user_extra      | 2 * user_extra|
  | 3                    | 6 * user_extra      | 3 * user_extra|
  | 4                    | 10 * user_extra     | 4 * user_extra|

Fix this by maintaining proper user_extra and extra.

Reviewed-By: Hechao Li &lt;hechaol@fb.com&gt;
Reported-by: Hechao Li &lt;hechaol@fb.com&gt;
Signed-off-by: Song Liu &lt;songliubraving@fb.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: &lt;kernel-team@fb.com&gt;
Cc: Jie Meng &lt;jmeng@fb.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Link: https://lkml.kernel.org/r/20190904214618.3795672-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit d44248a41337731a111374822d7d4451b64e73e4 ]

perf_mmap() always increases user-&gt;locked_vm. As a result, "extra" could
grow bigger than "user_extra", which doesn't make sense. Here is an
example case:

(Note: Assume "user_lock_limit" is very small.)

  | # of perf_mmap calls |vma-&gt;vm_mm-&gt;pinned_vm|user-&gt;locked_vm|
  | 0                    | 0                   | 0             |
  | 1                    | user_extra          | user_extra    |
  | 2                    | 3 * user_extra      | 2 * user_extra|
  | 3                    | 6 * user_extra      | 3 * user_extra|
  | 4                    | 10 * user_extra     | 4 * user_extra|

Fix this by maintaining proper user_extra and extra.

Reviewed-By: Hechao Li &lt;hechaol@fb.com&gt;
Reported-by: Hechao Li &lt;hechaol@fb.com&gt;
Signed-off-by: Song Liu &lt;songliubraving@fb.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: &lt;kernel-team@fb.com&gt;
Cc: Jie Meng &lt;jmeng@fb.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Link: https://lkml.kernel.org/r/20190904214618.3795672-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched/vtime: Fix guest/system mis-accounting on task switch</title>
<updated>2019-11-06T12:08:33+00:00</updated>
<author>
<name>Frederic Weisbecker</name>
<email>frederic@kernel.org</email>
</author>
<published>2019-09-25T21:42:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=5d14b7753c45b7da6c2a8c1129f1d094f64a9b50'/>
<id>5d14b7753c45b7da6c2a8c1129f1d094f64a9b50</id>
<content type='text'>
[ Upstream commit 68e7a4d66b0ce04bf18ff2ffded5596ab3618585 ]

vtime_account_system() assumes that the target task to account cputime
to is always the current task. This is most often true indeed except on
task switch where we call:

	vtime_common_task_switch(prev)
		vtime_account_system(prev)

Here prev is the scheduling-out task where we account the cputime to. It
doesn't match current that is already the scheduling-in task at this
stage of the context switch.

So we end up checking the wrong task flags to determine if we are
accounting guest or system time to the previous task.

As a result the wrong task is used to check if the target is running in
guest mode. We may then spuriously account or leak either system or
guest time on task switch.

Fix this assumption and also turn vtime_guest_enter/exit() to use the
task passed in parameter as well to avoid future similar issues.

Signed-off-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Wanpeng Li &lt;wanpengli@tencent.com&gt;
Fixes: 2a42eb9594a1 ("sched/cputime: Accumulate vtime on top of nsec clocksource")
Link: https://lkml.kernel.org/r/20190925214242.21873-1-frederic@kernel.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 68e7a4d66b0ce04bf18ff2ffded5596ab3618585 ]

vtime_account_system() assumes that the target task to account cputime
to is always the current task. This is most often true indeed except on
task switch where we call:

	vtime_common_task_switch(prev)
		vtime_account_system(prev)

Here prev is the scheduling-out task where we account the cputime to. It
doesn't match current that is already the scheduling-in task at this
stage of the context switch.

So we end up checking the wrong task flags to determine if we are
accounting guest or system time to the previous task.

As a result the wrong task is used to check if the target is running in
guest mode. We may then spuriously account or leak either system or
guest time on task switch.

Fix this assumption and also turn vtime_guest_enter/exit() to use the
task passed in parameter as well to avoid future similar issues.

Signed-off-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Wanpeng Li &lt;wanpengli@tencent.com&gt;
Fixes: 2a42eb9594a1 ("sched/cputime: Accumulate vtime on top of nsec clocksource")
Link: https://lkml.kernel.org/r/20190925214242.21873-1-frederic@kernel.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision</title>
<updated>2019-11-06T12:08:32+00:00</updated>
<author>
<name>Xuewei Zhang</name>
<email>xueweiz@google.com</email>
</author>
<published>2019-10-04T00:12:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=b0771e86502351ef5ee7cc3aaf4387f7ca38e773'/>
<id>b0771e86502351ef5ee7cc3aaf4387f7ca38e773</id>
<content type='text'>
[ Upstream commit 4929a4e6faa0f13289a67cae98139e727f0d4a97 ]

The quota/period ratio is used to ensure a child task group won't get
more bandwidth than the parent task group, and is calculated as:

  normalized_cfs_quota() = [(quota_us &lt;&lt; 20) / period_us]

If the quota/period ratio was changed during this scaling due to
precision loss, it will cause inconsistency between parent and child
task groups.

See below example:

A userspace container manager (kubelet) does three operations:

 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
 2) Create a few children cgroups.
 3) Set quota to 1,000us and period to 10,000us on a child cgroup.

These operations are expected to succeed. However, if the scaling of
147/128 happens before step 3, quota and period of the parent cgroup
will be changed:

  new_quota: 1148437ns,   1148us
 new_period: 11484375ns, 11484us

And when step 3 comes in, the ratio of the child cgroup will be
104857, which will be larger than the parent cgroup ratio (104821),
and will fail.

Scaling them by a factor of 2 will fix the problem.

Tested-by: Phil Auld &lt;pauld@redhat.com&gt;
Signed-off-by: Xuewei Zhang &lt;xueweiz@google.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Phil Auld &lt;pauld@redhat.com&gt;
Cc: Anton Blanchard &lt;anton@ozlabs.org&gt;
Cc: Ben Segall &lt;bsegall@google.com&gt;
Cc: Dietmar Eggemann &lt;dietmar.eggemann@arm.com&gt;
Cc: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Vincent Guittot &lt;vincent.guittot@linaro.org&gt;
Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@google.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 4929a4e6faa0f13289a67cae98139e727f0d4a97 ]

The quota/period ratio is used to ensure a child task group won't get
more bandwidth than the parent task group, and is calculated as:

  normalized_cfs_quota() = [(quota_us &lt;&lt; 20) / period_us]

If the quota/period ratio was changed during this scaling due to
precision loss, it will cause inconsistency between parent and child
task groups.

See below example:

A userspace container manager (kubelet) does three operations:

 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
 2) Create a few children cgroups.
 3) Set quota to 1,000us and period to 10,000us on a child cgroup.

These operations are expected to succeed. However, if the scaling of
147/128 happens before step 3, quota and period of the parent cgroup
will be changed:

  new_quota: 1148437ns,   1148us
 new_period: 11484375ns, 11484us

And when step 3 comes in, the ratio of the child cgroup will be
104857, which will be larger than the parent cgroup ratio (104821),
and will fail.

Scaling them by a factor of 2 will fix the problem.

Tested-by: Phil Auld &lt;pauld@redhat.com&gt;
Signed-off-by: Xuewei Zhang &lt;xueweiz@google.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Phil Auld &lt;pauld@redhat.com&gt;
Cc: Anton Blanchard &lt;anton@ozlabs.org&gt;
Cc: Ben Segall &lt;bsegall@google.com&gt;
Cc: Dietmar Eggemann &lt;dietmar.eggemann@arm.com&gt;
Cc: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Vincent Guittot &lt;vincent.guittot@linaro.org&gt;
Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@google.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tracing: Fix race in perf_trace_buf initialization</title>
<updated>2019-10-29T08:22:39+00:00</updated>
<author>
<name>Prateek Sood</name>
<email>prsood@codeaurora.org</email>
</author>
<published>2019-10-15T06:17:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=2872b10f19556b3c568f3dcf44c595a028b07e16'/>
<id>2872b10f19556b3c568f3dcf44c595a028b07e16</id>
<content type='text'>
commit 6b1340cc00edeadd52ebd8a45171f38c8de2a387 upstream.

A race condition exists while initialiazing perf_trace_buf from
perf_trace_init() and perf_kprobe_init().

      CPU0                                        CPU1
perf_trace_init()
  mutex_lock(&amp;event_mutex)
    perf_trace_event_init()
      perf_trace_event_reg()
        total_ref_count == 0
	buf = alloc_percpu()
        perf_trace_buf[i] = buf
        tp_event-&gt;class-&gt;reg() //fails       perf_kprobe_init()
	goto fail                              perf_trace_event_init()
                                                 perf_trace_event_reg()
        fail:
	  total_ref_count == 0

                                                   total_ref_count == 0
                                                   buf = alloc_percpu()
                                                   perf_trace_buf[i] = buf
                                                   tp_event-&gt;class-&gt;reg()
                                                   total_ref_count++

          free_percpu(perf_trace_buf[i])
          perf_trace_buf[i] = NULL

Any subsequent call to perf_trace_event_reg() will observe total_ref_count &gt; 0,
causing the perf_trace_buf to be always NULL. This can result in perf_trace_buf
getting accessed from perf_trace_buf_alloc() without being initialized. Acquiring
event_mutex in perf_kprobe_init() before calling perf_trace_event_init() should
fix this race.

The race caused the following bug:

 Unable to handle kernel paging request at virtual address 0000003106f2003c
 Mem abort info:
   ESR = 0x96000045
   Exception class = DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
 Data abort info:
   ISV = 0, ISS = 0x00000045
   CM = 0, WnR = 1
 user pgtable: 4k pages, 39-bit VAs, pgdp = ffffffc034b9b000
 [0000003106f2003c] pgd=0000000000000000, pud=0000000000000000
 Internal error: Oops: 96000045 [#1] PREEMPT SMP
 Process syz-executor (pid: 18393, stack limit = 0xffffffc093190000)
 pstate: 80400005 (Nzcv daif +PAN -UAO)
 pc : __memset+0x20/0x1ac
 lr : memset+0x3c/0x50
 sp : ffffffc09319fc50

  __memset+0x20/0x1ac
  perf_trace_buf_alloc+0x140/0x1a0
  perf_trace_sys_enter+0x158/0x310
  syscall_trace_enter+0x348/0x7c0
  el0_svc_common+0x11c/0x368
  el0_svc_handler+0x12c/0x198
  el0_svc+0x8/0xc

Ramdumps showed the following:
  total_ref_count = 3
  perf_trace_buf = (
      0x0 -&gt; NULL,
      0x0 -&gt; NULL,
      0x0 -&gt; NULL,
      0x0 -&gt; NULL)

Link: http://lkml.kernel.org/r/1571120245-4186-1-git-send-email-prsood@codeaurora.org

Cc: stable@vger.kernel.org
Fixes: e12f03d7031a9 ("perf/core: Implement the 'perf_kprobe' PMU")
Acked-by: Song Liu &lt;songliubraving@fb.com&gt;
Signed-off-by: Prateek Sood &lt;prsood@codeaurora.org&gt;
Signed-off-by: Steven Rostedt (VMware) &lt;rostedt@goodmis.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 6b1340cc00edeadd52ebd8a45171f38c8de2a387 upstream.

A race condition exists while initialiazing perf_trace_buf from
perf_trace_init() and perf_kprobe_init().

      CPU0                                        CPU1
perf_trace_init()
  mutex_lock(&amp;event_mutex)
    perf_trace_event_init()
      perf_trace_event_reg()
        total_ref_count == 0
	buf = alloc_percpu()
        perf_trace_buf[i] = buf
        tp_event-&gt;class-&gt;reg() //fails       perf_kprobe_init()
	goto fail                              perf_trace_event_init()
                                                 perf_trace_event_reg()
        fail:
	  total_ref_count == 0

                                                   total_ref_count == 0
                                                   buf = alloc_percpu()
                                                   perf_trace_buf[i] = buf
                                                   tp_event-&gt;class-&gt;reg()
                                                   total_ref_count++

          free_percpu(perf_trace_buf[i])
          perf_trace_buf[i] = NULL

Any subsequent call to perf_trace_event_reg() will observe total_ref_count &gt; 0,
causing the perf_trace_buf to be always NULL. This can result in perf_trace_buf
getting accessed from perf_trace_buf_alloc() without being initialized. Acquiring
event_mutex in perf_kprobe_init() before calling perf_trace_event_init() should
fix this race.

The race caused the following bug:

 Unable to handle kernel paging request at virtual address 0000003106f2003c
 Mem abort info:
   ESR = 0x96000045
   Exception class = DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
 Data abort info:
   ISV = 0, ISS = 0x00000045
   CM = 0, WnR = 1
 user pgtable: 4k pages, 39-bit VAs, pgdp = ffffffc034b9b000
 [0000003106f2003c] pgd=0000000000000000, pud=0000000000000000
 Internal error: Oops: 96000045 [#1] PREEMPT SMP
 Process syz-executor (pid: 18393, stack limit = 0xffffffc093190000)
 pstate: 80400005 (Nzcv daif +PAN -UAO)
 pc : __memset+0x20/0x1ac
 lr : memset+0x3c/0x50
 sp : ffffffc09319fc50

  __memset+0x20/0x1ac
  perf_trace_buf_alloc+0x140/0x1a0
  perf_trace_sys_enter+0x158/0x310
  syscall_trace_enter+0x348/0x7c0
  el0_svc_common+0x11c/0x368
  el0_svc_handler+0x12c/0x198
  el0_svc+0x8/0xc

Ramdumps showed the following:
  total_ref_count = 3
  perf_trace_buf = (
      0x0 -&gt; NULL,
      0x0 -&gt; NULL,
      0x0 -&gt; NULL,
      0x0 -&gt; NULL)

Link: http://lkml.kernel.org/r/1571120245-4186-1-git-send-email-prsood@codeaurora.org

Cc: stable@vger.kernel.org
Fixes: e12f03d7031a9 ("perf/core: Implement the 'perf_kprobe' PMU")
Acked-by: Song Liu &lt;songliubraving@fb.com&gt;
Signed-off-by: Prateek Sood &lt;prsood@codeaurora.org&gt;
Signed-off-by: Steven Rostedt (VMware) &lt;rostedt@goodmis.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
</feed>
