linux-toradex.git/kernel, branch v3.2.63

ring-buffer: Always reset iterator to reader page

2014-09-13T22:41:44+00:00

commit 651e22f2701b4113989237c3048d17337dd2185c upstream.

When performing a consuming read, the ring buffer swaps out a
page from the ring buffer with a empty page and this page that
was swapped out becomes the new reader page. The reader page
is owned by the reader and since it was swapped out of the ring
buffer, writers do not have access to it (there's an exception
to that rule, but it's out of scope for this commit).

When reading the "trace" file, it is a non consuming read, which
means that the data in the ring buffer will not be modified.
When the trace file is opened, a ring buffer iterator is allocated
and writes to the ring buffer are disabled, such that the iterator
will not have issues iterating over the data.

Although the ring buffer disabled writes, it does not disable other
reads, or even consuming reads. If a consuming read happens, then
the iterator is reset and starts reading from the beginning again.

My tests would sometimes trigger this bug on my i386 box:

WARNING: CPU: 0 PID: 5175 at kernel/trace/trace.c:1527 __trace_find_cmdline+0x66/0xaa()
Modules linked in:
CPU: 0 PID: 5175 Comm: grep Not tainted 3.16.0-rc3-test+ #8
Hardware name:                  /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
 00000000 00000000 f09c9e1c c18796b3 c1b5d74c f09c9e4c c103a0e3 c1b5154b
 f09c9e78 00001437 c1b5d74c 000005f7 c10bd85a c10bd85a c1cac57c f09c9eb0
 ed0e0000 f09c9e64 c103a185 00000009 f09c9e5c c1b5154b f09c9e78 f09c9e80^M
Call Trace:
 [] dump_stack+0x4b/0x75
 [] warn_slowpath_common+0x7e/0x95
 [] ? __trace_find_cmdline+0x66/0xaa
 [] ? __trace_find_cmdline+0x66/0xaa
 [] warn_slowpath_fmt+0x33/0x35
 [] __trace_find_cmdline+0x66/0xaa^M
 [] trace_find_cmdline+0x40/0x64
 [] trace_print_context+0x27/0xec
 [] ? trace_seq_printf+0x37/0x5b
 [] print_trace_line+0x319/0x39b
 [] ? ring_buffer_read+0x47/0x50
 [] s_show+0x192/0x1ab
 [] ? s_next+0x5a/0x7c
 [] seq_read+0x267/0x34c
 [] vfs_read+0x8c/0xef
 [] ? seq_lseek+0x154/0x154
 [] SyS_read+0x54/0x7f
 [] syscall_call+0x7/0xb
---[ end trace 3f507febd6b4cc83 ]---
>>>> ##### CPU 1 buffer started ####

Which was the __trace_find_cmdline() function complaining about the pid
in the event record being negative.

After adding more test cases, this would trigger more often. Strangely
enough, it would never trigger on a single test, but instead would trigger
only when running all the tests. I believe that was the case because it
required one of the tests to be shutting down via delayed instances while
a new test started up.

After spending several days debugging this, I found that it was caused by
the iterator becoming corrupted. Debugging further, I found out why
the iterator became corrupted. It happened with the rb_iter_reset().

As consuming reads may not read the full reader page, and only part
of it, there's a "read" field to know where the last read took place.
The iterator, must also start at the read position. In the rb_iter_reset()
code, if the reader page was disconnected from the ring buffer, the iterator
would start at the head page within the ring buffer (where writes still
happen). But the mistake there was that it still used the "read" field
to start the iterator on the head page, where it should always start
at zero because readers never read from within the ring buffer where
writes occur.

I originally wrote a patch to have it set the iter->head to 0 instead
of iter->head_page->read, but then I questioned why it wasn't always
setting the iter to point to the reader page, as the reader page is
still valid.  The list_empty(reader_page->list) just means that it was
successful in swapping out. But the reader_page may still have data.

There was a bug report a long time ago that was not reproducible that
had something about trace_pipe (consuming read) not matching trace
(iterator read). This may explain why that happened.

Anyway, the correct answer to this bug is to always use the reader page
an not reset the iterator to inside the writable ring buffer.

Fixes: d769041f8653 "ring_buffer: implement new locking"
Signed-off-by: Steven Rostedt 
Signed-off-by: Ben Hutchings

ring-buffer: Up rb_iter_peek() loop count to 3

2014-09-13T22:41:43+00:00

commit 021de3d904b88b1771a3a2cfc5b75023c391e646 upstream.

After writting a test to try to trigger the bug that caused the
ring buffer iterator to become corrupted, I hit another bug:

 WARNING: CPU: 1 PID: 5281 at kernel/trace/ring_buffer.c:3766 rb_iter_peek+0x113/0x238()
 Modules linked in: ipt_MASQUERADE sunrpc [...]
 CPU: 1 PID: 5281 Comm: grep Tainted: G        W     3.16.0-rc3-test+ #143
 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
  0000000000000000 ffffffff81809a80 ffffffff81503fb0 0000000000000000
  ffffffff81040ca1 ffff8800796d6010 ffffffff810c138d ffff8800796d6010
  ffff880077438c80 ffff8800796d6010 ffff88007abbe600 0000000000000003
 Call Trace:
  [] ? dump_stack+0x4a/0x75
  [] ? warn_slowpath_common+0x7e/0x97
  [] ? rb_iter_peek+0x113/0x238
  [] ? rb_iter_peek+0x113/0x238
  [] ? ring_buffer_iter_peek+0x2d/0x5c
  [] ? tracing_iter_reset+0x6e/0x96
  [] ? s_start+0xd7/0x17b
  [] ? kmem_cache_alloc_trace+0xda/0xea
  [] ? seq_read+0x148/0x361
  [] ? vfs_read+0x93/0xf1
  [] ? SyS_read+0x60/0x8e
  [] ? tracesys+0xdd/0xe2

Debugging this bug, which triggers when the rb_iter_peek() loops too
many times (more than 2 times), I discovered there's a case that can
cause that function to legitimately loop 3 times!

rb_iter_peek() is different than rb_buffer_peek() as the rb_buffer_peek()
only deals with the reader page (it's for consuming reads). The
rb_iter_peek() is for traversing the buffer without consuming it, and as
such, it can loop for one more reason. That is, if we hit the end of
the reader page or any page, it will go to the next page and try again.

That is, we have this:

 1. iter->head > iter->head_page->page->commit
    (rb_inc_iter() which moves the iter to the next page)
    try again

 2. event = rb_iter_head_event()
    event->type_len == RINGBUF_TYPE_TIME_EXTEND
    rb_advance_iter()
    try again

 3. read the event.

But we never get to 3, because the count is greater than 2 and we
cause the WARNING and return NULL.

Up the counter to 3.

Fixes: 69d1b839f7ee "ring-buffer: Bind time extend and data events together"
Signed-off-by: Steven Rostedt 
[bwh: Backported to 3.2: drop inapplicable spelling correction]
Signed-off-by: Ben Hutchings

nohz: Fix another inconsistency between CONFIG_NO_HZ=n and nohz=off

2014-08-06T17:07:39+00:00

commit 0e576acbc1d9600cf2d9b4a141a2554639959d50 upstream.

If CONFIG_NO_HZ=n tick_nohz_get_sleep_length() returns NSEC_PER_SEC/HZ.

If CONFIG_NO_HZ=y and the nohz functionality is disabled via the
command line option "nohz=off" or not enabled due to missing hardware
support, then tick_nohz_get_sleep_length() returns 0. That happens
because ts->sleep_length is never set in that case.

Set it to NSEC_PER_SEC/HZ when the NOHZ mode is inactive.

Reported-by: Michal Hocko 
Reported-by: Borislav Petkov 
Signed-off-by: Thomas Gleixner 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings

locking/mutex: Disable optimistic spinning on some architectures

2014-08-06T17:07:38+00:00

commit 4badad352a6bb202ec68afa7a574c0bb961e5ebc upstream.

The optimistic spin code assumes regular stores and cmpxchg() play nice;
this is found to not be true for at least: parisc, sparc32, tile32,
metag-lock1, arc-!llsc and hexagon.

There is further wreckage, but this in particular seemed easy to
trigger, so blacklist this.

Opt in for known good archs.

Signed-off-by: Peter Zijlstra 
Reported-by: Mikulas Patocka 
Cc: David Miller 
Cc: Chris Metcalf 
Cc: James Bottomley 
Cc: Vineet Gupta 
Cc: Jason Low 
Cc: Waiman Long 
Cc: "James E.J. Bottomley" 
Cc: Paul McKenney 
Cc: John David Anglin 
Cc: James Hogan 
Cc: Linus Torvalds 
Cc: Davidlohr Bueso 
Cc: Benjamin Herrenschmidt 
Cc: Catalin Marinas 
Cc: Russell King 
Cc: Will Deacon 
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Link: http://lkml.kernel.org/r/20140606175316.GV13930@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 
[bwh: Backported to 3.2:
 - Adjust context
 - Drop arm64 change]
Signed-off-by: Ben Hutchings

sched: Fix possible divide by zero in avg_atom() calculation

2014-08-06T17:07:38+00:00

commit b0ab99e7736af88b8ac1b7ae50ea287fffa2badc upstream.

proc_sched_show_task() does:

  if (nr_switches)
	do_div(avg_atom, nr_switches);

nr_switches is unsigned long and do_div truncates it to 32 bits, which
means it can test non-zero on e.g. x86-64 and be truncated to zero for
division.

Fix the problem by using div64_ul() instead.

As a side effect calculations of avg_atom for big nr_switches are now correct.

Signed-off-by: Mateusz Guzik 
Signed-off-by: Peter Zijlstra 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/1402750809-31991-1-git-send-email-mguzik@redhat.com
Signed-off-by: Ingo Molnar 
[bwh: Backported to 3.2: adjust filename]
Signed-off-by: Ben Hutchings

ring-buffer: Fix polling on trace_pipe

2014-08-06T17:07:37+00:00

commit 97b8ee845393701edc06e27ccec2876ff9596019 upstream.

ring_buffer_poll_wait() should always put the poll_table to its wait_queue
even there is immediate data available.  Otherwise, the following epoll and
read sequence will eventually hang forever:

1. Put some data to make the trace_pipe ring_buffer read ready first
2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee)
3. epoll_wait()
4. read(trace_pipe_fd) till EAGAIN
5. Add some more data to the trace_pipe ring_buffer
6. epoll_wait() -> this epoll_wait() will block forever

~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2,
  ring_buffer_poll_wait() returns immediately without adding poll_table,
  which has poll_table->_qproc pointing to ep_poll_callback(), to its
  wait_queue.
~ During the epoll_wait() call in step 3 and step 6,
  ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue
  because the poll_table->_qproc is NULL and it is how epoll works.
~ When there is new data available in step 6, ring_buffer does not know
  it has to call ep_poll_callback() because it is not in its wait queue.
  Hence, block forever.

Other poll implementation seems to call poll_wait() unconditionally as the very
first thing to do.  For example, tcp_poll() in tcp.c.

Link: http://lkml.kernel.org/p/20140610060637.GA14045@devbig242.prn2.facebook.com

Fixes: 2a2cc8f7c4d0 "ftrace: allow the event pipe to be polled"
Reviewed-by: Chris Mason 
Signed-off-by: Martin Lau 
Signed-off-by: Steven Rostedt 
[bwh: Backported to 3.2: the poll implementation looks rather different
 but does have a conditional return before and after the poll_wait() call;
 delete the return before it.]
Signed-off-by: Ben Hutchings

alarmtimer: Fix bug where relative alarm timers were treated as absolute

2014-08-06T17:07:37+00:00

commit 16927776ae757d0d132bdbfabbfe2c498342bd59 upstream.

Sharvil noticed with the posix timer_settime interface, using the
CLOCK_REALTIME_ALARM or CLOCK_BOOTTIME_ALARM clockid, if the users
tried to specify a relative time timer, it would incorrectly be
treated as absolute regardless of the state of the flags argument.

This patch corrects this, properly checking the absolute/relative flag,
as well as adds further error checking that no invalid flag bits are set.

Reported-by: Sharvil Nanavati 
Signed-off-by: John Stultz 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Prarit Bhargava 
Cc: Sharvil Nanavati 
Link: http://lkml.kernel.org/r/1404767171-6902-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner 
Signed-off-by: Ben Hutchings

cpuset,mempolicy: fix sleeping function called from invalid context

2014-08-06T17:07:33+00:00

commit 391acf970d21219a2a5446282d3b20eace0c0d7a upstream.

When runing with the kernel(3.15-rc7+), the follow bug occurs:
[ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
[ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
[ 9969.441175] INFO: lockdep is turned off.
[ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G       A      3.15.0-rc7+ #85
[ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
[ 9969.706052]  ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
[ 9969.795323]  ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
[ 9969.884710]  ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
[ 9969.974071] Call Trace:
[ 9970.003403]  [] dump_stack+0x4d/0x66
[ 9970.065074]  [] __might_sleep+0xfa/0x130
[ 9970.130743]  [] mutex_lock_nested+0x3c/0x4f0
[ 9970.200638]  [] ? kmem_cache_alloc+0x1bc/0x210
[ 9970.272610]  [] cpuset_mems_allowed+0x27/0x140
[ 9970.344584]  [] ? __mpol_dup+0x63/0x150
[ 9970.409282]  [] __mpol_dup+0xe5/0x150
[ 9970.471897]  [] ? __mpol_dup+0x63/0x150
[ 9970.536585]  [] ? copy_process.part.23+0x606/0x1d40
[ 9970.613763]  [] ? trace_hardirqs_on+0xd/0x10
[ 9970.683660]  [] ? monotonic_to_bootbased+0x2f/0x50
[ 9970.759795]  [] copy_process.part.23+0x670/0x1d40
[ 9970.834885]  [] do_fork+0xd8/0x380
[ 9970.894375]  [] ? __audit_syscall_entry+0x9c/0xf0
[ 9970.969470]  [] SyS_clone+0x16/0x20
[ 9971.030011]  [] stub_clone+0x69/0x90
[ 9971.091573]  [] ? system_call_fastpath+0x16/0x1b

The cause is that cpuset_mems_allowed() try to take
mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
__mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
protection region to protect the access to cpuset only in
current_cpuset_is_being_rebound(). So that we can avoid this bug.

This patch is a temporary solution that just addresses the bug
mentioned above, can not fix the long-standing issue about cpuset.mems
rebinding on fork():

"When the forker's task_struct is duplicated (which includes
 ->mems_allowed) and it races with an update to cpuset_being_rebound
 in update_tasks_nodemask() then the task's mems_allowed doesn't get
 updated. And the child task's mems_allowed can be wrong if the
 cpuset's nodemask changes before the child has been added to the
 cgroup's tasklist."

Signed-off-by: Gu Zheng 
Acked-by: Li Zefan 
Signed-off-by: Tejun Heo 
Signed-off-by: Ben Hutchings

perf: Fix race in removing an event

2014-07-11T12:33:56+00:00

commit 46ce0fe97a6be7532ce6126bb26ce89fed81528c upstream.

When removing a (sibling) event we do:

	raw_spin_lock_irq(&ctx->lock);
	perf_group_detach(event);
	raw_spin_unlock_irq(&ctx->lock);

	

	perf_remove_from_context(event);
		raw_spin_lock_irq(&ctx->lock);
		...
		raw_spin_unlock_irq(&ctx->lock);

Now, assuming the event is a sibling, it will be 'unreachable' for
things like ctx_sched_out() because that iterates the
groups->siblings, and we just unhooked the sibling.

So, if during  we get ctx_sched_out(), it will miss the event
and not call event_sched_out() on it, leaving it programmed on the
PMU.

The subsequent perf_remove_from_context() call will find the ctx is
inactive and only call list_del_event() to remove the event from all
other lists.

Hereafter we can proceed to free the event; while still programmed!

Close this hole by moving perf_group_detach() inside the same
ctx->lock region(s) perf_remove_from_context() has.

The condition on inherited events only in __perf_event_exit_task() is
likely complete crap because non-inherited events are part of groups
too and we're tearing down just the same. But leave that for another
patch.

Most-likely-Fixes: e03a9a55b4e ("perf: Change close() semantics for group events")
Reported-by: Vince Weaver 
Tested-by: Vince Weaver 
Much-staring-at-traces-by: Vince Weaver 
Much-staring-at-traces-by: Thomas Gleixner 
Cc: Arnaldo Carvalho de Melo 
Cc: Linus Torvalds 
Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/r/20140505093124.GN17778@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 
[bwh: Backported to 3.2: drop change in perf_pmu_migrate_context()]
Signed-off-by: Ben Hutchings

tracing: Fix syscall_*regfunc() vs copy_process() race

2014-07-11T12:33:53+00:00

commit 4af4206be2bd1933cae20c2b6fb2058dbc887f7c upstream.

syscall_regfunc() and syscall_unregfunc() should set/clear
TIF_SYSCALL_TRACEPOINT system-wide, but do_each_thread() can race
with copy_process() and miss the new child which was not added to
the process/thread lists yet.

Change copy_process() to update the child's TIF_SYSCALL_TRACEPOINT
under tasklist.

Link: http://lkml.kernel.org/p/20140413185854.GB20668@redhat.com

Fixes: a871bd33a6c0 "tracing: Add syscall tracepoints"
Acked-by: Frederic Weisbecker 
Acked-by: Paul E. McKenney 
Signed-off-by: Oleg Nesterov 
Signed-off-by: Steven Rostedt 
Signed-off-by: Ben Hutchings