<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/kernel, branch v3.10.14</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>audit: fix endless wait in audit_log_start()</title>
<updated>2013-10-01T16:17:48+00:00</updated>
<author>
<name>Konstantin Khlebnikov</name>
<email>khlebnikov@openvz.org</email>
</author>
<published>2013-09-24T22:27:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=3ed3690eac0c839c6216fd7f7ce0add6a1f593b2'/>
<id>3ed3690eac0c839c6216fd7f7ce0add6a1f593b2</id>
<content type='text'>
commit 8ac1c8d5deba65513b6a82c35e89e73996c8e0d6 upstream.

After commit 829199197a43 ("kernel/audit.c: avoid negative sleep
durations") audit emitters will block forever if userspace daemon cannot
handle backlog.

After the timeout the waiting loop turns into busy loop and runs until
daemon dies or returns back to work.  This is a minimal patch for that
bug.

Signed-off-by: Konstantin Khlebnikov &lt;khlebnikov@openvz.org&gt;
Cc: Luiz Capitulino &lt;lcapitulino@redhat.com&gt;
Cc: Richard Guy Briggs &lt;rgb@redhat.com&gt;
Cc: Eric Paris &lt;eparis@redhat.com&gt;
Cc: Chuck Anderson &lt;chuck.anderson@oracle.com&gt;
Cc: Dan Duval &lt;dan.duval@oracle.com&gt;
Cc: Dave Kleikamp &lt;dave.kleikamp@oracle.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Jonghwan Choi &lt;jhbird.choi@samsung.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 8ac1c8d5deba65513b6a82c35e89e73996c8e0d6 upstream.

After commit 829199197a43 ("kernel/audit.c: avoid negative sleep
durations") audit emitters will block forever if userspace daemon cannot
handle backlog.

After the timeout the waiting loop turns into busy loop and runs until
daemon dies or returns back to work.  This is a minimal patch for that
bug.

Signed-off-by: Konstantin Khlebnikov &lt;khlebnikov@openvz.org&gt;
Cc: Luiz Capitulino &lt;lcapitulino@redhat.com&gt;
Cc: Richard Guy Briggs &lt;rgb@redhat.com&gt;
Cc: Eric Paris &lt;eparis@redhat.com&gt;
Cc: Chuck Anderson &lt;chuck.anderson@oracle.com&gt;
Cc: Dan Duval &lt;dan.duval@oracle.com&gt;
Cc: Dave Kleikamp &lt;dave.kleikamp@oracle.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Jonghwan Choi &lt;jhbird.choi@samsung.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>sched/fair: Fix small race where child-&gt;se.parent,cfs_rq might point to invalid ones</title>
<updated>2013-10-01T16:17:45+00:00</updated>
<author>
<name>Daisuke Nishimura</name>
<email>nishimura@mxp.nes.nec.co.jp</email>
</author>
<published>2013-09-10T09:16:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=51f5294797f14185f9b25ad0972a4abbbeff70c6'/>
<id>51f5294797f14185f9b25ad0972a4abbbeff70c6</id>
<content type='text'>
commit 6c9a27f5da9609fca46cb2b183724531b48f71ad upstream.

There is a small race between copy_process() and cgroup_attach_task()
where child-&gt;se.parent,cfs_rq points to invalid (old) ones.

        parent doing fork()      | someone moving the parent to another cgroup
  -------------------------------+---------------------------------------------
    copy_process()
      + dup_task_struct()
        -&gt; parent-&gt;se is copied to child-&gt;se.
           se.parent,cfs_rq of them point to old ones.

                                     cgroup_attach_task()
                                       + cgroup_task_migrate()
                                         -&gt; parent-&gt;cgroup is updated.
                                       + cpu_cgroup_attach()
                                         + sched_move_task()
                                           + task_move_group_fair()
                                             +- set_task_rq()
                                                -&gt; se.parent,cfs_rq of parent
                                                   are updated.

      + cgroup_fork()
        -&gt; parent-&gt;cgroup is copied to child-&gt;cgroup. (*1)
      + sched_fork()
        + task_fork_fair()
          -&gt; se.parent,cfs_rq of child are accessed
             while they point to old ones. (*2)

In the worst case, this bug can lead to "use-after-free" and cause a panic,
because it's new cgroup's refcount that is incremented at (*1),
so the old cgroup(and related data) can be freed before (*2).

In fact, a panic caused by this bug was originally caught in RHEL6.4.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [&lt;ffffffff81051e3e&gt;] sched_slice+0x6e/0xa0
    [...]
    Call Trace:
     [&lt;ffffffff81051f25&gt;] place_entity+0x75/0xa0
     [&lt;ffffffff81056a3a&gt;] task_fork_fair+0xaa/0x160
     [&lt;ffffffff81063c0b&gt;] sched_fork+0x6b/0x140
     [&lt;ffffffff8106c3c2&gt;] copy_process+0x5b2/0x1450
     [&lt;ffffffff81063b49&gt;] ? wake_up_new_task+0xd9/0x130
     [&lt;ffffffff8106d2f4&gt;] do_fork+0x94/0x460
     [&lt;ffffffff81072a9e&gt;] ? sys_wait4+0xae/0x100
     [&lt;ffffffff81009598&gt;] sys_clone+0x28/0x30
     [&lt;ffffffff8100b393&gt;] stub_clone+0x13/0x20
     [&lt;ffffffff8100b072&gt;] ? system_call_fastpath+0x16/0x1b

Signed-off-by: Daisuke Nishimura &lt;nishimura@mxp.nes.nec.co.jp&gt;
Signed-off-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Link: http://lkml.kernel.org/r/039601ceae06$733d3130$59b79390$@mxp.nes.nec.co.jp
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 6c9a27f5da9609fca46cb2b183724531b48f71ad upstream.

There is a small race between copy_process() and cgroup_attach_task()
where child-&gt;se.parent,cfs_rq points to invalid (old) ones.

        parent doing fork()      | someone moving the parent to another cgroup
  -------------------------------+---------------------------------------------
    copy_process()
      + dup_task_struct()
        -&gt; parent-&gt;se is copied to child-&gt;se.
           se.parent,cfs_rq of them point to old ones.

                                     cgroup_attach_task()
                                       + cgroup_task_migrate()
                                         -&gt; parent-&gt;cgroup is updated.
                                       + cpu_cgroup_attach()
                                         + sched_move_task()
                                           + task_move_group_fair()
                                             +- set_task_rq()
                                                -&gt; se.parent,cfs_rq of parent
                                                   are updated.

      + cgroup_fork()
        -&gt; parent-&gt;cgroup is copied to child-&gt;cgroup. (*1)
      + sched_fork()
        + task_fork_fair()
          -&gt; se.parent,cfs_rq of child are accessed
             while they point to old ones. (*2)

In the worst case, this bug can lead to "use-after-free" and cause a panic,
because it's new cgroup's refcount that is incremented at (*1),
so the old cgroup(and related data) can be freed before (*2).

In fact, a panic caused by this bug was originally caught in RHEL6.4.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [&lt;ffffffff81051e3e&gt;] sched_slice+0x6e/0xa0
    [...]
    Call Trace:
     [&lt;ffffffff81051f25&gt;] place_entity+0x75/0xa0
     [&lt;ffffffff81056a3a&gt;] task_fork_fair+0xaa/0x160
     [&lt;ffffffff81063c0b&gt;] sched_fork+0x6b/0x140
     [&lt;ffffffff8106c3c2&gt;] copy_process+0x5b2/0x1450
     [&lt;ffffffff81063b49&gt;] ? wake_up_new_task+0xd9/0x130
     [&lt;ffffffff8106d2f4&gt;] do_fork+0x94/0x460
     [&lt;ffffffff81072a9e&gt;] ? sys_wait4+0xae/0x100
     [&lt;ffffffff81009598&gt;] sys_clone+0x28/0x30
     [&lt;ffffffff8100b393&gt;] stub_clone+0x13/0x20
     [&lt;ffffffff8100b072&gt;] ? system_call_fastpath+0x16/0x1b

Signed-off-by: Daisuke Nishimura &lt;nishimura@mxp.nes.nec.co.jp&gt;
Signed-off-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Link: http://lkml.kernel.org/r/039601ceae06$733d3130$59b79390$@mxp.nes.nec.co.jp
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>sched/cputime: Do not scale when utime == 0</title>
<updated>2013-10-01T16:17:45+00:00</updated>
<author>
<name>Stanislaw Gruszka</name>
<email>sgruszka@redhat.com</email>
</author>
<published>2013-09-04T13:16:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=013b14c3067f0d55ab115405647d9f035f67737e'/>
<id>013b14c3067f0d55ab115405647d9f035f67737e</id>
<content type='text'>
commit 5a8e01f8fa51f5cbce8f37acc050eb2319d12956 upstream.

scale_stime() silently assumes that stime &lt; rtime, otherwise
when stime == rtime and both values are big enough (operations
on them do not fit in 32 bits), the resulting scaling stime can
be bigger than rtime. In consequence utime = rtime - stime
results in negative value.

User space visible symptoms of the bug are overflowed TIME
values on ps/top, for example:

 $ ps aux | grep rcu
 root         8  0.0  0.0      0     0 ?        S    12:42   0:00 [rcuc/0]
 root         9  0.0  0.0      0     0 ?        S    12:42   0:00 [rcub/0]
 root        10 62422329  0.0  0     0 ?        R    12:42 21114581:37 [rcu_preempt]
 root        11  0.1  0.0      0     0 ?        S    12:42   0:02 [rcuop/0]
 root        12 62422329  0.0  0     0 ?        S    12:42 21114581:35 [rcuop/1]
 root        10 62422329  0.0  0     0 ?        R    12:42 21114581:37 [rcu_preempt]

or overflowed utime values read directly from /proc/$PID/stat

Reference:

  https://lkml.org/lkml/2013/8/20/259

Reported-and-tested-by: Sergey Senozhatsky &lt;sergey.senozhatsky@gmail.com&gt;
Signed-off-by: Stanislaw Gruszka &lt;sgruszka@redhat.com&gt;
Cc: stable@vger.kernel.org
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Cc: Borislav Petkov &lt;bp@alien8.de&gt;
Link: http://lkml.kernel.org/r/20130904131602.GC2564@redhat.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 5a8e01f8fa51f5cbce8f37acc050eb2319d12956 upstream.

scale_stime() silently assumes that stime &lt; rtime, otherwise
when stime == rtime and both values are big enough (operations
on them do not fit in 32 bits), the resulting scaling stime can
be bigger than rtime. In consequence utime = rtime - stime
results in negative value.

User space visible symptoms of the bug are overflowed TIME
values on ps/top, for example:

 $ ps aux | grep rcu
 root         8  0.0  0.0      0     0 ?        S    12:42   0:00 [rcuc/0]
 root         9  0.0  0.0      0     0 ?        S    12:42   0:00 [rcub/0]
 root        10 62422329  0.0  0     0 ?        R    12:42 21114581:37 [rcu_preempt]
 root        11  0.1  0.0      0     0 ?        S    12:42   0:02 [rcuop/0]
 root        12 62422329  0.0  0     0 ?        S    12:42 21114581:35 [rcuop/1]
 root        10 62422329  0.0  0     0 ?        R    12:42 21114581:37 [rcu_preempt]

or overflowed utime values read directly from /proc/$PID/stat

Reference:

  https://lkml.org/lkml/2013/8/20/259

Reported-and-tested-by: Sergey Senozhatsky &lt;sergey.senozhatsky@gmail.com&gt;
Signed-off-by: Stanislaw Gruszka &lt;sgruszka@redhat.com&gt;
Cc: stable@vger.kernel.org
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Cc: Borislav Petkov &lt;bp@alien8.de&gt;
Link: http://lkml.kernel.org/r/20130904131602.GC2564@redhat.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>timekeeping: Fix HRTICK related deadlock from ntp lock changes</title>
<updated>2013-10-01T16:17:45+00:00</updated>
<author>
<name>John Stultz</name>
<email>john.stultz@linaro.org</email>
</author>
<published>2013-09-11T23:50:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=63946e8616205dafe39b4d88f9fc3dc7c4fd79aa'/>
<id>63946e8616205dafe39b4d88f9fc3dc7c4fd79aa</id>
<content type='text'>
commit 7bd36014460f793c19e7d6c94dab67b0afcfcb7f upstream.

Gerlando Falauto reported that when HRTICK is enabled, it is
possible to trigger system deadlocks. These were hard to
reproduce, as HRTICK has been broken in the past, but seemed
to be connected to the timekeeping_seq lock.

Since seqlock/seqcount's aren't supported w/ lockdep, I added
some extra spinlock based locking and triggered the following
lockdep output:

[   15.849182] ntpd/4062 is trying to acquire lock:
[   15.849765]  (&amp;(&amp;pool-&gt;lock)-&gt;rlock){..-...}, at: [&lt;ffffffff810aa9b5&gt;] __queue_work+0x145/0x480
[   15.850051]
[   15.850051] but task is already holding lock:
[   15.850051]  (timekeeper_lock){-.-.-.}, at: [&lt;ffffffff810df6df&gt;] do_adjtimex+0x7f/0x100

&lt;snip&gt;

[   15.850051] Chain exists of: &amp;(&amp;pool-&gt;lock)-&gt;rlock --&gt; &amp;p-&gt;pi_lock --&gt; timekeeper_lock
[   15.850051]  Possible unsafe locking scenario:
[   15.850051]
[   15.850051]        CPU0                    CPU1
[   15.850051]        ----                    ----
[   15.850051]   lock(timekeeper_lock);
[   15.850051]                                lock(&amp;p-&gt;pi_lock);
[   15.850051] lock(timekeeper_lock);
[   15.850051] lock(&amp;(&amp;pool-&gt;lock)-&gt;rlock);
[   15.850051]
[   15.850051]  *** DEADLOCK ***

The deadlock was introduced by 06c017fdd4dc48451a ("timekeeping:
Hold timekeepering locks in do_adjtimex and hardpps") in 3.10

This patch avoids this deadlock, by moving the call to
schedule_delayed_work() outside of the timekeeper lock
critical section.

Reported-by: Gerlando Falauto &lt;gerlando.falauto@keymile.com&gt;
Tested-by: Lin Ming &lt;minggr@gmail.com&gt;
Signed-off-by: John Stultz &lt;john.stultz@linaro.org&gt;
Cc: Mathieu Desnoyers &lt;mathieu.desnoyers@efficios.com&gt;
Link: http://lkml.kernel.org/r/1378943457-27314-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 7bd36014460f793c19e7d6c94dab67b0afcfcb7f upstream.

Gerlando Falauto reported that when HRTICK is enabled, it is
possible to trigger system deadlocks. These were hard to
reproduce, as HRTICK has been broken in the past, but seemed
to be connected to the timekeeping_seq lock.

Since seqlock/seqcount's aren't supported w/ lockdep, I added
some extra spinlock based locking and triggered the following
lockdep output:

[   15.849182] ntpd/4062 is trying to acquire lock:
[   15.849765]  (&amp;(&amp;pool-&gt;lock)-&gt;rlock){..-...}, at: [&lt;ffffffff810aa9b5&gt;] __queue_work+0x145/0x480
[   15.850051]
[   15.850051] but task is already holding lock:
[   15.850051]  (timekeeper_lock){-.-.-.}, at: [&lt;ffffffff810df6df&gt;] do_adjtimex+0x7f/0x100

&lt;snip&gt;

[   15.850051] Chain exists of: &amp;(&amp;pool-&gt;lock)-&gt;rlock --&gt; &amp;p-&gt;pi_lock --&gt; timekeeper_lock
[   15.850051]  Possible unsafe locking scenario:
[   15.850051]
[   15.850051]        CPU0                    CPU1
[   15.850051]        ----                    ----
[   15.850051]   lock(timekeeper_lock);
[   15.850051]                                lock(&amp;p-&gt;pi_lock);
[   15.850051] lock(timekeeper_lock);
[   15.850051] lock(&amp;(&amp;pool-&gt;lock)-&gt;rlock);
[   15.850051]
[   15.850051]  *** DEADLOCK ***

The deadlock was introduced by 06c017fdd4dc48451a ("timekeeping:
Hold timekeepering locks in do_adjtimex and hardpps") in 3.10

This patch avoids this deadlock, by moving the call to
schedule_delayed_work() outside of the timekeeper lock
critical section.

Reported-by: Gerlando Falauto &lt;gerlando.falauto@keymile.com&gt;
Tested-by: Lin Ming &lt;minggr@gmail.com&gt;
Signed-off-by: John Stultz &lt;john.stultz@linaro.org&gt;
Cc: Mathieu Desnoyers &lt;mathieu.desnoyers@efficios.com&gt;
Link: http://lkml.kernel.org/r/1378943457-27314-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>pidns: fix vfork() after unshare(CLONE_NEWPID)</title>
<updated>2013-09-27T00:18:28+00:00</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2013-09-11T21:19:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=f608ebd760cb9b48a333700fdc10245e257d9fce'/>
<id>f608ebd760cb9b48a333700fdc10245e257d9fce</id>
<content type='text'>
commit e79f525e99b04390ca4d2366309545a836c03bf1 upstream.

Commit 8382fcac1b81 ("pidns: Outlaw thread creation after
unshare(CLONE_NEWPID)") nacks CLONE_VM if the forking process unshared
pid_ns, this obviously breaks vfork:

	int main(void)
	{
		assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0);
		assert(vfork() &gt;= 0);
		_exit(0);
		return 0;
	}

fails without this patch.

Change this check to use CLONE_SIGHAND instead.  This also forbids
CLONE_THREAD automatically, and this is what the comment implies.

We could probably even drop CLONE_SIGHAND and use CLONE_THREAD, but it
would be safer to not do this.  The current check denies CLONE_SIGHAND
implicitely and there is no reason to change this.

Eric said "CLONE_SIGHAND is fine.  CLONE_THREAD would be even better.
Having shared signal handling between two different pid namespaces is
the case that we are fundamentally guarding against."

Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Reported-by: Colin Walters &lt;walters@redhat.com&gt;
Acked-by: Andy Lutomirski &lt;luto@amacapital.net&gt;
Reviewed-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit e79f525e99b04390ca4d2366309545a836c03bf1 upstream.

Commit 8382fcac1b81 ("pidns: Outlaw thread creation after
unshare(CLONE_NEWPID)") nacks CLONE_VM if the forking process unshared
pid_ns, this obviously breaks vfork:

	int main(void)
	{
		assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0);
		assert(vfork() &gt;= 0);
		_exit(0);
		return 0;
	}

fails without this patch.

Change this check to use CLONE_SIGHAND instead.  This also forbids
CLONE_THREAD automatically, and this is what the comment implies.

We could probably even drop CLONE_SIGHAND and use CLONE_THREAD, but it
would be safer to not do this.  The current check denies CLONE_SIGHAND
implicitely and there is no reason to change this.

Eric said "CLONE_SIGHAND is fine.  CLONE_THREAD would be even better.
Having shared signal handling between two different pid namespaces is
the case that we are fundamentally guarding against."

Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Reported-by: Colin Walters &lt;walters@redhat.com&gt;
Acked-by: Andy Lutomirski &lt;luto@amacapital.net&gt;
Reviewed-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup</title>
<updated>2013-09-27T00:18:27+00:00</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2013-08-29T20:56:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=5a48788ca4d6dc78d1856d9ddeea1b1160097cc9'/>
<id>5a48788ca4d6dc78d1856d9ddeea1b1160097cc9</id>
<content type='text'>
commit a606488513543312805fab2b93070cefe6a3016c upstream.

Serge Hallyn &lt;serge.hallyn@ubuntu.com&gt; writes:

&gt; Since commit af4b8a83add95ef40716401395b44a1b579965f4 it's been
&gt; possible to get into a situation where a pidns reaper is
&gt; &lt;defunct&gt;, reparented to host pid 1, but never reaped.  How to
&gt; reproduce this is documented at
&gt;
&gt; https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526
&gt; (and see
&gt; https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526/comments/13)
&gt; In short, run repeated starts of a container whose init is
&gt;
&gt; Process.exit(0);
&gt;
&gt; sysrq-t when such a task is playing zombie shows:
&gt;
&gt; [  131.132978] init            x ffff88011fc14580     0  2084   2039 0x00000000
&gt; [  131.132978]  ffff880116e89ea8 0000000000000002 ffff880116e89fd8 0000000000014580
&gt; [  131.132978]  ffff880116e89fd8 0000000000014580 ffff8801172a0000 ffff8801172a0000
&gt; [  131.132978]  ffff8801172a0630 ffff88011729fff0 ffff880116e14650 ffff88011729fff0
&gt; [  131.132978] Call Trace:
&gt; [  131.132978]  [&lt;ffffffff816f6159&gt;] schedule+0x29/0x70
&gt; [  131.132978]  [&lt;ffffffff81064591&gt;] do_exit+0x6e1/0xa40
&gt; [  131.132978]  [&lt;ffffffff81071eae&gt;] ? signal_wake_up_state+0x1e/0x30
&gt; [  131.132978]  [&lt;ffffffff8106496f&gt;] do_group_exit+0x3f/0xa0
&gt; [  131.132978]  [&lt;ffffffff810649e4&gt;] SyS_exit_group+0x14/0x20
&gt; [  131.132978]  [&lt;ffffffff8170102f&gt;] tracesys+0xe1/0xe6
&gt;
&gt; Further debugging showed that every time this happened, zap_pid_ns_processes()
&gt; started with nr_hashed being 3, while we were expecting it to drop to 2.
&gt; Any time it didn't happen, nr_hashed was 1 or 2.  So the reaper was
&gt; waiting for nr_hashed to become 2, but free_pid() only wakes the reaper
&gt; if nr_hashed hits 1.

The issue is that when the task group leader of an init process exits
before other tasks of the init process when the init process finally
exits it will be a secondary task sleeping in zap_pid_ns_processes and
waiting to wake up when the number of hashed pids drops to two.  This
case waits forever as free_pid only sends a wake up when the number of
hashed pids drops to 1.

To correct this the simple strategy of sending a possibly unncessary
wake up when the number of hashed pids drops to 2 is adopted.

Sending one extraneous wake up is relatively harmless, at worst we
waste a little cpu time in the rare case when a pid namespace
appropaches exiting.

We can detect the case when the pid namespace drops to just two pids
hashed race free in free_pid.

Dereferencing pid_ns-&gt;child_reaper with the pidmap_lock held is safe
without out the tasklist_lock because it is guaranteed that the
detach_pid will be called on the child_reaper before it is freed and
detach_pid calls __change_pid which calls free_pid which takes the
pidmap_lock.  __change_pid only calls free_pid if this is the
last use of the pid.  For a thread that is not the thread group leader
the threads pid will only ever have one user because a threads pid
is not allowed to be the pid of a process, of a process group or
a session.  For a thread that is a thread group leader all of
the other threads of that process will be reaped before it is allowed
for the thread group leader to be reaped ensuring there will only
be one user of the threads pid as a process pid.  Furthermore
because the thread is the init process of a pid namespace all of the
other processes in the pid namespace will have also been already freed
leading to the fact that the pid will not be used as a session pid or
a process group pid for any other running process.

Acked-by: Serge Hallyn &lt;serge.hallyn@canonical.com&gt;
Tested-by: Serge Hallyn &lt;serge.hallyn@canonical.com&gt;
Reported-by: Serge Hallyn &lt;serge.hallyn@ubuntu.com&gt;
Signed-off-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit a606488513543312805fab2b93070cefe6a3016c upstream.

Serge Hallyn &lt;serge.hallyn@ubuntu.com&gt; writes:

&gt; Since commit af4b8a83add95ef40716401395b44a1b579965f4 it's been
&gt; possible to get into a situation where a pidns reaper is
&gt; &lt;defunct&gt;, reparented to host pid 1, but never reaped.  How to
&gt; reproduce this is documented at
&gt;
&gt; https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526
&gt; (and see
&gt; https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526/comments/13)
&gt; In short, run repeated starts of a container whose init is
&gt;
&gt; Process.exit(0);
&gt;
&gt; sysrq-t when such a task is playing zombie shows:
&gt;
&gt; [  131.132978] init            x ffff88011fc14580     0  2084   2039 0x00000000
&gt; [  131.132978]  ffff880116e89ea8 0000000000000002 ffff880116e89fd8 0000000000014580
&gt; [  131.132978]  ffff880116e89fd8 0000000000014580 ffff8801172a0000 ffff8801172a0000
&gt; [  131.132978]  ffff8801172a0630 ffff88011729fff0 ffff880116e14650 ffff88011729fff0
&gt; [  131.132978] Call Trace:
&gt; [  131.132978]  [&lt;ffffffff816f6159&gt;] schedule+0x29/0x70
&gt; [  131.132978]  [&lt;ffffffff81064591&gt;] do_exit+0x6e1/0xa40
&gt; [  131.132978]  [&lt;ffffffff81071eae&gt;] ? signal_wake_up_state+0x1e/0x30
&gt; [  131.132978]  [&lt;ffffffff8106496f&gt;] do_group_exit+0x3f/0xa0
&gt; [  131.132978]  [&lt;ffffffff810649e4&gt;] SyS_exit_group+0x14/0x20
&gt; [  131.132978]  [&lt;ffffffff8170102f&gt;] tracesys+0xe1/0xe6
&gt;
&gt; Further debugging showed that every time this happened, zap_pid_ns_processes()
&gt; started with nr_hashed being 3, while we were expecting it to drop to 2.
&gt; Any time it didn't happen, nr_hashed was 1 or 2.  So the reaper was
&gt; waiting for nr_hashed to become 2, but free_pid() only wakes the reaper
&gt; if nr_hashed hits 1.

The issue is that when the task group leader of an init process exits
before other tasks of the init process when the init process finally
exits it will be a secondary task sleeping in zap_pid_ns_processes and
waiting to wake up when the number of hashed pids drops to two.  This
case waits forever as free_pid only sends a wake up when the number of
hashed pids drops to 1.

To correct this the simple strategy of sending a possibly unncessary
wake up when the number of hashed pids drops to 2 is adopted.

Sending one extraneous wake up is relatively harmless, at worst we
waste a little cpu time in the rare case when a pid namespace
appropaches exiting.

We can detect the case when the pid namespace drops to just two pids
hashed race free in free_pid.

Dereferencing pid_ns-&gt;child_reaper with the pidmap_lock held is safe
without out the tasklist_lock because it is guaranteed that the
detach_pid will be called on the child_reaper before it is freed and
detach_pid calls __change_pid which calls free_pid which takes the
pidmap_lock.  __change_pid only calls free_pid if this is the
last use of the pid.  For a thread that is not the thread group leader
the threads pid will only ever have one user because a threads pid
is not allowed to be the pid of a process, of a process group or
a session.  For a thread that is a thread group leader all of
the other threads of that process will be reaped before it is allowed
for the thread group leader to be reaped ensuring there will only
be one user of the threads pid as a process pid.  Furthermore
because the thread is the init process of a pid namespace all of the
other processes in the pid namespace will have also been already freed
leading to the fact that the pid will not be used as a session pid or
a process group pid for any other running process.

Acked-by: Serge Hallyn &lt;serge.hallyn@canonical.com&gt;
Tested-by: Serge Hallyn &lt;serge.hallyn@canonical.com&gt;
Reported-by: Serge Hallyn &lt;serge.hallyn@ubuntu.com&gt;
Signed-off-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>uprobes: Fix utask-&gt;depth accounting in handle_trampoline()</title>
<updated>2013-09-27T00:18:27+00:00</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2013-09-11T15:47:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=73e2c2b7c105e46344eb409575a4508f24a82eee'/>
<id>73e2c2b7c105e46344eb409575a4508f24a82eee</id>
<content type='text'>
commit 878b5a6efd38030c7a90895dc8346e8fb1e09b4c upstream.

Currently utask-&gt;depth is simply the number of allocated/pending
return_instance's in uprobe_task-&gt;return_instances list.

handle_trampoline() should decrement this counter every time we
handle/free an instance, but due to typo it does this only if
-&gt;chained == T. This means that in the likely case this counter
is never decremented and the probed task can't report more than
MAX_URETPROBE_DEPTH events.

Reported-by: Mikhail Kulemin &lt;Mikhail.Kulemin@ru.ibm.com&gt;
Reported-by: Hemant Kumar Shaw &lt;hkshaw@linux.vnet.ibm.com&gt;
Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Acked-by: Anton Arapov &lt;anton@redhat.com&gt;
Cc: masami.hiramatsu.pt@hitachi.com
Cc: srikar@linux.vnet.ibm.com
Cc: systemtap@sourceware.org
Link: http://lkml.kernel.org/r/20130911154726.GA8093@redhat.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 878b5a6efd38030c7a90895dc8346e8fb1e09b4c upstream.

Currently utask-&gt;depth is simply the number of allocated/pending
return_instance's in uprobe_task-&gt;return_instances list.

handle_trampoline() should decrement this counter every time we
handle/free an instance, but due to typo it does this only if
-&gt;chained == T. This means that in the likely case this counter
is never decremented and the probed task can't report more than
MAX_URETPROBE_DEPTH events.

Reported-by: Mikhail Kulemin &lt;Mikhail.Kulemin@ru.ibm.com&gt;
Reported-by: Hemant Kumar Shaw &lt;hkshaw@linux.vnet.ibm.com&gt;
Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Acked-by: Anton Arapov &lt;anton@redhat.com&gt;
Cc: masami.hiramatsu.pt@hitachi.com
Cc: srikar@linux.vnet.ibm.com
Cc: systemtap@sourceware.org
Link: http://lkml.kernel.org/r/20130911154726.GA8093@redhat.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>workqueue: cond_resched() after processing each work item</title>
<updated>2013-09-08T05:09:58+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2013-08-28T21:33:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=6ff96f7340e8a0b7b7e2c40a26bc47fb320e6475'/>
<id>6ff96f7340e8a0b7b7e2c40a26bc47fb320e6475</id>
<content type='text'>
commit b22ce2785d97423846206cceec4efee0c4afd980 upstream.

If !PREEMPT, a kworker running work items back to back can hog CPU.
This becomes dangerous when a self-requeueing work item which is
waiting for something to happen races against stop_machine.  Such
self-requeueing work item would requeue itself indefinitely hogging
the kworker and CPU it's running on while stop_machine would wait for
that CPU to enter stop_machine while preventing anything else from
happening on all other CPUs.  The two would deadlock.

Jamie Liu reports that this deadlock scenario exists around
scsi_requeue_run_queue() and libata port multiplier support, where one
port may exclude command processing from other ports.  With the right
timing, scsi_requeue_run_queue() can end up requeueing itself trying
to execute an IO which is asked to be retried while another device has
an exclusive access, which in turn can't make forward progress due to
stop_machine.

Fix it by invoking cond_resched() after executing each work item.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-by: Jamie Liu &lt;jamieliu@google.com&gt;
References: http://thread.gmane.org/gmane.linux.kernel/1552567
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit b22ce2785d97423846206cceec4efee0c4afd980 upstream.

If !PREEMPT, a kworker running work items back to back can hog CPU.
This becomes dangerous when a self-requeueing work item which is
waiting for something to happen races against stop_machine.  Such
self-requeueing work item would requeue itself indefinitely hogging
the kworker and CPU it's running on while stop_machine would wait for
that CPU to enter stop_machine while preventing anything else from
happening on all other CPUs.  The two would deadlock.

Jamie Liu reports that this deadlock scenario exists around
scsi_requeue_run_queue() and libata port multiplier support, where one
port may exclude command processing from other ports.  With the right
timing, scsi_requeue_run_queue() can end up requeueing itself trying
to execute an IO which is asked to be retried while another device has
an exclusive access, which in turn can't make forward progress due to
stop_machine.

Fix it by invoking cond_resched() after executing each work item.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-by: Jamie Liu &lt;jamieliu@google.com&gt;
References: http://thread.gmane.org/gmane.linux.kernel/1552567
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>timer_list: correct the iterator for timer_list</title>
<updated>2013-09-08T05:09:58+00:00</updated>
<author>
<name>Nathan Zimmer</name>
<email>nzimmer@sgi.com</email>
</author>
<published>2013-08-28T23:35:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=b3772c81e3490f1ddc0547ee479598f3f4221699'/>
<id>b3772c81e3490f1ddc0547ee479598f3f4221699</id>
<content type='text'>
commit 84a78a6504f5c5394a8e558702e5b54131f01d14 upstream.

Correct an issue with /proc/timer_list reported by Holger.

When reading from the proc file with a sufficiently small buffer, 2k so
not really that small, there was one could get hung trying to read the
file a chunk at a time.

The timer_list_start function failed to account for the possibility that
the offset was adjusted outside the timer_list_next.

Signed-off-by: Nathan Zimmer &lt;nzimmer@sgi.com&gt;
Reported-by: Holger Hans Peter Freyther &lt;holger@freyther.de&gt;
Cc: John Stultz &lt;john.stultz@linaro.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Berke Durak &lt;berke.durak@xiphos.com&gt;
Cc: Jeff Layton &lt;jlayton@redhat.com&gt;
Tested-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 84a78a6504f5c5394a8e558702e5b54131f01d14 upstream.

Correct an issue with /proc/timer_list reported by Holger.

When reading from the proc file with a sufficiently small buffer, 2k so
not really that small, there was one could get hung trying to read the
file a chunk at a time.

The timer_list_start function failed to account for the possibility that
the offset was adjusted outside the timer_list_next.

Signed-off-by: Nathan Zimmer &lt;nzimmer@sgi.com&gt;
Reported-by: Holger Hans Peter Freyther &lt;holger@freyther.de&gt;
Cc: John Stultz &lt;john.stultz@linaro.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Berke Durak &lt;berke.durak@xiphos.com&gt;
Cc: Jeff Layton &lt;jlayton@redhat.com&gt;
Tested-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>ftrace: Check module functions being traced on reload</title>
<updated>2013-08-29T16:47:34+00:00</updated>
<author>
<name>Steven Rostedt (Red Hat)</name>
<email>rostedt@goodmis.org</email>
</author>
<published>2013-07-30T04:04:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=f05e999f16aed044c0b8b0837bdb084cbeb8cee6'/>
<id>f05e999f16aed044c0b8b0837bdb084cbeb8cee6</id>
<content type='text'>
commit 8c4f3c3fa9681dc549cd35419b259496082fef8b upstream.

There's been a nasty bug that would show up and not give much info.
The bug displayed the following warning:

 WARNING: at kernel/trace/ftrace.c:1529 __ftrace_hash_rec_update+0x1e3/0x230()
 Pid: 20903, comm: bash Tainted: G           O 3.6.11+ #38405.trunk
 Call Trace:
  [&lt;ffffffff8103e5ff&gt;] warn_slowpath_common+0x7f/0xc0
  [&lt;ffffffff8103e65a&gt;] warn_slowpath_null+0x1a/0x20
  [&lt;ffffffff810c2ee3&gt;] __ftrace_hash_rec_update+0x1e3/0x230
  [&lt;ffffffff810c4f28&gt;] ftrace_hash_move+0x28/0x1d0
  [&lt;ffffffff811401cc&gt;] ? kfree+0x2c/0x110
  [&lt;ffffffff810c68ee&gt;] ftrace_regex_release+0x8e/0x150
  [&lt;ffffffff81149f1e&gt;] __fput+0xae/0x220
  [&lt;ffffffff8114a09e&gt;] ____fput+0xe/0x10
  [&lt;ffffffff8105fa22&gt;] task_work_run+0x72/0x90
  [&lt;ffffffff810028ec&gt;] do_notify_resume+0x6c/0xc0
  [&lt;ffffffff8126596e&gt;] ? trace_hardirqs_on_thunk+0x3a/0x3c
  [&lt;ffffffff815c0f88&gt;] int_signal+0x12/0x17
 ---[ end trace 793179526ee09b2c ]---

It was finally narrowed down to unloading a module that was being traced.

It was actually more than that. When functions are being traced, there's
a table of all functions that have a ref count of the number of active
tracers attached to that function. When a function trace callback is
registered to a function, the function's record ref count is incremented.
When it is unregistered, the function's record ref count is decremented.
If an inconsistency is detected (ref count goes below zero) the above
warning is shown and the function tracing is permanently disabled until
reboot.

The ftrace callback ops holds a hash of functions that it filters on
(and/or filters off). If the hash is empty, the default means to filter
all functions (for the filter_hash) or to disable no functions (for the
notrace_hash).

When a module is unloaded, it frees the function records that represent
the module functions. These records exist on their own pages, that is
function records for one module will not exist on the same page as
function records for other modules or even the core kernel.

Now when a module unloads, the records that represents its functions are
freed. When the module is loaded again, the records are recreated with
a default ref count of zero (unless there's a callback that traces all
functions, then they will also be traced, and the ref count will be
incremented).

The problem is that if an ftrace callback hash includes functions of the
module being unloaded, those hash entries will not be removed. If the
module is reloaded in the same location, the hash entries still point
to the functions of the module but the module's ref counts do not reflect
that.

With the help of Steve and Joern, we found a reproducer:

 Using uinput module and uinput_release function.

 cd /sys/kernel/debug/tracing
 modprobe uinput
 echo uinput_release &gt; set_ftrace_filter
 echo function &gt; current_tracer
 rmmod uinput
 modprobe uinput
 # check /proc/modules to see if loaded in same addr, otherwise try again
 echo nop &gt; current_tracer

 [BOOM]

The above loads the uinput module, which creates a table of functions that
can be traced within the module.

We add uinput_release to the filter_hash to trace just that function.

Enable function tracincg, which increments the ref count of the record
associated to uinput_release.

Remove uinput, which frees the records including the one that represents
uinput_release.

Load the uinput module again (and make sure it's at the same address).
This recreates the function records all with a ref count of zero,
including uinput_release.

Disable function tracing, which will decrement the ref count for uinput_release
which is now zero because of the module removal and reload, and we have
a mismatch (below zero ref count).

The solution is to check all currently tracing ftrace callbacks to see if any
are tracing any of the module's functions when a module is loaded (it already does
that with callbacks that trace all functions). If a callback happens to have
a module function being traced, it increments that records ref count and starts
tracing that function.

There may be a strange side effect with this, where tracing module functions
on unload and then reloading a new module may have that new module's functions
being traced. This may be something that confuses the user, but it's not
a big deal. Another approach is to disable all callback hashes on module unload,
but this leaves some ftrace callbacks that may not be registered, but can
still have hashes tracing the module's function where ftrace doesn't know about
it. That situation can cause the same bug. This solution solves that case too.
Another benefit of this solution, is it is possible to trace a module's
function on unload and load.

Link: http://lkml.kernel.org/r/20130705142629.GA325@redhat.com

Reported-by: Jörn Engel &lt;joern@logfs.org&gt;
Reported-by: Dave Jones &lt;davej@redhat.com&gt;
Reported-by: Steve Hodgson &lt;steve@purestorage.com&gt;
Tested-by: Steve Hodgson &lt;steve@purestorage.com&gt;
Signed-off-by: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 8c4f3c3fa9681dc549cd35419b259496082fef8b upstream.

There's been a nasty bug that would show up and not give much info.
The bug displayed the following warning:

 WARNING: at kernel/trace/ftrace.c:1529 __ftrace_hash_rec_update+0x1e3/0x230()
 Pid: 20903, comm: bash Tainted: G           O 3.6.11+ #38405.trunk
 Call Trace:
  [&lt;ffffffff8103e5ff&gt;] warn_slowpath_common+0x7f/0xc0
  [&lt;ffffffff8103e65a&gt;] warn_slowpath_null+0x1a/0x20
  [&lt;ffffffff810c2ee3&gt;] __ftrace_hash_rec_update+0x1e3/0x230
  [&lt;ffffffff810c4f28&gt;] ftrace_hash_move+0x28/0x1d0
  [&lt;ffffffff811401cc&gt;] ? kfree+0x2c/0x110
  [&lt;ffffffff810c68ee&gt;] ftrace_regex_release+0x8e/0x150
  [&lt;ffffffff81149f1e&gt;] __fput+0xae/0x220
  [&lt;ffffffff8114a09e&gt;] ____fput+0xe/0x10
  [&lt;ffffffff8105fa22&gt;] task_work_run+0x72/0x90
  [&lt;ffffffff810028ec&gt;] do_notify_resume+0x6c/0xc0
  [&lt;ffffffff8126596e&gt;] ? trace_hardirqs_on_thunk+0x3a/0x3c
  [&lt;ffffffff815c0f88&gt;] int_signal+0x12/0x17
 ---[ end trace 793179526ee09b2c ]---

It was finally narrowed down to unloading a module that was being traced.

It was actually more than that. When functions are being traced, there's
a table of all functions that have a ref count of the number of active
tracers attached to that function. When a function trace callback is
registered to a function, the function's record ref count is incremented.
When it is unregistered, the function's record ref count is decremented.
If an inconsistency is detected (ref count goes below zero) the above
warning is shown and the function tracing is permanently disabled until
reboot.

The ftrace callback ops holds a hash of functions that it filters on
(and/or filters off). If the hash is empty, the default means to filter
all functions (for the filter_hash) or to disable no functions (for the
notrace_hash).

When a module is unloaded, it frees the function records that represent
the module functions. These records exist on their own pages, that is
function records for one module will not exist on the same page as
function records for other modules or even the core kernel.

Now when a module unloads, the records that represents its functions are
freed. When the module is loaded again, the records are recreated with
a default ref count of zero (unless there's a callback that traces all
functions, then they will also be traced, and the ref count will be
incremented).

The problem is that if an ftrace callback hash includes functions of the
module being unloaded, those hash entries will not be removed. If the
module is reloaded in the same location, the hash entries still point
to the functions of the module but the module's ref counts do not reflect
that.

With the help of Steve and Joern, we found a reproducer:

 Using uinput module and uinput_release function.

 cd /sys/kernel/debug/tracing
 modprobe uinput
 echo uinput_release &gt; set_ftrace_filter
 echo function &gt; current_tracer
 rmmod uinput
 modprobe uinput
 # check /proc/modules to see if loaded in same addr, otherwise try again
 echo nop &gt; current_tracer

 [BOOM]

The above loads the uinput module, which creates a table of functions that
can be traced within the module.

We add uinput_release to the filter_hash to trace just that function.

Enable function tracincg, which increments the ref count of the record
associated to uinput_release.

Remove uinput, which frees the records including the one that represents
uinput_release.

Load the uinput module again (and make sure it's at the same address).
This recreates the function records all with a ref count of zero,
including uinput_release.

Disable function tracing, which will decrement the ref count for uinput_release
which is now zero because of the module removal and reload, and we have
a mismatch (below zero ref count).

The solution is to check all currently tracing ftrace callbacks to see if any
are tracing any of the module's functions when a module is loaded (it already does
that with callbacks that trace all functions). If a callback happens to have
a module function being traced, it increments that records ref count and starts
tracing that function.

There may be a strange side effect with this, where tracing module functions
on unload and then reloading a new module may have that new module's functions
being traced. This may be something that confuses the user, but it's not
a big deal. Another approach is to disable all callback hashes on module unload,
but this leaves some ftrace callbacks that may not be registered, but can
still have hashes tracing the module's function where ftrace doesn't know about
it. That situation can cause the same bug. This solution solves that case too.
Another benefit of this solution, is it is possible to trace a module's
function on unload and load.

Link: http://lkml.kernel.org/r/20130705142629.GA325@redhat.com

Reported-by: Jörn Engel &lt;joern@logfs.org&gt;
Reported-by: Dave Jones &lt;davej@redhat.com&gt;
Reported-by: Steve Hodgson &lt;steve@purestorage.com&gt;
Tested-by: Steve Hodgson &lt;steve@purestorage.com&gt;
Signed-off-by: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
</feed>
