<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/kernel, branch v3.16.3</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>ring-buffer: Up rb_iter_peek() loop count to 3</title>
<updated>2014-09-17T16:22:12+00:00</updated>
<author>
<name>Steven Rostedt (Red Hat)</name>
<email>rostedt@goodmis.org</email>
</author>
<published>2014-08-06T19:36:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=39ed6dfc36293ce24734ec571569a0921d8b0745'/>
<id>39ed6dfc36293ce24734ec571569a0921d8b0745</id>
<content type='text'>
commit 021de3d904b88b1771a3a2cfc5b75023c391e646 upstream.

After writting a test to try to trigger the bug that caused the
ring buffer iterator to become corrupted, I hit another bug:

 WARNING: CPU: 1 PID: 5281 at kernel/trace/ring_buffer.c:3766 rb_iter_peek+0x113/0x238()
 Modules linked in: ipt_MASQUERADE sunrpc [...]
 CPU: 1 PID: 5281 Comm: grep Tainted: G        W     3.16.0-rc3-test+ #143
 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
  0000000000000000 ffffffff81809a80 ffffffff81503fb0 0000000000000000
  ffffffff81040ca1 ffff8800796d6010 ffffffff810c138d ffff8800796d6010
  ffff880077438c80 ffff8800796d6010 ffff88007abbe600 0000000000000003
 Call Trace:
  [&lt;ffffffff81503fb0&gt;] ? dump_stack+0x4a/0x75
  [&lt;ffffffff81040ca1&gt;] ? warn_slowpath_common+0x7e/0x97
  [&lt;ffffffff810c138d&gt;] ? rb_iter_peek+0x113/0x238
  [&lt;ffffffff810c138d&gt;] ? rb_iter_peek+0x113/0x238
  [&lt;ffffffff810c14df&gt;] ? ring_buffer_iter_peek+0x2d/0x5c
  [&lt;ffffffff810c6f73&gt;] ? tracing_iter_reset+0x6e/0x96
  [&lt;ffffffff810c74a3&gt;] ? s_start+0xd7/0x17b
  [&lt;ffffffff8112b13e&gt;] ? kmem_cache_alloc_trace+0xda/0xea
  [&lt;ffffffff8114cf94&gt;] ? seq_read+0x148/0x361
  [&lt;ffffffff81132d98&gt;] ? vfs_read+0x93/0xf1
  [&lt;ffffffff81132f1b&gt;] ? SyS_read+0x60/0x8e
  [&lt;ffffffff8150bf9f&gt;] ? tracesys+0xdd/0xe2

Debugging this bug, which triggers when the rb_iter_peek() loops too
many times (more than 2 times), I discovered there's a case that can
cause that function to legitimately loop 3 times!

rb_iter_peek() is different than rb_buffer_peek() as the rb_buffer_peek()
only deals with the reader page (it's for consuming reads). The
rb_iter_peek() is for traversing the buffer without consuming it, and as
such, it can loop for one more reason. That is, if we hit the end of
the reader page or any page, it will go to the next page and try again.

That is, we have this:

 1. iter-&gt;head &gt; iter-&gt;head_page-&gt;page-&gt;commit
    (rb_inc_iter() which moves the iter to the next page)
    try again

 2. event = rb_iter_head_event()
    event-&gt;type_len == RINGBUF_TYPE_TIME_EXTEND
    rb_advance_iter()
    try again

 3. read the event.

But we never get to 3, because the count is greater than 2 and we
cause the WARNING and return NULL.

Up the counter to 3.

Fixes: 69d1b839f7ee "ring-buffer: Bind time extend and data events together"
Signed-off-by: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 021de3d904b88b1771a3a2cfc5b75023c391e646 upstream.

After writting a test to try to trigger the bug that caused the
ring buffer iterator to become corrupted, I hit another bug:

 WARNING: CPU: 1 PID: 5281 at kernel/trace/ring_buffer.c:3766 rb_iter_peek+0x113/0x238()
 Modules linked in: ipt_MASQUERADE sunrpc [...]
 CPU: 1 PID: 5281 Comm: grep Tainted: G        W     3.16.0-rc3-test+ #143
 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
  0000000000000000 ffffffff81809a80 ffffffff81503fb0 0000000000000000
  ffffffff81040ca1 ffff8800796d6010 ffffffff810c138d ffff8800796d6010
  ffff880077438c80 ffff8800796d6010 ffff88007abbe600 0000000000000003
 Call Trace:
  [&lt;ffffffff81503fb0&gt;] ? dump_stack+0x4a/0x75
  [&lt;ffffffff81040ca1&gt;] ? warn_slowpath_common+0x7e/0x97
  [&lt;ffffffff810c138d&gt;] ? rb_iter_peek+0x113/0x238
  [&lt;ffffffff810c138d&gt;] ? rb_iter_peek+0x113/0x238
  [&lt;ffffffff810c14df&gt;] ? ring_buffer_iter_peek+0x2d/0x5c
  [&lt;ffffffff810c6f73&gt;] ? tracing_iter_reset+0x6e/0x96
  [&lt;ffffffff810c74a3&gt;] ? s_start+0xd7/0x17b
  [&lt;ffffffff8112b13e&gt;] ? kmem_cache_alloc_trace+0xda/0xea
  [&lt;ffffffff8114cf94&gt;] ? seq_read+0x148/0x361
  [&lt;ffffffff81132d98&gt;] ? vfs_read+0x93/0xf1
  [&lt;ffffffff81132f1b&gt;] ? SyS_read+0x60/0x8e
  [&lt;ffffffff8150bf9f&gt;] ? tracesys+0xdd/0xe2

Debugging this bug, which triggers when the rb_iter_peek() loops too
many times (more than 2 times), I discovered there's a case that can
cause that function to legitimately loop 3 times!

rb_iter_peek() is different than rb_buffer_peek() as the rb_buffer_peek()
only deals with the reader page (it's for consuming reads). The
rb_iter_peek() is for traversing the buffer without consuming it, and as
such, it can loop for one more reason. That is, if we hit the end of
the reader page or any page, it will go to the next page and try again.

That is, we have this:

 1. iter-&gt;head &gt; iter-&gt;head_page-&gt;page-&gt;commit
    (rb_inc_iter() which moves the iter to the next page)
    try again

 2. event = rb_iter_head_event()
    event-&gt;type_len == RINGBUF_TYPE_TIME_EXTEND
    rb_advance_iter()
    try again

 3. read the event.

But we never get to 3, because the count is greater than 2 and we
cause the WARNING and return NULL.

Up the counter to 3.

Fixes: 69d1b839f7ee "ring-buffer: Bind time extend and data events together"
Signed-off-by: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>ring-buffer: Always reset iterator to reader page</title>
<updated>2014-09-17T16:22:11+00:00</updated>
<author>
<name>Steven Rostedt (Red Hat)</name>
<email>rostedt@goodmis.org</email>
</author>
<published>2014-08-06T18:11:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=1cfa896d6e52e3d9d2cb2133a0c495c816556ca9'/>
<id>1cfa896d6e52e3d9d2cb2133a0c495c816556ca9</id>
<content type='text'>
commit 651e22f2701b4113989237c3048d17337dd2185c upstream.

When performing a consuming read, the ring buffer swaps out a
page from the ring buffer with a empty page and this page that
was swapped out becomes the new reader page. The reader page
is owned by the reader and since it was swapped out of the ring
buffer, writers do not have access to it (there's an exception
to that rule, but it's out of scope for this commit).

When reading the "trace" file, it is a non consuming read, which
means that the data in the ring buffer will not be modified.
When the trace file is opened, a ring buffer iterator is allocated
and writes to the ring buffer are disabled, such that the iterator
will not have issues iterating over the data.

Although the ring buffer disabled writes, it does not disable other
reads, or even consuming reads. If a consuming read happens, then
the iterator is reset and starts reading from the beginning again.

My tests would sometimes trigger this bug on my i386 box:

WARNING: CPU: 0 PID: 5175 at kernel/trace/trace.c:1527 __trace_find_cmdline+0x66/0xaa()
Modules linked in:
CPU: 0 PID: 5175 Comm: grep Not tainted 3.16.0-rc3-test+ #8
Hardware name:                  /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
 00000000 00000000 f09c9e1c c18796b3 c1b5d74c f09c9e4c c103a0e3 c1b5154b
 f09c9e78 00001437 c1b5d74c 000005f7 c10bd85a c10bd85a c1cac57c f09c9eb0
 ed0e0000 f09c9e64 c103a185 00000009 f09c9e5c c1b5154b f09c9e78 f09c9e80^M
Call Trace:
 [&lt;c18796b3&gt;] dump_stack+0x4b/0x75
 [&lt;c103a0e3&gt;] warn_slowpath_common+0x7e/0x95
 [&lt;c10bd85a&gt;] ? __trace_find_cmdline+0x66/0xaa
 [&lt;c10bd85a&gt;] ? __trace_find_cmdline+0x66/0xaa
 [&lt;c103a185&gt;] warn_slowpath_fmt+0x33/0x35
 [&lt;c10bd85a&gt;] __trace_find_cmdline+0x66/0xaa^M
 [&lt;c10bed04&gt;] trace_find_cmdline+0x40/0x64
 [&lt;c10c3c16&gt;] trace_print_context+0x27/0xec
 [&lt;c10c4360&gt;] ? trace_seq_printf+0x37/0x5b
 [&lt;c10c0b15&gt;] print_trace_line+0x319/0x39b
 [&lt;c10ba3fb&gt;] ? ring_buffer_read+0x47/0x50
 [&lt;c10c13b1&gt;] s_show+0x192/0x1ab
 [&lt;c10bfd9a&gt;] ? s_next+0x5a/0x7c
 [&lt;c112e76e&gt;] seq_read+0x267/0x34c
 [&lt;c1115a25&gt;] vfs_read+0x8c/0xef
 [&lt;c112e507&gt;] ? seq_lseek+0x154/0x154
 [&lt;c1115ba2&gt;] SyS_read+0x54/0x7f
 [&lt;c188488e&gt;] syscall_call+0x7/0xb
---[ end trace 3f507febd6b4cc83 ]---
&gt;&gt;&gt;&gt; ##### CPU 1 buffer started ####

Which was the __trace_find_cmdline() function complaining about the pid
in the event record being negative.

After adding more test cases, this would trigger more often. Strangely
enough, it would never trigger on a single test, but instead would trigger
only when running all the tests. I believe that was the case because it
required one of the tests to be shutting down via delayed instances while
a new test started up.

After spending several days debugging this, I found that it was caused by
the iterator becoming corrupted. Debugging further, I found out why
the iterator became corrupted. It happened with the rb_iter_reset().

As consuming reads may not read the full reader page, and only part
of it, there's a "read" field to know where the last read took place.
The iterator, must also start at the read position. In the rb_iter_reset()
code, if the reader page was disconnected from the ring buffer, the iterator
would start at the head page within the ring buffer (where writes still
happen). But the mistake there was that it still used the "read" field
to start the iterator on the head page, where it should always start
at zero because readers never read from within the ring buffer where
writes occur.

I originally wrote a patch to have it set the iter-&gt;head to 0 instead
of iter-&gt;head_page-&gt;read, but then I questioned why it wasn't always
setting the iter to point to the reader page, as the reader page is
still valid.  The list_empty(reader_page-&gt;list) just means that it was
successful in swapping out. But the reader_page may still have data.

There was a bug report a long time ago that was not reproducible that
had something about trace_pipe (consuming read) not matching trace
(iterator read). This may explain why that happened.

Anyway, the correct answer to this bug is to always use the reader page
an not reset the iterator to inside the writable ring buffer.

Fixes: d769041f8653 "ring_buffer: implement new locking"
Signed-off-by: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 651e22f2701b4113989237c3048d17337dd2185c upstream.

When performing a consuming read, the ring buffer swaps out a
page from the ring buffer with a empty page and this page that
was swapped out becomes the new reader page. The reader page
is owned by the reader and since it was swapped out of the ring
buffer, writers do not have access to it (there's an exception
to that rule, but it's out of scope for this commit).

When reading the "trace" file, it is a non consuming read, which
means that the data in the ring buffer will not be modified.
When the trace file is opened, a ring buffer iterator is allocated
and writes to the ring buffer are disabled, such that the iterator
will not have issues iterating over the data.

Although the ring buffer disabled writes, it does not disable other
reads, or even consuming reads. If a consuming read happens, then
the iterator is reset and starts reading from the beginning again.

My tests would sometimes trigger this bug on my i386 box:

WARNING: CPU: 0 PID: 5175 at kernel/trace/trace.c:1527 __trace_find_cmdline+0x66/0xaa()
Modules linked in:
CPU: 0 PID: 5175 Comm: grep Not tainted 3.16.0-rc3-test+ #8
Hardware name:                  /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
 00000000 00000000 f09c9e1c c18796b3 c1b5d74c f09c9e4c c103a0e3 c1b5154b
 f09c9e78 00001437 c1b5d74c 000005f7 c10bd85a c10bd85a c1cac57c f09c9eb0
 ed0e0000 f09c9e64 c103a185 00000009 f09c9e5c c1b5154b f09c9e78 f09c9e80^M
Call Trace:
 [&lt;c18796b3&gt;] dump_stack+0x4b/0x75
 [&lt;c103a0e3&gt;] warn_slowpath_common+0x7e/0x95
 [&lt;c10bd85a&gt;] ? __trace_find_cmdline+0x66/0xaa
 [&lt;c10bd85a&gt;] ? __trace_find_cmdline+0x66/0xaa
 [&lt;c103a185&gt;] warn_slowpath_fmt+0x33/0x35
 [&lt;c10bd85a&gt;] __trace_find_cmdline+0x66/0xaa^M
 [&lt;c10bed04&gt;] trace_find_cmdline+0x40/0x64
 [&lt;c10c3c16&gt;] trace_print_context+0x27/0xec
 [&lt;c10c4360&gt;] ? trace_seq_printf+0x37/0x5b
 [&lt;c10c0b15&gt;] print_trace_line+0x319/0x39b
 [&lt;c10ba3fb&gt;] ? ring_buffer_read+0x47/0x50
 [&lt;c10c13b1&gt;] s_show+0x192/0x1ab
 [&lt;c10bfd9a&gt;] ? s_next+0x5a/0x7c
 [&lt;c112e76e&gt;] seq_read+0x267/0x34c
 [&lt;c1115a25&gt;] vfs_read+0x8c/0xef
 [&lt;c112e507&gt;] ? seq_lseek+0x154/0x154
 [&lt;c1115ba2&gt;] SyS_read+0x54/0x7f
 [&lt;c188488e&gt;] syscall_call+0x7/0xb
---[ end trace 3f507febd6b4cc83 ]---
&gt;&gt;&gt;&gt; ##### CPU 1 buffer started ####

Which was the __trace_find_cmdline() function complaining about the pid
in the event record being negative.

After adding more test cases, this would trigger more often. Strangely
enough, it would never trigger on a single test, but instead would trigger
only when running all the tests. I believe that was the case because it
required one of the tests to be shutting down via delayed instances while
a new test started up.

After spending several days debugging this, I found that it was caused by
the iterator becoming corrupted. Debugging further, I found out why
the iterator became corrupted. It happened with the rb_iter_reset().

As consuming reads may not read the full reader page, and only part
of it, there's a "read" field to know where the last read took place.
The iterator, must also start at the read position. In the rb_iter_reset()
code, if the reader page was disconnected from the ring buffer, the iterator
would start at the head page within the ring buffer (where writes still
happen). But the mistake there was that it still used the "read" field
to start the iterator on the head page, where it should always start
at zero because readers never read from within the ring buffer where
writes occur.

I originally wrote a patch to have it set the iter-&gt;head to 0 instead
of iter-&gt;head_page-&gt;read, but then I questioned why it wasn't always
setting the iter to point to the reader page, as the reader page is
still valid.  The list_empty(reader_page-&gt;list) just means that it was
successful in swapping out. But the reader_page may still have data.

There was a bug report a long time ago that was not reproducible that
had something about trace_pipe (consuming read) not matching trace
(iterator read). This may explain why that happened.

Anyway, the correct answer to this bug is to always use the reader page
an not reset the iterator to inside the writable ring buffer.

Fixes: d769041f8653 "ring_buffer: implement new locking"
Signed-off-by: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>kernel/smp.c:on_each_cpu_cond(): fix warning in fallback path</title>
<updated>2014-09-17T16:21:54+00:00</updated>
<author>
<name>Sasha Levin</name>
<email>sasha.levin@oracle.com</email>
</author>
<published>2014-08-06T23:08:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=0a7e596a04fa03d0ae99c71b362055f23c6417cc'/>
<id>0a7e596a04fa03d0ae99c71b362055f23c6417cc</id>
<content type='text'>
commit 618fde872163e782183ce574c77f1123e2be8887 upstream.

The rarely-executed memry-allocation-failed callback path generates a
WARN_ON_ONCE() when smp_call_function_single() succeeds.  Presumably
it's supposed to warn on failures.

Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
Cc: Christoph Lameter &lt;cl@gentwo.org&gt;
Cc: Gilad Ben-Yossef &lt;gilad@benyossef.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Tejun Heo &lt;htejun@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 618fde872163e782183ce574c77f1123e2be8887 upstream.

The rarely-executed memry-allocation-failed callback path generates a
WARN_ON_ONCE() when smp_call_function_single() succeeds.  Presumably
it's supposed to warn on failures.

Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
Cc: Christoph Lameter &lt;cl@gentwo.org&gt;
Cc: Gilad Ben-Yossef &lt;gilad@benyossef.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Tejun Heo &lt;htejun@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>CAPABILITIES: remove undefined caps from all processes</title>
<updated>2014-09-17T16:21:53+00:00</updated>
<author>
<name>Eric Paris</name>
<email>eparis@redhat.com</email>
</author>
<published>2014-07-23T19:36:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=769b2b894ee6cf55fd26149261b69579f2c3a9cd'/>
<id>769b2b894ee6cf55fd26149261b69579f2c3a9cd</id>
<content type='text'>
commit 7d8b6c63751cfbbe5eef81a48c22978b3407a3ad upstream.

This is effectively a revert of 7b9a7ec565505699f503b4fcf61500dceb36e744
plus fixing it a different way...

We found, when trying to run an application from an application which
had dropped privs that the kernel does security checks on undefined
capability bits.  This was ESPECIALLY difficult to debug as those
undefined bits are hidden from /proc/$PID/status.

Consider a root application which drops all capabilities from ALL 4
capability sets.  We assume, since the application is going to set
eff/perm/inh from an array that it will clear not only the defined caps
less than CAP_LAST_CAP, but also the higher 28ish bits which are
undefined future capabilities.

The BSET gets cleared differently.  Instead it is cleared one bit at a
time.  The problem here is that in security/commoncap.c::cap_task_prctl()
we actually check the validity of a capability being read.  So any task
which attempts to 'read all things set in bset' followed by 'unset all
things set in bset' will not even attempt to unset the undefined bits
higher than CAP_LAST_CAP.

So the 'parent' will look something like:
CapInh:	0000000000000000
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	ffffffc000000000

All of this 'should' be fine.  Given that these are undefined bits that
aren't supposed to have anything to do with permissions.  But they do...

So lets now consider a task which cleared the eff/perm/inh completely
and cleared all of the valid caps in the bset (but not the invalid caps
it couldn't read out of the kernel).  We know that this is exactly what
the libcap-ng library does and what the go capabilities library does.
They both leave you in that above situation if you try to clear all of
you capapabilities from all 4 sets.  If that root task calls execve()
the child task will pick up all caps not blocked by the bset.  The bset
however does not block bits higher than CAP_LAST_CAP.  So now the child
task has bits in eff which are not in the parent.  These are
'meaningless' undefined bits, but still bits which the parent doesn't
have.

The problem is now in cred_cap_issubset() (or any operation which does a
subset test) as the child, while a subset for valid cap bits, is not a
subset for invalid cap bits!  So now we set durring commit creds that
the child is not dumpable.  Given it is 'more priv' than its parent.  It
also means the parent cannot ptrace the child and other stupidity.

The solution here:
1) stop hiding capability bits in status
	This makes debugging easier!

2) stop giving any task undefined capability bits.  it's simple, it you
don't put those invalid bits in CAP_FULL_SET you won't get them in init
and you won't get them in any other task either.
	This fixes the cap_issubset() tests and resulting fallout (which
	made the init task in a docker container untraceable among other
	things)

3) mask out undefined bits when sys_capset() is called as it might use
~0, ~0 to denote 'all capabilities' for backward/forward compatibility.
	This lets 'capsh --caps="all=eip" -- -c /bin/bash' run.

4) mask out undefined bit when we read a file capability off of disk as
again likely all bits are set in the xattr for forward/backward
compatibility.
	This lets 'setcap all+pe /bin/bash; /bin/bash' run

Signed-off-by: Eric Paris &lt;eparis@redhat.com&gt;
Reviewed-by: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Andrew Vagin &lt;avagin@openvz.org&gt;
Cc: Andrew G. Morgan &lt;morgan@kernel.org&gt;
Cc: Serge E. Hallyn &lt;serge.hallyn@canonical.com&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Steve Grubb &lt;sgrubb@redhat.com&gt;
Cc: Dan Walsh &lt;dwalsh@redhat.com&gt;
Signed-off-by: James Morris &lt;james.l.morris@oracle.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 7d8b6c63751cfbbe5eef81a48c22978b3407a3ad upstream.

This is effectively a revert of 7b9a7ec565505699f503b4fcf61500dceb36e744
plus fixing it a different way...

We found, when trying to run an application from an application which
had dropped privs that the kernel does security checks on undefined
capability bits.  This was ESPECIALLY difficult to debug as those
undefined bits are hidden from /proc/$PID/status.

Consider a root application which drops all capabilities from ALL 4
capability sets.  We assume, since the application is going to set
eff/perm/inh from an array that it will clear not only the defined caps
less than CAP_LAST_CAP, but also the higher 28ish bits which are
undefined future capabilities.

The BSET gets cleared differently.  Instead it is cleared one bit at a
time.  The problem here is that in security/commoncap.c::cap_task_prctl()
we actually check the validity of a capability being read.  So any task
which attempts to 'read all things set in bset' followed by 'unset all
things set in bset' will not even attempt to unset the undefined bits
higher than CAP_LAST_CAP.

So the 'parent' will look something like:
CapInh:	0000000000000000
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	ffffffc000000000

All of this 'should' be fine.  Given that these are undefined bits that
aren't supposed to have anything to do with permissions.  But they do...

So lets now consider a task which cleared the eff/perm/inh completely
and cleared all of the valid caps in the bset (but not the invalid caps
it couldn't read out of the kernel).  We know that this is exactly what
the libcap-ng library does and what the go capabilities library does.
They both leave you in that above situation if you try to clear all of
you capapabilities from all 4 sets.  If that root task calls execve()
the child task will pick up all caps not blocked by the bset.  The bset
however does not block bits higher than CAP_LAST_CAP.  So now the child
task has bits in eff which are not in the parent.  These are
'meaningless' undefined bits, but still bits which the parent doesn't
have.

The problem is now in cred_cap_issubset() (or any operation which does a
subset test) as the child, while a subset for valid cap bits, is not a
subset for invalid cap bits!  So now we set durring commit creds that
the child is not dumpable.  Given it is 'more priv' than its parent.  It
also means the parent cannot ptrace the child and other stupidity.

The solution here:
1) stop hiding capability bits in status
	This makes debugging easier!

2) stop giving any task undefined capability bits.  it's simple, it you
don't put those invalid bits in CAP_FULL_SET you won't get them in init
and you won't get them in any other task either.
	This fixes the cap_issubset() tests and resulting fallout (which
	made the init task in a docker container untraceable among other
	things)

3) mask out undefined bits when sys_capset() is called as it might use
~0, ~0 to denote 'all capabilities' for backward/forward compatibility.
	This lets 'capsh --caps="all=eip" -- -c /bin/bash' run.

4) mask out undefined bit when we read a file capability off of disk as
again likely all bits are set in the xattr for forward/backward
compatibility.
	This lets 'setcap all+pe /bin/bash; /bin/bash' run

Signed-off-by: Eric Paris &lt;eparis@redhat.com&gt;
Reviewed-by: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Andrew Vagin &lt;avagin@openvz.org&gt;
Cc: Andrew G. Morgan &lt;morgan@kernel.org&gt;
Cc: Serge E. Hallyn &lt;serge.hallyn@canonical.com&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Steve Grubb &lt;sgrubb@redhat.com&gt;
Cc: Dan Walsh &lt;dwalsh@redhat.com&gt;
Signed-off-by: James Morris &lt;james.l.morris@oracle.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>sched: Fix sched_setparam() policy == -1 logic</title>
<updated>2014-09-05T23:36:27+00:00</updated>
<author>
<name>Daniel Bristot de Oliveira</name>
<email>bristot@redhat.com</email>
</author>
<published>2014-07-23T02:27:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=c1d97766689182f2d8e79eae5c58b0901ee200c3'/>
<id>c1d97766689182f2d8e79eae5c58b0901ee200c3</id>
<content type='text'>
commit d8d28c8f00e84a72e8bee39a85835635417bee49 upstream.

The scheduler uses policy == -1 to preserve the current policy state to
implement sched_setparam(). But, as (int) -1 is equals to 0xffffffff,
it's matching the if (policy &amp; SCHED_RESET_ON_FORK) on
_sched_setscheduler(). This match changes the policy value to an
invalid value, breaking the sched_setparam() syscall.

This patch checks policy == -1 before check the SCHED_RESET_ON_FORK flag.

The following program shows the bug:

int main(void)
{
	struct sched_param param = {
		.sched_priority = 5,
	};

	sched_setscheduler(0, SCHED_FIFO, &amp;param);
	param.sched_priority = 1;
	sched_setparam(0, &amp;param);
	param.sched_priority = 0;
	sched_getparam(0, &amp;param);
	if (param.sched_priority != 1)
		printf("failed priority setting (found %d instead of 1)\n",
			param.sched_priority);
	else
		printf("priority setting fine\n");
}

Signed-off-by: Daniel Bristot de Oliveira &lt;bristot@redhat.com&gt;
Signed-off-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Reviewed-by: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: linux-kernel@vger.kernel.org
Fixes: 7479f3c9cf67 "sched: Move SCHED_RESET_ON_FORK into attr::sched_flags"
Link: http://lkml.kernel.org/r/9ebe0566a08dbbb3999759d3f20d6004bb2dbcfa.1406079891.git.bristot@redhat.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit d8d28c8f00e84a72e8bee39a85835635417bee49 upstream.

The scheduler uses policy == -1 to preserve the current policy state to
implement sched_setparam(). But, as (int) -1 is equals to 0xffffffff,
it's matching the if (policy &amp; SCHED_RESET_ON_FORK) on
_sched_setscheduler(). This match changes the policy value to an
invalid value, breaking the sched_setparam() syscall.

This patch checks policy == -1 before check the SCHED_RESET_ON_FORK flag.

The following program shows the bug:

int main(void)
{
	struct sched_param param = {
		.sched_priority = 5,
	};

	sched_setscheduler(0, SCHED_FIFO, &amp;param);
	param.sched_priority = 1;
	sched_setparam(0, &amp;param);
	param.sched_priority = 0;
	sched_getparam(0, &amp;param);
	if (param.sched_priority != 1)
		printf("failed priority setting (found %d instead of 1)\n",
			param.sched_priority);
	else
		printf("priority setting fine\n");
}

Signed-off-by: Daniel Bristot de Oliveira &lt;bristot@redhat.com&gt;
Signed-off-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Reviewed-by: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: linux-kernel@vger.kernel.org
Fixes: 7479f3c9cf67 "sched: Move SCHED_RESET_ON_FORK into attr::sched_flags"
Link: http://lkml.kernel.org/r/9ebe0566a08dbbb3999759d3f20d6004bb2dbcfa.1406079891.git.bristot@redhat.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2014-08-03T16:58:20+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2014-08-03T16:58:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=8d71844b5194fee6edd49e68c01445266f364572'/>
<id>8d71844b5194fee6edd49e68c01445266f364572</id>
<content type='text'>
Pull timer fixes from Thomas Gleixner:
 "Two fixes in the timer area:
   - a long-standing lock inversion due to a printk
   - suspend-related hrtimer corruption in sched_clock"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks
  sched_clock: Avoid corrupting hrtimer tree during suspend
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull timer fixes from Thomas Gleixner:
 "Two fixes in the timer area:
   - a long-standing lock inversion due to a printk
   - suspend-related hrtimer corruption in sched_clock"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks
  sched_clock: Avoid corrupting hrtimer tree during suspend
</pre>
</div>
</content>
</entry>
<entry>
<title>timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks</title>
<updated>2014-08-01T10:54:41+00:00</updated>
<author>
<name>Jan Kara</name>
<email>jack@suse.cz</email>
</author>
<published>2014-08-01T10:20:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=504d58745c9ca28d33572e2d8a9990b43e06075d'/>
<id>504d58745c9ca28d33572e2d8a9990b43e06075d</id>
<content type='text'>
clockevents_increase_min_delta() calls printk() from under
hrtimer_bases.lock. That causes lock inversion on scheduler locks because
printk() can call into the scheduler. Lockdep puts it as:

======================================================
[ INFO: possible circular locking dependency detected ]
3.15.0-rc8-06195-g939f04b #2 Not tainted
-------------------------------------------------------
trinity-main/74 is trying to acquire lock:
 (&amp;port_lock_key){-.....}, at: [&lt;811c60be&gt;] serial8250_console_write+0x8c/0x10c

but task is already holding lock:
 (hrtimer_bases.lock){-.-...}, at: [&lt;8103caeb&gt;] hrtimer_try_to_cancel+0x13/0x66

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-&gt; #5 (hrtimer_bases.lock){-.-...}:
       [&lt;8104a942&gt;] lock_acquire+0x92/0x101
       [&lt;8142f11d&gt;] _raw_spin_lock_irqsave+0x2e/0x3e
       [&lt;8103c918&gt;] __hrtimer_start_range_ns+0x1c/0x197
       [&lt;8107ec20&gt;] perf_swevent_start_hrtimer.part.41+0x7a/0x85
       [&lt;81080792&gt;] task_clock_event_start+0x3a/0x3f
       [&lt;810807a4&gt;] task_clock_event_add+0xd/0x14
       [&lt;8108259a&gt;] event_sched_in+0xb6/0x17a
       [&lt;810826a2&gt;] group_sched_in+0x44/0x122
       [&lt;81082885&gt;] ctx_sched_in.isra.67+0x105/0x11f
       [&lt;810828e6&gt;] perf_event_sched_in.isra.70+0x47/0x4b
       [&lt;81082bf6&gt;] __perf_install_in_context+0x8b/0xa3
       [&lt;8107eb8e&gt;] remote_function+0x12/0x2a
       [&lt;8105f5af&gt;] smp_call_function_single+0x2d/0x53
       [&lt;8107e17d&gt;] task_function_call+0x30/0x36
       [&lt;8107fb82&gt;] perf_install_in_context+0x87/0xbb
       [&lt;810852c9&gt;] SYSC_perf_event_open+0x5c6/0x701
       [&lt;810856f9&gt;] SyS_perf_event_open+0x17/0x19
       [&lt;8142f8ee&gt;] syscall_call+0x7/0xb

-&gt; #4 (&amp;ctx-&gt;lock){......}:
       [&lt;8104a942&gt;] lock_acquire+0x92/0x101
       [&lt;8142f04c&gt;] _raw_spin_lock+0x21/0x30
       [&lt;81081df3&gt;] __perf_event_task_sched_out+0x1dc/0x34f
       [&lt;8142cacc&gt;] __schedule+0x4c6/0x4cb
       [&lt;8142cae0&gt;] schedule+0xf/0x11
       [&lt;8142f9a6&gt;] work_resched+0x5/0x30

-&gt; #3 (&amp;rq-&gt;lock){-.-.-.}:
       [&lt;8104a942&gt;] lock_acquire+0x92/0x101
       [&lt;8142f04c&gt;] _raw_spin_lock+0x21/0x30
       [&lt;81040873&gt;] __task_rq_lock+0x33/0x3a
       [&lt;8104184c&gt;] wake_up_new_task+0x25/0xc2
       [&lt;8102474b&gt;] do_fork+0x15c/0x2a0
       [&lt;810248a9&gt;] kernel_thread+0x1a/0x1f
       [&lt;814232a2&gt;] rest_init+0x1a/0x10e
       [&lt;817af949&gt;] start_kernel+0x303/0x308
       [&lt;817af2ab&gt;] i386_start_kernel+0x79/0x7d

-&gt; #2 (&amp;p-&gt;pi_lock){-.-...}:
       [&lt;8104a942&gt;] lock_acquire+0x92/0x101
       [&lt;8142f11d&gt;] _raw_spin_lock_irqsave+0x2e/0x3e
       [&lt;810413dd&gt;] try_to_wake_up+0x1d/0xd6
       [&lt;810414cd&gt;] default_wake_function+0xb/0xd
       [&lt;810461f3&gt;] __wake_up_common+0x39/0x59
       [&lt;81046346&gt;] __wake_up+0x29/0x3b
       [&lt;811b8733&gt;] tty_wakeup+0x49/0x51
       [&lt;811c3568&gt;] uart_write_wakeup+0x17/0x19
       [&lt;811c5dc1&gt;] serial8250_tx_chars+0xbc/0xfb
       [&lt;811c5f28&gt;] serial8250_handle_irq+0x54/0x6a
       [&lt;811c5f57&gt;] serial8250_default_handle_irq+0x19/0x1c
       [&lt;811c56d8&gt;] serial8250_interrupt+0x38/0x9e
       [&lt;810510e7&gt;] handle_irq_event_percpu+0x5f/0x1e2
       [&lt;81051296&gt;] handle_irq_event+0x2c/0x43
       [&lt;81052cee&gt;] handle_level_irq+0x57/0x80
       [&lt;81002a72&gt;] handle_irq+0x46/0x5c
       [&lt;810027df&gt;] do_IRQ+0x32/0x89
       [&lt;8143036e&gt;] common_interrupt+0x2e/0x33
       [&lt;8142f23c&gt;] _raw_spin_unlock_irqrestore+0x3f/0x49
       [&lt;811c25a4&gt;] uart_start+0x2d/0x32
       [&lt;811c2c04&gt;] uart_write+0xc7/0xd6
       [&lt;811bc6f6&gt;] n_tty_write+0xb8/0x35e
       [&lt;811b9beb&gt;] tty_write+0x163/0x1e4
       [&lt;811b9cd9&gt;] redirected_tty_write+0x6d/0x75
       [&lt;810b6ed6&gt;] vfs_write+0x75/0xb0
       [&lt;810b7265&gt;] SyS_write+0x44/0x77
       [&lt;8142f8ee&gt;] syscall_call+0x7/0xb

-&gt; #1 (&amp;tty-&gt;write_wait){-.....}:
       [&lt;8104a942&gt;] lock_acquire+0x92/0x101
       [&lt;8142f11d&gt;] _raw_spin_lock_irqsave+0x2e/0x3e
       [&lt;81046332&gt;] __wake_up+0x15/0x3b
       [&lt;811b8733&gt;] tty_wakeup+0x49/0x51
       [&lt;811c3568&gt;] uart_write_wakeup+0x17/0x19
       [&lt;811c5dc1&gt;] serial8250_tx_chars+0xbc/0xfb
       [&lt;811c5f28&gt;] serial8250_handle_irq+0x54/0x6a
       [&lt;811c5f57&gt;] serial8250_default_handle_irq+0x19/0x1c
       [&lt;811c56d8&gt;] serial8250_interrupt+0x38/0x9e
       [&lt;810510e7&gt;] handle_irq_event_percpu+0x5f/0x1e2
       [&lt;81051296&gt;] handle_irq_event+0x2c/0x43
       [&lt;81052cee&gt;] handle_level_irq+0x57/0x80
       [&lt;81002a72&gt;] handle_irq+0x46/0x5c
       [&lt;810027df&gt;] do_IRQ+0x32/0x89
       [&lt;8143036e&gt;] common_interrupt+0x2e/0x33
       [&lt;8142f23c&gt;] _raw_spin_unlock_irqrestore+0x3f/0x49
       [&lt;811c25a4&gt;] uart_start+0x2d/0x32
       [&lt;811c2c04&gt;] uart_write+0xc7/0xd6
       [&lt;811bc6f6&gt;] n_tty_write+0xb8/0x35e
       [&lt;811b9beb&gt;] tty_write+0x163/0x1e4
       [&lt;811b9cd9&gt;] redirected_tty_write+0x6d/0x75
       [&lt;810b6ed6&gt;] vfs_write+0x75/0xb0
       [&lt;810b7265&gt;] SyS_write+0x44/0x77
       [&lt;8142f8ee&gt;] syscall_call+0x7/0xb

-&gt; #0 (&amp;port_lock_key){-.....}:
       [&lt;8104a62d&gt;] __lock_acquire+0x9ea/0xc6d
       [&lt;8104a942&gt;] lock_acquire+0x92/0x101
       [&lt;8142f11d&gt;] _raw_spin_lock_irqsave+0x2e/0x3e
       [&lt;811c60be&gt;] serial8250_console_write+0x8c/0x10c
       [&lt;8104e402&gt;] call_console_drivers.constprop.31+0x87/0x118
       [&lt;8104f5d5&gt;] console_unlock+0x1d7/0x398
       [&lt;8104fb70&gt;] vprintk_emit+0x3da/0x3e4
       [&lt;81425f76&gt;] printk+0x17/0x19
       [&lt;8105bfa0&gt;] clockevents_program_min_delta+0x104/0x116
       [&lt;8105c548&gt;] clockevents_program_event+0xe7/0xf3
       [&lt;8105cc1c&gt;] tick_program_event+0x1e/0x23
       [&lt;8103c43c&gt;] hrtimer_force_reprogram+0x88/0x8f
       [&lt;8103c49e&gt;] __remove_hrtimer+0x5b/0x79
       [&lt;8103cb21&gt;] hrtimer_try_to_cancel+0x49/0x66
       [&lt;8103cb4b&gt;] hrtimer_cancel+0xd/0x18
       [&lt;8107f102&gt;] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
       [&lt;81080705&gt;] task_clock_event_stop+0x20/0x64
       [&lt;81080756&gt;] task_clock_event_del+0xd/0xf
       [&lt;81081350&gt;] event_sched_out+0xab/0x11e
       [&lt;810813e0&gt;] group_sched_out+0x1d/0x66
       [&lt;81081682&gt;] ctx_sched_out+0xaf/0xbf
       [&lt;81081e04&gt;] __perf_event_task_sched_out+0x1ed/0x34f
       [&lt;8142cacc&gt;] __schedule+0x4c6/0x4cb
       [&lt;8142cae0&gt;] schedule+0xf/0x11
       [&lt;8142f9a6&gt;] work_resched+0x5/0x30

other info that might help us debug this:

Chain exists of:
  &amp;port_lock_key --&gt; &amp;ctx-&gt;lock --&gt; hrtimer_bases.lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(hrtimer_bases.lock);
                               lock(&amp;ctx-&gt;lock);
                               lock(hrtimer_bases.lock);
  lock(&amp;port_lock_key);

 *** DEADLOCK ***

4 locks held by trinity-main/74:
 #0:  (&amp;rq-&gt;lock){-.-.-.}, at: [&lt;8142c6f3&gt;] __schedule+0xed/0x4cb
 #1:  (&amp;ctx-&gt;lock){......}, at: [&lt;81081df3&gt;] __perf_event_task_sched_out+0x1dc/0x34f
 #2:  (hrtimer_bases.lock){-.-...}, at: [&lt;8103caeb&gt;] hrtimer_try_to_cancel+0x13/0x66
 #3:  (console_lock){+.+...}, at: [&lt;8104fb5d&gt;] vprintk_emit+0x3c7/0x3e4

stack backtrace:
CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04b #2
 00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570
 8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0
 8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003
Call Trace:
 [&lt;81426f69&gt;] dump_stack+0x16/0x18
 [&lt;81425a99&gt;] print_circular_bug+0x18f/0x19c
 [&lt;8104a62d&gt;] __lock_acquire+0x9ea/0xc6d
 [&lt;8104a942&gt;] lock_acquire+0x92/0x101
 [&lt;811c60be&gt;] ? serial8250_console_write+0x8c/0x10c
 [&lt;811c6032&gt;] ? wait_for_xmitr+0x76/0x76
 [&lt;8142f11d&gt;] _raw_spin_lock_irqsave+0x2e/0x3e
 [&lt;811c60be&gt;] ? serial8250_console_write+0x8c/0x10c
 [&lt;811c60be&gt;] serial8250_console_write+0x8c/0x10c
 [&lt;8104af87&gt;] ? lock_release+0x191/0x223
 [&lt;811c6032&gt;] ? wait_for_xmitr+0x76/0x76
 [&lt;8104e402&gt;] call_console_drivers.constprop.31+0x87/0x118
 [&lt;8104f5d5&gt;] console_unlock+0x1d7/0x398
 [&lt;8104fb70&gt;] vprintk_emit+0x3da/0x3e4
 [&lt;81425f76&gt;] printk+0x17/0x19
 [&lt;8105bfa0&gt;] clockevents_program_min_delta+0x104/0x116
 [&lt;8105cc1c&gt;] tick_program_event+0x1e/0x23
 [&lt;8103c43c&gt;] hrtimer_force_reprogram+0x88/0x8f
 [&lt;8103c49e&gt;] __remove_hrtimer+0x5b/0x79
 [&lt;8103cb21&gt;] hrtimer_try_to_cancel+0x49/0x66
 [&lt;8103cb4b&gt;] hrtimer_cancel+0xd/0x18
 [&lt;8107f102&gt;] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
 [&lt;81080705&gt;] task_clock_event_stop+0x20/0x64
 [&lt;81080756&gt;] task_clock_event_del+0xd/0xf
 [&lt;81081350&gt;] event_sched_out+0xab/0x11e
 [&lt;810813e0&gt;] group_sched_out+0x1d/0x66
 [&lt;81081682&gt;] ctx_sched_out+0xaf/0xbf
 [&lt;81081e04&gt;] __perf_event_task_sched_out+0x1ed/0x34f
 [&lt;8104416d&gt;] ? __dequeue_entity+0x23/0x27
 [&lt;81044505&gt;] ? pick_next_task_fair+0xb1/0x120
 [&lt;8142cacc&gt;] __schedule+0x4c6/0x4cb
 [&lt;81047574&gt;] ? trace_hardirqs_off_caller+0xd7/0x108
 [&lt;810475b0&gt;] ? trace_hardirqs_off+0xb/0xd
 [&lt;81056346&gt;] ? rcu_irq_exit+0x64/0x77

Fix the problem by using printk_deferred() which does not call into the
scheduler.

Reported-by: Fengguang Wu &lt;fengguang.wu@intel.com&gt;
Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
clockevents_increase_min_delta() calls printk() from under
hrtimer_bases.lock. That causes lock inversion on scheduler locks because
printk() can call into the scheduler. Lockdep puts it as:

======================================================
[ INFO: possible circular locking dependency detected ]
3.15.0-rc8-06195-g939f04b #2 Not tainted
-------------------------------------------------------
trinity-main/74 is trying to acquire lock:
 (&amp;port_lock_key){-.....}, at: [&lt;811c60be&gt;] serial8250_console_write+0x8c/0x10c

but task is already holding lock:
 (hrtimer_bases.lock){-.-...}, at: [&lt;8103caeb&gt;] hrtimer_try_to_cancel+0x13/0x66

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-&gt; #5 (hrtimer_bases.lock){-.-...}:
       [&lt;8104a942&gt;] lock_acquire+0x92/0x101
       [&lt;8142f11d&gt;] _raw_spin_lock_irqsave+0x2e/0x3e
       [&lt;8103c918&gt;] __hrtimer_start_range_ns+0x1c/0x197
       [&lt;8107ec20&gt;] perf_swevent_start_hrtimer.part.41+0x7a/0x85
       [&lt;81080792&gt;] task_clock_event_start+0x3a/0x3f
       [&lt;810807a4&gt;] task_clock_event_add+0xd/0x14
       [&lt;8108259a&gt;] event_sched_in+0xb6/0x17a
       [&lt;810826a2&gt;] group_sched_in+0x44/0x122
       [&lt;81082885&gt;] ctx_sched_in.isra.67+0x105/0x11f
       [&lt;810828e6&gt;] perf_event_sched_in.isra.70+0x47/0x4b
       [&lt;81082bf6&gt;] __perf_install_in_context+0x8b/0xa3
       [&lt;8107eb8e&gt;] remote_function+0x12/0x2a
       [&lt;8105f5af&gt;] smp_call_function_single+0x2d/0x53
       [&lt;8107e17d&gt;] task_function_call+0x30/0x36
       [&lt;8107fb82&gt;] perf_install_in_context+0x87/0xbb
       [&lt;810852c9&gt;] SYSC_perf_event_open+0x5c6/0x701
       [&lt;810856f9&gt;] SyS_perf_event_open+0x17/0x19
       [&lt;8142f8ee&gt;] syscall_call+0x7/0xb

-&gt; #4 (&amp;ctx-&gt;lock){......}:
       [&lt;8104a942&gt;] lock_acquire+0x92/0x101
       [&lt;8142f04c&gt;] _raw_spin_lock+0x21/0x30
       [&lt;81081df3&gt;] __perf_event_task_sched_out+0x1dc/0x34f
       [&lt;8142cacc&gt;] __schedule+0x4c6/0x4cb
       [&lt;8142cae0&gt;] schedule+0xf/0x11
       [&lt;8142f9a6&gt;] work_resched+0x5/0x30

-&gt; #3 (&amp;rq-&gt;lock){-.-.-.}:
       [&lt;8104a942&gt;] lock_acquire+0x92/0x101
       [&lt;8142f04c&gt;] _raw_spin_lock+0x21/0x30
       [&lt;81040873&gt;] __task_rq_lock+0x33/0x3a
       [&lt;8104184c&gt;] wake_up_new_task+0x25/0xc2
       [&lt;8102474b&gt;] do_fork+0x15c/0x2a0
       [&lt;810248a9&gt;] kernel_thread+0x1a/0x1f
       [&lt;814232a2&gt;] rest_init+0x1a/0x10e
       [&lt;817af949&gt;] start_kernel+0x303/0x308
       [&lt;817af2ab&gt;] i386_start_kernel+0x79/0x7d

-&gt; #2 (&amp;p-&gt;pi_lock){-.-...}:
       [&lt;8104a942&gt;] lock_acquire+0x92/0x101
       [&lt;8142f11d&gt;] _raw_spin_lock_irqsave+0x2e/0x3e
       [&lt;810413dd&gt;] try_to_wake_up+0x1d/0xd6
       [&lt;810414cd&gt;] default_wake_function+0xb/0xd
       [&lt;810461f3&gt;] __wake_up_common+0x39/0x59
       [&lt;81046346&gt;] __wake_up+0x29/0x3b
       [&lt;811b8733&gt;] tty_wakeup+0x49/0x51
       [&lt;811c3568&gt;] uart_write_wakeup+0x17/0x19
       [&lt;811c5dc1&gt;] serial8250_tx_chars+0xbc/0xfb
       [&lt;811c5f28&gt;] serial8250_handle_irq+0x54/0x6a
       [&lt;811c5f57&gt;] serial8250_default_handle_irq+0x19/0x1c
       [&lt;811c56d8&gt;] serial8250_interrupt+0x38/0x9e
       [&lt;810510e7&gt;] handle_irq_event_percpu+0x5f/0x1e2
       [&lt;81051296&gt;] handle_irq_event+0x2c/0x43
       [&lt;81052cee&gt;] handle_level_irq+0x57/0x80
       [&lt;81002a72&gt;] handle_irq+0x46/0x5c
       [&lt;810027df&gt;] do_IRQ+0x32/0x89
       [&lt;8143036e&gt;] common_interrupt+0x2e/0x33
       [&lt;8142f23c&gt;] _raw_spin_unlock_irqrestore+0x3f/0x49
       [&lt;811c25a4&gt;] uart_start+0x2d/0x32
       [&lt;811c2c04&gt;] uart_write+0xc7/0xd6
       [&lt;811bc6f6&gt;] n_tty_write+0xb8/0x35e
       [&lt;811b9beb&gt;] tty_write+0x163/0x1e4
       [&lt;811b9cd9&gt;] redirected_tty_write+0x6d/0x75
       [&lt;810b6ed6&gt;] vfs_write+0x75/0xb0
       [&lt;810b7265&gt;] SyS_write+0x44/0x77
       [&lt;8142f8ee&gt;] syscall_call+0x7/0xb

-&gt; #1 (&amp;tty-&gt;write_wait){-.....}:
       [&lt;8104a942&gt;] lock_acquire+0x92/0x101
       [&lt;8142f11d&gt;] _raw_spin_lock_irqsave+0x2e/0x3e
       [&lt;81046332&gt;] __wake_up+0x15/0x3b
       [&lt;811b8733&gt;] tty_wakeup+0x49/0x51
       [&lt;811c3568&gt;] uart_write_wakeup+0x17/0x19
       [&lt;811c5dc1&gt;] serial8250_tx_chars+0xbc/0xfb
       [&lt;811c5f28&gt;] serial8250_handle_irq+0x54/0x6a
       [&lt;811c5f57&gt;] serial8250_default_handle_irq+0x19/0x1c
       [&lt;811c56d8&gt;] serial8250_interrupt+0x38/0x9e
       [&lt;810510e7&gt;] handle_irq_event_percpu+0x5f/0x1e2
       [&lt;81051296&gt;] handle_irq_event+0x2c/0x43
       [&lt;81052cee&gt;] handle_level_irq+0x57/0x80
       [&lt;81002a72&gt;] handle_irq+0x46/0x5c
       [&lt;810027df&gt;] do_IRQ+0x32/0x89
       [&lt;8143036e&gt;] common_interrupt+0x2e/0x33
       [&lt;8142f23c&gt;] _raw_spin_unlock_irqrestore+0x3f/0x49
       [&lt;811c25a4&gt;] uart_start+0x2d/0x32
       [&lt;811c2c04&gt;] uart_write+0xc7/0xd6
       [&lt;811bc6f6&gt;] n_tty_write+0xb8/0x35e
       [&lt;811b9beb&gt;] tty_write+0x163/0x1e4
       [&lt;811b9cd9&gt;] redirected_tty_write+0x6d/0x75
       [&lt;810b6ed6&gt;] vfs_write+0x75/0xb0
       [&lt;810b7265&gt;] SyS_write+0x44/0x77
       [&lt;8142f8ee&gt;] syscall_call+0x7/0xb

-&gt; #0 (&amp;port_lock_key){-.....}:
       [&lt;8104a62d&gt;] __lock_acquire+0x9ea/0xc6d
       [&lt;8104a942&gt;] lock_acquire+0x92/0x101
       [&lt;8142f11d&gt;] _raw_spin_lock_irqsave+0x2e/0x3e
       [&lt;811c60be&gt;] serial8250_console_write+0x8c/0x10c
       [&lt;8104e402&gt;] call_console_drivers.constprop.31+0x87/0x118
       [&lt;8104f5d5&gt;] console_unlock+0x1d7/0x398
       [&lt;8104fb70&gt;] vprintk_emit+0x3da/0x3e4
       [&lt;81425f76&gt;] printk+0x17/0x19
       [&lt;8105bfa0&gt;] clockevents_program_min_delta+0x104/0x116
       [&lt;8105c548&gt;] clockevents_program_event+0xe7/0xf3
       [&lt;8105cc1c&gt;] tick_program_event+0x1e/0x23
       [&lt;8103c43c&gt;] hrtimer_force_reprogram+0x88/0x8f
       [&lt;8103c49e&gt;] __remove_hrtimer+0x5b/0x79
       [&lt;8103cb21&gt;] hrtimer_try_to_cancel+0x49/0x66
       [&lt;8103cb4b&gt;] hrtimer_cancel+0xd/0x18
       [&lt;8107f102&gt;] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
       [&lt;81080705&gt;] task_clock_event_stop+0x20/0x64
       [&lt;81080756&gt;] task_clock_event_del+0xd/0xf
       [&lt;81081350&gt;] event_sched_out+0xab/0x11e
       [&lt;810813e0&gt;] group_sched_out+0x1d/0x66
       [&lt;81081682&gt;] ctx_sched_out+0xaf/0xbf
       [&lt;81081e04&gt;] __perf_event_task_sched_out+0x1ed/0x34f
       [&lt;8142cacc&gt;] __schedule+0x4c6/0x4cb
       [&lt;8142cae0&gt;] schedule+0xf/0x11
       [&lt;8142f9a6&gt;] work_resched+0x5/0x30

other info that might help us debug this:

Chain exists of:
  &amp;port_lock_key --&gt; &amp;ctx-&gt;lock --&gt; hrtimer_bases.lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(hrtimer_bases.lock);
                               lock(&amp;ctx-&gt;lock);
                               lock(hrtimer_bases.lock);
  lock(&amp;port_lock_key);

 *** DEADLOCK ***

4 locks held by trinity-main/74:
 #0:  (&amp;rq-&gt;lock){-.-.-.}, at: [&lt;8142c6f3&gt;] __schedule+0xed/0x4cb
 #1:  (&amp;ctx-&gt;lock){......}, at: [&lt;81081df3&gt;] __perf_event_task_sched_out+0x1dc/0x34f
 #2:  (hrtimer_bases.lock){-.-...}, at: [&lt;8103caeb&gt;] hrtimer_try_to_cancel+0x13/0x66
 #3:  (console_lock){+.+...}, at: [&lt;8104fb5d&gt;] vprintk_emit+0x3c7/0x3e4

stack backtrace:
CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04b #2
 00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570
 8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0
 8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003
Call Trace:
 [&lt;81426f69&gt;] dump_stack+0x16/0x18
 [&lt;81425a99&gt;] print_circular_bug+0x18f/0x19c
 [&lt;8104a62d&gt;] __lock_acquire+0x9ea/0xc6d
 [&lt;8104a942&gt;] lock_acquire+0x92/0x101
 [&lt;811c60be&gt;] ? serial8250_console_write+0x8c/0x10c
 [&lt;811c6032&gt;] ? wait_for_xmitr+0x76/0x76
 [&lt;8142f11d&gt;] _raw_spin_lock_irqsave+0x2e/0x3e
 [&lt;811c60be&gt;] ? serial8250_console_write+0x8c/0x10c
 [&lt;811c60be&gt;] serial8250_console_write+0x8c/0x10c
 [&lt;8104af87&gt;] ? lock_release+0x191/0x223
 [&lt;811c6032&gt;] ? wait_for_xmitr+0x76/0x76
 [&lt;8104e402&gt;] call_console_drivers.constprop.31+0x87/0x118
 [&lt;8104f5d5&gt;] console_unlock+0x1d7/0x398
 [&lt;8104fb70&gt;] vprintk_emit+0x3da/0x3e4
 [&lt;81425f76&gt;] printk+0x17/0x19
 [&lt;8105bfa0&gt;] clockevents_program_min_delta+0x104/0x116
 [&lt;8105cc1c&gt;] tick_program_event+0x1e/0x23
 [&lt;8103c43c&gt;] hrtimer_force_reprogram+0x88/0x8f
 [&lt;8103c49e&gt;] __remove_hrtimer+0x5b/0x79
 [&lt;8103cb21&gt;] hrtimer_try_to_cancel+0x49/0x66
 [&lt;8103cb4b&gt;] hrtimer_cancel+0xd/0x18
 [&lt;8107f102&gt;] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
 [&lt;81080705&gt;] task_clock_event_stop+0x20/0x64
 [&lt;81080756&gt;] task_clock_event_del+0xd/0xf
 [&lt;81081350&gt;] event_sched_out+0xab/0x11e
 [&lt;810813e0&gt;] group_sched_out+0x1d/0x66
 [&lt;81081682&gt;] ctx_sched_out+0xaf/0xbf
 [&lt;81081e04&gt;] __perf_event_task_sched_out+0x1ed/0x34f
 [&lt;8104416d&gt;] ? __dequeue_entity+0x23/0x27
 [&lt;81044505&gt;] ? pick_next_task_fair+0xb1/0x120
 [&lt;8142cacc&gt;] __schedule+0x4c6/0x4cb
 [&lt;81047574&gt;] ? trace_hardirqs_off_caller+0xd7/0x108
 [&lt;810475b0&gt;] ? trace_hardirqs_off+0xb/0xd
 [&lt;81056346&gt;] ? rcu_irq_exit+0x64/0x77

Fix the problem by using printk_deferred() which does not call into the
scheduler.

Reported-by: Fengguang Wu &lt;fengguang.wu@intel.com&gt;
Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>kexec: fix build error when hugetlbfs is disabled</title>
<updated>2014-07-31T03:09:37+00:00</updated>
<author>
<name>David Rientjes</name>
<email>rientjes@google.com</email>
</author>
<published>2014-07-31T02:05:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=3a1122d26c62d4e8c61ef9a0eaba6e21c0862c77'/>
<id>3a1122d26c62d4e8c61ef9a0eaba6e21c0862c77</id>
<content type='text'>
free_huge_page() is undefined without CONFIG_HUGETLBFS and there's no
need to filter PageHuge() page is such a configuration either, so avoid
exporting the symbol to fix a build error:

   In file included from kernel/kexec.c:14:0:
   kernel/kexec.c: In function 'crash_save_vmcoreinfo_init':
   kernel/kexec.c:1623:20: error: 'free_huge_page' undeclared (first use in this function)
     VMCOREINFO_SYMBOL(free_huge_page);
                       ^

Introduced by commit 8f1d26d0e59b ("kexec: export free_huge_page to
VMCOREINFO")

Reported-by: kbuild test robot &lt;fengguang.wu@intel.com&gt;
Acked-by: Olof Johansson &lt;olof@lixom.net&gt;
Cc: Atsushi Kumagai &lt;kumagai-atsushi@mxc.nes.nec.co.jp&gt;
Cc: Baoquan He &lt;bhe@redhat.com&gt;
Cc: Vivek Goyal &lt;vgoyal@redhat.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
free_huge_page() is undefined without CONFIG_HUGETLBFS and there's no
need to filter PageHuge() page is such a configuration either, so avoid
exporting the symbol to fix a build error:

   In file included from kernel/kexec.c:14:0:
   kernel/kexec.c: In function 'crash_save_vmcoreinfo_init':
   kernel/kexec.c:1623:20: error: 'free_huge_page' undeclared (first use in this function)
     VMCOREINFO_SYMBOL(free_huge_page);
                       ^

Introduced by commit 8f1d26d0e59b ("kexec: export free_huge_page to
VMCOREINFO")

Reported-by: kbuild test robot &lt;fengguang.wu@intel.com&gt;
Acked-by: Olof Johansson &lt;olof@lixom.net&gt;
Cc: Atsushi Kumagai &lt;kumagai-atsushi@mxc.nes.nec.co.jp&gt;
Cc: Baoquan He &lt;bhe@redhat.com&gt;
Cc: Vivek Goyal &lt;vgoyal@redhat.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Josh has moved</title>
<updated>2014-07-31T00:16:13+00:00</updated>
<author>
<name>Josh Triplett</name>
<email>josh@joshtriplett.org</email>
</author>
<published>2014-07-30T23:08:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=e0198b290dcd8313bdf313a0d083033d5c01d761'/>
<id>e0198b290dcd8313bdf313a0d083033d5c01d761</id>
<content type='text'>
My IBM email addresses haven't worked for years; also map some
old-but-functional forwarding addresses to my canonical address.

Update my GPG key fingerprint; I moved to 4096R a long time ago.

Update description.

Signed-off-by: Josh Triplett &lt;josh@joshtriplett.org&gt;
Cc: "Paul E. McKenney" &lt;paulmck@linux.vnet.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
My IBM email addresses haven't worked for years; also map some
old-but-functional forwarding addresses to my canonical address.

Update my GPG key fingerprint; I moved to 4096R a long time ago.

Update description.

Signed-off-by: Josh Triplett &lt;josh@joshtriplett.org&gt;
Cc: "Paul E. McKenney" &lt;paulmck@linux.vnet.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>kexec: export free_huge_page to VMCOREINFO</title>
<updated>2014-07-31T00:16:13+00:00</updated>
<author>
<name>Atsushi Kumagai</name>
<email>kumagai-atsushi@mxc.nes.nec.co.jp</email>
</author>
<published>2014-07-30T23:08:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=8f1d26d0e59b9676587c54578f976709b625d6e9'/>
<id>8f1d26d0e59b9676587c54578f976709b625d6e9</id>
<content type='text'>
PG_head_mask was added into VMCOREINFO to filter huge pages in b3acc56bfe1
("kexec: save PG_head_mask in VMCOREINFO"), but makedumpfile still need
another symbol to filter *hugetlbfs* pages.

If a user hope to filter user pages, makedumpfile tries to exclude them by
checking the condition whether the page is anonymous, but hugetlbfs pages
aren't anonymous while they also be user pages.

We know it's possible to detect them in the same way as PageHuge(),
so we need the start address of free_huge_page():

    int PageHuge(struct page *page)
    {
            if (!PageCompound(page))
                    return 0;

            page = compound_head(page);
            return get_compound_page_dtor(page) == free_huge_page;
    }

For that reason, this patch changes free_huge_page() into public
to export it to VMCOREINFO.

Signed-off-by: Atsushi Kumagai &lt;kumagai-atsushi@mxc.nes.nec.co.jp&gt;
Acked-by: Baoquan He &lt;bhe@redhat.com&gt;
Cc: Vivek Goyal &lt;vgoyal@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
PG_head_mask was added into VMCOREINFO to filter huge pages in b3acc56bfe1
("kexec: save PG_head_mask in VMCOREINFO"), but makedumpfile still need
another symbol to filter *hugetlbfs* pages.

If a user hope to filter user pages, makedumpfile tries to exclude them by
checking the condition whether the page is anonymous, but hugetlbfs pages
aren't anonymous while they also be user pages.

We know it's possible to detect them in the same way as PageHuge(),
so we need the start address of free_huge_page():

    int PageHuge(struct page *page)
    {
            if (!PageCompound(page))
                    return 0;

            page = compound_head(page);
            return get_compound_page_dtor(page) == free_huge_page;
    }

For that reason, this patch changes free_huge_page() into public
to export it to VMCOREINFO.

Signed-off-by: Atsushi Kumagai &lt;kumagai-atsushi@mxc.nes.nec.co.jp&gt;
Acked-by: Baoquan He &lt;bhe@redhat.com&gt;
Cc: Vivek Goyal &lt;vgoyal@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
