linux-toradex.git/kernel, branch v3.2.73

sched/core: Fix TASK_DEAD race in finish_task_switch()

2015-11-17T15:54:43+00:00

commit 95913d97914f44db2b81271c2e2ebd4d2ac2df83 upstream.

So the problem this patch is trying to address is as follows:

        CPU0                            CPU1

        context_switch(A, B)
                                        ttwu(A)
                                          LOCK A->pi_lock
                                          A->on_cpu == 0
        finish_task_switch(A)
          prev_state = A->state  <-.
          WMB                      |
          A->on_cpu = 0;           |
          UNLOCK rq0->lock         |
                                   |    context_switch(C, A)
                                   `--  A->state = TASK_DEAD
          prev_state == TASK_DEAD
            put_task_struct(A)
                                        context_switch(A, C)
                                        finish_task_switch(A)
                                          A->state == TASK_DEAD
                                            put_task_struct(A)

The argument being that the WMB will allow the load of A->state on CPU0
to cross over and observe CPU1's store of A->state, which will then
result in a double-drop and use-after-free.

Now the comment states (and this was true once upon a long time ago)
that we need to observe A->state while holding rq->lock because that
will order us against the wakeup; however the wakeup will not in fact
acquire (that) rq->lock; it takes A->pi_lock these days.

We can obviously fix this by upgrading the WMB to an MB, but that is
expensive, so we'd rather avoid that.

The alternative this patch takes is: smp_store_release(&A->on_cpu, 0),
which avoids the MB on some archs, but not important ones like ARM.

Reported-by: Oleg Nesterov 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-kernel@vger.kernel.org
Cc: manfred@colorfullife.com
Cc: will.deacon@arm.com
Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()")
Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 
[bwh: Backported to 3.2:
 - Adjust filename
 - As smp_store_release() is not defined, use smp_mb()]
Signed-off-by: Ben Hutchings

clocksource: Fix abs() usage w/ 64bit values

2015-11-17T15:54:41+00:00

commit 67dfae0cd72fec5cd158b6e5fb1647b7dbe0834c upstream.

This patch fixes one cases where abs() was being used with 64-bit
nanosecond values, where the result may be capped at 32-bits.

This potentially could cause watchdog false negatives on 32-bit
systems, so this patch addresses the issue by using abs64().

Signed-off-by: John Stultz 
Cc: Prarit Bhargava 
Cc: Richard Cochran 
Cc: Ingo Molnar 
Link: http://lkml.kernel.org/r/1442279124-7309-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings

genirq: Fix race in register_irq_proc()

2015-11-17T15:54:41+00:00

commit 95c2b17534654829db428f11bcf4297c059a2a7e upstream.

Per-IRQ directories in procfs are created only when a handler is first
added to the irqdesc, not when the irqdesc is created.  In the case of
a shared IRQ, multiple tasks can race to create a directory.  This
race condition seems to have been present forever, but is easier to
hit with async probing.

Signed-off-by: Ben Hutchings 
Link: http://lkml.kernel.org/r/1443266636.2004.2.camel@decadent.org.uk
Signed-off-by: Thomas Gleixner

module: Fix locking in symbol_put_addr()

2015-11-17T15:54:40+00:00

commit 275d7d44d802ef271a42dc87ac091a495ba72fc5 upstream.

Poma (on the way to another bug) reported an assertion triggering:

  [] module_assert_mutex_or_preempt+0x49/0x90
  [] __module_address+0x32/0x150
  [] __module_text_address+0x16/0x70
  [] symbol_put_addr+0x29/0x40
  [] dvb_frontend_detach+0x7d/0x90 [dvb_core]

Laura Abbott  produced a patch which lead us to
inspect symbol_put_addr(). This function has a comment claiming it
doesn't need to disable preemption around the module lookup
because it holds a reference to the module it wants to find, which
therefore cannot go away.

This is wrong (and a false optimization too, preempt_disable() is really
rather cheap, and I doubt any of this is on uber critical paths,
otherwise it would've retained a pointer to the actual module anyway and
avoided the second lookup).

While its true that the module cannot go away while we hold a reference
on it, the data structure we do the lookup in very much _CAN_ change
while we do the lookup. Therefore fix the comment and add the
required preempt_disable().

Reported-by: poma 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Rusty Russell 
Fixes: a6e6abd575fc ("module: remove module_text_address()")
Signed-off-by: Ben Hutchings

fs: create and use seq_show_option for escaping

2015-10-13T02:46:08+00:00

commit a068acf2ee77693e0bf39d6e07139ba704f461c3 upstream.

Many file systems that implement the show_options hook fail to correctly
escape their output which could lead to unescaped characters (e.g.  new
lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files.  This
could lead to confusion, spoofed entries (resulting in things like
systemd issuing false d-bus "mount" notifications), and who knows what
else.  This looks like it would only be the root user stepping on
themselves, but it's possible weird things could happen in containers or
in other situations with delegated mount privileges.

Here's an example using overlay with setuid fusermount trusting the
contents of /proc/mounts (via the /etc/mtab symlink).  Imagine the use
of "sudo" is something more sneaky:

  $ BASE="ovl"
  $ MNT="$BASE/mnt"
  $ LOW="$BASE/lower"
  $ UP="$BASE/upper"
  $ WORK="$BASE/work/ 0 0
  none /proc fuse.pwn user_id=1000"
  $ mkdir -p "$LOW" "$UP" "$WORK"
  $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
  $ cat /proc/mounts
  none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
  none /proc fuse.pwn user_id=1000 0 0
  $ fusermount -u /proc
  $ cat /proc/mounts
  cat: /proc/mounts: No such file or directory

This fixes the problem by adding new seq_show_option and
seq_show_option_n helpers, and updating the vulnerable show_option
handlers to use them as needed.  Some, like SELinux, need to be open
coded due to unusual existing escape mechanisms.

[akpm@linux-foundation.org: add lost chunk, per Kees]
[keescook@chromium.org: seq_show_option should be using const parameters]
Signed-off-by: Kees Cook 
Acked-by: Serge Hallyn 
Acked-by: Jan Kara 
Acked-by: Paul Moore 
Cc: J. R. Okajima 
Signed-off-by: Kees Cook 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[bwh: Backported to 3.2:
 - Drop changes to overlayfs, reiserfs
 - Drop vers option from cifs
 - ceph changes are all in one file
 - Adjust context]
Signed-off-by: Ben Hutchings

perf: Fix fasync handling on inherited events

2015-10-13T02:46:02+00:00

commit fed66e2cdd4f127a43fd11b8d92a99bdd429528c upstream.

Vince reported that the fasync signal stuff doesn't work proper for
inherited events. So fix that.

Installing fasync allocates memory and sets filp->f_flags |= FASYNC,
which upon the demise of the file descriptor ensures the allocation is
freed and state is updated.

Now for perf, we can have the events stick around for a while after the
original FD is dead because of references from child events. So we
cannot copy the fasync pointer around. We can however consistently use
the parent's fasync, as that will be updated.

Reported-and-Tested-by: Vince Weaver 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho deMelo 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1434011521.1495.71.camel@twins
Signed-off-by: Ingo Molnar 
Signed-off-by: Ben Hutchings

tracing/filter: Do not allow infix to exceed end of string

2015-08-12T14:33:17+00:00

commit 6b88f44e161b9ee2a803e5b2b1fbcf4e20e8b980 upstream.

While debugging a WARN_ON() for filtering, I found that it is possible
for the filter string to be referenced after its end. With the filter:

 # echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter

The filter_parse() function can call infix_get_op() which calls
infix_advance() that updates the infix filter pointers for the cnt
and tail without checking if the filter is already at the end, which
will put the cnt to zero and the tail beyond the end. The loop then calls
infix_next() that has

	ps->infix.cnt--;
	return ps->infix.string[ps->infix.tail++];

The cnt will now be below zero, and the tail that is returned is
already passed the end of the filter string. So far the allocation
of the filter string usually has some buffer that is zeroed out, but
if the filter string is of the exact size of the allocated buffer
there's no guarantee that the charater after the nul terminating
character will be zero.

Luckily, only root can write to the filter.

Signed-off-by: Steven Rostedt 
Signed-off-by: Ben Hutchings

tracing/filter: Do not WARN on operand count going below zero

2015-08-12T14:33:17+00:00

commit b4875bbe7e68f139bd3383828ae8e994a0df6d28 upstream.

When testing the fix for the trace filter, I could not come up with
a scenario where the operand count goes below zero, so I added a
WARN_ON_ONCE(cnt < 0) to the logic. But there is legitimate case
that it can happen (although the filter would be wrong).

 # echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter

That is, a single operation without any operands will hit the path
where the WARN_ON_ONCE() can trigger. Although this is harmless,
and the filter is reported as a error. But instead of spitting out
a warning to the kernel dmesg, just fail nicely and report it via
the proper channels.

Link: http://lkml.kernel.org/r/558C6082.90608@oracle.com

Reported-by: Vince Weaver 
Reported-by: Sasha Levin 
Signed-off-by: Steven Rostedt 
Signed-off-by: Ben Hutchings

rcu: Correctly handle non-empty Tiny RCU callback list with none ready

2015-08-12T14:33:12+00:00

commit 6e91f8cb138625be96070b778d9ba71ce520ea7e upstream.

If, at the time __rcu_process_callbacks() is invoked,  there are callbacks
in Tiny RCU's callback list, but none of them are ready to be invoked,
the current list-management code will knit the non-ready callbacks out
of the list.  This can result in hangs and possibly worse.  This commit
therefore inserts a check for there being no callbacks that can be
invoked immediately.

This bug is unlikely to occur -- you have to get a new callback between
the time rcu_sched_qs() or rcu_bh_qs() was called, but before we get to
__rcu_process_callbacks().  It was detected by the addition of RCU-bh
testing to rcutorture, which in turn was instigated by Iftekhar Ahmed's
mutation testing.  Although this bug was made much more likely by
915e8a4fe45e (rcu: Remove fastpath from __rcu_process_callbacks()), this
did not cause the bug, but rather made it much more probable.   That
said, it takes more than 40 hours of rcutorture testing, on average,
for this bug to appear, so this fix cannot be considered an emergency.

Signed-off-by: Paul E. McKenney 
Reviewed-by: Josh Triplett 
[bwh: Backported to 3.2: adjust filename, context]
Signed-off-by: Ben Hutchings

hrtimer: Allow concurrent hrtimer_start() for self restarting timers

2015-08-12T14:33:11+00:00

commit 5de2755c8c8b3a6b8414870e2c284914a2b42e4d upstream.

Because we drop cpu_base->lock around calling hrtimer::function, it is
possible for hrtimer_start() to come in between and enqueue the timer.

If hrtimer::function then returns HRTIMER_RESTART we'll hit the BUG_ON
because HRTIMER_STATE_ENQUEUED will be set.

Since the above is a perfectly valid scenario, remove the BUG_ON and
make the enqueue_hrtimer() call conditional on the timer not being
enqueued already.

NOTE: in that concurrent scenario its entirely common for both sites
to want to modify the hrtimer, since hrtimers don't provide
serialization themselves be sure to provide some such that the
hrtimer::function and the hrtimer_start() caller don't both try and
fudge the expiration state at the same time.

To that effect, add a WARN when someone tries to forward an already
enqueued timer, the most common way to change the expiry of self
restarting timers. Ideally we'd put the WARN in everything modifying
the expiry but most of that is inlines and we don't need the bloat.

Fixes: 2d44ae4d7135 ("hrtimer: clean up cpu->base locking tricks")
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Ben Segall 
Cc: Roman Gushchin 
Cc: Paul Turner 
Link: http://lkml.kernel.org/r/20150415113105.GT5029@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner 
[bwh: Backported to 3.2: adjust filename, context]
Signed-off-by: Ben Hutchings