linux-toradex.git/kernel, branch v3.18.13

bpf: fix verifier memory corruption

2015-04-27T20:48:31+00:00

[ Upstream commit c3de6317d748e23b9e46ba36e10483728d00d144 ]

Due to missing bounds check the DAG pass of the BPF verifier can corrupt
the memory which can cause random crashes during program loading:

[8.449451] BUG: unable to handle kernel paging request at ffffffffffffffff
[8.451293] IP: [] kmem_cache_alloc_trace+0x8d/0x2f0
[8.452329] Oops: 0000 [#1] SMP
[8.452329] Call Trace:
[8.452329]  [] bpf_check+0x852/0x2000
[8.452329]  [] bpf_prog_load+0x1e4/0x310
[8.452329]  [] ? might_fault+0x5f/0xb0
[8.452329]  [] SyS_bpf+0x806/0xa30

Fixes: f1bca824dabb ("bpf: add search pruning optimization to verifier")
Signed-off-by: Alexei Starovoitov 
Acked-by: Hannes Frederic Sowa 
Acked-by: Daniel Borkmann 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

timers/tick/broadcast-hrtimer: Fix suspicious RCU usage in idle loop

2015-04-24T21:14:12+00:00

[ Upstream commit a127d2bcf1fbc8c8e0b5cf0dab54f7d3ff50ce47 ]

The hrtimer mode of broadcast queues hrtimers in the idle entry
path so as to wakeup cpus in deep idle states. The associated
call graph is :

	cpuidle_idle_call()
	|____ clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, ....))
	     |_____tick_broadcast_set_event()
		   |____clockevents_program_event()
			|____bc_set_next()

The hrtimer_{start/cancel} functions call into tracing which uses RCU.
But it is not legal to call into RCU in cpuidle because it is one of the
quiescent states. Hence protect this region with RCU_NONIDLE which informs
RCU that the cpu is momentarily non-idle.

As an aside it is helpful to point out that the clock event device that is
programmed here is not a per-cpu clock device; it is a
pseudo clock device, used by the broadcast framework alone.
The per-cpu clock device programming never goes through bc_set_next().

Signed-off-by: Preeti U Murthy 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Paul E. McKenney 
Cc: linuxppc-dev@ozlabs.org
Cc: mpe@ellerman.id.au
Cc: tglx@linutronix.de
Link: http://lkml.kernel.org/r/20150318104705.17763.56668.stgit@preeti.in.ibm.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin

Revert "PM / hibernate: avoid unsafe pages in e820 reserved regions"

2015-04-24T21:14:04+00:00

[ Upstream commit f82daee49c09cf6a99c28303d93438a2566e5552 ]

Commit 84c91b7ae07c (PM / hibernate: avoid unsafe pages in e820 reserved
regions) is reported to make resume from hibernation on Lenovo x230
unreliable, so revert it.

We will revisit the issue the commit in question was supposed to fix
in the future.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=96111
Reported-by: rhn 
Cc: 3.17+  # 3.17+
Signed-off-by: Rafael J. Wysocki 
Signed-off-by: Sasha Levin

sched: Fix RLIMIT_RTTIME when PI-boosting to RT

2015-04-24T21:13:43+00:00

[ Upstream commit 746db9443ea57fd9c059f62c4bfbf41cf224fe13 ]

When non-realtime tasks get priority-inheritance boosted to a realtime
scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
counter used for checking this (the same one used for SCHED_RR
timeslices) was not getting reset. This meant that tasks running with a
non-realtime scheduling class which are repeatedly boosted to a realtime
one, but never block while they are running realtime, eventually hit the
timeout without ever running for a time over the limit. This patch
resets the realtime timeslice counter when un-PI-boosting from an RT to
a non-RT scheduling class.

I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
mutex which induces priority boosting and spins while boosted that gets
killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
does happen eventually with PREEMPT_VOLUNTARY kernels.

Signed-off-by: Brian Silverman 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: austin@peloton-tech.com
Cc: 
Link: http://lkml.kernel.org/r/1424305436-6716-1-git-send-email-brian@peloton-tech.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin

perf: Fix irq_work 'tail' recursion

2015-04-17T00:11:43+00:00

[ Upstream commit d525211f9d1be8b523ec7633f080f2116f5ea536 ]

Vince reported a watchdog lockup like:

	[] perf_tp_event+0xc4/0x210
	[] perf_trace_lock+0x12a/0x160
	[] lock_release+0x130/0x260
	[] _raw_spin_unlock_irqrestore+0x24/0x40
	[] do_send_sig_info+0x5d/0x80
	[] send_sigio_to_task+0x12f/0x1a0
	[] send_sigio+0xae/0x100
	[] kill_fasync+0x97/0xf0
	[] perf_event_wakeup+0xd4/0xf0
	[] perf_pending_event+0x33/0x60
	[] irq_work_run_list+0x4c/0x80
	[] irq_work_run+0x18/0x40
	[] smp_trace_irq_work_interrupt+0x3f/0xc0
	[] trace_irq_work_interrupt+0x6d/0x80

Which is caused by an irq_work generating new irq_work and therefore
not allowing forward progress.

This happens because processing the perf irq_work triggers another
perf event (tracepoint stuff) which in turn generates an irq_work ad
infinitum.

Avoid this by raising the recursion counter in the irq_work -- which
effectively disables all software events (including tracepoints) from
actually triggering again.

Reported-by: Vince Weaver 
Tested-by: Vince Weaver 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Paul Mackerras 
Cc: Steven Rostedt 
Cc: 
Link: http://lkml.kernel.org/r/20150219170311.GH21418@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin

sched/wait: Provide infrastructure to deal with nested blocking

2015-04-17T00:11:19+00:00

[ Upstream commit 61ada528dea028331e99e8ceaed87c683ad25de2 ]

There are a few places that call blocking primitives from wait loops,
provide infrastructure to support this without the typical
task_struct::state collision.

We record the wakeup in wait_queue_t::flags which leaves
task_struct::state free to be used by others.

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Oleg Nesterov 
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/20140924082242.051202318@infradead.org
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin

cpuset: Fix cpuset sched_relax_domain_level

2015-03-28T13:42:58+00:00

[ Upstream commit 283cb41f426b723a0255702b761b0fc5d1b53a81 ]

The cpuset.sched_relax_domain_level can control how far we do
immediate load balancing on a system. However, it was found on recent
kernels that echo'ing a value into cpuset.sched_relax_domain_level
did not reduce any immediate load balancing.

The reason this occurred was because the update_domain_attr_tree() traversal
did not update for the "top_cpuset". This resulted in nothing being changed
when modifying the sched_relax_domain_level parameter.

This patch is able to address that problem by having update_domain_attr_tree()
allow updates for the root in the cpuset traversal.

Fixes: fc560a26acce ("cpuset: replace cpuset->stack_list with cpuset_for_each_descendant_pre()")
Cc:  # 3.9+
Signed-off-by: Jason Low 
Signed-off-by: Zefan Li 
Signed-off-by: Tejun Heo 
Tested-by: Serge Hallyn 
Signed-off-by: Sasha Levin

cpuset: fix a warning when clearing configured masks in old hierarchy

2015-03-28T13:42:51+00:00

[ Upstream commit 79063bffc81f82689bd90e16da1b49408f3bf095 ]

When we clear cpuset.cpus, cpuset.effective_cpus won't be cleared:

  # mount -t cgroup -o cpuset xxx /mnt
  # mkdir /mnt/tmp
  # echo 0 > /mnt/tmp/cpuset.cpus
  # echo > /mnt/tmp/cpuset.cpus
  # cat cpuset.cpus

  # cat cpuset.effective_cpus
  0-15

And a kernel warning in update_cpumasks_hier() is triggered:

 ------------[ cut here ]------------
 WARNING: CPU: 0 PID: 4028 at kernel/cpuset.c:894 update_cpumasks_hier+0x471/0x650()

Cc:  # 3.17+
Signed-off-by: Zefan Li 
Signed-off-by: Tejun Heo 
Tested-by: Serge Hallyn 
Signed-off-by: Sasha Levin

cpuset: initialize effective masks when clone_children is enabled

2015-03-28T13:42:38+00:00

[ Upstream commit 790317e1b266c776765a4bdcedefea706ff0fada ]

If clone_children is enabled, effective masks won't be initialized
due to the bug:

  # mount -t cgroup -o cpuset xxx /mnt
  # echo 1 > cgroup.clone_children
  # mkdir /mnt/tmp
  # cat /mnt/tmp/
  # cat cpuset.effective_cpus

  # cat cpuset.cpus
  0-15

And then this cpuset won't constrain the tasks in it.

Either the bug or the fix has no effect on unified hierarchy, as
there's no clone_chidren flag there any more.

Reported-by: Christian Brauner 
Reported-by: Serge Hallyn 
Cc:  # 3.17+
Signed-off-by: Zefan Li 
Signed-off-by: Tejun Heo 
Tested-by: Serge Hallyn 
Signed-off-by: Sasha Levin

workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE

2015-03-28T13:37:48+00:00

[ Upstream commit 8603e1b30027f943cc9c1eef2b291d42c3347af1 ]

cancel[_delayed]_work_sync() are implemented using
__cancel_work_timer() which grabs the PENDING bit using
try_to_grab_pending() and then flushes the work item with PENDING set
to prevent the on-going execution of the work item from requeueing
itself.

try_to_grab_pending() can always grab PENDING bit without blocking
except when someone else is doing the above flushing during
cancelation.  In that case, try_to_grab_pending() returns -ENOENT.  In
this case, __cancel_work_timer() currently invokes flush_work().  The
assumption is that the completion of the work item is what the other
canceling task would be waiting for too and thus waiting for the same
condition and retrying should allow forward progress without excessive
busy looping

Unfortunately, this doesn't work if preemption is disabled or the
latter task has real time priority.  Let's say task A just got woken
up from flush_work() by the completion of the target work item.  If,
before task A starts executing, task B gets scheduled and invokes
__cancel_work_timer() on the same work item, its try_to_grab_pending()
will return -ENOENT as the work item is still being canceled by task A
and flush_work() will also immediately return false as the work item
is no longer executing.  This puts task B in a busy loop possibly
preventing task A from executing and clearing the canceling state on
the work item leading to a hang.

task A			task B			worker

						executing work
__cancel_work_timer()
  try_to_grab_pending()
  set work CANCELING
  flush_work()
    block for work completion
						completion, wakes up A
			__cancel_work_timer()
			while (forever) {
			  try_to_grab_pending()
			    -ENOENT as work is being canceled
			  flush_work()
			    false as work is no longer executing
			}

This patch removes the possible hang by updating __cancel_work_timer()
to explicitly wait for clearing of CANCELING rather than invoking
flush_work() after try_to_grab_pending() fails with -ENOENT.

Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com

v3: bit_waitqueue() can't be used for work items defined in vmalloc
    area.  Switched to custom wake function which matches the target
    work item and exclusive wait and wakeup.

v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if
    the target bit waitqueue has wait_bit_queue's on it.  Use
    DEFINE_WAIT_BIT() and __wake_up_bit() instead.  Reported by Tomeu
    Vizoso.

Signed-off-by: Tejun Heo 
Reported-by: Rabin Vincent 
Cc: Tomeu Vizoso 
Cc: stable@vger.kernel.org
Tested-by: Jesper Nilsson 
Tested-by: Rabin Vincent 
Signed-off-by: Sasha Levin