linux-toradex.git/kernel, branch v3.7.9

sched/rt: Use root_domain of rt_rq not current processor

2013-02-11T17:04:43+00:00

commit aa7f67304d1a03180f463258aa6f15a8b434e77d upstream.

When the system has multiple domains do_sched_rt_period_timer()
can run on any CPU and may iterate over all rt_rq in
cpu_online_mask.  This means when balance_runtime() is run for a
given rt_rq that rt_rq may be in a different rd than the current
processor.  Thus if we use smp_processor_id() to get rd in
do_balance_runtime() we may borrow runtime from a rt_rq that is
not part of our rd.

This changes do_balance_runtime to get the rd from the passed in
rt_rq ensuring that we borrow runtime only from the correct rd
for the given rt_rq.

This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime
when we try reclaim runtime lent to other rt_rq but runtime has
been lent to a rt_rq in another rd.

Signed-off-by: Shawn Bohrer 
Acked-by: Steven Rostedt 
Acked-by: Mike Galbraith 
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1358186131-29494-1-git-send-email-sbohrer@rgmadvisors.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

smp: Fix SMP function call empty cpu mask race

2013-02-04T00:27:05+00:00

commit f44310b98ddb7f0d06550d73ed67df5865e3eda5 upstream.

I get the following warning every day with v3.7, once or
twice a day:

  [ 2235.186027] WARNING: at /mnt/sda7/kernel/linux/arch/x86/kernel/apic/ipi.c:109 default_send_IPI_mask_logical+0x2f/0xb8()

As explained by Linus as well:

 |
 | Once we've done the "list_add_rcu()" to add it to the
 | queue, we can have (another) IPI to the target CPU that can
 | now see it and clear the mask.
 |
 | So by the time we get to actually send the IPI, the mask might
 | have been cleared by another IPI.
 |

This patch also fixes a system hang problem, if the data->cpumask
gets cleared after passing this point:

        if (WARN_ONCE(!mask, "empty IPI mask"))
                return;

then the problem in commit 83d349f35e1a ("x86: don't send an IPI to
the empty set of CPU's") will happen again.

Signed-off-by: Wang YanQing 
Acked-by: Linus Torvalds 
Acked-by: Jan Beulich 
Cc: Paul E. McKenney 
Cc: Andrew Morton 
Cc: peterz@infradead.org
Cc: mina86@mina86.org
Cc: srivatsa.bhat@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/20130126075357.GA3205@udknight
[ Tidied up the changelog and the comment in the code. ]
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

module: fix missing module_mutex unlock

2013-01-28T04:49:02+00:00

commit ee61abb3223e28a1a14a8429c0319755d20d3e40 upstream.

Commit 1fb9341ac348 ("module: put modules in list much earlier") moved
some of the module initialization code around, and in the process
changed the exit paths too.  But for the duplicate export symbol error
case the change made the ddebug_cleanup path jump to after the module
mutex unlock, even though it happens with the mutex held.

Rusty has some patches to split this function up into some helper
functions, hopefully the mess of complex goto targets will go away
eventually.

Reported-by: Dan Carpenter 
Cc: Rusty Russell 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

module: put modules in list much earlier.

2013-01-28T04:49:02+00:00

commit 1fb9341ac34825aa40354e74d9a2c69df7d2c304 upstream.

Prarit's excellent bug report:
> In recent Fedora releases (F17 & F18) some users have reported seeing
> messages similar to
>
> [   15.478160] kvm: Could not allocate 304 bytes percpu data
> [   15.478174] PERCPU: allocation failed, size=304 align=32, alloc from
> reserved chunk failed
>
> during system boot.  In some cases, users have also reported seeing this
> message along with a failed load of other modules.
>
> What is happening is systemd is loading an instance of the kvm module for
> each cpu found (see commit e9bda3b).  When the module load occurs the kernel
> currently allocates the modules percpu data area prior to checking to see
> if the module is already loaded or is in the process of being loaded.  If
> the module is already loaded, or finishes load, the module loading code
> releases the current instance's module's percpu data.

Now we have a new state MODULE_STATE_UNFORMED, we can insert the
module into the list (and thus guarantee its uniqueness) before we
allocate the per-cpu region.

Reported-by: Prarit Bhargava 
Signed-off-by: Rusty Russell 
Tested-by: Prarit Bhargava 
Signed-off-by: Greg Kroah-Hartman

module: add new state MODULE_STATE_UNFORMED.

2013-01-28T04:49:02+00:00

commit 0d21b0e3477395e7ff2acc269f15df6e6a8d356d upstream.

You should never look at such a module, so it's excised from all paths
which traverse the modules list.

We add the state at the end, to avoid gratuitous ABI break (ksplice).

Signed-off-by: Rusty Russell 
Signed-off-by: Greg Kroah-Hartman

wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task

2013-01-28T04:49:00+00:00

commit 9067ac85d533651b98c2ff903182a20cbb361fcb upstream.

wake_up_process() should never wakeup a TASK_STOPPED/TRACED task.
Change it to use TASK_NORMAL and add the WARN_ON().

TASK_ALL has no other users, probably can be killed.

Signed-off-by: Oleg Nesterov 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL

2013-01-28T04:49:00+00:00

commit 9899d11f654474d2d54ea52ceaa2a1f4db3abd68 upstream.

putreg() assumes that the tracee is not running and pt_regs_access() can
safely play with its stack.  However a killed tracee can return from
ptrace_stop() to the low-level asm code and do RESTORE_REST, this means
that debugger can actually read/modify the kernel stack until the tracee
does SAVE_REST again.

set_task_blockstep() can race with SIGKILL too and in some sense this
race is even worse, the very fact the tracee can be woken up breaks the
logic.

As Linus suggested we can clear TASK_WAKEKILL around the arch_ptrace()
call, this ensures that nobody can ever wakeup the tracee while the
debugger looks at it.  Not only this fixes the mentioned problems, we
can do some cleanups/simplifications in arch_ptrace() paths.

Probably ptrace_unfreeze_traced() needs more callers, for example it
makes sense to make the tracee killable for oom-killer before
access_process_vm().

While at it, add the comment into may_ptrace_stop() to explain why
ptrace_stop() still can't rely on SIGKILL and signal_pending_state().

Reported-by: Salman Qazi 
Reported-by: Suleiman Souhlal 
Suggested-by: Linus Torvalds 
Signed-off-by: Oleg Nesterov 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

ptrace: introduce signal_wake_up_state() and ptrace_signal_wake_up()

2013-01-28T04:49:00+00:00

commit 910ffdb18a6408e14febbb6e4b6840fd2c928c82 upstream.

Cleanup and preparation for the next change.

signal_wake_up(resume => true) is overused. None of ptrace/jctl callers
actually want to wakeup a TASK_WAKEKILL task, but they can't specify the
necessary mask.

Turn signal_wake_up() into signal_wake_up_state(state), reintroduce
signal_wake_up() as a trivial helper, and add ptrace_signal_wake_up()
which adds __TASK_TRACED.

This way ptrace_signal_wake_up() can work "inside" ptrace_request()
even if the tracee doesn't have the TASK_WAKEKILL bit set.

Signed-off-by: Oleg Nesterov 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

async: fix __lowest_in_progress()

2013-01-28T04:49:00+00:00

commit f56c3196f251012de9b3ebaff55732a9074fdaae upstream.

Commit 083b804c4d3e ("async: use workqueue for worker pool") made it
possible that async jobs are moved from pending to running out-of-order.
While pending async jobs will be queued and dispatched for execution in
the same order, nothing guarantees they'll enter "1) move self to the
running queue" of async_run_entry_fn() in the same order.

Before the conversion, async implemented its own worker pool.  An async
worker, upon being woken up, fetches the first item from the pending
list, which kept the executing lists sorted.  The conversion to
workqueue was done by adding work_struct to each async_entry and async
just schedules the work item.  The queueing and dispatching of such work
items are still in order but now each worker thread is associated with a
specific async_entry and moves that specific async_entry to the
executing list.  So, depending on which worker reaches that point
earlier, which is non-deterministic, we may end up moving an async_entry
with larger cookie before one with smaller one.

This broke __lowest_in_progress().  running->domain may not be properly
sorted and is not guaranteed to contain lower cookies than pending list
when not empty.  Fix it by ensuring sort-inserting to the running list
and always looking at both pending and running when trying to determine
the lowest cookie.

Over time, the async synchronization implementation became quite messy.
We better restructure it such that each async_entry is linked to two
lists - one global and one per domain - and not move it when execution
starts.  There's no reason to distinguish pending and running.  They
behave the same for synchronization purposes.

Signed-off-by: Tejun Heo 
Cc: Arjan van de Ven 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

ftrace: Be first to run code modification on modules

2013-01-28T04:49:00+00:00

commit c1bf08ac26e92122faab9f6c32ea8aba94612dae upstream.

If some other kernel subsystem has a module notifier, and adds a kprobe
to a ftrace mcount point (now that kprobes work on ftrace points),
when the ftrace notifier runs it will fail and disable ftrace, as well
as kprobes that are attached to ftrace points.

Here's the error:

 WARNING: at kernel/trace/ftrace.c:1618 ftrace_bug+0x239/0x280()
 Hardware name: Bochs
 Modules linked in: fat(+) stap_56d28a51b3fe546293ca0700b10bcb29__8059(F) nfsv4 auth_rpcgss nfs dns_resolver fscache xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack lockd sunrpc ppdev parport_pc parport microcode virtio_net i2c_piix4 drm_kms_helper ttm drm i2c_core [last unloaded: bid_shared]
 Pid: 8068, comm: modprobe Tainted: GF            3.7.0-0.rc8.git0.1.fc19.x86_64 #1
 Call Trace:
  [] warn_slowpath_common+0x7f/0xc0
  [] ? __probe_kernel_read+0x46/0x70
  [] ? 0xffffffffa017ffff
  [] ? 0xffffffffa017ffff
  [] warn_slowpath_null+0x1a/0x20
  [] ftrace_bug+0x239/0x280
  [] ftrace_process_locs+0x376/0x520
  [] ftrace_module_notify+0x47/0x50
  [] notifier_call_chain+0x4d/0x70
  [] __blocking_notifier_call_chain+0x58/0x80
  [] blocking_notifier_call_chain+0x16/0x20
  [] sys_init_module+0x73/0x220
  [] system_call_fastpath+0x16/0x1b
 ---[ end trace 9ef46351e53bbf80 ]---
 ftrace failed to modify [] init_once+0x0/0x20 [fat]
  actual: cc:bb:d2:4b:e1

A kprobe was added to the init_once() function in the fat module on load.
But this happened before ftrace could have touched the code. As ftrace
didn't run yet, the kprobe system had no idea it was a ftrace point and
simply added a breakpoint to the code (0xcc in the cc:bb:d2:4b:e1).

Then when ftrace went to modify the location from a call to mcount/fentry
into a nop, it didn't see a call op, but instead it saw the breakpoint op
and not knowing what to do with it, ftrace shut itself down.

The solution is to simply give the ftrace module notifier the max priority.
This should have been done regardless, as the core code ftrace modification
also happens very early on in boot up. This makes the module modification
closer to core modification.

Link: http://lkml.kernel.org/r/20130107140333.593683061@goodmis.org

Acked-by: Masami Hiramatsu 
Reported-by: Frank Ch. Eigler 
Signed-off-by: Steven Rostedt