linux-toradex.git/kernel/rcutree.h, branch v3.12.25

rcu: Throttle rcu_try_advance_all_cbs() execution

2014-03-12T12:25:37+00:00

commit c229828ca6bc62d6c654f64b1d1b8a9ebd8a56f3 upstream.

The rcu_try_advance_all_cbs() function is invoked on each attempted
entry to and every exit from idle.  If this function determines that
there are callbacks ready to invoke, the caller will invoke the RCU
core, which in turn will result in a pair of context switches.  If a
CPU enters and exits idle extremely frequently, this can result in
an excessive number of context switches and high CPU overhead.

This commit therefore causes rcu_try_advance_all_cbs() to throttle
itself, refusing to do work more than once per jiffy.

Reported-by: Tibor Billes 
Signed-off-by: Paul E. McKenney 
Tested-by: Tibor Billes 
Reviewed-by: Josh Triplett 
Signed-off-by: Jiri Slaby

nohz_full: Force RCU's grace-period kthreads onto timekeeping CPU

2013-08-31T21:44:02+00:00

Because RCU's quiescent-state-forcing mechanism is used to drive the
full-system-idle state machine, and because this mechanism is executed
by RCU's grace-period kthreads, this commit forces these kthreads to
run on the timekeeping CPU (tick_do_timer_cpu).  To do otherwise would
mean that the RCU grace-period kthreads would force the system into
non-idle state every time they drove the state machine, which would
be just a bit on the futile side.

Signed-off-by: Paul E. McKenney 
Cc: Frederic Weisbecker 
Cc: Steven Rostedt 
Cc: Lai Jiangshan 
Reviewed-by: Josh Triplett

nohz_full: Add full-system-idle state machine

2013-08-31T21:43:50+00:00

This commit adds the state machine that takes the per-CPU idle data
as input and produces a full-system-idle indication as output.  This
state machine is driven out of RCU's quiescent-state-forcing
mechanism, which invokes rcu_sysidle_check_cpu() to collect per-CPU
idle state and then rcu_sysidle_report() to drive the state machine.

The full-system-idle state is sampled using rcu_sys_is_idle(), which
also drives the state machine if RCU is idle (and does so by forcing
RCU to become non-idle).  This function returns true if all but the
timekeeping CPU (tick_do_timer_cpu) are idle and have been idle long
enough to avoid memory contention on the full_sysidle_state state
variable.  The rcu_sysidle_force_exit() may be called externally
to reset the state machine back into non-idle state.

For large systems the state machine is driven out of RCU's
force-quiescent-state logic, which provides good scalability at the price
of millisecond-scale latencies on the transition to full-system-idle
state.  This is not so good for battery-powered systems, which are usually
small enough that they don't need to care about scalability, but which
do care deeply about energy efficiency.  Small systems therefore drive
the state machine directly out of the idle-entry code.  The number of
CPUs in a "small" system is defined by a new NO_HZ_FULL_SYSIDLE_SMALL
Kconfig parameter, which defaults to 8.  Note that this is a build-time
definition.

Signed-off-by: Paul E. McKenney 
Cc: Frederic Weisbecker 
Cc: Steven Rostedt 
Cc: Lai Jiangshan 
[ paulmck: Use true and false for boolean constants per Lai Jiangshan. ]
Reviewed-by: Josh Triplett 
[ paulmck: Simplify logic and provide better comments for memory barriers,
  based on review comments and questions by Lai Jiangshan. ]

nohz_full: Add per-CPU idle-state tracking

2013-08-19T01:58:43+00:00

This commit adds the code that updates the rcu_dyntick structure's
new fields to track the per-CPU idle state based on interrupts and
transitions into and out of the idle loop (NMIs are ignored because NMI
handlers cannot cleanly read out the time anyway).  This code is similar
to the code that maintains RCU's idea of per-CPU idleness, but differs
in that RCU treats CPUs running in user mode as idle, where this new
code does not.

Signed-off-by: Paul E. McKenney 
Acked-by: Frederic Weisbecker 
Cc: Steven Rostedt 
Reviewed-by: Josh Triplett

nohz_full: Add rcu_dyntick data for scalable detection of all-idle state

2013-08-19T01:58:31+00:00

This commit adds fields to the rcu_dyntick structure that are used to
detect idle CPUs.  These new fields differ from the existing ones in
that the existing ones consider a CPU executing in user mode to be idle,
where the new ones consider CPUs executing in user mode to be busy.
The handling of these new fields is otherwise quite similar to that for
the exiting fields.  This commit also adds the initialization required
for these fields.

So, why is usermode execution treated differently, with RCU considering
it a quiescent state equivalent to idle, while in contrast the new
full-system idle state detection considers usermode execution to be
non-idle?

It turns out that although one of RCU's quiescent states is usermode
execution, it is not a full-system idle state.  This is because the
purpose of the full-system idle state is not RCU, but rather determining
when accurate timekeeping can safely be disabled.  Whenever accurate
timekeeping is required in a CONFIG_NO_HZ_FULL kernel, at least one
CPU must keep the scheduling-clock tick going.  If even one CPU is
executing in user mode, accurate timekeeping is requires, particularly for
architectures where gettimeofday() and friends do not enter the kernel.
Only when all CPUs are really and truly idle can accurate timekeeping be
disabled, allowing all CPUs to turn off the scheduling clock interrupt,
thus greatly improving energy efficiency.

This naturally raises the question "Why is this code in RCU rather than in
timekeeping?", and the answer is that RCU has the data and infrastructure
to efficiently make this determination.

Signed-off-by: Paul E. McKenney 
Acked-by: Frederic Weisbecker 
Cc: Steven Rostedt 
Reviewed-by: Josh Triplett

rcu: Add const annotation to char * for RCU tracepoints and functions

2013-07-29T21:07:49+00:00

All the RCU tracepoints and functions that reference char pointers do
so with just 'char *' even though they do not modify the contents of
the string itself. This will cause warnings if a const char * is used
in one of these functions.

The RCU tracepoints store the pointer to the string to refer back to them
when the trace output is displayed. As this can be minutes, hours or
even days later, those strings had better be constant.

This change also opens the door to allow the RCU tracepoint strings and
their addresses to be exported so that userspace tracing tools can
translate the contents of the pointers of the RCU tracepoints.

Signed-off-by: Steven Rostedt

rcu: delete __cpuinit usage from all rcu files

2013-07-14T23:36:58+00:00

The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications.  For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.

After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out.  Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.

This removes all the drivers/rcu uses of the __cpuinit macros
from all C files.

[1] https://lkml.org/lkml/2013/5/20/589

Cc: "Paul E. McKenney" 
Cc: Josh Triplett 
Cc: Dipankar Sarma 
Reviewed-by: Josh Triplett 
Signed-off-by: Paul Gortmaker

rcu: Drive quiescent-state-forcing delay from HZ

2013-06-10T20:44:56+00:00

Systems with HZ=100 can have slow bootup times due to the default
three-jiffy delays between quiescent-state forcing attempts.  This
commit therefore auto-tunes the RCU_JIFFIES_TILL_FORCE_QS value based
on the value of HZ.  However, this would break very large systems that
require more time between quiescent-state forcing attempts.  This
commit therefore also ups the default delay by one jiffy for each
256 CPUs that might be on the system (based off of nr_cpu_ids at
runtime, -not- NR_CPUS at build time).

Updated to collapse #ifdefs for RCU_JIFFIES_TILL_FORCE_QS into a
step-function definition as suggested by Josh Triplett.

Reported-by: Paul Mackerras 
Signed-off-by: Paul E. McKenney

rcu: Don't call wakeup() with rcu_node structure ->lock held

2013-06-10T20:37:11+00:00

This commit fixes a lockdep-detected deadlock by moving a wake_up()
call out from a rnp->lock critical section.  Please see below for
the long version of this story.

On Tue, 2013-05-28 at 16:13 -0400, Dave Jones wrote:

> [12572.705832] ======================================================
> [12572.750317] [ INFO: possible circular locking dependency detected ]
> [12572.796978] 3.10.0-rc3+ #39 Not tainted
> [12572.833381] -------------------------------------------------------
> [12572.862233] trinity-child17/31341 is trying to acquire lock:
> [12572.870390]  (rcu_node_0){..-.-.}, at: [] rcu_read_unlock_special+0x9f/0x4c0
> [12572.878859]
> but task is already holding lock:
> [12572.894894]  (&ctx->lock){-.-...}, at: [] perf_lock_task_context+0x7d/0x2d0
> [12572.903381]
> which lock already depends on the new lock.
>
> [12572.927541]
> the existing dependency chain (in reverse order) is:
> [12572.943736]
> -> #4 (&ctx->lock){-.-...}:
> [12572.960032]        [] lock_acquire+0x91/0x1f0
> [12572.968337]        [] _raw_spin_lock+0x40/0x80
> [12572.976633]        [] __perf_event_task_sched_out+0x2e7/0x5e0
> [12572.984969]        [] perf_event_task_sched_out+0x93/0xa0
> [12572.993326]        [] __schedule+0x2cf/0x9c0
> [12573.001652]        [] schedule_user+0x2e/0x70
> [12573.009998]        [] retint_careful+0x12/0x2e
> [12573.018321]
> -> #3 (&rq->lock){-.-.-.}:
> [12573.034628]        [] lock_acquire+0x91/0x1f0
> [12573.042930]        [] _raw_spin_lock+0x40/0x80
> [12573.051248]        [] wake_up_new_task+0xb7/0x260
> [12573.059579]        [] do_fork+0x105/0x470
> [12573.067880]        [] kernel_thread+0x26/0x30
> [12573.076202]        [] rest_init+0x23/0x140
> [12573.084508]        [] start_kernel+0x3f1/0x3fe
> [12573.092852]        [] x86_64_start_reservations+0x2a/0x2c
> [12573.101233]        [] x86_64_start_kernel+0xcc/0xcf
> [12573.109528]
> -> #2 (&p->pi_lock){-.-.-.}:
> [12573.125675]        [] lock_acquire+0x91/0x1f0
> [12573.133829]        [] _raw_spin_lock_irqsave+0x4b/0x90
> [12573.141964]        [] try_to_wake_up+0x31/0x320
> [12573.150065]        [] default_wake_function+0x12/0x20
> [12573.158151]        [] autoremove_wake_function+0x18/0x40
> [12573.166195]        [] __wake_up_common+0x58/0x90
> [12573.174215]        [] __wake_up+0x39/0x50
> [12573.182146]        [] rcu_start_gp_advanced.isra.11+0x4a/0x50
> [12573.190119]        [] rcu_start_future_gp+0x1c9/0x1f0
> [12573.198023]        [] rcu_nocb_kthread+0x114/0x930
> [12573.205860]        [] kthread+0xed/0x100
> [12573.213656]        [] ret_from_fork+0x7c/0xb0
> [12573.221379]
> -> #1 (&rsp->gp_wq){..-.-.}:
> [12573.236329]        [] lock_acquire+0x91/0x1f0
> [12573.243783]        [] _raw_spin_lock_irqsave+0x4b/0x90
> [12573.251178]        [] __wake_up+0x23/0x50
> [12573.258505]        [] rcu_start_gp_advanced.isra.11+0x4a/0x50
> [12573.265891]        [] rcu_start_future_gp+0x1c9/0x1f0
> [12573.273248]        [] rcu_nocb_kthread+0x114/0x930
> [12573.280564]        [] kthread+0xed/0x100
> [12573.287807]        [] ret_from_fork+0x7c/0xb0

Notice the above call chain.

rcu_start_future_gp() is called with the rnp->lock held. Then it calls
rcu_start_gp_advance, which does a wakeup.

You can't do wakeups while holding the rnp->lock, as that would mean
that you could not do a rcu_read_unlock() while holding the rq lock, or
any lock that was taken while holding the rq lock. This is because...
(See below).

> [12573.295067]
> -> #0 (rcu_node_0){..-.-.}:
> [12573.309293]        [] __lock_acquire+0x1786/0x1af0
> [12573.316568]        [] lock_acquire+0x91/0x1f0
> [12573.323825]        [] _raw_spin_lock+0x40/0x80
> [12573.331081]        [] rcu_read_unlock_special+0x9f/0x4c0
> [12573.338377]        [] __rcu_read_unlock+0x96/0xa0
> [12573.345648]        [] perf_lock_task_context+0x143/0x2d0
> [12573.352942]        [] find_get_context+0x4e/0x1f0
> [12573.360211]        [] SYSC_perf_event_open+0x514/0xbd0
> [12573.367514]        [] SyS_perf_event_open+0x9/0x10
> [12573.374816]        [] tracesys+0xdd/0xe2

Notice the above trace.

perf took its own ctx->lock, which can be taken while holding the rq
lock. While holding this lock, it did a rcu_read_unlock(). The
perf_lock_task_context() basically looks like:

rcu_read_lock();
raw_spin_lock(ctx->lock);
rcu_read_unlock();

Now, what looks to have happened, is that we scheduled after taking that
first rcu_read_lock() but before taking the spin lock. When we scheduled
back in and took the ctx->lock, the following rcu_read_unlock()
triggered the "special" code.

The rcu_read_unlock_special() takes the rnp->lock, which gives us a
possible deadlock scenario.

	CPU0		CPU1		CPU2
	----		----		----

				     rcu_nocb_kthread()
    lock(rq->lock);
		    lock(ctx->lock);
				     lock(rnp->lock);

				     wake_up();

				     lock(rq->lock);

		    rcu_read_unlock();

		    rcu_read_unlock_special();

		    lock(rnp->lock);
    lock(ctx->lock);

**** DEADLOCK ****

> [12573.382068]
> other info that might help us debug this:
>
> [12573.403229] Chain exists of:
>   rcu_node_0 --> &rq->lock --> &ctx->lock
>
> [12573.424471]  Possible unsafe locking scenario:
>
> [12573.438499]        CPU0                    CPU1
> [12573.445599]        ----                    ----
> [12573.452691]   lock(&ctx->lock);
> [12573.459799]                                lock(&rq->lock);
> [12573.467010]                                lock(&ctx->lock);
> [12573.474192]   lock(rcu_node_0);
> [12573.481262]
>  *** DEADLOCK ***
>
> [12573.501931] 1 lock held by trinity-child17/31341:
> [12573.508990]  #0:  (&ctx->lock){-.-...}, at: [] perf_lock_task_context+0x7d/0x2d0
> [12573.516475]
> stack backtrace:
> [12573.530395] CPU: 1 PID: 31341 Comm: trinity-child17 Not tainted 3.10.0-rc3+ #39
> [12573.545357]  ffffffff825b4f90 ffff880219f1dbc0 ffffffff816e375b ffff880219f1dc00
> [12573.552868]  ffffffff816dfa5d ffff880219f1dc50 ffff88023ce4d1f8 ffff88023ce4ca40
> [12573.560353]  0000000000000001 0000000000000001 ffff88023ce4d1f8 ffff880219f1dcc0
> [12573.567856] Call Trace:
> [12573.575011]  [] dump_stack+0x19/0x1b
> [12573.582284]  [] print_circular_bug+0x200/0x20f
> [12573.589637]  [] __lock_acquire+0x1786/0x1af0
> [12573.596982]  [] ? sched_clock_cpu+0xb5/0x100
> [12573.604344]  [] lock_acquire+0x91/0x1f0
> [12573.611652]  [] ? rcu_read_unlock_special+0x9f/0x4c0
> [12573.619030]  [] _raw_spin_lock+0x40/0x80
> [12573.626331]  [] ? rcu_read_unlock_special+0x9f/0x4c0
> [12573.633671]  [] rcu_read_unlock_special+0x9f/0x4c0
> [12573.640992]  [] ? perf_lock_task_context+0x7d/0x2d0
> [12573.648330]  [] ? put_lock_stats.isra.29+0xe/0x40
> [12573.655662]  [] ? delay_tsc+0x90/0xe0
> [12573.662964]  [] __rcu_read_unlock+0x96/0xa0
> [12573.670276]  [] perf_lock_task_context+0x143/0x2d0
> [12573.677622]  [] ? __perf_event_enable+0x370/0x370
> [12573.684981]  [] find_get_context+0x4e/0x1f0
> [12573.692358]  [] SYSC_perf_event_open+0x514/0xbd0
> [12573.699753]  [] ? get_parent_ip+0xd/0x50
> [12573.707135]  [] ? trace_hardirqs_on_caller+0xfd/0x1c0
> [12573.714599]  [] SyS_perf_event_open+0x9/0x10
> [12573.721996]  [] tracesys+0xdd/0xe2

This commit delays the wakeup via irq_work(), which is what
perf and ftrace use to perform wakeups in critical sections.

Reported-by: Dave Jones 
Signed-off-by: Steven Rostedt 
Signed-off-by: Paul E. McKenney

Merge commit '8700c95adb03' into timers/nohz

2013-05-02T15:54:19+00:00

The full dynticks tree needs the latest RCU and sched
upstream updates in order to fix some dependencies.

Merge a common upstream merge point that has these
updates.

Conflicts:
	include/linux/perf_event.h
	kernel/rcutree.h
	kernel/rcutree_plugin.h

Signed-off-by: Frederic Weisbecker