summaryrefslogtreecommitdiff
path: root/kernel/rcutree.h
AgeCommit message (Collapse)Author
2012-10-08rcu: Grace-period initialization excludes only RCU notifierPaul E. McKenney
Kirill noted the following deadlock cycle on shutdown involving padata: > With commit 755609a9087fa983f567dc5452b2fa7b089b591f I've got deadlock on > poweroff. > > It guess it happens because of race for cpu_hotplug.lock: > > CPU A CPU B > disable_nonboot_cpus() > _cpu_down() > cpu_hotplug_begin() > mutex_lock(&cpu_hotplug.lock); > __cpu_notify() > padata_cpu_callback() > __padata_remove_cpu() > padata_replace() > synchronize_rcu() > rcu_gp_kthread() > get_online_cpus(); > mutex_lock(&cpu_hotplug.lock); It would of course be good to eliminate grace-period delays from CPU-hotplug notifiers, but that is a separate issue. Deadlock is not an appropriate diagnostic for excessive CPU-hotplug latency. Fortunately, grace-period initialization does not actually need to exclude all of the CPU-hotplug operation, but rather only RCU's own CPU_UP_PREPARE and CPU_DEAD CPU-hotplug notifiers. This commit therefore introduces a new per-rcu_state onoff_mutex that provides the required concurrency control in place of the get_online_cpus() that was previously in rcu_gp_init(). Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Kirill A. Shutemov <kirill@shutemov.name>
2012-09-26rcu: Ignore userspace extended quiescent state by defaultFrederic Weisbecker
By default we don't want to enter into RCU extended quiescent state while in userspace because doing this produces some overhead (eg: use of syscall slowpath). Set it off by default and ready to run when some feature like adaptive tickless need it. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Alessio Igor Bogani <abogani@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Avi Kivity <avi@redhat.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Kevin Hilman <khilman@ti.com> Cc: Max Krasnyansky <maxk@qualcomm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-26rcu: Allow rcu_user_enter()/exit() to nestFrederic Weisbecker
Allow calls to rcu_user_enter() even if we are already in userspace (as seen by RCU) and allow calls to rcu_user_exit() even if we are already in the kernel. This makes the APIs more flexible to be called from architectures. Exception entries for example won't need to know if they come from userspace before calling rcu_user_exit(). Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Alessio Igor Bogani <abogani@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Avi Kivity <avi@redhat.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Kevin Hilman <khilman@ti.com> Cc: Max Krasnyansky <maxk@qualcomm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-25Merge remote-tracking branch 'tip/smp/hotplug' into next.2012.09.25bPaul E. McKenney
The conflicts between kernel/rcutree.h and kernel/rcutree_plugin.h were due to adjacent insertions and deletions, which were resolved by simply accepting the changes on both branches.
2012-09-24Merge branches 'bigrt.2012.09.23a', 'doctorture.2012.09.23a', ↵Paul E. McKenney
'fixes.2012.09.23a', 'hotplug.2012.09.23a' and 'idlechop.2012.09.23a' into HEAD bigrt.2012.09.23a contains additional commits to reduce scheduling latency from RCU on huge systems (many hundrends or thousands of CPUs). doctorture.2012.09.23a contains documentation changes and rcutorture fixes. fixes.2012.09.23a contains miscellaneous fixes. hotplug.2012.09.23a contains CPU-hotplug-related changes. idle.2012.09.23a fixes architectures for which RCU no longer considered the idle loop to be a quiescent state due to earlier adaptive-dynticks changes. Affected architectures are alpha, cris, frv, h8300, m32r, m68k, mn10300, parisc, score, xtensa, and ia64.
2012-09-23rcu: Remove _rcu_barrier() dependency on __stop_machine()Paul E. McKenney
Currently, _rcu_barrier() relies on preempt_disable() to prevent any CPU from going offline, which in turn depends on CPU hotplug's use of __stop_machine(). This patch therefore makes _rcu_barrier() use get_online_cpus() to block CPU-hotplug operations. This has the added benefit of removing the need for _rcu_barrier() to adopt callbacks: Because CPU-hotplug operations are excluded, there can be no callbacks to adopt. This commit simplifies the code accordingly. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23rcu: Simplify quiescent-state detectionPaul E. McKenney
The current quiescent-state detection algorithm is needlessly complex. It records the grace-period number corresponding to the quiescent state at the time of the quiescent state, which works, but it seems better to simply erase any record of previous quiescent states at the time that the CPU notices the new grace period. This has the further advantage of removing another piece of RCU for which lockless reasoning is required. Therefore, this commit makes this change. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23rcu: Prevent force_quiescent_state() memory contentionPaul E. McKenney
Large systems running RCU_FAST_NO_HZ kernels see extreme memory contention on the rcu_state structure's ->fqslock field. This can be avoided by disabling RCU_FAST_NO_HZ, either at compile time or at boot time (via the nohz kernel boot parameter), but large systems will no doubt become sensitive to energy consumption. This commit therefore uses a combining-tree approach to spread the memory contention across new cache lines in the leaf rcu_node structures. This can be thought of as a tournament lock that has only a try-lock acquisition primitive. The effect on small systems is minimal, because such systems have an rcu_node "tree" consisting of a single node. In addition, this functionality is not used on fastpaths. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23rcu: Adjust debugfs tracing for kthread-based quiescent-state forcingPaul E. McKenney
Moving quiescent-state forcing into a kthread dispenses with the need for the ->n_rp_need_fqs field, so this commit removes it. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23rcu: Move quiescent-state forcing into kthreadPaul E. McKenney
As the first step towards allowing quiescent-state forcing to be preemptible, this commit moves RCU quiescent-state forcing into the same kthread that is now used to initialize and clean up after grace periods. This is yet another step towards keeping scheduling latency down to a dull roar. Updated to change from raw_spin_lock_irqsave() to raw_spin_lock_irq() and to remove the now-unused rcu_state structure fields as suggested by Peter Zijlstra. Reported-by: Mike Galbraith <mgalbraith@suse.de> Reported-by: Dimitri Sivanich <sivanich@sgi.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23rcu: Segregate rcu_state fields to improve cache localityDimitri Sivanich
The fields in the rcu_state structure that are protected by the root rcu_node structure's ->lock can share a cache line with the fields protected by ->onofflock. This can result in excessive memory contention on large systems, so this commit applies ____cacheline_internodealigned_in_smp to the ->onofflock field in order to segregate them. Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Dimitri Sivanich <sivanich@sgi.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23rcu: Provide OOM handler to motivate lazy RCU callbacksPaul E. McKenney
In kernels built with CONFIG_RCU_FAST_NO_HZ=y, CPUs can accumulate a large number of lazy callbacks, which as the name implies will be slow to be invoked. This can be a problem on small-memory systems, where the default 6-second sleep for CPUs having only lazy RCU callbacks could well be fatal. This commit therefore installs an OOM hander that ensures that every CPU with lazy callbacks has at least one non-lazy callback, in turn ensuring timely advancement for these callbacks. Updated to fix bug that disabled OOM killing, noted by Lai Jiangshan. Updated to push the for_each_rcu_flavor() loop into rcu_oom_notify_cpu(), thus reducing the number of IPIs, as suggested by Steven Rostedt. Also to make the for_each_online_cpu() loop be preemptible. (Later, it might be good to use smp_call_function(), as suggested by Peter Zijlstra.) Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Sasha Levin <levinsasha928@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23rcu: Move RCU grace-period initialization into a kthreadPaul E. McKenney
As the first step towards allowing grace-period initialization to be preemptible, this commit moves the RCU grace-period initialization into its own kthread. This is needed to keep large-system scheduling latency at reasonable levels. Also change raw_spin_lock_irqsave() to raw_spin_lock_irq() as suggested by Peter Zijlstra in review comments. Reported-by: Mike Galbraith <mgalbraith@suse.de> Reported-by: Dimitri Sivanich <sivanich@sgi.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-08-13rcu: Use smp_hotplug_thread facility for RCUs per-CPU kthreadPaul E. McKenney
Bring RCU into the new-age CPU-hotplug fold by modifying RCU's per-CPU kthread code to use the new smp_hotplug_thread facility. [ tglx: Adapted it to use callbacks and to the simplified rcu yield ] Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lkml.kernel.org/r/20120716103948.673354828@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13rcu: Yield simplerThomas Gleixner
The rcu_yield() code is amazing. It's there to avoid starvation of the system when lots of (boosting) work is to be done. Now looking at the code it's functionality is: Make the thread SCHED_OTHER and very nice, i.e. get it out of the way Arm a timer with 2 ticks schedule() Now if the system goes idle the rcu task returns, regains SCHED_FIFO and plugs on. If the systems stays busy the timer fires and wakes a per node kthread which in turn makes the per cpu thread SCHED_FIFO and brings it back on the cpu. For the boosting thread the "make it FIFO" bit is missing and it just runs some magic boost checks. Now this is a lot of code with extra threads and complexity. It's way simpler to let the tasks when they detect overload schedule away for 2 ticks and defer the normal wakeup as long as they are in yielded state and the cpu is not idle. That solves the same problem and the only difference is that when the cpu goes idle it's not guaranteed that the thread returns right away, but it won't be longer out than two ticks, so no harm is done. If that's an issue than it is way simpler just to wake the task from idle as RCU has callbacks there anyway. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Namhyung Kim <namhyung@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120716103948.131256723@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-06Merge branches 'bigrtm.2012.07.04a', 'doctorture.2012.07.02a', ↵Paul E. McKenney
'fixes.2012.07.06a' and 'fnh.2012.07.02a' into HEAD bigrtm: First steps towards getting RCU out of the way of tens-of-microseconds real-time response on systems compiled with NR_CPUS=4096. Also cleanups for and increased concurrency of rcu_barrier() family of primitives. doctorture: rcutorture and documentation improvements. fixes: Miscellaneous fixes. fnh: RCU_FAST_NO_HZ fixes and improvements.
2012-07-02rcu: Make RCU_FAST_NO_HZ respect nohz= boot parameterPaul E. McKenney
If the nohz= boot parameter disables nohz, then RCU_FAST_NO_HZ needs to also disable itself. This commit therefore checks for tick_nohz_enabled being zero, disabling rcu_prepare_for_idle() if so. This commit assumes that tick_nohz_enabled can change at runtime: If this is not the case, then a simpler approach suffices. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02rcu: Introduce for_each_rcu_flavor() and use itPaul E. McKenney
The arrival of TREE_PREEMPT_RCU some years back included some ugly code involving either #ifdef or #ifdef'ed wrapper functions to iterate over all non-SRCU flavors of RCU. This commit therefore introduces a for_each_rcu_flavor() iterator over the rcu_state structures for each flavor of RCU to clean up a bit of the ugliness. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02rcu: Increase rcu_barrier() concurrencyPaul E. McKenney
The traditional rcu_barrier() implementation has serialized all requests, regardless of RCU flavor, and also does not coalesce concurrent requests. In the past, this has been good and sufficient. However, systems are getting larger and use of rcu_barrier() has been increasing. This commit therefore introduces a counter-based scheme that allows _rcu_barrier() calls for the same flavor of RCU to take advantage of each others' work. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02rcu: Move rcu_barrier_mutex to rcu_state structurePaul E. McKenney
In order to allow each RCU flavor to concurrently execute its rcu_barrier() function, it is necessary to move the relevant state to the rcu_state structure. This commit therefore moves the rcu_barrier_mutex global variable to a new ->barrier_mutex field in the rcu_state structure. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02rcu: Move rcu_barrier_completion to rcu_state structurePaul E. McKenney
In order to allow each RCU flavor to concurrently execute its rcu_barrier() function, it is necessary to move the relevant state to the rcu_state structure. This commit therefore moves the rcu_barrier_completion global variable to a new ->barrier_completion field in the rcu_state structure. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02rcu: Move rcu_barrier_cpu_count to rcu_state structurePaul E. McKenney
In order to allow each RCU flavor to concurrently execute its rcu_barrier() function, it is necessary to move the relevant state to the rcu_state structure. This commit therefore moves the rcu_barrier_cpu_count global variable to a new ->barrier_cpu_count field in the rcu_state structure. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02rcu: Move _rcu_barrier()'s rcu_head structures to rcu_data structuresPaul E. McKenney
In order for multiple flavors of RCU to each concurrently run one rcu_barrier(), each flavor needs its own per-CPU set of rcu_head structures. This commit therefore moves _rcu_barrier()'s set of per-CPU rcu_head structures from per-CPU variables to the existing per-CPU and per-RCU-flavor rcu_data structures. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02rcu: Place pointer to call_rcu() in rcu_data structurePaul E. McKenney
This is a preparatory commit for increasing rcu_barrier()'s concurrency. It adds a pointer in the rcu_data structure to the corresponding call_rcu() function. This allows a pointer to the rcu_data structure to imply the function pointer, which allows _rcu_barrier() state to be placed in the rcu_state structure. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02rcu: Control RCU_FANOUT_LEAF from boot-time parameterPaul E. McKenney
Although making RCU_FANOUT_LEAF a kernel configuration parameter rather than a fixed constant makes it easier for people to decrease cache-miss overhead for large systems, it is of little help for people who must run a single pre-built kernel binary. This commit therefore allows the value of RCU_FANOUT_LEAF to be increased (but not decreased!) via a boot-time parameter named rcutree.rcu_fanout_leaf. Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02Revert "rcu: Move PREEMPT_RCU preemption to switch_to() invocation"Paul E. McKenney
This reverts commit 616c310e83b872024271c915c1b9ab505b9efad9. (Move PREEMPT_RCU preemption to switch_to() invocation). Testing by Sasha Levin <levinsasha928@gmail.com> showed that this can result in deadlock due to invoking the scheduler when one of the runqueue locks is held. Because this commit was simply a performance optimization, revert it. Reported-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Sasha Levin <levinsasha928@gmail.com>
2012-06-06rcu: Move RCU_FAST_NO_HZ per-CPU variables to rcu_dynticks structurePaul E. McKenney
The RCU_FAST_NO_HZ code relies on a number of per-CPU variables. This works, but is hidden from someone scanning the data structures in rcutree.h. This commit therefore converts these per-CPU variables to fields in the per-CPU rcu_dynticks structures. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com> Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-05-11Merge branches 'barrier.2012.05.09a', 'fixes.2012.04.26a', ↵Paul E. McKenney
'inline.2012.05.02b' and 'srcu.2012.05.07b' into HEAD barrier: Reduce the amount of disturbance by rcu_barrier() to the rest of the system. This branch also includes improvements to RCU_FAST_NO_HZ, which are included here due to conflicts. fixes: Miscellaneous fixes. inline: Remaining changes from an abortive attempt to inline preemptible RCU's __rcu_read_lock(). These are (1) making exit_rcu() avoid unnecessary work and (2) avoiding having preemptible RCU record a blocked thread when the scheduler declines to do a context switch. srcu: Lai Jiangshan's algorithmic implementation of SRCU, including call_srcu().
2012-05-09rcu: Make rcu_barrier() less disruptivePaul E. McKenney
The rcu_barrier() primitive interrupts each and every CPU, registering a callback on every CPU. Once all of these callbacks have been invoked, rcu_barrier() knows that every callback that was registered before the call to rcu_barrier() has also been invoked. However, there is no point in registering a callback on a CPU that currently has no callbacks, most especially if that CPU is in a deep idle state. This commit therefore makes rcu_barrier() avoid interrupting CPUs that have no callbacks. Doing this requires reworking the handling of orphaned callbacks, otherwise callbacks could slip through rcu_barrier()'s net by being orphaned from a CPU that rcu_barrier() had not yet interrupted to a CPU that rcu_barrier() had already interrupted. This reworking was needed anyway to take a first step towards weaning RCU from the CPU_DYING notifier's use of stop_cpu(). Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-05-02rcu: Move PREEMPT_RCU preemption to switch_to() invocationPaul E. McKenney
Currently, PREEMPT_RCU readers are enqueued upon entry to the scheduler. This is inefficient because enqueuing is required only if there is a context switch, and entry to the scheduler does not guarantee a context switch. The commit therefore moves the enqueuing to immediately precede the call to switch_to() from the scheduler. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-24rcu: Make RCU_FAST_NO_HZ account for pauses out of idlePaul E. McKenney
Both Steven Rostedt's new idle-capable trace macros and the RCU_NONIDLE() macro can cause RCU to momentarily pause out of idle without the rest of the system being involved. This can cause rcu_prepare_for_idle() to run through its state machine too quickly, which can in turn result in needless scheduling-clock interrupts. This commit therefore adds code to enable rcu_prepare_for_idle() to distinguish between an initial entry to idle on the one hand (which needs to advance the rcu_prepare_for_idle() state machine) and an idle reentry due to idle-capable trace macros and RCU_NONIDLE() on the other hand (which should avoid advancing the rcu_prepare_for_idle() state machine). Additional state is maintained to allow the timer to be correctly reposted when returning after a momentary pause out of idle, and even more state is maintained to detect when new non-lazy callbacks have been enqueued (which may require re-evaluation of the approach to idleness). Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-24rcu: Reduce cache-miss initialization latencies for large systemsPaul E. McKenney
Commit #0209f649 (rcu: limit rcu_node leaf-level fanout) set an upper limit of 16 on the leaf-level fanout for the rcu_node tree. This was needed to reduce lock contention that was induced by the synchronization of scheduling-clock interrupts, which was in turn needed to improve energy efficiency for moderate-sized lightly loaded servers. However, reducing the leaf-level fanout means that there are more leaf-level rcu_node structures in the tree, which in turn means that RCU's grace-period initialization incurs more cache misses. This is not a problem on moderate-sized servers with only a few tens of CPUs, but becomes a major source of real-time latency spikes on systems with many hundreds of CPUs. In addition, the workloads running on these large systems tend to be CPU-bound, which eliminates the energy-efficiency advantages of synchronizing scheduling-clock interrupts. Therefore, these systems need maximal values for the rcu_node leaf-level fanout. This commit addresses this problem by introducing a new kernel parameter named RCU_FANOUT_LEAF that directly controls the leaf-level fanout. This parameter defaults to 16 to handle the common case of a moderate sized lightly loaded servers, but may be set higher on larger systems. Reported-by: Mike Galbraith <efault@gmx.de> Reported-by: Dimitri Sivanich <sivanich@sgi.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21rcu: Rework detection of use of RCU by offline CPUsPaul E. McKenney
Because newly offlined CPUs continue executing after completing the CPU_DYING notifiers, they legitimately enter the scheduler and use RCU while appearing to be offline. This calls for a more sophisticated approach as follows: 1. RCU marks the CPU online during the CPU_UP_PREPARE phase. 2. RCU marks the CPU offline during the CPU_DEAD phase. 3. Diagnostics regarding use of read-side RCU by offline CPUs use RCU's accounting rather than the cpu_online_map. (Note that __call_rcu() still uses cpu_online_map to detect illegal invocations within CPU_DYING notifiers.) 4. Offline CPUs are prevented from hanging the system by force_quiescent_state(), which pays attention to cpu_online_map. Some additional work (in a later commit) will be needed to guarantee that force_quiescent_state() waits a full jiffy before assuming that a CPU is offline, for example, when called from idle entry. (This commit also makes the one-jiffy wait explicit, since the old-style implicit wait can now be defeated by RCU_FAST_NO_HZ and by rcutorture.) This approach avoids the false positives encountered when attempting to use more exact classification of CPU online/offline state. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21rcu: Print scheduling-clock information on RCU CPU stall-warning messagesPaul E. McKenney
There have been situations where RCU CPU stall warnings were caused by issues in scheduling-clock timer initialization. To make it easier to track these down, this commit causes the RCU CPU stall-warning messages to print out the number of scheduling-clock interrupts taken in the current grace period for each stalled CPU. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21rcu: Set RCU CPU stall times via sysfsPaul E. McKenney
The default CONFIG_RCU_CPU_STALL_TIMEOUT value of 60 seconds has served Linux users well for production use for quite some time. However, for debugging, there will be more than three minutes between subsequent stall-warning messages. This can be an annoyingly long wait if you are trying to work out where the offending infinite loop is hiding. Therefore, this commit provides a rcu_cpu_stall_timeout sysfs parameter that may be adjusted at boot time and at runtime to speed up debugging. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21rcu: Clean up straggling rcu_preempt_needs_cpu() namePaul E. McKenney
The recent updates to RCU_CPU_FAST_NO_HZ have an rcu_needs_cpu() that does more than just check for callbacks, so get the name for rcu_preempt_needs_cpu() consistent with that change, now calling it rcu_preempt_cpu_has_callbacks(). Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21rcu: Simplify offline processingPaul E. McKenney
Move ->qsmaskinit and blkd_tasks[] manipulation to the CPU_DYING notifier. This simplifies the code by eliminating a potential deadlock and by reducing the responsibilities of force_quiescent_state(). Also rename functions to make their connection to the CPU-hotplug stages explicit. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21rcu: Avoid waking up CPUs having only kfree_rcu() callbacksPaul E. McKenney
When CONFIG_RCU_FAST_NO_HZ is enabled, RCU will allow a given CPU to enter dyntick-idle mode even if it still has RCU callbacks queued. RCU avoids system hangs in this case by scheduling a timer for several jiffies in the future. However, if all of the callbacks on that CPU are from kfree_rcu(), there is no reason to wake the CPU up, as it is not a problem to defer freeing of memory. This commit therefore tracks the number of callbacks on a given CPU that are from kfree_rcu(), and avoids scheduling the timer if all of a given CPU's callbacks are from kfree_rcu(). Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Keep invoking callbacks if CPU otherwise idlePaul E. McKenney
The rcu_do_batch() function that invokes callbacks for TREE_RCU and TREE_PREEMPT_RCU normally throttles callback invocation to avoid degrading scheduling latency. However, as long as the CPU would otherwise be idle, there is no downside to continuing to invoke any callbacks that have passed through their grace periods. In fact, processing such callbacks in a timely manner has the benefit of increasing the probability that the CPU can enter the power-saving dyntick-idle mode. Therefore, this commit allows callback invocation to continue beyond the preset limit as long as the scheduler does not have some other task to run and as long as context is that of the idle task or the relevant RCU kthread. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Permit dyntick-idle with callbacks pendingPaul E. McKenney
The current implementation of RCU_FAST_NO_HZ prevents CPUs from entering dyntick-idle state if they have RCU callbacks pending. Unfortunately, this has the side-effect of often preventing them from entering this state, especially if at least one other CPU is not in dyntick-idle state. However, the resulting per-tick wakeup is wasteful in many cases: if the CPU has already fully responded to the current RCU grace period, there will be nothing for it to do until this grace period ends, which will frequently take several jiffies. This commit therefore permits a CPU that has done everything that the current grace period has asked of it (rcu_pending() == 0) even if it still as RCU callbacks pending. However, such a CPU posts a timer to wake it up several jiffies later (6 jiffies, based on experience with grace-period lengths). This wakeup is required to handle situations that can result in all CPUs being in dyntick-idle mode, thus failing to ever complete the current grace period. If a CPU wakes up before the timer goes off, then it cancels that timer, thus avoiding spurious wakeups. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Eliminate RCU_FAST_NO_HZ grace-period hangPaul E. McKenney
With the new implementation of RCU_FAST_NO_HZ, it was possible to hang RCU grace periods as follows: o CPU 0 attempts to go idle, cycles several times through the rcu_prepare_for_idle() loop, then goes dyntick-idle when RCU needs nothing more from it, while still having at least on RCU callback pending. o CPU 1 goes idle with no callbacks. Both CPUs can then stay in dyntick-idle mode indefinitely, preventing the RCU grace period from ever completing, possibly hanging the system. This commit therefore prevents CPUs that have RCU callbacks from entering dyntick-idle mode. This approach also eliminates the need for the end-of-grace-period IPIs used previously. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Allow dyntick-idle mode for CPUs with callbacksPaul E. McKenney
Currently, RCU does not permit a CPU to enter dyntick-idle mode if that CPU has any RCU callbacks queued. This means that workloads for which each CPU wakes up and does some RCU updates every few ticks will never enter dyntick-idle mode. This can result in significant unnecessary power consumption, so this patch permits a given to enter dyntick-idle mode if it has callbacks, but only if that same CPU has completed all current work for the RCU core. We determine use rcu_pending() to determine whether a given CPU has completed all current work for the RCU core. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Omit self-awaken when setting up expedited grace periodThomas Gleixner
When setting up an expedited grace period, if there were no readers, the task will awaken itself. This commit removes this useless self-awakening. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Track idleness independent of idle tasksPaul E. McKenney
Earlier versions of RCU used the scheduling-clock tick to detect idleness by checking for the idle task, but handled idleness differently for CONFIG_NO_HZ=y. But there are now a number of uses of RCU read-side critical sections in the idle task, for example, for tracing. A more fine-grained detection of idleness is therefore required. This commit presses the old dyntick-idle code into full-time service, so that rcu_idle_enter(), previously known as rcu_enter_nohz(), is always invoked at the beginning of an idle loop iteration. Similarly, rcu_idle_exit(), previously known as rcu_exit_nohz(), is always invoked at the end of an idle-loop iteration. This allows the idle task to use RCU everywhere except between consecutive rcu_idle_enter() and rcu_idle_exit() calls, in turn allowing architecture maintainers to specify exactly where in the idle loop that RCU may be used. Because some of the userspace upcall uses can result in what looks to RCU like half of an interrupt, it is not possible to expect that the irq_enter() and irq_exit() hooks will give exact counts. This patch therefore expands the ->dynticks_nesting counter to 64 bits and uses two separate bitfields to count process/idle transitions and interrupt entry/exit transitions. It is presumed that userspace upcalls do not happen in the idle loop or from usermode execution (though usermode might do a system call that results in an upcall). The counter is hard-reset on each process/idle transition, which avoids the interrupt entry/exit error from accumulating. Overflow is avoided by the 64-bitness of the ->dyntick_nesting counter. This commit also adds warnings if a non-idle task asks RCU to enter idle state (and these checks will need some adjustment before applying Frederic's OS-jitter patches (http://lkml.org/lkml/2011/10/7/246). In addition, validation of ->dynticks and ->dynticks_nesting is added. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-12-11rcu: ->signaled better named ->fqs_statePaul E. McKenney
The ->signaled field was named before complications in the form of dyntick-idle mode and offlined CPUs. These complications have required that force_quiescent_state() be implemented as a state machine, instead of simply unconditionally sending reschedule IPIs. Therefore, this commit renames ->signaled to ->fqs_state to catch up with the new force_quiescent_state() reality. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-09-28rcu: Remove rcu_needs_cpu_flush() to avoid false quiescent statesPaul E. McKenney
The purpose of rcu_needs_cpu_flush() was to iterate on pushing the current grace period in order to help the current CPU enter dyntick-idle mode. However, this can result in failures if the CPU starts entering dyntick-idle mode, but then backs out. In this case, the call to rcu_pending() from rcu_needs_cpu_flush() might end up announcing a non-existing quiescent state. This commit therefore removes rcu_needs_cpu_flush() in favor of letting the dyntick-idle machinery at the end of the softirq handler push the loop along via its call to rcu_pending(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28rcu: Suppress NMI backtraces when stall ends before dumpPaul E. McKenney
It is possible for an RCU CPU stall to end just as it is detected, in which case the current code will uselessly dump all CPU's stacks. This commit therefore checks for this condition and refrains from sending needless NMIs. And yes, the stall might also end just after we checked all CPUs and tasks, but in that case we would at least have given some clue as to which CPU/task was at fault. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28rcu: Simplify quiescent-state accountingPaul E. McKenney
There is often a delay between the time that a CPU passes through a quiescent state and the time that this quiescent state is reported to the RCU core. It is quite possible that the grace period ended before the quiescent state could be reported, for example, some other CPU might have deduced that this CPU passed through dyntick-idle mode. It is critically important that quiescent state be counted only against the grace period that was in effect at the time that the quiescent state was detected. Previously, this was handled by recording the number of the last grace period to complete when passing through a quiescent state. The RCU core then checks this number against the current value, and rejects the quiescent state if there is a mismatch. However, one additional possibility must be accounted for, namely that the quiescent state was recorded after the prior grace period completed but before the current grace period started. In this case, the RCU core must reject the quiescent state, but the recorded number will match. This is handled when the CPU becomes aware of a new grace period -- at that point, it invalidates any prior quiescent state. This works, but is a bit indirect. The new approach records the current grace period, and the RCU core checks to see (1) that this is still the current grace period and (2) that this grace period has not yet ended. This approach simplifies reasoning about correctness, and this commit changes over to this new approach. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28rcu: Add grace-period, quiescent-state, and call_rcu trace eventsPaul E. McKenney
Add trace events to record grace-period start and end, quiescent states, CPUs noticing grace-period start and end, grace-period initialization, call_rcu() invocation, tasks blocking in RCU read-side critical sections, tasks exiting those same critical sections, force_quiescent_state() detection of dyntick-idle and offline CPUs, CPUs entering and leaving dyntick-idle mode (except from NMIs), CPUs coming online and going offline, and CPUs being kicked for staying in dyntick-idle mode for too long (as in many weeks, even on 32-bit systems). Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> rcu: Add the rcu flavor to callback trace events The earlier trace events for registering RCU callbacks and for invoking them did not include the RCU flavor (rcu_bh, rcu_preempt, or rcu_sched). This commit adds the RCU flavor to those trace events. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28rcu: Move RCU_BOOST declarations to allow compiler checkingPaul E. McKenney
Andi Kleen noticed that one of the RCU_BOOST data declarations was out of sync with the definition. Move the declarations so that the compiler can do the checking in the future. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>