From 295fde89be1b7dce874c0f38d8bb78975a25d46e Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 29 Apr 2013 10:09:41 -0700 Subject: nohz_full: Update based on Sedat Dilek review Make it more clear that there are three options, and give hints as to which of the three is most likely to be useful in different situations. Reported-by: Sedat Dilek Signed-off-by: Paul E. McKenney Reviewed-by: Josh Triplett --- Documentation/timers/NO_HZ.txt | 58 +++++++++++++++++++++++++++++++++++------- 1 file changed, 49 insertions(+), 9 deletions(-) (limited to 'Documentation') diff --git a/Documentation/timers/NO_HZ.txt b/Documentation/timers/NO_HZ.txt index 5b5322024067..d5323e075550 100644 --- a/Documentation/timers/NO_HZ.txt +++ b/Documentation/timers/NO_HZ.txt @@ -7,21 +7,59 @@ efficiency and reducing OS jitter. Reducing OS jitter is important for some types of computationally intensive high-performance computing (HPC) applications and for real-time applications. -There are two main contexts in which the number of scheduling-clock -interrupts can be reduced compared to the old-school approach of sending -a scheduling-clock interrupt to all CPUs every jiffy whether they need -it or not (CONFIG_HZ_PERIODIC=y or CONFIG_NO_HZ=n for older kernels): +There are three main ways of managing scheduling-clock interrupts +(also known as "scheduling-clock ticks" or simply "ticks"): -1. Idle CPUs (CONFIG_NO_HZ_IDLE=y or CONFIG_NO_HZ=y for older kernels). +1. Never omit scheduling-clock ticks (CONFIG_HZ_PERIODIC=y or + CONFIG_NO_HZ=n for older kernels). You normally will -not- + want to choose this option. -2. CPUs having only one runnable task (CONFIG_NO_HZ_FULL=y). +2. Omit scheduling-clock ticks on idle CPUs (CONFIG_NO_HZ_IDLE=y or + CONFIG_NO_HZ=y for older kernels). This is the most common + approach, and should be the default. -These two cases are described in the following two sections, followed +3. Omit scheduling-clock ticks on CPUs that are either idle or that + have only one runnable task (CONFIG_NO_HZ_FULL=y). Unless you + are running realtime applications or certain types of HPC + workloads, you will normally -not- want this option. + +These three cases are described in the following three sections, followed by a third section on RCU-specific considerations and a fourth and final section listing known issues. -IDLE CPUs +NEVER OMIT SCHEDULING-CLOCK TICKS + +Very old versions of Linux from the 1990s and the very early 2000s +are incapable of omitting scheduling-clock ticks. It turns out that +there are some situations where this old-school approach is still the +right approach, for example, in heavy workloads with lots of tasks +that use short bursts of CPU, where there are very frequent idle +periods, but where these idle periods are also quite short (tens or +hundreds of microseconds). For these types of workloads, scheduling +clock interrupts will normally be delivered any way because there +will frequently be multiple runnable tasks per CPU. In these cases, +attempting to turn off the scheduling clock interrupt will have no effect +other than increasing the overhead of switching to and from idle and +transitioning between user and kernel execution. + +This mode of operation can be selected using CONFIG_HZ_PERIODIC=y (or +CONFIG_NO_HZ=n for older kernels). + +However, if you are instead running a light workload with long idle +periods, failing to omit scheduling-clock interrupts will result in +excessive power consumption. This is especially bad on battery-powered +devices, where it results in extremely short battery lifetimes. If you +are running light workloads, you should therefore read the following +section. + +In addition, if you are running either a real-time workload or an HPC +workload with short iterations, the scheduling-clock interrupts can +degrade your applications performance. If this describes your workload, +you should read the following two sections. + + +OMIT SCHEDULING-CLOCK TICKS FOR IDLE CPUs If a CPU is idle, there is little point in sending it a scheduling-clock interrupt. After all, the primary purpose of a scheduling-clock interrupt @@ -59,10 +97,12 @@ By default, CONFIG_NO_HZ_IDLE=y kernels boot with "nohz=on", enabling dyntick-idle mode. -CPUs WITH ONLY ONE RUNNABLE TASK +OMIT SCHEDULING-CLOCK TICKS FOR CPUs WITH ONLY ONE RUNNABLE TASK If a CPU has only one runnable task, there is little point in sending it a scheduling-clock interrupt because there is no other task to switch to. +Note that omitting scheduling-clock ticks for CPUs with only one runnable +task implies also omitting them for idle CPUs. The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid sending scheduling-clock interrupts to CPUs with a single runnable task, -- cgit v1.2.3 From ce5f4fc861e84739289c187f147040e6af6599b2 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 13 May 2013 10:32:10 -0700 Subject: nohz_full: Document additional restrictions This commit calls out the potential for slowing the tick even when there are multiple runnable processes per CPU, It also points out that current mainlined version keeps the tick going on at least one CPU even when all CPUs are otherwise idle. Finally, it notes the need for a 1-HZ tick in order to calculate CPU load, maintain sched average, compute CFS entity vruntime, compute avenrun, and carry out load balancing. Signed-off-by: Paul E. McKenney Reviewed-by: Josh Triplett --- Documentation/timers/NO_HZ.txt | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/timers/NO_HZ.txt b/Documentation/timers/NO_HZ.txt index d5323e075550..88697584242b 100644 --- a/Documentation/timers/NO_HZ.txt +++ b/Documentation/timers/NO_HZ.txt @@ -278,6 +278,11 @@ o Adaptive-ticks does not do anything unless there is only one single runnable SCHED_FIFO task and multiple runnable SCHED_OTHER tasks, even though these interrupts are unnecessary. + And even when there are multiple runnable tasks on a given CPU, + there is little point in interrupting that CPU until the current + running task's timeslice expires, which is almost always way + longer than the time of the next scheduling-clock interrupt. + Better handling of these sorts of situations is future work. o A reboot is required to reconfigure both adaptive idle and RCU @@ -308,6 +313,16 @@ o Unless all CPUs are idle, at least one CPU must keep the scheduling-clock interrupt going in order to support accurate timekeeping. -o If there are adaptive-ticks CPUs, there will be at least one - CPU keeping the scheduling-clock interrupt going, even if all - CPUs are otherwise idle. +o If there might potentially be some adaptive-ticks CPUs, there + will be at least one CPU keeping the scheduling-clock interrupt + going, even if all CPUs are otherwise idle. + + Better handling of this situation is ongoing work. + +o Some process-handling operations still require the occasional + scheduling-clock tick. These operations include calculating CPU + load, maintaining sched average, computing CFS entity vruntime, + computing avenrun, and carrying out load balancing. They are + currently accommodated by scheduling-clock tick every second + or so. On-going work will eliminate the need even for these + infrequent scheduling-clock ticks. -- cgit v1.2.3 From f7bac9b85a333ebdfa5f500cf6acf753edffc05f Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 30 Apr 2013 10:48:35 -0700 Subject: kthread: Add kworker kthreads to OS-jitter documentation The kworker workqueue kthreads can also contribute to OS jitter. The amount of jitter depends on their use, so this commit adds documentation on avoiding OS jitter due to workqueue use. Reported-by: Jonathan Clairembault Signed-off-by: Paul E. McKenney Reviewed-by: Josh Triplett --- Documentation/kernel-per-CPU-kthreads.txt | 47 +++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+) (limited to 'Documentation') diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt index cbf7ae412da4..5f39ef55c6f6 100644 --- a/Documentation/kernel-per-CPU-kthreads.txt +++ b/Documentation/kernel-per-CPU-kthreads.txt @@ -157,6 +157,53 @@ RCU_SOFTIRQ: Do at least one of the following: calls and by forcing both kernel threads and interrupts to execute elsewhere. +Name: kworker/%u:%d%s (cpu, id, priority) +Purpose: Execute workqueue requests +To reduce its OS jitter, do any of the following: +1. Run your workload at a real-time priority, which will allow + preempting the kworker daemons. +2. Do any of the following needed to avoid jitter that your + application cannot tolerate: + a. Build your kernel with CONFIG_SLUB=y rather than + CONFIG_SLAB=y, thus avoiding the slab allocator's periodic + use of each CPU's workqueues to run its cache_reap() + function. + b. Avoid using oprofile, thus avoiding OS jitter from + wq_sync_buffer(). + c. Limit your CPU frequency so that a CPU-frequency + governor is not required, possibly enlisting the aid of + special heatsinks or other cooling technologies. If done + correctly, and if you CPU architecture permits, you should + be able to build your kernel with CONFIG_CPU_FREQ=n to + avoid the CPU-frequency governor periodically running + on each CPU, including cs_dbs_timer() and od_dbs_timer(). + WARNING: Please check your CPU specifications to + make sure that this is safe on your particular system. + d. It is not possible to entirely get rid of OS jitter + from vmstat_update() on CONFIG_SMP=y systems, but you + can decrease its frequency by writing a large value to + /proc/sys/vm/stat_interval. The default value is HZ, + for an interval of one second. Of course, larger values + will make your virtual-memory statistics update more + slowly. Of course, you can also run your workload at + a real-time priority, thus preempting vmstat_update(). + e. If running on high-end powerpc servers, build with + CONFIG_PPC_RTAS_DAEMON=n. This prevents the RTAS + daemon from running on each CPU every second or so. + (This will require editing Kconfig files and will defeat + this platform's RAS functionality.) This avoids jitter + due to the rtas_event_scan() function. + WARNING: Please check your CPU specifications to + make sure that this is safe on your particular system. + f. If running on Cell Processor, build your kernel with + CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from + spu_gov_work(). + WARNING: Please check your CPU specifications to + make sure that this is safe on your particular system. + g. If running on PowerMAC, build your kernel with + CONFIG_PMAC_RACKMETER=n to disable the CPU-meter, + avoiding OS jitter from rackmeter_do_timer(). + Name: rcuc/%u Purpose: Execute RCU callbacks in CONFIG_RCU_BOOST=y kernels. To reduce its OS jitter, do at least one of the following: -- cgit v1.2.3 From 99f88919f8fa8a8b01b5306c59c9977b94604df8 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 12 Mar 2013 16:54:14 -0700 Subject: rcu: Remove srcu_read_lock_raw() and srcu_read_unlock_raw(). These interfaces never did get used, so this commit removes them, their rcutorture tests, and documentation referencing them. Signed-off-by: Paul E. McKenney Reviewed-by: Lai Jiangshan Reviewed-by: Josh Triplett --- Documentation/RCU/checklist.txt | 6 ------ Documentation/RCU/torture.txt | 6 ------ Documentation/RCU/whatisRCU.txt | 22 +++++++--------------- 3 files changed, 7 insertions(+), 27 deletions(-) (limited to 'Documentation') diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt index 79e789b8b8ea..7703ec73a9bb 100644 --- a/Documentation/RCU/checklist.txt +++ b/Documentation/RCU/checklist.txt @@ -354,12 +354,6 @@ over a rather long period of time, but improvements are always welcome! using RCU rather than SRCU, because RCU is almost always faster and easier to use than is SRCU. - If you need to enter your read-side critical section in a - hardirq or exception handler, and then exit that same read-side - critical section in the task that was interrupted, then you need - to srcu_read_lock_raw() and srcu_read_unlock_raw(), which avoid - the lockdep checking that would otherwise this practice illegal. - Also unlike other forms of RCU, explicit initialization and cleanup is required via init_srcu_struct() and cleanup_srcu_struct(). These are passed a "struct srcu_struct" diff --git a/Documentation/RCU/torture.txt b/Documentation/RCU/torture.txt index 7dce8a17eac2..d8a502387397 100644 --- a/Documentation/RCU/torture.txt +++ b/Documentation/RCU/torture.txt @@ -182,12 +182,6 @@ torture_type The type of RCU to test, with string values as follows: "srcu_expedited": srcu_read_lock(), srcu_read_unlock() and synchronize_srcu_expedited(). - "srcu_raw": srcu_read_lock_raw(), srcu_read_unlock_raw(), - and call_srcu(). - - "srcu_raw_sync": srcu_read_lock_raw(), srcu_read_unlock_raw(), - and synchronize_srcu(). - "sched": preempt_disable(), preempt_enable(), and call_rcu_sched(). diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index 10df0b82f459..0f0fb7c432c2 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt @@ -842,9 +842,7 @@ SRCU: Critical sections Grace period Barrier srcu_read_lock synchronize_srcu srcu_barrier srcu_read_unlock call_srcu - srcu_read_lock_raw synchronize_srcu_expedited - srcu_read_unlock_raw - srcu_dereference + srcu_dereference synchronize_srcu_expedited SRCU: Initialization/cleanup init_srcu_struct @@ -865,38 +863,32 @@ list can be helpful: a. Will readers need to block? If so, you need SRCU. -b. Is it necessary to start a read-side critical section in a - hardirq handler or exception handler, and then to complete - this read-side critical section in the task that was - interrupted? If so, you need SRCU's srcu_read_lock_raw() and - srcu_read_unlock_raw() primitives. - -c. What about the -rt patchset? If readers would need to block +b. What about the -rt patchset? If readers would need to block in an non-rt kernel, you need SRCU. If readers would block in a -rt kernel, but not in a non-rt kernel, SRCU is not necessary. -d. Do you need to treat NMI handlers, hardirq handlers, +c. Do you need to treat NMI handlers, hardirq handlers, and code segments with preemption disabled (whether via preempt_disable(), local_irq_save(), local_bh_disable(), or some other mechanism) as if they were explicit RCU readers? If so, RCU-sched is the only choice that will work for you. -e. Do you need RCU grace periods to complete even in the face +d. Do you need RCU grace periods to complete even in the face of softirq monopolization of one or more of the CPUs? For example, is your code subject to network-based denial-of-service attacks? If so, you need RCU-bh. -f. Is your workload too update-intensive for normal use of +e. Is your workload too update-intensive for normal use of RCU, but inappropriate for other synchronization mechanisms? If so, consider SLAB_DESTROY_BY_RCU. But please be careful! -g. Do you need read-side critical sections that are respected +f. Do you need read-side critical sections that are respected even though they are in the middle of the idle loop, during user-mode execution, or on an offlined CPU? If so, SRCU is the only choice that will work for you. -h. Otherwise, use RCU. +g. Otherwise, use RCU. Of course, this all assumes that you have determined that RCU is in fact the right tool for your job. -- cgit v1.2.3 From 7807acdb6b794b3af42dfa1912edd4fba8d0b622 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 27 Mar 2013 11:32:12 -0700 Subject: rcu: Remove TINY_PREEMPT_RCU tracing documentation Because TINY_PREEMPT_RCU is no more, this commit removes its tracing formats from the documentation. Signed-off-by: Paul E. McKenney Reviewed-by: Josh Triplett --- Documentation/RCU/trace.txt | 100 ++------------------------------------------ 1 file changed, 4 insertions(+), 96 deletions(-) (limited to 'Documentation') diff --git a/Documentation/RCU/trace.txt b/Documentation/RCU/trace.txt index c776968f4463..f3778f8952da 100644 --- a/Documentation/RCU/trace.txt +++ b/Documentation/RCU/trace.txt @@ -530,113 +530,21 @@ o "nos" counts the number of times we balked for other reasons, e.g., the grace period ended first. -CONFIG_TINY_RCU and CONFIG_TINY_PREEMPT_RCU debugfs Files and Formats +CONFIG_TINY_RCU debugfs Files and Formats These implementations of RCU provides a single debugfs file under the top-level directory RCU, namely rcu/rcudata, which displays fields in -rcu_bh_ctrlblk, rcu_sched_ctrlblk and, for CONFIG_TINY_PREEMPT_RCU, -rcu_preempt_ctrlblk. +rcu_bh_ctrlblk and rcu_sched_ctrlblk. The output of "cat rcu/rcudata" is as follows: -rcu_preempt: qlen=24 gp=1097669 g197/p197/c197 tasks=... - ttb=. btg=no ntb=184 neb=0 nnb=183 j=01f7 bt=0274 - normal balk: nt=1097669 gt=0 bt=371 b=0 ny=25073378 nos=0 - exp balk: bt=0 nos=0 rcu_sched: qlen: 0 rcu_bh: qlen: 0 -This is split into rcu_preempt, rcu_sched, and rcu_bh sections, with the -rcu_preempt section appearing only in CONFIG_TINY_PREEMPT_RCU builds. -The last three lines of the rcu_preempt section appear only in -CONFIG_RCU_BOOST kernel builds. The fields are as follows: +This is split into rcu_sched and rcu_bh sections. The field is as +follows: o "qlen" is the number of RCU callbacks currently waiting either for an RCU grace period or waiting to be invoked. This is the only field present for rcu_sched and rcu_bh, due to the short-circuiting of grace period in those two cases. - -o "gp" is the number of grace periods that have completed. - -o "g197/p197/c197" displays the grace-period state, with the - "g" number being the number of grace periods that have started - (mod 256), the "p" number being the number of grace periods - that the CPU has responded to (also mod 256), and the "c" - number being the number of grace periods that have completed - (once again mode 256). - - Why have both "gp" and "g"? Because the data flowing into - "gp" is only present in a CONFIG_RCU_TRACE kernel. - -o "tasks" is a set of bits. The first bit is "T" if there are - currently tasks that have recently blocked within an RCU - read-side critical section, the second bit is "N" if any of the - aforementioned tasks are blocking the current RCU grace period, - and the third bit is "E" if any of the aforementioned tasks are - blocking the current expedited grace period. Each bit is "." - if the corresponding condition does not hold. - -o "ttb" is a single bit. It is "B" if any of the blocked tasks - need to be priority boosted and "." otherwise. - -o "btg" indicates whether boosting has been carried out during - the current grace period, with "exp" indicating that boosting - is in progress for an expedited grace period, "no" indicating - that boosting has not yet started for a normal grace period, - "begun" indicating that boosting has bebug for a normal grace - period, and "done" indicating that boosting has completed for - a normal grace period. - -o "ntb" is the total number of tasks subjected to RCU priority boosting - periods since boot. - -o "neb" is the number of expedited grace periods that have had - to resort to RCU priority boosting since boot. - -o "nnb" is the number of normal grace periods that have had - to resort to RCU priority boosting since boot. - -o "j" is the low-order 16 bits of the jiffies counter in hexadecimal. - -o "bt" is the low-order 16 bits of the value that the jiffies counter - will have at the next time that boosting is scheduled to begin. - -o In the line beginning with "normal balk", the fields are as follows: - - o "nt" is the number of times that the system balked from - boosting because there were no blocked tasks to boost. - Note that the system will balk from boosting even if the - grace period is overdue when the currently running task - is looping within an RCU read-side critical section. - There is no point in boosting in this case, because - boosting a running task won't make it run any faster. - - o "gt" is the number of times that the system balked - from boosting because, although there were blocked tasks, - none of them were preventing the current grace period - from completing. - - o "bt" is the number of times that the system balked - from boosting because boosting was already in progress. - - o "b" is the number of times that the system balked from - boosting because boosting had already completed for - the grace period in question. - - o "ny" is the number of times that the system balked from - boosting because it was not yet time to start boosting - the grace period in question. - - o "nos" is the number of times that the system balked from - boosting for inexplicable ("not otherwise specified") - reasons. This can actually happen due to races involving - increments of the jiffies counter. - -o In the line beginning with "exp balk", the fields are as follows: - - o "bt" is the number of times that the system balked from - boosting because there were no blocked tasks to boost. - - o "nos" is the number of times that the system balked from - boosting for inexplicable ("not otherwise specified") - reasons. -- cgit v1.2.3