From c3123552aad3ffd7a35e16d4402231225165e343 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Wed, 17 Apr 2019 05:46:08 -0300 Subject: docs: accounting: convert to ReST Rename the accounting documentation files to ReST, add an index for them and adjust in order to produce a nice html output via the Sphinx build system. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab --- Documentation/accounting/cgroupstats.rst | 31 ++++ Documentation/accounting/cgroupstats.txt | 27 ---- Documentation/accounting/delay-accounting.rst | 126 ++++++++++++++++ Documentation/accounting/delay-accounting.txt | 117 --------------- Documentation/accounting/index.rst | 14 ++ Documentation/accounting/psi.rst | 182 +++++++++++++++++++++++ Documentation/accounting/psi.txt | 180 ----------------------- Documentation/accounting/taskstats-struct.rst | 199 ++++++++++++++++++++++++++ Documentation/accounting/taskstats-struct.txt | 180 ----------------------- Documentation/accounting/taskstats.rst | 180 +++++++++++++++++++++++ Documentation/accounting/taskstats.txt | 181 ----------------------- 11 files changed, 732 insertions(+), 685 deletions(-) create mode 100644 Documentation/accounting/cgroupstats.rst delete mode 100644 Documentation/accounting/cgroupstats.txt create mode 100644 Documentation/accounting/delay-accounting.rst delete mode 100644 Documentation/accounting/delay-accounting.txt create mode 100644 Documentation/accounting/index.rst create mode 100644 Documentation/accounting/psi.rst delete mode 100644 Documentation/accounting/psi.txt create mode 100644 Documentation/accounting/taskstats-struct.rst delete mode 100644 Documentation/accounting/taskstats-struct.txt create mode 100644 Documentation/accounting/taskstats.rst delete mode 100644 Documentation/accounting/taskstats.txt (limited to 'Documentation/accounting') diff --git a/Documentation/accounting/cgroupstats.rst b/Documentation/accounting/cgroupstats.rst new file mode 100644 index 000000000000..b9afc48f4ea2 --- /dev/null +++ b/Documentation/accounting/cgroupstats.rst @@ -0,0 +1,31 @@ +================== +Control Groupstats +================== + +Control Groupstats is inspired by the discussion at +http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics as +suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263. + +Per cgroup statistics infrastructure re-uses code from the taskstats +interface. A new set of cgroup operations are registered with commands +and attributes specific to cgroups. It should be very easy to +extend per cgroup statistics, by adding members to the cgroupstats +structure. + +The current model for cgroupstats is a pull, a push model (to post +statistics on interesting events), should be very easy to add. Currently +user space requests for statistics by passing the cgroup path. +Statistics about the state of all the tasks in the cgroup is returned to +user space. + +NOTE: We currently rely on delay accounting for extracting information +about tasks blocked on I/O. If CONFIG_TASK_DELAY_ACCT is disabled, this +information will not be available. + +To extract cgroup statistics a utility very similar to getdelays.c +has been developed, the sample output of the utility is shown below:: + + ~/balbir/cgroupstats # ./getdelays -C "/sys/fs/cgroup/a" + sleeping 1, blocked 0, running 1, stopped 0, uninterruptible 0 + ~/balbir/cgroupstats # ./getdelays -C "/sys/fs/cgroup" + sleeping 155, blocked 0, running 1, stopped 0, uninterruptible 2 diff --git a/Documentation/accounting/cgroupstats.txt b/Documentation/accounting/cgroupstats.txt deleted file mode 100644 index d16a9849e60e..000000000000 --- a/Documentation/accounting/cgroupstats.txt +++ /dev/null @@ -1,27 +0,0 @@ -Control Groupstats is inspired by the discussion at -http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics as -suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263. - -Per cgroup statistics infrastructure re-uses code from the taskstats -interface. A new set of cgroup operations are registered with commands -and attributes specific to cgroups. It should be very easy to -extend per cgroup statistics, by adding members to the cgroupstats -structure. - -The current model for cgroupstats is a pull, a push model (to post -statistics on interesting events), should be very easy to add. Currently -user space requests for statistics by passing the cgroup path. -Statistics about the state of all the tasks in the cgroup is returned to -user space. - -NOTE: We currently rely on delay accounting for extracting information -about tasks blocked on I/O. If CONFIG_TASK_DELAY_ACCT is disabled, this -information will not be available. - -To extract cgroup statistics a utility very similar to getdelays.c -has been developed, the sample output of the utility is shown below - -~/balbir/cgroupstats # ./getdelays -C "/sys/fs/cgroup/a" -sleeping 1, blocked 0, running 1, stopped 0, uninterruptible 0 -~/balbir/cgroupstats # ./getdelays -C "/sys/fs/cgroup" -sleeping 155, blocked 0, running 1, stopped 0, uninterruptible 2 diff --git a/Documentation/accounting/delay-accounting.rst b/Documentation/accounting/delay-accounting.rst new file mode 100644 index 000000000000..7cc7f5852da0 --- /dev/null +++ b/Documentation/accounting/delay-accounting.rst @@ -0,0 +1,126 @@ +================ +Delay accounting +================ + +Tasks encounter delays in execution when they wait +for some kernel resource to become available e.g. a +runnable task may wait for a free CPU to run on. + +The per-task delay accounting functionality measures +the delays experienced by a task while + +a) waiting for a CPU (while being runnable) +b) completion of synchronous block I/O initiated by the task +c) swapping in pages +d) memory reclaim + +and makes these statistics available to userspace through +the taskstats interface. + +Such delays provide feedback for setting a task's cpu priority, +io priority and rss limit values appropriately. Long delays for +important tasks could be a trigger for raising its corresponding priority. + +The functionality, through its use of the taskstats interface, also provides +delay statistics aggregated for all tasks (or threads) belonging to a +thread group (corresponding to a traditional Unix process). This is a commonly +needed aggregation that is more efficiently done by the kernel. + +Userspace utilities, particularly resource management applications, can also +aggregate delay statistics into arbitrary groups. To enable this, delay +statistics of a task are available both during its lifetime as well as on its +exit, ensuring continuous and complete monitoring can be done. + + +Interface +--------- + +Delay accounting uses the taskstats interface which is described +in detail in a separate document in this directory. Taskstats returns a +generic data structure to userspace corresponding to per-pid and per-tgid +statistics. The delay accounting functionality populates specific fields of +this structure. See + + include/linux/taskstats.h + +for a description of the fields pertaining to delay accounting. +It will generally be in the form of counters returning the cumulative +delay seen for cpu, sync block I/O, swapin, memory reclaim etc. + +Taking the difference of two successive readings of a given +counter (say cpu_delay_total) for a task will give the delay +experienced by the task waiting for the corresponding resource +in that interval. + +When a task exits, records containing the per-task statistics +are sent to userspace without requiring a command. If it is the last exiting +task of a thread group, the per-tgid statistics are also sent. More details +are given in the taskstats interface description. + +The getdelays.c userspace utility in tools/accounting directory allows simple +commands to be run and the corresponding delay statistics to be displayed. It +also serves as an example of using the taskstats interface. + +Usage +----- + +Compile the kernel with:: + + CONFIG_TASK_DELAY_ACCT=y + CONFIG_TASKSTATS=y + +Delay accounting is enabled by default at boot up. +To disable, add:: + + nodelayacct + +to the kernel boot options. The rest of the instructions +below assume this has not been done. + +After the system has booted up, use a utility +similar to getdelays.c to access the delays +seen by a given task or a task group (tgid). +The utility also allows a given command to be +executed and the corresponding delays to be +seen. + +General format of the getdelays command:: + + getdelays [-t tgid] [-p pid] [-c cmd...] + + +Get delays, since system boot, for pid 10:: + + # ./getdelays -p 10 + (output similar to next case) + +Get sum of delays, since system boot, for all pids with tgid 5:: + + # ./getdelays -t 5 + + + CPU count real total virtual total delay total + 7876 92005750 100000000 24001500 + IO count delay total + 0 0 + SWAP count delay total + 0 0 + RECLAIM count delay total + 0 0 + +Get delays seen in executing a given simple command:: + + # ./getdelays -c ls / + + bin data1 data3 data5 dev home media opt root srv sys usr + boot data2 data4 data6 etc lib mnt proc sbin subdomain tmp var + + + CPU count real total virtual total delay total + 6 4000250 4000000 0 + IO count delay total + 0 0 + SWAP count delay total + 0 0 + RECLAIM count delay total + 0 0 diff --git a/Documentation/accounting/delay-accounting.txt b/Documentation/accounting/delay-accounting.txt deleted file mode 100644 index 042ea59b5853..000000000000 --- a/Documentation/accounting/delay-accounting.txt +++ /dev/null @@ -1,117 +0,0 @@ -Delay accounting ----------------- - -Tasks encounter delays in execution when they wait -for some kernel resource to become available e.g. a -runnable task may wait for a free CPU to run on. - -The per-task delay accounting functionality measures -the delays experienced by a task while - -a) waiting for a CPU (while being runnable) -b) completion of synchronous block I/O initiated by the task -c) swapping in pages -d) memory reclaim - -and makes these statistics available to userspace through -the taskstats interface. - -Such delays provide feedback for setting a task's cpu priority, -io priority and rss limit values appropriately. Long delays for -important tasks could be a trigger for raising its corresponding priority. - -The functionality, through its use of the taskstats interface, also provides -delay statistics aggregated for all tasks (or threads) belonging to a -thread group (corresponding to a traditional Unix process). This is a commonly -needed aggregation that is more efficiently done by the kernel. - -Userspace utilities, particularly resource management applications, can also -aggregate delay statistics into arbitrary groups. To enable this, delay -statistics of a task are available both during its lifetime as well as on its -exit, ensuring continuous and complete monitoring can be done. - - -Interface ---------- - -Delay accounting uses the taskstats interface which is described -in detail in a separate document in this directory. Taskstats returns a -generic data structure to userspace corresponding to per-pid and per-tgid -statistics. The delay accounting functionality populates specific fields of -this structure. See - include/linux/taskstats.h -for a description of the fields pertaining to delay accounting. -It will generally be in the form of counters returning the cumulative -delay seen for cpu, sync block I/O, swapin, memory reclaim etc. - -Taking the difference of two successive readings of a given -counter (say cpu_delay_total) for a task will give the delay -experienced by the task waiting for the corresponding resource -in that interval. - -When a task exits, records containing the per-task statistics -are sent to userspace without requiring a command. If it is the last exiting -task of a thread group, the per-tgid statistics are also sent. More details -are given in the taskstats interface description. - -The getdelays.c userspace utility in tools/accounting directory allows simple -commands to be run and the corresponding delay statistics to be displayed. It -also serves as an example of using the taskstats interface. - -Usage ------ - -Compile the kernel with - CONFIG_TASK_DELAY_ACCT=y - CONFIG_TASKSTATS=y - -Delay accounting is enabled by default at boot up. -To disable, add - nodelayacct -to the kernel boot options. The rest of the instructions -below assume this has not been done. - -After the system has booted up, use a utility -similar to getdelays.c to access the delays -seen by a given task or a task group (tgid). -The utility also allows a given command to be -executed and the corresponding delays to be -seen. - -General format of the getdelays command - -getdelays [-t tgid] [-p pid] [-c cmd...] - - -Get delays, since system boot, for pid 10 -# ./getdelays -p 10 -(output similar to next case) - -Get sum of delays, since system boot, for all pids with tgid 5 -# ./getdelays -t 5 - - -CPU count real total virtual total delay total - 7876 92005750 100000000 24001500 -IO count delay total - 0 0 -SWAP count delay total - 0 0 -RECLAIM count delay total - 0 0 - -Get delays seen in executing a given simple command -# ./getdelays -c ls / - -bin data1 data3 data5 dev home media opt root srv sys usr -boot data2 data4 data6 etc lib mnt proc sbin subdomain tmp var - - -CPU count real total virtual total delay total - 6 4000250 4000000 0 -IO count delay total - 0 0 -SWAP count delay total - 0 0 -RECLAIM count delay total - 0 0 diff --git a/Documentation/accounting/index.rst b/Documentation/accounting/index.rst new file mode 100644 index 000000000000..e1f6284b5ff3 --- /dev/null +++ b/Documentation/accounting/index.rst @@ -0,0 +1,14 @@ +:orphan: + +========== +Accounting +========== + +.. toctree:: + :maxdepth: 1 + + cgroupstats + delay-accounting + psi + taskstats + taskstats-struct diff --git a/Documentation/accounting/psi.rst b/Documentation/accounting/psi.rst new file mode 100644 index 000000000000..621111ce5740 --- /dev/null +++ b/Documentation/accounting/psi.rst @@ -0,0 +1,182 @@ +================================ +PSI - Pressure Stall Information +================================ + +:Date: April, 2018 +:Author: Johannes Weiner + +When CPU, memory or IO devices are contended, workloads experience +latency spikes, throughput losses, and run the risk of OOM kills. + +Without an accurate measure of such contention, users are forced to +either play it safe and under-utilize their hardware resources, or +roll the dice and frequently suffer the disruptions resulting from +excessive overcommit. + +The psi feature identifies and quantifies the disruptions caused by +such resource crunches and the time impact it has on complex workloads +or even entire systems. + +Having an accurate measure of productivity losses caused by resource +scarcity aids users in sizing workloads to hardware--or provisioning +hardware according to workload demand. + +As psi aggregates this information in realtime, systems can be managed +dynamically using techniques such as load shedding, migrating jobs to +other systems or data centers, or strategically pausing or killing low +priority or restartable batch jobs. + +This allows maximizing hardware utilization without sacrificing +workload health or risking major disruptions such as OOM kills. + +Pressure interface +================== + +Pressure information for each resource is exported through the +respective file in /proc/pressure/ -- cpu, memory, and io. + +The format for CPU is as such:: + + some avg10=0.00 avg60=0.00 avg300=0.00 total=0 + +and for memory and IO:: + + some avg10=0.00 avg60=0.00 avg300=0.00 total=0 + full avg10=0.00 avg60=0.00 avg300=0.00 total=0 + +The "some" line indicates the share of time in which at least some +tasks are stalled on a given resource. + +The "full" line indicates the share of time in which all non-idle +tasks are stalled on a given resource simultaneously. In this state +actual CPU cycles are going to waste, and a workload that spends +extended time in this state is considered to be thrashing. This has +severe impact on performance, and it's useful to distinguish this +situation from a state where some tasks are stalled but the CPU is +still doing productive work. As such, time spent in this subset of the +stall state is tracked separately and exported in the "full" averages. + +The ratios (in %) are tracked as recent trends over ten, sixty, and +three hundred second windows, which gives insight into short term events +as well as medium and long term trends. The total absolute stall time +(in us) is tracked and exported as well, to allow detection of latency +spikes which wouldn't necessarily make a dent in the time averages, +or to average trends over custom time frames. + +Monitoring for pressure thresholds +================================== + +Users can register triggers and use poll() to be woken up when resource +pressure exceeds certain thresholds. + +A trigger describes the maximum cumulative stall time over a specific +time window, e.g. 100ms of total stall time within any 500ms window to +generate a wakeup event. + +To register a trigger user has to open psi interface file under +/proc/pressure/ representing the resource to be monitored and write the +desired threshold and time window. The open file descriptor should be +used to wait for trigger events using select(), poll() or epoll(). +The following format is used:: + +