From abffc0207f12563f17bbde96e4cc0d9f3d7e2a53 Mon Sep 17 00:00:00 2001 From: Andrea Righi Date: Wed, 27 Oct 2010 15:33:31 -0700 Subject: doc: clarify the behaviour of dirty_ratio/dirty_bytes When dirty_ratio or dirty_bytes is written the other parameter is disabled and set to 0 (in dirty_bytes_handler() / dirty_ratio_handler()). We do the same for dirty_background_ratio and dirty_background_bytes. However, in the sysctl documentation, we say that the counterpart becomes a function of the old value, that is not correct. Clarify the documentation reporting the actual behaviour. Reviewed-by: Greg Thelen Acked-by: David Rientjes Signed-off-by: Andrea Righi Cc: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/vm.txt | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index b606c2c4dd37..30289fab86eb 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -80,8 +80,10 @@ dirty_background_bytes Contains the amount of dirty memory at which the pdflush background writeback daemon will start writeback. -If dirty_background_bytes is written, dirty_background_ratio becomes a function -of its value (dirty_background_bytes / the amount of dirtyable system memory). +Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only +one of them may be specified at a time. When one sysctl is written it is +immediately taken into account to evaluate the dirty memory limits and the +other appears as 0 when read. ============================================================== @@ -97,8 +99,10 @@ dirty_bytes Contains the amount of dirty memory at which a process generating disk writes will itself start writeback. -If dirty_bytes is written, dirty_ratio becomes a function of its value -(dirty_bytes / the amount of dirtyable system memory). +Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be +specified at a time. When one sysctl is written it is immediately taken into +account to evaluate the dirty memory limits and the other appears as 0 when +read. Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any value lower than this limit will be ignored and the old configuration will be -- cgit v1.2.3 From 97978e6d1f2da0073416870410459694fbdbfd9b Mon Sep 17 00:00:00 2001 From: Daniel Lezcano Date: Wed, 27 Oct 2010 15:33:35 -0700 Subject: cgroup: add clone_children control file The ns_cgroup is a control group interacting with the namespaces. When a new namespace is created, a corresponding cgroup is automatically created too. The cgroup name is the pid of the process who did 'unshare' or the child of 'clone'. This cgroup is tied with the namespace because it prevents a process to escape the control group and use the post_clone callback, so the child cgroup inherits the values of the parent cgroup. Unfortunately, the more we use this cgroup and the more we are facing problems with it: (1) when a process unshares, the cgroup name may conflict with a previous cgroup with the same pid, so unshare or clone return -EEXIST (2) the cgroup creation is out of control because there may have an application creating several namespaces where the system will automatically create several cgroups in his back and let them on the cgroupfs (eg. a vrf based on the network namespace). (3) the mix of (1) and (2) force an administrator to regularly check and clean these cgroups. This patchset removes the ns_cgroup by adding a new flag to the cgroup and the cgroupfs mount option. It enables the copy of the parent cgroup when a child cgroup is created. We can then safely remove the ns_cgroup as this flag brings a compatibility. We have now to manually create and add the task to a cgroup, which is consistent with the cgroup framework. This patch: Sent as an answer to a previous thread around the ns_cgroup. https://lists.linux-foundation.org/pipermail/containers/2009-June/018627.html It adds a control file 'clone_children' for a cgroup. This control file is a boolean specifying if the child cgroup should be a clone of the parent cgroup or not. The default value is 'false'. This flag makes the child cgroup to call the post_clone callback of all the subsystem, if it is available. At present, the cpuset is the only one which had implemented the post_clone callback. The option can be set at mount time by specifying the 'clone_children' mount option. Signed-off-by: Daniel Lezcano Signed-off-by: Serge E. Hallyn Cc: Eric W. Biederman Acked-by: Paul Menage Reviewed-by: Li Zefan Cc: Jamal Hadi Salim Cc: Matt Helsley Acked-by: Balbir Singh Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/cgroups.txt | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index b34823ff1646..190018b0c649 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt @@ -18,7 +18,8 @@ CONTENTS: 1.2 Why are cgroups needed ? 1.3 How are cgroups implemented ? 1.4 What does notify_on_release do ? - 1.5 How do I use cgroups ? + 1.5 What does clone_children do ? + 1.6 How do I use cgroups ? 2. Usage Examples and Syntax 2.1 Basic Usage 2.2 Attaching processes @@ -293,7 +294,16 @@ notify_on_release in the root cgroup at system boot is disabled value of their parents notify_on_release setting. The default value of a cgroup hierarchy's release_agent path is empty. -1.5 How do I use cgroups ? +1.5 What does clone_children do ? +--------------------------------- + +If the clone_children flag is enabled (1) in a cgroup, then all +cgroups created beneath will call the post_clone callbacks for each +subsystem of the newly created cgroup. Usually when this callback is +implemented for a subsystem, it copies the values of the parent +subsystem, this is the case for the cpuset. + +1.6 How do I use cgroups ? -------------------------- To start a new job that is to be contained within a cgroup, using -- cgit v1.2.3 From 45531757b45cae0ce64c5aff08c2534d5a0fa3e7 Mon Sep 17 00:00:00 2001 From: Daniel Lezcano Date: Wed, 27 Oct 2010 15:33:38 -0700 Subject: cgroup: notify ns_cgroup deprecated The ns_cgroup will be removed very soon. Let's warn, for this version, ns_cgroup is deprecated. Make ns_cgroup and clone_children exclusive. If the clone_children is set and the ns_cgroup is mounted, let's fail with EINVAL when the ns_cgroup subsys is created (a printk will help the user to understand why the creation fails). Update the feature remove schedule file with the deprecated ns_cgroup. Signed-off-by: Daniel Lezcano Acked-by: Paul Menage Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/feature-removal-schedule.txt | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) (limited to 'Documentation') diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index d2af87ba96e1..f3da8c0a3af2 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -526,6 +526,23 @@ Who: FUJITA Tomonori ---------------------------- +What: namespace cgroup (ns_cgroup) +When: 2.6.38 +Why: The ns_cgroup leads to some problems: + * cgroup creation is out-of-control + * cgroup name can conflict when pids are looping + * it is not possible to have a single process handling + a lot of namespaces without falling in a exponential creation time + * we may want to create a namespace without creating a cgroup + + The ns_cgroup is replaced by a compatibility flag 'clone_children', + where a newly created cgroup will copy the parent cgroup values. + The userspace has to manually create a cgroup and add a task to + the 'tasks' file. +Who: Daniel Lezcano + +---------------------------- + What: iwlwifi disable_hw_scan module parameters When: 2.6.40 Why: Hareware scan is the prefer method for iwlwifi devices for -- cgit v1.2.3 From b40d4f84becd69275451baee7f0801c85eb58437 Mon Sep 17 00:00:00 2001 From: Nikanth Karthikesan Date: Wed, 27 Oct 2010 15:34:10 -0700 Subject: /proc/pid/smaps: export amount of anonymous memory in a mapping Export the number of anonymous pages in a mapping via smaps. Even the private pages in a mapping backed by a file, would be marked as anonymous, when they are modified. Export this information to user-space via smaps. Exporting this count will help gdb to make a better decision on which areas need to be dumped in its coredump; and should be useful to others studying the memory usage of a process. Signed-off-by: Nikanth Karthikesan Acked-by: Hugh Dickins Reviewed-by: KOSAKI Motohiro Cc: Matt Mackall Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index a563b74c7aef..976de6e19dd8 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -370,6 +370,7 @@ Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 0 kB Referenced: 892 kB +Anonymous: 0 kB Swap: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB @@ -378,9 +379,15 @@ The first of these lines shows the same information as is displayed for the mapping in /proc/PID/maps. The remaining lines show the size of the mapping (size), the amount of the mapping that is currently resident in RAM (RSS), the process' proportional share of this mapping (PSS), the number of clean and -dirty shared pages in the mapping, and the number of clean and dirty private -pages in the mapping. The "Referenced" indicates the amount of memory -currently marked as referenced or accessed. +dirty private pages in the mapping. Note that even a page which is part of a +MAP_SHARED mapping, but has only a single pte mapped, i.e. is currently used +by only one process, is accounted as private and not as shared. "Referenced" +indicates the amount of memory currently marked as referenced or accessed. +"Anonymous" shows the amount of memory that does not belong to any file. Even +a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE +and a page is modified, the file page is replaced by a private anonymous copy. +"Swap" shows how much would-be-anonymous memory is also used, but out on +swap. This file is only present if the CONFIG_MMU kernel configuration option is enabled. -- cgit v1.2.3 From 03f890f8c2f5c9008d3d8f6d85267717ced4bd79 Mon Sep 17 00:00:00 2001 From: Nikanth Karthikesan Date: Wed, 27 Oct 2010 15:34:11 -0700 Subject: /proc/pid/pagemap: document in Documentation/filesystems/proc.txt Document /proc/pid/pagemap in Documentation/filesystems/proc.txt Signed-off-by: Nikanth Karthikesan Cc: Richard Guenther Cc: Balbir Singh Cc: KOSAKI Motohiro Acked-by: Matt Mackall Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'Documentation') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 976de6e19dd8..e73df2722ff3 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -136,6 +136,7 @@ Table 1-1: Process specific entries in /proc statm Process memory status information status Process status in human readable form wchan If CONFIG_KALLSYMS is set, a pre-decoded wchan + pagemap Page table stack Report full stack trace, enable via CONFIG_STACKTRACE smaps a extension based on maps, showing the memory consumption of each mapping @@ -404,6 +405,9 @@ To clear the bits for the file mapped pages associated with the process > echo 3 > /proc/PID/clear_refs Any other value written to /proc/PID/clear_refs will have no effect. +The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags +using /proc/kpageflags and number of times a page is mapped using +/proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.txt. 1.2 Kernel data --------------- -- cgit v1.2.3 From db9e5679d6aecb17253f41bd06d98194800f9c01 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 27 Oct 2010 15:34:42 -0700 Subject: delay-accounting: reimplement -c for getdelays.c to report information on a target command Task delay-accounting was identified as one means of determining how long a process spends in congestion_wait() without adding new statistics. For example, if the workload should not be doing IO, delay-accounting could reveal how long it was spending in unexpected IO or delays. Unfortunately, on closer examination it was clear that getdelays does not act as documented. Commit a3baf649 ("per-task-delay-accounting: documentation") added Documentation/accounting/getdelays.c with a -c switch that was documented to fork/exec a child and report statistics on it but for reasons that are unclear to me, commit 9e06d3f9 deleted support for this switch but did not update the documentation. It might be an oversight or it might be because the control flow of the program meant that accounting information would be printed once early in the lifetime of the program making it of limited use. This patch reimplements -c for getdelays.c to act as documented. Unlike the original version, it waits until the command completes before printing any information on it. An example of it being used looks like $ ./getdelays -d -c find /home/mel -name mel print delayacct stats ON /home/mel /home/mel/.notes-wine/drive_c/windows/profiles/mel /home/mel/.wine/drive_c/windows/profiles/mel /home/mel/git-configs/dot.kde/share/apps/konqueror/home/mel PID 5923 CPU count real total virtual total delay total 42779 5051232096 5164722692 564207988 IO count delay total 41727 97804147758 SWAP count delay total 0 0 RECLAIM count delay total 0 0 [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Mel Gorman Acked-by: Balbir Singh Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/accounting/getdelays.c | 38 ++++++++++++++++++++++++++++++++++-- 1 file changed, 36 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/accounting/getdelays.c b/Documentation/accounting/getdelays.c index 6e25c2659e0a..a2976a6de033 100644 --- a/Documentation/accounting/getdelays.c +++ b/Documentation/accounting/getdelays.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include @@ -266,11 +267,13 @@ int main(int argc, char *argv[]) int containerset = 0; char containerpath[1024]; int cfd = 0; + int forking = 0; + sigset_t sigset; struct msgtemplate msg; - while (1) { - c = getopt(argc, argv, "qdiw:r:m:t:p:vlC:"); + while (!forking) { + c = getopt(argc, argv, "qdiw:r:m:t:p:vlC:c:"); if (c < 0) break; @@ -319,6 +322,28 @@ int main(int argc, char *argv[]) err(1, "Invalid pid\n"); cmd_type = TASKSTATS_CMD_ATTR_PID; break; + case 'c': + + /* Block SIGCHLD for sigwait() later */ + if (sigemptyset(&sigset) == -1) + err(1, "Failed to empty sigset"); + if (sigaddset(&sigset, SIGCHLD)) + err(1, "Failed to set sigchld in sigset"); + sigprocmask(SIG_BLOCK, &sigset, NULL); + + /* fork/exec a child */ + tid = fork(); + if (tid < 0) + err(1, "Fork failed\n"); + if (tid == 0) + if (execvp(argv[optind - 1], + &argv[optind - 1]) < 0) + exit(-1); + + /* Set the command type and avoid further processing */ + cmd_type = TASKSTATS_CMD_ATTR_PID; + forking = 1; + break; case 'v': printf("debug on\n"); dbg = 1; @@ -370,6 +395,15 @@ int main(int argc, char *argv[]) goto err; } + /* + * If we forked a child, wait for it to exit. Cannot use waitpid() + * as all the delicious data would be reaped as part of the wait + */ + if (tid && forking) { + int sig_received; + sigwait(&sigset, &sig_received); + } + if (tid) { rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET, cmd_type, &tid, sizeof(__u32)); -- cgit v1.2.3