From c18258c6f0848f97e85287f6271c511a092bb784 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Wed, 27 Sep 2006 01:51:06 -0700 Subject: [PATCH] pid: Implement transfer_pid and use it to simplify de_thread In de_thread we move pids from one process to another, a rather ugly case. The function transfer_pid makes it clear what we are doing, and makes the action atomic. This is useful we ever want to atomically traverse the process group and session lists, in a rcu safe manner. Even if the atomic properties this change should be a win as transfer_pid should be less code to execute than executing both attach_pid and detach_pid, and this should make de_thread slightly smaller as only a single function call needs to be emitted. The only downside is that the code might be slower to execute as the odds are against transfer_pid being in cache. Signed-off-by: Eric W. Biederman Cc: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/pid.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include/linux/pid.h') diff --git a/include/linux/pid.h b/include/linux/pid.h index 29960b03bef7..93da7e2d9f30 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -76,6 +76,8 @@ extern int FASTCALL(attach_pid(struct task_struct *task, enum pid_type type, int nr)); extern void FASTCALL(detach_pid(struct task_struct *task, enum pid_type)); +extern void FASTCALL(transfer_pid(struct task_struct *old, + struct task_struct *new, enum pid_type)); /* * look up a PID in the hash table. Must be called with the tasklist_lock -- cgit v1.2.3 From 0804ef4b0de7121261f77c565b20a11ac694e877 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Mon, 2 Oct 2006 02:17:04 -0700 Subject: [PATCH] proc: readdir race fix (take 3) The problem: An opendir, readdir, closedir sequence can fail to report process ids that are continually in use throughout the sequence of system calls. For this race to trigger the process that proc_pid_readdir stops at must exit before readdir is called again. This can cause ps to fail to report processes, and it is in violation of posix guarantees and normal application expectations with respect to readdir. Currently there is no way to work around this problem in user space short of providing a gargantuan buffer to user space so the directory read all happens in on system call. This patch implements the normal directory semantics for proc, that guarantee that a directory entry that is neither created nor destroyed while reading the directory entry will be returned. For directory that are either created or destroyed during the readdir you may or may not see them. Furthermore you may seek to a directory offset you have previously seen. These are the guarantee that ext[23] provides and that posix requires, and more importantly that user space expects. Plus it is a simple semantic to implement reliable service. It is just a matter of calling readdir a second time if you are wondering if something new has show up. These better semantics are implemented by scanning through the pids in numerical order and by making the file offset a pid plus a fixed offset. The pid scan happens on the pid bitmap, which when you look at it is remarkably efficient for a brute force algorithm. Given that a typical cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There are only 40 cache lines for the entire 32K pid space. A typical system will have 100 pids or more so this is actually fewer cache lines we have to look at to scan a linked list, and the worst case of having to scan the entire pid bitmap is pretty reasonable. If we need something more efficient we can go to a more efficient data structure for indexing the pids, but for now what we have should be sufficient. In addition this takes no additional locks and is actually less code than what we are doing now. Also another very subtle bug in this area has been fixed. It is possible to catch a task in the middle of de_thread where a thread is assuming the thread of it's thread group leader. This patch carefully handles that case so if we hit it we don't fail to return the pid, that is undergoing the de_thread dance. Thanks to KAMEZAWA Hiroyuki for providing the first fix, pointing this out and working on it. [oleg@tv-sign.ru: fix it] Signed-off-by: Eric W. Biederman Acked-by: KAMEZAWA Hiroyuki Signed-off-by: Oleg Nesterov Cc: Jean Delvare Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/pid.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux/pid.h') diff --git a/include/linux/pid.h b/include/linux/pid.h index 93da7e2d9f30..359121086de1 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -89,6 +89,7 @@ extern struct pid *FASTCALL(find_pid(int nr)); * Lookup a PID in the hash table, and return with it's count elevated. */ extern struct pid *find_get_pid(int nr); +extern struct pid *find_ge_pid(int nr); extern struct pid *alloc_pid(void); extern void FASTCALL(free_pid(struct pid *pid)); -- cgit v1.2.3 From 558cb325485aaf655130f140e8ddd25392f6c972 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Mon, 2 Oct 2006 02:17:09 -0700 Subject: [PATCH] pid: add do_each_pid_task To avoid pid rollover confusion the kernel needs to work with struct pid * instead of pid_t. Currently there is not an iterator that walks through all of the tasks of a given pid type starting with a struct pid. This prevents us replacing some pid_t instances with struct pid. So this patch adds do_each_pid_task which walks through the set of task for a given pid type starting with a struct pid. Signed-off-by: Eric W. Biederman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/pid.h | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'include/linux/pid.h') diff --git a/include/linux/pid.h b/include/linux/pid.h index 359121086de1..8cf9b11ed264 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -119,4 +119,17 @@ extern void FASTCALL(free_pid(struct pid *pid)); 1; }) ); \ } +#define do_each_pid_task(pid, type, task) \ + if ((task = pid_task(pid, type))) { \ + prefetch(pid_next(task, type)); \ + do { + +#define while_each_pid_task(pid, type, task) \ + } while (pid_next(task, type) && ({ \ + task = pid_next_task(task, type); \ + rcu_dereference(task); \ + prefetch(pid_next(task, type)); \ + 1; }) ); \ + } + #endif /* _LINUX_PID_H */ -- cgit v1.2.3 From 5feb8f5f8403d8874a04aac443692dfe83bd63d2 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Mon, 2 Oct 2006 02:17:12 -0700 Subject: [PATCH] pid: implement pid_nr As we stop storing pid_t's and move to storing struct pid *. We need a way to get the pid_t from the struct pid to report to user space what we have stored. Having a clean well defined way to do this is especially important as we move to multiple pid spaces as may need to report a different value to the caller depending on which pid space the caller is in. Signed-off-by: Eric W. Biederman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/pid.h | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'include/linux/pid.h') diff --git a/include/linux/pid.h b/include/linux/pid.h index 8cf9b11ed264..dba1b2d677a3 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -94,6 +94,14 @@ extern struct pid *find_ge_pid(int nr); extern struct pid *alloc_pid(void); extern void FASTCALL(free_pid(struct pid *pid)); +static inline pid_t pid_nr(struct pid *pid) +{ + pid_t nr = 0; + if (pid) + nr = pid->nr; + return nr; +} + #define pid_next(task, type) \ ((task)->pids[(type)].node.next) -- cgit v1.2.3 From d387cae075b0aec479adbdfb71df39f7de8e9adb Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Mon, 2 Oct 2006 02:17:22 -0700 Subject: [PATCH] pid: simplify pid iterators I think it is hardly possible to read the current do_each_task_pid(). The new version is much simpler and makes the code smaller. Only the do_each_task_pid change is tested, the do_each_pid_task isn't. Signed-off-by: Oleg Nesterov Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/pid.h | 59 +++++++++++++++++++++-------------------------------- 1 file changed, 23 insertions(+), 36 deletions(-) (limited to 'include/linux/pid.h') diff --git a/include/linux/pid.h b/include/linux/pid.h index dba1b2d677a3..7e39767b4c60 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -102,42 +102,29 @@ static inline pid_t pid_nr(struct pid *pid) return nr; } -#define pid_next(task, type) \ - ((task)->pids[(type)].node.next) -#define pid_next_task(task, type) \ - hlist_entry(pid_next(task, type), struct task_struct, \ - pids[(type)].node) - - -/* We could use hlist_for_each_entry_rcu here but it takes more arguments - * than the do_each_task_pid/while_each_task_pid. So we roll our own - * to preserve the existing interface. - */ -#define do_each_task_pid(who, type, task) \ - if ((task = find_task_by_pid_type(type, who))) { \ - prefetch(pid_next(task, type)); \ - do { - -#define while_each_task_pid(who, type, task) \ - } while (pid_next(task, type) && ({ \ - task = pid_next_task(task, type); \ - rcu_dereference(task); \ - prefetch(pid_next(task, type)); \ - 1; }) ); \ - } - -#define do_each_pid_task(pid, type, task) \ - if ((task = pid_task(pid, type))) { \ - prefetch(pid_next(task, type)); \ - do { - -#define while_each_pid_task(pid, type, task) \ - } while (pid_next(task, type) && ({ \ - task = pid_next_task(task, type); \ - rcu_dereference(task); \ - prefetch(pid_next(task, type)); \ - 1; }) ); \ - } +#define do_each_task_pid(who, type, task) \ + do { \ + struct hlist_node *pos___; \ + struct pid *pid___ = find_pid(who); \ + if (pid___ != NULL) \ + hlist_for_each_entry_rcu((task), pos___, \ + &pid___->tasks[type], pids[type].node) { + +#define while_each_task_pid(who, type, task) \ + } \ + } while (0) + + +#define do_each_pid_task(pid, type, task) \ + do { \ + struct hlist_node *pos___; \ + if (pid != NULL) \ + hlist_for_each_entry_rcu((task), pos___, \ + &pid->tasks[type], pids[type].node) { + +#define while_each_pid_task(pid, type, task) \ + } \ + } while (0) #endif /* _LINUX_PID_H */ -- cgit v1.2.3 From 1a657f78dcc8ea7c53eaa1f2a45ea2315738c15f Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Mon, 2 Oct 2006 02:18:59 -0700 Subject: [PATCH] introduce get_task_pid() to fix unsafe get_pid() proc_pid_make_inode: ei->pid = get_pid(task_pid(task)); I think this is not safe. get_pid() can be preempted after checking "pid != NULL". Then the task exits, does detach_pid(), and RCU frees the pid. Signed-off-by: Oleg Nesterov Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/pid.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include/linux/pid.h') diff --git a/include/linux/pid.h b/include/linux/pid.h index 7e39767b4c60..17b9e04d3586 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -68,6 +68,8 @@ extern struct task_struct *FASTCALL(pid_task(struct pid *pid, enum pid_type)); extern struct task_struct *FASTCALL(get_pid_task(struct pid *pid, enum pid_type)); +extern struct pid *get_task_pid(struct task_struct *task, enum pid_type type); + /* * attach_pid() and detach_pid() must be called with the tasklist_lock * write-held. -- cgit v1.2.3 From 1d32849b14bc8792e6f35ab27dd990d74b16126c Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Tue, 3 Oct 2006 01:13:45 -0700 Subject: [PATCH] pid.h cleanup Make the pid.h macros look less revolting in an 80-col window. Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/pid.h | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) (limited to 'include/linux/pid.h') diff --git a/include/linux/pid.h b/include/linux/pid.h index 17b9e04d3586..2c0007d17218 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -105,28 +105,28 @@ static inline pid_t pid_nr(struct pid *pid) } -#define do_each_task_pid(who, type, task) \ - do { \ - struct hlist_node *pos___; \ - struct pid *pid___ = find_pid(who); \ - if (pid___ != NULL) \ - hlist_for_each_entry_rcu((task), pos___, \ +#define do_each_task_pid(who, type, task) \ + do { \ + struct hlist_node *pos___; \ + struct pid *pid___ = find_pid(who); \ + if (pid___ != NULL) \ + hlist_for_each_entry_rcu((task), pos___, \ &pid___->tasks[type], pids[type].node) { -#define while_each_task_pid(who, type, task) \ - } \ +#define while_each_task_pid(who, type, task) \ + } \ } while (0) -#define do_each_pid_task(pid, type, task) \ - do { \ - struct hlist_node *pos___; \ - if (pid != NULL) \ - hlist_for_each_entry_rcu((task), pos___, \ +#define do_each_pid_task(pid, type, task) \ + do { \ + struct hlist_node *pos___; \ + if (pid != NULL) \ + hlist_for_each_entry_rcu((task), pos___, \ &pid->tasks[type], pids[type].node) { -#define while_each_pid_task(pid, type, task) \ - } \ +#define while_each_pid_task(pid, type, task) \ + } \ } while (0) #endif /* _LINUX_PID_H */ -- cgit v1.2.3 From 84d737866e2babdeab0c6b18ea155c6a649663b8 Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu Date: Fri, 8 Dec 2006 02:38:01 -0800 Subject: [PATCH] add child reaper to pid_namespace Add a per pid_namespace child-reaper. This is needed so processes are reaped within the same pid space and do not spill over to the parent pid space. Its also needed so containers preserve existing semantic that pid == 1 would reap orphaned children. This is based on Eric Biederman's patch: http://lkml.org/lkml/2006/2/6/285 Signed-off-by: Sukadev Bhattiprolu Signed-off-by: Cedric Le Goater Cc: Kirill Korotaev Cc: Eric W. Biederman Cc: Herbert Poetzl Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/pid.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'include/linux/pid.h') diff --git a/include/linux/pid.h b/include/linux/pid.h index 2c0007d17218..4dec047b1837 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -35,8 +35,9 @@ enum pid_type * * Holding a reference to struct pid solves both of these problems. * It is small so holding a reference does not consume a lot of - * resources, and since a new struct pid is allocated when the numeric - * pid value is reused we don't mistakenly refer to new processes. + * resources, and since a new struct pid is allocated when the numeric pid + * value is reused (when pids wrap around) we don't mistakenly refer to new + * processes. */ struct pid -- cgit v1.2.3