linux-toradex.git/kernel/fork.c, branch v2.6.36-rc7

rmap: fix walk during fork

2010-09-23T00:22:39+00:00

The below bug in fork led to the rmap walk finding the parent huge-pmd
twice instead of just once, because the anon_vma_chain objects of the
child vma still point to the vma->vm_mm of the parent.

The patch fixes it by making the rmap walk accurate during fork.  It's not
a big deal normally but it worth being accurate considering the cost is
the same.

Signed-off-by: Andrea Arcangeli 
Acked-by: Johannes Weiner 
Acked-by: Rik van Riel 
Acked-by: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: make the vma list be doubly linked

2010-08-21T15:49:21+00:00

It's a really simple list, and several of the users want to go backwards
in it to find the previous vma.  So rather than have to look up the
previous entry with 'find_vma_prev()' or something similar, just make it
doubly linked instead.

Tested-by: Ian Campbell 
Signed-off-by: Linus Torvalds

fs: fs_struct rwlock to spinlock

2010-08-18T12:35:46+00:00

fs: fs_struct rwlock to spinlock

struct fs_struct.lock is an rwlock with the read-side used to protect root and
pwd members while taking references to them. Taking a reference to a path
typically requires just 2 atomic ops, so the critical section is very small.
Parallel read-side operations would have cacheline contention on the lock, the
dentry, and the vfsmount cachelines, so the rwlock is unlikely to ever give a
real parallelism increase.

Replace it with a spinlock to avoid one or two atomic operations in typical
path lookup fastpath.

Signed-off-by: Nick Piggin 
Signed-off-by: Al Viro

oom: badness heuristic rewrite

2010-08-10T03:45:02+00:00

This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions.  The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.

Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead.  This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits.  This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.

The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory.  "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit.  The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.

The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.

Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs.  In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.

Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it.  It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability.  Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000.  It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered.  The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.

/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa.  Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning.  Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity.  This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.

Signed-off-by: David Rientjes 
Cc: Nick Piggin 
Cc: KAMEZAWA Hiroyuki 
Cc: KOSAKI Motohiro 
Cc: Oleg Nesterov 
Cc: Balbir Singh 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

sched: add hooks for workqueue

2010-06-08T19:40:37+00:00

Concurrency managed workqueue needs to know when workers are going to
sleep and waking up.  Using these two hooks, cmwq keeps track of the
current concurrency level and throttles execution of new works if it's
too high and wakes up another worker from the sleep hook if it becomes
too low.

This patch introduces PF_WQ_WORKER to identify workqueue workers and
adds the following two hooks.

* wq_worker_waking_up(): called when a worker is woken up.

* wq_worker_sleeping(): called when a worker is going to sleep and may
  return a pointer to a local task which should be woken up.  The
  returned task is woken up using try_to_wake_up_local() which is
  simplified ttwu which is called under rq lock and can only wake up
  local tasks.

Both hooks are currently defined as noop in kernel/workqueue_sched.h.
Later cmwq implementation will replace them with proper
implementation.

These hooks are hard coded as they'll always be enabled.

Signed-off-by: Tejun Heo 
Acked-by: Peter Zijlstra 
Cc: Mike Galbraith 
Cc: Ingo Molnar

Revert "cpusets: randomize node rotor used in cpuset_mem_spread_node()"

2010-05-30T16:00:03+00:00

This reverts commit 0ac0c0d0f837c499afd02a802f9cf52d3027fa3b, which
caused cross-architecture build problems for all the wrong reasons.
IA64 already added its own version of __node_random(), but the fact is,
there is nothing architectural about the function, and the original
commit was just badly done. Revert it, since no fix is forthcoming.

Requested-by: Stephen Rothwell 
Signed-off-by: Linus Torvalds

pids: fix fork_idle() to setup ->pids correctly

2010-05-27T16:12:52+00:00

copy_process(pid => &init_struct_pid) doesn't do attach_pid/etc.

It shouldn't, but this means that the idle threads run with the wrong
pids copied from the caller's task_struct. In x86 case the caller is
either kernel_init() thread or keventd.

In particular, this means that after the series of cpu_up/cpu_down an
idle thread (which never exits) can run with .pid pointing to nowhere.

Change fork_idle() to initialize idle->pids[] correctly. We only set
.pid = &init_struct_pid but do not add .node to list, INIT_TASK() does
the same for the boot-cpu idle thread (swapper).

Signed-off-by: Oleg Nesterov 
Cc: Cedric Le Goater 
Cc: Dave Hansen 
Cc: Eric Biederman 
Cc: Herbert Poetzl 
Cc: Mathias Krause 
Acked-by: Roland McGrath 
Acked-by: Serge Hallyn 
Cc: Sukadev Bhattiprolu 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

proc: turn signal_struct->count into "int nr_threads"

2010-05-27T16:12:47+00:00

No functional changes, just s/atomic_t count/int nr_threads/.

With the recent changes this counter has a single user, get_nr_threads()
And, none of its callers need the really accurate number of threads, not
to mention each caller obviously races with fork/exit.  It is only used to
report this value to the user-space, except first_tid() uses it to avoid
the unnecessary while_each_thread() loop in the unlikely case.

It is a bit sad we need a word in struct signal_struct for this, perhaps
we can change get_nr_threads() to approximate the number of threads using
signal->live and kill ->nr_threads later.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oleg Nesterov 
Cc: Alexey Dobriyan 
Cc: "Eric W. Biederman" 
Acked-by: Roland McGrath 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

check_unshare_flags: kill the bogus CLONE_SIGHAND/sig->count check

2010-05-27T16:12:47+00:00

check_unshare_flags(CLONE_SIGHAND) adds CLONE_THREAD to *flags_ptr if the
task is multithreaded to ensure unshare_thread() will fail.

Not only this is a bit strange way to return the error, this is absolutely
meaningless.  If signal->count > 1 then sighand->count must be also > 1,
and unshare_sighand() will fail anyway.

In fact, all CLONE_THREAD/SIGHAND/VM checks inside sys_unshare() do not
look right.  Fortunately this code doesn't really work anyway.

Signed-off-by: Oleg Nesterov 
Cc: Balbir Singh 
Acked-by: Roland McGrath 
Cc: Veaceslav Falico 
Cc: Stanislaw Gruszka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

exit: move taskstats_tgid_free() from __exit_signal() to free_signal_struct()

2010-05-27T16:12:46+00:00

Move taskstats_tgid_free() from __exit_signal() to free_signal_struct().

This way signal->stats never points to nowhere and we can read ->stats
lockless.

Signed-off-by: Oleg Nesterov 
Cc: Balbir Singh 
Cc: Roland McGrath 
Cc: Veaceslav Falico 
Cc: Stanislaw Gruszka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds