linux-toradex.git/mm/oom_kill.c, branch v3.17-rc2

mm, oom: remove unnecessary exit_state check

2014-08-07T01:01:21+00:00

The oom killer scans each process and determines whether it is eligible
for oom kill or whether the oom killer should abort because of
concurrent memory freeing.  It will abort when an eligible process is
found to have TIF_MEMDIE set, meaning it has already been oom killed and
we're waiting for it to exit.

Processes with task->mm == NULL should not be considered because they
are either kthreads or have already detached their memory and killing
them would not lead to memory freeing.  That memory is only freed after
exit_mm() has returned, however, and not when task->mm is first set to
NULL.

Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process
is no longer considered for oom kill, but only until exit_mm() has
returned.  This was fragile in the past because it relied on
exit_notify() to be reached before no longer considering TIF_MEMDIE
processes.

Signed-off-by: David Rientjes 
Cc: Oleg Nesterov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, oom: rename zonelist locking functions

2014-08-07T01:01:21+00:00

try_set_zonelist_oom() and clear_zonelist_oom() are not named properly
to imply that they require locking semantics to avoid out_of_memory()
being reordered.

zone_scan_lock is required for both functions to ensure that there is
proper locking synchronization.

Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename
clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper
locking semantics.

At the same time, convert oom_zonelist_trylock() to return bool instead
of int since only success and failure are tested.

Signed-off-by: David Rientjes 
Cc: "Kirill A. Shutemov" 
Cc: Johannes Weiner 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, oom: ensure memoryless node zonelist always includes zones

2014-08-07T01:01:21+00:00

With memoryless node support being worked on, it's possible that for
optimizations that a node may not have a non-NULL zonelist.  When
CONFIG_NUMA is enabled and node 0 is memoryless, this means the zonelist
for first_online_node may become NULL.

The oom killer requires a zonelist that includes all memory zones for
the sysrq trigger and pagefault out of memory handler.

Ensure that a non-NULL zonelist is always passed to the oom killer.

[akpm@linux-foundation.org: fix non-numa build]
Signed-off-by: David Rientjes 
Cc: "Kirill A. Shutemov" 
Cc: Johannes Weiner 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, oom: base root bonus on current usage

2014-01-31T00:56:56+00:00

A 3% of system memory bonus is sometimes too excessive in comparison to
other processes.

With commit a63d83f427fb ("oom: badness heuristic rewrite"), the OOM
killer tries to avoid killing privileged tasks by subtracting 3% of
overall memory (system or cgroup) from their per-task consumption.  But
as a result, all root tasks that consume less than 3% of overall memory
are considered equal, and so it only takes 33+ privileged tasks pushing
the system out of memory for the OOM killer to do something stupid and
kill dhclient or other root-owned processes.  For example, on a 32G
machine it can't tell the difference between the 1M agetty and the 10G
fork bomb member.

The changelog describes this 3% boost as the equivalent to the global
overcommit limit being 3% higher for privileged tasks, but this is not
the same as discounting 3% of overall memory from _every privileged task
individually_ during OOM selection.

Replace the 3% of system memory bonus with a 3% of current memory usage
bonus.

By giving root tasks a bonus that is proportional to their actual size,
they remain comparable even when relatively small.  In the example
above, the OOM killer will discount the 1M agetty's 256 badness points
down to 179, and the 10G fork bomb's 262144 points down to 183500 points
and make the right choice, instead of discounting both to 0 and killing
agetty because it's first in the task list.

Signed-off-by: David Rientjes 
Reported-by: Johannes Weiner 
Acked-by: Johannes Weiner 
Cc: Michal Hocko 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, oom: prefer thread group leaders for display purposes

2014-01-24T00:36:53+00:00

When two threads have the same badness score, it's preferable to kill
the thread group leader so that the actual process name is printed to
the kernel log rather than the thread group name which may be shared
amongst several processes.

This was the behavior when select_bad_process() used to do
for_each_process(), but it now iterates threads instead and leads to
ambiguity.

Signed-off-by: David Rientjes 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: KAMEZAWA Hiroyuki 
Cc: Greg Thelen 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

oom_kill: add rcu_read_lock() into find_lock_task_mm()

2014-01-22T00:19:46+00:00

find_lock_task_mm() expects it is called under rcu or tasklist lock, but
it seems that at least oom_unkillable_task()->task_in_mem_cgroup() and
mem_cgroup_out_of_memory()->oom_badness() can call it lockless.

Perhaps we could fix the callers, but this patch simply adds rcu lock
into find_lock_task_mm().  This also allows to simplify a bit one of its
callers, oom_kill_process().

Signed-off-by: Oleg Nesterov 
Cc: Sergey Dyasly 
Cc: Sameer Nanda 
Cc: "Eric W. Biederman" 
Cc: Frederic Weisbecker 
Cc: Mandeep Singh Baines 
Cc: "Ma, Xindong" 
Reviewed-by: Michal Hocko 
Cc: "Tu, Xiaobing" 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

oom_kill: has_intersects_mems_allowed() needs rcu_read_lock()

2014-01-22T00:19:46+00:00

At least out_of_memory() calls has_intersects_mems_allowed() without
even rcu_read_lock(), this is obviously buggy.

Add the necessary rcu_read_lock().  This means that we can not simply
return from the loop, we need "bool ret" and "break".

While at it, swap the names of task_struct's (the argument and the
local).  This cleans up the code a little bit and avoids the unnecessary
initialization.

Signed-off-by: Oleg Nesterov 
Reviewed-by: Sergey Dyasly 
Tested-by: Sergey Dyasly 
Reviewed-by: Sameer Nanda 
Cc: "Eric W. Biederman" 
Cc: Frederic Weisbecker 
Cc: Mandeep Singh Baines 
Cc: "Ma, Xindong" 
Reviewed-by: Michal Hocko 
Cc: "Tu, Xiaobing" 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

oom_kill: change oom_kill.c to use for_each_thread()

2014-01-22T00:19:46+00:00

Change oom_kill.c to use for_each_thread() rather than the racy
while_each_thread() which can loop forever if we race with exit.

Note also that most users were buggy even if while_each_thread() was
fine, the task can exit even _before_ rcu_read_lock().

Fortunately the new for_each_thread() only requires the stable
task_struct, so this change fixes both problems.

Signed-off-by: Oleg Nesterov 
Reviewed-by: Sergey Dyasly 
Tested-by: Sergey Dyasly 
Reviewed-by: Sameer Nanda 
Cc: "Eric W. Biederman" 
Cc: Frederic Weisbecker 
Cc: Mandeep Singh Baines 
Cc: "Ma, Xindong" 
Reviewed-by: Michal Hocko 
Cc: "Tu, Xiaobing" 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: convert mm->nr_ptes to atomic_long_t

2013-11-15T00:32:14+00:00

With split page table lock for PMD level we can't hold mm->page_table_lock
while updating nr_ptes.

Let's convert it to atomic_long_t to avoid races.

Signed-off-by: Kirill A. Shutemov 
Tested-by: Alex Thorlton 
Cc: Ingo Molnar 
Cc: Naoya Horiguchi 
Cc: "Eric W . Biederman" 
Cc: "Paul E . McKenney" 
Cc: Al Viro 
Cc: Andi Kleen 
Cc: Andrea Arcangeli 
Cc: Dave Hansen 
Cc: Dave Jones 
Cc: David Howells 
Cc: Frederic Weisbecker 
Cc: Johannes Weiner 
Cc: Kees Cook 
Cc: Mel Gorman 
Cc: Michael Kerrisk 
Cc: Oleg Nesterov 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Robin Holt 
Cc: Sedat Dilek 
Cc: Srikar Dronamraju 
Cc: Thomas Gleixner 
Cc: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: memcg: handle non-error OOM situations more gracefully

2013-10-17T04:35:53+00:00

Commit 3812c8c8f395 ("mm: memcg: do not trap chargers with full
callstack on OOM") assumed that only a few places that can trigger a
memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
readahead.  But there are many more and it's impractical to annotate
them all.

First of all, we don't want to invoke the OOM killer when the failed
allocation is gracefully handled, so defer the actual kill to the end of
the fault handling as well.  This simplifies the code quite a bit for
added bonus.

Second, since a failed allocation might not be the abrupt end of the
fault, the memcg OOM handler needs to be re-entrant until the fault
finishes for subsequent allocation attempts.  If an allocation is
attempted after the task already OOMed, allow it to bypass the limit so
that it can quickly finish the fault and invoke the OOM killer.

Reported-by: azurIt 
Signed-off-by: Johannes Weiner 
Cc: Michal Hocko 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds