linux-toradex.git/mm/oom_kill.c, branch v3.14.34

OOM, PM: OOM killed task shouldn't escape PM suspend

2014-11-14T17:00:01+00:00

commit 5695be142e203167e3cb515ef86a88424f3524eb upstream.

PM freezer relies on having all tasks frozen by the time devices are
getting frozen so that no task will touch them while they are getting
frozen. But OOM killer is allowed to kill an already frozen task in
order to handle OOM situtation. In order to protect from late wake ups
OOM killer is disabled after all tasks are frozen. This, however, still
keeps a window open when a killed task didn't manage to die by the time
freeze_processes finishes.

Reduce the race window by checking all tasks after OOM killer has been
disabled. This is still not race free completely unfortunately because
oom_killer_disable cannot stop an already ongoing OOM killer so a task
might still wake up from the fridge and get killed without
freeze_processes noticing. Full synchronization of OOM and freezer is,
however, too heavy weight for this highly unlikely case.

Introduce and check oom_kills counter which gets incremented early when
the allocator enters __alloc_pages_may_oom path and only check all the
tasks if the counter changes during the freezing attempt. The counter
is updated so early to reduce the race window since allocator checked
oom_killer_disabled which is set by PM-freezing code. A false positive
will push the PM-freezer into a slow path but that is not a big deal.

Changes since v1
- push the re-check loop out of freeze_processes into
  check_frozen_processes and invert the condition to make the code more
  readable as per Rafael

Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring)
Signed-off-by: Michal Hocko 
Signed-off-by: Rafael J. Wysocki 
Signed-off-by: Greg Kroah-Hartman

mm, oom: base root bonus on current usage

2014-01-31T00:56:56+00:00

A 3% of system memory bonus is sometimes too excessive in comparison to
other processes.

With commit a63d83f427fb ("oom: badness heuristic rewrite"), the OOM
killer tries to avoid killing privileged tasks by subtracting 3% of
overall memory (system or cgroup) from their per-task consumption.  But
as a result, all root tasks that consume less than 3% of overall memory
are considered equal, and so it only takes 33+ privileged tasks pushing
the system out of memory for the OOM killer to do something stupid and
kill dhclient or other root-owned processes.  For example, on a 32G
machine it can't tell the difference between the 1M agetty and the 10G
fork bomb member.

The changelog describes this 3% boost as the equivalent to the global
overcommit limit being 3% higher for privileged tasks, but this is not
the same as discounting 3% of overall memory from _every privileged task
individually_ during OOM selection.

Replace the 3% of system memory bonus with a 3% of current memory usage
bonus.

By giving root tasks a bonus that is proportional to their actual size,
they remain comparable even when relatively small.  In the example
above, the OOM killer will discount the 1M agetty's 256 badness points
down to 179, and the 10G fork bomb's 262144 points down to 183500 points
and make the right choice, instead of discounting both to 0 and killing
agetty because it's first in the task list.

Signed-off-by: David Rientjes 
Reported-by: Johannes Weiner 
Acked-by: Johannes Weiner 
Cc: Michal Hocko 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, oom: prefer thread group leaders for display purposes

2014-01-24T00:36:53+00:00

When two threads have the same badness score, it's preferable to kill
the thread group leader so that the actual process name is printed to
the kernel log rather than the thread group name which may be shared
amongst several processes.

This was the behavior when select_bad_process() used to do
for_each_process(), but it now iterates threads instead and leads to
ambiguity.

Signed-off-by: David Rientjes 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: KAMEZAWA Hiroyuki 
Cc: Greg Thelen 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

oom_kill: add rcu_read_lock() into find_lock_task_mm()

2014-01-22T00:19:46+00:00

find_lock_task_mm() expects it is called under rcu or tasklist lock, but
it seems that at least oom_unkillable_task()->task_in_mem_cgroup() and
mem_cgroup_out_of_memory()->oom_badness() can call it lockless.

Perhaps we could fix the callers, but this patch simply adds rcu lock
into find_lock_task_mm().  This also allows to simplify a bit one of its
callers, oom_kill_process().

Signed-off-by: Oleg Nesterov 
Cc: Sergey Dyasly 
Cc: Sameer Nanda 
Cc: "Eric W. Biederman" 
Cc: Frederic Weisbecker 
Cc: Mandeep Singh Baines 
Cc: "Ma, Xindong" 
Reviewed-by: Michal Hocko 
Cc: "Tu, Xiaobing" 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

oom_kill: has_intersects_mems_allowed() needs rcu_read_lock()

2014-01-22T00:19:46+00:00

At least out_of_memory() calls has_intersects_mems_allowed() without
even rcu_read_lock(), this is obviously buggy.

Add the necessary rcu_read_lock().  This means that we can not simply
return from the loop, we need "bool ret" and "break".

While at it, swap the names of task_struct's (the argument and the
local).  This cleans up the code a little bit and avoids the unnecessary
initialization.

Signed-off-by: Oleg Nesterov 
Reviewed-by: Sergey Dyasly 
Tested-by: Sergey Dyasly 
Reviewed-by: Sameer Nanda 
Cc: "Eric W. Biederman" 
Cc: Frederic Weisbecker 
Cc: Mandeep Singh Baines 
Cc: "Ma, Xindong" 
Reviewed-by: Michal Hocko 
Cc: "Tu, Xiaobing" 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

oom_kill: change oom_kill.c to use for_each_thread()

2014-01-22T00:19:46+00:00

Change oom_kill.c to use for_each_thread() rather than the racy
while_each_thread() which can loop forever if we race with exit.

Note also that most users were buggy even if while_each_thread() was
fine, the task can exit even _before_ rcu_read_lock().

Fortunately the new for_each_thread() only requires the stable
task_struct, so this change fixes both problems.

Signed-off-by: Oleg Nesterov 
Reviewed-by: Sergey Dyasly 
Tested-by: Sergey Dyasly 
Reviewed-by: Sameer Nanda 
Cc: "Eric W. Biederman" 
Cc: Frederic Weisbecker 
Cc: Mandeep Singh Baines 
Cc: "Ma, Xindong" 
Reviewed-by: Michal Hocko 
Cc: "Tu, Xiaobing" 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: convert mm->nr_ptes to atomic_long_t

2013-11-15T00:32:14+00:00

With split page table lock for PMD level we can't hold mm->page_table_lock
while updating nr_ptes.

Let's convert it to atomic_long_t to avoid races.

Signed-off-by: Kirill A. Shutemov 
Tested-by: Alex Thorlton 
Cc: Ingo Molnar 
Cc: Naoya Horiguchi 
Cc: "Eric W . Biederman" 
Cc: "Paul E . McKenney" 
Cc: Al Viro 
Cc: Andi Kleen 
Cc: Andrea Arcangeli 
Cc: Dave Hansen 
Cc: Dave Jones 
Cc: David Howells 
Cc: Frederic Weisbecker 
Cc: Johannes Weiner 
Cc: Kees Cook 
Cc: Mel Gorman 
Cc: Michael Kerrisk 
Cc: Oleg Nesterov 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Robin Holt 
Cc: Sedat Dilek 
Cc: Srikar Dronamraju 
Cc: Thomas Gleixner 
Cc: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: memcg: handle non-error OOM situations more gracefully

2013-10-17T04:35:53+00:00

Commit 3812c8c8f395 ("mm: memcg: do not trap chargers with full
callstack on OOM") assumed that only a few places that can trigger a
memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
readahead.  But there are many more and it's impractical to annotate
them all.

First of all, we don't want to invoke the OOM killer when the failed
allocation is gracefully handled, so defer the actual kill to the end of
the fault handling as well.  This simplifies the code quite a bit for
added bonus.

Second, since a failed allocation might not be the abrupt end of the
fault, the memcg OOM handler needs to be re-entrant until the fault
finishes for subsequent allocation attempts.  If an allocation is
attempted after the task already OOMed, allow it to bypass the limit so
that it can quickly finish the fault and invoke the OOM killer.

Reported-by: azurIt 
Signed-off-by: Johannes Weiner 
Cc: Michal Hocko 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: memcg: do not trap chargers with full callstack on OOM

2013-09-12T22:38:02+00:00

The memcg OOM handling is incredibly fragile and can deadlock.  When a
task fails to charge memory, it invokes the OOM killer and loops right
there in the charge code until it succeeds.  Comparably, any other task
that enters the charge path at this point will go to a waitqueue right
then and there and sleep until the OOM situation is resolved.  The problem
is that these tasks may hold filesystem locks and the mmap_sem; locks that
the selected OOM victim may need to exit.

For example, in one reported case, the task invoking the OOM killer was
about to charge a page cache page during a write(), which holds the
i_mutex.  The OOM killer selected a task that was just entering truncate()
and trying to acquire the i_mutex:

OOM invoking task:
  mem_cgroup_handle_oom+0x241/0x3b0
  mem_cgroup_cache_charge+0xbe/0xe0
  add_to_page_cache_locked+0x4c/0x140
  add_to_page_cache_lru+0x22/0x50
  grab_cache_page_write_begin+0x8b/0xe0
  ext3_write_begin+0x88/0x270
  generic_file_buffered_write+0x116/0x290
  __generic_file_aio_write+0x27c/0x480
  generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
  do_sync_write+0xea/0x130
  vfs_write+0xf3/0x1f0
  sys_write+0x51/0x90
  system_call_fastpath+0x18/0x1d

OOM kill victim:
  do_truncate+0x58/0xa0              # takes i_mutex
  do_last+0x250/0xa30
  path_openat+0xd7/0x440
  do_filp_open+0x49/0xa0
  do_sys_open+0x106/0x240
  sys_open+0x20/0x30
  system_call_fastpath+0x18/0x1d

The OOM handling task will retry the charge indefinitely while the OOM
killed task is not releasing any resources.

A similar scenario can happen when the kernel OOM killer for a memcg is
disabled and a userspace task is in charge of resolving OOM situations.
In this case, ALL tasks that enter the OOM path will be made to sleep on
the OOM waitqueue and wait for userspace to free resources or increase
the group's limit.  But a userspace OOM handler is prone to deadlock
itself on the locks held by the waiting tasks.  For example one of the
sleeping tasks may be stuck in a brk() call with the mmap_sem held for
writing but the userspace handler, in order to pick an optimal victim,
may need to read files from /proc/, which tries to acquire the same
mmap_sem for reading and deadlocks.

This patch changes the way tasks behave after detecting a memcg OOM and
makes sure nobody loops or sleeps with locks held:

1. When OOMing in a user fault, invoke the OOM killer and restart the
   fault instead of looping on the charge attempt.  This way, the OOM
   victim can not get stuck on locks the looping task may hold.

2. When OOMing in a user fault but somebody else is handling it
   (either the kernel OOM killer or a userspace handler), don't go to
   sleep in the charge context.  Instead, remember the OOMing memcg in
   the task struct and then fully unwind the page fault stack with
   -ENOMEM.  pagefault_out_of_memory() will then call back into the
   memcg code to check if the -ENOMEM came from the memcg, and then
   either put the task to sleep on the memcg's OOM waitqueue or just
   restart the fault.  The OOM victim can no longer get stuck on any
   lock a sleeping task may hold.

Debugged by Michal Hocko.

Signed-off-by: Johannes Weiner 
Reported-by: azurIt 
Acked-by: Michal Hocko 
Cc: David Rientjes 
Cc: KAMEZAWA Hiroyuki 
Cc: KOSAKI Motohiro 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR().

2013-07-15T01:55:05+00:00

The normal expectation for ERR_PTR() is to put a negative errno into a
pointer.  oom_kill puts the magic -1 in the result (and has since
pre-git), which is probably clearer with an explicit cast.

Cc: Andrew Morton 
Signed-off-by: Rusty Russell