linux-toradex.git/mm/oom_kill.c, branch v3.0.80

oom: fix integer overflow of points in oom_badness

2012-01-06T22:13:51+00:00

commit ff05b6f7ae762b6eb464183eec994b28ea09f6dd upstream.

An integer overflow will happen on 64bit archs if task's sum of rss,
swapents and nr_ptes exceeds (2^31)/1000 value.  This was introduced by
commit

f755a04 oom: use pte pages in OOM score

where the oom score computation was divided into several steps and it's no
longer computed as one expression in unsigned long(rss, swapents, nr_pte
are unsigned long), where the result value assigned to points(int) is in
range(1..1000).  So there could be an int overflow while computing

176          points *= 1000;

and points may have negative value. Meaning the oom score for a mem hog task
will be one.

196          if (points <= 0)
197                  return 1;

For example:
[ 3366]     0  3366 35390480 24303939   5       0             0 oom01
Out of memory: Kill process 3366 (oom01) score 1 or sacrifice child

Here the oom1 process consumes more than 24303939(rss)*4096~=92GB physical
memory, but it's oom score is one.

In this situation the mem hog task is skipped and oom killer kills another and
most probably innocent task with oom score greater than one.

The points variable should be of type long instead of int to prevent the
int overflow.

Signed-off-by: Frantisek Hrbata 
Acked-by: KOSAKI Motohiro 
Acked-by: Oleg Nesterov 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

oom: task->mm == NULL doesn't mean the memory was freed

2011-08-05T04:58:42+00:00

commit c027a474a68065391c8773f6e83ed5412657e369 upstream.

exit_mm() sets ->mm == NULL then it does mmput()->exit_mmap() which
frees the memory.

However select_bad_process() checks ->mm != NULL before TIF_MEMDIE,
so it continues to kill other tasks even if we have the oom-killed
task freeing its memory.

Change select_bad_process() to check ->mm after TIF_MEMDIE, but skip
the tasks which have already passed exit_notify() to ensure a zombie
with TIF_MEMDIE set can't block oom-killer. Alternatively we could
probably clear TIF_MEMDIE after exit_mmap().

Signed-off-by: Oleg Nesterov 
Reviewed-by: KOSAKI Motohiro 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

oom: replace PF_OOM_ORIGIN with toggling oom_score_adj

2011-05-25T15:39:10+00:00

There's a kernel-wide shortage of per-process flags, so it's always
helpful to trim one when possible without incurring a significant penalty.
 It's even more important when you're planning on adding a per- process
flag yourself, which I plan to do shortly for transparent hugepages.

PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a
tendency to allocate large amounts of memory and should be preferred for
killing over other tasks.  We'd rather immediately kill the task making
the errant syscall rather than penalizing an innocent task.

This patch removes PF_OOM_ORIGIN since its behavior is equivalent to
setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX.

The process's old oom_score_adj is stored and then set to
OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN.  The old
value is then reinstated when the process should no longer be considered a
high priority for oom killing.

Signed-off-by: David Rientjes 
Reviewed-by: KOSAKI Motohiro 
Reviewed-by: Minchan Kim 
Cc: Hugh Dickins 
Cc: Izik Eidus 
Cc: KAMEZAWA Hiroyuki 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

oom: use pte pages in OOM score

2011-04-28T18:28:21+00:00

PTE pages eat up memory just like anything else, but we do not account for
them in any way in the OOM scores.  They are also _guaranteed_ to get
freed up when a process is OOM killed, while RSS is not.

Reported-by: Dave Hansen 
Signed-off-by: KOSAKI Motohiro 
Cc: Hugh Dickins 
Cc: KAMEZAWA Hiroyuki 
Cc: Oleg Nesterov 
Acked-by: David Rientjes 
Cc: 		[2.6.36+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

oom-kill: remove boost_dying_task_prio()

2011-04-14T23:06:56+00:00

This is an almost-revert of commit 93b43fa ("oom: give the dying task a
higher priority").

That commit dramatically improved oom killer logic when a fork-bomb
occurs.  But I've found that it has nasty corner case.  Now cpu cgroup has
strange default RT runtime.  It's 0!  That said, if a process under cpu
cgroup promote RT scheduling class, the process never run at all.

If an admin inserts a !RT process into a cpu cgroup by setting
rtruntime=0, usually it runs perfectly because a !RT task isn't affected
by the rtruntime knob.  But if it promotes an RT task via an explicit
setscheduler() syscall or an OOM, the task can't run at all.  In short,
the oom killer doesn't work at all if admins are using cpu cgroup and don't
touch the rtruntime knob.

Eventually, kernel may hang up when oom kill occur.  I and the original
author Luis agreed to disable this logic.

Signed-off-by: KOSAKI Motohiro 
Acked-by: Luis Claudio R. Goncalves 
Acked-by: KAMEZAWA Hiroyuki 
Reviewed-by: Minchan Kim 
Acked-by: David Rientjes 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

lib, arch: add filter argument to show_mem and fix private implementations

2011-03-25T00:49:37+00:00

Commit ddd588b5dd55 ("oom: suppress nodes that are not allowed from
meminfo on oom kill") moved lib/show_mem.o out of lib/lib.a, which
resulted in build warnings on all architectures that implement their own
versions of show_mem():

	lib/lib.a(show_mem.o): In function `show_mem':
	show_mem.c:(.text+0x1f4): multiple definition of `show_mem'
	arch/sparc/mm/built-in.o:(.text+0xd70): first defined here

The fix is to remove __show_mem() and add its argument to show_mem() in
all implementations to prevent this breakage.

Architectures that implement their own show_mem() actually don't do
anything with the argument yet, but they could be made to filter nodes
that aren't allowed in the current context in the future just like the
generic implementation.

Reported-by: Stephen Rothwell 
Reported-by: James Bottomley 
Suggested-by: Andrew Morton 
Signed-off-by: David Rientjes 
Signed-off-by: Linus Torvalds

memcg: give current access to memory reserves if it's trying to die

2011-03-24T02:46:33+00:00

When a memcg is oom and current has already received a SIGKILL, then give
it access to memory reserves with a higher scheduling priority so that it
may quickly exit and free its memory.

This is identical to the global oom killer and is done even before
checking for panic_on_oom: a pending SIGKILL here while panic_on_oom is
selected is guaranteed to have come from userspace; the thread only needs
access to memory reserves to exit and thus we don't unnecessarily panic
the machine until the kernel has no last resort to free memory.

Signed-off-by: David Rientjes 
Cc: Balbir Singh 
Cc: Daisuke Nishimura 
Acked-by: KAMEZAWA Hiroyuki 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

oom: suppress nodes that are not allowed from meminfo on oom kill

2011-03-23T00:44:01+00:00

The oom killer is extremely verbose for machines with a large number of
cpus and/or nodes.  This verbosity can often be harmful if it causes other
important messages to be scrolled from the kernel log and incurs a
signicant time delay, specifically for kernels with CONFIG_NODES_SHIFT >
8.

This patch causes only memory information to be displayed for nodes that
are allowed by current's cpuset when dumping the VM state.  Information
for all other nodes is irrelevant to the oom condition; we don't care if
there's an abundance of memory elsewhere if we can't access it.

This only affects the behavior of dumping memory information when an oom
is triggered.  Other dumps, such as for sysrq+m, still display the
unfiltered form when using the existing show_mem() interface.

Additionally, the per-cpu pageset statistics are extremely verbose in oom
killer output, so it is now suppressed.  This removes

	nodes_weight(current->mems_allowed) * (1 + nr_cpus)

lines from the oom killer output.

Callers may use __show_mem(SHOW_MEM_FILTER_NODES) to filter disallowed
nodes.

Signed-off-by: David Rientjes 
Cc: Mel Gorman 
Cc: KAMEZAWA Hiroyuki 
Cc: KOSAKI Motohiro 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

oom: avoid deferring oom killer if exiting task is being traced

2011-03-23T00:43:58+00:00

The oom killer naturally defers killing anything if it finds an eligible
task that is already exiting and has yet to detach its ->mm.  This avoids
unnecessarily killing tasks when one is already in the exit path and may
free enough memory that the oom killer is no longer needed.  This is
detected by PF_EXITING since threads that have already detached its ->mm
are no longer considered at all.

The problem with always deferring when a thread is PF_EXITING, however, is
that it may never actually exit when being traced, specifically if another
task is tracing it with PTRACE_O_TRACEEXIT.  The oom killer does not want
to defer in this case since there is no guarantee that thread will ever
exit without intervention.

This patch will now only defer the oom killer when a thread is PF_EXITING
and no ptracer has stopped its progress in the exit path.  It also ensures
that a child is sacrificed for the chosen parent only if it has a
different ->mm as the comment implies: this ensures that the thread group
leader is always targeted appropriately.

Signed-off-by: David Rientjes 
Reported-by: Oleg Nesterov 
Cc: KOSAKI Motohiro 
Cc: KAMEZAWA Hiroyuki 
Cc: Hugh Dickins 
Cc: Andrey Vagin 
Cc: 		[2.6.38.x]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

oom: skip zombies when iterating tasklist

2011-03-23T00:43:58+00:00

We shouldn't defer oom killing if a thread has already detached its ->mm
and still has TIF_MEMDIE set.  Memory needs to be freed, so find kill
other threads that pin the same ->mm or find another task to kill.

Signed-off-by: Andrey Vagin 
Signed-off-by: David Rientjes 
Cc: KOSAKI Motohiro 
Cc: 		[2.6.38.x]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds