linux-toradex.git/kernel/cgroup.c, branch v3.2.61

cgroup: update cgroup_enable_task_cg_lists() to grab siglock

2014-04-01T23:58:54+00:00

commit 532de3fc72adc2a6525c4d53c07bf81e1732083d upstream.

Currently, there's nothing preventing cgroup_enable_task_cg_lists()
from missing set PF_EXITING and race against cgroup_exit().  Depending
on the timing, cgroup_exit() may finish with the task still linked on
css_set leading to list corruption.  Fix it by grabbing siglock in
cgroup_enable_task_cg_lists() so that PF_EXITING is guaranteed to be
visible.

This whole on-demand cg_list optimization is extremely fragile and has
ample possibility to lead to bugs which can cause things like
once-a-year oops during boot.  I'm wondering whether the better
approach would be just adding "cgroup_disable=all" handling which
disables the whole cgroup rather than tempting fate with this
on-demand craziness.

Signed-off-by: Tejun Heo 
Acked-by: Li Zefan 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings

cgroup: fail if monitored file and event_control are in different cgroup

2013-10-26T20:06:11+00:00

commit f169007b2773f285e098cb84c74aac0154d65ff7 upstream.

If we pass fd of memory.usage_in_bytes of cgroup A to cgroup.event_control
of cgroup B, then we won't get memory usage notification from A but B!

What's worse, if A and B are in different mount hierarchy, we'll end up
accessing NULL pointer!

Disallow this kind of invalid usage.

Signed-off-by: Li Zefan 
Acked-by: Kirill A. Shutemov 
Signed-off-by: Tejun Heo 
Signed-off-by: Ben Hutchings

cgroup: fix an off-by-one bug which may trigger BUG_ON()

2013-05-13T14:02:11+00:00

commit 3ac1707a13a3da9cfc8f242a15b2fae6df2c5f88 upstream.

The 3rd parameter of flex_array_prealloc() is the number of elements,
not the index of the last element.

The effect of the bug is, when opening cgroup.procs, a flex array will
be allocated and all elements of the array is allocated with
GFP_KERNEL flag, but the last one is GFP_ATOMIC, and if we fail to
allocate memory for it, it'll trigger a BUG_ON().

Signed-off-by: Li Zefan 
Signed-off-by: Tejun Heo 
Signed-off-by: Ben Hutchings

cgroup: fix exit() vs rmdir() race

2013-03-06T03:24:02+00:00

commit 71b5707e119653039e6e95213f00479668c79b75 upstream.

In cgroup_exit() put_css_set_taskexit() is called without any lock,
which might lead to accessing a freed cgroup:

thread1                           thread2
---------------------------------------------
exit()
  cgroup_exit()
    put_css_set_taskexit()
      atomic_dec(cgrp->count);
                                   rmdir();
      /* not safe !! */
      check_for_release(cgrp);

rcu_read_lock() can be used to make sure the cgroup is alive.

Signed-off-by: Li Zefan 
Signed-off-by: Tejun Heo 
Signed-off-by: Ben Hutchings

cgroup: remove incorrect dget/dput() pair in cgroup_create_dir()

2013-01-03T03:33:11+00:00

commit 175431635ec09b1d1bba04979b006b99e8305a83 upstream.

cgroup_create_dir() does weird dancing with dentry refcnt.  On
success, it gets and then puts it achieving nothing.  On failure, it
puts but there isn't no matching get anywhere leading to the following
oops if cgroup_create_file() fails for whatever reason.

  ------------[ cut here ]------------
  kernel BUG at /work/os/work/fs/dcache.c:552!
  invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  Modules linked in:
  CPU 2
  Pid: 697, comm: mkdir Not tainted 3.7.0-rc4-work+ #3 Bochs Bochs
  RIP: 0010:[]  [] dput+0x1dc/0x1e0
  RSP: 0018:ffff88001a3ebef8  EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffff88000e5b1ef8 RCX: 0000000000000403
  RDX: 0000000000000303 RSI: 2000000000000000 RDI: ffff88000e5b1f58
  RBP: ffff88001a3ebf18 R08: ffffffff82c76960 R09: 0000000000000001
  R10: ffff880015022080 R11: ffd9bed70f48a041 R12: 00000000ffffffea
  R13: 0000000000000001 R14: ffff88000e5b1f58 R15: 00007fff57656d60
  FS:  00007ff05fcb3800(0000) GS:ffff88001fd00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000004046f0 CR3: 000000001315f000 CR4: 00000000000006e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  Process mkdir (pid: 697, threadinfo ffff88001a3ea000, task ffff880015022080)
  Stack:
   ffff88001a3ebf48 00000000ffffffea 0000000000000001 0000000000000000
   ffff88001a3ebf38 ffffffff811cc889 0000000000000001 ffff88000e5b1ef8
   ffff88001a3ebf68 ffffffff811d1fc9 ffff8800198d7f18 ffff880019106ef8
  Call Trace:
   [] done_path_create+0x19/0x50
   [] sys_mkdirat+0x59/0x80
   [] sys_mkdir+0x19/0x20
   [] system_call_fastpath+0x16/0x1b
  Code: 00 48 8d 90 18 01 00 00 48 89 93 c0 00 00 00 4c 89 a0 18 01 00 00 48 8b 83 a0 00 00 00 83 80 28 01 00 00 01 e8 e6 6f a0 00 eb 92 <0f> 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 49 89 fe 41
  RIP  [] dput+0x1dc/0x1e0
   RSP 
  ---[ end trace 1277bcfd9561ddb0 ]---

Fix it by dropping the unnecessary dget/dput() pair.

Signed-off-by: Tejun Heo 
Acked-by: Li Zefan 
Signed-off-by: Ben Hutchings

cgroup: cgroup_subsys->fork() should be called after the task is added to css_set

2013-01-03T03:32:56+00:00

commit 5edee61edeaaebafe584f8fb7074c1ef4658596b upstream.

cgroup core has a bug which violates a basic rule about event
notifications - when a new entity needs to be added, you add that to
the notification list first and then make the new entity conform to
the current state.  If done in the reverse order, an event happening
inbetween will be lost.

cgroup_subsys->fork() is invoked way before the new task is added to
the css_set.  Currently, cgroup_freezer is the only user of ->fork()
and uses it to make new tasks conform to the current state of the
freezer.  If FROZEN state is requested while fork is in progress
between cgroup_fork_callbacks() and cgroup_post_fork(), the child
could escape freezing - the cgroup isn't frozen when ->fork() is
called and the freezer couldn't see the new task on the css_set.

This patch moves cgroup_subsys->fork() invocation to
cgroup_post_fork() after the new task is added to the css_set.
cgroup_fork_callbacks() is removed.

Because now a task may be migrated during cgroup_subsys->fork(),
freezer_fork() is updated so that it adheres to the usual RCU locking
and the rather pointless comment on why locking can be different there
is removed (if it doesn't make anything simpler, why even bother?).

Signed-off-by: Tejun Heo 
Cc: Oleg Nesterov 
Cc: Rafael J. Wysocki 
[bwh: Backported to 3.2:
 - Adjust context
 - Iterate over first CGROUP_BUILTIN_SUBSYS_COUNT elements of subsys
 - cgroup_subsys::fork takes cgroup_subsys pointer as first parameter]
Signed-off-by: Ben Hutchings

cgroup: notify_on_release may not be triggered in some cases

2012-10-30T23:26:48+00:00

commit 1f5320d5972aa50d3e8d2b227b636b370e608359 upstream.

notify_on_release must be triggered when the last process in a cgroup is
move to another. But if the first(and only) process in a cgroup is moved to
another, notify_on_release is not triggered.

	# mkdir /cgroup/cpu/SRC
	# mkdir /cgroup/cpu/DST
	#
	# echo 1 >/cgroup/cpu/SRC/notify_on_release
	# echo 1 >/cgroup/cpu/DST/notify_on_release
	#
	# sleep 300 &
	[1] 8629
	#
	# echo 8629 >/cgroup/cpu/SRC/tasks
	# echo 8629 >/cgroup/cpu/DST/tasks
	-> notify_on_release for /SRC must be triggered at this point,
	   but it isn't.

This is because put_css_set() is called before setting CGRP_RELEASABLE
in cgroup_task_migrate(), and is a regression introduce by the
commit:74a1166d(cgroups: make procs file writable), which was merged
into v3.0.

Cc: Ben Blum 
Acked-by: Li Zefan 
Signed-off-by: Daisuke Nishimura 
Signed-off-by: Tejun Heo 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings

cgroup: fix to allow mounting a hierarchy by name

2012-01-12T19:29:35+00:00

commit 0d19ea866562e46989412a0676412fa0983c9ce7 upstream.

If we mount a hierarchy with a specified name, the name is unique,
and we can use it to mount the hierarchy without specifying its
set of subsystem names. This feature is documented is
Documentation/cgroups/cgroups.txt section 2.3

Here's an example:

	# mount -t cgroup -o cpuset,name=myhier xxx /cgroup1
	# mount -t cgroup -o name=myhier xxx /cgroup2

But it was broken by commit 32a8cf235e2f192eb002755076994525cdbaa35a
(cgroup: make the mount options parsing more accurate)

This fixes the regression.

Signed-off-by: Li Zefan 
Signed-off-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman

cgroups: fix a css_set not found bug in cgroup_attach_proc

2011-12-19T17:09:09+00:00

There is a BUG when migrating a PF_EXITING proc. Since css_set_prefetch()
is not called for the PF_EXITING case, find_existing_css_set() will return
NULL inside cgroup_task_migrate() causing a BUG.

This bug is easy to reproduce. Create a zombie and echo its pid to
cgroup.procs.

$ cat zombie.c
\#include 

int main()
{
  if (fork())
      pause();
  return 0;
}
$

We are hitting this bug pretty regularly on ChromeOS.

This bug is already fixed by Tejun Heo's cgroup patchset which is
targetted for the next merge window:

https://lkml.org/lkml/2011/11/1/356

I've create a smaller patch here which just fixes this bug so that a
fix can be merged into the current release and stable.

Signed-off-by: Mandeep Singh Baines 
Downstream-Bug-Report: http://crosbug.com/23953
Reviewed-by: Li Zefan 
Signed-off-by: Tejun Heo 
Cc: containers@lists.linux-foundation.org
Cc: cgroups@vger.kernel.org
Cc: stable@kernel.org
Cc: KAMEZAWA Hiroyuki 
Cc: Frederic Weisbecker 
Cc: Oleg Nesterov 
Cc: Andrew Morton 
Cc: Paul Menage 
Cc: Olof Johansson

memcg: replace ss->id_lock with a rwlock

2011-11-02T23:07:03+00:00

While back-porting Johannes Weiner's patch "mm: memcg-aware global
reclaim" for an internal effort, we noticed a significant performance
regression during page-reclaim heavy workloads due to high contention of
the ss->id_lock.  This lock protects idr map, and serializes calls to
idr_get_next() in css_get_next() (which is used during the memcg hierarchy
walk).

Since idr_get_next() is just doing a look up, we need only serialize it
with respect to idr_remove()/idr_get_new().  By making the ss->id_lock a
rwlock, contention is greatly reduced and performance improves.

Tested: cat a 256m file from a ramdisk in a 128m container 50 times on
each core (one file + container per core) in parallel on a NUMA machine.
Result is the time for the test to complete in 1 of the containers.
Both kernels included Johannes' memcg-aware global reclaim patches.

Before rwlock patch: 1710.778s
After rwlock patch: 152.227s

Signed-off-by: Andrew Bresticker 
Cc: Paul Menage 
Cc: Li Zefan 
Acked-by: KAMEZAWA Hiroyuki 
Cc: Ying Han 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds