linux-toradex.git/kernel/cgroup.c, branch v4.4.106

cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups

2017-04-21T07:30:04+00:00

commit 77f88796cee819b9c4562b0b6b44691b3b7755b1 upstream.

Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator.  Once the new kthread starts
running, it initializes itself and wakes up the creator.  The creator
then can further configure the kthread and then let it start doing its
job by waking it up.

In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland.  When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity.  This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.

Unfortunately, the cgroup side of protection is racy.  While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.

This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one.  Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.

This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland.  kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed.  The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.

It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side.  Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.

v2: Switch to a simpler implementation using a new task_struct bit
    field suggested by Oleg.

Signed-off-by: Tejun Heo 
Suggested-by: Oleg Nesterov 
Cc: Linus Torvalds 
Cc: Andrew Morton 
Cc: Peter Zijlstra (Intel) 
Cc: Thomas Gleixner 
Reported-and-debugged-by: Chris Mason 
Signed-off-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman

cgroup: avoid false positive gcc-6 warning

2016-11-10T15:36:36+00:00

commit cfe02a8a973e7e5f66926b8ae38dfce404b19e29 upstream.

When all subsystems are disabled, gcc notices that cgroup_subsys_enabled_key
is a zero-length array and that any access to it must be out of bounds:

In file included from ../include/linux/cgroup.h:19:0,
                 from ../kernel/cgroup.c:31:
../kernel/cgroup.c: In function 'cgroup_add_cftypes':
../kernel/cgroup.c:261:53: error: array subscript is above array bounds [-Werror=array-bounds]
  return static_key_enabled(cgroup_subsys_enabled_key[ssid]);
                            ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
../include/linux/jump_label.h:271:40: note: in definition of macro 'static_key_enabled'
  static_key_count((struct static_key *)x) > 0;    \
                                        ^

We should never call the function in this particular case, so this is
not a bug. In order to silence the warning, this adds an explicit check
for the CGROUP_SUBSYS_COUNT==0 case.

Signed-off-by: Arnd Bergmann 
Signed-off-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman

cgroup: set css->id to -1 during init

2016-08-10T09:49:27+00:00

commit 8fa3b8d689a54d6d04ff7803c724fb7aca6ce98e upstream.

If percpu_ref initialization fails during css_create(), the free path
can end up trying to free css->id of zero.  As ID 0 is unused, it
doesn't cause a critical breakage but it does trigger a warning
message.  Fix it by setting css->id to -1 from init_and_link_css().

Signed-off-by: Tejun Heo 
Cc: Wenwei Tao 
Fixes: 01e586598b22 ("cgroup: release css->id after css_free")
Signed-off-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman

cgroup: make sure a parent css isn't freed before its children

2016-05-04T21:48:49+00:00

commit 8bb5ef79bc0f4016ecf79e8dce6096a3c63603e4 upstream.

There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free().  Except for
css_released(), cgroup core didn't guarantee the order of invocation.
css_offline() or css_free() could be called on a parent css before its
children.  This behavior is unexpected and led to bugs in cpu and
memory controller.

The previous patch updated ordering for css_offline() which fixes the
cpu controller issue.  While there currently isn't a known bug caused
by misordering of css_free() invocations, let's fix it too for
consistency.

css_free() ordering can be trivially fixed by moving putting of the
parent css below css_free() invocation.

Signed-off-by: Tejun Heo 
Cc: Peter Zijlstra 
Signed-off-by: Greg Kroah-Hartman

cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback

2016-05-04T21:48:49+00:00

commit 5cf1cacb49aee39c3e02ae87068fc3c6430659b0 upstream.

Since e93ad19d0564 ("cpuset: make mm migration asynchronous"), cpuset
kicks off asynchronous NUMA node migration if necessary during task
migration and flushes it from cpuset_post_attach_flush() which is
called at the end of __cgroup_procs_write().  This is to avoid
performing migration with cgroup_threadgroup_rwsem write-locked which
can lead to deadlock through dependency on kworker creation.

memcg has a similar issue with charge moving, so let's convert it to
an official callback rather than the current one-off cpuset specific
function.  This patch adds cgroup_subsys->post_attach callback and
makes cpuset register cpuset_post_attach_flush() as its ->post_attach.

The conversion is mostly one-to-one except that the new callback is
called under cgroup_mutex.  This is to guarantee that no other
migration operations are started before ->post_attach callbacks are
finished.  cgroup_mutex is one of the outermost mutex in the system
and has never been and shouldn't be a problem.  We can add specialized
synchronization around __cgroup_procs_write() but I don't think
there's any noticeable benefit.

Signed-off-by: Tejun Heo 
Cc: Li Zefan 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Signed-off-by: Greg Kroah-Hartman

cgroup: ignore css_sets associated with dead cgroups during migration

2016-04-12T16:08:54+00:00

commit 2b021cbf3cb6208f0d40fd2f1869f237934340ed upstream.

Before 2e91fa7f6d45 ("cgroup: keep zombies associated with their
original cgroups"), all dead tasks were associated with init_css_set.
If a zombie task is requested for migration, while migration prep
operations would still be performed on init_css_set, the actual
migration would ignore zombie tasks.  As init_css_set is always valid,
this worked fine.

However, after 2e91fa7f6d45, zombie tasks stay with the css_set it was
associated with at the time of death.  Let's say a task T associated
with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2.  After T
becomes a zombie, it would still remain associated with A and B.  If A
only contains zombie tasks, it can be removed.  On removal, A gets
marked offline but stays pinned until all zombies are drained.  At
this point, if migration is initiated on T to a cgroup C on hierarchy
H-2, migration path would try to prepare T's css_set for migration and
trigger the following.

 WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
 CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
 ...
 Call Trace:
  [] dump_stack+0x4e/0x82
  [] warn_slowpath_common+0x78/0xb0
  [] warn_slowpath_null+0x15/0x20
  [] cgroup_get+0x121/0x160
  [] link_css_set+0x7b/0x90
  [] find_css_set+0x3bc/0x5e0
  [] cgroup_migrate_prepare_dst+0x89/0x1f0
  [] cgroup_attach_task+0x157/0x230
  [] __cgroup_procs_write+0x2b7/0x470
  [] cgroup_tasks_write+0xc/0x10
  [] cgroup_file_write+0x30/0x1b0
  [] kernfs_fop_write+0x13c/0x180
  [] __vfs_write+0x23/0xe0
  [] vfs_write+0xa4/0x1a0
  [] SyS_write+0x44/0xa0
  [] entry_SYSCALL_64_fastpath+0x12/0x6f

It doesn't make sense to prepare migration for css_sets pointing to
dead cgroups as they are guaranteed to contain only zombies which are
ignored later during migration.  This patch makes cgroup destruction
path mark all affected css_sets as dead and updates the migration path
to ignore them during preparation.

Signed-off-by: Tejun Heo 
Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
Signed-off-by: Greg Kroah-Hartman

cgroup: make sure a parent css isn't offlined before its children

2016-03-03T23:07:28+00:00

commit aa226ff4a1ce79f229c6b7a4c0a14e17fececd01 upstream.

There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free().  Except for
css_released(), cgroup core didn't guarantee the order of invocation.
css_offline() or css_free() could be called on a parent css before its
children.  This behavior is unexpected and led to bugs in cpu and
memory controller.

This patch updates offline path so that a parent css is never offlined
before its children.  Each css keeps online_cnt which reaches zero iff
itself and all its children are offline and offline_css() is invoked
only after online_cnt reaches zero.

This fixes the memory controller bug and allows the fix for cpu
controller.

Signed-off-by: Tejun Heo 
Reported-and-tested-by: Christian Borntraeger 
Reported-by: Brian Christiansen 
Link: http://lkml.kernel.org/g/5698A023.9070703@de.ibm.com
Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg@mail.gmail.com
Cc: Heiko Carstens 
Cc: Peter Zijlstra 
Signed-off-by: Greg Kroah-Hartman

cpuset: make mm migration asynchronous

2016-03-03T23:07:28+00:00

commit e93ad19d05648397ef3bcb838d26aec06c245dc0 upstream.

If "cpuset.memory_migrate" is set, when a process is moved from one
cpuset to another with a different memory node mask, pages in used by
the process are migrated to the new set of nodes.  This was performed
synchronously in the ->attach() callback, which is synchronized
against process management.  Recently, the synchronization was changed
from per-process rwsem to global percpu rwsem for simplicity and
optimization.

Combined with the synchronous mm migration, this led to deadlocks
because mm migration could schedule a work item which may in turn try
to create a new worker blocking on the process management lock held
from cgroup process migration path.

This heavy an operation shouldn't be performed synchronously from that
deep inside cgroup migration in the first place.  This patch punts the
actual migration to an ordered workqueue and updates cgroup process
migration and cpuset config update paths to flush the workqueue after
all locks are released.  This way, the operations still seem
synchronous to userland without entangling mm migration with process
management synchronization.  CPU hotplug can also invoke mm migration
but there's no reason for it to wait for mm migrations and thus
doesn't synchronize against their completions.

Signed-off-by: Tejun Heo 
Reported-and-tested-by: Christian Borntraeger 
Signed-off-by: Greg Kroah-Hartman

cgroup: fix handling of multi-destination migration from subtree_control enabling

2015-12-03T15:18:21+00:00

Consider the following v2 hierarchy.

  P0 (+memory) --- P1 (-memory) --- A
                                 \- B
       
P0 has memory enabled in its subtree_control while P1 doesn't.  If
both A and B contain processes, they would belong to the memory css of
P1.  Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter.  IOW, enabling controllers
can cause atomic migrations into different csses.

The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses.  pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.

 WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
 Modules linked in:
 CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
 ...
  ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
  ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
  ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
 Call Trace:
  [] dump_stack+0x4e/0x82
  [] warn_slowpath_common+0x82/0xc0
  [] warn_slowpath_null+0x1a/0x20
  [] pids_cancel.constprop.6+0x31/0x40
  [] pids_can_attach+0x6d/0xf0
  [] cgroup_taskset_migrate+0x6c/0x330
  [] cgroup_migrate+0xf5/0x190
  [] cgroup_attach_task+0x176/0x200
  [] __cgroup_procs_write+0x2ad/0x460
  [] cgroup_procs_write+0x14/0x20
  [] cgroup_file_write+0x35/0x1c0
  [] kernfs_fop_write+0x141/0x190
  [] __vfs_write+0x28/0xe0
  [] vfs_write+0xac/0x1a0
  [] SyS_write+0x49/0xb0
  [] entry_SYSCALL_64_fastpath+0x12/0x76

This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated.  All controllers are
updated accordingly.

* Controllers which don't care whether there are one or multiple
  target csses can be converted trivially.  cpu, io, freezer, perf,
  netclassid and netprio fall in this category.

* cpuset's current implementation assumes that there's single source
  and destination and thus doesn't support v2 hierarchy already.  The
  only change made by this patchset is how that single destination css
  is obtained.

* memory migration path already doesn't do anything on v2.  How the
  single destination css is obtained is updated and the prep stage of
  mem_cgroup_can_attach() is reordered to accomodate the change.

* pids is the only controller which was affected by this bug.  It now
  correctly handles multi-destination migrations and no longer causes
  counter underflow from incorrect accounting.

Signed-off-by: Tejun Heo 
Reported-and-tested-by: Daniel Wagner 
Cc: Aleksa Sarai

cgroup: make css_set pin its css's to avoid use-afer-free

2015-11-30T14:46:21+00:00

A css_set represents the relationship between a set of tasks and
css's.  css_set never pinned the associated css's.  This was okay
because tasks used to always disassociate immediately (in RCU sense) -
either a task is moved to a different css_set or exits and never
accesses css_set again.

Unfortunately, afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method
and use it to fix pids controller") and patches leading up to it made
a zombie hold onto its css_set and deref the associated css's on its
release.  Nothing pins the css's after exit and it might have already
been freed leading to use-after-free.

 general protection fault: 0000 [#1] PREEMPT SMP
 task: ffffffff81bf2500 ti: ffffffff81be4000 task.ti: ffffffff81be4000
 RIP: 0010:[]  [] pids_cancel.constprop.4+0x5/0x40
 ...
 Call Trace:
  
  [] ? pids_free+0x3d/0xa0
  [] cgroup_free+0x53/0xe0
  [] __put_task_struct+0x42/0x130
  [] delayed_put_task_struct+0x77/0x130
  [] rcu_process_callbacks+0x2f4/0x820
  [] ? rcu_process_callbacks+0x2b3/0x820
  [] __do_softirq+0xd4/0x460
  [] irq_exit+0x89/0xa0
  [] smp_apic_timer_interrupt+0x42/0x50
  [] apic_timer_interrupt+0x84/0x90
  
 ...
 Code: 5b 5d c3 48 89 df 48 c7 c2 c9 f9 ae 81 48 c7 c6 91 2c ae 81 e8 1d 94 0e 00 31 c0 5b 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00  48 83 87 e0 00 00 00 ff 78 01 c3 80 3d 08 7a c1 00 00 74 02
 RIP  [] pids_cancel.constprop.4+0x5/0x40
  RSP 
 ---[ end trace 89a4a4b916b90c49 ]---
 Kernel panic - not syncing: Fatal exception in interrupt
 Kernel Offset: disabled
 ---[ end Kernel panic - not syncing: Fatal exception in interrupt

Fix it by making css_set pin the associate css's until its release.

Signed-off-by: Tejun Heo 
Reported-by: Dave Jones 
Reported-by: Daniel Wagner 
Link: http://lkml.kernel.org/g/20151120041836.GA18390@codemonkey.org.uk
Link: http://lkml.kernel.org/g/5652D448.3080002@bmw-carit.de
Fixes: afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller")