linux-toradex.git/include/linux/cpuset.h, branch v4.4.5

cpuset: make mm migration asynchronous

2016-03-03T23:07:28+00:00

commit e93ad19d05648397ef3bcb838d26aec06c245dc0 upstream.

If "cpuset.memory_migrate" is set, when a process is moved from one
cpuset to another with a different memory node mask, pages in used by
the process are migrated to the new set of nodes.  This was performed
synchronously in the ->attach() callback, which is synchronized
against process management.  Recently, the synchronization was changed
from per-process rwsem to global percpu rwsem for simplicity and
optimization.

Combined with the synchronous mm migration, this led to deadlocks
because mm migration could schedule a work item which may in turn try
to create a new worker blocking on the process management lock held
from cgroup process migration path.

This heavy an operation shouldn't be performed synchronously from that
deep inside cgroup migration in the first place.  This patch punts the
actual migration to an ordered workqueue and updates cgroup process
migration and cpuset config update paths to flush the workqueue after
all locks are released.  This way, the operations still seem
synchronous to userland without entangling mm migration with process
management synchronization.  CPU hotplug can also invoke mm migration
but there's no reason for it to wait for mm migrations and thus
doesn't synchronize against their completions.

Signed-off-by: Tejun Heo 
Reported-and-tested-by: Christian Borntraeger 
Signed-off-by: Greg Kroah-Hartman

mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled

2015-11-07T01:50:42+00:00

There is a seqcounter that protects against spurious allocation failures
when a task is changing the allowed nodes in a cpuset.  There is no need
to check the seqcounter until a cpuset exists.

Signed-off-by: Mel Gorman 
Acked-by: Christoph Lameter 
Acked-by: David Rientjes 
Acked-by: Vlastimil Babka 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Cc: Vitaly Wool 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, oom: remove task_lock protecting comm printing

2015-11-06T03:34:48+00:00

The oom killer takes task_lock() in a couple of places solely to protect
printing the task's comm.

A process's comm, including current's comm, may change due to
/proc/pid/comm or PR_SET_NAME.

The comm will always be NULL-terminated, so the worst race scenario would
only be during update.  We can tolerate a comm being printed that is in
the middle of an update to avoid taking the lock.

Other locations in the kernel have already dropped task_lock() when
printing comm, so this is consistent.

Signed-off-by: David Rientjes 
Suggested-by: Oleg Nesterov 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Sergey Senozhatsky 
Acked-by: Johannes Weiner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

cpuset: simplify cpuset_node_allowed API

2014-10-27T15:15:27+00:00

Current cpuset API for checking if a zone/node is allowed to allocate
from looks rather awkward. We have hardwall and softwall versions of
cpuset_node_allowed with the softwall version doing literally the same
as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
If it isn't, the softwall version may check the given node against the
enclosing hardwall cpuset, which it needs to take the callback lock to
do.

Such a distinction was introduced by commit 02a0e53d8227 ("cpuset:
rework cpuset_zone_allowed api"). Before, we had the only version with
the __GFP_HARDWALL flag determining its behavior. The purpose of the
commit was to avoid sleep-in-atomic bugs when someone would mistakenly
call the function without the __GFP_HARDWALL flag for an atomic
allocation. The suffixes introduced were intended to make the callers
think before using the function.

However, since the callback lock was converted from mutex to spinlock by
the previous patch, the softwall check function cannot sleep, and these
precautions are no longer necessary.

So let's simplify the API back to the single check.

Suggested-by: David Rientjes 
Signed-off-by: Vladimir Davydov 
Acked-by: Christoph Lameter 
Acked-by: Zefan Li 
Signed-off-by: Tejun Heo

Merge branch 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

2014-10-10T11:24:40+00:00

Pull cgroup updates from Tejun Heo:
 "Nothing too interesting.  Just a handful of cleanup patches"

* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  Revert "cgroup: remove redundant variable in cgroup_mount()"
  cgroup: remove redundant variable in cgroup_mount()
  cgroup: fix missing unlock in cgroup_release_agent()
  cgroup: remove CGRP_RELEASABLE flag
  perf/cgroup: Remove perf_put_cgroup()
  cgroup: remove redundant check in cgroup_ino()
  cpuset: simplify proc_cpuset_show()
  cgroup: simplify proc_cgroup_show()
  cgroup: use a per-cgroup work for release agent
  cgroup: remove bogus comments
  cgroup: remove redundant code in cgroup_rmdir()
  cgroup: remove some useless forward declarations
  cgroup: fix a typo in comment.

cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags

2014-09-25T02:16:06+00:00

When we change cpuset.memory_spread_{page,slab}, cpuset will flip
PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
This should be done using atomic bitops, but currently we don't,
which is broken.

Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
when one thread tried to clear PF_USED_MATH while at the same time another
thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
the same task.

Here's the full report:
https://lkml.org/lkml/2014/9/19/230

To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.

v4:
- updated mm/slab.c. (Fengguang Wu)
- updated Documentation.

Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Miao Xie 
Cc: Kees Cook 
Fixes: 950592f7b991 ("cpusets: update tasks' page/slab spread flags in time")
Cc:  # 2.6.31+
Reported-by: Tetsuo Handa 
Signed-off-by: Zefan Li 
Signed-off-by: Tejun Heo

cpuset: simplify proc_cpuset_show()

2014-09-18T17:27:23+00:00

Use the ONE macro instead of REG, and we can simplify proc_cpuset_show().

Signed-off-by: Zefan Li 
Signed-off-by: Tejun Heo

mm: page_alloc: use jump labels to avoid checking number_of_cpusets

2014-06-04T23:54:08+00:00

If cpusets are not in use then we still check a global variable on every
page allocation.  Use jump labels to avoid the overhead.

Signed-off-by: Mel Gorman 
Reviewed-by: Rik van Riel 
Cc: Johannes Weiner 
Cc: Vlastimil Babka 
Cc: Jan Kara 
Cc: Michal Hocko 
Cc: Hugh Dickins 
Cc: Dave Hansen 
Cc: Theodore Ts'o 
Cc: "Paul E. McKenney" 
Cc: Oleg Nesterov 
Cc: Peter Zijlstra 
Cc: Stephen Rothwell 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: optimize put_mems_allowed() usage

2014-04-03T23:20:58+00:00

Since put_mems_allowed() is strictly optional, its a seqcount retry, we
don't need to evaluate the function if the allocation was in fact
successful, saving a smp_rmb some loads and comparisons on some relative
fast-paths.

Since the naming, get/put_mems_allowed() does suggest a mandatory
pairing, rename the interface, as suggested by Mel, to resemble the
seqcount interface.

This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
where it is important to note that the return value of the latter call
is inverted from its previous incarnation.

Signed-off-by: Peter Zijlstra 
Signed-off-by: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

cpuset: Fix potential deadlock w/ set_mems_allowed

2013-11-06T11:40:27+00:00

After adding lockdep support to seqlock/seqcount structures,
I started seeing the following warning:

[    1.070907] ======================================================
[    1.072015] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
[    1.073181] 3.11.0+ #67 Not tainted
[    1.073801] ------------------------------------------------------
[    1.074882] kworker/u4:2/708 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[    1.076088]  (&p->mems_allowed_seq){+.+...}, at: [] new_slab+0x5f/0x280
[    1.077572]
[    1.077572] and this task is already holding:
[    1.078593]  (&(&q->__queue_lock)->rlock){..-...}, at: [] blk_execute_rq_nowait+0x53/0xf0
[    1.080042] which would create a new lock dependency:
[    1.080042]  (&(&q->__queue_lock)->rlock){..-...} -> (&p->mems_allowed_seq){+.+...}
[    1.080042]
[    1.080042] but this new dependency connects a SOFTIRQ-irq-safe lock:
[    1.080042]  (&(&q->__queue_lock)->rlock){..-...}
[    1.080042] ... which became SOFTIRQ-irq-safe at:
[    1.080042]   [] __lock_acquire+0x5b9/0x1db0
[    1.080042]   [] lock_acquire+0x95/0x130
[    1.080042]   [] _raw_spin_lock+0x41/0x80
[    1.080042]   [] scsi_device_unbusy+0x7e/0xd0
[    1.080042]   [] scsi_finish_command+0x32/0xf0
[    1.080042]   [] scsi_softirq_done+0xa1/0x130
[    1.080042]   [] blk_done_softirq+0x73/0x90
[    1.080042]   [] __do_softirq+0x110/0x2f0
[    1.080042]   [] run_ksoftirqd+0x2d/0x60
[    1.080042]   [] smpboot_thread_fn+0x156/0x1e0
[    1.080042]   [] kthread+0xd6/0xe0
[    1.080042]   [] ret_from_fork+0x7c/0xb0
[    1.080042]
[    1.080042] to a SOFTIRQ-irq-unsafe lock:
[    1.080042]  (&p->mems_allowed_seq){+.+...}
[    1.080042] ... which became SOFTIRQ-irq-unsafe at:
[    1.080042] ...  [] __lock_acquire+0x613/0x1db0
[    1.080042]   [] lock_acquire+0x95/0x130
[    1.080042]   [] kthreadd+0x82/0x180
[    1.080042]   [] ret_from_fork+0x7c/0xb0
[    1.080042]
[    1.080042] other info that might help us debug this:
[    1.080042]
[    1.080042]  Possible interrupt unsafe locking scenario:
[    1.080042]
[    1.080042]        CPU0                    CPU1
[    1.080042]        ----                    ----
[    1.080042]   lock(&p->mems_allowed_seq);
[    1.080042]                                local_irq_disable();
[    1.080042]                                lock(&(&q->__queue_lock)->rlock);
[    1.080042]                                lock(&p->mems_allowed_seq);
[    1.080042]   
[    1.080042]     lock(&(&q->__queue_lock)->rlock);
[    1.080042]
[    1.080042]  *** DEADLOCK ***

The issue stems from the kthreadd() function calling set_mems_allowed
with irqs enabled. While its possibly unlikely for the actual deadlock
to trigger, a fix is fairly simple: disable irqs before taking the
mems_allowed_seq lock.

Signed-off-by: John Stultz 
Signed-off-by: Peter Zijlstra 
Acked-by: Li Zefan 
Cc: Mathieu Desnoyers 
Cc: Steven Rostedt 
Cc: "David S. Miller" 
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/1381186321-4906-4-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar