linux-toradex.git/mm/memcontrol.c, branch v3.8.4

memcg: fix kmemcg registration for late caches

2013-02-12T22:34:00+00:00

The designed workflow for the caches in kmemcg is: register it with
memcg_register_cache() if kmemcg is already available or later on when a
new kmemcg appears at memcg_update_cache_sizes() which will handle all
caches in the system.  The caches created at boot time will be handled
by the later, and the memcg-caches as well as any system caches that are
registered later on by the former.

There is a bug, however, in memcg_register_cache: we correctly set up
the array size, but do not mark the cache as a root cache.

This means that allocations for any cache appearing late in the game
will see memcg->memcg_params->is_root_cache == false, and in particular,
trigger VM_BUG_ON(!cachep->memcg_params->is_root_cache) in
__memcg_kmem_cache_get.

The obvious fix is to include the missing assignment.

Signed-off-by: Glauber Costa 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: KAMEZAWA Hiroyuki 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: don't register hotcpu notifier from ->css_alloc()

2012-12-21T01:40:20+00:00

Commit 648bb56d076b ("cgroup: lock cgroup_mutex in cgroup_init_subsys()")
made cgroup_init_subsys() grab cgroup_mutex before invoking
->css_alloc() for the root css.  Because memcg registers hotcpu notifier
from ->css_alloc() for the root css, this introduced circular locking
dependency between cgroup_mutex and cpu hotplug.

Fix it by moving hotcpu notifier registration to a subsys initcall.

  ======================================================
  [ INFO: possible circular locking dependency detected ]
  3.7.0-rc4-work+ #42 Not tainted
  -------------------------------------------------------
  bash/645 is trying to acquire lock:
   (cgroup_mutex){+.+.+.}, at: [] cgroup_lock+0x17/0x20

  but task is already holding lock:
   (cpu_hotplug.lock){+.+.+.}, at: [] cpu_hotplug_begin+0x2f/0x60

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

 -> #1 (cpu_hotplug.lock){+.+.+.}:
         lock_acquire+0x97/0x1e0
         mutex_lock_nested+0x61/0x3b0
         get_online_cpus+0x3c/0x60
         rebuild_sched_domains_locked+0x1b/0x70
         cpuset_write_resmask+0x298/0x2c0
         cgroup_file_write+0x1ef/0x300
         vfs_write+0xa8/0x160
         sys_write+0x52/0xa0
         system_call_fastpath+0x16/0x1b

 -> #0 (cgroup_mutex){+.+.+.}:
         __lock_acquire+0x14ce/0x1d20
         lock_acquire+0x97/0x1e0
         mutex_lock_nested+0x61/0x3b0
         cgroup_lock+0x17/0x20
         cpuset_handle_hotplug+0x1b/0x560
         cpuset_update_active_cpus+0xe/0x10
         cpuset_cpu_inactive+0x47/0x50
         notifier_call_chain+0x66/0x150
         __raw_notifier_call_chain+0xe/0x10
         __cpu_notify+0x20/0x40
         _cpu_down+0x7e/0x2f0
         cpu_down+0x36/0x50
         store_online+0x5d/0xe0
         dev_attr_store+0x18/0x30
         sysfs_write_file+0xe0/0x150
         vfs_write+0xa8/0x160
         sys_write+0x52/0xa0
         system_call_fastpath+0x16/0x1b
  other info that might help us debug this:

   Possible unsafe locking scenario:

         CPU0                    CPU1
         ----                    ----
    lock(cpu_hotplug.lock);
                                 lock(cgroup_mutex);
                                 lock(cpu_hotplug.lock);
    lock(cgroup_mutex);

   *** DEADLOCK ***

  5 locks held by bash/645:
   #0:  (&buffer->mutex){+.+.+.}, at: [] sysfs_write_file+0x48/0x150
   #1:  (s_active#42){.+.+.+}, at: [] sysfs_write_file+0xc8/0x150
   #2:  (x86_cpu_hotplug_driver_mutex){+.+...}, at: [] cpu_hotplug_driver_lock+0x1
+7/0x20
   #3:  (cpu_add_remove_lock){+.+.+.}, at: [] cpu_maps_update_begin+0x17/0x20
   #4:  (cpu_hotplug.lock){+.+.+.}, at: [] cpu_hotplug_begin+0x2f/0x60

  stack backtrace:
  Pid: 645, comm: bash Not tainted 3.7.0-rc4-work+ #42
  Call Trace:
   print_circular_bug+0x28e/0x29f
   __lock_acquire+0x14ce/0x1d20
   lock_acquire+0x97/0x1e0
   mutex_lock_nested+0x61/0x3b0
   cgroup_lock+0x17/0x20
   cpuset_handle_hotplug+0x1b/0x560
   cpuset_update_active_cpus+0xe/0x10
   cpuset_cpu_inactive+0x47/0x50
   notifier_call_chain+0x66/0x150
   __raw_notifier_call_chain+0xe/0x10
   __cpu_notify+0x20/0x40
   _cpu_down+0x7e/0x2f0
   cpu_down+0x36/0x50
   store_online+0x5d/0xe0
   dev_attr_store+0x18/0x30
   sysfs_write_file+0xe0/0x150
   vfs_write+0xa8/0x160
   sys_write+0x52/0xa0
   system_call_fastpath+0x16/0x1b

Signed-off-by: Tejun Heo 
Reported-by: Fengguang Wu 
Acked-by: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

slab: propagate tunable values

2012-12-18T23:02:14+00:00

SLAB allows us to tune a particular cache behavior with tunables.  When
creating a new memcg cache copy, we'd like to preserve any tunables the
parent cache already had.

This could be done by an explicit call to do_tune_cpucache() after the
cache is created.  But this is not very convenient now that the caches are
created from common code, since this function is SLAB-specific.

Another method of doing that is taking advantage of the fact that
do_tune_cpucache() is always called from enable_cpucache(), which is
called at cache initialization.  We can just preset the values, and then
things work as expected.

It can also happen that a root cache has its tunables updated during
normal system operation.  In this case, we will propagate the change to
all caches that are already active.

This change will require us to move the assignment of root_cache in
memcg_params a bit earlier.  We need this to be already set - which
memcg_kmem_register_cache will do - when we reach __kmem_cache_create()

Signed-off-by: Glauber Costa 
Cc: Christoph Lameter 
Cc: David Rientjes 
Cc: Frederic Weisbecker 
Cc: Greg Thelen 
Cc: Johannes Weiner 
Cc: JoonSoo Kim 
Cc: KAMEZAWA Hiroyuki 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Pekka Enberg 
Cc: Rik van Riel 
Cc: Suleiman Souhlal 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: aggregate memcg cache values in slabinfo

2012-12-18T23:02:14+00:00

When we create caches in memcgs, we need to display their usage
information somewhere.  We'll adopt a scheme similar to /proc/meminfo,
with aggregate totals shown in the global file, and per-group information
stored in the group itself.

For the time being, only reads are allowed in the per-group cache.

Signed-off-by: Glauber Costa 
Cc: Christoph Lameter 
Cc: David Rientjes 
Cc: Frederic Weisbecker 
Cc: Greg Thelen 
Cc: Johannes Weiner 
Cc: JoonSoo Kim 
Cc: KAMEZAWA Hiroyuki 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Pekka Enberg 
Cc: Rik van Riel 
Cc: Suleiman Souhlal 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg/sl[au]b: shrink dead caches

2012-12-18T23:02:14+00:00

This means that when we destroy a memcg cache that happened to be empty,
those caches may take a lot of time to go away: removing the memcg
reference won't destroy them - because there are pending references, and
the empty pages will stay there, until a shrinker is called upon for any
reason.

In this patch, we will call kmem_cache_shrink() for all dead caches that
cannot be destroyed because of remaining pages.  After shrinking, it is
possible that it could be freed.  If this is not the case, we'll schedule
a lazy worker to keep trying.

Signed-off-by: Glauber Costa 
Cc: Christoph Lameter 
Cc: David Rientjes 
Cc: Frederic Weisbecker 
Cc: Greg Thelen 
Cc: Johannes Weiner 
Cc: JoonSoo Kim 
Cc: KAMEZAWA Hiroyuki 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Pekka Enberg 
Cc: Rik van Riel 
Cc: Suleiman Souhlal 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg/sl[au]b: track all the memcg children of a kmem_cache

2012-12-18T23:02:14+00:00

This enables us to remove all the children of a kmem_cache being
destroyed, if for example the kernel module it's being used in gets
unloaded.  Otherwise, the children will still point to the destroyed
parent.

Signed-off-by: Suleiman Souhlal 
Signed-off-by: Glauber Costa 
Cc: Christoph Lameter 
Cc: David Rientjes 
Cc: Frederic Weisbecker 
Cc: Greg Thelen 
Cc: Johannes Weiner 
Cc: JoonSoo Kim 
Cc: KAMEZAWA Hiroyuki 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Pekka Enberg 
Cc: Rik van Riel 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: destroy memcg caches

2012-12-18T23:02:14+00:00

Implement destruction of memcg caches.  Right now, only caches where our
reference counter is the last remaining are deleted.  If there are any
other reference counters around, we just leave the caches lying around
until they go away.

When that happens, a destruction function is called from the cache code.
Caches are only destroyed in process context, so we queue them up for
later processing in the general case.

Signed-off-by: Glauber Costa 
Cc: Christoph Lameter 
Cc: David Rientjes 
Cc: Frederic Weisbecker 
Cc: Greg Thelen 
Cc: Johannes Weiner 
Cc: JoonSoo Kim 
Cc: KAMEZAWA Hiroyuki 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Pekka Enberg 
Cc: Rik van Riel 
Cc: Suleiman Souhlal 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

sl[au]b: allocate objects from memcg cache

2012-12-18T23:02:14+00:00

We are able to match a cache allocation to a particular memcg.  If the
task doesn't change groups during the allocation itself - a rare event,
this will give us a good picture about who is the first group to touch a
cache page.

This patch uses the now available infrastructure by calling
memcg_kmem_get_cache() before all the cache allocations.

Signed-off-by: Glauber Costa 
Cc: Christoph Lameter 
Cc: David Rientjes 
Cc: Frederic Weisbecker 
Cc: Greg Thelen 
Cc: Johannes Weiner 
Cc: JoonSoo Kim 
Cc: KAMEZAWA Hiroyuki 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Pekka Enberg 
Cc: Rik van Riel 
Cc: Suleiman Souhlal 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: skip memcg kmem allocations in specified code regions

2012-12-18T23:02:14+00:00

Create a mechanism that skip memcg allocations during certain pieces of
our core code.  It basically works in the same way as
preempt_disable()/preempt_enable(): By marking a region under which all
allocations will be accounted to the root memcg.

We need this to prevent races in early cache creation, when we
allocate data using caches that are not necessarily created already.

Signed-off-by: Glauber Costa 
yCc: Christoph Lameter 
Cc: David Rientjes 
Cc: Frederic Weisbecker 
Cc: Greg Thelen 
Cc: Johannes Weiner 
Cc: JoonSoo Kim 
Cc: KAMEZAWA Hiroyuki 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Pekka Enberg 
Cc: Rik van Riel 
Cc: Suleiman Souhlal 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: infrastructure to match an allocation to the right cache

2012-12-18T23:02:14+00:00

The page allocator is able to bind a page to a memcg when it is
allocated.  But for the caches, we'd like to have as many objects as
possible in a page belonging to the same cache.

This is done in this patch by calling memcg_kmem_get_cache in the
beginning of every allocation function.  This function is patched out by
static branches when kernel memory controller is not being used.

It assumes that the task allocating, which determines the memcg in the
page allocator, belongs to the same cgroup throughout the whole process.
Misaccounting can happen if the task calls memcg_kmem_get_cache() while
belonging to a cgroup, and later on changes.  This is considered
acceptable, and should only happen upon task migration.

Before the cache is created by the memcg core, there is also a possible
imbalance: the task belongs to a memcg, but the cache being allocated from
is the global cache, since the child cache is not yet guaranteed to be
ready.  This case is also fine, since in this case the GFP_KMEMCG will not
be passed and the page allocator will not attempt any cgroup accounting.

Signed-off-by: Glauber Costa 
Cc: Christoph Lameter 
Cc: David Rientjes 
Cc: Frederic Weisbecker 
Cc: Greg Thelen 
Cc: Johannes Weiner 
Cc: JoonSoo Kim 
Cc: KAMEZAWA Hiroyuki 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Pekka Enberg 
Cc: Rik van Riel 
Cc: Suleiman Souhlal 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds