linux-toradex.git/include/linux/memcontrol.h, branch v5.1-rc3

mm, memcg: create mem_cgroup_from_seq

2019-03-06T05:07:17+00:00

This is the start of a series of patches similar to my earlier
DEFINE_MEMCG_MAX_OR_VAL work, but with less Macro Magic(tm).

There are a bunch of places we go from seq_file to mem_cgroup, which
currently requires manually getting the css, then getting the mem_cgroup
from the css.  It's in enough places now that having mem_cgroup_from_seq
makes sense (and also makes the next patch a bit nicer).

Link: http://lkml.kernel.org/r/20190124194050.GA31341@chrisdown.name
Signed-off-by: Chris Down 
Acked-by: Johannes Weiner 
Acked-by: Michal Hocko 
Cc: Tejun Heo 
Cc: Roman Gushchin 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: localize memcg_kmem_enabled() check

2019-03-06T05:07:15+00:00

Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge
functions, so, the users don't have to explicitly check that condition.

This is purely code cleanup patch without any functional change.  Only
the order of checks in memcg_charge_slab() can potentially be changed
but the functionally it will be same.  This should not matter as
memcg_charge_slab() is not in the hot path.

Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.com
Signed-off-by: Shakeel Butt 
Acked-by: Michal Hocko 
Cc: Johannes Weiner 
Cc: Vladimir Davydov 
Cc: Roman Gushchin 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, oom: add oom victim's memcg to the oom context information

2018-12-28T20:11:48+00:00

The current oom report doesn't display victim's memcg context during the
global OOM situation.  While this information is not strictly needed, it
can be really helpful for containerized environments to locate which
container has lost a process.  Now that we have a single line for the oom
context, we can trivially add both the oom memcg (this can be either
global_oom or a specific memcg which hits its hard limits) and task_memcg
which is the victim's memcg.

Below is the single line output in the oom report after this patch.

- global oom context information:

oom-kill:constraint=,nodemask=,cpuset=,mems_allowed=,global_oom,task_memcg=,task=,pid=,uid=

- memcg oom context information:

oom-kill:constraint=,nodemask=,cpuset=,mems_allowed=,oom_memcg=,task_memcg=,task=,pid=,uid=

[penguin-kernel@I-love.SAKURA.ne.jp: use pr_cont() in mem_cgroup_print_oom_context()]
  Link: http://lkml.kernel.org/r/201812190723.wBJ7NdkN032628@www262.sakura.ne.jp
Link: http://lkml.kernel.org/r/1542799799-36184-2-git-send-email-ufo19890607@gmail.com
Signed-off-by: yuzhoujian 
Signed-off-by: Tetsuo Handa 
Acked-by: Michal Hocko 
Cc: David Rientjes 
Cc: "Kirill A . Shutemov" 
Cc: Andrea Arcangeli 
Cc: Tetsuo Handa 
Cc: Roman Gushchin 
Cc: Yang Shi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memcontrol.c: convert mem_cgroup_id::ref to refcount_t type

2018-10-26T23:26:35+00:00

This will allow to use generic refcount_t interfaces to check counters
overflow instead of currently existing VM_BUG_ON().  The only difference
after the patch is VM_BUG_ON() may cause BUG(), while refcount_t fires
with WARN().  But this seems not to be significant here, since such the
problems are usually caught by syzbot with panic-on-warn enabled.

Link: http://lkml.kernel.org/r/153910718919.7006.13400779039257185427.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai 
Reviewed-by: Andrew Morton 
Acked-by: Michal Hocko 
Cc: Johannes Weiner 
Cc: Vladimir Davydov 
Cc: Andrea Parri 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: rework memcg kernel stack accounting

2018-10-26T23:25:19+00:00

If CONFIG_VMAP_STACK is set, kernel stacks are allocated using
__vmalloc_node_range() with __GFP_ACCOUNT.  So kernel stack pages are
charged against corresponding memory cgroups on allocation and uncharged
on releasing them.

The problem is that we do cache kernel stacks in small per-cpu caches and
do reuse them for new tasks, which can belong to different memory cgroups.

Each stack page still holds a reference to the original cgroup, so the
cgroup can't be released until the vmap area is released.

To make this happen we need more than two subsequent exits without forks
in between on the current cpu, which makes it very unlikely to happen.  As
a result, I saw a significant number of dying cgroups (in theory, up to 2
* number_of_cpu + number_of_tasks), which can't be released even by
significant memory pressure.

As a cgroup structure can take a significant amount of memory (first of
all, per-cpu data like memcg statistics), it leads to a noticeable waste
of memory.

Link: http://lkml.kernel.org/r/20180827162621.30187-1-guro@fb.com
Fixes: ac496bf48d97 ("fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y")
Signed-off-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 
Acked-by: Michal Hocko 
Cc: Johannes Weiner 
Cc: Andy Lutomirski 
Cc: Konstantin Khlebnikov 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, oom: introduce memory.oom.group

2018-08-22T17:52:45+00:00

For some workloads an intervention from the OOM killer can be painful.
Killing a random task can bring the workload into an inconsistent state.

Historically, there are two common solutions for this
problem:
1) enabling panic_on_oom,
2) using a userspace daemon to monitor OOMs and kill
   all outstanding processes.

Both approaches have their downsides: rebooting on each OOM is an obvious
waste of capacity, and handling all in userspace is tricky and requires a
userspace agent, which will monitor all cgroups for OOMs.

In most cases an in-kernel after-OOM cleaning-up mechanism can eliminate
the necessity of enabling panic_on_oom.  Also, it can simplify the cgroup
management for userspace applications.

This commit introduces a new knob for cgroup v2 memory controller:
memory.oom.group.  The knob determines whether the cgroup should be
treated as an indivisible workload by the OOM killer.  If set, all tasks
belonging to the cgroup or to its descendants (if the memory cgroup is not
a leaf cgroup) are killed together or not at all.

To determine which cgroup has to be killed, we do traverse the cgroup
hierarchy from the victim task's cgroup up to the OOMing cgroup (or root)
and looking for the highest-level cgroup with memory.oom.group set.

Tasks with the OOM protection (oom_score_adj set to -1000) are treated as
an exception and are never killed.

This patch doesn't change the OOM victim selection algorithm.

Link: http://lkml.kernel.org/r/20180802003201.817-4-guro@fb.com
Signed-off-by: Roman Gushchin 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Cc: David Rientjes 
Cc: Tetsuo Handa 
Cc: Tejun Heo 
Cc: Vladimir Davydov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance

2018-08-17T23:20:31+00:00

Introduce set_shrinker_bit() function to set shrinker-related bit in
memcg shrinker bitmap, and set the bit after the first item is added and
in case of reparenting destroyed memcg's items.

This will allow next patch to make shrinkers be called only, in case of
they have charged objects at the moment, and to improve shrink_slab()
performance.

[ktkhai@virtuozzo.com: v9]
  Link: http://lkml.kernel.org/r/153112557572.4097.17315791419810749985.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063065671.1818.15914674956134687268.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai 
Acked-by: Vladimir Davydov 
Tested-by: Shakeel Butt 
Cc: Al Viro 
Cc: Andrey Ryabinin 
Cc: Chris Wilson 
Cc: Greg Kroah-Hartman 
Cc: Guenter Roeck 
Cc: "Huang, Ying" 
Cc: Johannes Weiner 
Cc: Josef Bacik 
Cc: Li RongQing 
Cc: Matthew Wilcox 
Cc: Matthias Kaehlcke 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Philippe Ombredanne 
Cc: Roman Gushchin 
Cc: Sahitya Tummala 
Cc: Stephen Rothwell 
Cc: Tetsuo Handa 
Cc: Thomas Gleixner 
Cc: Waiman Long 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memcontrol.c: export mem_cgroup_is_root()

2018-08-17T23:20:31+00:00

This will be used in next patch.

Link: http://lkml.kernel.org/r/153063064347.1818.1987011484100392706.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai 
Acked-by: Vladimir Davydov 
Tested-by: Shakeel Butt 
Cc: Al Viro 
Cc: Andrey Ryabinin 
Cc: Chris Wilson 
Cc: Greg Kroah-Hartman 
Cc: Guenter Roeck 
Cc: "Huang, Ying" 
Cc: Johannes Weiner 
Cc: Josef Bacik 
Cc: Li RongQing 
Cc: Matthew Wilcox 
Cc: Matthias Kaehlcke 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Philippe Ombredanne 
Cc: Roman Gushchin 
Cc: Sahitya Tummala 
Cc: Stephen Rothwell 
Cc: Tetsuo Handa 
Cc: Thomas Gleixner 
Cc: Waiman Long 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memcg: assign memcg-aware shrinkers bitmap to memcg

2018-08-17T23:20:30+00:00

Imagine a big node with many cpus, memory cgroups and containers.  Let
we have 200 containers, every container has 10 mounts, and 10 cgroups.
All container tasks don't touch foreign containers mounts.  If there is
intensive pages write, and global reclaim happens, a writing task has to
iterate over all memcgs to shrink slab, before it's able to go to
shrink_page_list().

Iteration over all the memcg slabs is very expensive: the task has to
visit 200 * 10 = 2000 shrinkers for every memcg, and since there are
2000 memcgs, the total calls are 2000 * 2000 = 4000000.

So, the shrinker makes 4 million do_shrink_slab() calls just to try to
isolate SWAP_CLUSTER_MAX pages in one of the actively writing memcg via
shrink_page_list().  I've observed a node spending almost 100% in
kernel, making useless iteration over already shrinked slab.

This patch adds bitmap of memcg-aware shrinkers to memcg.  The size of
the bitmap depends on bitmap_nr_ids, and during memcg life it's
maintained to be enough to fit bitmap_nr_ids shrinkers.  Every bit in
the map is related to corresponding shrinker id.

Next patches will maintain set bit only for really charged memcg.  This
will allow shrink_slab() to increase its performance in significant way.
See the last patch for the numbers.

[ktkhai@virtuozzo.com: v9]
  Link: http://lkml.kernel.org/r/153112549031.4097.3576147070498769979.stgit@localhost.localdomain
[ktkhai@virtuozzo.com: add comment to mem_cgroup_css_online()]
  Link: http://lkml.kernel.org/r/521f9e5f-c436-b388-fe83-4dc870bfb489@virtuozzo.com
Link: http://lkml.kernel.org/r/153063056619.1818.12550500883688681076.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai 
Acked-by: Vladimir Davydov 
Tested-by: Shakeel Butt 
Cc: Al Viro 
Cc: Andrey Ryabinin 
Cc: Chris Wilson 
Cc: Greg Kroah-Hartman 
Cc: Guenter Roeck 
Cc: "Huang, Ying" 
Cc: Johannes Weiner 
Cc: Josef Bacik 
Cc: Li RongQing 
Cc: Matthew Wilcox 
Cc: Matthias Kaehlcke 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Philippe Ombredanne 
Cc: Roman Gushchin 
Cc: Sahitya Tummala 
Cc: Stephen Rothwell 
Cc: Tetsuo Handa 
Cc: Thomas Gleixner 
Cc: Waiman Long 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: introduce CONFIG_MEMCG_KMEM as combination of CONFIG_MEMCG && !CONFIG_SLOB

2018-08-17T23:20:30+00:00

Introduce new config option, which is used to replace repeating
CONFIG_MEMCG && !CONFIG_SLOB pattern.  Next patches add a little more
memcg+kmem related code, so let's keep the defines more clearly.

Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai 
Acked-by: Vladimir Davydov 
Tested-by: Shakeel Butt 
Cc: Al Viro 
Cc: Andrey Ryabinin 
Cc: Chris Wilson 
Cc: Greg Kroah-Hartman 
Cc: Guenter Roeck 
Cc: "Huang, Ying" 
Cc: Johannes Weiner 
Cc: Josef Bacik 
Cc: Li RongQing 
Cc: Matthew Wilcox 
Cc: Matthias Kaehlcke 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Philippe Ombredanne 
Cc: Roman Gushchin 
Cc: Sahitya Tummala 
Cc: Stephen Rothwell 
Cc: Tetsuo Handa 
Cc: Thomas Gleixner 
Cc: Waiman Long 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds