linux-toradex.git/include/linux/list_lru.h, branch v6.8-rc5

list_lru: allow explicit memcg and NUMA node selection

2023-12-12T18:57:01+00:00

Patch series "workload-specific and memory pressure-driven zswap
writeback", v8.

There are currently several issues with zswap writeback:

1. There is only a single global LRU for zswap, making it impossible to
   perform worload-specific shrinking - an memcg under memory pressure
   cannot determine which pages in the pool it owns, and often ends up
   writing pages from other memcgs. This issue has been previously
   observed in practice and mitigated by simply disabling
   memcg-initiated shrinking:

   https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u

   But this solution leaves a lot to be desired, as we still do not
   have an avenue for an memcg to free up its own memory locked up in
   the zswap pool.

2. We only shrink the zswap pool when the user-defined limit is hit.
   This means that if we set the limit too high, cold data that are
   unlikely to be used again will reside in the pool, wasting precious
   memory. It is hard to predict how much zswap space will be needed
   ahead of time, as this depends on the workload (specifically, on
   factors such as memory access patterns and compressibility of the
   memory pages).

This patch series solves these issues by separating the global zswap LRU
into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e
memcg- and NUMA-aware) zswap writeback under memory pressure.  The new
shrinker does not have any parameter that must be tuned by the user, and
can be opted in or out on a per-memcg basis.

As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance.  Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.


This patch (of 6):

The interface of list_lru is based on the assumption that the list node
and the data it represents belong to the same allocated on the correct
node/memcg.  While this assumption is valid for existing slab objects LRU
such as dentries and inodes, it is undocumented, and rather inflexible for
certain potential list_lru users (such as the upcoming zswap shrinker and
the THP shrinker).  It has caused us a lot of issues during our
development.

This patch changes list_lru interface so that the caller must explicitly
specify numa node and memcg when adding and removing objects.  The old
list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and
list_lru_del_obj(), respectively.

It also extends the list_lru API with a new function, list_lru_putback,
which undoes a previous list_lru_isolate call.  Unlike list_lru_add, it
does not increment the LRU node count (as list_lru_isolate does not
decrement the node count).  list_lru_putback also allows for explicit
memcg and NUMA node selection.

Link: https://lkml.kernel.org/r/20231130194023.4102148-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham 
Suggested-by: Johannes Weiner 
Acked-by: Johannes Weiner 
Tested-by: Bagas Sanjaya 
Cc: Chris Li 
Cc: Dan Streetman 
Cc: Domenico Cerasuolo 
Cc: Michal Hocko 
Cc: Muchun Song 
Cc: Roman Gushchin 
Cc: Seth Jennings 
Cc: Shakeel Butt 
Cc: Shuah Khan 
Cc: Vitaly Wool 
Cc: Yosry Ahmed 
Signed-off-by: Andrew Morton

mm: list_lru: Update kernel documentation to follow the requirements

2023-12-11T00:51:52+00:00

kernel-doc is not happy about documentation in list_lru.h:

list_lru.h:90: warning: Function parameter or member 'lru' not described in 'list_lru_add'
list_lru.h:90: warning: Excess function parameter 'list_lru' description in 'list_lru_add'
list_lru.h:90: warning: No description found for return value of 'list_lru_add'
list_lru.h:103: warning: Function parameter or member 'lru' not described in 'list_lru_del'
list_lru.h:103: warning: Excess function parameter 'list_lru' description in 'list_lru_del'
list_lru.h:103: warning: No description found for return value of 'list_lru_del'
list_lru.h:116: warning: No description found for return value of 'list_lru_count_one'
list_lru.h:168: warning: No description found for return value of 'list_lru_walk_one'
list_lru.h:185: warning: No description found for return value of 'list_lru_walk_one_irq'

Fix the documentation accordingly.

While at it, fix the references to the parameters in functions
inside the long descriptions, on which the above script is not
complaining (yet?).

Link: https://lkml.kernel.org/r/20231123172320.2434780-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko 
Cc: Rasmus Villemoes 
Signed-off-by: Andrew Morton

mm: list_lru: rename list_lru_per_memcg to list_lru_memcg

2022-03-22T22:57:03+00:00

The name of list_lru_memcg was occupied before and became free since
last commit.  Rename list_lru_per_memcg to list_lru_memcg since the name
is brief.

Link: https://lkml.kernel.org/r/20220228122126.37293-16-songmuchun@bytedance.com
Signed-off-by: Muchun Song 
Cc: Alex Shi 
Cc: Anna Schumaker 
Cc: Chao Yu 
Cc: Dave Chinner 
Cc: Fam Zheng 
Cc: Jaegeuk Kim 
Cc: Johannes Weiner 
Cc: Kari Argillander 
Cc: Matthew Wilcox (Oracle) 
Cc: Michal Hocko 
Cc: Qi Zheng 
Cc: Roman Gushchin 
Cc: Shakeel Butt 
Cc: Theodore Ts'o 
Cc: Trond Myklebust 
Cc: Vladimir Davydov 
Cc: Vlastimil Babka 
Cc: Wei Yang 
Cc: Xiongchun Duan 
Cc: Yang Shi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: list_lru: replace linear array with xarray

2022-03-22T22:57:03+00:00

If we run 10k containers in the system, the size of the
list_lru_memcg->lrus can be ~96KB per list_lru.  When we decrease the
number containers, the size of the array will not be shrinked.  It is
not scalable.  The xarray is a good choice for this case.  We can save a
lot of memory when there are tens of thousands continers in the system.
If we use xarray, we also can remove the logic code of resizing array,
which can simplify the code.

[akpm@linux-foundation.org: remove unused local]

Link: https://lkml.kernel.org/r/20220228122126.37293-13-songmuchun@bytedance.com
Signed-off-by: Muchun Song 
Cc: Alex Shi 
Cc: Anna Schumaker 
Cc: Chao Yu 
Cc: Dave Chinner 
Cc: Fam Zheng 
Cc: Jaegeuk Kim 
Cc: Johannes Weiner 
Cc: Kari Argillander 
Cc: Matthew Wilcox (Oracle) 
Cc: Michal Hocko 
Cc: Qi Zheng 
Cc: Roman Gushchin 
Cc: Shakeel Butt 
Cc: Theodore Ts'o 
Cc: Trond Myklebust 
Cc: Vladimir Davydov 
Cc: Vlastimil Babka 
Cc: Wei Yang 
Cc: Xiongchun Duan 
Cc: Yang Shi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: list_lru: rename memcg_drain_all_list_lrus to memcg_reparent_list_lrus

2022-03-22T22:57:03+00:00

The purpose of the memcg_drain_all_list_lrus() is list_lrus reparenting.
It is very similar to memcg_reparent_objcgs().  Rename it to
memcg_reparent_list_lrus() so that the name can more consistent with
memcg_reparent_objcgs().

Link: https://lkml.kernel.org/r/20220228122126.37293-12-songmuchun@bytedance.com
Signed-off-by: Muchun Song 
Cc: Alex Shi 
Cc: Anna Schumaker 
Cc: Chao Yu 
Cc: Dave Chinner 
Cc: Fam Zheng 
Cc: Jaegeuk Kim 
Cc: Johannes Weiner 
Cc: Kari Argillander 
Cc: Matthew Wilcox (Oracle) 
Cc: Michal Hocko 
Cc: Qi Zheng 
Cc: Roman Gushchin 
Cc: Shakeel Butt 
Cc: Theodore Ts'o 
Cc: Trond Myklebust 
Cc: Vladimir Davydov 
Cc: Vlastimil Babka 
Cc: Wei Yang 
Cc: Xiongchun Duan 
Cc: Yang Shi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: list_lru: allocate list_lru_one only when needed

2022-03-22T22:57:03+00:00

In our server, we found a suspected memory leak problem.  The kmalloc-32
consumes more than 6GB of memory.  Other kmem_caches consume less than
2GB memory.

After our in-depth analysis, the memory consumption of kmalloc-32 slab
cache is the cause of list_lru_one allocation.

  crash> p memcg_nr_cache_ids
  memcg_nr_cache_ids = $2 = 24574

memcg_nr_cache_ids is very large and memory consumption of each list_lru
can be calculated with the following formula.

  num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)

There are 4 numa nodes in our system, so each list_lru consumes ~3MB.

  crash> list super_blocks | wc -l
  952

Every mount will register 2 list lrus, one is for inode, another is for
dentry.  There are 952 super_blocks.  So the total memory is 952 * 2 * 3
MB (~5.6GB).  But the number of memory cgroup is less than 500.  So I
guess more than 12286 containers have been deployed on this machine (I do
not know why there are so many containers, it may be a user's bug or the
user really want to do that).  And memcg_nr_cache_ids has not been reduced
to a suitable value.  This can waste a lot of memory.

Now the infrastructure for dynamic list_lru_one allocation is ready, so
remove statically allocated memory code to save memory.

Link: https://lkml.kernel.org/r/20220228122126.37293-11-songmuchun@bytedance.com
Signed-off-by: Muchun Song 
Cc: Alex Shi 
Cc: Anna Schumaker 
Cc: Chao Yu 
Cc: Dave Chinner 
Cc: Fam Zheng 
Cc: Jaegeuk Kim 
Cc: Johannes Weiner 
Cc: Kari Argillander 
Cc: Matthew Wilcox (Oracle) 
Cc: Michal Hocko 
Cc: Qi Zheng 
Cc: Roman Gushchin 
Cc: Shakeel Butt 
Cc: Theodore Ts'o 
Cc: Trond Myklebust 
Cc: Vladimir Davydov 
Cc: Vlastimil Babka 
Cc: Wei Yang 
Cc: Xiongchun Duan 
Cc: Yang Shi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: introduce kmem_cache_alloc_lru

2022-03-22T22:57:03+00:00

We currently allocate scope for every memcg to be able to tracked on
every superblock instantiated in the system, regardless of whether that
superblock is even accessible to that memcg.

These huge memcg counts come from container hosts where memcgs are
confined to just a small subset of the total number of superblocks that
instantiated at any given point in time.

For these systems with huge container counts, list_lru does not need the
capability of tracking every memcg on every superblock.  What it comes
down to is that adding the memcg to the list_lru at the first insert.
So introduce kmem_cache_alloc_lru to allocate objects and its list_lru.
In the later patch, we will convert all inode and dentry allocation from
kmem_cache_alloc to kmem_cache_alloc_lru.

Link: https://lkml.kernel.org/r/20220228122126.37293-3-songmuchun@bytedance.com
Signed-off-by: Muchun Song 
Cc: Alex Shi 
Cc: Anna Schumaker 
Cc: Chao Yu 
Cc: Dave Chinner 
Cc: Fam Zheng 
Cc: Jaegeuk Kim 
Cc: Johannes Weiner 
Cc: Kari Argillander 
Cc: Matthew Wilcox (Oracle) 
Cc: Michal Hocko 
Cc: Qi Zheng 
Cc: Roman Gushchin 
Cc: Shakeel Butt 
Cc: Theodore Ts'o 
Cc: Trond Myklebust 
Cc: Vladimir Davydov 
Cc: Vlastimil Babka 
Cc: Wei Yang 
Cc: Xiongchun Duan 
Cc: Yang Shi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: list_lru: transpose the array of per-node per-memcg lru lists

2022-03-22T22:57:03+00:00

Patch series "Optimize list lru memory consumption", v6.

In our server, we found a suspected memory leak problem.  The kmalloc-32
consumes more than 6GB of memory.  Other kmem_caches consume less than
2GB memory.

After our in-depth analysis, the memory consumption of kmalloc-32 slab
cache is the cause of list_lru_one allocation.

  crash> p
  memcg_nr_cache_ids memcg_nr_cache_ids = $2 = 24574

memcg_nr_cache_ids is very large and memory consumption of each list_lru
can be calculated with the following formula.

  num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)

There are 4 numa nodes in our system, so each list_lru consumes ~3MB.

  crash> list super_blocks | wc -l
  952

Every mount will register 2 list lrus, one is for inode, another is for
dentry.  There are 952 super_blocks.  So the total memory is 952 * 2 * 3
MB (~5.6GB).  But now the number of memory cgroups is less than 500.  So
I guess more than 12286 memory cgroups have been created on this machine
(I do not know why there are so many cgroups, it may be a user's bug or
the user really want to do that).  Because memcg_nr_cache_ids has not
been reduced to a suitable value.  It leads to waste a lot of memory.
If we want to reduce memcg_nr_cache_ids, we have to *reboot* the server.
This is not what we want.

In order to reduce memcg_nr_cache_ids, I had posted a patchset [1] to do
this.  But this did not fundamentally solve the problem.

We currently allocate scope for every memcg to be able to tracked on
every superblock instantiated in the system, regardless of whether that
superblock is even accessible to that memcg.

These huge memcg counts come from container hosts where memcgs are
confined to just a small subset of the total number of superblocks that
instantiated at any given point in time.

For these systems with huge container counts, list_lru does not need the
capability of tracking every memcg on every superblock.

What it comes down to is that the list_lru is only needed for a given
memcg if that memcg is instatiating and freeing objects on a given
list_lru.

As Dave said, "Which makes me think we should be moving more towards 'add
the memcg to the list_lru at the first insert' model rather than
'instantiate all at memcg init time just in case'."

This patchset aims to optimize the list lru memory consumption from
different aspects.

I had done a easy test to show the optimization.  I create 10k memory
cgroups and mount 10k filesystems in the systems.  We use free command to
show how many memory does the systems comsumes after this operation (There
are 2 numa nodes in the system).

        +-----------------------+------------------------+
        |      condition        |   memory consumption   |
        +-----------------------+------------------------+
        | without this patchset |        24464 MB        |
        +-----------------------+------------------------+
        |     after patch 1     |        21957 MB        | <--------+
        +-----------------------+------------------------+          |
        |     after patch 10    |         6895 MB        |          |
        +-----------------------+------------------------+          |
        |     after patch 12    |         4367 MB        |          |
        +-----------------------+------------------------+          |
                                                                    |
        The more the number of nodes, the more obvious the effect---+

BTW, there was a recent discussion [2] on the same issue.

[1] https://lore.kernel.org/all/20210428094949.43579-1-songmuchun@bytedance.com/
[2] https://lore.kernel.org/all/20210405054848.GA1077931@in.ibm.com/

This series not only optimizes the memory usage of list_lru but also
simplifies the code.

This patch (of 16):

The current scheme of maintaining per-node per-memcg lru lists looks like:
  struct list_lru {
    struct list_lru_node *node;           (for each node)
      struct list_lru_memcg *memcg_lrus;
        struct list_lru_one *lru[];       (for each memcg)
  }

By effectively transposing the two-dimension array of list_lru_one's structures
(per-node per-memcg => per-memcg per-node) it's possible to save some memory
and simplify alloc/dealloc paths. The new scheme looks like:
  struct list_lru {
    struct list_lru_memcg *mlrus;
      struct list_lru_per_memcg *mlru[];  (for each memcg)
        struct list_lru_one node[0];      (for each node)
  }

Memory savings are coming from not only 'struct rcu_head' but also some
pointer arrays used to store the pointer to 'struct list_lru_one'.  The
array is per node and its size is 8 (a pointer) * num_memcgs.  So the
total size of the arrays is 8 * num_nodes * memcg_nr_cache_ids.  After
this patch, the size becomes 8 * memcg_nr_cache_ids.

Link: https://lkml.kernel.org/r/20220228122126.37293-1-songmuchun@bytedance.com
Link: https://lkml.kernel.org/r/20220228122126.37293-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song 
Acked-by: Johannes Weiner 
Cc: Matthew Wilcox (Oracle) 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Shakeel Butt 
Cc: Yang Shi 
Cc: Alex Shi 
Cc: Wei Yang 
Cc: Dave Chinner 
Cc: Trond Myklebust 
Cc: Anna Schumaker 
Cc: Jaegeuk Kim 
Cc: Chao Yu 
Cc: Kari Argillander 
Cc: Vlastimil Babka 
Cc: Qi Zheng 
Cc: Xiongchun Duan 
Cc: Fam Zheng 
Cc: Roman Gushchin 
Cc: Theodore Ts'o 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: fix spelling mistakes in header files

2021-07-08T18:48:21+00:00

Fix some spelling mistakes in comments:
successfull ==> successful
potentialy ==> potentially
alloced ==> allocated
indicies ==> indices
wont ==> won't
resposible ==> responsible
dirtyness ==> dirtiness
droppped ==> dropped
alread ==> already
occured ==> occurred
interupts ==> interrupts
extention ==> extension
slighly ==> slightly
Dont't ==> Don't

Link: https://lkml.kernel.org/r/20210531034849.9549-2-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei 
Cc: Jerome Glisse 
Cc: Mike Kravetz 
Cc: Dennis Zhou 
Cc: Tejun Heo 
Cc: Christoph Lameter 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

list_lru.h: Replace zero-length array with flexible-array member

2020-04-18T20:44:55+00:00

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva