linux-toradex.git/mm/vmalloc.c, branch v5.17-rc5

mm/vmalloc: be more explicit about supported gfp flags.

2022-01-15T14:30:28+00:00

Commit b7d90e7a5ea8 ("mm/vmalloc: be more explicit about supported gfp
flags") has been merged prematurely without the rest of the series and
without addressed review feedback from Neil.  Fix that up now.  Only
wording is changed slightly.

Link: https://lkml.kernel.org/r/20211122153233.9924-4-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Reviewed-by: Uladzislau Rezki (Sony) 
Acked-by: Vlastimil Babka 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: Ilya Dryomov 
Cc: Jeff Layton 
Cc: Neil Brown 
Cc: Sebastian Andrzej Siewior 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/vmalloc: add support for __GFP_NOFAIL

2022-01-15T14:30:28+00:00

Dave Chinner has mentioned that some of the xfs code would benefit from
kvmalloc support for __GFP_NOFAIL because they have allocations that
cannot fail and they do not fit into a single page.

The large part of the vmalloc implementation already complies with the
given gfp flags so there is no work for those to be done.  The area and
page table allocations are an exception to that.  Implement a retry loop
for those.

Add a short sleep before retrying.  1 jiffy is a completely random
timeout.  Ideally the retry would wait for an explicit event - e.g.  a
change to the vmalloc space change if the failure was caused by the
space fragmentation or depletion.  But there are multiple different
reasons to retry and this could become much more complex.  Keep the
retry simple for now and just sleep to prevent from hogging CPUs.

Link: https://lkml.kernel.org/r/20211122153233.9924-3-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: Ilya Dryomov 
Cc: Jeff Layton 
Cc: Neil Brown 
Cc: Sebastian Andrzej Siewior 
Cc: Uladzislau Rezki (Sony) 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc

2022-01-15T14:30:28+00:00

Patch series "extend vmalloc support for constrained allocations", v2.

Based on a recent discussion with Dave and Neil [1] I have tried to
implement NOFS, NOIO, NOFAIL support for the vmalloc to make life of
kvmalloc users easier.

A requirement for NOFAIL support for kvmalloc was new to me but this
seems to be really needed by the xfs code.

NOFS/NOIO was a known and a long term problem which was hoped to be
handled by the scope API.  Those scope should have been used at the
reclaim recursion boundaries both to document them and also to remove
the necessity of NOFS/NOIO constrains for all allocations within that
scope.  Instead workarounds were developed to wrap a single allocation
instead (like ceph_kvmalloc).

First patch implements NOFS/NOIO support for vmalloc.  The second one
adds NOFAIL support and the third one bundles all together into kvmalloc
and drops ceph_kvmalloc which can use kvmalloc directly now.

[1] http://lkml.kernel.org/r/163184741778.29351.16920832234899124642.stgit@noble.brown

This patch (of 4):

vmalloc historically hasn't supported GFP_NO{FS,IO} requests because
page table allocations do not support externally provided gfp mask and
performed GFP_KERNEL like allocations.

Since few years we have scope (memalloc_no{fs,io}_{save,restore}) APIs
to enforce NOFS and NOIO constrains implicitly to all allocators within
the scope.  There was a hope that those scopes would be defined on a
higher level when the reclaim recursion boundary starts/stops (e.g.
when a lock required during the memory reclaim is required etc.).  It
seems that not all NOFS/NOIO users have adopted this approach and
instead they have taken a workaround approach to wrap a single
[k]vmalloc allocation by a scope API.

These workarounds do not serve the purpose of a better reclaim recursion
documentation and reduction of explicit GFP_NO{FS,IO} usege so let's
just provide them with the semantic they are asking for without a need
for workarounds.

Add support for GFP_NOFS and GFP_NOIO to vmalloc directly.  All internal
allocations already comply with the given gfp_mask.  The only current
exception is vmap_pages_range which maps kernel page tables.  Infer the
proper scope API based on the given gfp mask.

[sfr@canb.auug.org.au: mm/vmalloc.c needs linux/sched/mm.h]
 Link: https://lkml.kernel.org/r/20211217232641.0148710c@canb.auug.org.au

Link: https://lkml.kernel.org/r/20211122153233.9924-1-mhocko@kernel.org
Link: https://lkml.kernel.org/r/20211122153233.9924-2-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Signed-off-by: Stephen Rothwell 
Reviewed-by: Uladzislau Rezki (Sony) 
Acked-by: Vlastimil Babka 
Cc: Neil Brown 
Cc: Christoph Hellwig 
Cc: Ilya Dryomov 
Cc: Jeff Layton 
Cc: Dave Chinner 
Cc: Sebastian Andrzej Siewior 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: add per-memcg vmalloc stat

2022-01-15T14:30:27+00:00

The kvmalloc* allocation functions can fallback to vmalloc allocations
and more often on long running machines.  In addition the kernel does
have __GFP_ACCOUNT kvmalloc* calls.  So, often on long running machines,
the memory.stat does not tell the complete picture which type of memory
is charged to the memcg.  So add a per-memcg vmalloc stat.

[shakeelb@google.com: page_memcg() within rcu lock, per Muchun]
  Link: https://lkml.kernel.org/r/20211222052457.1960701-1-shakeelb@google.com
[akpm@linux-foundation.org: remove cast, per Muchun]
[shakeelb@google.com: remove area->page[0] checks and move to page by page accounting per Michal]
  Link: https://lkml.kernel.org/r/20220104222341.3972772-1-shakeelb@google.com

Link: https://lkml.kernel.org/r/20211221215336.1922823-1-shakeelb@google.com
Signed-off-by: Shakeel Butt 
Acked-by: Roman Gushchin 
Reviewed-by: Muchun Song 
Acked-by: Michal Hocko 
Cc: Johannes Weiner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: defer kmemleak object creation of module_alloc()

2022-01-15T14:30:25+00:00

Yongqiang reports a kmemleak panic when module insmod/rmmod with KASAN
enabled(without KASAN_VMALLOC) on x86[1].

When the module area allocates memory, it's kmemleak_object is created
successfully, but the KASAN shadow memory of module allocation is not
ready, so when kmemleak scan the module's pointer, it will panic due to
no shadow memory with KASAN check.

  module_alloc
    __vmalloc_node_range
      kmemleak_vmalloc
				kmemleak_scan
				  update_checksum
    kasan_module_alloc
      kmemleak_ignore

Note, there is no problem if KASAN_VMALLOC enabled, the modules area
entire shadow memory is preallocated.  Thus, the bug only exits on ARCH
which supports dynamic allocation of module area per module load, for
now, only x86/arm64/s390 are involved.

Add a VM_DEFER_KMEMLEAK flags, defer vmalloc'ed object register of
kmemleak in module_alloc() to fix this issue.

[1] https://lore.kernel.org/all/6d41e2b9-4692-5ec4-b1cd-cbe29ae89739@huawei.com/

[wangkefeng.wang@huawei.com: fix build]
  Link: https://lkml.kernel.org/r/20211125080307.27225-1-wangkefeng.wang@huawei.com
[akpm@linux-foundation.org: simplify ifdefs, per Andrey]
  Link: https://lkml.kernel.org/r/CA+fCnZcnwJHUQq34VuRxpdoY6_XbJCDJ-jopksS5Eia4PijPzw@mail.gmail.com

Link: https://lkml.kernel.org/r/20211124142034.192078-1-wangkefeng.wang@huawei.com
Fixes: 793213a82de4 ("s390/kasan: dynamic shadow mem allocation for modules")
Fixes: 39d114ddc682 ("arm64: add KASAN support")
Fixes: bebf56a1b176 ("kasan: enable instrumentation of global variables")
Signed-off-by: Kefeng Wang 
Reported-by: Yongqiang Liu 
Cc: Andrey Konovalov 
Cc: Andrey Ryabinin 
Cc: Dmitry Vyukov 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Christian Borntraeger 
Cc: Alexander Gordeev 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: Alexander Potapenko 
Cc: Kefeng Wang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation

2021-11-06T20:30:37+00:00

Commit ffb29b1c255a ("mm/vmalloc: fix numa spreading for large hash
tables") can cause significant performance regressions in some
situations as Andrew mentioned in [1].  The main situation is vmalloc,
vmalloc will allocate pages with NUMA_NO_NODE by default, that will
result in alloc page one by one;

In order to solve this, __alloc_pages_bulk and mempolicy should be
considered at the same time.

1) If node is specified in memory allocation request, it will alloc all
   pages by __alloc_pages_bulk.

2) If interleaving allocate memory, it will cauculate how many pages
   should be allocated in each node, and use __alloc_pages_bulk to alloc
   pages in each node.

[1]: https://lore.kernel.org/lkml/CALvZod4G3SzP3kWxQYn0fj+VgG-G3yWXz=gz17+3N57ru1iajw@mail.gmail.com/t/#m750c8e3231206134293b089feaa090590afa0f60

[akpm@linux-foundation.org: coding style fixes]
[akpm@linux-foundation.org: make two functions static]
[akpm@linux-foundation.org: fix CONFIG_NUMA=n build]

Link: https://lkml.kernel.org/r/20211021080744.874701-3-chenwandun@huawei.com
Signed-off-by: Chen Wandun 
Reviewed-by: Uladzislau Rezki (Sony) 
Cc: Eric Dumazet 
Cc: Shakeel Butt 
Cc: Nicholas Piggin 
Cc: Kefeng Wang 
Cc: Hanjun Guo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/vmalloc: be more explicit about supported gfp flags

2021-11-06T20:30:37+00:00

The core of the vmalloc allocator __vmalloc_area_node doesn't say
anything about gfp mask argument.  Not all gfp flags are supported
though.  Be more explicit about constraints.

Link: https://lkml.kernel.org/r/20211020082545.4830-1-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Cc: Dave Chinner 
Cc: Neil Brown 
Cc: Christoph Hellwig 
Cc: Uladzislau Rezki 
Cc: Ilya Dryomov 
Cc: Jeff Layton 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

kasan: arm64: fix pcpu_page_first_chunk crash with KASAN_VMALLOC

2021-11-06T20:30:37+00:00

With KASAN_VMALLOC and NEED_PER_CPU_PAGE_FIRST_CHUNK the kernel crashes:

  Unable to handle kernel paging request at virtual address ffff7000028f2000
  ...
  swapper pgtable: 64k pages, 48-bit VAs, pgdp=0000000042440000
  [ffff7000028f2000] pgd=000000063e7c0003, p4d=000000063e7c0003, pud=000000063e7c0003, pmd=000000063e7b0003, pte=0000000000000000
  Internal error: Oops: 96000007 [#1] PREEMPT SMP
  Modules linked in:
  CPU: 0 PID: 0 Comm: swapper Not tainted 5.13.0-rc4-00003-gc6e6e28f3f30-dirty #62
  Hardware name: linux,dummy-virt (DT)
  pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO BTYPE=--)
  pc : kasan_check_range+0x90/0x1a0
  lr : memcpy+0x88/0xf4
  sp : ffff80001378fe20
  ...
  Call trace:
   kasan_check_range+0x90/0x1a0
   pcpu_page_first_chunk+0x3f0/0x568
   setup_per_cpu_areas+0xb8/0x184
   start_kernel+0x8c/0x328

The vm area used in vm_area_register_early() has no kasan shadow memory,
Let's add a new kasan_populate_early_vm_area_shadow() function to
populate the vm area shadow memory to fix the issue.

[wangkefeng.wang@huawei.com: fix redefinition of 'kasan_populate_early_vm_area_shadow']
  Link: https://lkml.kernel.org/r/20211011123211.3936196-1-wangkefeng.wang@huawei.com

Link: https://lkml.kernel.org/r/20210910053354.26721-4-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang 
Acked-by: Marco Elver 		[KASAN]
Acked-by: Andrey Konovalov 	[KASAN]
Acked-by: Catalin Marinas 
Cc: Andrey Ryabinin 
Cc: Dmitry Vyukov 
Cc: Greg Kroah-Hartman 
Cc: Will Deacon 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

vmalloc: choose a better start address in vm_area_register_early()

2021-11-06T20:30:37+00:00

Percpu embedded first chunk allocator is the firstly option, but it
could fail on ARM64, eg,

  percpu: max_distance=0x5fcfdc640000 too large for vmalloc space 0x781fefff0000
  percpu: max_distance=0x600000540000 too large for vmalloc space 0x7dffb7ff0000
  percpu: max_distance=0x5fff9adb0000 too large for vmalloc space 0x5dffb7ff0000

then we could get to

  WARNING: CPU: 15 PID: 461 at vmalloc.c:3087 pcpu_get_vm_areas+0x488/0x838

and the system cannot boot successfully.

Let's implement page mapping percpu first chunk allocator as a fallback
to the embedding allocator to increase the robustness of the system.

Also fix a crash when both NEED_PER_CPU_PAGE_FIRST_CHUNK and
KASAN_VMALLOC enabled.

Tested on ARM64 qemu with cmdline "percpu_alloc=page".

This patch (of 3):

There are some fixed locations in the vmalloc area be reserved in
ARM(see iotable_init()) and ARM64(see map_kernel()), but for
pcpu_page_first_chunk(), it calls vm_area_register_early() and choose
VMALLOC_START as the start address of vmap area which could be
conflicted with above address, then could trigger a BUG_ON in
vm_area_add_early().

Let's choose a suit start address by traversing the vmlist.

Link: https://lkml.kernel.org/r/20210910053354.26721-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20210910053354.26721-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang 
Reviewed-by: Catalin Marinas 
Cc: Will Deacon 
Cc: Andrey Ryabinin 
Cc: Andrey Konovalov 
Cc: Dmitry Vyukov 
Cc: Marco Elver 
Cc: Greg Kroah-Hartman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

vmalloc: back off when the current task is OOM-killed

2021-11-06T20:30:37+00:00

Huge vmalloc allocation on heavy loaded node can lead to a global memory
shortage.  Task called vmalloc can have worst badness and be selected by
OOM-killer, however taken fatal signal does not interrupt allocation
cycle.  Vmalloc repeat page allocaions again and again, exacerbating the
crisis and consuming the memory freed up by another killed tasks.

After a successful completion of the allocation procedure, a fatal
signal will be processed and task will be destroyed finally.  However it
may not release the consumed memory, since the allocated object may have
a lifetime unrelated to the completed task.  In the worst case, this can
lead to the host will panic due to "Out of memory and no killable
processes..."

This patch allows OOM-killer to break vmalloc cycle, makes OOM more
effective and avoid host panic.  It does not check oom condition
directly, however, and breaks page allocation cycle when fatal signal
was received.

This may trigger some hidden problems, when caller does not handle
vmalloc failures, or when rollaback after failed vmalloc calls own
vmallocs inside.  However all of these scenarios are incorrect: vmalloc
does not guarantee successful allocation, it has never been called with
__GFP_NOFAIL and threfore either should not be used for any rollbacks or
should handle such errors correctly and not lead to critical failures.

Link: https://lkml.kernel.org/r/83efc664-3a65-2adb-d7c4-2885784cf109@virtuozzo.com
Signed-off-by: Vasily Averin 
Acked-by: Michal Hocko 
Cc: Johannes Weiner 
Cc: Vladimir Davydov 
Cc: Tetsuo Handa 
Cc: Uladzislau Rezki (Sony) 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds