linux-toradex.git/mm/vmalloc.c, branch v5.16

mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation

2021-11-06T20:30:37+00:00

Commit ffb29b1c255a ("mm/vmalloc: fix numa spreading for large hash
tables") can cause significant performance regressions in some
situations as Andrew mentioned in [1].  The main situation is vmalloc,
vmalloc will allocate pages with NUMA_NO_NODE by default, that will
result in alloc page one by one;

In order to solve this, __alloc_pages_bulk and mempolicy should be
considered at the same time.

1) If node is specified in memory allocation request, it will alloc all
   pages by __alloc_pages_bulk.

2) If interleaving allocate memory, it will cauculate how many pages
   should be allocated in each node, and use __alloc_pages_bulk to alloc
   pages in each node.

[1]: https://lore.kernel.org/lkml/CALvZod4G3SzP3kWxQYn0fj+VgG-G3yWXz=gz17+3N57ru1iajw@mail.gmail.com/t/#m750c8e3231206134293b089feaa090590afa0f60

[akpm@linux-foundation.org: coding style fixes]
[akpm@linux-foundation.org: make two functions static]
[akpm@linux-foundation.org: fix CONFIG_NUMA=n build]

Link: https://lkml.kernel.org/r/20211021080744.874701-3-chenwandun@huawei.com
Signed-off-by: Chen Wandun 
Reviewed-by: Uladzislau Rezki (Sony) 
Cc: Eric Dumazet 
Cc: Shakeel Butt 
Cc: Nicholas Piggin 
Cc: Kefeng Wang 
Cc: Hanjun Guo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/vmalloc: be more explicit about supported gfp flags

2021-11-06T20:30:37+00:00

The core of the vmalloc allocator __vmalloc_area_node doesn't say
anything about gfp mask argument.  Not all gfp flags are supported
though.  Be more explicit about constraints.

Link: https://lkml.kernel.org/r/20211020082545.4830-1-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Cc: Dave Chinner 
Cc: Neil Brown 
Cc: Christoph Hellwig 
Cc: Uladzislau Rezki 
Cc: Ilya Dryomov 
Cc: Jeff Layton 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

kasan: arm64: fix pcpu_page_first_chunk crash with KASAN_VMALLOC

2021-11-06T20:30:37+00:00

With KASAN_VMALLOC and NEED_PER_CPU_PAGE_FIRST_CHUNK the kernel crashes:

  Unable to handle kernel paging request at virtual address ffff7000028f2000
  ...
  swapper pgtable: 64k pages, 48-bit VAs, pgdp=0000000042440000
  [ffff7000028f2000] pgd=000000063e7c0003, p4d=000000063e7c0003, pud=000000063e7c0003, pmd=000000063e7b0003, pte=0000000000000000
  Internal error: Oops: 96000007 [#1] PREEMPT SMP
  Modules linked in:
  CPU: 0 PID: 0 Comm: swapper Not tainted 5.13.0-rc4-00003-gc6e6e28f3f30-dirty #62
  Hardware name: linux,dummy-virt (DT)
  pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO BTYPE=--)
  pc : kasan_check_range+0x90/0x1a0
  lr : memcpy+0x88/0xf4
  sp : ffff80001378fe20
  ...
  Call trace:
   kasan_check_range+0x90/0x1a0
   pcpu_page_first_chunk+0x3f0/0x568
   setup_per_cpu_areas+0xb8/0x184
   start_kernel+0x8c/0x328

The vm area used in vm_area_register_early() has no kasan shadow memory,
Let's add a new kasan_populate_early_vm_area_shadow() function to
populate the vm area shadow memory to fix the issue.

[wangkefeng.wang@huawei.com: fix redefinition of 'kasan_populate_early_vm_area_shadow']
  Link: https://lkml.kernel.org/r/20211011123211.3936196-1-wangkefeng.wang@huawei.com

Link: https://lkml.kernel.org/r/20210910053354.26721-4-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang 
Acked-by: Marco Elver 		[KASAN]
Acked-by: Andrey Konovalov 	[KASAN]
Acked-by: Catalin Marinas 
Cc: Andrey Ryabinin 
Cc: Dmitry Vyukov 
Cc: Greg Kroah-Hartman 
Cc: Will Deacon 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

vmalloc: choose a better start address in vm_area_register_early()

2021-11-06T20:30:37+00:00

Percpu embedded first chunk allocator is the firstly option, but it
could fail on ARM64, eg,

  percpu: max_distance=0x5fcfdc640000 too large for vmalloc space 0x781fefff0000
  percpu: max_distance=0x600000540000 too large for vmalloc space 0x7dffb7ff0000
  percpu: max_distance=0x5fff9adb0000 too large for vmalloc space 0x5dffb7ff0000

then we could get to

  WARNING: CPU: 15 PID: 461 at vmalloc.c:3087 pcpu_get_vm_areas+0x488/0x838

and the system cannot boot successfully.

Let's implement page mapping percpu first chunk allocator as a fallback
to the embedding allocator to increase the robustness of the system.

Also fix a crash when both NEED_PER_CPU_PAGE_FIRST_CHUNK and
KASAN_VMALLOC enabled.

Tested on ARM64 qemu with cmdline "percpu_alloc=page".

This patch (of 3):

There are some fixed locations in the vmalloc area be reserved in
ARM(see iotable_init()) and ARM64(see map_kernel()), but for
pcpu_page_first_chunk(), it calls vm_area_register_early() and choose
VMALLOC_START as the start address of vmap area which could be
conflicted with above address, then could trigger a BUG_ON in
vm_area_add_early().

Let's choose a suit start address by traversing the vmlist.

Link: https://lkml.kernel.org/r/20210910053354.26721-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20210910053354.26721-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang 
Reviewed-by: Catalin Marinas 
Cc: Will Deacon 
Cc: Andrey Ryabinin 
Cc: Andrey Konovalov 
Cc: Dmitry Vyukov 
Cc: Marco Elver 
Cc: Greg Kroah-Hartman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

vmalloc: back off when the current task is OOM-killed

2021-11-06T20:30:37+00:00

Huge vmalloc allocation on heavy loaded node can lead to a global memory
shortage.  Task called vmalloc can have worst badness and be selected by
OOM-killer, however taken fatal signal does not interrupt allocation
cycle.  Vmalloc repeat page allocaions again and again, exacerbating the
crisis and consuming the memory freed up by another killed tasks.

After a successful completion of the allocation procedure, a fatal
signal will be processed and task will be destroyed finally.  However it
may not release the consumed memory, since the allocated object may have
a lifetime unrelated to the completed task.  In the worst case, this can
lead to the host will panic due to "Out of memory and no killable
processes..."

This patch allows OOM-killer to break vmalloc cycle, makes OOM more
effective and avoid host panic.  It does not check oom condition
directly, however, and breaks page allocation cycle when fatal signal
was received.

This may trigger some hidden problems, when caller does not handle
vmalloc failures, or when rollaback after failed vmalloc calls own
vmallocs inside.  However all of these scenarios are incorrect: vmalloc
does not guarantee successful allocation, it has never been called with
__GFP_NOFAIL and threfore either should not be used for any rollbacks or
should handle such errors correctly and not lead to critical failures.

Link: https://lkml.kernel.org/r/83efc664-3a65-2adb-d7c4-2885784cf109@virtuozzo.com
Signed-off-by: Vasily Averin 
Acked-by: Michal Hocko 
Cc: Johannes Weiner 
Cc: Vladimir Davydov 
Cc: Tetsuo Handa 
Cc: Uladzislau Rezki (Sony) 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/vmalloc: check various alignments when debugging

2021-11-06T20:30:37+00:00

Before we did not guarantee a free block with lowest start address for
allocations with alignment >= PAGE_SIZE.  Because an alignment overhead
was included into a search length like below:

     length = size + align - 1;

doing so we make sure that a bigger block would fit after applying an
alignment adjustment.  Now there is no such limitation, i.e.  any
alignment that user wants to apply will result to a lowest address of
returned free area.

Link: https://lkml.kernel.org/r/20211004142829.22222-2-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) 
Cc: Christoph Hellwig 
Cc: David Hildenbrand 
Cc: Hillf Danton 
Cc: Matthew Wilcox 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Nicholas Piggin 
Cc: Oleksiy Avramchenko 
Cc: Ping Fang 
Cc: Steven Rostedt 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/vmalloc: do not adjust the search size for alignment overhead

2021-11-06T20:30:37+00:00

We used to include an alignment overhead into a search length, in that
case we guarantee that a found area will definitely fit after applying a
specific alignment that user specifies.  From the other hand we do not
guarantee that an area has the lowest address if an alignment is >=
PAGE_SIZE.

It means that, when a user specifies a special alignment together with a
range that corresponds to an exact requested size then an allocation
will fail.  This is what happens to KASAN, it wants the free block that
exactly matches a specified range during onlining memory banks:

    [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory82/state
    [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory83/state
    [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory85/state
    [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory84/state
    vmap allocation for size 16777216 failed: use vmalloc= to increase size
    bash: vmalloc: allocation failure: 16777216 bytes, mode:0x6000c0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
    CPU: 4 PID: 1644 Comm: bash Kdump: loaded Not tainted 4.18.0-339.el8.x86_64+debug #1
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
    Call Trace:
     dump_stack+0x8e/0xd0
     warn_alloc.cold.90+0x8a/0x1b2
     ? zone_watermark_ok_safe+0x300/0x300
     ? slab_free_freelist_hook+0x85/0x1a0
     ? __get_vm_area_node+0x240/0x2c0
     ? kfree+0xdd/0x570
     ? kmem_cache_alloc_node_trace+0x157/0x230
     ? notifier_call_chain+0x90/0x160
     __vmalloc_node_range+0x465/0x840
     ? mark_held_locks+0xb7/0x120

Fix it by making sure that find_vmap_lowest_match() returns lowest start
address with any given alignment value, i.e.  for alignments bigger then
PAGE_SIZE the algorithm rolls back toward parent nodes checking right
sub-trees if the most left free block did not fit due to alignment
overhead.

Link: https://lkml.kernel.org/r/20211004142829.22222-1-urezki@gmail.com
Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
Signed-off-by: Uladzislau Rezki (Sony) 
Reported-by: Ping Fang 
Tested-by: David Hildenbrand 
Reviewed-by: David Hildenbrand 
Cc: Mel Gorman 
Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: Nicholas Piggin 
Cc: Hillf Danton 
Cc: Michal Hocko 
Cc: Oleksiy Avramchenko 
Cc: Steven Rostedt 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/vmalloc: make sure to dump unpurged areas in /proc/vmallocinfo

2021-11-06T20:30:37+00:00

If last va found in vmap_area_list does not have a vm pointer,
vmallocinfo.s_show() returns 0, and show_purge_info() is not called as
it should.

Link: https://lkml.kernel.org/r/20211001170815.73321-1-eric.dumazet@gmail.com
Fixes: dd3b8353bae7 ("mm/vmalloc: do not keep unpurged areas in the busy tree")
Signed-off-by: Eric Dumazet 
Cc: Uladzislau Rezki (Sony) 
Cc: Pengfei Li 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/vmalloc: make show_numa_info() aware of hugepage mappings

2021-11-06T20:30:36+00:00

show_numa_info() can be slightly faster, by skipping over hugepages
directly.

Link: https://lkml.kernel.org/r/20211001172725.105824-1-eric.dumazet@gmail.com
Signed-off-by: Eric Dumazet 
Cc: Uladzislau Rezki (Sony) 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/vmalloc: don't allow VM_NO_GUARD on vmap()

2021-11-06T20:30:36+00:00

The vmalloc guard pages are added on top of each allocation, thereby
isolating any two allocations from one another.  The top guard of the
lower allocation is the bottom guard guard of the higher allocation etc.

Therefore VM_NO_GUARD is dangerous; it breaks the basic premise of
isolating separate allocations.

There are only two in-tree users of this flag, neither of which use it
through the exported interface.  Ensure it stays this way.

Link: https://lkml.kernel.org/r/YUMfdA36fuyZ+/xt@hirez.programming.kicks-ass.net
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Christoph Hellwig 
Reviewed-by: David Hildenbrand 
Acked-by: Will Deacon 
Acked-by: Kees Cook 
Cc: Andrey Konovalov 
Cc: Mel Gorman 
Cc: Uladzislau Rezki 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds