linux-toradex.git/mm/memory_hotplug.c, branch v6.8-rc2

mm/memory_hotplug: fix memmap_on_memory sysfs value retrieval

2024-01-12T23:20:48+00:00

set_memmap_mode() stores the kernel parameter memmap mode as an integer. 
However, the get_memmap_mode() function utilizes param_get_bool() to fetch
the value as a boolean, leading to potential endianness issue.  On
Big-endian architectures, the memmap_on_memory is consistently displayed
as 'N' regardless of its actual status.

To address this endianness problem, the solution involves obtaining the
mode as an integer.  This adjustment ensures the proper display of the
memmap_on_memory parameter, presenting it as one of the following options:
Force, Y, or N.

Link: https://lkml.kernel.org/r/20240110140127.241451-1-sumanthk@linux.ibm.com
Fixes: 2d1f649c7c08 ("mm/memory_hotplug: support memmap_on_memory when memmap is not aligned to pageblocks")
Signed-off-by: Sumanth Korikkar 
Suggested-by: Gerald Schaefer 
Acked-by: David Hildenbrand 
Cc: Alexander Gordeev 
Cc: Aneesh Kumar K.V 
Cc: Heiko Carstens 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Vasily Gorbik 
Cc: 	[6.6+]
Signed-off-by: Andrew Morton

mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER

2024-01-08T23:27:15+00:00

commit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") has
changed the definition of MAX_ORDER to be inclusive.  This has caused
issues with code that was not yet upstream and depended on the previous
definition.

To draw attention to the altered meaning of the define, rename MAX_ORDER
to MAX_PAGE_ORDER.

Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov 
Cc: Linus Torvalds 
Signed-off-by: Andrew Morton

mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-12-11T00:51:34+00:00

The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
'memblock_size' chunks of memory being added.  Adding a larger span of
memory precludes memmap_on_memory semantics.

For users of hotplug such as kmem, large amounts of memory might get added
from the CXL subsystem.  In some cases, this amount may exceed the
available 'main memory' to store the memmap for the memory being added. 
In this case, it is useful to have a way to place the memmap on the memory
being added, even if it means splitting the addition into memblock-sized
chunks.

Change add_memory_resource() to loop over memblock-sized chunks of memory
if caller requested memmap_on_memory, and if other conditions for it are
met.  Teach try_remove_memory() to also expect that a memory range being
removed might have been split up into memblock sized chunks, and to loop
through those as needed.

This does preclude being able to use PUD mappings in the direct map; a
proposal to how this could be optimized in the future is laid out here[1].

[1]: https://lore.kernel.org/linux-mm/b6753402-2de9-25b2-36e9-eacd49752b19@redhat.com/

Link: https://lkml.kernel.org/r/20231107-vv-kmem_memmap-v10-2-1253ec050ed0@intel.com
Signed-off-by: Vishal Verma 
Suggested-by: David Hildenbrand 
Reviewed-by: Dan Williams 
Reviewed-by: "Huang, Ying" 
Acked-by: David Hildenbrand 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Dave Jiang 
Cc: Dave Hansen 
Cc: Aneesh Kumar K.V 
Cc: Fan Ni 
Cc: Jeff Moyer 
Cc: Jonathan Cameron 
Signed-off-by: Andrew Morton

mm/memory_hotplug: replace an open-coded kmemdup() in add_memory_resource()

2023-12-11T00:51:34+00:00

Patch series "mm: use memmap_on_memory semantics for dax/kmem", v10.

The dax/kmem driver can potentially hot-add large amounts of memory
originating from CXL memory expanders, or NVDIMMs, or other 'device
memories'.  There is a chance there isn't enough regular system memory
available to fit the memmap for this new memory.  It's therefore
desirable, if all other conditions are met, for the kmem managed memory to
place its memmap on the newly added memory itself.

The main hurdle for accomplishing this for kmem is that memmap_on_memory
can only be done if the memory being added is equal to the size of one
memblock.  To overcome this, allow the hotplug code to split an
add_memory() request into memblock-sized chunks, and try_remove_memory()
to also expect and handle such a scenario.

Patch 1 replaces an open-coded kmemdup()

Patch 2 teaches the memory_hotplug code to allow for splitting
add_memory() and remove_memory() requests over memblock sized chunks.

Patch 3 allows the dax region drivers to request memmap_on_memory
semantics. CXL dax regions default this to 'on', all others default to
off to keep existing behavior unchanged.


This patch (of 3):

A review of the memmap_on_memory modifications to add_memory_resource()
revealed an instance of an open-coded kmemdup().  Replace it with
kmemdup().

Link: https://lkml.kernel.org/r/20231107-vv-kmem_memmap-v10-0-1253ec050ed0@intel.com
Link: https://lkml.kernel.org/r/20231107-vv-kmem_memmap-v10-1-1253ec050ed0@intel.com
Signed-off-by: Vishal Verma 
Reviewed-by: David Hildenbrand 
Reviewed-by: Fan Ni 
Reported-by: Dan Williams 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Aneesh Kumar K.V 
Cc: Dave Hansen 
Cc: Dave Jiang 
Cc: "Huang, Ying" 
Cc: Jeff Moyer 
Cc: Jonathan Cameron 
Signed-off-by: Andrew Morton

mm/memory_hotplug: fix error handling in add_memory_resource()

2023-12-07T00:12:46+00:00

In add_memory_resource(), creation of memory block devices occurs after
successful call to arch_add_memory().  However, creation of memory block
devices could fail.  In that case, arch_remove_memory() is called to
perform necessary cleanup.

Currently with or without altmap support, arch_remove_memory() is always
passed with altmap set to NULL during error handling.  This leads to
freeing of struct pages using free_pages(), eventhough the allocation
might have been performed with altmap support via
altmap_alloc_block_buf().

Fix the error handling by passing altmap in arch_remove_memory(). This
ensures the following:
* When altmap is disabled, deallocation of the struct pages array occurs
  via free_pages().
* When altmap is enabled, deallocation occurs via vmem_altmap_free().

Link: https://lkml.kernel.org/r/20231120145354.308999-3-sumanthk@linux.ibm.com
Fixes: a08a2ae34613 ("mm,memory_hotplug: allocate memmap from the added memory range")
Signed-off-by: Sumanth Korikkar 
Reviewed-by: Gerald Schaefer 
Acked-by: David Hildenbrand 
Cc: Alexander Gordeev 
Cc: Aneesh Kumar K.V 
Cc: Anshuman Khandual 
Cc: Heiko Carstens 
Cc: kernel test robot 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Vasily Gorbik 
Cc: 	[5.15+]
Signed-off-by: Andrew Morton

mm/memory_hotplug: add missing mem_hotplug_lock

2023-12-07T00:12:46+00:00

From Documentation/core-api/memory-hotplug.rst:
When adding/removing/onlining/offlining memory or adding/removing
heterogeneous/device memory, we should always hold the mem_hotplug_lock
in write mode to serialise memory hotplug (e.g. access to global/zone
variables).

mhp_(de)init_memmap_on_memory() functions can change zone stats and
struct page content, but they are currently called w/o the
mem_hotplug_lock.

When memory block is being offlined and when kmemleak goes through each
populated zone, the following theoretical race conditions could occur:
CPU 0:					     | CPU 1:
memory_offline()			     |
-> offline_pages()			     |
	-> mem_hotplug_begin()		     |
	   ...				     |
	-> mem_hotplug_done()		     |
					     | kmemleak_scan()
					     | -> get_online_mems()
					     |    ...
-> mhp_deinit_memmap_on_memory()	     |
  [not protected by mem_hotplug_begin/done()]|
  Marks memory section as offline,	     |   Retrieves zone_start_pfn
  poisons vmemmap struct pages and updates   |   and struct page members.
  the zone related data			     |
   					     |    ...
   					     | -> put_online_mems()

Fix this by ensuring mem_hotplug_lock is taken before performing
mhp_init_memmap_on_memory().  Also ensure that
mhp_deinit_memmap_on_memory() holds the lock.

online/offline_pages() are currently only called from
memory_block_online/offline(), so it is safe to move the locking there.

Link: https://lkml.kernel.org/r/20231120145354.308999-2-sumanthk@linux.ibm.com
Fixes: a08a2ae34613 ("mm,memory_hotplug: allocate memmap from the added memory range")
Signed-off-by: Sumanth Korikkar 
Reviewed-by: Gerald Schaefer 
Acked-by: David Hildenbrand 
Cc: Alexander Gordeev 
Cc: Aneesh Kumar K.V 
Cc: Anshuman Khandual 
Cc: Heiko Carstens 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Vasily Gorbik 
Cc: kernel test robot 
Cc: 	[5.15+]
Signed-off-by: Andrew Morton

mm: memory_hotplug: drop memoryless node from fallback lists

2023-10-25T23:47:14+00:00

In offline_pages(), if a node becomes memoryless, we will clear its
N_MEMORY state by calling node_states_clear_node().  But we do this
after rebuilding the zonelists by calling build_all_zonelists(), which
will cause this memoryless node to still be in the fallback nodes
(node_order[]) of other nodes.

To drop memoryless nodes from fallback nodes in this case, just call
node_states_clear_node() before calling build_all_zonelists().

In this way, we will not try to allocate pages from memoryless node0,
then the panic mentioned in [1] will also be fixed.  Even though this
problem has been solved by dropping the NODE_MIN_SIZE constrain in x86
[2], it would be better to fix it in the core MM as well.

https://lore.kernel.org/all/20230212110305.93670-1-zhengqi.arch@bytedance.com/ [1]
https://lore.kernel.org/all/20231017062215.171670-1-rppt@kernel.org/ [2]

Link: https://lkml.kernel.org/r/9f1dbe7ee1301c7163b2770e32954ff5e3ecf2c4.1697711415.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng 
Acked-by: David Hildenbrand 
Acked-by: Ingo Molnar 
Cc: Aneesh Kumar K.V 
Cc: "Huang, Ying" 
Cc: Johannes Weiner 
Cc: Matthew Wilcox (Oracle) 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Mike Rapoport 
Cc: Oscar Salvador 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton

mm/memory_hotplug: use pfn math in place of direct struct page manipulation

2023-10-04T17:32:29+00:00

When dealing with hugetlb pages, manipulating struct page pointers
directly can get to wrong struct page, since struct page is not guaranteed
to be contiguous on SPARSEMEM without VMEMMAP.  Use pfn calculation to
handle it properly.

Without the fix, a wrong number of page might be skipped. Since skip cannot be
negative, scan_movable_page() will end early and might miss a movable page with
-ENOENT. This might fail offline_pages(). No bug is reported. The fix comes
from code inspection.

Link: https://lkml.kernel.org/r/20230913201248.452081-4-zi.yan@sent.com
Fixes: eeb0efd071d8 ("mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepages")
Signed-off-by: Zi Yan 
Reviewed-by: Muchun Song 
Acked-by: David Hildenbrand 
Cc: Matthew Wilcox (Oracle) 
Cc: Mike Kravetz 
Cc: Mike Rapoport (IBM) 
Cc: Thomas Bogendoerfer 
Cc: 
Signed-off-by: Andrew Morton

mm/memory_hotplug: embed vmem_altmap details in memory block

2023-08-21T20:37:49+00:00

With memmap on memory, some architecture needs more details w.r.t altmap
such as base_pfn, end_pfn, etc to unmap vmemmap memory.  Instead of
computing them again when we remove a memory block, embed vmem_altmap
details in struct memory_block if we are using memmap on memory block
feature.

[yangyingliang@huawei.com: fix error return code in add_memory_resource()]
  Link: https://lkml.kernel.org/r/20230809081552.1351184-1-yangyingliang@huawei.com
Link: https://lkml.kernel.org/r/20230808091501.287660-7-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V 
Signed-off-by: Yang Yingliang 
Acked-by: Michal Hocko 
Acked-by: David Hildenbrand 
Cc: Christophe Leroy 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Oscar Salvador 
Cc: Vishal Verma 
Signed-off-by: Andrew Morton

mm/memory_hotplug: support memmap_on_memory when memmap is not aligned to pageblocks

2023-08-21T20:37:49+00:00

Currently, memmap_on_memory feature is only supported with memory block
sizes that result in vmemmap pages covering full page blocks.  This is
because memory onlining/offlining code requires applicable ranges to be
pageblock-aligned, for example, to set the migratetypes properly.

This patch helps to lift that restriction by reserving more pages than
required for vmemmap space.  This helps the start address to be page block
aligned with different memory block sizes.  Using this facility implies
the kernel will be reserving some pages for every memoryblock.  This
allows the memmap on memory feature to be widely useful with different
memory block size values.

For ex: with 64K page size and 256MiB memory block size, we require 4
pages to map vmemmap pages, To align things correctly we end up adding a
reserve of 28 pages.  ie, for every 4096 pages 28 pages get reserved.

Link: https://lkml.kernel.org/r/20230808091501.287660-5-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V 
Acked-by: Michal Hocko 
Acked-by: David Hildenbrand 
Cc: Christophe Leroy 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Oscar Salvador 
Cc: Vishal Verma 
Signed-off-by: Andrew Morton