linux-toradex.git/mm/memory_hotplug.c, branch v5.5-rc1

mm/memory_hotplug.c: remove __online_page_set_limits()

2019-12-01T20:59:10+00:00

__online_page_set_limits() is a dummy function - remove it and all
callers.

Link: http://lkml.kernel.org/r/8e1bc9d3b492f6bde16e95ebc1dee11d6aefabd7.1567889743.git.jrdr.linux@gmail.com
Link: http://lkml.kernel.org/r/854db2cf8145d9635249c95584d9a91fd774a229.1567889743.git.jrdr.linux@gmail.com
Link: http://lkml.kernel.org/r/9afe6c5a18158f3884a6b302ac2c772f3da49ccc.1567889743.git.jrdr.linux@gmail.com
Signed-off-by: Souptick Joarder 
Reviewed-by: David Hildenbrand 
Acked-by: Michal Hocko 
Cc: Juergen Gross 
Cc: "Kirill A. Shutemov" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory_hotplug.c: don't allow to online/offline memory blocks with holes

2019-12-01T20:59:05+00:00

Our onlining/offlining code is unnecessarily complicated.  Only memory
blocks added during boot can have holes (a range that is not
IORESOURCE_SYSTEM_RAM).  Hotplugged memory never has holes (e.g., see
add_memory_resource()).  All memory blocks that belong to boot memory
are already online.

Note that boot memory can have holes and the memmap of the holes is
marked PG_reserved.  However, also memory allocated early during boot is
PG_reserved - basically every page of boot memory that is not given to
the buddy is PG_reserved.

Therefore, when we stop allowing to offline memory blocks with holes, we
implicitly no longer have to deal with onlining memory blocks with
holes.  E.g., online_pages() will do a walk_system_ram_range(...,
online_pages_range), whereby online_pages_range() will effectively only
free the memory holes not falling into a hole to the buddy.  The other
pages (holes) are kept PG_reserved (via
move_pfn_range_to_zone()->memmap_init_zone()).

This allows to simplify the code.  For example, we no longer have to
worry about marking pages that fall into memory holes PG_reserved when
onlining memory.  We can stop setting pages PG_reserved completely in
memmap_init_zone().

Offlining memory blocks added during boot is usually not guaranteed to
work either way (unmovable data might have easily ended up on that
memory during boot).  So stopping to do that should not really hurt.
Also, people are not even aware of a setup where onlining/offlining of
memory blocks with holes used to work reliably (see [1] and [2]
especially regarding the hotplug path) - I doubt it worked reliably.

For the use case of offlining memory to unplug DIMMs, we should see no
change.  (holes on DIMMs would be weird).

Please note that hardware errors (PG_hwpoison) are not memory holes and
are not affected by this change when offlining.

[1] https://lkml.org/lkml/2019/10/22/135
[2] https://lkml.org/lkml/2019/8/14/1365

Link: http://lkml.kernel.org/r/20191119115237.6662-1-david@redhat.com
Reviewed-by: Dan Williams 
Signed-off-by: David Hildenbrand 
Acked-by: Michal Hocko 
Cc: Oscar Salvador 
Cc: Pavel Tatashin 
Cc: Dan Williams 
Cc: Anshuman Khandual 
Cc: Naoya Horiguchi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/page_isolation.c: convert SKIP_HWPOISON to MEMORY_OFFLINE

2019-12-01T20:59:04+00:00

We have two types of users of page isolation:

 1. Memory offlining:  Offline memory so it can be unplugged. Memory
                       won't be touched.

 2. Memory allocation: Allocate memory (e.g., alloc_contig_range()) to
                       become the owner of the memory and make use of
                       it.

For example, in case we want to offline memory, we can ignore (skip
over) PageHWPoison() pages, as the memory won't get used.  We can allow
to offline memory.  In contrast, we don't want to allow to allocate such
memory.

Let's generalize the approach so we can special case other types of
pages we want to skip over in case we offline memory.  While at it, also
pass the same flags to test_pages_isolated().

Link: http://lkml.kernel.org/r/20191021172353.3056-3-david@redhat.com
Signed-off-by: David Hildenbrand 
Suggested-by: Michal Hocko 
Acked-by: Michal Hocko 
Cc: Oscar Salvador 
Cc: Anshuman Khandual 
Cc: David Hildenbrand 
Cc: Pingfan Liu 
Cc: Qian Cai 
Cc: Pavel Tatashin 
Cc: Dan Williams 
Cc: Vlastimil Babka 
Cc: Mel Gorman 
Cc: Mike Rapoport 
Cc: Alexander Duyck 
Cc: Mike Rapoport 
Cc: Pavel Tatashin 
Cc: Wei Yang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/page_alloc.c: don't set pages PageReserved() when offlining

2019-12-01T20:59:04+00:00

Patch series "mm: Memory offlining + page isolation cleanups", v2.

This patch (of 2):

We call __offline_isolated_pages() from __offline_pages() after all
pages were isolated and are either free (PageBuddy()) or PageHWPoison.
Nothing can stop us from offlining memory at this point.

In __offline_isolated_pages() we first set all affected memory sections
offline (offline_mem_sections(pfn, end_pfn)), to mark the memmap as
invalid (pfn_to_online_page() will no longer succeed), and then walk
over all pages to pull the free pages from the free lists (to the
isolated free lists, to be precise).

Note that re-onlining a memory block will result in the whole memmap
getting reinitialized, overwriting any old state.  We already poision
the memmap when offlining is complete to find any access to
stale/uninitialized memmaps.

So, setting the pages PageReserved() is not helpful.  The memap is
marked offline and all pageblocks are isolated.  As soon as offline, the
memmap is stale either way.

This looks like a leftover from ancient times where we initialized the
memmap when adding memory and not when onlining it (the pages were set
PageReserved so re-onling would work as expected).

Link: http://lkml.kernel.org/r/20191021172353.3056-2-david@redhat.com
Signed-off-by: David Hildenbrand 
Acked-by: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Oscar Salvador 
Cc: Mel Gorman 
Cc: Mike Rapoport 
Cc: Dan Williams 
Cc: Wei Yang 
Cc: Alexander Duyck 
Cc: Anshuman Khandual 
Cc: Pavel Tatashin 
Cc: Mike Rapoport 
Cc: Pavel Tatashin 
Cc: Pingfan Liu 
Cc: Qian Cai 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory_hotplug: remove __online_page_free() and __online_page_increment_counters()

2019-12-01T20:59:04+00:00

Let's drop the now unused functions.

Link: http://lkml.kernel.org/r/20190909114830.662-4-david@redhat.com
Signed-off-by: David Hildenbrand 
Acked-by: Michal Hocko 
Cc: Oscar Salvador 
Cc: Pavel Tatashin 
Cc: Wei Yang 
Cc: Dan Williams 
Cc: Qian Cai 
Cc: Haiyang Zhang 
Cc: "K. Y. Srinivasan" 
Cc: Sasha Levin 
Cc: Stephen Hemminger 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory_hotplug: export generic_online_page()

2019-12-01T20:59:04+00:00

Patch series "mm/memory_hotplug: Export generic_online_page()".

Let's replace the __online_page...() functions by generic_online_page().
Hyper-V only wants to delay the actual onlining of un-backed pages, so
we can simpy re-use the generic function.

This patch (of 3):

Let's expose generic_online_page() so online_page_callback users can
simply fall back to the generic implementation when actually deciding to
online the pages.

Link: http://lkml.kernel.org/r/20190909114830.662-2-david@redhat.com
Signed-off-by: David Hildenbrand 
Acked-by: Michal Hocko 
Cc: Oscar Salvador 
Cc: Pavel Tatashin 
Cc: Dan Williams 
Cc: Wei Yang 
Cc: Qian Cai 
Cc: Haiyang Zhang 
Cc: "K. Y. Srinivasan" 
Cc: Sasha Levin 
Cc: Stephen Hemminger 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory_hotplug.c: add a bounds check to __add_pages()

2019-12-01T20:59:04+00:00

On PowerPC, the address ranges allocated to OpenCAPI LPC memory are
allocated from firmware.  These address ranges may be higher than what
older kernels permit, as we increased the maximum permissable address in
commit 4ffe713b7587 ("powerpc/mm: Increase the max addressable memory to
2PB").  It is possible that the addressable range may change again in
the future.

In this scenario, we end up with a bogus section returned from
__section_nr (see the discussion on the thread "mm: Trigger bug on if a
section is not found in __section_nr").

Adding a check here means that we fail early and have an opportunity to
handle the error gracefully, rather than rumbling on and potentially
accessing an incorrect section.

Further discussion is also on the thread ("powerpc: Perform a bounds
check in arch_add_memory")
  http://lkml.kernel.org/r/20190827052047.31547-1-alastair@au1.ibm.com

Link: http://lkml.kernel.org/r/20191001004617.7536-2-alastair@au1.ibm.com
Signed-off-by: Alastair D'Silva 
Reviewed-by: David Hildenbrand 
Acked-by: Michal Hocko 
Cc: Oscar Salvador 
Cc: Pavel Tatashin 
Cc: Dan Williams 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/hotplug: reorder memblock_[free|remove]() calls in try_remove_memory()

2019-12-01T20:59:04+00:00

Currently during memory hot add procedure, memory gets into memblock
before calling arch_add_memory() which creates its linear mapping.

  add_memory_resource() {
	..................
	memblock_add_node()
	..................
	arch_add_memory()
	..................
  }

But during memory hot remove procedure, removal from memblock happens
first before its linear mapping gets teared down with
arch_remove_memory() which is not consistent.  Resource removal should
happen in reverse order as they were added.  However this does not pose
any problem for now, unless there is an assumption regarding linear
mapping.  One example was a subtle failure on arm64 platform [1].
Though this has now found a different solution.

  try_remove_memory() {
	..................
	memblock_free()
	memblock_remove()
	..................
	arch_remove_memory()
	..................
  }

This changes the sequence of resource removal including memblock and
linear mapping tear down during memory hot remove which will now be the
reverse order in which they were added during memory hot add.  The
changed removal order looks like the following.

  try_remove_memory() {
	..................
	arch_remove_memory()
	..................
	memblock_free()
	memblock_remove()
	..................
  }

[1] https://patchwork.kernel.org/patch/11127623/

Memory hot remove now works on arm64 without this because a recent
commit 60bb462fc7ad ("drivers/base/node.c: simplify
unregister_memory_block_under_nodes()").

This does not fix a serious problem.  It just removes an inconsistency
while freeing resources during memory hot remove which for now does not
pose a real problem.

David mentioned that re-ordering should still make sense for consistency
purpose (removing stuff in the reverse order they were added).  This
patch is now detached from arm64 hot-remove series.

Michal:

: I would just a note that the inconsistency doesn't pose any problem now
: but if somebody makes any assumptions about linear mappings then it could
: get subtly broken like your example for arm64 which has found a different
: solution in the meantime.

Link: http://lkml.kernel.org/r/1569380273-7708-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual 
Acked-by: Michal Hocko 
Reviewed-by: David Hildenbrand 
Cc: Oscar Salvador 
Cc: Pavel Tatashin 
Cc: Dan Williams 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory_hotplug: don't access uninitialized memmaps in shrink_zone_span()

2019-11-22T17:11:18+00:00

Let's limit shrinking to !ZONE_DEVICE so we can fix the current code.
We should never try to touch the memmap of offline sections where we
could have uninitialized memmaps and could trigger BUGs when calling
page_to_nid() on poisoned pages.

There is no reliable way to distinguish an uninitialized memmap from an
initialized memmap that belongs to ZONE_DEVICE, as we don't have
anything like SECTION_IS_ONLINE we can use similar to
pfn_to_online_section() for !ZONE_DEVICE memory.

E.g., set_zone_contiguous() similarly relies on pfn_to_online_section()
and will therefore never set a ZONE_DEVICE zone consecutive.  Stopping
to shrink the ZONE_DEVICE therefore results in no observable changes,
besides /proc/zoneinfo indicating different boundaries - something we
can totally live with.

Before commit d0dc12e86b31 ("mm/memory_hotplug: optimize memory
hotplug"), the memmap was initialized with 0 and the node with the right
value.  So the zone might be wrong but not garbage.  After that commit,
both the zone and the node will be garbage when touching uninitialized
memmaps.

Toshiki reported a BUG (race between delayed initialization of
ZONE_DEVICE memmaps without holding the memory hotplug lock and
concurrent zone shrinking).

  https://lkml.org/lkml/2019/11/14/1040

"Iteration of create and destroy namespace causes the panic as below:

      kernel BUG at mm/page_alloc.c:535!
      CPU: 7 PID: 2766 Comm: ndctl Not tainted 5.4.0-rc4 #6
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
      RIP: 0010:set_pfnblock_flags_mask+0x95/0xf0
      Call Trace:
       memmap_init_zone_device+0x165/0x17c
       memremap_pages+0x4c1/0x540
       devm_memremap_pages+0x1d/0x60
       pmem_attach_disk+0x16b/0x600 [nd_pmem]
       nvdimm_bus_probe+0x69/0x1c0
       really_probe+0x1c2/0x3e0
       driver_probe_device+0xb4/0x100
       device_driver_attach+0x4f/0x60
       bind_store+0xc9/0x110
       kernfs_fop_write+0x116/0x190
       vfs_write+0xa5/0x1a0
       ksys_write+0x59/0xd0
       do_syscall_64+0x5b/0x180
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

  While creating a namespace and initializing memmap, if you destroy the
  namespace and shrink the zone, it will initialize the memmap outside
  the zone and trigger VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page),
  pfn), page) in set_pfnblock_flags_mask()."

This BUG is also mitigated by this commit, where we for now stop to
shrink the ZONE_DEVICE zone until we can do it in a safe and clean way.

Link: http://lkml.kernel.org/r/20191006085646.5768-5-david@redhat.com
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e86b319]
Signed-off-by: David Hildenbrand 
Reported-by: Aneesh Kumar K.V 
Reported-by: Toshiki Fukasawa 
Cc: Oscar Salvador 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Pavel Tatashin 
Cc: Dan Williams 
Cc: Alexander Duyck 
Cc: Alexander Potapenko 
Cc: Andy Lutomirski 
Cc: Anshuman Khandual 
Cc: Benjamin Herrenschmidt 
Cc: Borislav Petkov 
Cc: Catalin Marinas 
Cc: Christian Borntraeger 
Cc: Christophe Leroy 
Cc: Damian Tometzki 
Cc: Dave Hansen 
Cc: Fenghua Yu 
Cc: Gerald Schaefer 
Cc: Greg Kroah-Hartman 
Cc: Halil Pasic 
Cc: Heiko Carstens 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Ira Weiny 
Cc: Jason Gunthorpe 
Cc: Jun Yao 
Cc: Logan Gunthorpe 
Cc: Mark Rutland 
Cc: Masahiro Yamada 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Mike Rapoport 
Cc: Pankaj Gupta 
Cc: Paul Mackerras 
Cc: Pavel Tatashin 
Cc: Peter Zijlstra 
Cc: Qian Cai 
Cc: Rich Felker 
Cc: Robin Murphy 
Cc: Steve Capper 
Cc: Thomas Gleixner 
Cc: Tom Lendacky 
Cc: Tony Luck 
Cc: Vasily Gorbik 
Cc: Vlastimil Babka 
Cc: Wei Yang 
Cc: Wei Yang 
Cc: Will Deacon 
Cc: Yoshinori Sato 
Cc: Yu Zhao 
Cc: 	[4.13+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory_hotplug: fix try_offline_node()

2019-11-16T02:34:00+00:00

try_offline_node() is pretty much broken right now:

 - The node span is updated when onlining memory, not when adding it. We
   ignore memory that was mever onlined. Bad.

 - We touch possible garbage memmaps. The pfn_to_nid(pfn) can easily
   trigger a kernel panic. Bad for memory that is offline but also bad
   for subsection hotadd with ZONE_DEVICE, whereby the memmap of the
   first PFN of a section might contain garbage.

 - Sections belonging to mixed nodes are not properly considered.

As memory blocks might belong to multiple nodes, we would have to walk
all pageblocks (or at least subsections) within present sections.
However, we don't have a way to identify whether a memmap that is not
online was initialized (relevant for ZONE_DEVICE).  This makes things
more complicated.

Luckily, we can piggy pack on the node span and the nid stored in memory
blocks.  Currently, the node span is grown when calling
move_pfn_range_to_zone() - e.g., when onlining memory, and shrunk when
removing memory, before calling try_offline_node().  Sysfs links are
created via link_mem_sections(), e.g., during boot or when adding
memory.

If the node still spans memory or if any memory block belongs to the
nid, we don't set the node offline.  As memory blocks that span multiple
nodes cannot get offlined, the nid stored in memory blocks is reliable
enough (for such online memory blocks, the node still spans the memory).

Introduce for_each_memory_block() to efficiently walk all memory blocks.

Note: We will soon stop shrinking the ZONE_DEVICE zone and the node span
when removing ZONE_DEVICE memory to fix similar issues (access of
garbage memmaps) - until we have a reliable way to identify whether
these memmaps were properly initialized.  This implies later, that once
a node had ZONE_DEVICE memory, we won't be able to set a node offline -
which should be acceptable.

Since commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate
hotadded memory to zones until online") memory that is added is not
assoziated with a zone/node (memmap not initialized).  The introducing
commit 60a5a19e7419 ("memory-hotplug: remove sysfs file of node")
already missed that we could have multiple nodes for a section and that
the zone/node span is updated when onlining pages, not when adding them.

I tested this by hotplugging two DIMMs to a memory-less and cpu-less
NUMA node.  The node is properly onlined when adding the DIMMs.  When
removing the DIMMs, the node is properly offlined.

Masayoshi Mizuma reported:

: Without this patch, memory hotplug fails as panic:
:
:  BUG: kernel NULL pointer dereference, address: 0000000000000000
:  ...
:  Call Trace:
:   remove_memory_block_devices+0x81/0xc0
:   try_remove_memory+0xb4/0x130
:   __remove_memory+0xa/0x20
:   acpi_memory_device_remove+0x84/0x100
:   acpi_bus_trim+0x57/0x90
:   acpi_bus_trim+0x2e/0x90
:   acpi_device_hotplug+0x2b2/0x4d0
:   acpi_hotplug_work_fn+0x1a/0x30
:   process_one_work+0x171/0x380
:   worker_thread+0x49/0x3f0
:   kthread+0xf8/0x130
:   ret_from_fork+0x35/0x40

[david@redhat.com: v3]
  Link: http://lkml.kernel.org/r/20191102120221.7553-1-david@redhat.com
Link: http://lkml.kernel.org/r/20191028105458.28320-1-david@redhat.com
Fixes: 60a5a19e7419 ("memory-hotplug: remove sysfs file of node")
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visiable after d0dc12e86b319
Signed-off-by: David Hildenbrand 
Tested-by: Masayoshi Mizuma 
Cc: Tang Chen 
Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Keith Busch 
Cc: Jiri Olsa 
Cc: "Peter Zijlstra (Intel)" 
Cc: Jani Nikula 
Cc: Nayna Jain 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Stephen Rothwell 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds