linux-toradex.git/include/linux/node.h, branch v6.0-rc1

drivers/base/node: rename link_mem_sections() to register_memory_block_under_node()

2022-03-22T22:57:10+00:00

Patch series "drivers/base/memory: determine and store zone for single-zone memory blocks", v2.

I remember talking to Michal in the past about removing
test_pages_in_a_zone(), which we use for:
* verifying that a memory block we intend to offline is really only managed
  by a single zone. We don't support offlining of memory blocks that are
  managed by multiple zones (e.g., multiple nodes, DMA and DMA32)
* exposing that zone to user space via
  /sys/devices/system/memory/memory*/valid_zones

Now that I identified some more cases where test_pages_in_a_zone() might
go wrong, and we received an UBSAN report (see patch #3), let's get rid of
this PFN walker.

So instead of detecting the zone at runtime with test_pages_in_a_zone() by
scanning the memmap, let's determine and remember for each memory block if
it's managed by a single zone.  The stored zone can then be used for the
above two cases, avoiding a manual lookup using test_pages_in_a_zone().

This avoids eventually stumbling over uninitialized memmaps in corner
cases, especially when ZONE_DEVICE ranges partly fall into memory block
(that are responsible for managing System RAM).

Handling memory onlining is easy, because we online to exactly one zone.
Handling boot memory is more tricky, because we want to avoid scanning all
zones of all nodes to detect possible zones that overlap with the physical
memory region of interest.  Fortunately, we already have code that
determines the applicable nodes for a memory block, to create sysfs links
-- we'll hook into that.

Patch #1 is a simple cleanup I had laying around for a longer time.
Patch #2 contains the main logic to remove test_pages_in_a_zone() and
further details.

[1] https://lkml.kernel.org/r/20220128144540.153902-1-david@redhat.com
[2] https://lkml.kernel.org/r/20220203105212.30385-1-david@redhat.com

This patch (of 2):

Let's adjust the stale terminology, making it match
unregister_memory_block_under_nodes() and
do_register_memory_block_under_node().  We're dealing with memory block
devices, which span 1..X memory sections.

Link: https://lkml.kernel.org/r/20220210184359.235565-1-david@redhat.com
Link: https://lkml.kernel.org/r/20220210184359.235565-2-david@redhat.com
Signed-off-by: David Hildenbrand 
Acked-by: Oscar Salvador 
Cc: Greg Kroah-Hartman 
Cc: Michal Hocko 
Cc: "Rafael J. Wysocki" 
Cc: Rafael Parra 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

drivers/base/node: consolidate node device subsystem initialization in node_dev_init()

2022-03-22T22:57:10+00:00

...  and call node_dev_init() after memory_dev_init() from driver_init(),
so before any of the existing arch/subsys calls.  All online nodes should
be known at that point: early during boot, arch code determines node and
zone ranges and sets the relevant nodes online; usually this happens in
setup_arch().

This is in line with memory_dev_init(), which initializes the memory
device subsystem and creates all memory block devices.

Similar to memory_dev_init(), panic() if anything goes wrong, we don't
want to continue with such basic initialization errors.

The important part is that node_dev_init() gets called after
memory_dev_init() and after cpu_dev_init(), but before any of the relevant
archs call register_cpu() to register the new cpu device under the node
device.  The latter should be the case for the current users of
topology_init().

Link: https://lkml.kernel.org/r/20220203105212.30385-1-david@redhat.com
Signed-off-by: David Hildenbrand 
Reviewed-by: Oscar Salvador 
Tested-by: Anatoly Pugachev  (sparc64)
Cc: Greg Kroah-Hartman 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Mike Rapoport 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Thomas Bogendoerfer 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Paul Walmsley 
Cc: Palmer Dabbelt 
Cc: Albert Ou 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: "David S. Miller" 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: "Rafael J. Wysocki" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

NUMA Balancing: add page promotion counter

2022-03-22T22:57:09+00:00

Patch series "NUMA balancing: optimize memory placement for memory tiering system", v13

With the advent of various new memory types, some machines will have
multiple types of memory, e.g.  DRAM and PMEM (persistent memory).  The
memory subsystem of these machines can be called memory tiering system,
because the performance of the different types of memory are different.

After commit c221c0b0308f ("device-dax: "Hotplug" persistent memory for
use like normal RAM"), the PMEM could be used as the cost-effective
volatile memory in separate NUMA nodes.  In a typical memory tiering
system, there are CPUs, DRAM and PMEM in each physical NUMA node.  The
CPUs and the DRAM will be put in one logical node, while the PMEM will
be put in another (faked) logical node.

To optimize the system overall performance, the hot pages should be
placed in DRAM node.  To do that, we need to identify the hot pages in
the PMEM node and migrate them to DRAM node via NUMA migration.

In the original NUMA balancing, there are already a set of existing
mechanisms to identify the pages recently accessed by the CPUs in a node
and migrate the pages to the node.  So we can reuse these mechanisms to
build the mechanisms to optimize the page placement in the memory
tiering system.  This is implemented in this patchset.

At the other hand, the cold pages should be placed in PMEM node.  So, we
also need to identify the cold pages in the DRAM node and migrate them
to PMEM node.

In commit 26aa2d199d6f ("mm/migrate: demote pages during reclaim"), a
mechanism to demote the cold DRAM pages to PMEM node under memory
pressure is implemented.  Based on that, the cold DRAM pages can be
demoted to PMEM node proactively to free some memory space on DRAM node
to accommodate the promoted hot PMEM pages.  This is implemented in this
patchset too.

We have tested the solution with the pmbench memory accessing benchmark
with the 80:20 read/write ratio and the Gauss access address
distribution on a 2 socket Intel server with Optane DC Persistent Memory
Model.  The test results shows that the pmbench score can improve up to
95.9%.

This patch (of 3):

In a system with multiple memory types, e.g.  DRAM and PMEM, the CPU
and DRAM in one socket will be put in one NUMA node as before, while
the PMEM will be put in another NUMA node as described in the
description of the commit c221c0b0308f ("device-dax: "Hotplug"
persistent memory for use like normal RAM").  So, the NUMA balancing
mechanism will identify all PMEM accesses as remote access and try to
promote the PMEM pages to DRAM.

To distinguish the number of the inter-type promoted pages from that of
the inter-socket migrated pages.  A new vmstat count is added.  The
counter is per-node (count in the target node).  So this can be used to
identify promotion imbalance among the NUMA nodes.

Link: https://lkml.kernel.org/r/20220301085329.3210428-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20220221084529.1052339-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20220221084529.1052339-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" 
Reviewed-by: Yang Shi 
Tested-by: Baolin Wang 
Reviewed-by: Baolin Wang 
Acked-by: Johannes Weiner 
Reviewed-by: Oscar Salvador 
Cc: Michal Hocko 
Cc: Rik van Riel 
Cc: Mel Gorman 
Cc: Peter Zijlstra 
Cc: Dave Hansen 
Cc: Zi Yan 
Cc: Wei Xu 
Cc: Shakeel Butt 
Cc: zhongjiang-ali 
Cc: Feng Tang 
Cc: Randy Dunlap 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE

2021-11-06T20:30:42+00:00

CONFIG_MEMORY_HOTPLUG depends on CONFIG_SPARSEMEM, so there is no need for
CONFIG_MEMORY_HOTPLUG_SPARSE anymore; adjust all instances to use
CONFIG_MEMORY_HOTPLUG and remove CONFIG_MEMORY_HOTPLUG_SPARSE.

Link: https://lkml.kernel.org/r/20210929143600.49379-3-david@redhat.com
Signed-off-by: David Hildenbrand 
Acked-by: Shuah Khan 	[kselftest]
Acked-by: Greg Kroah-Hartman 
Acked-by: Oscar Salvador 
Cc: Alex Shi 
Cc: Andy Lutomirski 
Cc: Benjamin Herrenschmidt 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jason Wang 
Cc: Jonathan Corbet 
Cc: Michael Ellerman 
Cc: "Michael S. Tsirkin" 
Cc: Michal Hocko 
Cc: Mike Rapoport 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: "Rafael J. Wysocki" 
Cc: Thomas Gleixner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: don't panic when links can't be created in sysfs

2020-10-16T18:11:18+00:00

At boot time, or when doing memory hot-add operations, if the links in
sysfs can't be created, the system is still able to run, so just report
the error in the kernel log rather than BUG_ON and potentially make system
unusable because the callpath can be called with locks held.

Since the number of memory blocks managed could be high, the messages are
rate limited.

As a consequence, link_mem_sections() has no status to report anymore.

Signed-off-by: Laurent Dufour 
Signed-off-by: Andrew Morton 
Reviewed-by: Oscar Salvador 
Acked-by: Michal Hocko 
Acked-by: David Hildenbrand 
Cc: Greg Kroah-Hartman 
Cc: Fenghua Yu 
Cc: Nathan Lynch 
Cc: "Rafael J . Wysocki" 
Cc: Scott Cheloha 
Cc: Tony Luck 
Link: https://lkml.kernel.org/r/20200915094143.79181-4-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds

mm: don't rely on system state to detect hot-plug operations

2020-09-26T17:33:57+00:00

In register_mem_sect_under_node() the system_state's value is checked to
detect whether the call is made during boot time or during an hot-plug
operation.  Unfortunately, that check against SYSTEM_BOOTING is wrong
because regular memory is registered at SYSTEM_SCHEDULING state.  In
addition, memory hot-plug operation can be triggered at this system
state by the ACPI [1].  So checking against the system state is not
enough.

The consequence is that on system with interleaved node's ranges like this:

 Early memory node ranges
   node   1: [mem 0x0000000000000000-0x000000011fffffff]
   node   2: [mem 0x0000000120000000-0x000000014fffffff]
   node   1: [mem 0x0000000150000000-0x00000001ffffffff]
   node   0: [mem 0x0000000200000000-0x000000048fffffff]
   node   2: [mem 0x0000000490000000-0x00000007ffffffff]

This can be seen on PowerPC LPAR after multiple memory hot-plug and
hot-unplug operations are done.  At the next reboot the node's memory
ranges can be interleaved and since the call to link_mem_sections() is
made in topology_init() while the system is in the SYSTEM_SCHEDULING
state, the node's id is not checked, and the sections registered to
multiple nodes:

  $ ls -l /sys/devices/system/memory/memory21/node*
  total 0
  lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
  lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2

In that case, the system is able to boot but if later one of theses
memory blocks is hot-unplugged and then hot-plugged, the sysfs
inconsistency is detected and this is triggering a BUG_ON():

  kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
  Oops: Exception in kernel mode, sig: 5 [#1]
  LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
  CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
  Call Trace:
    add_memory_resource+0x23c/0x340 (unreliable)
    __add_memory+0x5c/0xf0
    dlpar_add_lmb+0x1b4/0x500
    dlpar_memory+0x1f8/0xb80
    handle_dlpar_errorlog+0xc0/0x190
    dlpar_store+0x198/0x4a0
    kobj_attr_store+0x30/0x50
    sysfs_kf_write+0x64/0x90
    kernfs_fop_write+0x1b0/0x290
    vfs_write+0xe8/0x290
    ksys_write+0xdc/0x130
    system_call_exception+0x160/0x270
    system_call_common+0xf0/0x27c

This patch addresses the root cause by not relying on the system_state
value to detect whether the call is due to a hot-plug operation.  An
extra parameter is added to link_mem_sections() detailing whether the
operation is due to a hot-plug operation.

[1] According to Oscar Salvador, using this qemu command line, ACPI
memory hotplug operations are raised at SYSTEM_SCHEDULING state:

  $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
        -m size=$MEM,slots=255,maxmem=4294967296k  \
        -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
        -object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
        -object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
        -object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
        -object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
        -object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
        -object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
        -object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \

Fixes: 4fbce633910e ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
Signed-off-by: Laurent Dufour 
Signed-off-by: Andrew Morton 
Reviewed-by: David Hildenbrand 
Reviewed-by: Oscar Salvador 
Acked-by: Michal Hocko 
Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Fenghua Yu 
Cc: Nathan Lynch 
Cc: Scott Cheloha 
Cc: Tony Luck 
Cc: 
Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds

mm: make register_mem_sect_under_node() static

2019-07-19T00:08:06+00:00

It is only used internally.

Link: http://lkml.kernel.org/r/20190614100114.311-4-david@redhat.com
Signed-off-by: David Hildenbrand 
Reviewed-by: Andrew Morton 
Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Keith Busch 
Cc: Oscar Salvador 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory_hotplug: make unregister_memory_block_under_nodes() never fail

2019-07-19T00:08:06+00:00

We really don't want anything during memory hotunplug to fail.  We
always pass a valid memory block device, that check can go.  Avoid
allocating memory and eventually failing.  As we are always called under
lock, we can use a static piece of memory.  This avoids having to put
the structure onto the stack, having to guess about the stack size of
callers.

Patch inspired by a patch from Oscar Salvador.

In the future, there might be no need to iterate over nodes at all.
mem->nid should tell us exactly what to remove.  Memory block devices
with mixed nodes (added during boot) should properly fenced off and
never removed.

Link: http://lkml.kernel.org/r/20190527111152.16324-11-david@redhat.com
Signed-off-by: David Hildenbrand 
Reviewed-by: Wei Yang 
Reviewed-by: Oscar Salvador 
Acked-by: Michal Hocko 
Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Alex Deucher 
Cc: "David S. Miller" 
Cc: Mark Brown 
Cc: Chris Wilson 
Cc: David Hildenbrand 
Cc: Jonathan Cameron 
Cc: Andrew Banman 
Cc: Andy Lutomirski 
Cc: Anshuman Khandual 
Cc: Ard Biesheuvel 
Cc: Arun KS 
Cc: Baoquan He 
Cc: Benjamin Herrenschmidt 
Cc: Borislav Petkov 
Cc: Catalin Marinas 
Cc: Chintan Pandya 
Cc: Christophe Leroy 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: Fenghua Yu 
Cc: Heiko Carstens 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Joonsoo Kim 
Cc: Jun Yao 
Cc: "Kirill A. Shutemov" 
Cc: Logan Gunthorpe 
Cc: Mark Rutland 
Cc: Masahiro Yamada 
Cc: Mathieu Malaterre 
Cc: Michael Ellerman 
Cc: Mike Rapoport 
Cc: "mike.travis@hpe.com" 
Cc: Nicholas Piggin 
Cc: Paul Mackerras 
Cc: Pavel Tatashin 
Cc: Peter Zijlstra 
Cc: Qian Cai 
Cc: Rich Felker 
Cc: Rob Herring 
Cc: Robin Murphy 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Vasily Gorbik 
Cc: Will Deacon 
Cc: Yoshinori Sato 
Cc: Yu Zhao 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory_hotplug: remove memory block devices before arch_remove_memory()

2019-07-19T00:08:06+00:00

Let's factor out removing of memory block devices, which is only
necessary for memory added via add_memory() and friends that created
memory block devices.  Remove the devices before calling
arch_remove_memory().

This finishes factoring out memory block device handling from
arch_add_memory() and arch_remove_memory().

Link: http://lkml.kernel.org/r/20190527111152.16324-10-david@redhat.com
Signed-off-by: David Hildenbrand 
Reviewed-by: Dan Williams 
Acked-by: Michal Hocko 
Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: David Hildenbrand 
Cc: "mike.travis@hpe.com" 
Cc: Andrew Banman 
Cc: Ingo Molnar 
Cc: Alex Deucher 
Cc: "David S. Miller" 
Cc: Mark Brown 
Cc: Chris Wilson 
Cc: Oscar Salvador 
Cc: Jonathan Cameron 
Cc: Arun KS 
Cc: Mathieu Malaterre 
Cc: Andy Lutomirski 
Cc: Anshuman Khandual 
Cc: Ard Biesheuvel 
Cc: Baoquan He 
Cc: Benjamin Herrenschmidt 
Cc: Borislav Petkov 
Cc: Catalin Marinas 
Cc: Chintan Pandya 
Cc: Christophe Leroy 
Cc: Dave Hansen 
Cc: Fenghua Yu 
Cc: Heiko Carstens 
Cc: "H. Peter Anvin" 
Cc: Joonsoo Kim 
Cc: Jun Yao 
Cc: "Kirill A. Shutemov" 
Cc: Logan Gunthorpe 
Cc: Mark Rutland 
Cc: Masahiro Yamada 
Cc: Michael Ellerman 
Cc: Mike Rapoport 
Cc: Nicholas Piggin 
Cc: Oscar Salvador 
Cc: Paul Mackerras 
Cc: Pavel Tatashin 
Cc: Peter Zijlstra 
Cc: Qian Cai 
Cc: Rich Felker 
Cc: Rob Herring 
Cc: Robin Murphy 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Vasily Gorbik 
Cc: Wei Yang 
Cc: Will Deacon 
Cc: Yoshinori Sato 
Cc: Yu Zhao 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

node: Add memory-side caching attributes

2019-04-04T16:41:21+00:00

System memory may have caches to help improve access speed to frequently
requested address ranges. While the system provided cache is transparent
to the software accessing these memory ranges, applications can optimize
their own access based on cache attributes.

Provide a new API for the kernel to register these memory-side caches
under the memory node that provides it.

The new sysfs representation is modeled from the existing cpu cacheinfo
attributes, as seen from /sys/devices/system/cpu//cache/.  Unlike CPU
cacheinfo though, the node cache level is reported from the view of the
memory. A higher level number is nearer to the CPU, while lower levels
are closer to the last level memory.

The exported attributes are the cache size, the line size, associativity
indexing, and write back policy, and add the attributes for the system
memory caches to sysfs stable documentation.

Signed-off-by: Keith Busch 
Reviewed-by: Rafael J. Wysocki 
Reviewed-by: Brice Goglin 
Tested-by: Brice Goglin 
Signed-off-by: Greg Kroah-Hartman