linux-toradex.git/arch/ia64/mm, branch v5.11

mm: memmap defer init doesn't work as expected

2020-12-29T23:36:49+00:00

VMware observed a performance regression during memmap init on their
platform, and bisected to commit 73a6e474cb376 ("mm: memmap_init:
iterate over memblock regions rather that check each PFN") causing it.

Before the commit:

  [0.033176] Normal zone: 1445888 pages used for memmap
  [0.033176] Normal zone: 89391104 pages, LIFO batch:63
  [0.035851] ACPI: PM-Timer IO Port: 0x448

With commit

  [0.026874] Normal zone: 1445888 pages used for memmap
  [0.026875] Normal zone: 89391104 pages, LIFO batch:63
  [2.028450] ACPI: PM-Timer IO Port: 0x448

The root cause is the current memmap defer init doesn't work as expected.

Before, memmap_init_zone() was used to do memmap init of one whole zone,
to initialize all low zones of one numa node, but defer memmap init of
the last zone in that numa node.  However, since commit 73a6e474cb376,
function memmap_init() is adapted to iterater over memblock regions
inside one zone, then call memmap_init_zone() to do memmap init for each
region.

E.g, on VMware's system, the memory layout is as below, there are two
memory regions in node 2.  The current code will mistakenly initialize the
whole 1st region [mem 0xab00000000-0xfcffffffff], then do memmap defer to
iniatialize only one memmory section on the 2nd region [mem
0x10000000000-0x1033fffffff].  In fact, we only expect to see that there's
only one memory section's memmap initialized.  That's why more time is
costed at the time.

[    0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[    0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
[    0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x55ffffffff]
[    0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x5600000000-0xaaffffffff]
[    0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab00000000-0xfcffffffff]
[    0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x10000000000-0x1033fffffff]

Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
down the real zone end pfn so that defer_init() can use it to judge
whether defer need be taken in zone wide.

Link: https://lkml.kernel.org/r/20201223080811.16211-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20201223080811.16211-2-bhe@redhat.com
Fixes: commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
Signed-off-by: Baoquan He 
Reported-by: Rahul Gopakumar 
Reviewed-by: Mike Rapoport 
Cc: David Hildenbrand 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

ia64: forbid using VIRTUAL_MEM_MAP with FLATMEM

2020-12-15T20:13:42+00:00

Virtual memory map was intended to avoid wasting memory on the memory map
on systems with large holes in the physical memory layout. Long ago it been
superseded first by DISCONTIGMEM and then by SPARSEMEM. Moreover,
SPARSEMEM_VMEMMAP provide the same functionality in much more portable way.

As the first step to removing the VIRTUAL_MEM_MAP forbid it's usage with
FLATMEM and panic on systems with large holes in the physical memory
layout that try to run FLATMEM kernels.

Link: https://lkml.kernel.org/r/20201101170454.9567-7-rppt@kernel.org
Signed-off-by: Mike Rapoport 
Cc: Alexey Dobriyan 
Cc: Catalin Marinas 
Cc: Geert Uytterhoeven 
Cc: Greg Ungerer 
Cc: John Paul Adrian Glaubitz 
Cc: Jonathan Corbet 
Cc: Matt Turner 
Cc: Meelis Roos 
Cc: Michael Schmitz 
Cc: Russell King 
Cc: Tony Luck 
Cc: Vineet Gupta 
Cc: Will Deacon 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

ia64: split virtual map initialization out of paging_init()

2020-12-15T20:13:42+00:00

For both FLATMEM and DISCONTIGMEM/SPARSEMEM the virtual map initialization
is spread over paging_init() for no good reason.

Split out the bits related to virtual map initialization to a helper
functions, one for FLATMEM and another for !FLATMEM configurations.

Link: https://lkml.kernel.org/r/20201101170454.9567-6-rppt@kernel.org
Signed-off-by: Mike Rapoport 
Cc: Alexey Dobriyan 
Cc: Catalin Marinas 
Cc: Geert Uytterhoeven 
Cc: Greg Ungerer 
Cc: John Paul Adrian Glaubitz 
Cc: Jonathan Corbet 
Cc: Matt Turner 
Cc: Meelis Roos 
Cc: Michael Schmitz 
Cc: Russell King 
Cc: Tony Luck 
Cc: Vineet Gupta 
Cc: Will Deacon 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

ia64: discontig: paging_init(): remove local max_pfn calculation

2020-12-15T20:13:42+00:00

The maximal PFN in the system is calculated during find_memory() time and
it is stored at max_low_pfn then.

Use this value in paging_init() and remove the redundant detection of
max_pfn in that function.

Link: https://lkml.kernel.org/r/20201101170454.9567-5-rppt@kernel.org
Signed-off-by: Mike Rapoport 
Cc: Alexey Dobriyan 
Cc: Catalin Marinas 
Cc: Geert Uytterhoeven 
Cc: Greg Ungerer 
Cc: John Paul Adrian Glaubitz 
Cc: Jonathan Corbet 
Cc: Matt Turner 
Cc: Meelis Roos 
Cc: Michael Schmitz 
Cc: Russell King 
Cc: Tony Luck 
Cc: Vineet Gupta 
Cc: Will Deacon 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

ia64: remove 'ifdef CONFIG_ZONE_DMA32' statements

2020-12-15T20:13:42+00:00

After the removal of SN2 platform (commit cf07cb1ff4ea ("ia64: remove
support for the SGI SN2 platform") IA-64 always has ZONE_DMA32 and there is
no point to guard code with this configuration option.

Remove ifdefery associated with CONFIG_ZONE_DMA32

Link: https://lkml.kernel.org/r/20201101170454.9567-4-rppt@kernel.org
Signed-off-by: Mike Rapoport 
Cc: Alexey Dobriyan 
Cc: Catalin Marinas 
Cc: Geert Uytterhoeven 
Cc: Greg Ungerer 
Cc: John Paul Adrian Glaubitz 
Cc: Jonathan Corbet 
Cc: Matt Turner 
Cc: Meelis Roos 
Cc: Michael Schmitz 
Cc: Russell King 
Cc: Tony Luck 
Cc: Vineet Gupta 
Cc: Will Deacon 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

ia64: remove custom __early_pfn_to_nid()

2020-12-15T20:13:42+00:00

The ia64 implementation of __early_pfn_to_nid() essentially relies on the
same data as the generic implementation.

The correspondence between memory ranges and nodes is set in memblock
during early memory initialization in register_active_ranges() function.

The initialization of sparsemem that requires early_pfn_to_nid() happens
later and it can use the memblock information like the other architectures.

Link: https://lkml.kernel.org/r/20201101170454.9567-3-rppt@kernel.org
Signed-off-by: Mike Rapoport 
Cc: Alexey Dobriyan 
Cc: Catalin Marinas 
Cc: Geert Uytterhoeven 
Cc: Greg Ungerer 
Cc: John Paul Adrian Glaubitz 
Cc: Jonathan Corbet 
Cc: Matt Turner 
Cc: Meelis Roos 
Cc: Michael Schmitz 
Cc: Russell King 
Cc: Tony Luck 
Cc: Vineet Gupta 
Cc: Will Deacon 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: pass migratetype into memmap_init_zone() and move_pfn_range_to_zone()

2020-10-16T18:11:17+00:00

On the memory onlining path, we want to start with MIGRATE_ISOLATE, to
un-isolate the pages after memory onlining is complete.  Let's allow
passing in the migratetype.

Signed-off-by: David Hildenbrand 
Signed-off-by: Andrew Morton 
Reviewed-by: Oscar Salvador 
Acked-by: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Logan Gunthorpe 
Cc: Dan Williams 
Cc: Mike Rapoport 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Michel Lespinasse 
Cc: Charan Teja Reddy 
Cc: Mel Gorman 
Link: https://lkml.kernel.org/r/20200819175957.28465-10-david@redhat.com
Signed-off-by: Linus Torvalds

Merge tag 'dma-mapping-5.10' of git://git.infradead.org/users/hch/dma-mapping

2020-10-15T21:43:29+00:00

Pull dma-mapping updates from Christoph Hellwig:

 - rework the non-coherent DMA allocator

 - move private definitions out of 

 - lower CMA_ALIGNMENT (Paul Cercueil)

 - remove the omap1 dma address translation in favor of the common code

 - make dma-direct aware of multiple dma offset ranges (Jim Quinlan)

 - support per-node DMA CMA areas (Barry Song)

 - increase the default seg boundary limit (Nicolin Chen)

 - misc fixes (Robin Murphy, Thomas Tai, Xu Wang)

 - various cleanups

* tag 'dma-mapping-5.10' of git://git.infradead.org/users/hch/dma-mapping: (63 commits)
  ARM/ixp4xx: add a missing include of dma-map-ops.h
  dma-direct: simplify the DMA_ATTR_NO_KERNEL_MAPPING handling
  dma-direct: factor out a dma_direct_alloc_from_pool helper
  dma-direct check for highmem pages in dma_direct_alloc_pages
  dma-mapping: merge  into 
  dma-mapping: move large parts of  to kernel/dma
  dma-mapping: move dma-debug.h to kernel/dma/
  dma-mapping: remove 
  dma-mapping: merge  into 
  dma-contiguous: remove dma_contiguous_set_default
  dma-contiguous: remove dev_set_cma_area
  dma-contiguous: remove dma_declare_contiguous
  dma-mapping: split 
  cma: decrease CMA_ALIGNMENT lower limit to 2
  firewire-ohci: use dma_alloc_pages
  dma-iommu: implement ->alloc_noncoherent
  dma-mapping: add new {alloc,free}_noncoherent dma_map_ops methods
  dma-mapping: add a new dma_alloc_pages API
  dma-mapping: remove dma_cache_sync
  53c700: convert to dma_alloc_noncoherent
  ...

dma-mapping: merge into

2020-10-06T05:07:06+00:00

Move more nitty gritty DMA implementation details into the common
internal header.

Signed-off-by: Christoph Hellwig

mm: replace memmap_context by meminit_context

2020-09-26T17:33:57+00:00

Patch series "mm: fix memory to node bad links in sysfs", v3.

Sometimes, firmware may expose interleaved memory layout like this:

 Early memory node ranges
   node   1: [mem 0x0000000000000000-0x000000011fffffff]
   node   2: [mem 0x0000000120000000-0x000000014fffffff]
   node   1: [mem 0x0000000150000000-0x00000001ffffffff]
   node   0: [mem 0x0000000200000000-0x000000048fffffff]
   node   2: [mem 0x0000000490000000-0x00000007ffffffff]

In that case, we can see memory blocks assigned to multiple nodes in
sysfs:

  $ ls -l /sys/devices/system/memory/memory21
  total 0
  lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
  lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2
  -rw-r--r-- 1 root root 65536 Aug 24 05:27 online
  -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
  -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
  drwxr-xr-x 2 root root     0 Aug 24 05:27 power
  -r--r--r-- 1 root root 65536 Aug 24 05:27 removable
  -rw-r--r-- 1 root root 65536 Aug 24 05:27 state
  lrwxrwxrwx 1 root root     0 Aug 24 05:25 subsystem -> ../../../../bus/memory
  -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
  -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones

The same applies in the node's directory with a memory21 link in both
the node1 and node2's directory.

This is wrong but doesn't prevent the system to run.  However when
later, one of these memory blocks is hot-unplugged and then hot-plugged,
the system is detecting an inconsistency in the sysfs layout and a
BUG_ON() is raised:

  kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
  LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
  CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
  Call Trace:
    add_memory_resource+0x23c/0x340 (unreliable)
    __add_memory+0x5c/0xf0
    dlpar_add_lmb+0x1b4/0x500
    dlpar_memory+0x1f8/0xb80
    handle_dlpar_errorlog+0xc0/0x190
    dlpar_store+0x198/0x4a0
    kobj_attr_store+0x30/0x50
    sysfs_kf_write+0x64/0x90
    kernfs_fop_write+0x1b0/0x290
    vfs_write+0xe8/0x290
    ksys_write+0xdc/0x130
    system_call_exception+0x160/0x270
    system_call_common+0xf0/0x27c

This has been seen on PowerPC LPAR.

The root cause of this issue is that when node's memory is registered,
the range used can overlap another node's range, thus the memory block
is registered to multiple nodes in sysfs.

There are two issues here:

 (a) The sysfs memory and node's layouts are broken due to these
     multiple links

 (b) The link errors in link_mem_sections() should not lead to a system
     panic.

To address (a) register_mem_sect_under_node should not rely on the
system state to detect whether the link operation is triggered by a hot
plug operation or not.  This is addressed by the patches 1 and 2 of this
series.

Issue (b) will be addressed separately.

This patch (of 2):

The memmap_context enum is used to detect whether a memory operation is
due to a hot-add operation or happening at boot time.

Make it general to the hotplug operation and rename it as
meminit_context.

There is no functional change introduced by this patch

Suggested-by: David Hildenbrand 
Signed-off-by: Laurent Dufour 
Signed-off-by: Andrew Morton 
Reviewed-by: David Hildenbrand 
Reviewed-by: Oscar Salvador 
Acked-by: Michal Hocko 
Cc: Greg Kroah-Hartman 
Cc: "Rafael J . Wysocki" 
Cc: Nathan Lynch 
Cc: Scott Cheloha 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: 
Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds