linux-toradex.git/mm/memory_hotplug.c, branch v5.2-rc2

treewide: Add SPDX license identifier for missed files

2019-05-21T08:50:45+00:00

Add SPDX license identifiers to all files which:

 - Have no license information of any form

 - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
   initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman

mm: shuffle initial free memory to improve memory-side-cache utilization

2019-05-15T02:52:48+00:00

Patch series "mm: Randomize free memory", v10.

This patch (of 3):

Randomization of the page allocator improves the average utilization of
a direct-mapped memory-side-cache.  Memory side caching is a platform
capability that Linux has been previously exposed to in HPC
(high-performance computing) environments on specialty platforms.  In
that instance it was a smaller pool of high-bandwidth-memory relative to
higher-capacity / lower-bandwidth DRAM.  Now, this capability is going
to be found on general purpose server platforms where DRAM is a cache in
front of higher latency persistent memory [1].

Robert offered an explanation of the state of the art of Linux
interactions with memory-side-caches [2], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel).  That's better than forcing
    users to deploy remedies like:
        "To eliminate this gradual degradation, we have added a Stream
         measurement to the Node Health Check that follows each job;
         nodes are rebooted whenever their measured memory bandwidth
         falls below 300 GB/s."

A replacement for zonesort was merged upstream in commit cc9aec03e58f
("x86/numa_emulation: Introduce uniform split capability").  With this
numa_emulation capability, memory can be split into cache sized
("near-memory" sized) numa nodes.  A bind operation to such a node, and
disabling workloads on other nodes, enables full cache performance.
However, once the workload exceeds the cache size then cache conflicts
are unavoidable.  While HPC environments might be able to tolerate
time-scheduling of cache sized workloads, for general purpose server
platforms, the oversubscribed cache case will be the common case.

The worst case scenario is that a server system owner benchmarks a
workload at boot with an un-contended cache only to see that performance
degrade over time, even below the average cache performance due to
excessive conflicts.  Randomization clips the peaks and fills in the
valleys of cache utilization to yield steady average performance.

Here are some performance impact details of the patches:

1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
   3X speedup in a contrived case that tries to force cache conflicts.
   The contrived cased used the numa_emulation capability to force an
   instance of the benchmark to be run in two of the near-memory sized
   numa nodes.  If both instances were placed on the same emulated they
   would fit and cause zero conflicts.  While on separate emulated nodes
   without randomization they underutilized the cache and conflicted
   unnecessarily due to the in-order allocation per node.

2/ A well known Java server application benchmark was run with a heap
   size that exceeded cache size by 3X.  The cache conflict rate was 8%
   for the first run and degraded to 21% after page allocator aging.  With
   randomization enabled the rate levelled out at 11%.

3/ A MongoDB workload did not observe measurable difference in
   cache-conflict rates, but the overall throughput dropped by 7% with
   randomization in one case.

4/ Mel Gorman ran his suite of performance workloads with randomization
   enabled on platforms without a memory-side-cache and saw a mix of some
   improvements and some losses [3].

While there is potentially significant improvement for applications that
depend on low latency access across a wide working-set, the performance
may be negligible to negative for other workloads.  For this reason the
shuffle capability defaults to off unless a direct-mapped
memory-side-cache is detected.  Even then, the page_alloc.shuffle=0
parameter can be specified to disable the randomization on those systems.

Outside of memory-side-cache utilization concerns there is potentially
security benefit from randomization.  Some data exfiltration and
return-oriented-programming attacks rely on the ability to infer the
location of sensitive data objects.  The kernel page allocator, especially
early in system boot, has predictable first-in-first out behavior for
physical pages.  Pages are freed in physical address order when first
onlined.

Quoting Kees:
    "While we already have a base-address randomization
     (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
     memory layouts would certainly be using the predictability of
     allocation ordering (i.e. for attacks where the base address isn't
     important: only the relative positions between allocated memory).
     This is common in lots of heap-style attacks. They try to gain
     control over ordering by spraying allocations, etc.

     I'd really like to see this because it gives us something similar
     to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
caches it leaves vast bulk of memory to be predictably in order allocated.
However, it should be noted, the concrete security benefits are hard to
quantify, and no known CVE is mitigated by this randomization.

Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
are initially populated with free memory at boot and at hotplug time.  Do
this based on either the presence of a page_alloc.shuffle=Y command line
parameter, or autodetection of a memory-side-cache (to be added in a
follow-on patch).

The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.  10,
4MB this trades off randomization granularity for time spent shuffling.
MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
while still showing memory-side cache behavior improvements, and the
expectation that the security implications of finer granularity
randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.  The
performance impact of the shuffling appears to be in the noise compared to
other memory initialization work.

This initial randomization can be undone over time so a follow-on patch is
introduced to inject entropy on page free decisions.  It is reasonable to
ask if the page free entropy is sufficient, but it is not enough due to
the in-order initial freeing of pages.  At the start of that process
putting page1 in front or behind page0 still keeps them close together,
page2 is still near page1 and has a high chance of being adjacent.  As
more pages are added ordering diversity improves, but there is still high
page locality for the low address pages and this leads to no significant
impact to the cache conflict rate.

[1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
[2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
[3]: https://lkml.org/lkml/2018/10/12/309

[dan.j.williams@intel.com: fix shuffle enable]
  Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
[cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
  Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams 
Signed-off-by: Qian Cai 
Reviewed-by: Kees Cook 
Acked-by: Michal Hocko 
Cc: Dave Hansen 
Cc: Keith Busch 
Cc: Robert Elliott 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory_hotplug: make __remove_pages() and arch_remove_memory() never fail

2019-05-14T16:47:50+00:00

All callers of arch_remove_memory() ignore errors.  And we should really
try to remove any errors from the memory removal path.  No more errors are
reported from __remove_pages().  BUG() in s390x code in case
arch_remove_memory() is triggered.  We may implement that properly later.
WARN in case powerpc code failed to remove the section mapping, which is
better than ignoring the error completely right now.

Link: http://lkml.kernel.org/r/20190409100148.24703-5-david@redhat.com
Signed-off-by: David Hildenbrand 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Michal Hocko 
Cc: Mike Rapoport 
Cc: Oscar Salvador 
Cc: "Kirill A. Shutemov" 
Cc: Christophe Leroy 
Cc: Stefan Agner 
Cc: Nicholas Piggin 
Cc: Pavel Tatashin 
Cc: Vasily Gorbik 
Cc: Arun KS 
Cc: Geert Uytterhoeven 
Cc: Masahiro Yamada 
Cc: Rob Herring 
Cc: Joonsoo Kim 
Cc: Wei Yang 
Cc: Qian Cai 
Cc: Mathieu Malaterre 
Cc: Andrew Banman 
Cc: Greg Kroah-Hartman 
Cc: Ingo Molnar 
Cc: Mike Travis 
Cc: Oscar Salvador 
Cc: "Rafael J. Wysocki" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory_hotplug: make __remove_section() never fail

2019-05-14T16:47:49+00:00

Let's just warn in case a section is not valid instead of failing to
remove somewhere in the middle of the process, returning an error that
will be mostly ignored by callers.

Link: http://lkml.kernel.org/r/20190409100148.24703-4-david@redhat.com
Signed-off-by: David Hildenbrand 
Reviewed-by: Oscar Salvador 
Cc: Michal Hocko 
Cc: David Hildenbrand 
Cc: Pavel Tatashin 
Cc: Qian Cai 
Cc: Wei Yang 
Cc: Arun KS 
Cc: Mathieu Malaterre 
Cc: Andrew Banman 
Cc: Andy Lutomirski 
Cc: Benjamin Herrenschmidt 
Cc: Borislav Petkov 
Cc: Christophe Leroy 
Cc: Dave Hansen 
Cc: Fenghua Yu 
Cc: Geert Uytterhoeven 
Cc: Greg Kroah-Hartman 
Cc: Heiko Carstens 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Ingo Molnar 
Cc: Joonsoo Kim 
Cc: "Kirill A. Shutemov" 
Cc: Martin Schwidefsky 
Cc: Masahiro Yamada 
Cc: Michael Ellerman 
Cc: Mike Rapoport 
Cc: Mike Travis 
Cc: Nicholas Piggin 
Cc: Oscar Salvador 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: "Rafael J. Wysocki" 
Cc: Rich Felker 
Cc: Rob Herring 
Cc: Stefan Agner 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Vasily Gorbik 
Cc: Yoshinori Sato 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory_hotplug: make unregister_memory_section() never fail

2019-05-14T16:47:49+00:00

Failing while removing memory is mostly ignored and cannot really be
handled.  Let's treat errors in unregister_memory_section() in a nice way,
warning, but continuing.

Link: http://lkml.kernel.org/r/20190409100148.24703-3-david@redhat.com
Signed-off-by: David Hildenbrand 
Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Ingo Molnar 
Cc: Andrew Banman 
Cc: Mike Travis 
Cc: David Hildenbrand 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Pavel Tatashin 
Cc: Qian Cai 
Cc: Wei Yang 
Cc: Arun KS 
Cc: Mathieu Malaterre 
Cc: Andy Lutomirski 
Cc: Benjamin Herrenschmidt 
Cc: Borislav Petkov 
Cc: Christophe Leroy 
Cc: Dave Hansen 
Cc: Fenghua Yu 
Cc: Geert Uytterhoeven 
Cc: Heiko Carstens 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Joonsoo Kim 
Cc: "Kirill A. Shutemov" 
Cc: Martin Schwidefsky 
Cc: Masahiro Yamada 
Cc: Michael Ellerman 
Cc: Mike Rapoport 
Cc: Nicholas Piggin 
Cc: Oscar Salvador 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Rich Felker 
Cc: Rob Herring 
Cc: Stefan Agner 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Vasily Gorbik 
Cc: Yoshinori Sato 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory_hotplug: release memory resource after arch_remove_memory()

2019-05-14T16:47:49+00:00

Patch series "mm/memory_hotplug: Better error handling when removing
memory", v1.

Error handling when removing memory is somewhat messed up right now.  Some
errors result in warnings, others are completely ignored.  Memory unplug
code can essentially not deal with errors properly as of now.
remove_memory() will never fail.

We have basically two choices:
1. Allow arch_remov_memory() and friends to fail, propagating errors via
   remove_memory(). Might be problematic (e.g. DIMMs consisting of multiple
   pieces added/removed separately).
2. Don't allow the functions to fail, handling errors in a nicer way.

It seems like most errors that can theoretically happen are really corner
cases and mostly theoretical (e.g.  "section not valid").  However e.g.
aborting removal of sections while all callers simply continue in case of
errors is not nice.

If we can gurantee that removal of memory always works (and WARN/skip in
case of theoretical errors so we can figure out what is going on), we can
go ahead and implement better error handling when adding memory.

E.g. via add_memory():

arch_add_memory()
ret = do_stuff()
if (ret) {
	arch_remove_memory();
	goto error;
}

Handling here that arch_remove_memory() might fail is basically
impossible.  So I suggest, let's avoid reporting errors while removing
memory, warning on theoretical errors instead and continuing instead of
aborting.

This patch (of 4):

__add_pages() doesn't add the memory resource, so __remove_pages()
shouldn't remove it.  Let's factor it out.  Especially as it is a special
case for memory used as system memory, added via add_memory() and friends.

We now remove the resource after removing the sections instead of doing it
the other way around.  I don't think this change is problematic.

add_memory()
	register memory resource
	arch_add_memory()

remove_memory
	arch_remove_memory()
	release memory resource

While at it, explain why we ignore errors and that it only happeny if
we remove memory in a different granularity as we added it.

[david@redhat.com: fix printk warning]
  Link: http://lkml.kernel.org/r/20190417120204.6997-1-david@redhat.com
Link: http://lkml.kernel.org/r/20190409100148.24703-2-david@redhat.com
Signed-off-by: David Hildenbrand 
Reviewed-by: Oscar Salvador 
Cc: Michal Hocko 
Cc: David Hildenbrand 
Cc: Pavel Tatashin 
Cc: Wei Yang 
Cc: Qian Cai 
Cc: Arun KS 
Cc: Mathieu Malaterre 
Cc: Andrew Banman 
Cc: Andy Lutomirski 
Cc: Benjamin Herrenschmidt 
Cc: Borislav Petkov 
Cc: Christophe Leroy 
Cc: Dave Hansen 
Cc: Fenghua Yu 
Cc: Geert Uytterhoeven 
Cc: Greg Kroah-Hartman 
Cc: Heiko Carstens 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Ingo Molnar 
Cc: Joonsoo Kim 
Cc: "Kirill A. Shutemov" 
Cc: Martin Schwidefsky 
Cc: Masahiro Yamada 
Cc: Michael Ellerman 
Cc: Mike Rapoport 
Cc: Mike Travis 
Cc: Nicholas Piggin 
Cc: Oscar Salvador 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: "Rafael J. Wysocki" 
Cc: Rich Felker 
Cc: Rob Herring 
Cc: Stefan Agner 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Vasily Gorbik 
Cc: Yoshinori Sato 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: provide a more generic restrictions for memory hotplug

2019-05-14T16:47:49+00:00

arch_add_memory, __add_pages take a want_memblock which controls whether
the newly added memory should get the sysfs memblock user API (e.g.
ZONE_DEVICE users do not want/need this interface).  Some callers even
want to control where do we allocate the memmap from by configuring
altmap.

Add a more generic hotplug context for arch_add_memory and __add_pages.
struct mhp_restrictions contains flags which contains additional features
to be enabled by the memory hotplug (MHP_MEMBLOCK_API currently) and
altmap for alternative memmap allocator.

This patch shouldn't introduce any functional change.

[akpm@linux-foundation.org: build fix]
Link: http://lkml.kernel.org/r/20190408082633.2864-3-osalvador@suse.de
Signed-off-by: Michal Hocko 
Signed-off-by: Oscar Salvador 
Cc: Dan Williams 
Cc: David Hildenbrand 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: cleanup memory offline path

2019-05-14T16:47:49+00:00

check_pages_isolated_cb currently accounts the whole pfn range as being
offlined if test_pages_isolated suceeds on the range.  This is based on
the assumption that all pages in the range are freed which is currently
the case in most cases but it won't be with later changes, as pages marked
as vmemmap won't be isolated.

Move the offlined pages counting to offline_isolated_pages_cb and rely on
__offline_isolated_pages to return the correct value.
check_pages_isolated_cb will still do it's primary job and check the pfn
range.

While we are at it remove check_pages_isolated and offline_isolated_pages
and use directly walk_system_ram_range as do in online_pages.

Link: http://lkml.kernel.org/r/20190408082633.2864-2-osalvador@suse.de
Reviewed-by: David Hildenbrand 
Signed-off-by: Michal Hocko 
Signed-off-by: Oscar Salvador 
Cc: Dan Williams 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory_hotplug.c: fix the wrong usage of N_HIGH_MEMORY

2019-05-14T16:47:46+00:00

In node_states_check_changes_online(), N_HIGH_MEMORY is used to substitute
ZONE_HIGHMEM directly.  This is not right.  N_HIGH_MEMORY is to mark the
memory state of node.  Here zone index is checked, which should be
compared with 'ZONE_HIGHMEM' accordingly.

Replace it with ZONE_HIGHMEM.

This is a code cleanup - no known runtime effects.

Link: http://lkml.kernel.org/r/20190320080732.14933-1-bhe@redhat.com
Fixes: 8efe33f40f3e ("mm/memory_hotplug.c: simplify node_states_check_changes_online")
Signed-off-by: Baoquan He 
Reviewed-by: David Hildenbrand 
Acked-by: Michal Hocko 
Reviewed-by: Oscar Salvador 
Cc: Wei Yang 
Cc: Mike Rapoport 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm,memory_hotplug: drop redundant hugepage_migration_supported check

2019-05-14T16:47:46+00:00

has_unmovable_pages() already checks whether the hugetlb page supports
migration, so all non-migratable hugetlb pages should have been caught
there.  Let us drop the check from scan_movable_pages() as is redundant.

Link: http://lkml.kernel.org/r/20190320152658.10855-3-osalvador@suse.de
Signed-off-by: Oscar Salvador 
Acked-by: Michal Hocko 
Reviewed-by: David Hildenbrand 
Cc: Mike Kravetz 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds