linux-toradex.git/drivers/base/node.c, branch v4.14-rc7

mm: only display online cpus of the numa node

2017-10-13T23:18:32+00:00

When I execute numactl -H (which reads /sys/devices/system/node/nodeX/cpumap
and displays cpumask_of_node for each node), I get different result
on X86 and arm64.  For each numa node, the former only displayed online
CPUs, and the latter displayed all possible CPUs.  Unfortunately, both
Linux documentation and numactl manual have not described it clear.

I sent a mail to ask for help, and Michal Hocko replied that he
preferred to print online cpus because it doesn't really make much sense
to bind anything on offline nodes.

Will said:
 "I suspect the vast majority (if not all) code that reads this file was
  developed for x86, so having the same behaviour for arm64 sounds like
  something we should do ASAP before people try to special case with
  things like #ifdef __aarch64__. I'd rather have this in 4.14 if
  possible."

Link: http://lkml.kernel.org/r/1506678805-15392-2-git-send-email-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei 
Acked-by: Michal Hocko 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Greg Kroah-Hartman 
Cc: Tianhong Ding 
Cc: Hanjun Guo 
Cc: Libin 
Cc: Kefeng Wang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: change the call sites of numa statistics items

2017-09-09T01:26:47+00:00

Patch series "Separate NUMA statistics from zone statistics", v2.

Each page allocation updates a set of per-zone statistics with a call to
zone_statistics().  As discussed in 2017 MM summit, these are a
substantial source of overhead in the page allocator and are very rarely
consumed.  This significant overhead in cache bouncing caused by zone
counters (NUMA associated counters) update in parallel in multi-threaded
page allocation (pointed out by Dave Hansen).

A link to the MM summit slides:
  http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf

To mitigate this overhead, this patchset separates NUMA statistics from
zone statistics framework, and update NUMA counter threshold to a fixed
size of MAX_U16 - 2, as a small threshold greatly increases the update
frequency of the global counter from local per cpu counter (suggested by
Ying Huang).  The rationality is that these statistics counters don't
need to be read often, unlike other VM counters, so it's not a problem
to use a large threshold and make readers more expensive.

With this patchset, we see 31.3% drop of CPU cycles(537-->369, see
below) for per single page allocation and reclaim on Jesper's
page_bench03 benchmark.  Meanwhile, this patchset keeps the same style
of virtual memory statistics with little end-user-visible effects (only
move the numa stats to show behind zone page stats, see the first patch
for details).

I did an experiment of single page allocation and reclaim concurrently
using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based
server (88 processors with 126G memory) with different size of threshold
of pcp counter.

Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
  https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

   Threshold   CPU cycles    Throughput(88 threads)
      32        799         241760478
      64        640         301628829
      125       537         358906028 <==> system by default
      256       468         412397590
      512       428         450550704
      4096      399         482520943
      20000     394         489009617
      30000     395         488017817
      65533     369(-31.3%) 521661345(+45.3%) <==> with this patchset
      N/A       342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics

This patch (of 3):

In this patch, NUMA statistics is separated from zone statistics
framework, all the call sites of NUMA stats are changed to use
numa-stats-specific functions, it does not have any functionality change
except that the number of NUMA stats is shown behind zone page stats
when users *read* the zone info.

E.g. cat /proc/zoneinfo
    ***Base***                           ***With this patch***
nr_free_pages 3976                         nr_free_pages 3976
nr_zone_inactive_anon 0                    nr_zone_inactive_anon 0
nr_zone_active_anon 0                      nr_zone_active_anon 0
nr_zone_inactive_file 0                    nr_zone_inactive_file 0
nr_zone_active_file 0                      nr_zone_active_file 0
nr_zone_unevictable 0                      nr_zone_unevictable 0
nr_zone_write_pending 0                    nr_zone_write_pending 0
nr_mlock     0                             nr_mlock     0
nr_page_table_pages 0                      nr_page_table_pages 0
nr_kernel_stack 0                          nr_kernel_stack 0
nr_bounce    0                             nr_bounce    0
nr_zspages   0                             nr_zspages   0
numa_hit 0                                *nr_free_cma  0*
numa_miss 0                                numa_hit     0
numa_foreign 0                             numa_miss    0
numa_interleave 0                          numa_foreign 0
numa_local   0                             numa_interleave 0
numa_other   0                             numa_local   0
*nr_free_cma 0*                            numa_other 0
    ...                                        ...
vm stats threshold: 10                     vm stats threshold: 10
    ...                                        ...

The next patch updates the numa stats counter size and threshold.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1503568801-21305-2-git-send-email-kemi.wang@intel.com
Signed-off-by: Kemi Wang 
Reported-by: Jesper Dangaard Brouer 
Acked-by: Mel Gorman 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Christopher Lameter 
Cc: Dave Hansen 
Cc: Andi Kleen 
Cc: Ying Huang 
Cc: Aaron Lu 
Cc: Tim Chen 
Cc: Dave Hansen 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: drop useless local parameters of __register_one_node()

2017-07-10T23:32:32+00:00

__register_one_node() initializes local parameters "p_node" & "parent"
for register_node().

But, register_node() does not use them.

Remove the related code of "parent" node, cleanup __register_one_node()
and register_node().

Link: http://lkml.kernel.org/r/1498013846-20149-1-git-send-email-douly.fnst@cn.fujitsu.com
Signed-off-by: Dou Liyang 
Acked-by: David Rientjes 
Acked-by: Michal Hocko 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: drop CONFIG_MOVABLE_NODE

2017-07-06T23:24:35+00:00

Commit 20b2f52b73fe ("numa: add CONFIG_MOVABLE_NODE for
movable-dedicated node") has introduced CONFIG_MOVABLE_NODE without a
good explanation on why it is actually useful.

It makes a lot of sense to make movable node semantic opt in but we
already have that because the feature has to be explicitly enabled on
the kernel command line.  A config option on top only makes the
configuration space larger without a good reason.  It also adds an
additional ifdefery that pollutes the code.

Just drop the config option and make it de-facto always enabled.  This
shouldn't introduce any change to the semantic.

Link: http://lkml.kernel.org/r/20170529114141.536-3-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Reza Arbab 
Acked-by: Vlastimil Babka 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Jerome Glisse 
Cc: Yasuaki Ishimatsu 
Cc: Xishi Qiu 
Cc: Kani Toshimitsu 
Cc: Chen Yucong 
Cc: Joonsoo Kim 
Cc: Andi Kleen 
Cc: David Rientjes 
Cc: Daniel Kiper 
Cc: Igor Mammedov 
Cc: Vitaly Kuznetsov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: vmstat: move slab statistics from zone to node counters

2017-07-06T23:24:35+00:00

Patch series "mm: per-lruvec slab stats"

Josef is working on a new approach to balancing slab caches and the page
cache.  For this to work, he needs slab cache statistics on the lruvec
level.  These patches implement that by adding infrastructure that
allows updating and reading generic VM stat items per lruvec, then
switches some existing VM accounting sites, including the slab
accounting ones, to this new cgroup-aware API.

I'll follow up with more patches on this, because there is actually
substantial simplification that can be done to the memory controller
when we replace private memcg accounting with making the existing VM
accounting sites cgroup-aware.  But this is enough for Josef to base his
slab reclaim work on, so here goes.

This patch (of 5):

To re-implement slab cache vs.  page cache balancing, we'll need the
slab counters at the lruvec level, which, ever since lru reclaim was
moved from the zone to the node, is the intersection of the node, not
the zone, and the memcg.

We could retain the per-zone counters for when the page allocator dumps
its memory information on failures, and have counters on both levels -
which on all but NUMA node 0 is usually redundant.  But let's keep it
simple for now and just move them.  If anybody complains we can restore
the per-zone counters.

[hannes@cmpxchg.org: fix oops]
  Link: http://lkml.kernel.org/r/20170605183511.GA8915@cmpxchg.org
Link: http://lkml.kernel.org/r/20170530181724.27197-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner 
Cc: Josef Bacik 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: split up register_one_node()

2017-07-06T23:24:32+00:00

Memory hotplug (add_memory_resource) has to reinitialize node
infrastructure if the node is offline (one which went through the
complete add_memory(); remove_memory() cycle).  That involves node
registration to the kobj infrastructure (register_node), the proper
association with cpus (register_cpu_under_node) and finally creation of
node<->memblock symlinks (link_mem_sections).

The last part requires to know node_start_pfn and node_spanned_pages
which we currently have but a leter patch will postpone this
initialization to the onlining phase which happens later.  In fact we do
not need to rely on the early pgdat initialization even now because the
currently hot added pfn range is currently known.

Split register_one_node into core which does all the common work for the
boot time NUMA initialization and the hotplug (__register_one_node).
register_one_node keeps the full initialization while hotplug calls
__register_one_node and manually calls link_mem_sections for the proper
range.

This shouldn't introduce any functional change.

Link: http://lkml.kernel.org/r/20170515085827.16474-6-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Cc: Andi Kleen 
Cc: Andrea Arcangeli 
Cc: Balbir Singh 
Cc: Dan Williams 
Cc: Daniel Kiper 
Cc: David Rientjes 
Cc: Heiko Carstens 
Cc: Igor Mammedov 
Cc: Jerome Glisse 
Cc: Joonsoo Kim 
Cc: Martin Schwidefsky 
Cc: Mel Gorman 
Cc: Reza Arbab 
Cc: Tobias Regnery 
Cc: Toshi Kani 
Cc: Vitaly Kuznetsov 
Cc: Xishi Qiu 
Cc: Yasuaki Ishimatsu 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: drop page_initialized check from get_nid_for_pfn

2017-07-06T23:24:32+00:00

Commit c04fc586c1a4 ("mm: show node to memory section relationship with
symlinks in sysfs") has added means to export memblock<->node
association into the sysfs.  It has also introduced get_nid_for_pfn
which is a rather confusing counterpart of pfn_to_nid which checks also
whether the pfn page is already initialized (page_initialized).

This is done by checking page::lru != NULL which doesn't make any sense
at all.  Nothing in this path really relies on the lru list being used
or initialized.  Just remove it because this will become a problem with
later patches.

Thanks to Reza Arbab for testing which revealed this to be a problem
(http://lkml.kernel.org/r/20170403202337.GA12482@dhcp22.suse.cz)

Link: http://lkml.kernel.org/r/20170515085827.16474-4-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Cc: Reza Arbab 
Cc: Andi Kleen 
Cc: Andrea Arcangeli 
Cc: Balbir Singh 
Cc: Dan Williams 
Cc: Daniel Kiper 
Cc: David Rientjes 
Cc: Heiko Carstens 
Cc: Igor Mammedov 
Cc: Jerome Glisse 
Cc: Joonsoo Kim 
Cc: Martin Schwidefsky 
Cc: Mel Gorman 
Cc: Tobias Regnery 
Cc: Toshi Kani 
Cc: Vitaly Kuznetsov 
Cc: Xishi Qiu 
Cc: Yasuaki Ishimatsu 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: Adjust system_state check

2017-05-23T08:01:36+00:00

To enable smp_processor_id() and might_sleep() debug checks earlier, it's
required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING.

get_nid_for_pfn() checks for system_state == BOOTING to decide whether to
use early_pfn_to_nid() when CONFIG_DEFERRED_STRUCT_PAGE_INIT=y.

That check is dubious, because the switch to state RUNNING happes way after
page_alloc_init_late() has been invoked.

Change the check to less than RUNNING state so it covers the new
intermediate states as well.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Greg Kroah-Hartman 
Cc: Linus Torvalds 
Cc: Mark Rutland 
Cc: Mel Gorman 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Link: http://lkml.kernel.org/r/20170516184735.528279534@linutronix.de
Signed-off-by: Ingo Molnar

treewide: replace obsolete _refok by __ref

2016-08-02T21:31:41+00:00

There was only one use of __initdata_refok and __exit_refok

__init_refok was used 46 times against 82 for __ref.

Those definitions are obsolete since commit 312b1485fb50 ("Introduce new
section reference annotations tags: __ref, __refdata, __refconst")

This patch removes the following compatibility definitions and replaces
them treewide.

/* compatibility defines */
#define __init_refok     __ref
#define __initdata_refok __refdata
#define __exit_refok     __ref

I can also provide separate patches if necessary.
(One patch per tree and check in 1 month or 2 to remove old definitions)

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick 
Cc: Ingo Molnar 
Cc: Sam Ravnborg 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: track NR_KERNEL_STACK in KiB instead of number of stacks

2016-07-28T23:07:41+00:00

Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone.
This only makes sense if each kernel stack exists entirely in one zone,
and allowing vmapped stacks could break this assumption.

Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all
architectures.  Keep it simple and use KiB.

Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.org
Signed-off-by: Andy Lutomirski 
Cc: Vladimir Davydov 
Acked-by: Johannes Weiner 
Cc: Michal Hocko 
Reviewed-by: Josh Poimboeuf 
Reviewed-by: Vladimir Davydov 
Acked-by: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds