linux-toradex.git/mm/vmstat.c, branch v3.14.34

mm: vmscan: only update per-cpu thresholds for online CPU

2015-01-30T01:40:53+00:00

commit bb0b6dffa2ccfbd9747ad0cc87c7459622896e60 upstream.

When kswapd is awake reclaiming, the per-cpu stat thresholds are lowered
to get more accurate counts to avoid breaching watermarks.  This
threshold update iterates over all possible CPUs which is unnecessary.
Only online CPUs need to be updated.  If a new CPU is onlined,
refresh_zone_stat_thresholds() will set the thresholds correctly.

Signed-off-by: Mel Gorman 
Acked-by: Johannes Weiner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Mel Gorman 
Signed-off-by: Greg Kroah-Hartman

mm: move zone->pages_scanned into a vmstat counter

2015-01-30T01:40:53+00:00

commit 0d5d823ab4e608ec7b52ac4410de4cb74bbe0edd upstream.

zone->pages_scanned is a write-intensive cache line during page reclaim
and it's also updated during page free.  Move the counter into vmstat to
take advantage of the per-cpu updates and do not update it in the free
paths unless necessary.

On a small UMA machine running tiobench the difference is marginal.  On
a 4-node machine the overhead is more noticable.  Note that automatic
NUMA balancing was disabled for this test as otherwise the system CPU
overhead is unpredictable.

          3.16.0-rc3  3.16.0-rc3  3.16.0-rc3
             vanillarearrange-v5   vmstat-v5
User          746.94      759.78      774.56
System      65336.22    58350.98    32847.27
Elapsed     27553.52    27282.02    27415.04

Note that the overhead reduction will vary depending on where exactly
pages are allocated and freed.

Signed-off-by: Mel Gorman 
Acked-by: Johannes Weiner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Mel Gorman 
Signed-off-by: Greg Kroah-Hartman

mm: rearrange zone fields into read-only, page alloc, statistics and page reclaim lines

2015-01-30T01:40:52+00:00

commit 3484b2de9499df23c4604a513b36f96326ae81ad upstream.

The arrangement of struct zone has changed over time and now it has
reached the point where there is some inappropriate sharing going on.
On x86-64 for example

o The zone->node field is shared with the zone lock and zone->node is
  accessed frequently from the page allocator due to the fair zone
  allocation policy.

o span_seqlock is almost never used by shares a line with free_area

o Some zone statistics share a cache line with the LRU lock so
  reclaim-intensive and allocator-intensive workloads can bounce the cache
  line on a stat update

This patch rearranges struct zone to put read-only and read-mostly
fields together and then splits the page allocator intensive fields, the
zone statistics and the page reclaim intensive fields into their own
cache lines.  Note that the type of lowmem_reserve changes due to the
watermark calculations being signed and avoiding a signed/unsigned
conversion there.

On the test configuration I used the overall size of struct zone shrunk
by one cache line.  On smaller machines, this is not likely to be
noticable.  However, on a 4-node NUMA machine running tiobench the
system CPU overhead is reduced by this patch.

          3.16.0-rc3  3.16.0-rc3
             vanillarearrange-v5r9
User          746.94      759.78
System      65336.22    58350.98
Elapsed     27553.52    27282.02

Signed-off-by: Mel Gorman 
Acked-by: Johannes Weiner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Mel Gorman 
Signed-off-by: Greg Kroah-Hartman

mm, x86: Account for TLB flushes only when debugging

2014-01-25T08:10:41+00:00

Bisection between 3.11 and 3.12 fingered commit 9824cf97 ("mm:
vmstats: tlb flush counters") to cause overhead problems.

The counters are undeniably useful but how often do we really
need to debug TLB flush related issues?  It does not justify
taking the penalty everywhere so make it a debugging option.

Signed-off-by: Mel Gorman 
Tested-by: Davidlohr Bueso 
Reviewed-by: Rik van Riel 
Signed-off-by: Andrew Morton 
Cc: Hugh Dickins 
Cc: Alex Shi 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Link: http://lkml.kernel.org/n/tip-XzxjntugxuwpxXhcrxqqh53b@git.kernel.org
Signed-off-by: Ingo Molnar

mm: numa: return the number of base pages altered by protection changes

2013-11-13T03:09:11+00:00

Commit 0255d4918480 ("mm: Account for a THP NUMA hinting update as one
PTE update") was added to account for the number of PTE updates when
marking pages prot_numa.  task_numa_work was using the old return value
to track how much address space had been updated.  Altering the return
value causes the scanner to do more work than it is configured or
documented to in a single unit of work.

This patch reverts that commit and accounts for the number of THP
updates separately in vmstat.  It is up to the administrator to
interpret the pair of values correctly.  This is a straight-forward
operation and likely to only be of interest when actively debugging NUMA
balancing problems.

The impact of this patch is that the NUMA PTE scanner will scan slower
when THP is enabled and workloads may converge slower as a result.  On
the flip size system CPU usage should be lower than recent tests
reported.  This is an illustrative example of a short single JVM specjbb
test

specjbb
                       3.12.0                3.12.0
                      vanilla      acctupdates
TPut 1      26143.00 (  0.00%)     25747.00 ( -1.51%)
TPut 7     185257.00 (  0.00%)    183202.00 ( -1.11%)
TPut 13    329760.00 (  0.00%)    346577.00 (  5.10%)
TPut 19    442502.00 (  0.00%)    460146.00 (  3.99%)
TPut 25    540634.00 (  0.00%)    549053.00 (  1.56%)
TPut 31    512098.00 (  0.00%)    519611.00 (  1.47%)
TPut 37    461276.00 (  0.00%)    474973.00 (  2.97%)
TPut 43    403089.00 (  0.00%)    414172.00 (  2.75%)

              3.12.0      3.12.0
             vanillaacctupdates
User         5169.64     5184.14
System        100.45       80.02
Elapsed       252.75      251.85

Performance is similar but note the reduction in system CPU time.  While
this showed a performance gain, it will not be universal but at least
it'll be behaving as documented.  The vmstats are obviously different but
here is an obvious interpretation of them from mmtests.

                                3.12.0      3.12.0
                               vanillaacctupdates
NUMA page range updates        1408326    11043064
NUMA huge PMD updates                0       21040
NUMA PTE updates               1408326      291624

"NUMA page range updates" == nr_pte_updates and is the value returned to
the NUMA pte scanner.  NUMA huge PMD updates were the number of THP
updates which in combination can be used to calculate how many ptes were
updated from userspace.

Signed-off-by: Mel Gorman 
Reported-by: Alex Thorlton 
Reviewed-by: Rik van Riel 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: clear N_CPU from node_states at CPU offline

2013-11-13T03:09:09+00:00

vmstat_cpuup_callback() is a CPU notifier callback, which marks N_CPU to a
node at CPU online event.  However, it does not update this N_CPU info at
CPU offline event.

Changed vmstat_cpuup_callback() to clear N_CPU when the last CPU in the
node is put into offline, i.e.  the node no longer has any online CPU.

Signed-off-by: Toshi Kani 
Acked-by: Christoph Lameter 
Reviewed-by: Yasuaki Ishimatsu 
Tested-by: Yasuaki Ishimatsu 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: set N_CPU to node_states during boot

2013-11-13T03:09:09+00:00

After a system booted, N_CPU is not set to any node as has_cpu shows an
empty line.

  # cat /sys/devices/system/node/has_cpu
  (show-empty-line)

setup_vmstat() registers its CPU notifier callback,
vmstat_cpuup_callback(), which marks N_CPU to a node when a CPU is put
into online.  However, setup_vmstat() is called after all CPUs are
launched in the boot sequence.

Changed setup_vmstat() to mark N_CPU to the nodes with online CPUs at
boot, which is consistent with other operations in
vmstat_cpuup_callback(), i.e.  start_cpu_timer() and
refresh_zone_stat_thresholds().

Also added get_online_cpus() to protect the for_each_online_cpu() loop.

Signed-off-by: Toshi Kani 
Acked-by: Christoph Lameter 
Reviewed-by: Yasuaki Ishimatsu 
Tested-by: Yasuaki Ishimatsu 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: vmscan: fix do_try_to_free_pages() livelock

2013-09-11T22:58:01+00:00

This patch is based on KOSAKI's work and I add a little more description,
please refer https://lkml.org/lkml/2012/6/14/74.

Currently, I found system can enter a state that there are lots of free
pages in a zone but only order-0 and order-1 pages which means the zone is
heavily fragmented, then high order allocation could make direct reclaim
path's long stall(ex, 60 seconds) especially in no swap and no compaciton
enviroment.  This problem happened on v3.4, but it seems issue still lives
in current tree, the reason is do_try_to_free_pages enter live lock:

kswapd will go to sleep if the zones have been fully scanned and are still
not balanced.  As kswapd thinks there's little point trying all over again
to avoid infinite loop.  Instead it changes order from high-order to
0-order because kswapd think order-0 is the most important.  Look at
73ce02e9 in detail.  If watermarks are ok, kswapd will go back to sleep
and may leave zone->all_unreclaimable =3D 0.  It assume high-order users
can still perform direct reclaim if they wish.

Direct reclaim continue to reclaim for a high order which is not a
COSTLY_ORDER without oom-killer until kswapd turn on
zone->all_unreclaimble= .  This is because to avoid too early oom-kill.
So it means direct_reclaim depends on kswapd to break this loop.

In worst case, direct-reclaim may continue to page reclaim forever when
kswapd sleeps forever until someone like watchdog detect and finally kill
the process.  As described in:
http://thread.gmane.org/gmane.linux.kernel.mm/103737

We can't turn on zone->all_unreclaimable from direct reclaim path because
direct reclaim path don't take any lock and this way is racy.  Thus this
patch removes zone->all_unreclaimable field completely and recalculates
zone reclaimable state every time.

Note: we can't take the idea that direct-reclaim see zone->pages_scanned
directly and kswapd continue to use zone->all_unreclaimable.  Because, it
is racy.  commit 929bea7c71 (vmscan: all_unreclaimable() use
zone->all_unreclaimable as a name) describes the detail.

[akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
Cc: Aaditya Kumar 
Cc: Ying Han 
Cc: Nick Piggin 
Acked-by: Rik van Riel 
Cc: Mel Gorman 
Cc: KAMEZAWA Hiroyuki 
Cc: Christoph Lameter 
Cc: Bob Liu 
Cc: Neil Zhang 
Cc: Russell King - ARM Linux 
Reviewed-by: Michal Hocko 
Acked-by: Minchan Kim 
Acked-by: Johannes Weiner 
Signed-off-by: KOSAKI Motohiro 
Signed-off-by: Lisa Du 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

vmstat: use this_cpu() to avoid irqon/off sequence in refresh_cpu_vm_stats

2013-09-11T22:57:31+00:00

Disabling interrupts repeatedly can be avoided in the inner loop if we use
a this_cpu operation.

Signed-off-by: Christoph Lameter 
Cc: KOSAKI Motohiro 
CC: Tejun Heo 
Cc: Joonsoo Kim 
Cc: Alexey Dobriyan 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

vmstat: create fold_diff

2013-09-11T22:57:31+00:00

Both functions that update global counters use the same mechanism.

Create a function that contains the common code.

Signed-off-by: Christoph Lameter 
Cc: KOSAKI Motohiro 
CC: Tejun Heo 
Cc: Joonsoo Kim 
Cc: Alexey Dobriyan 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds