From 44a70adec910d6929689e42b6e5cee5b7d202d20 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 28 Jul 2016 15:44:43 -0700 Subject: mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj oom_score_adj is shared for the thread groups (via struct signal) but this is not sufficient to cover processes sharing mm (CLONE_VM without CLONE_SIGHAND) and so we can easily end up in a situation when some processes update their oom_score_adj and confuse the oom killer. In the worst case some of those processes might hide from the oom killer altogether via OOM_SCORE_ADJ_MIN while others are eligible. OOM killer would then pick up those eligible but won't be allowed to kill others sharing the same mm so the mm wouldn't release the mm and so the memory. It would be ideal to have the oom_score_adj per mm_struct because that is the natural entity OOM killer considers. But this will not work because some programs are doing vfork() set_oom_adj() exec() We can achieve the same though. oom_score_adj write handler can set the oom_score_adj for all processes sharing the same mm if the task is not in the middle of vfork. As a result all the processes will share the same oom_score_adj. The current implementation is rather pessimistic and checks all the existing processes by default if there is more than 1 holder of the mm but we do not have any reliable way to check for external users yet. Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko Acked-by: Oleg Nesterov Cc: Vladimir Davydov Cc: David Rientjes Cc: Tetsuo Handa Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include/linux') diff --git a/include/linux/mm.h b/include/linux/mm.h index 192c1bbe5fcd..c606fe4f9a7f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2284,6 +2284,8 @@ static inline int in_gate_area(struct mm_struct *mm, unsigned long addr) } #endif /* __HAVE_ARCH_GATE_AREA */ +extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm); + #ifdef CONFIG_SYSCTL extern int sysctl_drop_caches; int drop_caches_sysctl_handler(struct ctl_table *, int, -- cgit v1.2.3 From b18dc5f291c07ddaf31562b9f27b3a122f1f9b7e Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 28 Jul 2016 15:44:46 -0700 Subject: mm, oom: skip vforked tasks from being selected vforked tasks are not really sitting on any memory. They are sharing the mm with parent until they exec into a new code. Until then it is just pinning the address space. OOM killer will kill the vforked task along with its parent but we still can end up selecting vforked task when the parent wouldn't be selected. E.g. init doing vfork to launch a task or vforked being a child of oom unkillable task with an updated oom_score_adj to be killable. Add a new helper to check whether a task is in the vfork sharing memory with its parent and use it in oom_badness to skip over these tasks. Link: http://lkml.kernel.org/r/1466426628-15074-6-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko Acked-by: Oleg Nesterov Cc: Vladimir Davydov Cc: David Rientjes Cc: Tetsuo Handa Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/sched.h | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) (limited to 'include/linux') diff --git a/include/linux/sched.h b/include/linux/sched.h index d99218a1e043..c0efd80ba40f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1949,6 +1949,32 @@ static inline int tsk_nr_cpus_allowed(struct task_struct *p) #define TNF_FAULT_LOCAL 0x08 #define TNF_MIGRATE_FAIL 0x10 +static inline bool in_vfork(struct task_struct *tsk) +{ + bool ret; + + /* + * need RCU to access ->real_parent if CLONE_VM was used along with + * CLONE_PARENT. + * + * We check real_parent->mm == tsk->mm because CLONE_VFORK does not + * imply CLONE_VM + * + * CLONE_VFORK can be used with CLONE_PARENT/CLONE_THREAD and thus + * ->real_parent is not necessarily the task doing vfork(), so in + * theory we can't rely on task_lock() if we want to dereference it. + * + * And in this case we can't trust the real_parent->mm == tsk->mm + * check, it can be false negative. But we do not care, if init or + * another oom-unkillable task does this it should blame itself. + */ + rcu_read_lock(); + ret = tsk->vfork_done && tsk->real_parent->mm == tsk->mm; + rcu_read_unlock(); + + return ret; +} + #ifdef CONFIG_NUMA_BALANCING extern void task_numa_fault(int last_node, int node, int pages, int flags); extern pid_t task_numa_group_id(struct task_struct *p); -- cgit v1.2.3 From 1af8bb43269563e458ebcf0ece812e9a970864b3 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 28 Jul 2016 15:44:52 -0700 Subject: mm, oom: fortify task_will_free_mem() task_will_free_mem is rather weak. It doesn't really tell whether the task has chance to drop its mm. 98748bd72200 ("oom: consider multi-threaded tasks in task_will_free_mem") made a first step into making it more robust for multi-threaded applications so now we know that the whole process is going down and probably drop the mm. This patch builds on top for more complex scenarios where mm is shared between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel use_mm(). Make sure that all processes sharing the mm are killed or exiting. This will allow us to replace try_oom_reaper by wake_oom_reaper because task_will_free_mem implies the task is reapable now. Therefore all paths which bypass the oom killer are now reapable and so they shouldn't lock up the oom killer. Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko Acked-by: Oleg Nesterov Cc: Vladimir Davydov Cc: David Rientjes Cc: Tetsuo Handa Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/oom.h | 26 +++----------------------- 1 file changed, 3 insertions(+), 23 deletions(-) (limited to 'include/linux') diff --git a/include/linux/oom.h b/include/linux/oom.h index 606137b3b778..5bc0457ee3a8 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -73,9 +73,9 @@ static inline bool oom_task_origin(const struct task_struct *p) extern void mark_oom_victim(struct task_struct *tsk); #ifdef CONFIG_MMU -extern void try_oom_reaper(struct task_struct *tsk); +extern void wake_oom_reaper(struct task_struct *tsk); #else -static inline void try_oom_reaper(struct task_struct *tsk) +static inline void wake_oom_reaper(struct task_struct *tsk) { } #endif @@ -107,27 +107,7 @@ extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); -static inline bool task_will_free_mem(struct task_struct *task) -{ - struct signal_struct *sig = task->signal; - - /* - * A coredumping process may sleep for an extended period in exit_mm(), - * so the oom killer cannot assume that the process will promptly exit - * and release memory. - */ - if (sig->flags & SIGNAL_GROUP_COREDUMP) - return false; - - if (!(task->flags & PF_EXITING)) - return false; - - /* Make sure that the whole thread group is going down */ - if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT)) - return false; - - return true; -} +bool task_will_free_mem(struct task_struct *task); /* sysctls */ extern int sysctl_oom_dump_tasks; -- cgit v1.2.3 From 11a410d516e89320fe0817606eeab58f36c22968 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 28 Jul 2016 15:44:58 -0700 Subject: mm, oom_reaper: do not attempt to reap a task more than twice oom_reaper relies on the mmap_sem for read to do its job. Many places which might block readers have been converted to use down_write_killable and that has reduced chances of the contention a lot. Some paths where the mmap_sem is held for write can take other locks and they might either be not prepared to fail due to fatal signal pending or too impractical to be changed. This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the first attempt to reap a task's mm fails. If the flag is present after the failure then we set MMF_OOM_REAPED to hide this mm from the oom killer completely so it can go and chose another victim. As a result a risk of OOM deadlock when the oom victim would be blocked indefinetly and so the oom killer cannot make any progress should be mitigated considerably while we still try really hard to perform all reclaim attempts and stay predictable in the behavior. Link: http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko Acked-by: Oleg Nesterov Cc: Vladimir Davydov Cc: David Rientjes Cc: Tetsuo Handa Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/sched.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux') diff --git a/include/linux/sched.h b/include/linux/sched.h index c0efd80ba40f..553af2923824 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -523,6 +523,7 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_HAS_UPROBES 19 /* has uprobes */ #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ #define MMF_OOM_REAPED 21 /* mm has been already reaped */ +#define MMF_OOM_NOT_REAPABLE 22 /* mm couldn't be reaped */ #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK) -- cgit v1.2.3 From 55779ec759ccc3c12b917b3712a7716e1140c652 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 28 Jul 2016 15:45:10 -0700 Subject: mm: fix vm-scalability regression in cgroup-aware workingset code Commit 23047a96d7cf ("mm: workingset: per-cgroup cache thrash detection") added a page->mem_cgroup lookup to the cache eviction, refault, and activation paths, as well as locking to the activation path, and the vm-scalability tests showed a regression of -23%. While the test in question is an artificial worst-case scenario that doesn't occur in real workloads - reading two sparse files in parallel at full CPU speed just to hammer the LRU paths - there is still some optimizations that can be done in those paths. Inline the lookup functions to eliminate calls. Also, page->mem_cgroup doesn't need to be stabilized when counting an activation; we merely need to hold the RCU lock to prevent the memcg from being freed. This cuts down on overhead quite a bit: 23047a96d7cfcfca 063f6715e77a7be5770d6081fe ---------------- -------------------------- %stddev %change %stddev \ | \ 21621405 +- 0% +11.3% 24069657 +- 2% vm-scalability.throughput [linux@roeck-us.net: drop unnecessary include file] [hannes@cmpxchg.org: add WARN_ON_ONCE()s] Link: http://lkml.kernel.org/r/20160707194024.GA26580@cmpxchg.org Link: http://lkml.kernel.org/r/20160624175101.GA3024@cmpxchg.org Reported-by: Ye Xiaolong Signed-off-by: Johannes Weiner Acked-by: Michal Hocko Cc: Vladimir Davydov Signed-off-by: Guenter Roeck Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 43 ++++++++++++++++++++++++++++++++++++++++++- include/linux/mm.h | 10 ++++++++++ 2 files changed, 52 insertions(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 71aff733a497..1c4df4420258 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -314,7 +314,48 @@ void mem_cgroup_uncharge_list(struct list_head *page_list); void mem_cgroup_migrate(struct page *oldpage, struct page *newpage); -struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); +static inline struct mem_cgroup_per_zone * +mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone) +{ + int nid = zone_to_nid(zone); + int zid = zone_idx(zone); + + return &memcg->nodeinfo[nid]->zoneinfo[zid]; +} + +/** + * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg + * @zone: zone of the wanted lruvec + * @memcg: memcg of the wanted lruvec + * + * Returns the lru list vector holding pages for the given @zone and + * @mem. This can be the global zone lruvec, if the memory controller + * is disabled. + */ +static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone, + struct mem_cgroup *memcg) +{ + struct mem_cgroup_per_zone *mz; + struct lruvec *lruvec; + + if (mem_cgroup_disabled()) { + lruvec = &zone->lruvec; + goto out; + } + + mz = mem_cgroup_zone_zoneinfo(memcg, zone); + lruvec = &mz->lruvec; +out: + /* + * Since a node can be onlined after the mem_cgroup was created, + * we have to be prepared to initialize lruvec->zone here; + * and if offlined then reonlined, we need to reinitialize it. + */ + if (unlikely(lruvec->zone != zone)) + lruvec->zone = zone; + return lruvec; +} + struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg); diff --git a/include/linux/mm.h b/include/linux/mm.h index c606fe4f9a7f..97065e1f0237 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -973,11 +973,21 @@ static inline struct mem_cgroup *page_memcg(struct page *page) { return page->mem_cgroup; } +static inline struct mem_cgroup *page_memcg_rcu(struct page *page) +{ + WARN_ON_ONCE(!rcu_read_lock_held()); + return READ_ONCE(page->mem_cgroup); +} #else static inline struct mem_cgroup *page_memcg(struct page *page) { return NULL; } +static inline struct mem_cgroup *page_memcg_rcu(struct page *page) +{ + WARN_ON_ONCE(!rcu_read_lock_held()); + return NULL; +} #endif /* -- cgit v1.2.3 From 75ef7184053989118d3814c558a9af62e7376a58 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:45:24 -0700 Subject: mm, vmstat: add infrastructure for per-node vmstats Patchset: "Move LRU page reclaim from zones to nodes v9" This series moves LRUs from the zones to the node. While this is a current rebase, the test results were based on mmotm as of June 23rd. Conceptually, this series is simple but there are a lot of details. Some of the broad motivations for this are; 1. The residency of a page partially depends on what zone the page was allocated from. This is partially combatted by the fair zone allocation policy but that is a partial solution that introduces overhead in the page allocator paths. 2. Currently, reclaim on node 0 behaves slightly different to node 1. For example, direct reclaim scans in zonelist order and reclaims even if the zone is over the high watermark regardless of the age of pages in that LRU. Kswapd on the other hand starts reclaim on the highest unbalanced zone. A difference in distribution of file/anon pages due to when they were allocated results can result in a difference in again. While the fair zone allocation policy mitigates some of the problems here, the page reclaim results on a multi-zone node will always be different to a single-zone node. it was scheduled on as a result. 3. kswapd and the page allocator scan zones in the opposite order to avoid interfering with each other but it's sensitive to timing. This mitigates the page allocator using pages that were allocated very recently in the ideal case but it's sensitive to timing. When kswapd is allocating from lower zones then it's great but during the rebalancing of the highest zone, the page allocator and kswapd interfere with each other. It's worse if the highest zone is small and difficult to balance. 4. slab shrinkers are node-based which makes it harder to identify the exact relationship between slab reclaim and LRU reclaim. The reason we have zone-based reclaim is that we used to have large highmem zones in common configurations and it was necessary to quickly find ZONE_NORMAL pages for reclaim. Today, this is much less of a concern as machines with lots of memory will (or should) use 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are rare. Machines that do use highmem should have relatively low highmem:lowmem ratios than we worried about in the past. Conceptually, moving to node LRUs should be easier to understand. The page allocator plays fewer tricks to game reclaim and reclaim behaves similarly on all nodes. The series has been tested on a 16 core UMA machine and a 2-socket 48 core NUMA machine. The UMA results are presented in most cases as the NUMA machine behaved similarly. pagealloc --------- This is a microbenchmark that shows the benefit of removing the fair zone allocation policy. It was tested uip to order-4 but only orders 0 and 1 are shown as the other orders were comparable. 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v9 Min total-odr0-1 490.00 ( 0.00%) 457.00 ( 6.73%) Min total-odr0-2 347.00 ( 0.00%) 329.00 ( 5.19%) Min total-odr0-4 288.00 ( 0.00%) 273.00 ( 5.21%) Min total-odr0-8 251.00 ( 0.00%) 239.00 ( 4.78%) Min total-odr0-16 234.00 ( 0.00%) 222.00 ( 5.13%) Min total-odr0-32 223.00 ( 0.00%) 211.00 ( 5.38%) Min total-odr0-64 217.00 ( 0.00%) 208.00 ( 4.15%) Min total-odr0-128 214.00 ( 0.00%) 204.00 ( 4.67%) Min total-odr0-256 250.00 ( 0.00%) 230.00 ( 8.00%) Min total-odr0-512 271.00 ( 0.00%) 269.00 ( 0.74%) Min total-odr0-1024 291.00 ( 0.00%) 282.00 ( 3.09%) Min total-odr0-2048 303.00 ( 0.00%) 296.00 ( 2.31%) Min total-odr0-4096 311.00 ( 0.00%) 309.00 ( 0.64%) Min total-odr0-8192 316.00 ( 0.00%) 314.00 ( 0.63%) Min total-odr0-16384 317.00 ( 0.00%) 315.00 ( 0.63%) Min total-odr1-1 742.00 ( 0.00%) 712.00 ( 4.04%) Min total-odr1-2 562.00 ( 0.00%) 530.00 ( 5.69%) Min total-odr1-4 457.00 ( 0.00%) 433.00 ( 5.25%) Min total-odr1-8 411.00 ( 0.00%) 381.00 ( 7.30%) Min total-odr1-16 381.00 ( 0.00%) 356.00 ( 6.56%) Min total-odr1-32 372.00 ( 0.00%) 346.00 ( 6.99%) Min total-odr1-64 372.00 ( 0.00%) 343.00 ( 7.80%) Min total-odr1-128 375.00 ( 0.00%) 351.00 ( 6.40%) Min total-odr1-256 379.00 ( 0.00%) 351.00 ( 7.39%) Min total-odr1-512 385.00 ( 0.00%) 355.00 ( 7.79%) Min total-odr1-1024 386.00 ( 0.00%) 358.00 ( 7.25%) Min total-odr1-2048 390.00 ( 0.00%) 362.00 ( 7.18%) Min total-odr1-4096 390.00 ( 0.00%) 362.00 ( 7.18%) Min total-odr1-8192 388.00 ( 0.00%) 363.00 ( 6.44%) This shows a steady improvement throughout. The primary benefit is from reduced system CPU usage which is obvious from the overall times; 4.7.0-rc4 4.7.0-rc4 mmotm-20160623nodelru-v8 User 189.19 191.80 System 2604.45 2533.56 Elapsed 2855.30 2786.39 The vmstats also showed that the fair zone allocation policy was definitely removed as can be seen here; 4.7.0-rc3 4.7.0-rc3 mmotm-20160623 nodelru-v8 DMA32 allocs 28794729769 0 Normal allocs 48432501431 77227309877 Movable allocs 0 0 tiobench on ext4 ---------------- tiobench is a benchmark that artifically benefits if old pages remain resident while new pages get reclaimed. The fair zone allocation policy mitigates this problem so pages age fairly. While the benchmark has problems, it is important that tiobench performance remains constant as it implies that page aging problems that the fair zone allocation policy fixes are not re-introduced. 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v9 Min PotentialReadSpeed 89.65 ( 0.00%) 90.21 ( 0.62%) Min SeqRead-MB/sec-1 82.68 ( 0.00%) 82.01 ( -0.81%) Min SeqRead-MB/sec-2 72.76 ( 0.00%) 72.07 ( -0.95%) Min SeqRead-MB/sec-4 75.13 ( 0.00%) 74.92 ( -0.28%) Min SeqRead-MB/sec-8 64.91 ( 0.00%) 65.19 ( 0.43%) Min SeqRead-MB/sec-16 62.24 ( 0.00%) 62.22 ( -0.03%) Min RandRead-MB/sec-1 0.88 ( 0.00%) 0.88 ( 0.00%) Min RandRead-MB/sec-2 0.95 ( 0.00%) 0.92 ( -3.16%) Min RandRead-MB/sec-4 1.43 ( 0.00%) 1.34 ( -6.29%) Min RandRead-MB/sec-8 1.61 ( 0.00%) 1.60 ( -0.62%) Min RandRead-MB/sec-16 1.80 ( 0.00%) 1.90 ( 5.56%) Min SeqWrite-MB/sec-1 76.41 ( 0.00%) 76.85 ( 0.58%) Min SeqWrite-MB/sec-2 74.11 ( 0.00%) 73.54 ( -0.77%) Min SeqWrite-MB/sec-4 80.05 ( 0.00%) 80.13 ( 0.10%) Min SeqWrite-MB/sec-8 72.88 ( 0.00%) 73.20 ( 0.44%) Min SeqWrite-MB/sec-16 75.91 ( 0.00%) 76.44 ( 0.70%) Min RandWrite-MB/sec-1 1.18 ( 0.00%) 1.14 ( -3.39%) Min RandWrite-MB/sec-2 1.02 ( 0.00%) 1.03 ( 0.98%) Min RandWrite-MB/sec-4 1.05 ( 0.00%) 0.98 ( -6.67%) Min RandWrite-MB/sec-8 0.89 ( 0.00%) 0.92 ( 3.37%) Min RandWrite-MB/sec-16 0.92 ( 0.00%) 0.93 ( 1.09%) 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 approx-v9 User 645.72 525.90 System 403.85 331.75 Elapsed 6795.36 6783.67 This shows that the series has little or not impact on tiobench which is desirable and a reduction in system CPU usage. It indicates that the fair zone allocation policy was removed in a manner that didn't reintroduce one class of page aging bug. There were only minor differences in overall reclaim activity 4.7.0-rc4 4.7.0-rc4 mmotm-20160623nodelru-v8 Minor Faults 645838 647465 Major Faults 573 640 Swap Ins 0 0 Swap Outs 0 0 DMA allocs 0 0 DMA32 allocs 46041453 44190646 Normal allocs 78053072 79887245 Movable allocs 0 0 Allocation stalls 24 67 Stall zone DMA 0 0 Stall zone DMA32 0 0 Stall zone Normal 0 2 Stall zone HighMem 0 0 Stall zone Movable 0 65 Direct pages scanned 10969 30609 Kswapd pages scanned 93375144 93492094 Kswapd pages reclaimed 93372243 93489370 Direct pages reclaimed 10969 30609 Kswapd efficiency 99% 99% Kswapd velocity 13741.015 13781.934 Direct efficiency 100% 100% Direct velocity 1.614 4.512 Percentage direct scans 0% 0% kswapd activity was roughly comparable. There were differences in direct reclaim activity but negligible in the context of the overall workload (velocity of 4 pages per second with the patches applied, 1.6 pages per second in the baseline kernel). pgbench read-only large configuration on ext4 --------------------------------------------- pgbench is a database benchmark that can be sensitive to page reclaim decisions. This also checks if removing the fair zone allocation policy is safe pgbench Transactions 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Hmean 1 188.26 ( 0.00%) 189.78 ( 0.81%) Hmean 5 330.66 ( 0.00%) 328.69 ( -0.59%) Hmean 12 370.32 ( 0.00%) 380.72 ( 2.81%) Hmean 21 368.89 ( 0.00%) 369.00 ( 0.03%) Hmean 30 382.14 ( 0.00%) 360.89 ( -5.56%) Hmean 32 428.87 ( 0.00%) 432.96 ( 0.95%) Negligible differences again. As with tiobench, overall reclaim activity was comparable. bonnie++ on ext4 ---------------- No interesting performance difference, negligible differences on reclaim stats. paralleldd on ext4 ------------------ This workload uses varying numbers of dd instances to read large amounts of data from disk. 4.7.0-rc3 4.7.0-rc3 mmotm-20160623 nodelru-v9 Amean Elapsd-1 186.04 ( 0.00%) 189.41 ( -1.82%) Amean Elapsd-3 192.27 ( 0.00%) 191.38 ( 0.46%) Amean Elapsd-5 185.21 ( 0.00%) 182.75 ( 1.33%) Amean Elapsd-7 183.71 ( 0.00%) 182.11 ( 0.87%) Amean Elapsd-12 180.96 ( 0.00%) 181.58 ( -0.35%) Amean Elapsd-16 181.36 ( 0.00%) 183.72 ( -1.30%) 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v9 User 1548.01 1552.44 System 8609.71 8515.08 Elapsed 3587.10 3594.54 There is little or no change in performance but some drop in system CPU usage. 4.7.0-rc3 4.7.0-rc3 mmotm-20160623 nodelru-v9 Minor Faults 362662 367360 Major Faults 1204 1143 Swap Ins 22 0 Swap Outs 2855 1029 DMA allocs 0 0 DMA32 allocs 31409797 28837521 Normal allocs 46611853 49231282 Movable allocs 0 0 Direct pages scanned 0 0 Kswapd pages scanned 40845270 40869088 Kswapd pages reclaimed 40830976 40855294 Direct pages reclaimed 0 0 Kswapd efficiency 99% 99% Kswapd velocity 11386.711 11369.769 Direct efficiency 100% 100% Direct velocity 0.000 0.000 Percentage direct scans 0% 0% Page writes by reclaim 2855 1029 Page writes file 0 0 Page writes anon 2855 1029 Page reclaim immediate 771 1628 Sector Reads 293312636 293536360 Sector Writes 18213568 18186480 Page rescued immediate 0 0 Slabs scanned 128257 132747 Direct inode steals 181 56 Kswapd inode steals 59 1131 It basically shows that kswapd was active at roughly the same rate in both kernels. There was also comparable slab scanning activity and direct reclaim was avoided in both cases. There appears to be a large difference in numbers of inodes reclaimed but the workload has few active inodes and is likely a timing artifact. stutter ------- stutter simulates a simple workload. One part uses a lot of anonymous memory, a second measures mmap latency and a third copies a large file. The primary metric is checking for mmap latency. stutter 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Min mmap 16.6283 ( 0.00%) 13.4258 ( 19.26%) 1st-qrtle mmap 54.7570 ( 0.00%) 34.9121 ( 36.24%) 2nd-qrtle mmap 57.3163 ( 0.00%) 46.1147 ( 19.54%) 3rd-qrtle mmap 58.9976 ( 0.00%) 47.1882 ( 20.02%) Max-90% mmap 59.7433 ( 0.00%) 47.4453 ( 20.58%) Max-93% mmap 60.1298 ( 0.00%) 47.6037 ( 20.83%) Max-95% mmap 73.4112 ( 0.00%) 82.8719 (-12.89%) Max-99% mmap 92.8542 ( 0.00%) 88.8870 ( 4.27%) Max mmap 1440.6569 ( 0.00%) 121.4201 ( 91.57%) Mean mmap 59.3493 ( 0.00%) 42.2991 ( 28.73%) Best99%Mean mmap 57.2121 ( 0.00%) 41.8207 ( 26.90%) Best95%Mean mmap 55.9113 ( 0.00%) 39.9620 ( 28.53%) Best90%Mean mmap 55.6199 ( 0.00%) 39.3124 ( 29.32%) Best50%Mean mmap 53.2183 ( 0.00%) 33.1307 ( 37.75%) Best10%Mean mmap 45.9842 ( 0.00%) 20.4040 ( 55.63%) Best5%Mean mmap 43.2256 ( 0.00%) 17.9654 ( 58.44%) Best1%Mean mmap 32.9388 ( 0.00%) 16.6875 ( 49.34%) This shows a number of improvements with the worst-case outlier greatly improved. Some of the vmstats are interesting 4.7.0-rc4 4.7.0-rc4 mmotm-20160623nodelru-v8 Swap Ins 163 502 Swap Outs 0 0 DMA allocs 0 0 DMA32 allocs 618719206 1381662383 Normal allocs 891235743 564138421 Movable allocs 0 0 Allocation stalls 2603 1 Direct pages scanned 216787 2 Kswapd pages scanned 50719775 41778378 Kswapd pages reclaimed 41541765 41777639 Direct pages reclaimed 209159 0 Kswapd efficiency 81% 99% Kswapd velocity 16859.554 14329.059 Direct efficiency 96% 0% Direct velocity 72.061 0.001 Percentage direct scans 0% 0% Page writes by reclaim 6215049 0 Page writes file 6215049 0 Page writes anon 0 0 Page reclaim immediate 70673 90 Sector Reads 81940800 81680456 Sector Writes 100158984 98816036 Page rescued immediate 0 0 Slabs scanned 1366954 22683 While this is not guaranteed in all cases, this particular test showed a large reduction in direct reclaim activity. It's also worth noting that no page writes were issued from reclaim context. This series is not without its hazards. There are at least three areas that I'm concerned with even though I could not reproduce any problems in that area. 1. Reclaim/compaction is going to be affected because the amount of reclaim is no longer targetted at a specific zone. Compaction works on a per-zone basis so there is no guarantee that reclaiming a few THP's worth page pages will have a positive impact on compaction success rates. 2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers are called is now different. This may or may not be a problem but if it is, it'll be because shrinkers are not called enough and some balancing is required. 3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are distributed between zones and the fair zone allocation policy used to do something very similar for anon. The distribution is now different but not necessarily in any way that matters but it's still worth bearing in mind. VM statistic counters for reclaim decisions are zone-based. If the kernel is to reclaim on a per-node basis then we need to track per-node statistics but there is no infrastructure for that. The most notable change is that the old node_page_state is renamed to sum_zone_node_page_state. The new node_page_state takes a pglist_data and uses per-node stats but none exist yet. There is some renaming such as vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming of mod_state to mod_zone_state. Otherwise, this is mostly a mechanical patch with no functional change. There is a lot of similarity between the node and zone helpers which is unfortunate but there was no obvious way of reusing the code and maintaining type safety. Link: http://lkml.kernel.org/r/1467970510-21195-2-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Acked-by: Vlastimil Babka Cc: Rik van Riel Cc: Vlastimil Babka Cc: Johannes Weiner Cc: Minchan Kim Cc: Joonsoo Kim Cc: Hillf Danton Cc: Michal Hocko Cc: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 5 +++ include/linux/mmzone.h | 13 +++++++ include/linux/vmstat.h | 92 +++++++++++++++++++++++++++++++++++++++++++------- 3 files changed, 98 insertions(+), 12 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mm.h b/include/linux/mm.h index 97065e1f0237..08ed53eeedd5 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -933,6 +933,11 @@ static inline struct zone *page_zone(const struct page *page) return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]; } +static inline pg_data_t *page_pgdat(const struct page *page) +{ + return NODE_DATA(page_to_nid(page)); +} + #ifdef SECTION_IN_PAGE_FLAGS static inline void set_page_section(struct page *page, unsigned long section) { diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 19425e988bdc..078ecb81e209 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -160,6 +160,10 @@ enum zone_stat_item { NR_FREE_CMA_PAGES, NR_VM_ZONE_STAT_ITEMS }; +enum node_stat_item { + NR_VM_NODE_STAT_ITEMS +}; + /* * We do arithmetic on the LRU lists in various places in the code, * so it is important to keep the active lists LRU_ACTIVE higher in @@ -267,6 +271,11 @@ struct per_cpu_pageset { #endif }; +struct per_cpu_nodestat { + s8 stat_threshold; + s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS]; +}; + #endif /* !__GENERATING_BOUNDS.H */ enum zone_type { @@ -695,6 +704,10 @@ typedef struct pglist_data { struct list_head split_queue; unsigned long split_queue_len; #endif + + /* Per-node vmstats */ + struct per_cpu_nodestat __percpu *per_cpu_nodestats; + atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS]; } pg_data_t; #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index d2da8e053210..d1744aa3ab9c 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -106,20 +106,38 @@ static inline void vm_events_fold_cpu(int cpu) zone_idx(zone), delta) /* - * Zone based page accounting with per cpu differentials. + * Zone and node-based page accounting with per cpu differentials. */ -extern atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS]; +extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS]; +extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS]; static inline void zone_page_state_add(long x, struct zone *zone, enum zone_stat_item item) { atomic_long_add(x, &zone->vm_stat[item]); - atomic_long_add(x, &vm_stat[item]); + atomic_long_add(x, &vm_zone_stat[item]); +} + +static inline void node_page_state_add(long x, struct pglist_data *pgdat, + enum node_stat_item item) +{ + atomic_long_add(x, &pgdat->vm_stat[item]); + atomic_long_add(x, &vm_node_stat[item]); } static inline unsigned long global_page_state(enum zone_stat_item item) { - long x = atomic_long_read(&vm_stat[item]); + long x = atomic_long_read(&vm_zone_stat[item]); +#ifdef CONFIG_SMP + if (x < 0) + x = 0; +#endif + return x; +} + +static inline unsigned long global_node_page_state(enum node_stat_item item) +{ + long x = atomic_long_read(&vm_node_stat[item]); #ifdef CONFIG_SMP if (x < 0) x = 0; @@ -161,31 +179,44 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone, } #ifdef CONFIG_NUMA - -extern unsigned long node_page_state(int node, enum zone_stat_item item); - +extern unsigned long sum_zone_node_page_state(int node, + enum zone_stat_item item); +extern unsigned long node_page_state(struct pglist_data *pgdat, + enum node_stat_item item); #else - -#define node_page_state(node, item) global_page_state(item) - +#define sum_zone_node_page_state(node, item) global_page_state(item) +#define node_page_state(node, item) global_node_page_state(item) #endif /* CONFIG_NUMA */ #define add_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, __d) #define sub_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, -(__d)) +#define add_node_page_state(__p, __i, __d) mod_node_page_state(__p, __i, __d) +#define sub_node_page_state(__p, __i, __d) mod_node_page_state(__p, __i, -(__d)) #ifdef CONFIG_SMP void __mod_zone_page_state(struct zone *, enum zone_stat_item item, long); void __inc_zone_page_state(struct page *, enum zone_stat_item); void __dec_zone_page_state(struct page *, enum zone_stat_item); +void __mod_node_page_state(struct pglist_data *, enum node_stat_item item, long); +void __inc_node_page_state(struct page *, enum node_stat_item); +void __dec_node_page_state(struct page *, enum node_stat_item); + void mod_zone_page_state(struct zone *, enum zone_stat_item, long); void inc_zone_page_state(struct page *, enum zone_stat_item); void dec_zone_page_state(struct page *, enum zone_stat_item); +void mod_node_page_state(struct pglist_data *, enum node_stat_item, long); +void inc_node_page_state(struct page *, enum node_stat_item); +void dec_node_page_state(struct page *, enum node_stat_item); + extern void inc_zone_state(struct zone *, enum zone_stat_item); +extern void inc_node_state(struct pglist_data *, enum node_stat_item); extern void __inc_zone_state(struct zone *, enum zone_stat_item); +extern void __inc_node_state(struct pglist_data *, enum node_stat_item); extern void dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_zone_state(struct zone *, enum zone_stat_item); +extern void __dec_node_state(struct pglist_data *, enum node_stat_item); void quiet_vmstat(void); void cpu_vm_stats_fold(int cpu); @@ -213,16 +244,34 @@ static inline void __mod_zone_page_state(struct zone *zone, zone_page_state_add(delta, zone, item); } +static inline void __mod_node_page_state(struct pglist_data *pgdat, + enum node_stat_item item, int delta) +{ + node_page_state_add(delta, pgdat, item); +} + static inline void __inc_zone_state(struct zone *zone, enum zone_stat_item item) { atomic_long_inc(&zone->vm_stat[item]); - atomic_long_inc(&vm_stat[item]); + atomic_long_inc(&vm_zone_stat[item]); +} + +static inline void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) +{ + atomic_long_inc(&pgdat->vm_stat[item]); + atomic_long_inc(&vm_node_stat[item]); } static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item item) { atomic_long_dec(&zone->vm_stat[item]); - atomic_long_dec(&vm_stat[item]); + atomic_long_dec(&vm_zone_stat[item]); +} + +static inline void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item) +{ + atomic_long_dec(&pgdat->vm_stat[item]); + atomic_long_dec(&vm_node_stat[item]); } static inline void __inc_zone_page_state(struct page *page, @@ -231,12 +280,26 @@ static inline void __inc_zone_page_state(struct page *page, __inc_zone_state(page_zone(page), item); } +static inline void __inc_node_page_state(struct page *page, + enum node_stat_item item) +{ + __inc_node_state(page_pgdat(page), item); +} + + static inline void __dec_zone_page_state(struct page *page, enum zone_stat_item item) { __dec_zone_state(page_zone(page), item); } +static inline void __dec_node_page_state(struct page *page, + enum node_stat_item item) +{ + __dec_node_state(page_pgdat(page), item); +} + + /* * We only use atomic operations to update counters. So there is no need to * disable interrupts. @@ -245,7 +308,12 @@ static inline void __dec_zone_page_state(struct page *page, #define dec_zone_page_state __dec_zone_page_state #define mod_zone_page_state __mod_zone_page_state +#define inc_node_page_state __inc_node_page_state +#define dec_node_page_state __dec_node_page_state +#define mod_node_page_state __mod_node_page_state + #define inc_zone_state __inc_zone_state +#define inc_node_state __inc_node_state #define dec_zone_state __dec_zone_state #define set_pgdat_percpu_threshold(pgdat, callback) { } -- cgit v1.2.3 From a52633d8e9c35832f1409dc5fa166019048a3f1f Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:45:28 -0700 Subject: mm, vmscan: move lru_lock to the node Node-based reclaim requires node-based LRUs and locking. This is a preparation patch that just moves the lru_lock to the node so later patches are easier to review. It is a mechanical change but note this patch makes contention worse because the LRU lock is hotter and direct reclaim and kswapd can contend on the same lock even when reclaiming from different zones. Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Reviewed-by: Minchan Kim Acked-by: Johannes Weiner Acked-by: Vlastimil Babka Cc: Hillf Danton Cc: Joonsoo Kim Cc: Michal Hocko Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm_types.h | 2 +- include/linux/mmzone.h | 10 ++++++++-- 2 files changed, 9 insertions(+), 3 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 79472b22d23f..903200f4ec41 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -118,7 +118,7 @@ struct page { */ union { struct list_head lru; /* Pageout list, eg. active_list - * protected by zone->lru_lock ! + * protected by zone_lru_lock ! * Can be used as a generic list * by the page owner. */ diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 078ecb81e209..cfa870107abe 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -93,7 +93,7 @@ struct free_area { struct pglist_data; /* - * zone->lock and zone->lru_lock are two of the hottest locks in the kernel. + * zone->lock and the zone lru_lock are two of the hottest locks in the kernel. * So add a wild amount of padding here to ensure that they fall into separate * cachelines. There are very few zone structures in the machine, so space * consumption is not a concern here. @@ -496,7 +496,6 @@ struct zone { /* Write-intensive fields used by page reclaim */ /* Fields commonly accessed by the page reclaim scanner */ - spinlock_t lru_lock; struct lruvec lruvec; /* @@ -690,6 +689,9 @@ typedef struct pglist_data { /* Number of pages migrated during the rate limiting time interval */ unsigned long numabalancing_migrate_nr_pages; #endif + /* Write-intensive fields used by page reclaim */ + ZONE_PADDING(_pad1_) + spinlock_t lru_lock; #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT /* @@ -721,6 +723,10 @@ typedef struct pglist_data { #define node_start_pfn(nid) (NODE_DATA(nid)->node_start_pfn) #define node_end_pfn(nid) pgdat_end_pfn(NODE_DATA(nid)) +static inline spinlock_t *zone_lru_lock(struct zone *zone) +{ + return &zone->zone_pgdat->lru_lock; +} static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat) { -- cgit v1.2.3 From 599d0c954f91d0689c9bb421b5bc04ea02437a41 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:45:31 -0700 Subject: mm, vmscan: move LRU lists to node This moves the LRU lists from the zone to the node and related data such as counters, tracing, congestion tracking and writeback tracking. Unfortunately, due to reclaim and compaction retry logic, it is necessary to account for the number of LRU pages on both zone and node logic. Most reclaim logic is based on the node counters but the retry logic uses the zone counters which do not distinguish inactive and active sizes. It would be possible to leave the LRU counters on a per-zone basis but it's a heavier calculation across multiple cache lines that is much more frequent than the retry checks. Other than the LRU counters, this is mostly a mechanical patch but note that it introduces a number of anomalies. For example, the scans are per-zone but using per-node counters. We also mark a node as congested when a zone is congested. This causes weird problems that are fixed later but is easier to review. In the event that there is excessive overhead on 32-bit systems due to the nodes being on LRU then there are two potential solutions 1. Long-term isolation of highmem pages when reclaim is lowmem When pages are skipped, they are immediately added back onto the LRU list. If lowmem reclaim persisted for long periods of time, the same highmem pages get continually scanned. The idea would be that lowmem keeps those pages on a separate list until a reclaim for highmem pages arrives that splices the highmem pages back onto the LRU. It potentially could be implemented similar to the UNEVICTABLE list. That would reduce the skip rate with the potential corner case is that highmem pages have to be scanned and reclaimed to free lowmem slab pages. 2. Linear scan lowmem pages if the initial LRU shrink fails This will break LRU ordering but may be preferable and faster during memory pressure than skipping LRU pages. Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Acked-by: Vlastimil Babka Cc: Hillf Danton Cc: Joonsoo Kim Cc: Michal Hocko Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/backing-dev.h | 2 +- include/linux/memcontrol.h | 18 ++++++------ include/linux/mm_inline.h | 21 ++++++++----- include/linux/mmzone.h | 68 ++++++++++++++++++++++++++----------------- include/linux/swap.h | 1 + include/linux/vm_event_item.h | 10 +++---- include/linux/vmstat.h | 17 +++++++++++ 7 files changed, 88 insertions(+), 49 deletions(-) (limited to 'include/linux') diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index c82794f20110..491a91717788 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -197,7 +197,7 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits) } long congestion_wait(int sync, long timeout); -long wait_iff_congested(struct zone *zone, int sync, long timeout); +long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout); int pdflush_proc_obsolete(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1c4df4420258..6d2321c148cd 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -339,7 +339,7 @@ static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone, struct lruvec *lruvec; if (mem_cgroup_disabled()) { - lruvec = &zone->lruvec; + lruvec = zone_lruvec(zone); goto out; } @@ -348,15 +348,15 @@ static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone, out: /* * Since a node can be onlined after the mem_cgroup was created, - * we have to be prepared to initialize lruvec->zone here; + * we have to be prepared to initialize lruvec->pgdat here; * and if offlined then reonlined, we need to reinitialize it. */ - if (unlikely(lruvec->zone != zone)) - lruvec->zone = zone; + if (unlikely(lruvec->pgdat != zone->zone_pgdat)) + lruvec->pgdat = zone->zone_pgdat; return lruvec; } -struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); +struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *); bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg); struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); @@ -437,7 +437,7 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg) int mem_cgroup_select_victim_node(struct mem_cgroup *memcg); void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, - int nr_pages); + enum zone_type zid, int nr_pages); unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg, int nid, unsigned int lru_mask); @@ -612,13 +612,13 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new) static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone, struct mem_cgroup *memcg) { - return &zone->lruvec; + return zone_lruvec(zone); } static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page, - struct zone *zone) + struct pglist_data *pgdat) { - return &zone->lruvec; + return &pgdat->lruvec; } static inline bool mm_match_cgroup(struct mm_struct *mm, diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 5bd29ba4f174..9aadcc781857 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -23,25 +23,32 @@ static inline int page_is_file_cache(struct page *page) } static __always_inline void __update_lru_size(struct lruvec *lruvec, - enum lru_list lru, int nr_pages) + enum lru_list lru, enum zone_type zid, + int nr_pages) { - __mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages); + struct pglist_data *pgdat = lruvec_pgdat(lruvec); + + __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages); + __mod_zone_page_state(&pgdat->node_zones[zid], + NR_ZONE_LRU_BASE + !!is_file_lru(lru), + nr_pages); } static __always_inline void update_lru_size(struct lruvec *lruvec, - enum lru_list lru, int nr_pages) + enum lru_list lru, enum zone_type zid, + int nr_pages) { #ifdef CONFIG_MEMCG - mem_cgroup_update_lru_size(lruvec, lru, nr_pages); + mem_cgroup_update_lru_size(lruvec, lru, zid, nr_pages); #else - __update_lru_size(lruvec, lru, nr_pages); + __update_lru_size(lruvec, lru, zid, nr_pages); #endif } static __always_inline void add_page_to_lru_list(struct page *page, struct lruvec *lruvec, enum lru_list lru) { - update_lru_size(lruvec, lru, hpage_nr_pages(page)); + update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page)); list_add(&page->lru, &lruvec->lists[lru]); } @@ -49,7 +56,7 @@ static __always_inline void del_page_from_lru_list(struct page *page, struct lruvec *lruvec, enum lru_list lru) { list_del(&page->lru); - update_lru_size(lruvec, lru, -hpage_nr_pages(page)); + update_lru_size(lruvec, lru, page_zonenum(page), -hpage_nr_pages(page)); } /** diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index cfa870107abe..d4f5cac0a8c3 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -111,12 +111,9 @@ enum zone_stat_item { /* First 128 byte cacheline (assuming 64 bit words) */ NR_FREE_PAGES, NR_ALLOC_BATCH, - NR_LRU_BASE, - NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */ - NR_ACTIVE_ANON, /* " " " " " */ - NR_INACTIVE_FILE, /* " " " " " */ - NR_ACTIVE_FILE, /* " " " " " */ - NR_UNEVICTABLE, /* " " " " " */ + NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */ + NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE, + NR_ZONE_LRU_FILE, NR_MLOCK, /* mlock()ed pages found and moved off LRU */ NR_ANON_PAGES, /* Mapped anonymous pages */ NR_FILE_MAPPED, /* pagecache pages mapped into pagetables. @@ -134,12 +131,9 @@ enum zone_stat_item { NR_VMSCAN_WRITE, NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */ - NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ - NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */ NR_DIRTIED, /* page dirtyings since bootup */ NR_WRITTEN, /* page writings since bootup */ - NR_PAGES_SCANNED, /* pages scanned since last reclaim */ #if IS_ENABLED(CONFIG_ZSMALLOC) NR_ZSPAGES, /* allocated in zsmalloc */ #endif @@ -161,6 +155,15 @@ enum zone_stat_item { NR_VM_ZONE_STAT_ITEMS }; enum node_stat_item { + NR_LRU_BASE, + NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */ + NR_ACTIVE_ANON, /* " " " " " */ + NR_INACTIVE_FILE, /* " " " " " */ + NR_ACTIVE_FILE, /* " " " " " */ + NR_UNEVICTABLE, /* " " " " " */ + NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ + NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ + NR_PAGES_SCANNED, /* pages scanned since last reclaim */ NR_VM_NODE_STAT_ITEMS }; @@ -219,7 +222,7 @@ struct lruvec { /* Evictions & activations on the inactive file list */ atomic_long_t inactive_age; #ifdef CONFIG_MEMCG - struct zone *zone; + struct pglist_data *pgdat; #endif }; @@ -357,13 +360,6 @@ struct zone { #ifdef CONFIG_NUMA int node; #endif - - /* - * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on - * this zone's LRU. Maintained by the pageout code. - */ - unsigned int inactive_ratio; - struct pglist_data *zone_pgdat; struct per_cpu_pageset __percpu *pageset; @@ -495,9 +491,6 @@ struct zone { /* Write-intensive fields used by page reclaim */ - /* Fields commonly accessed by the page reclaim scanner */ - struct lruvec lruvec; - /* * When free pages are below this point, additional steps are taken * when reading the number of free pages to avoid per-cpu counter @@ -537,17 +530,20 @@ struct zone { enum zone_flags { ZONE_RECLAIM_LOCKED, /* prevents concurrent reclaim */ - ZONE_CONGESTED, /* zone has many dirty pages backed by + ZONE_FAIR_DEPLETED, /* fair zone policy batch depleted */ +}; + +enum pgdat_flags { + PGDAT_CONGESTED, /* pgdat has many dirty pages backed by * a congested BDI */ - ZONE_DIRTY, /* reclaim scanning has recently found + PGDAT_DIRTY, /* reclaim scanning has recently found * many dirty file pages at the tail * of the LRU. */ - ZONE_WRITEBACK, /* reclaim scanning has recently found + PGDAT_WRITEBACK, /* reclaim scanning has recently found * many pages under writeback */ - ZONE_FAIR_DEPLETED, /* fair zone policy batch depleted */ }; static inline unsigned long zone_end_pfn(const struct zone *zone) @@ -707,6 +703,19 @@ typedef struct pglist_data { unsigned long split_queue_len; #endif + /* Fields commonly accessed by the page reclaim scanner */ + struct lruvec lruvec; + + /* + * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on + * this node's LRU. Maintained by the pageout code. + */ + unsigned int inactive_ratio; + + unsigned long flags; + + ZONE_PADDING(_pad2_) + /* Per-node vmstats */ struct per_cpu_nodestat __percpu *per_cpu_nodestats; atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS]; @@ -728,6 +737,11 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone) return &zone->zone_pgdat->lru_lock; } +static inline struct lruvec *zone_lruvec(struct zone *zone) +{ + return &zone->zone_pgdat->lruvec; +} + static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat) { return pgdat->node_start_pfn + pgdat->node_spanned_pages; @@ -779,12 +793,12 @@ extern int init_currently_empty_zone(struct zone *zone, unsigned long start_pfn, extern void lruvec_init(struct lruvec *lruvec); -static inline struct zone *lruvec_zone(struct lruvec *lruvec) +static inline struct pglist_data *lruvec_pgdat(struct lruvec *lruvec) { #ifdef CONFIG_MEMCG - return lruvec->zone; + return lruvec->pgdat; #else - return container_of(lruvec, struct zone, lruvec); + return container_of(lruvec, struct pglist_data, lruvec); #endif } diff --git a/include/linux/swap.h b/include/linux/swap.h index 0af2bb2028fd..c82f916008b7 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page, /* linux/mm/vmscan.c */ extern unsigned long zone_reclaimable_pages(struct zone *zone); +extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat); extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); extern int __isolate_lru_page(struct page *page, isolate_mode_t mode); diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 42604173f122..1798ff542517 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -26,11 +26,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, PGFREE, PGACTIVATE, PGDEACTIVATE, PGFAULT, PGMAJFAULT, PGLAZYFREED, - FOR_ALL_ZONES(PGREFILL), - FOR_ALL_ZONES(PGSTEAL_KSWAPD), - FOR_ALL_ZONES(PGSTEAL_DIRECT), - FOR_ALL_ZONES(PGSCAN_KSWAPD), - FOR_ALL_ZONES(PGSCAN_DIRECT), + PGREFILL, + PGSTEAL_KSWAPD, + PGSTEAL_DIRECT, + PGSCAN_KSWAPD, + PGSCAN_DIRECT, PGSCAN_DIRECT_THROTTLE, #ifdef CONFIG_NUMA PGSCAN_ZONE_RECLAIM_FAILED, diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index d1744aa3ab9c..fee321c98550 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -178,6 +178,23 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone, return x; } +static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat, + enum node_stat_item item) +{ + long x = atomic_long_read(&pgdat->vm_stat[item]); + +#ifdef CONFIG_SMP + int cpu; + for_each_online_cpu(cpu) + x += per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->vm_node_stat_diff[item]; + + if (x < 0) + x = 0; +#endif + return x; +} + + #ifdef CONFIG_NUMA extern unsigned long sum_zone_node_page_state(int node, enum zone_stat_item item); -- cgit v1.2.3 From 0f66114893997f781029c109b0974b7f61130df7 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:45:34 -0700 Subject: mm, mmzone: clarify the usage of zone padding Zone padding separates write-intensive fields used by page allocation, compaction and vmstats but the comments are a little misleading and need clarification. Link: http://lkml.kernel.org/r/1467970510-21195-5-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Cc: Hillf Danton Acked-by: Johannes Weiner Cc: Joonsoo Kim Cc: Michal Hocko Cc: Minchan Kim Cc: Rik van Riel Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index d4f5cac0a8c3..edafdaf62e90 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -477,20 +477,21 @@ struct zone { unsigned long wait_table_hash_nr_entries; unsigned long wait_table_bits; + /* Write-intensive fields used from the page allocator */ ZONE_PADDING(_pad1_) + /* free areas of different sizes */ struct free_area free_area[MAX_ORDER]; /* zone flags, see below */ unsigned long flags; - /* Write-intensive fields used from the page allocator */ + /* Primarily protects free_area */ spinlock_t lock; + /* Write-intensive fields used by compaction and vmstats. */ ZONE_PADDING(_pad2_) - /* Write-intensive fields used by page reclaim */ - /* * When free pages are below this point, additional steps are taken * when reading the number of free pages to avoid per-cpu counter -- cgit v1.2.3 From 31483b6ad205784d3a82240865135bef5c97c105 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:45:46 -0700 Subject: mm, vmscan: remove balance gap The balance gap was introduced to apply equal pressure to all zones when reclaiming for a higher zone. With node-based LRU, the need for the balance gap is removed and the code is dead so remove it. [vbabka@suse.cz: Also remove KSWAPD_ZONE_BALANCE_GAP_RATIO] Link: http://lkml.kernel.org/r/1467970510-21195-9-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Hillf Danton Acked-by: Johannes Weiner Cc: Joonsoo Kim Cc: Michal Hocko Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/swap.h | 9 --------- 1 file changed, 9 deletions(-) (limited to 'include/linux') diff --git a/include/linux/swap.h b/include/linux/swap.h index c82f916008b7..916e2eddecd6 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -157,15 +157,6 @@ enum { #define SWAP_CLUSTER_MAX 32UL #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX -/* - * Ratio between zone->managed_pages and the "gap" that above the per-zone - * "high_wmark". While balancing nodes, We allow kswapd to shrink zones that - * do not meet the (high_wmark + gap) watermark, even which already met the - * high_wmark, in order to provide better per-zone lru behavior. We are ok to - * spend not more than 1% of the memory for this zone balancing "gap". - */ -#define KSWAPD_ZONE_BALANCE_GAP_RATIO 100 - #define SWAP_MAP_MAX 0x3e /* Max duplication count, in first swap_map */ #define SWAP_MAP_BAD 0x3f /* Note pageblock is bad, in first swap_map */ #define SWAP_HAS_CACHE 0x40 /* Flag page is cached, in first swap_map */ -- cgit v1.2.3 From 38087d9b0360987a6db46c2c2c4ece37cd048abe Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:45:49 -0700 Subject: mm, vmscan: simplify the logic deciding whether kswapd sleeps kswapd goes through some complex steps trying to figure out if it should stay awake based on the classzone_idx and the requested order. It is unnecessarily complex and passes in an invalid classzone_idx to balance_pgdat(). What matters most of all is whether a larger order has been requsted and whether kswapd successfully reclaimed at the previous order. This patch irons out the logic to check just that and the end result is less headache inducing. Link: http://lkml.kernel.org/r/1467970510-21195-10-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Acked-by: Vlastimil Babka Cc: Hillf Danton Cc: Joonsoo Kim Cc: Michal Hocko Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index edafdaf62e90..4062fa74526f 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -668,8 +668,9 @@ typedef struct pglist_data { wait_queue_head_t pfmemalloc_wait; struct task_struct *kswapd; /* Protected by mem_hotplug_begin/end() */ - int kswapd_max_order; - enum zone_type classzone_idx; + int kswapd_order; + enum zone_type kswapd_classzone_idx; + #ifdef CONFIG_COMPACTION int kcompactd_max_order; enum zone_type kcompactd_classzone_idx; -- cgit v1.2.3 From a9dd0a83104c01269ea36a9b4ec42b51edf85427 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:46:02 -0700 Subject: mm, vmscan: make shrink_node decisions more node-centric Earlier patches focused on having direct reclaim and kswapd use data that is node-centric for reclaiming but shrink_node() itself still uses too much zone information. This patch removes unnecessary zone-based information with the most important decision being whether to continue reclaim or not. Some memcg APIs are adjusted as a result even though memcg itself still uses some zone information. [mgorman@techsingularity.net: optimization] Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Michal Hocko Acked-by: Vlastimil Babka Cc: Hillf Danton Acked-by: Johannes Weiner Cc: Joonsoo Kim Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 19 ++++++++++--------- include/linux/mmzone.h | 4 ++-- include/linux/swap.h | 2 +- 3 files changed, 13 insertions(+), 12 deletions(-) (limited to 'include/linux') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6d2321c148cd..f4963ee4fdbc 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -324,22 +324,23 @@ mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone) } /** - * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg + * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone + * @node: node of the wanted lruvec * @zone: zone of the wanted lruvec * @memcg: memcg of the wanted lruvec * - * Returns the lru list vector holding pages for the given @zone and - * @mem. This can be the global zone lruvec, if the memory controller + * Returns the lru list vector holding pages for a given @node or a given + * @memcg and @zone. This can be the node lruvec, if the memory controller * is disabled. */ -static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone, - struct mem_cgroup *memcg) +static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat, + struct zone *zone, struct mem_cgroup *memcg) { struct mem_cgroup_per_zone *mz; struct lruvec *lruvec; if (mem_cgroup_disabled()) { - lruvec = zone_lruvec(zone); + lruvec = node_lruvec(pgdat); goto out; } @@ -609,10 +610,10 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new) { } -static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone, - struct mem_cgroup *memcg) +static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat, + struct zone *zone, struct mem_cgroup *memcg) { - return zone_lruvec(zone); + return node_lruvec(pgdat); } static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page, diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4062fa74526f..895c365e3259 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -739,9 +739,9 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone) return &zone->zone_pgdat->lru_lock; } -static inline struct lruvec *zone_lruvec(struct zone *zone) +static inline struct lruvec *node_lruvec(struct pglist_data *pgdat) { - return &zone->zone_pgdat->lruvec; + return &pgdat->lruvec; } static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat) diff --git a/include/linux/swap.h b/include/linux/swap.h index 916e2eddecd6..0ad616d7c381 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -316,7 +316,7 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, bool may_swap); -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, +extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, struct zone *zone, unsigned long *nr_scanned); -- cgit v1.2.3 From ef8f2327996b5c20f11420f64e439e87c7a01604 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:46:05 -0700 Subject: mm, memcg: move memcg limit enforcement from zones to nodes Memcg needs adjustment after moving LRUs to the node. Limits are tracked per memcg but the soft-limit excess is tracked per zone. As global page reclaim is based on the node, it is easy to imagine a situation where a zone soft limit is exceeded even though the memcg limit is fine. This patch moves the soft limit tree the node. Technically, all the variable names should also change but people are already familiar by the meaning of "mz" even if "mn" would be a more appropriate name now. Link: http://lkml.kernel.org/r/1467970510-21195-15-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Michal Hocko Cc: Hillf Danton Acked-by: Johannes Weiner Cc: Joonsoo Kim Cc: Minchan Kim Cc: Rik van Riel Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 38 +++++++++++++++----------------------- include/linux/swap.h | 2 +- 2 files changed, 16 insertions(+), 24 deletions(-) (limited to 'include/linux') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index f4963ee4fdbc..b759827b2f1e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -60,7 +60,7 @@ enum mem_cgroup_stat_index { }; struct mem_cgroup_reclaim_cookie { - struct zone *zone; + pg_data_t *pgdat; int priority; unsigned int generation; }; @@ -118,7 +118,7 @@ struct mem_cgroup_reclaim_iter { /* * per-zone information in memory controller. */ -struct mem_cgroup_per_zone { +struct mem_cgroup_per_node { struct lruvec lruvec; unsigned long lru_size[NR_LRU_LISTS]; @@ -132,10 +132,6 @@ struct mem_cgroup_per_zone { /* use container_of */ }; -struct mem_cgroup_per_node { - struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES]; -}; - struct mem_cgroup_threshold { struct eventfd_ctx *eventfd; unsigned long threshold; @@ -314,19 +310,15 @@ void mem_cgroup_uncharge_list(struct list_head *page_list); void mem_cgroup_migrate(struct page *oldpage, struct page *newpage); -static inline struct mem_cgroup_per_zone * -mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone) +static struct mem_cgroup_per_node * +mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid) { - int nid = zone_to_nid(zone); - int zid = zone_idx(zone); - - return &memcg->nodeinfo[nid]->zoneinfo[zid]; + return memcg->nodeinfo[nid]; } /** * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone * @node: node of the wanted lruvec - * @zone: zone of the wanted lruvec * @memcg: memcg of the wanted lruvec * * Returns the lru list vector holding pages for a given @node or a given @@ -334,9 +326,9 @@ mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone) * is disabled. */ static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat, - struct zone *zone, struct mem_cgroup *memcg) + struct mem_cgroup *memcg) { - struct mem_cgroup_per_zone *mz; + struct mem_cgroup_per_node *mz; struct lruvec *lruvec; if (mem_cgroup_disabled()) { @@ -344,7 +336,7 @@ static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat, goto out; } - mz = mem_cgroup_zone_zoneinfo(memcg, zone); + mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); lruvec = &mz->lruvec; out: /* @@ -352,8 +344,8 @@ out: * we have to be prepared to initialize lruvec->pgdat here; * and if offlined then reonlined, we need to reinitialize it. */ - if (unlikely(lruvec->pgdat != zone->zone_pgdat)) - lruvec->pgdat = zone->zone_pgdat; + if (unlikely(lruvec->pgdat != pgdat)) + lruvec->pgdat = pgdat; return lruvec; } @@ -446,9 +438,9 @@ unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg, static inline unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru) { - struct mem_cgroup_per_zone *mz; + struct mem_cgroup_per_node *mz; - mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec); + mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec); return mz->lru_size[lru]; } @@ -519,7 +511,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, mem_cgroup_update_page_stat(page, idx, -1); } -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, +unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, gfp_t gfp_mask, unsigned long *total_scanned); @@ -611,7 +603,7 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new) } static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat, - struct zone *zone, struct mem_cgroup *memcg) + struct mem_cgroup *memcg) { return node_lruvec(pgdat); } @@ -723,7 +715,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, } static inline -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, +unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, gfp_t gfp_mask, unsigned long *total_scanned) { diff --git a/include/linux/swap.h b/include/linux/swap.h index 0ad616d7c381..2a23ddc96edd 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -318,7 +318,7 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, bool may_swap); extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, - struct zone *zone, + pg_data_t *pgdat, unsigned long *nr_scanned); extern unsigned long shrink_all_memory(unsigned long nr_pages); extern int vm_swappiness; -- cgit v1.2.3 From 1e6b10857f91685c60c341703ece4ae9bb775cf3 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:46:08 -0700 Subject: mm, workingset: make working set detection node-aware Working set and refault detection is still zone-based, fix it. Link: http://lkml.kernel.org/r/1467970510-21195-16-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Acked-by: Vlastimil Babka Cc: Hillf Danton Cc: Joonsoo Kim Cc: Michal Hocko Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 6 +++--- include/linux/vmstat.h | 1 - 2 files changed, 3 insertions(+), 4 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 895c365e3259..62f477d6cfe8 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -145,9 +145,6 @@ enum zone_stat_item { NUMA_LOCAL, /* allocation from local node */ NUMA_OTHER, /* allocation from other node */ #endif - WORKINGSET_REFAULT, - WORKINGSET_ACTIVATE, - WORKINGSET_NODERECLAIM, NR_ANON_THPS, NR_SHMEM_THPS, NR_SHMEM_PMDMAPPED, @@ -164,6 +161,9 @@ enum node_stat_item { NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ NR_PAGES_SCANNED, /* pages scanned since last reclaim */ + WORKINGSET_REFAULT, + WORKINGSET_ACTIVATE, + WORKINGSET_NODERECLAIM, NR_VM_NODE_STAT_ITEMS }; diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index fee321c98550..6b7975cd98aa 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -227,7 +227,6 @@ void mod_node_page_state(struct pglist_data *, enum node_stat_item, long); void inc_node_page_state(struct page *, enum node_stat_item); void dec_node_page_state(struct page *, enum node_stat_item); -extern void inc_zone_state(struct zone *, enum zone_stat_item); extern void inc_node_state(struct pglist_data *, enum node_stat_item); extern void __inc_zone_state(struct zone *, enum zone_stat_item); extern void __inc_node_state(struct pglist_data *, enum node_stat_item); -- cgit v1.2.3 From 281e37265f2826ed401d84d6790226448ef3f0e8 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:46:11 -0700 Subject: mm, page_alloc: consider dirtyable memory in terms of nodes Historically dirty pages were spread among zones but now that LRUs are per-node it is more appropriate to consider dirty pages in a node. Link: http://lkml.kernel.org/r/1467970510-21195-17-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Signed-off-by: Johannes Weiner Acked-by: Vlastimil Babka Acked-by: Michal Hocko Cc: Hillf Danton Cc: Joonsoo Kim Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 12 ++++++------ include/linux/writeback.h | 2 +- 2 files changed, 7 insertions(+), 7 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 62f477d6cfe8..fae2fe3c6942 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -363,12 +363,6 @@ struct zone { struct pglist_data *zone_pgdat; struct per_cpu_pageset __percpu *pageset; - /* - * This is a per-zone reserve of pages that are not available - * to userspace allocations. - */ - unsigned long totalreserve_pages; - #ifndef CONFIG_SPARSEMEM /* * Flags for a pageblock_nr_pages block. See pageblock-flags.h. @@ -687,6 +681,12 @@ typedef struct pglist_data { /* Number of pages migrated during the rate limiting time interval */ unsigned long numabalancing_migrate_nr_pages; #endif + /* + * This is a per-node reserve of pages that are not available + * to userspace allocations. + */ + unsigned long totalreserve_pages; + /* Write-intensive fields used by page reclaim */ ZONE_PADDING(_pad1_) spinlock_t lru_lock; diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 717e6149e753..fc1e16c25a29 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -320,7 +320,7 @@ void laptop_mode_timer_fn(unsigned long data); static inline void laptop_sync_completion(void) { } #endif void throttle_vm_writeout(gfp_t gfp_mask); -bool zone_dirty_ok(struct zone *zone); +bool node_dirty_ok(struct pglist_data *pgdat); int wb_domain_init(struct wb_domain *dom, gfp_t gfp); #ifdef CONFIG_CGROUP_WRITEBACK void wb_domain_exit(struct wb_domain *dom); -- cgit v1.2.3 From 50658e2e04c12d5cd628381c1b9cb69d0093a9c0 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:46:14 -0700 Subject: mm: move page mapped accounting to the node Reclaim makes decisions based on the number of pages that are mapped but it's mixing node and zone information. Account NR_FILE_MAPPED and NR_ANON_PAGES pages on the node. Link: http://lkml.kernel.org/r/1467970510-21195-18-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Acked-by: Michal Hocko Cc: Hillf Danton Acked-by: Johannes Weiner Cc: Joonsoo Kim Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index fae2fe3c6942..95d34d1e1fb5 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -115,9 +115,6 @@ enum zone_stat_item { NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE, NR_ZONE_LRU_FILE, NR_MLOCK, /* mlock()ed pages found and moved off LRU */ - NR_ANON_PAGES, /* Mapped anonymous pages */ - NR_FILE_MAPPED, /* pagecache pages mapped into pagetables. - only modified from process context */ NR_FILE_PAGES, NR_FILE_DIRTY, NR_WRITEBACK, @@ -164,6 +161,9 @@ enum node_stat_item { WORKINGSET_REFAULT, WORKINGSET_ACTIVATE, WORKINGSET_NODERECLAIM, + NR_ANON_PAGES, /* Mapped anonymous pages */ + NR_FILE_MAPPED, /* pagecache pages mapped into pagetables. + only modified from process context */ NR_VM_NODE_STAT_ITEMS }; -- cgit v1.2.3 From 4b9d0fab7166c9323f06d708518a35cf3a90426c Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:46:17 -0700 Subject: mm: rename NR_ANON_PAGES to NR_ANON_MAPPED NR_FILE_PAGES is the number of file pages. NR_FILE_MAPPED is the number of mapped file pages. NR_ANON_PAGES is the number of mapped anon pages. This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and NR_ANON_PAGES for mapped pages. This patch renames NR_ANON_PAGES so we have NR_FILE_PAGES is the number of file pages. NR_FILE_MAPPED is the number of mapped file pages. NR_ANON_MAPPED is the number of mapped anon pages. Link: http://lkml.kernel.org/r/1467970510-21195-19-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Hillf Danton Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Michal Hocko Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 95d34d1e1fb5..2d4a8804eafa 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -161,7 +161,7 @@ enum node_stat_item { WORKINGSET_REFAULT, WORKINGSET_ACTIVATE, WORKINGSET_NODERECLAIM, - NR_ANON_PAGES, /* Mapped anonymous pages */ + NR_ANON_MAPPED, /* Mapped anonymous pages */ NR_FILE_MAPPED, /* pagecache pages mapped into pagetables. only modified from process context */ NR_VM_NODE_STAT_ITEMS -- cgit v1.2.3 From 11fb998986a72aa7e997d96d63d52582a01228c5 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:46:20 -0700 Subject: mm: move most file-based accounting to the node There are now a number of accounting oddities such as mapped file pages being accounted for on the node while the total number of file pages are accounted on the zone. This can be coped with to some extent but it's confusing so this patch moves the relevant file-based accounted. Due to throttling logic in the page allocator for reliable OOM detection, it is still necessary to track dirty and writeback pages on a per-zone basis. [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting] Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Acked-by: Michal Hocko Cc: Hillf Danton Acked-by: Johannes Weiner Cc: Joonsoo Kim Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 2d4a8804eafa..acd4665c3025 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -114,21 +114,16 @@ enum zone_stat_item { NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */ NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE, NR_ZONE_LRU_FILE, + NR_ZONE_WRITE_PENDING, /* Count of dirty, writeback and unstable pages */ NR_MLOCK, /* mlock()ed pages found and moved off LRU */ - NR_FILE_PAGES, - NR_FILE_DIRTY, - NR_WRITEBACK, NR_SLAB_RECLAIMABLE, NR_SLAB_UNRECLAIMABLE, NR_PAGETABLE, /* used for pagetables */ NR_KERNEL_STACK, /* Second 128 byte cacheline */ - NR_UNSTABLE_NFS, /* NFS unstable pages */ NR_BOUNCE, NR_VMSCAN_WRITE, NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ - NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */ - NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */ NR_DIRTIED, /* page dirtyings since bootup */ NR_WRITTEN, /* page writings since bootup */ #if IS_ENABLED(CONFIG_ZSMALLOC) @@ -142,9 +137,6 @@ enum zone_stat_item { NUMA_LOCAL, /* allocation from local node */ NUMA_OTHER, /* allocation from other node */ #endif - NR_ANON_THPS, - NR_SHMEM_THPS, - NR_SHMEM_PMDMAPPED, NR_FREE_CMA_PAGES, NR_VM_ZONE_STAT_ITEMS }; @@ -164,6 +156,15 @@ enum node_stat_item { NR_ANON_MAPPED, /* Mapped anonymous pages */ NR_FILE_MAPPED, /* pagecache pages mapped into pagetables. only modified from process context */ + NR_FILE_PAGES, + NR_FILE_DIRTY, + NR_WRITEBACK, + NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */ + NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */ + NR_SHMEM_THPS, + NR_SHMEM_PMDMAPPED, + NR_ANON_THPS, + NR_UNSTABLE_NFS, /* NFS unstable pages */ NR_VM_NODE_STAT_ITEMS }; -- cgit v1.2.3 From c4a25635b60d08853a3e4eaae3ab34419a36cfa2 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:46:23 -0700 Subject: mm: move vmscan writes and file write accounting to the node As reclaim is now node-based, it follows that page write activity due to page reclaim should also be accounted for on the node. For consistency, also account page writes and page dirtying on a per-node basis. After this patch, there are a few remaining zone counters that may appear strange but are fine. NUMA stats are still per-zone as this is a user-space interface that tools consume. NR_MLOCK, NR_SLAB_*, NR_PAGETABLE, NR_KERNEL_STACK and NR_BOUNCE are all allocations that potentially pin low memory and cannot trivially be reclaimed on demand. This information is still useful for debugging a page allocation failure warning. Link: http://lkml.kernel.org/r/1467970510-21195-21-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Acked-by: Michal Hocko Cc: Hillf Danton Acked-by: Johannes Weiner Cc: Joonsoo Kim Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index acd4665c3025..e3d6d42722a0 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -122,10 +122,6 @@ enum zone_stat_item { NR_KERNEL_STACK, /* Second 128 byte cacheline */ NR_BOUNCE, - NR_VMSCAN_WRITE, - NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ - NR_DIRTIED, /* page dirtyings since bootup */ - NR_WRITTEN, /* page writings since bootup */ #if IS_ENABLED(CONFIG_ZSMALLOC) NR_ZSPAGES, /* allocated in zsmalloc */ #endif @@ -165,6 +161,10 @@ enum node_stat_item { NR_SHMEM_PMDMAPPED, NR_ANON_THPS, NR_UNSTABLE_NFS, /* NFS unstable pages */ + NR_VMSCAN_WRITE, + NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ + NR_DIRTIED, /* page dirtyings since bootup */ + NR_WRITTEN, /* page writings since bootup */ NR_VM_NODE_STAT_ITEMS }; -- cgit v1.2.3 From a5f5f91da6ad647fb0cc7fce0e17343c0d1c5a9a Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:46:32 -0700 Subject: mm: convert zone_reclaim to node_reclaim As reclaim is now per-node based, convert zone_reclaim to be node_reclaim. It is possible that a node will be reclaimed multiple times if it has multiple zones but this is unavoidable without caching all nodes traversed so far. The documentation and interface to userspace is the same from a configuration perspective and will will be similar in behaviour unless the node-local allocation requests were also limited to lower zones. Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Hillf Danton Acked-by: Johannes Weiner Cc: Joonsoo Kim Cc: Michal Hocko Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 18 +++++++++--------- include/linux/swap.h | 9 +++++---- include/linux/topology.h | 2 +- 3 files changed, 15 insertions(+), 14 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e3d6d42722a0..e19c081c794e 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -372,14 +372,6 @@ struct zone { unsigned long *pageblock_flags; #endif /* CONFIG_SPARSEMEM */ -#ifdef CONFIG_NUMA - /* - * zone reclaim becomes active if more unmapped pages exist. - */ - unsigned long min_unmapped_pages; - unsigned long min_slab_pages; -#endif /* CONFIG_NUMA */ - /* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */ unsigned long zone_start_pfn; @@ -525,7 +517,6 @@ struct zone { } ____cacheline_internodealigned_in_smp; enum zone_flags { - ZONE_RECLAIM_LOCKED, /* prevents concurrent reclaim */ ZONE_FAIR_DEPLETED, /* fair zone policy batch depleted */ }; @@ -540,6 +531,7 @@ enum pgdat_flags { PGDAT_WRITEBACK, /* reclaim scanning has recently found * many pages under writeback */ + PGDAT_RECLAIM_LOCKED, /* prevents concurrent reclaim */ }; static inline unsigned long zone_end_pfn(const struct zone *zone) @@ -688,6 +680,14 @@ typedef struct pglist_data { */ unsigned long totalreserve_pages; +#ifdef CONFIG_NUMA + /* + * zone reclaim becomes active if more unmapped pages exist. + */ + unsigned long min_unmapped_pages; + unsigned long min_slab_pages; +#endif /* CONFIG_NUMA */ + /* Write-intensive fields used by page reclaim */ ZONE_PADDING(_pad1_) spinlock_t lru_lock; diff --git a/include/linux/swap.h b/include/linux/swap.h index 2a23ddc96edd..b17cc4830fa6 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -326,13 +326,14 @@ extern int remove_mapping(struct address_space *mapping, struct page *page); extern unsigned long vm_total_pages; #ifdef CONFIG_NUMA -extern int zone_reclaim_mode; +extern int node_reclaim_mode; extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; -extern int zone_reclaim(struct zone *, gfp_t, unsigned int); +extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int); #else -#define zone_reclaim_mode 0 -static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) +#define node_reclaim_mode 0 +static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask, + unsigned int order) { return 0; } diff --git a/include/linux/topology.h b/include/linux/topology.h index afce69296ac0..cb0775e1ee4b 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -54,7 +54,7 @@ int arch_update_cpu_topology(void); /* * If the distance between nodes in a system is larger than RECLAIM_DISTANCE * (in whatever arch specific measurement units returned by node_distance()) - * and zone_reclaim_mode is enabled then the VM will only call zone_reclaim() + * and node_reclaim_mode is enabled then the VM will only call node_reclaim() * on nodes within this distance. */ #define RECLAIM_DISTANCE 30 -- cgit v1.2.3 From e6cbd7f2efb433d717af72aa8510a9db6f7a7e05 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:46:50 -0700 Subject: mm, page_alloc: remove fair zone allocation policy The fair zone allocation policy interleaves allocation requests between zones to avoid an age inversion problem whereby new pages are reclaimed to balance a zone. Reclaim is now node-based so this should no longer be an issue and the fair zone allocation policy is not free. This patch removes it. Link: http://lkml.kernel.org/r/1467970510-21195-30-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Hillf Danton Acked-by: Johannes Weiner Cc: Joonsoo Kim Cc: Michal Hocko Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 5 ----- 1 file changed, 5 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e19c081c794e..bd33e6f1bed0 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -110,7 +110,6 @@ struct zone_padding { enum zone_stat_item { /* First 128 byte cacheline (assuming 64 bit words) */ NR_FREE_PAGES, - NR_ALLOC_BATCH, NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */ NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE, NR_ZONE_LRU_FILE, @@ -516,10 +515,6 @@ struct zone { atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS]; } ____cacheline_internodealigned_in_smp; -enum zone_flags { - ZONE_FAIR_DEPLETED, /* fair zone policy batch depleted */ -}; - enum pgdat_flags { PGDAT_CONGESTED, /* pgdat has many dirty pages backed by * a congested BDI -- cgit v1.2.3 From 16709d1de1954475356a65848f80a01581b4903c Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:46:56 -0700 Subject: mm: vmstat: replace __count_zone_vm_events with a zone id equivalent This is partially a preparation patch for more vmstat work but it also has the slight advantage that __count_zid_vm_events is cheaper to calculate than __count_zone_vm_events(). Link: http://lkml.kernel.org/r/1467970510-21195-32-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Hillf Danton Acked-by: Johannes Weiner Cc: Joonsoo Kim Cc: Michal Hocko Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/vmstat.h | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'include/linux') diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 6b7975cd98aa..613771909b6e 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -101,9 +101,8 @@ static inline void vm_events_fold_cpu(int cpu) #define count_vm_vmacache_event(x) do {} while (0) #endif -#define __count_zone_vm_events(item, zone, delta) \ - __count_vm_events(item##_NORMAL - ZONE_NORMAL + \ - zone_idx(zone), delta) +#define __count_zid_vm_events(item, zid, delta) \ + __count_vm_events(item##_NORMAL - ZONE_NORMAL + zid, delta) /* * Zone and node-based page accounting with per cpu differentials. -- cgit v1.2.3 From 7cc30fcfd2a894589d832a192cac3dc5cd302bb8 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:46:59 -0700 Subject: mm: vmstat: account per-zone stalls and pages skipped during reclaim The vmstat allocstall was fairly useful in the general sense but node-based LRUs change that. It's important to know if a stall was for an address-limited allocation request as this will require skipping pages from other zones. This patch adds pgstall_* counters to replace allocstall. The sum of the counters will equal the old allocstall so it can be trivially recalculated. A high number of address-limited allocation requests may result in a lot of useless LRU scanning for suitable pages. As address-limited allocations require pages to be skipped, it's important to know how much useless LRU scanning took place so this patch adds pgskip* counters. This yields the following model 1. The number of address-space limited stalls can be accounted for (pgstall) 2. The amount of useless work required to reclaim the data is accounted (pgskip) 3. The total number of scans is available from pgscan_kswapd and pgscan_direct so from that the ratio of useful to useless scans can be calculated. [mgorman@techsingularity.net: s/pgstall/allocstall/] Link: http://lkml.kernel.org/r/1468404004-5085-3-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-33-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Hillf Danton Acked-by: Johannes Weiner Cc: Joonsoo Kim Cc: Michal Hocko Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/vm_event_item.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 1798ff542517..4d6ec58a8d45 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -23,6 +23,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, FOR_ALL_ZONES(PGALLOC), + FOR_ALL_ZONES(ALLOCSTALL), + FOR_ALL_ZONES(PGSCAN_SKIP), PGFREE, PGACTIVATE, PGDEACTIVATE, PGFAULT, PGMAJFAULT, PGLAZYFREED, @@ -37,7 +39,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #endif PGINODESTEAL, SLABS_SCANNED, KSWAPD_INODESTEAL, KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY, - PAGEOUTRUN, ALLOCSTALL, PGROTATED, + PAGEOUTRUN, PGROTATED, DROP_PAGECACHE, DROP_SLAB, #ifdef CONFIG_NUMA_BALANCING NUMA_PTE_UPDATES, -- cgit v1.2.3 From bca6759258dbef378bcf5b872177bcd2259ceb68 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:47:05 -0700 Subject: mm, vmstat: remove zone and node double accounting by approximating retries The number of LRU pages, dirty pages and writeback pages must be accounted for on both zones and nodes because of the reclaim retry logic, compaction retry logic and highmem calculations all depending on per-zone stats. Many lowmem allocations are immune from OOM kill due to a check in __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit 03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The exception is costly high-order allocations or allocations that cannot fail. If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem allocations then it would fall through to __alloc_pages_direct_compact. This patch will blindly retry reclaim for zone-constrained allocations in should_reclaim_retry up to MAX_RECLAIM_RETRIES. This is not ideal but without per-zone stats there are not many alternatives. The impact it that zone-constrained allocations may delay before considering the OOM killer. As there is no guarantee enough memory can ever be freed to satisfy compaction, this patch avoids retrying compaction for zone-contrained allocations. In combination, that means that the per-node stats can be used when deciding whether to continue reclaim using a rough approximation. While it is possible this will make the wrong decision on occasion, it will not infinite loop as the number of reclaim attempts is capped by MAX_RECLAIM_RETRIES. The final step is calculating the number of dirtyable highmem pages. As those calculations only care about the global count of file pages in highmem. This patch uses a global counter used instead of per-zone stats as it is sufficient. In combination, this allows the per-zone LRU and dirty state counters to be removed. [mgorman@techsingularity.net: fix acct_highmem_file_pages()] Link: http://lkml.kernel.org/r/1468853426-12858-4-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-35-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Suggested by: Michal Hocko Acked-by: Hillf Danton Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Michal Hocko Cc: Minchan Kim Cc: Rik van Riel Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm_inline.h | 20 +++++++++++++++++--- include/linux/mmzone.h | 4 ---- include/linux/swap.h | 1 - 3 files changed, 17 insertions(+), 8 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 9aadcc781857..dd22b08c47be 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -4,6 +4,22 @@ #include #include +#ifdef CONFIG_HIGHMEM +extern atomic_t highmem_file_pages; + +static inline void acct_highmem_file_pages(int zid, enum lru_list lru, + int nr_pages) +{ + if (is_highmem_idx(zid) && is_file_lru(lru)) + atomic_add(nr_pages, &highmem_file_pages); +} +#else +static inline void acct_highmem_file_pages(int zid, enum lru_list lru, + int nr_pages) +{ +} +#endif + /** * page_is_file_cache - should the page be on a file LRU or anon LRU? * @page: the page to test @@ -29,9 +45,7 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec, struct pglist_data *pgdat = lruvec_pgdat(lruvec); __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages); - __mod_zone_page_state(&pgdat->node_zones[zid], - NR_ZONE_LRU_BASE + !!is_file_lru(lru), - nr_pages); + acct_highmem_file_pages(zid, lru, nr_pages); } static __always_inline void update_lru_size(struct lruvec *lruvec, diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index bd33e6f1bed0..a3b7f45aac56 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -110,10 +110,6 @@ struct zone_padding { enum zone_stat_item { /* First 128 byte cacheline (assuming 64 bit words) */ NR_FREE_PAGES, - NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */ - NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE, - NR_ZONE_LRU_FILE, - NR_ZONE_WRITE_PENDING, /* Count of dirty, writeback and unstable pages */ NR_MLOCK, /* mlock()ed pages found and moved off LRU */ NR_SLAB_RECLAIMABLE, NR_SLAB_UNRECLAIMABLE, diff --git a/include/linux/swap.h b/include/linux/swap.h index b17cc4830fa6..cc753c639e3d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -307,7 +307,6 @@ extern void lru_cache_add_active_or_unevictable(struct page *page, struct vm_area_struct *vma); /* linux/mm/vmscan.c */ -extern unsigned long zone_reclaimable_pages(struct zone *zone); extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat); extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); -- cgit v1.2.3 From 7ee36a14f06cc937f6b2c2932c2e48f590970581 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:47:17 -0700 Subject: mm, vmscan: Update all zone LRU sizes before updating memcg Minchan Kim reported setting the following warning on a 32-bit system although it can affect 64-bit systems. WARNING: CPU: 4 PID: 1322 at mm/memcontrol.c:998 mem_cgroup_update_lru_size+0x103/0x110 mem_cgroup_update_lru_size(f44b4000, 1, -7): zid 1 lru_size 1 but empty Modules linked in: CPU: 4 PID: 1322 Comm: cp Not tainted 4.7.0-rc4-mm1+ #143 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: dump_stack+0x76/0xaf __warn+0xea/0x110 ? mem_cgroup_update_lru_size+0x103/0x110 warn_slowpath_fmt+0x3b/0x40 mem_cgroup_update_lru_size+0x103/0x110 isolate_lru_pages.isra.61+0x2e2/0x360 shrink_active_list+0xac/0x2a0 ? __delay+0xe/0x10 shrink_node_memcg+0x53c/0x7a0 shrink_node+0xab/0x2a0 do_try_to_free_pages+0xc6/0x390 try_to_free_pages+0x245/0x590 LRU list contents and counts are updated separately. Counts are updated before pages are added to the LRU and updated after pages are removed. The warning above is from a check in mem_cgroup_update_lru_size that ensures that list sizes of zero are empty. The problem is that node-lru needs to account for highmem pages if CONFIG_HIGHMEM is set. One impact of the implementation is that the sizes are updated in multiple passes when pages from multiple zones were isolated. This happens whether HIGHMEM is set or not. When multiple zones are isolated, it's possible for a debugging check in memcg to be tripped. This patch forces all the zone counts to be updated before the memcg function is called. Link: http://lkml.kernel.org/r/1468588165-12461-6-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Tested-by: Minchan Kim Reported-by: Minchan Kim Acked-by: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 2 +- include/linux/mm_inline.h | 5 ++--- 2 files changed, 3 insertions(+), 4 deletions(-) (limited to 'include/linux') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b759827b2f1e..5147e650287a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -430,7 +430,7 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg) int mem_cgroup_select_victim_node(struct mem_cgroup *memcg); void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, - enum zone_type zid, int nr_pages); + int nr_pages); unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg, int nid, unsigned int lru_mask); diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index dd22b08c47be..bcc4ed07fa90 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -52,10 +52,9 @@ static __always_inline void update_lru_size(struct lruvec *lruvec, enum lru_list lru, enum zone_type zid, int nr_pages) { -#ifdef CONFIG_MEMCG - mem_cgroup_update_lru_size(lruvec, lru, zid, nr_pages); -#else __update_lru_size(lruvec, lru, zid, nr_pages); +#ifdef CONFIG_MEMCG + mem_cgroup_update_lru_size(lruvec, lru, nr_pages); #endif } -- cgit v1.2.3 From 71c799f4982d340fff86e751898841322f07f235 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Thu, 28 Jul 2016 15:47:26 -0700 Subject: mm: add per-zone lru list stat When I did stress test with hackbench, I got OOM message frequently which didn't ever happen in zone-lru. gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0 .. .. __alloc_pages_nodemask+0xe52/0xe60 ? new_slab+0x39c/0x3b0 new_slab+0x39c/0x3b0 ___slab_alloc.constprop.87+0x6da/0x840 ? __alloc_skb+0x3c/0x260 ? _raw_spin_unlock_irq+0x27/0x60 ? trace_hardirqs_on_caller+0xec/0x1b0 ? finish_task_switch+0xa6/0x220 ? poll_select_copy_remaining+0x140/0x140 __slab_alloc.isra.81.constprop.86+0x40/0x6d ? __alloc_skb+0x3c/0x260 kmem_cache_alloc+0x22c/0x260 ? __alloc_skb+0x3c/0x260 __alloc_skb+0x3c/0x260 alloc_skb_with_frags+0x4e/0x1a0 sock_alloc_send_pskb+0x16a/0x1b0 ? wait_for_unix_gc+0x31/0x90 ? alloc_set_pte+0x2ad/0x310 unix_stream_sendmsg+0x28d/0x340 sock_sendmsg+0x2d/0x40 sock_write_iter+0x6c/0xc0 __vfs_write+0xc0/0x120 vfs_write+0x9b/0x1a0 ? __might_fault+0x49/0xa0 SyS_write+0x44/0x90 do_fast_syscall_32+0xa6/0x1e0 sysenter_past_esp+0x45/0x74 Mem-Info: active_anon:104698 inactive_anon:105791 isolated_anon:192 active_file:433 inactive_file:283 isolated_file:22 unevictable:0 dirty:0 writeback:296 unstable:0 slab_reclaimable:6389 slab_unreclaimable:78927 mapped:474 shmem:0 pagetables:101426 bounce:0 free:10518 free_pcp:334 free_cma:0 Node 0 active_anon:418792kB inactive_anon:423164kB active_file:1732kB inactive_file:1132kB unevictable:0kB isolated(anon):768kB isolated(file):88kB mapped:1896kB dirty:0kB writeback:1184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1478632 all_unreclaimable? yes DMA free:3304kB min:68kB low:84kB high:100kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:4088kB kernel_stack:0kB pagetables:2480kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 809 1965 1965 Normal free:3436kB min:3604kB low:4504kB high:5404kB present:897016kB managed:858460kB mlocked:0kB slab_reclaimable:25556kB slab_unreclaimable:311712kB kernel_stack:164608kB pagetables:30844kB bounce:0kB free_pcp:620kB local_pcp:104kB free_cma:0kB lowmem_reserve[]: 0 0 9247 9247 HighMem free:33808kB min:512kB low:1796kB high:3080kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:372252kB bounce:0kB free_pcp:428kB local_pcp:72kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 2*4kB (UM) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 0*1024kB 1*2048kB (U) 0*4096kB = 3192kB Normal: 33*4kB (MH) 79*8kB (ME) 11*16kB (M) 4*32kB (M) 2*64kB (ME) 2*128kB (EH) 7*256kB (EH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3244kB HighMem: 2590*4kB (UM) 1568*8kB (UM) 491*16kB (UM) 60*32kB (UM) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33064kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 25121 total pagecache pages 24160 pages in swap cache Swap cache stats: add 86371, delete 62211, find 42865/60187 Free swap = 4015560kB Total swap = 4192252kB 524186 pages RAM 295934 pages HighMem/MovableOnly 9658 pages reserved 0 pages cma reserved The order-0 allocation for normal zone failed while there are a lot of reclaimable memory(i.e., anonymous memory with free swap). I wanted to analyze the problem but it was hard because we removed per-zone lru stat so I couldn't know how many of anonymous memory there are in normal/dma zone. When we investigate OOM problem, reclaimable memory count is crucial stat to find a problem. Without it, it's hard to parse the OOM message so I believe we should keep it. With per-zone lru stat, gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0 Mem-Info: active_anon:101103 inactive_anon:102219 isolated_anon:0 active_file:503 inactive_file:544 isolated_file:0 unevictable:0 dirty:0 writeback:34 unstable:0 slab_reclaimable:6298 slab_unreclaimable:74669 mapped:863 shmem:0 pagetables:100998 bounce:0 free:23573 free_pcp:1861 free_cma:0 Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 809 1965 1965 Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB lowmem_reserve[]: 0 0 9247 9247 HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 54409 total pagecache pages 53215 pages in swap cache Swap cache stats: add 300982, delete 247765, find 157978/226539 Free swap = 3803244kB Total swap = 4192252kB 524186 pages RAM 295934 pages HighMem/MovableOnly 9642 pages reserved 0 pages cma reserved With that, we can see normal zone has a 86M reclaimable memory so we can know something goes wrong(I will fix the problem in next patch) in reclaim. [mgorman@techsingularity.net: rename zone LRU stats in /proc/vmstat] Link: http://lkml.kernel.org/r/20160725072300.GK10438@techsingularity.net Link: http://lkml.kernel.org/r/1469110261-7365-2-git-send-email-mgorman@techsingularity.net Signed-off-by: Minchan Kim Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Cc: Michal Hocko Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm_inline.h | 2 ++ include/linux/mmzone.h | 6 ++++++ 2 files changed, 8 insertions(+) (limited to 'include/linux') diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index bcc4ed07fa90..9cc130f5feb2 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -45,6 +45,8 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec, struct pglist_data *pgdat = lruvec_pgdat(lruvec); __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages); + __mod_zone_page_state(&pgdat->node_zones[zid], + NR_ZONE_LRU_BASE + lru, nr_pages); acct_highmem_file_pages(zid, lru, nr_pages); } diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a3b7f45aac56..1a813ad335f4 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -110,6 +110,12 @@ struct zone_padding { enum zone_stat_item { /* First 128 byte cacheline (assuming 64 bit words) */ NR_FREE_PAGES, + NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */ + NR_ZONE_INACTIVE_ANON = NR_ZONE_LRU_BASE, + NR_ZONE_ACTIVE_ANON, + NR_ZONE_INACTIVE_FILE, + NR_ZONE_ACTIVE_FILE, + NR_ZONE_UNEVICTABLE, NR_MLOCK, /* mlock()ed pages found and moved off LRU */ NR_SLAB_RECLAIMABLE, NR_SLAB_UNRECLAIMABLE, -- cgit v1.2.3 From bb4cc2bea6df7854d629bff114ca03237cc718d6 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:47:29 -0700 Subject: mm, vmscan: remove highmem_file_pages With the reintroduction of per-zone LRU stats, highmem_file_pages is redundant so remove it. [mgorman@techsingularity.net: wrong stat is being accumulated in highmem_dirtyable_memory] Link: http://lkml.kernel.org/r/20160725092324.GM10438@techsingularity.netLink: http://lkml.kernel.org/r/1469110261-7365-3-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Cc: Minchan Kim Cc: Michal Hocko Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm_inline.h | 17 ----------------- 1 file changed, 17 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 9cc130f5feb2..71613e8a720f 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -4,22 +4,6 @@ #include #include -#ifdef CONFIG_HIGHMEM -extern atomic_t highmem_file_pages; - -static inline void acct_highmem_file_pages(int zid, enum lru_list lru, - int nr_pages) -{ - if (is_highmem_idx(zid) && is_file_lru(lru)) - atomic_add(nr_pages, &highmem_file_pages); -} -#else -static inline void acct_highmem_file_pages(int zid, enum lru_list lru, - int nr_pages) -{ -} -#endif - /** * page_is_file_cache - should the page be on a file LRU or anon LRU? * @page: the page to test @@ -47,7 +31,6 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec, __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages); __mod_zone_page_state(&pgdat->node_zones[zid], NR_ZONE_LRU_BASE + lru, nr_pages); - acct_highmem_file_pages(zid, lru, nr_pages); } static __always_inline void update_lru_size(struct lruvec *lruvec, -- cgit v1.2.3 From 5a1c84b404a7176b8b36e2a0041b6f0adb3151a3 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 28 Jul 2016 15:47:31 -0700 Subject: mm: remove reclaim and compaction retry approximations If per-zone LRU accounting is available then there is no point approximating whether reclaim and compaction should retry based on pgdat statistics. This is effectively a revert of "mm, vmstat: remove zone and node double accounting by approximating retries" with the difference that inactive/active stats are still available. This preserves the history of why the approximation was retried and why it had to be reverted to handle OOM kills on 32-bit systems. Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Acked-by: Minchan Kim Cc: Michal Hocko Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 1 + include/linux/swap.h | 1 + 2 files changed, 2 insertions(+) (limited to 'include/linux') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 1a813ad335f4..ca0fbc483441 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -116,6 +116,7 @@ enum zone_stat_item { NR_ZONE_INACTIVE_FILE, NR_ZONE_ACTIVE_FILE, NR_ZONE_UNEVICTABLE, + NR_ZONE_WRITE_PENDING, /* Count of dirty, writeback and unstable pages */ NR_MLOCK, /* mlock()ed pages found and moved off LRU */ NR_SLAB_RECLAIMABLE, NR_SLAB_UNRECLAIMABLE, diff --git a/include/linux/swap.h b/include/linux/swap.h index cc753c639e3d..b17cc4830fa6 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -307,6 +307,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page, struct vm_area_struct *vma); /* linux/mm/vmscan.c */ +extern unsigned long zone_reclaimable_pages(struct zone *zone); extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat); extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); -- cgit v1.2.3 From 319904ad40d413a0e721ed35fcb2374657db6aaa Mon Sep 17 00:00:00 2001 From: Huang Ying Date: Thu, 28 Jul 2016 15:48:03 -0700 Subject: mm, THP: clean up return value of madvise_free_huge_pmd The definition of return value of madvise_free_huge_pmd is not clear before. According to the suggestion of Minchan Kim, change the type of return value to bool and return true if we do MADV_FREE successfully on entire pmd page, otherwise, return false. Comments are added too. Link: http://lkml.kernel.org/r/1467135452-16688-2-git-send-email-ying.huang@intel.com Signed-off-by: "Huang, Ying" Acked-by: Minchan Kim Cc: "Kirill A. Shutemov" Cc: Jerome Marchand Cc: Vlastimil Babka Cc: Dan Williams Cc: Mel Gorman Cc: Andrea Arcangeli Cc: Ebru Akagunduz Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/huge_mm.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 92ce91c03cd0..6f14de45b5ce 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -11,7 +11,7 @@ extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, unsigned int flags); -extern int madvise_free_huge_pmd(struct mmu_gather *tlb, +extern bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long next); extern int zap_huge_pmd(struct mmu_gather *tlb, -- cgit v1.2.3 From 11db04864336f20e19e16b64ade781eeefc3f6d3 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Thu, 28 Jul 2016 15:48:11 -0700 Subject: mm: cleanup ifdef guards for vmem_altmap Now that ZONE_DEVICE depends on SPARSEMEM_VMEMMAP we can simplify some ifdef guards to just ZONE_DEVICE. Link: http://lkml.kernel.org/r/146687646788.39261.8020536391978771940.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams Reported-by: Vlastimil Babka Cc: Eric Sandeen Cc: Jeff Moyer Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memremap.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/memremap.h b/include/linux/memremap.h index bcaa634139a9..93416196ba64 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -26,7 +26,7 @@ struct vmem_altmap { unsigned long vmem_altmap_offset(struct vmem_altmap *altmap); void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns); -#if defined(CONFIG_SPARSEMEM_VMEMMAP) && defined(CONFIG_ZONE_DEVICE) +#ifdef CONFIG_ZONE_DEVICE struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start); #else static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start) -- cgit v1.2.3 From d30dd8be06a5ae640766b20ea9ae288832bd12ac Mon Sep 17 00:00:00 2001 From: Andy Lutomirski Date: Thu, 28 Jul 2016 15:48:14 -0700 Subject: mm: track NR_KERNEL_STACK in KiB instead of number of stacks Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone. This only makes sense if each kernel stack exists entirely in one zone, and allowing vmapped stacks could break this assumption. Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all architectures. Keep it simple and use KiB. Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.org Signed-off-by: Andy Lutomirski Cc: Vladimir Davydov Acked-by: Johannes Weiner Cc: Michal Hocko Reviewed-by: Josh Poimboeuf Reviewed-by: Vladimir Davydov Acked-by: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index ca0fbc483441..f2e4e90621ec 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -121,7 +121,7 @@ enum zone_stat_item { NR_SLAB_RECLAIMABLE, NR_SLAB_UNRECLAIMABLE, NR_PAGETABLE, /* used for pagetables */ - NR_KERNEL_STACK, + NR_KERNEL_STACK_KB, /* measured in KiB */ /* Second 128 byte cacheline */ NR_BOUNCE, #if IS_ENABLED(CONFIG_ZSMALLOC) -- cgit v1.2.3 From efdc94907977d2db84b4b00cb9bd98ca011f6819 Mon Sep 17 00:00:00 2001 From: Andy Lutomirski Date: Thu, 28 Jul 2016 15:48:17 -0700 Subject: mm: fix memcg stack accounting for sub-page stacks We should account for stacks regardless of stack size, and we need to account in sub-page units if THREAD_SIZE < PAGE_SIZE. Change the units to kilobytes and Move it into account_kernel_stack(). Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat") Link: http://lkml.kernel.org/r/9b5314e3ee5eda61b0317ec1563768602c1ef438.1468523549.git.luto@kernel.org Signed-off-by: Andy Lutomirski Cc: Vladimir Davydov Acked-by: Johannes Weiner Cc: Michal Hocko Reviewed-by: Josh Poimboeuf Reviewed-by: Vladimir Davydov Acked-by: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5147e650287a..5d8ca6e02e39 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -52,7 +52,7 @@ enum mem_cgroup_stat_index { MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */ MEM_CGROUP_STAT_NSTATS, /* default hierarchy stats */ - MEMCG_KERNEL_STACK = MEM_CGROUP_STAT_NSTATS, + MEMCG_KERNEL_STACK_KB = MEM_CGROUP_STAT_NSTATS, MEMCG_SLAB_RECLAIMABLE, MEMCG_SLAB_UNRECLAIMABLE, MEMCG_SOCK, -- cgit v1.2.3 From e558af65be65713ef2e8b2aa637c6263caeed172 Mon Sep 17 00:00:00 2001 From: Andy Lutomirski Date: Thu, 28 Jul 2016 15:48:20 -0700 Subject: kdb: use task_cpu() instead of task_thread_info()->cpu We'll need this cleanup to make the cpu field in thread_info be optional. Link: http://lkml.kernel.org/r/da298328dc77ea494576c2f20a934218e758a6fa.1468523549.git.luto@kernel.org Signed-off-by: Andy Lutomirski Cc: Jason Wessel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/kdb.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/kdb.h b/include/linux/kdb.h index a19bcf9e762e..410decacff8f 100644 --- a/include/linux/kdb.h +++ b/include/linux/kdb.h @@ -177,7 +177,7 @@ extern int kdb_get_kbd_char(void); static inline int kdb_process_cpu(const struct task_struct *p) { - unsigned int cpu = task_thread_info(p)->cpu; + unsigned int cpu = task_cpu(p); if (cpu > num_possible_cpus()) cpu = 0; return cpu; -- cgit v1.2.3 From a571d4eb55d83ff538d98870fa8a8497b24d39bc Mon Sep 17 00:00:00 2001 From: Dennis Chen Date: Thu, 28 Jul 2016 15:48:26 -0700 Subject: mm/memblock.c: add new infrastructure to address the mem limit issue In some cases, memblock is queried by kernel to determine whether a specified address is RAM or not. For example, the ACPI core needs this information to determine which attributes to use when mapping ACPI regions(acpi_os_ioremap). Use of incorrect memory types can result in faults, data corruption, or other issues. Removing memory with memblock_enforce_memory_limit() throws away this information, and so a kernel booted with 'mem=' may suffer from the issues described above. To avoid this, we need to keep those NOMAP regions instead of removing all above the limit, which preserves the information we need while preventing other use of those regions. This patch adds new infrastructure to retain all NOMAP memblock regions while removing others, to cater for this. Link: http://lkml.kernel.org/r/1468475036-5852-2-git-send-email-dennis.chen@arm.com Signed-off-by: Dennis Chen Acked-by: Steve Capper Cc: Catalin Marinas Cc: Ard Biesheuvel Cc: Pekka Enberg Cc: Mel Gorman Cc: Tang Chen Cc: Tony Luck Cc: Ingo Molnar Cc: Rafael J. Wysocki Cc: Will Deacon Cc: Mark Rutland Cc: Matt Fleming Cc: Kaly Xin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memblock.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux') diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 6c14b6179727..2925da23505d 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -332,6 +332,7 @@ phys_addr_t memblock_mem_size(unsigned long limit_pfn); phys_addr_t memblock_start_of_DRAM(void); phys_addr_t memblock_end_of_DRAM(void); void memblock_enforce_memory_limit(phys_addr_t memory_limit); +void memblock_mem_limit_remove_map(phys_addr_t limit); bool memblock_is_memory(phys_addr_t addr); int memblock_is_map_memory(phys_addr_t addr); int memblock_is_region_memory(phys_addr_t base, phys_addr_t size); -- cgit v1.2.3 From c146a2b98eb5898eb0fab15a332257a4102ecae9 Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Thu, 28 Jul 2016 15:49:04 -0700 Subject: mm, kasan: account for object redzone in SLUB's nearest_obj() When looking up the nearest SLUB object for a given address, correctly calculate its offset if SLAB_RED_ZONE is enabled for that cache. Previously, when KASAN had detected an error on an object from a cache with SLAB_RED_ZONE set, the actual start address of the object was miscalculated, which led to random stacks having been reported. When looking up the nearest SLUB object for a given address, correctly calculate its offset if SLAB_RED_ZONE is enabled for that cache. Fixes: 7ed2f9e663854db ("mm, kasan: SLAB support") Link: http://lkml.kernel.org/r/1468347165-41906-2-git-send-email-glider@google.com Signed-off-by: Alexander Potapenko Cc: Andrey Konovalov Cc: Christoph Lameter Cc: Dmitry Vyukov Cc: Steven Rostedt (Red Hat) Cc: Joonsoo Kim Cc: Kostya Serebryany Cc: Andrey Ryabinin Cc: Kuthonuzo Luruo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/slub_def.h | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) (limited to 'include/linux') diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index 5624c1f3eb0a..cf501cf8e6db 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -119,15 +119,17 @@ static inline void sysfs_slab_remove(struct kmem_cache *s) void object_err(struct kmem_cache *s, struct page *page, u8 *object, char *reason); +void *fixup_red_left(struct kmem_cache *s, void *p); + static inline void *nearest_obj(struct kmem_cache *cache, struct page *page, void *x) { void *object = x - (x - page_address(page)) % cache->size; void *last_object = page_address(page) + (page->objects - 1) * cache->size; - if (unlikely(object > last_object)) - return last_object; - else - return object; + void *result = (unlikely(object > last_object)) ? last_object : object; + + result = fixup_red_left(cache, result); + return result; } #endif /* _LINUX_SLUB_DEF_H */ -- cgit v1.2.3 From 80a9201a5965f4715d5c09790862e0df84ce0614 Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Thu, 28 Jul 2016 15:49:07 -0700 Subject: mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB For KASAN builds: - switch SLUB allocator to using stackdepot instead of storing the allocation/deallocation stacks in the objects; - change the freelist hook so that parts of the freelist can be put into the quarantine. [aryabinin@virtuozzo.com: fixes] Link: http://lkml.kernel.org/r/1468601423-28676-1-git-send-email-aryabinin@virtuozzo.com Link: http://lkml.kernel.org/r/1468347165-41906-3-git-send-email-glider@google.com Signed-off-by: Alexander Potapenko Cc: Andrey Konovalov Cc: Christoph Lameter Cc: Dmitry Vyukov Cc: Steven Rostedt (Red Hat) Cc: Joonsoo Kim Cc: Kostya Serebryany Cc: Andrey Ryabinin Cc: Kuthonuzo Luruo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/kasan.h | 2 ++ include/linux/slab_def.h | 3 ++- include/linux/slub_def.h | 4 ++++ 3 files changed, 8 insertions(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/kasan.h b/include/linux/kasan.h index ac4b3c46a84d..c9cf374445d8 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -77,6 +77,7 @@ void kasan_free_shadow(const struct vm_struct *vm); size_t ksize(const void *); static inline void kasan_unpoison_slab(const void *ptr) { ksize(ptr); } +size_t kasan_metadata_size(struct kmem_cache *cache); #else /* CONFIG_KASAN */ @@ -121,6 +122,7 @@ static inline int kasan_module_alloc(void *addr, size_t size) { return 0; } static inline void kasan_free_shadow(const struct vm_struct *vm) {} static inline void kasan_unpoison_slab(const void *ptr) { } +static inline size_t kasan_metadata_size(struct kmem_cache *cache) { return 0; } #endif /* CONFIG_KASAN */ diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h index 339ba027ade9..4ad2c5a26399 100644 --- a/include/linux/slab_def.h +++ b/include/linux/slab_def.h @@ -88,7 +88,8 @@ struct kmem_cache { }; static inline void *nearest_obj(struct kmem_cache *cache, struct page *page, - void *x) { + void *x) +{ void *object = x - (x - page->s_mem) % cache->size; void *last_object = page->s_mem + (cache->num - 1) * cache->size; diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index cf501cf8e6db..75f56c2ef2d4 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -104,6 +104,10 @@ struct kmem_cache { unsigned int *random_seq; #endif +#ifdef CONFIG_KASAN + struct kasan_cache kasan_info; +#endif + struct kmem_cache_node *node[MAX_NUMNODES]; }; -- cgit v1.2.3 From 2516035499b9555f6acd373c9f12e44bcb50dbec Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Thu, 28 Jul 2016 15:49:25 -0700 Subject: mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations After the previous patch, we can distinguish costly allocations that should be really lightweight, such as THP page faults, with __GFP_NORETRY. This means we don't need to recognize khugepaged allocations via PF_KTHREAD anymore. We can also change THP page faults in areas where madvise(MADV_HUGEPAGE) was used to try as hard as khugepaged, as the process has indicated that it benefits from THP's and is willing to pay some initial latency costs. We can also make the flags handling less cryptic by distinguishing GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding __GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed. The patch effectively changes the current GFP_TRANSHUGE users as follows: * get_huge_zero_page() - the zero page lifetime should be relatively long and it's shared by multiple users, so it's worth spending some effort on it. We use GFP_TRANSHUGE, and __GFP_NORETRY is not added. This also restores direct reclaim to this allocation, which was unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag by default to madvise and add a stall-free defrag option") * alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency is not an issue. So if khugepaged "defrag" is enabled (the default), do reclaim via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the PF_KTHREAD check from page alloc. As a side-effect, khugepaged will now no longer check if the initial compaction was deferred or contended. This is OK, as khugepaged sleep times between collapsion attempts are long enough to prevent noticeable disruption, so we should allow it to spend some effort. * migrate_misplaced_transhuge_page() - already was masking out __GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is equivalent. * alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise) are now allocating without __GFP_NORETRY. Other vma's keep using __GFP_NORETRY if direct reclaim/compaction is at all allowed (by default it's allowed only for madvised vma's). The rest is conversion to GFP_TRANSHUGE(_LIGHT). [mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT] Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz Signed-off-by: Vlastimil Babka Acked-by: Michal Hocko Acked-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/gfp.h | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) (limited to 'include/linux') diff --git a/include/linux/gfp.h b/include/linux/gfp.h index c29e9d347bc6..f8041f9de31e 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -237,9 +237,11 @@ struct vm_area_struct; * are expected to be movable via page reclaim or page migration. Typically, * pages on the LRU would also be allocated with GFP_HIGHUSER_MOVABLE. * - * GFP_TRANSHUGE is used for THP allocations. They are compound allocations - * that will fail quickly if memory is not available and will not wake - * kswapd on failure. + * GFP_TRANSHUGE and GFP_TRANSHUGE_LIGHT are used for THP allocations. They are + * compound allocations that will generally fail quickly if memory is not + * available and will not wake kswapd/kcompactd on failure. The _LIGHT + * version does not attempt reclaim/compaction at all and is by default used + * in page fault path, while the non-light is used by khugepaged. */ #define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM) #define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS) @@ -254,9 +256,9 @@ struct vm_area_struct; #define GFP_DMA32 __GFP_DMA32 #define GFP_HIGHUSER (GFP_USER | __GFP_HIGHMEM) #define GFP_HIGHUSER_MOVABLE (GFP_HIGHUSER | __GFP_MOVABLE) -#define GFP_TRANSHUGE ((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \ - __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \ - ~__GFP_RECLAIM) +#define GFP_TRANSHUGE_LIGHT ((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \ + __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM) +#define GFP_TRANSHUGE (GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM) /* Convert GFP flags to their corresponding migrate type */ #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE) -- cgit v1.2.3 From a5508cd83f10f663e05d212cb81f600a3af46e40 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Thu, 28 Jul 2016 15:49:28 -0700 Subject: mm, compaction: introduce direct compaction priority In the context of direct compaction, for some types of allocations we would like the compaction to either succeed or definitely fail while trying as hard as possible. Current async/sync_light migration mode is insufficient, as there are heuristics such as caching scanner positions, marking pageblocks as unsuitable or deferring compaction for a zone. At least the final compaction attempt should be able to override these heuristics. To communicate how hard compaction should try, we replace migration mode with a new enum compact_priority and change the relevant function signatures. In compact_zone_order() where struct compact_control is constructed, the priority is mapped to suitable control flags. This patch itself has no functional change, as the current priority levels are mapped back to the same migration modes as before. Expanding them will be done next. Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is removed, as the only caller exists under CONFIG_COMPACTION. Link: http://lkml.kernel.org/r/20160721073614.24395-8-vbabka@suse.cz Signed-off-by: Vlastimil Babka Acked-by: Michal Hocko Acked-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/compaction.h | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) (limited to 'include/linux') diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 1a02dab16646..0980a6ce4436 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -1,6 +1,18 @@ #ifndef _LINUX_COMPACTION_H #define _LINUX_COMPACTION_H +/* + * Determines how hard direct compaction should try to succeed. + * Lower value means higher priority, analogically to reclaim priority. + */ +enum compact_priority { + COMPACT_PRIO_SYNC_LIGHT, + MIN_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT, + DEF_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT, + COMPACT_PRIO_ASYNC, + INIT_COMPACT_PRIORITY = COMPACT_PRIO_ASYNC +}; + /* Return values for compact_zone() and try_to_compact_pages() */ /* When adding new states, please adjust include/trace/events/compaction.h */ enum compact_result { @@ -66,7 +78,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order); extern enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, unsigned int alloc_flags, const struct alloc_context *ac, - enum migrate_mode mode, int *contended); + enum compact_priority prio, int *contended); extern void compact_pgdat(pg_data_t *pgdat, int order); extern void reset_isolation_suitable(pg_data_t *pgdat); extern enum compact_result compaction_suitable(struct zone *zone, int order, @@ -151,14 +163,6 @@ extern void kcompactd_stop(int nid); extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx); #else -static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask, - unsigned int order, int alloc_flags, - const struct alloc_context *ac, - enum migrate_mode mode, int *contended) -{ - return COMPACT_CONTINUE; -} - static inline void compact_pgdat(pg_data_t *pgdat, int order) { } -- cgit v1.2.3 From c3486f5376696034d0fcbef8ba70c70cfcb26f51 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Thu, 28 Jul 2016 15:49:30 -0700 Subject: mm, compaction: simplify contended compaction handling Async compaction detects contention either due to failing trylock on zone->lock or lru_lock, or by need_resched(). Since 1f9efdef4f3f ("mm, compaction: khugepaged should not give up due to need_resched()") the code got quite complicated to distinguish these two up to the __alloc_pages_slowpath() level, so different decisions could be taken for khugepaged allocations. After the recent changes, khugepaged allocations don't check for contended compaction anymore, so we again don't need to distinguish lock and sched contention, and simplify the current convoluted code a lot. However, I believe it's also possible to simplify even more and completely remove the check for contended compaction after the initial async compaction for costly orders, which was originally aimed at THP page fault allocations. There are several reasons why this can be done now: - with the new defaults, THP page faults no longer do reclaim/compaction at all, unless the system admin has overridden the default, or application has indicated via madvise that it can benefit from THP's. In both cases, it means that the potential extra latency is expected and worth the benefits. - even if reclaim/compaction proceeds after this patch where it previously wouldn't, the second compaction attempt is still async and will detect the contention and back off, if the contention persists - there are still heuristics like deferred compaction and pageblock skip bits in place that prevent excessive THP page fault latencies Link: http://lkml.kernel.org/r/20160721073614.24395-9-vbabka@suse.cz Signed-off-by: Vlastimil Babka Acked-by: Michal Hocko Acked-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/compaction.h | 13 ++----------- 1 file changed, 2 insertions(+), 11 deletions(-) (limited to 'include/linux') diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 0980a6ce4436..d4e106b5dc27 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -55,14 +55,6 @@ enum compact_result { COMPACT_PARTIAL, }; -/* Used to signal whether compaction detected need_sched() or lock contention */ -/* No contention detected */ -#define COMPACT_CONTENDED_NONE 0 -/* Either need_sched() was true or fatal signal pending */ -#define COMPACT_CONTENDED_SCHED 1 -/* Zone lock or lru_lock was contended in async compaction */ -#define COMPACT_CONTENDED_LOCK 2 - struct alloc_context; /* in mm/internal.h */ #ifdef CONFIG_COMPACTION @@ -76,9 +68,8 @@ extern int sysctl_compact_unevictable_allowed; extern int fragmentation_index(struct zone *zone, unsigned int order); extern enum compact_result try_to_compact_pages(gfp_t gfp_mask, - unsigned int order, - unsigned int alloc_flags, const struct alloc_context *ac, - enum compact_priority prio, int *contended); + unsigned int order, unsigned int alloc_flags, + const struct alloc_context *ac, enum compact_priority prio); extern void compact_pgdat(pg_data_t *pgdat, int order); extern void reset_isolation_suitable(pg_data_t *pgdat); extern enum compact_result compaction_suitable(struct zone *zone, int order, -- cgit v1.2.3