From c73322d098e4b6f5f0f0fa1330bf57e218775539 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 3 May 2017 14:51:51 -0700 Subject: mm: fix 100% CPU kswapd busyloop on unreclaimable nodes Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and cleanups". Jia reported a scenario in which the kswapd of a node indefinitely spins at 100% CPU usage. We have seen similar cases at Facebook. The kernel's current method of judging its ability to reclaim a node (or whether to back off and sleep) is based on the amount of scanned pages in proportion to the amount of reclaimable pages. In Jia's and our scenarios, there are no reclaimable pages in the node, however, and the condition for backing off is never met. Kswapd busyloops in an attempt to restore the watermarks while having nothing to work with. This series reworks the definition of an unreclaimable node based not on scanning but on whether kswapd is able to actually reclaim pages in MAX_RECLAIM_RETRIES (16) consecutive runs. This is the same criteria the page allocator uses for giving up on direct reclaim and invoking the OOM killer. If it cannot free any pages, kswapd will go to sleep and leave further attempts to direct reclaim invocations, which will either make progress and re-enable kswapd, or invoke the OOM killer. Patch #1 fixes the immediate problem Jia reported, the remainder are smaller fixlets, cleanups, and overall phasing out of the old method. Patch #6 is the odd one out. It's a nice cleanup to get_scan_count(), and directly related to #5, but in itself not relevant to the series. If the whole series is too ambitious for 4.11, I would consider the first three patches fixes, the rest cleanups. This patch (of 9): Jia He reports a problem with kswapd spinning at 100% CPU when requesting more hugepages than memory available in the system: $ echo 4000 >/proc/sys/vm/nr_hugepages top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01 Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3 At that time, there are no reclaimable pages left in the node, but as kswapd fails to restore the high watermarks it refuses to go to sleep. Kswapd needs to back away from nodes that fail to balance. Up until commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes") kswapd had such a mechanism. It considered zones whose theoretically reclaimable pages it had reclaimed six times over as unreclaimable and backed away from them. This guard was erroneously removed as the patch changed the definition of a balanced node. However, simply restoring this code wouldn't help in the case reported here: there *are* no reclaimable pages that could be scanned until the threshold is met. Kswapd would stay awake anyway. Introduce a new and much simpler way of backing off. If kswapd runs through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single page, make it back off from the node. This is the same number of shots direct reclaim takes before declaring OOM. Kswapd will go to sleep on that node until a direct reclaimer manages to reclaim some pages, thus proving the node reclaimable again. [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count] Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org [shakeelb@google.com: fix condition for throttle_direct_reclaim] Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Signed-off-by: Shakeel Butt Reported-by: Jia He Tested-by: Jia He Acked-by: Michal Hocko Acked-by: Hillf Danton Acked-by: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 8e02b3750fe0..d2c50ab6ae40 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -630,6 +630,8 @@ typedef struct pglist_data { int kswapd_order; enum zone_type kswapd_classzone_idx; + int kswapd_failures; /* Number of 'reclaimed == 0' runs */ + #ifdef CONFIG_COMPACTION int kcompactd_max_order; enum zone_type kcompactd_classzone_idx; -- cgit v1.2.3 From c822f6223d03c2c5b026a21da09c6b6d523258cd Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 3 May 2017 14:52:10 -0700 Subject: mm: delete NR_PAGES_SCANNED and pgdat_reclaimable() NR_PAGES_SCANNED counts number of pages scanned since the last page free event in the allocator. This was used primarily to measure the reclaimability of zones and nodes, and determine when reclaim should give up on them. In that role, it has been replaced in the preceding patches by a different mechanism. Being implemented as an efficient vmstat counter, it was automatically exported to userspace as well. It's however unlikely that anyone outside the kernel is using this counter in any meaningful way. Remove the counter and the unused pgdat_reclaimable(). Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Acked-by: Hillf Danton Acked-by: Michal Hocko Cc: Jia He Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index d2c50ab6ae40..04e0969966f6 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -149,7 +149,6 @@ enum node_stat_item { NR_UNEVICTABLE, /* " " " " " */ NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ - NR_PAGES_SCANNED, /* pages scanned since last reclaim */ WORKINGSET_REFAULT, WORKINGSET_ACTIVATE, WORKINGSET_NODERECLAIM, -- cgit v1.2.3 From a128ca71fb29ed4444b80f38a0148b468826e19b Mon Sep 17 00:00:00 2001 From: Shaohua Li Date: Wed, 3 May 2017 14:52:22 -0700 Subject: mm: delete unnecessary TTU_* flags Patch series "mm: fix some MADV_FREE issues", v5. We are trying to use MADV_FREE in jemalloc. Several issues are found. Without solving the issues, jemalloc can't use the MADV_FREE feature. - Doesn't support system without swap enabled. Because if swap is off, we can't or can't efficiently age anonymous pages. And since MADV_FREE pages are mixed with other anonymous pages, we can't reclaim MADV_FREE pages. In current implementation, MADV_FREE will fallback to MADV_DONTNEED without swap enabled. But in our environment, a lot of machines don't enable swap. This will prevent our setup using MADV_FREE. - Increases memory pressure. page reclaim bias file pages reclaim against anonymous pages. This doesn't make sense for MADV_FREE pages, because those pages could be freed easily and refilled with very slight penality. Even page reclaim doesn't bias file pages, there is still an issue, because MADV_FREE pages and other anonymous pages are mixed together. To reclaim a MADV_FREE page, we probably must scan a lot of other anonymous pages, which is inefficient. In our test, we usually see oom with MADV_FREE enabled and nothing without it. - Accounting. There are two accounting problems. We don't have a global accounting. If the system is abnormal, we don't know if it's a problem from MADV_FREE side. The other problem is RSS accounting. MADV_FREE pages are accounted as normal anon pages and reclaimed lazily, so application's RSS becomes bigger. This confuses our workloads. We have monitoring daemon running and if it finds applications' RSS becomes abnormal, the daemon will kill the applications even kernel can reclaim the memory easily. To address the first the two issues, we can either put MADV_FREE pages into a separate LRU list (Minchan's previous patches and V1 patches), or put them into LRU_INACTIVE_FILE list (suggested by Johannes). The patchset use the second idea. The reason is LRU_INACTIVE_FILE list is tiny nowadays and should be full of used once file pages. So we can still efficiently reclaim MADV_FREE pages there without interference with other anon and active file pages. Putting the pages into inactive file list also has an advantage which allows page reclaim to prioritize MADV_FREE pages and used once file pages. MADV_FREE pages are put into the lru list and clear SwapBacked flag, so PageAnon(page) && !PageSwapBacked(page) will indicate a MADV_FREE pages. These pages will directly freed without pageout if they are clean, otherwise normal swap will reclaim them. For the third issue, the previous post adds global accounting and a separate RSS count for MADV_FREE pages. The problem is we never get accurate accounting for MADV_FREE pages. The pages are mapped to userspace, can be dirtied without notice from kernel side. To get accurate accounting, we could write protect the page, but then there is extra page fault overhead, which people don't want to pay. Jemalloc guys have concerns about the inaccurate accounting, so this post drops the accounting patches temporarily. The info exported to /proc/pid/smaps for MADV_FREE pages are kept, which is the only place we can get accurate accounting right now. This patch (of 6): Johannes pointed out TTU_LZFREE is unnecessary. It's true because we always have the flag set if we want to do an unmap. For cases we don't do an unmap, the TTU_LZFREE part of code should never run. Also the TTU_UNMAP is unnecessary. If no other flags set (for example, TTU_MIGRATION), an unmap is implied. The patch includes Johannes's cleanup and dead TTU_ACTION macro removal code Link: http://lkml.kernel.org/r/4be3ea1bc56b26fd98a54d0a6f70bec63f6d8980.1487965799.git.shli@fb.com Signed-off-by: Shaohua Li Suggested-by: Johannes Weiner Acked-by: Johannes Weiner Acked-by: Minchan Kim Acked-by: Hillf Danton Acked-by: Michal Hocko Cc: Hugh Dickins Cc: Rik van Riel Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/rmap.h | 22 +++++++++------------- 1 file changed, 9 insertions(+), 13 deletions(-) (limited to 'include') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 8c89e902df3e..7a3941492856 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -83,19 +83,17 @@ struct anon_vma_chain { }; enum ttu_flags { - TTU_UNMAP = 1, /* unmap mode */ - TTU_MIGRATION = 2, /* migration mode */ - TTU_MUNLOCK = 4, /* munlock mode */ - TTU_LZFREE = 8, /* lazy free mode */ - TTU_SPLIT_HUGE_PMD = 16, /* split huge PMD if any */ - - TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */ - TTU_IGNORE_ACCESS = (1 << 9), /* don't age */ - TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */ - TTU_BATCH_FLUSH = (1 << 11), /* Batch TLB flushes where possible + TTU_MIGRATION = 0x1, /* migration mode */ + TTU_MUNLOCK = 0x2, /* munlock mode */ + + TTU_SPLIT_HUGE_PMD = 0x4, /* split huge PMD if any */ + TTU_IGNORE_MLOCK = 0x8, /* ignore mlock */ + TTU_IGNORE_ACCESS = 0x10, /* don't age */ + TTU_IGNORE_HWPOISON = 0x20, /* corrupted page is recoverable */ + TTU_BATCH_FLUSH = 0x40, /* Batch TLB flushes where possible * and caller guarantees they will * do a final flush if necessary */ - TTU_RMAP_LOCKED = (1 << 12) /* do not grab rmap lock: + TTU_RMAP_LOCKED = 0x80 /* do not grab rmap lock: * caller holds it */ }; @@ -193,8 +191,6 @@ static inline void page_dup_rmap(struct page *page, bool compound) int page_referenced(struct page *, int is_locked, struct mem_cgroup *memcg, unsigned long *vm_flags); -#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK) - int try_to_unmap(struct page *, enum ttu_flags flags); /* Avoid racy checks */ -- cgit v1.2.3 From f7ad2a6cb9f7c4040004bedee84a70a9b985583e Mon Sep 17 00:00:00 2001 From: Shaohua Li Date: Wed, 3 May 2017 14:52:29 -0700 Subject: mm: move MADV_FREE pages into LRU_INACTIVE_FILE list madv()'s MADV_FREE indicate pages are 'lazyfree'. They are still anonymous pages, but they can be freed without pageout. To distinguish these from normal anonymous pages, we clear their SwapBacked flag. MADV_FREE pages could be freed without pageout, so they pretty much like used once file pages. For such pages, we'd like to reclaim them once there is memory pressure. Also it might be unfair reclaiming MADV_FREE pages always before used once file pages and we definitively want to reclaim the pages before other anonymous and file pages. To speed up MADV_FREE pages reclaim, we put the pages into LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny nowadays and should be full of used once file pages. Reclaiming MADV_FREE pages will not have much interfere of anonymous and active file pages. And the inactive file pages and MADV_FREE pages will be reclaimed according to their age, so we don't reclaim too many MADV_FREE pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also means we can reclaim the pages without swap support. This idea is suggested by Johannes. This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to avoid bisect failure, next patch will do it. The patch is based on Minchan's original patch. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.com Signed-off-by: Shaohua Li Suggested-by: Johannes Weiner Acked-by: Johannes Weiner Acked-by: Minchan Kim Acked-by: Michal Hocko Acked-by: Hillf Danton Cc: Hugh Dickins Cc: Rik van Riel Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/swap.h | 2 +- include/linux/vm_event_item.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/swap.h b/include/linux/swap.h index 45e91dd6716d..486494e6b2fc 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -279,7 +279,7 @@ extern void lru_add_drain_cpu(int cpu); extern void lru_add_drain_all(void); extern void rotate_reclaimable_page(struct page *page); extern void deactivate_file_page(struct page *page); -extern void deactivate_page(struct page *page); +extern void mark_page_lazyfree(struct page *page); extern void swap_setup(void); extern void add_page_to_unevictable_list(struct page *page); diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index a80b7b59cf33..d84ae90ccd5c 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -25,7 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, FOR_ALL_ZONES(PGALLOC), FOR_ALL_ZONES(ALLOCSTALL), FOR_ALL_ZONES(PGSCAN_SKIP), - PGFREE, PGACTIVATE, PGDEACTIVATE, + PGFREE, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE, PGFAULT, PGMAJFAULT, PGLAZYFREED, PGREFILL, -- cgit v1.2.3 From 802a3a92ad7ac0b9be9df229dee530a1f0a8039b Mon Sep 17 00:00:00 2001 From: Shaohua Li Date: Wed, 3 May 2017 14:52:32 -0700 Subject: mm: reclaim MADV_FREE pages When memory pressure is high, we free MADV_FREE pages. If the pages are not dirty in pte, the pages could be freed immediately. Otherwise we can't reclaim them. We put the pages back to anonumous LRU list (by setting SwapBacked flag) and the pages will be reclaimed in normal swapout way. We use normal page reclaim policy. Since MADV_FREE pages are put into inactive file list, such pages and inactive file pages are reclaimed according to their age. This is expected, because we don't want to reclaim too many MADV_FREE pages before used once pages. Based on Minchan's original patch [minchan@kernel.org: clean up lazyfree page handling] Link: http://lkml.kernel.org/r/20170303025237.GB3503@bbox Link: http://lkml.kernel.org/r/14b8eb1d3f6bf6cc492833f183ac8c304e560484.1487965799.git.shli@fb.com Signed-off-by: Shaohua Li Signed-off-by: Minchan Kim Acked-by: Minchan Kim Acked-by: Michal Hocko Acked-by: Johannes Weiner Acked-by: Hillf Danton Cc: Hugh Dickins Cc: Rik van Riel Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/rmap.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 7a3941492856..fee10d744ebd 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -298,6 +298,6 @@ static inline int page_mkclean(struct page *page) #define SWAP_AGAIN 1 #define SWAP_FAIL 2 #define SWAP_MLOCK 3 -#define SWAP_LZFREE 4 +#define SWAP_DIRTY 4 #endif /* _LINUX_RMAP_H */ -- cgit v1.2.3 From 9a4caf1e9fa4864ce21ba9584a2c336bfbc72740 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 3 May 2017 14:52:45 -0700 Subject: mm: memcontrol: provide shmem statistics Cgroups currently don't report how much shmem they use, which can be useful data to have, in particular since shmem is included in the cache/file item while being reclaimed like anonymous memory. Add a counter to track shmem pages during charging and uncharging. Link: http://lkml.kernel.org/r/20170221164343.32252-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Reported-by: Chris Down Cc: Michal Hocko Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index bb7250c45cb8..c5ebb32fef49 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -46,6 +46,7 @@ enum mem_cgroup_stat_index { MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */ MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ MEM_CGROUP_STAT_RSS_HUGE, /* # of pages charged as anon huge */ + MEM_CGROUP_STAT_SHMEM, /* # of pages charged as shmem */ MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ MEM_CGROUP_STAT_DIRTY, /* # of dirty pages in page cache */ MEM_CGROUP_STAT_WRITEBACK, /* # of pages under writeback */ -- cgit v1.2.3 From a6ffdc07847e74cc244c02ab6d0351a4a5d77281 Mon Sep 17 00:00:00 2001 From: Xishi Qiu Date: Wed, 3 May 2017 14:52:52 -0700 Subject: mm: use is_migrate_highatomic() to simplify the code Introduce two helpers, is_migrate_highatomic() and is_migrate_highatomic_page(). Simplify the code, no functional changes. [akpm@linux-foundation.org: use static inlines rather than macros, per mhocko] Link: http://lkml.kernel.org/r/58B94F15.6060606@huawei.com Signed-off-by: Xishi Qiu Acked-by: Michal Hocko Cc: Vlastimil Babka Cc: Mel Gorman Cc: Minchan Kim Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 04e0969966f6..446cf68c1c09 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -35,7 +35,7 @@ */ #define PAGE_ALLOC_COSTLY_ORDER 3 -enum { +enum migratetype { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_RECLAIMABLE, -- cgit v1.2.3 From 7e7844226f1053236b6f6d5d122a06509fb14fd9 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 3 May 2017 14:53:09 -0700 Subject: lockdep: allow to disable reclaim lockup detection The current implementation of the reclaim lockup detection can lead to false positives and those even happen and usually lead to tweak the code to silence the lockdep by using GFP_NOFS even though the context can use __GFP_FS just fine. See http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example. ================================= [ INFO: inconsistent lock state ] 4.5.0-rc2+ #4 Tainted: G O --------------------------------- inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage. kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes: (&xfs_nondir_ilock_class){++++-+}, at: xfs_ilock+0x177/0x200 [xfs] {RECLAIM_FS-ON-R} state was registered at: mark_held_locks+0x79/0xa0 lockdep_trace_alloc+0xb3/0x100 kmem_cache_alloc+0x33/0x230 kmem_zone_alloc+0x81/0x120 [xfs] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs] __xfs_refcount_find_shared+0x75/0x580 [xfs] xfs_refcount_find_shared+0x84/0xb0 [xfs] xfs_getbmap+0x608/0x8c0 [xfs] xfs_vn_fiemap+0xab/0xc0 [xfs] do_vfs_ioctl+0x498/0x670 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x12/0x6f CPU0 ---- lock(&xfs_nondir_ilock_class); lock(&xfs_nondir_ilock_class); *** DEADLOCK *** 3 locks held by kswapd0/543: stack backtrace: CPU: 0 PID: 543 Comm: kswapd0 Tainted: G O 4.5.0-rc2+ #4 Call Trace: lock_acquire+0xd8/0x1e0 down_write_nested+0x5e/0xc0 xfs_ilock+0x177/0x200 [xfs] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] xfs_fs_evict_inode+0xdc/0x1e0 [xfs] evict+0xc5/0x190 dispose_list+0x39/0x60 prune_icache_sb+0x4b/0x60 super_cache_scan+0x14f/0x1a0 shrink_slab.part.63.constprop.79+0x1e9/0x4e0 shrink_zone+0x15e/0x170 kswapd+0x4f1/0xa80 kthread+0xf2/0x110 ret_from_fork+0x3f/0x70 To quote Dave: "Ignoring whether reflink should be doing anything or not, that's a "xfs_refcountbt_init_cursor() gets called both outside and inside transactions" lockdep false positive case. The problem here is lockdep has seen this allocation from within a transaction, hence a GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context. Also note that we have an active reference to this inode. So, because the reclaim annotations overload the interrupt level detections and it's seen the inode ilock been taken in reclaim ("interrupt") context, this triggers a reclaim context warning where it thinks it is unsafe to do this allocation in GFP_KERNEL context holding the inode ilock..." This sounds like a fundamental problem of the reclaim lock detection. It is really impossible to annotate such a special usecase IMHO unless the reclaim lockup detection is reworked completely. Until then it is much better to provide a way to add "I know what I am doing flag" and mark problematic places. This would prevent from abusing GFP_NOFS flag which has a runtime effect even on configurations which have lockdep disabled. Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to skip the current allocation request. While we are at it also make sure that the radix tree doesn't accidentaly override tags stored in the upper part of the gfp_mask. Link: http://lkml.kernel.org/r/20170306131408.9828-3-mhocko@kernel.org Signed-off-by: Michal Hocko Suggested-by: Peter Zijlstra Acked-by: Peter Zijlstra (Intel) Acked-by: Vlastimil Babka Cc: Dave Chinner Cc: Theodore Ts'o Cc: Chris Mason Cc: David Sterba Cc: Jan Kara Cc: Brian Foster Cc: Darrick J. Wong Cc: Nikolay Borisov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/gfp.h | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/gfp.h b/include/linux/gfp.h index db373b9d3223..978232a3b4ae 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -40,6 +40,11 @@ struct vm_area_struct; #define ___GFP_DIRECT_RECLAIM 0x400000u #define ___GFP_WRITE 0x800000u #define ___GFP_KSWAPD_RECLAIM 0x1000000u +#ifdef CONFIG_LOCKDEP +#define ___GFP_NOLOCKDEP 0x4000000u +#else +#define ___GFP_NOLOCKDEP 0 +#endif /* If the above are modified, __GFP_BITS_SHIFT may need updating */ /* @@ -179,8 +184,11 @@ struct vm_area_struct; #define __GFP_NOTRACK ((__force gfp_t)___GFP_NOTRACK) #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK) +/* Disable lockdep for GFP context tracking */ +#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP) + /* Room for N __GFP_FOO bits */ -#define __GFP_BITS_SHIFT 25 +#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP)) #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /* -- cgit v1.2.3 From 9070733b4efac4bf17f299a81b01c15e206f9ff5 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 3 May 2017 14:53:12 -0700 Subject: xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite some time ago. We would like to make this concept more generic and use it for other filesystems as well. Let's start by giving the flag a more generic name PF_MEMALLOC_NOFS which is in line with an exiting PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO contexts. Replace all PF_FSTRANS usage from the xfs code in the first step before we introduce a full API for it as xfs uses the flag directly anyway. This patch doesn't introduce any functional change. Link: http://lkml.kernel.org/r/20170306131408.9828-4-mhocko@kernel.org Signed-off-by: Michal Hocko Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster Acked-by: Vlastimil Babka Cc: Dave Chinner Cc: Theodore Ts'o Cc: Chris Mason Cc: David Sterba Cc: Jan Kara Cc: Nikolay Borisov Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/sched.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include') diff --git a/include/linux/sched.h b/include/linux/sched.h index 3d4fa448223f..8ac11465ac5b 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1237,6 +1237,8 @@ extern struct pid *cad_pid; #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */ #define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */ +#define PF_MEMALLOC_NOFS PF_FSTRANS /* Transition to a more generic GFP_NOFS scope semantic */ + /* * Only the _current_ task can read/write to tsk->flags, but other * tasks can access tsk->flags in readonly mode for example -- cgit v1.2.3 From 7dea19f9ee636cb244109a4dba426bbb3e5304b7 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 3 May 2017 14:53:15 -0700 Subject: mm: introduce memalloc_nofs_{save,restore} API GFP_NOFS context is used for the following 5 reasons currently: - to prevent from deadlocks when the lock held by the allocation context would be needed during the memory reclaim - to prevent from stack overflows during the reclaim because the allocation is performed from a deep context already - to prevent lockups when the allocation context depends on other reclaimers to make a forward progress indirectly - just in case because this would be safe from the fs POV - silence lockdep false positives Unfortunately overuse of this allocation context brings some problems to the MM. Memory reclaim is much weaker (especially during heavy FS metadata workloads), OOM killer cannot be invoked because the MM layer doesn't have enough information about how much memory is freeable by the FS layer. In many cases it is far from clear why the weaker context is even used and so it might be used unnecessarily. We would like to get rid of those as much as possible. One way to do that is to use the flag in scopes rather than isolated cases. Such a scope is declared when really necessary, tracked per task and all the allocation requests from within the context will simply inherit the GFP_NOFS semantic. Not only this is easier to understand and maintain because there are much less problematic contexts than specific allocation requests, this also helps code paths where FS layer interacts with other layers (e.g. crypto, security modules, MM etc...) and there is no easy way to convey the allocation context between the layers. Introduce memalloc_nofs_{save,restore} API to control the scope of GFP_NOFS allocation context. This is basically copying memalloc_noio_{save,restore} API we have for other restricted allocation context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is just an alias for PF_FSTRANS which has been xfs specific until recently. There are no more PF_FSTRANS users anymore so let's just drop it. PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags is renamed to current_gfp_context because it now cares about both PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve their semantic. kmem_flags_convert() doesn't need to evaluate the flag anymore. This patch shouldn't introduce any functional changes. Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS) usage as much as possible and only use a properly documented memalloc_nofs_{save,restore} checkpoints where they are appropriate. [akpm@linux-foundation.org: fix comment typo, reflow comment] Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.org Signed-off-by: Michal Hocko Acked-by: Vlastimil Babka Cc: Dave Chinner Cc: Theodore Ts'o Cc: Chris Mason Cc: David Sterba Cc: Jan Kara Cc: Brian Foster Cc: Darrick J. Wong Cc: Nikolay Borisov Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/gfp.h | 8 ++++++++ include/linux/sched.h | 8 +++----- include/linux/sched/mm.h | 26 +++++++++++++++++++++++--- 3 files changed, 34 insertions(+), 8 deletions(-) (limited to 'include') diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 978232a3b4ae..2bfcfd33e476 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -210,8 +210,16 @@ struct vm_area_struct; * * GFP_NOIO will use direct reclaim to discard clean pages or slab pages * that do not require the starting of any physical IO. + * Please try to avoid using this flag directly and instead use + * memalloc_noio_{save,restore} to mark the whole scope which cannot + * perform any IO with a short explanation why. All allocation requests + * will inherit GFP_NOIO implicitly. * * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces. + * Please try to avoid using this flag directly and instead use + * memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't + * recurse into the FS layer with a short explanation why. All allocation + * requests will inherit GFP_NOFS implicitly. * * GFP_USER is for userspace allocations that also need to be directly * accessibly by the kernel or hardware. It is typically used by hardware diff --git a/include/linux/sched.h b/include/linux/sched.h index 8ac11465ac5b..993e7e25a3a5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1224,9 +1224,9 @@ extern struct pid *cad_pid; #define PF_USED_ASYNC 0x00004000 /* Used async_schedule*(), used by module init */ #define PF_NOFREEZE 0x00008000 /* This thread should not be frozen */ #define PF_FROZEN 0x00010000 /* Frozen for system suspend */ -#define PF_FSTRANS 0x00020000 /* Inside a filesystem transaction */ -#define PF_KSWAPD 0x00040000 /* I am kswapd */ -#define PF_MEMALLOC_NOIO 0x00080000 /* Allocating memory without IO involved */ +#define PF_KSWAPD 0x00020000 /* I am kswapd */ +#define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */ +#define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */ #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ @@ -1237,8 +1237,6 @@ extern struct pid *cad_pid; #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */ #define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */ -#define PF_MEMALLOC_NOFS PF_FSTRANS /* Transition to a more generic GFP_NOFS scope semantic */ - /* * Only the _current_ task can read/write to tsk->flags, but other * tasks can access tsk->flags in readonly mode for example diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 830953ebb391..9daabe138c99 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -149,13 +149,21 @@ static inline bool in_vfork(struct task_struct *tsk) return ret; } -/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags - * __GFP_FS is also cleared as it implies __GFP_IO. +/* + * Applies per-task gfp context to the given allocation flags. + * PF_MEMALLOC_NOIO implies GFP_NOIO + * PF_MEMALLOC_NOFS implies GFP_NOFS */ -static inline gfp_t memalloc_noio_flags(gfp_t flags) +static inline gfp_t current_gfp_context(gfp_t flags) { + /* + * NOIO implies both NOIO and NOFS and it is a weaker context + * so always make sure it makes precendence + */ if (unlikely(current->flags & PF_MEMALLOC_NOIO)) flags &= ~(__GFP_IO | __GFP_FS); + else if (unlikely(current->flags & PF_MEMALLOC_NOFS)) + flags &= ~__GFP_FS; return flags; } @@ -171,4 +179,16 @@ static inline void memalloc_noio_restore(unsigned int flags) current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags; } +static inline unsigned int memalloc_nofs_save(void) +{ + unsigned int flags = current->flags & PF_MEMALLOC_NOFS; + current->flags |= PF_MEMALLOC_NOFS; + return flags; +} + +static inline void memalloc_nofs_restore(unsigned int flags) +{ + current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags; +} + #endif /* _LINUX_SCHED_MM_H */ -- cgit v1.2.3 From 81378da64de6d33d0c200885f1de431c9a3e5ccd Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 3 May 2017 14:53:22 -0700 Subject: jbd2: mark the transaction context with the scope GFP_NOFS context now that we have memalloc_nofs_{save,restore} api we can mark the whole transaction context as implicitly GFP_NOFS. All allocations will automatically inherit GFP_NOFS this way. This means that we do not have to mark any of those requests with GFP_NOFS and moreover all the ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded GFP_KERNEL allocations deep inside the vmalloc will be NOFS now. [akpm@linux-foundation.org: tweak comments] Link: http://lkml.kernel.org/r/20170306131408.9828-7-mhocko@kernel.org Signed-off-by: Michal Hocko Reviewed-by: Jan Kara Cc: Dave Chinner Cc: Theodore Ts'o Cc: Chris Mason Cc: David Sterba Cc: Brian Foster Cc: Darrick J. Wong Cc: Nikolay Borisov Cc: Peter Zijlstra Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/jbd2.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include') diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index dfaa1f4dcb0c..606b6bce3a5b 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -491,6 +491,8 @@ struct jbd2_journal_handle unsigned long h_start_jiffies; unsigned int h_requested_credits; + + unsigned int saved_alloc_context; }; -- cgit v1.2.3 From 056b9d8a76924df02011f3941c4f53ace8d6c32a Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Wed, 3 May 2017 14:53:32 -0700 Subject: mm: remove rodata_test_data export, add pr_fmt Since commit 3ad38ceb2769 ("x86/mm: Remove CONFIG_DEBUG_NX_TEST"), nothing is using the exported rodata_test_data variable, so drop the export. This additionally updates the pr_fmt to avoid redundant strings and adjusts some whitespace. Link: http://lkml.kernel.org/r/20170307005313.GA85809@beast Signed-off-by: Kees Cook Cc: Jinbum Park Cc: Arjan van de Ven Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/rodata_test.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include') diff --git a/include/linux/rodata_test.h b/include/linux/rodata_test.h index ea05f6c51413..84766bcdd01f 100644 --- a/include/linux/rodata_test.h +++ b/include/linux/rodata_test.h @@ -14,7 +14,6 @@ #define _RODATA_TEST_H #ifdef CONFIG_DEBUG_RODATA_TEST -extern const int rodata_test_data; void rodata_test(void); #else static inline void rodata_test(void) {} -- cgit v1.2.3 From 18863d3a3f593f47b075b9f53ebf9228dc76cf72 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Wed, 3 May 2017 14:54:04 -0700 Subject: mm: remove SWAP_DIRTY in ttu If we found lazyfree page is dirty, try_to_unmap_one can just SetPageSwapBakced in there like PG_mlocked page and just return with SWAP_FAIL which is very natural because the page is not swappable right now so that vmscan can activate it. There is no point to introduce new return value SWAP_DIRTY in try_to_unmap at the moment. Link: http://lkml.kernel.org/r/1489555493-14659-3-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim Acked-by: Hillf Danton Acked-by: Kirill A. Shutemov Cc: Anshuman Khandual Cc: Johannes Weiner Cc: Michal Hocko Cc: Naoya Horiguchi Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/rmap.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index fee10d744ebd..b556eefa62bc 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -298,6 +298,5 @@ static inline int page_mkclean(struct page *page) #define SWAP_AGAIN 1 #define SWAP_FAIL 2 #define SWAP_MLOCK 3 -#define SWAP_DIRTY 4 #endif /* _LINUX_RMAP_H */ -- cgit v1.2.3 From 192d7232569ab61ded40c8be691b12832bc6bcd1 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Wed, 3 May 2017 14:54:10 -0700 Subject: mm: make try_to_munlock() return void try_to_munlock returns SWAP_MLOCK if the one of VMAs mapped the page has VM_LOCKED flag. In that time, VM set PG_mlocked to the page if the page is not pte-mapped THP which cannot be mlocked, either. With that, __munlock_isolated_page can use PageMlocked to check whether try_to_munlock is successful or not without relying on try_to_munlock's retval. It helps to make try_to_unmap/try_to_unmap_one simple with upcoming patches. [minchan@kernel.org: remove PG_Mlocked VM_BUG_ON check] Link: http://lkml.kernel.org/r/20170411025615.GA6545@bbox Link: http://lkml.kernel.org/r/1489555493-14659-5-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim Acked-by: Kirill A. Shutemov Acked-by: Vlastimil Babka Cc: Anshuman Khandual Cc: Hillf Danton Cc: Johannes Weiner Cc: Michal Hocko Cc: Naoya Horiguchi Cc: Sasha Levin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/rmap.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index b556eefa62bc..1b0cd4cf68e3 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -235,7 +235,7 @@ int page_mkclean(struct page *); * called in munlock()/munmap() path to check for other vmas holding * the page mlocked. */ -int try_to_munlock(struct page *); +void try_to_munlock(struct page *); void remove_migration_ptes(struct page *old, struct page *new, bool locked); -- cgit v1.2.3 From ad6b67041a45497261617d7a28b15159b202cb5a Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Wed, 3 May 2017 14:54:13 -0700 Subject: mm: remove SWAP_MLOCK in ttu ttu doesn't need to return SWAP_MLOCK. Instead, just return SWAP_FAIL because it means the page is not-swappable so it should move to another LRU list(active or unevictable). putback friends will move it to right list depending on the page's LRU flag. Link: http://lkml.kernel.org/r/1489555493-14659-6-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim Cc: Anshuman Khandual Cc: Hillf Danton Cc: Johannes Weiner Cc: Kirill A. Shutemov Cc: Michal Hocko Cc: Naoya Horiguchi Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/rmap.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 1b0cd4cf68e3..3630d4dcee13 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -297,6 +297,5 @@ static inline int page_mkclean(struct page *page) #define SWAP_SUCCESS 0 #define SWAP_AGAIN 1 #define SWAP_FAIL 2 -#define SWAP_MLOCK 3 #endif /* _LINUX_RMAP_H */ -- cgit v1.2.3 From 666e5a406c3ed562e7b3ceff8b631b6067bdaead Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Wed, 3 May 2017 14:54:20 -0700 Subject: mm: make ttu's return boolean try_to_unmap() returns SWAP_SUCCESS or SWAP_FAIL so it's suitable for boolean return. This patch changes it. Link: http://lkml.kernel.org/r/1489555493-14659-8-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim Cc: Naoya Horiguchi Cc: Anshuman Khandual Cc: Hillf Danton Cc: Johannes Weiner Cc: Kirill A. Shutemov Cc: Michal Hocko Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/rmap.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 3630d4dcee13..6028c38d3cac 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -191,7 +191,7 @@ static inline void page_dup_rmap(struct page *page, bool compound) int page_referenced(struct page *, int is_locked, struct mem_cgroup *memcg, unsigned long *vm_flags); -int try_to_unmap(struct page *, enum ttu_flags flags); +bool try_to_unmap(struct page *, enum ttu_flags flags); /* Avoid racy checks */ #define PVMW_SYNC (1 << 0) @@ -281,7 +281,7 @@ static inline int page_referenced(struct page *page, int is_locked, return 0; } -#define try_to_unmap(page, refs) SWAP_FAIL +#define try_to_unmap(page, refs) false static inline int page_mkclean(struct page *page) { -- cgit v1.2.3 From 1df631ae19819cff343d316eda42eca32d3de7fc Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Wed, 3 May 2017 14:54:23 -0700 Subject: mm: make rmap_walk() return void There is no user of the return value from rmap_walk() and friends so this patch makes them void-returning functions. Link: http://lkml.kernel.org/r/1489555493-14659-9-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim Cc: Anshuman Khandual Cc: Hillf Danton Cc: Johannes Weiner Cc: Kirill A. Shutemov Cc: Michal Hocko Cc: Naoya Horiguchi Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/ksm.h | 5 ++--- include/linux/rmap.h | 4 ++-- 2 files changed, 4 insertions(+), 5 deletions(-) (limited to 'include') diff --git a/include/linux/ksm.h b/include/linux/ksm.h index e1cfda4bee58..78b44a024eaa 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -61,7 +61,7 @@ static inline void set_page_stable_node(struct page *page, struct page *ksm_might_need_to_copy(struct page *page, struct vm_area_struct *vma, unsigned long address); -int rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc); +void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc); void ksm_migrate_page(struct page *newpage, struct page *oldpage); #else /* !CONFIG_KSM */ @@ -94,10 +94,9 @@ static inline int page_referenced_ksm(struct page *page, return 0; } -static inline int rmap_walk_ksm(struct page *page, +static inline void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc) { - return 0; } static inline void ksm_migrate_page(struct page *newpage, struct page *oldpage) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 6028c38d3cac..1d7d457ca0dc 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -264,8 +264,8 @@ struct rmap_walk_control { bool (*invalid_vma)(struct vm_area_struct *vma, void *arg); }; -int rmap_walk(struct page *page, struct rmap_walk_control *rwc); -int rmap_walk_locked(struct page *page, struct rmap_walk_control *rwc); +void rmap_walk(struct page *page, struct rmap_walk_control *rwc); +void rmap_walk_locked(struct page *page, struct rmap_walk_control *rwc); #else /* !CONFIG_MMU */ -- cgit v1.2.3 From e4b82222712ed15813d35204c91429883d27d1d9 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Wed, 3 May 2017 14:54:27 -0700 Subject: mm: make rmap_one boolean function rmap_one's return value controls whether rmap_work should contine to scan other ptes or not so it's target for changing to boolean. Return true if the scan should be continued. Otherwise, return false to stop the scanning. This patch makes rmap_one's return value to boolean. Link: http://lkml.kernel.org/r/1489555493-14659-10-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim Cc: Anshuman Khandual Cc: Hillf Danton Cc: Johannes Weiner Cc: Kirill A. Shutemov Cc: Michal Hocko Cc: Naoya Horiguchi Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/rmap.h | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 1d7d457ca0dc..13ed232cbb29 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -257,7 +257,11 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma); */ struct rmap_walk_control { void *arg; - int (*rmap_one)(struct page *page, struct vm_area_struct *vma, + /* + * Return false if page table scanning in rmap_walk should be stopped. + * Otherwise, return true. + */ + bool (*rmap_one)(struct page *page, struct vm_area_struct *vma, unsigned long addr, void *arg); int (*done)(struct page *page); struct anon_vma *(*anon_lock)(struct page *page); -- cgit v1.2.3 From 83612a948d3bd2e71b110d7e8735661621bd23d9 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Wed, 3 May 2017 14:54:30 -0700 Subject: mm: remove SWAP_[SUCCESS|AGAIN|FAIL] There is no user for it. Remove it. [minchan@kernel.org: use false instead of SWAP_FAIL] Link: http://lkml.kernel.org/r/20170316053313.GA19241@bbox Link: http://lkml.kernel.org/r/1489555493-14659-11-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim Cc: Anshuman Khandual Cc: Hillf Danton Cc: Johannes Weiner Cc: Kirill A. Shutemov Cc: Michal Hocko Cc: Naoya Horiguchi Cc: Vlastimil Babka Cc: Sergey Senozhatsky Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/rmap.h | 7 ------- 1 file changed, 7 deletions(-) (limited to 'include') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 13ed232cbb29..43ef2c30cb0f 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -295,11 +295,4 @@ static inline int page_mkclean(struct page *page) #endif /* CONFIG_MMU */ -/* - * Return values of try_to_unmap - */ -#define SWAP_SUCCESS 0 -#define SWAP_AGAIN 1 -#define SWAP_FAIL 2 - #endif /* _LINUX_RMAP_H */ -- cgit v1.2.3 From bd33ef3681359343863f2290aded182b0441edee Mon Sep 17 00:00:00 2001 From: Vinayak Menon Date: Wed, 3 May 2017 14:54:42 -0700 Subject: mm: enable page poisoning early at boot On SPARSEMEM systems page poisoning is enabled after buddy is up, because of the dependency on page extension init. This causes the pages released by free_all_bootmem not to be poisoned. This either delays or misses the identification of some issues because the pages have to undergo another cycle of alloc-free-alloc for any corruption to be detected. Enable page poisoning early by getting rid of the PAGE_EXT_DEBUG_POISON flag. Since all the free pages will now be poisoned, the flag need not be verified before checking the poison during an alloc. [vinmenon@codeaurora.org: fix Kconfig] Link: http://lkml.kernel.org/r/1490878002-14423-1-git-send-email-vinmenon@codeaurora.org Link: http://lkml.kernel.org/r/1490358246-11001-1-git-send-email-vinmenon@codeaurora.org Signed-off-by: Vinayak Menon Acked-by: Laura Abbott Tested-by: Laura Abbott Cc: Joonsoo Kim Cc: Michal Hocko Cc: Akinobu Mita Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include') diff --git a/include/linux/mm.h b/include/linux/mm.h index 695da2a19b4c..5d22e69f51ea 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2487,7 +2487,6 @@ extern long copy_huge_page_from_user(struct page *dst_page, #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */ extern struct page_ext_operations debug_guardpage_ops; -extern struct page_ext_operations page_poisoning_ops; #ifdef CONFIG_DEBUG_PAGEALLOC extern unsigned int _debug_guardpage_minorder; -- cgit v1.2.3 From 9927e3887642b976d9b391cd77d71388aa521e54 Mon Sep 17 00:00:00 2001 From: Pushkar Jambhlekar Date: Wed, 3 May 2017 14:54:45 -0700 Subject: include/linux/migrate.h: add arg names to prototype It is preferred, and the rest of migrate.h gets it right. Link: http://lkml.kernel.org/r/1490336009-8024-1-git-send-email-pushkar.iit@gmail.com Signed-off-by: Pushkar Jambhlekar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/migrate.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/migrate.h b/include/linux/migrate.h index fa76b516fa47..48e24844b3c5 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -33,8 +33,9 @@ extern char *migrate_reason_names[MR_TYPES]; #ifdef CONFIG_MIGRATION extern void putback_movable_pages(struct list_head *l); -extern int migrate_page(struct address_space *, - struct page *, struct page *, enum migrate_mode); +extern int migrate_page(struct address_space *mapping, + struct page *newpage, struct page *page, + enum migrate_mode mode); extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free, unsigned long private, enum migrate_mode mode, int reason); extern int isolate_movable_page(struct page *page, isolate_mode_t mode); -- cgit v1.2.3 From ac2e8e40acf4c73e0ad1addca34b186d855565d7 Mon Sep 17 00:00:00 2001 From: Hao Lee Date: Wed, 3 May 2017 14:54:51 -0700 Subject: mm: fix spelling error Fix variable name error in comments. No code changes. Link: http://lkml.kernel.org/r/20170403161655.5081-1-haolee.swjtu@gmail.com Signed-off-by: Hao Lee Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/gfp.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 2bfcfd33e476..2b1a44f5bdb6 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -313,8 +313,8 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags) /* * GFP_ZONE_TABLE is a word size bitstring that is used for looking up the - * zone to use given the lowest 4 bits of gfp_t. Entries are ZONE_SHIFT long - * and there are 16 of them to cover all possible combinations of + * zone to use given the lowest 4 bits of gfp_t. Entries are GFP_ZONES_SHIFT + * bits long and there are 16 of them to cover all possible combinations of * __GFP_DMA, __GFP_DMA32, __GFP_MOVABLE and __GFP_HIGHMEM. * * The zone fallback order is MOVABLE=>HIGHMEM=>NORMAL=>DMA32=>DMA. -- cgit v1.2.3 From 2a2e48854d704214dac7546e87ae0e4daa0e61a0 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 3 May 2017 14:55:03 -0700 Subject: mm: vmscan: fix IO/refault regression in cache workingset transition Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list") we noticed bigger IO spikes during changes in cache access patterns. The patch in question shrunk the inactive list size to leave more room for the current workingset in the presence of streaming IO. However, workingset transitions that previously happened on the inactive list are now pushed out of memory and incur more refaults to complete. This patch disables active list protection when refaults are being observed. This accelerates workingset transitions, and allows more of the new set to establish itself from memory, without eating into the ability to protect the established workingset during stable periods. The workloads that were measurably affected for us were hit pretty bad by it, with refault/majfault rates doubling and tripling during cache transitions, and the machines sustaining half-hour periods of 100% IO utilization, where they'd previously have sub-minute peaks at 60-90%. Stateful services that handle user data tend to be more conservative with kernel upgrades. As a result we hit most page cache issues with some delay, as was the case here. The severity seemed to warrant a stable tag. Fixes: 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list") Link: http://lkml.kernel.org/r/20170404220052.27593-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Cc: Rik van Riel Cc: Mel Gorman Cc: Michal Hocko Cc: Vladimir Davydov Cc: [4.7+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 64 +++++++++++++++++++++++++++++++++++++++++++--- include/linux/mmzone.h | 2 ++ 2 files changed, 63 insertions(+), 3 deletions(-) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c5ebb32fef49..cfa91a3ca0ca 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -57,6 +57,9 @@ enum mem_cgroup_stat_index { MEMCG_SLAB_RECLAIMABLE, MEMCG_SLAB_UNRECLAIMABLE, MEMCG_SOCK, + MEMCG_WORKINGSET_REFAULT, + MEMCG_WORKINGSET_ACTIVATE, + MEMCG_WORKINGSET_NODERECLAIM, MEMCG_NR_STAT, }; @@ -495,6 +498,40 @@ extern int do_swap_account; void lock_page_memcg(struct page *page); void unlock_page_memcg(struct page *page); +static inline unsigned long mem_cgroup_read_stat(struct mem_cgroup *memcg, + enum mem_cgroup_stat_index idx) +{ + long val = 0; + int cpu; + + for_each_possible_cpu(cpu) + val += per_cpu(memcg->stat->count[idx], cpu); + + if (val < 0) + val = 0; + + return val; +} + +static inline void mem_cgroup_update_stat(struct mem_cgroup *memcg, + enum mem_cgroup_stat_index idx, int val) +{ + if (!mem_cgroup_disabled()) + this_cpu_add(memcg->stat->count[idx], val); +} + +static inline void mem_cgroup_inc_stat(struct mem_cgroup *memcg, + enum mem_cgroup_stat_index idx) +{ + mem_cgroup_update_stat(memcg, idx, 1); +} + +static inline void mem_cgroup_dec_stat(struct mem_cgroup *memcg, + enum mem_cgroup_stat_index idx) +{ + mem_cgroup_update_stat(memcg, idx, -1); +} + /** * mem_cgroup_update_page_stat - update page state statistics * @page: the page @@ -509,14 +546,14 @@ void unlock_page_memcg(struct page *page); * if (TestClearPageState(page)) * mem_cgroup_update_page_stat(page, state, -1); * unlock_page(page) or unlock_page_memcg(page) + * + * Kernel pages are an exception to this, since they'll never move. */ static inline void mem_cgroup_update_page_stat(struct page *page, enum mem_cgroup_stat_index idx, int val) { - VM_BUG_ON(!(rcu_read_lock_held() || PageLocked(page))); - if (page->mem_cgroup) - this_cpu_add(page->mem_cgroup->stat->count[idx], val); + mem_cgroup_update_stat(page->mem_cgroup, idx, val); } static inline void mem_cgroup_inc_page_stat(struct page *page, @@ -741,6 +778,27 @@ static inline bool mem_cgroup_oom_synchronize(bool wait) return false; } +static inline unsigned long mem_cgroup_read_stat(struct mem_cgroup *memcg, + enum mem_cgroup_stat_index idx) +{ + return 0; +} + +static inline void mem_cgroup_update_stat(struct mem_cgroup *memcg, + enum mem_cgroup_stat_index idx, int val) +{ +} + +static inline void mem_cgroup_inc_stat(struct mem_cgroup *memcg, + enum mem_cgroup_stat_index idx) +{ +} + +static inline void mem_cgroup_dec_stat(struct mem_cgroup *memcg, + enum mem_cgroup_stat_index idx) +{ +} + static inline void mem_cgroup_update_page_stat(struct page *page, enum mem_cgroup_stat_index idx, int nr) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 446cf68c1c09..e0c3c5e3d8a0 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -225,6 +225,8 @@ struct lruvec { struct zone_reclaim_stat reclaim_stat; /* Evictions & activations on the inactive file list */ atomic_long_t inactive_age; + /* Refaults at the time of last reclaim cycle */ + unsigned long refaults; #ifdef CONFIG_MEMCG struct pglist_data *pgdat; #endif -- cgit v1.2.3 From 31176c781508e4e35b1cc4ae2f0a5abd1f4ea689 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 3 May 2017 14:55:07 -0700 Subject: mm: memcontrol: clean up memory.events counting function We only ever count single events, drop the @nr parameter. Rename the function accordingly. Remove low-information kerneldoc. Link: http://lkml.kernel.org/r/20170404220148.28338-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Acked-by: Vladimir Davydov Acked-by: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 18 +++++------------- 1 file changed, 5 insertions(+), 13 deletions(-) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index cfa91a3ca0ca..bc0c16e284c0 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -287,17 +287,10 @@ static inline bool mem_cgroup_disabled(void) return !cgroup_subsys_enabled(memory_cgrp_subsys); } -/** - * mem_cgroup_events - count memory events against a cgroup - * @memcg: the memory cgroup - * @idx: the event index - * @nr: the number of events to account for - */ -static inline void mem_cgroup_events(struct mem_cgroup *memcg, - enum mem_cgroup_events_index idx, - unsigned int nr) +static inline void mem_cgroup_event(struct mem_cgroup *memcg, + enum mem_cgroup_events_index idx) { - this_cpu_add(memcg->stat->events[idx], nr); + this_cpu_inc(memcg->stat->events[idx]); cgroup_file_notify(&memcg->events_file); } @@ -614,9 +607,8 @@ static inline bool mem_cgroup_disabled(void) return true; } -static inline void mem_cgroup_events(struct mem_cgroup *memcg, - enum mem_cgroup_events_index idx, - unsigned int nr) +static inline void mem_cgroup_event(struct mem_cgroup *memcg, + enum mem_cgroup_events_index idx) { } -- cgit v1.2.3 From df0e53d0619e83b465e363c088bf4eeb2848273b Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 3 May 2017 14:55:10 -0700 Subject: mm: memcontrol: re-use global VM event enum The current duplication is a high-maintenance mess, and it's painful to add new items. This increases the size of the event array, but we'll eventually want most of the VM events tracked on a per-cgroup basis anyway. Link: http://lkml.kernel.org/r/20170404220148.28338-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Acked-by: Vladimir Davydov Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 45 ++++++++++++++------------------------------- 1 file changed, 14 insertions(+), 31 deletions(-) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index bc0c16e284c0..0bb5f055bd26 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -69,20 +69,6 @@ struct mem_cgroup_reclaim_cookie { unsigned int generation; }; -enum mem_cgroup_events_index { - MEM_CGROUP_EVENTS_PGPGIN, /* # of pages paged in */ - MEM_CGROUP_EVENTS_PGPGOUT, /* # of pages paged out */ - MEM_CGROUP_EVENTS_PGFAULT, /* # of page-faults */ - MEM_CGROUP_EVENTS_PGMAJFAULT, /* # of major page-faults */ - MEM_CGROUP_EVENTS_NSTATS, - /* default hierarchy events */ - MEMCG_LOW = MEM_CGROUP_EVENTS_NSTATS, - MEMCG_HIGH, - MEMCG_MAX, - MEMCG_OOM, - MEMCG_NR_EVENTS, -}; - /* * Per memcg event counter is incremented at every pagein/pageout. With THP, * it will be incremated by the number of pages. This counter is used for @@ -106,6 +92,15 @@ struct mem_cgroup_id { atomic_t ref; }; +/* Cgroup-specific events, on top of universal VM events */ +enum memcg_event_item { + MEMCG_LOW = NR_VM_EVENT_ITEMS, + MEMCG_HIGH, + MEMCG_MAX, + MEMCG_OOM, + MEMCG_NR_EVENTS, +}; + struct mem_cgroup_stat_cpu { long count[MEMCG_NR_STAT]; unsigned long events[MEMCG_NR_EVENTS]; @@ -288,9 +283,9 @@ static inline bool mem_cgroup_disabled(void) } static inline void mem_cgroup_event(struct mem_cgroup *memcg, - enum mem_cgroup_events_index idx) + enum memcg_event_item event) { - this_cpu_inc(memcg->stat->events[idx]); + this_cpu_inc(memcg->stat->events[event]); cgroup_file_notify(&memcg->events_file); } @@ -575,20 +570,8 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm, rcu_read_lock(); memcg = mem_cgroup_from_task(rcu_dereference(mm->owner)); - if (unlikely(!memcg)) - goto out; - - switch (idx) { - case PGFAULT: - this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGFAULT]); - break; - case PGMAJFAULT: - this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT]); - break; - default: - BUG(); - } -out: + if (likely(memcg)) + this_cpu_inc(memcg->stat->events[idx]); rcu_read_unlock(); } #ifdef CONFIG_TRANSPARENT_HUGEPAGE @@ -608,7 +591,7 @@ static inline bool mem_cgroup_disabled(void) } static inline void mem_cgroup_event(struct mem_cgroup *memcg, - enum mem_cgroup_events_index idx) + enum memcg_event_item event) { } -- cgit v1.2.3 From 71cd31135d4cf030a057ed7079a75a40c0a4a796 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 3 May 2017 14:55:13 -0700 Subject: mm: memcontrol: re-use node VM page state enum The current duplication is a high-maintenance mess, and it's painful to add new items or query memcg state from the rest of the VM. This increases the size of the stat array marginally, but we should aim to track all these stats on a per-cgroup level anyway. Link: http://lkml.kernel.org/r/20170404220148.28338-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Acked-by: Vladimir Davydov Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 100 +++++++++++++++++++-------------------------- 1 file changed, 43 insertions(+), 57 deletions(-) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0bb5f055bd26..0fa1f5de6841 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -35,40 +35,45 @@ struct page; struct mm_struct; struct kmem_cache; -/* - * The corresponding mem_cgroup_stat_names is defined in mm/memcontrol.c, - * These two lists should keep in accord with each other. - */ -enum mem_cgroup_stat_index { - /* - * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss. - */ - MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */ - MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ - MEM_CGROUP_STAT_RSS_HUGE, /* # of pages charged as anon huge */ - MEM_CGROUP_STAT_SHMEM, /* # of pages charged as shmem */ - MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ - MEM_CGROUP_STAT_DIRTY, /* # of dirty pages in page cache */ - MEM_CGROUP_STAT_WRITEBACK, /* # of pages under writeback */ - MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */ - MEM_CGROUP_STAT_NSTATS, - /* default hierarchy stats */ - MEMCG_KERNEL_STACK_KB = MEM_CGROUP_STAT_NSTATS, +/* Cgroup-specific page state, on top of universal node page state */ +enum memcg_stat_item { + MEMCG_CACHE = NR_VM_NODE_STAT_ITEMS, + MEMCG_RSS, + MEMCG_RSS_HUGE, + MEMCG_SWAP, + MEMCG_SOCK, + /* XXX: why are these zone and not node counters? */ + MEMCG_KERNEL_STACK_KB, MEMCG_SLAB_RECLAIMABLE, MEMCG_SLAB_UNRECLAIMABLE, - MEMCG_SOCK, - MEMCG_WORKINGSET_REFAULT, - MEMCG_WORKINGSET_ACTIVATE, - MEMCG_WORKINGSET_NODERECLAIM, MEMCG_NR_STAT, }; +/* Cgroup-specific events, on top of universal VM events */ +enum memcg_event_item { + MEMCG_LOW = NR_VM_EVENT_ITEMS, + MEMCG_HIGH, + MEMCG_MAX, + MEMCG_OOM, + MEMCG_NR_EVENTS, +}; + struct mem_cgroup_reclaim_cookie { pg_data_t *pgdat; int priority; unsigned int generation; }; +#ifdef CONFIG_MEMCG + +#define MEM_CGROUP_ID_SHIFT 16 +#define MEM_CGROUP_ID_MAX USHRT_MAX + +struct mem_cgroup_id { + int id; + atomic_t ref; +}; + /* * Per memcg event counter is incremented at every pagein/pageout. With THP, * it will be incremated by the number of pages. This counter is used for @@ -82,25 +87,6 @@ enum mem_cgroup_events_target { MEM_CGROUP_NTARGETS, }; -#ifdef CONFIG_MEMCG - -#define MEM_CGROUP_ID_SHIFT 16 -#define MEM_CGROUP_ID_MAX USHRT_MAX - -struct mem_cgroup_id { - int id; - atomic_t ref; -}; - -/* Cgroup-specific events, on top of universal VM events */ -enum memcg_event_item { - MEMCG_LOW = NR_VM_EVENT_ITEMS, - MEMCG_HIGH, - MEMCG_MAX, - MEMCG_OOM, - MEMCG_NR_EVENTS, -}; - struct mem_cgroup_stat_cpu { long count[MEMCG_NR_STAT]; unsigned long events[MEMCG_NR_EVENTS]; @@ -487,7 +473,7 @@ void lock_page_memcg(struct page *page); void unlock_page_memcg(struct page *page); static inline unsigned long mem_cgroup_read_stat(struct mem_cgroup *memcg, - enum mem_cgroup_stat_index idx) + enum memcg_stat_item idx) { long val = 0; int cpu; @@ -502,20 +488,20 @@ static inline unsigned long mem_cgroup_read_stat(struct mem_cgroup *memcg, } static inline void mem_cgroup_update_stat(struct mem_cgroup *memcg, - enum mem_cgroup_stat_index idx, int val) + enum memcg_stat_item idx, int val) { if (!mem_cgroup_disabled()) this_cpu_add(memcg->stat->count[idx], val); } static inline void mem_cgroup_inc_stat(struct mem_cgroup *memcg, - enum mem_cgroup_stat_index idx) + enum memcg_stat_item idx) { mem_cgroup_update_stat(memcg, idx, 1); } static inline void mem_cgroup_dec_stat(struct mem_cgroup *memcg, - enum mem_cgroup_stat_index idx) + enum memcg_stat_item idx) { mem_cgroup_update_stat(memcg, idx, -1); } @@ -538,20 +524,20 @@ static inline void mem_cgroup_dec_stat(struct mem_cgroup *memcg, * Kernel pages are an exception to this, since they'll never move. */ static inline void mem_cgroup_update_page_stat(struct page *page, - enum mem_cgroup_stat_index idx, int val) + enum memcg_stat_item idx, int val) { if (page->mem_cgroup) mem_cgroup_update_stat(page->mem_cgroup, idx, val); } static inline void mem_cgroup_inc_page_stat(struct page *page, - enum mem_cgroup_stat_index idx) + enum memcg_stat_item idx) { mem_cgroup_update_page_stat(page, idx, 1); } static inline void mem_cgroup_dec_page_stat(struct page *page, - enum mem_cgroup_stat_index idx) + enum memcg_stat_item idx) { mem_cgroup_update_page_stat(page, idx, -1); } @@ -760,33 +746,33 @@ static inline unsigned long mem_cgroup_read_stat(struct mem_cgroup *memcg, } static inline void mem_cgroup_update_stat(struct mem_cgroup *memcg, - enum mem_cgroup_stat_index idx, int val) + enum memcg_stat_item idx, int val) { } static inline void mem_cgroup_inc_stat(struct mem_cgroup *memcg, - enum mem_cgroup_stat_index idx) + enum memcg_stat_item idx) { } static inline void mem_cgroup_dec_stat(struct mem_cgroup *memcg, - enum mem_cgroup_stat_index idx) + enum memcg_stat_item idx) { } static inline void mem_cgroup_update_page_stat(struct page *page, - enum mem_cgroup_stat_index idx, + enum memcg_stat_item idx, int nr) { } static inline void mem_cgroup_inc_page_stat(struct page *page, - enum mem_cgroup_stat_index idx) + enum memcg_stat_item idx) { } static inline void mem_cgroup_dec_page_stat(struct page *page, - enum mem_cgroup_stat_index idx) + enum memcg_stat_item idx) { } @@ -906,7 +892,7 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg) * @val: number of pages (positive or negative) */ static inline void memcg_kmem_update_page_stat(struct page *page, - enum mem_cgroup_stat_index idx, int val) + enum memcg_stat_item idx, int val) { if (memcg_kmem_enabled() && page->mem_cgroup) this_cpu_add(page->mem_cgroup->stat->count[idx], val); @@ -935,7 +921,7 @@ static inline void memcg_put_cache_ids(void) } static inline void memcg_kmem_update_page_stat(struct page *page, - enum mem_cgroup_stat_index idx, int val) + enum memcg_stat_item idx, int val) { } #endif /* CONFIG_MEMCG && !CONFIG_SLOB */ -- cgit v1.2.3 From ccda7f4360be86b87497c50d1f58aab3fd85a9a5 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 3 May 2017 14:55:16 -0700 Subject: mm: memcontrol: use node page state naming scheme for memcg The memory controllers stat function names are awkwardly long and arbitrarily different from the zone and node stat functions. The current interface is named: mem_cgroup_read_stat() mem_cgroup_update_stat() mem_cgroup_inc_stat() mem_cgroup_dec_stat() mem_cgroup_update_page_stat() mem_cgroup_inc_page_stat() mem_cgroup_dec_page_stat() This patch renames it to match the corresponding node stat functions: memcg_page_state() [node_page_state()] mod_memcg_state() [mod_node_state()] inc_memcg_state() [inc_node_state()] dec_memcg_state() [dec_node_state()] mod_memcg_page_state() [mod_node_page_state()] inc_memcg_page_state() [inc_node_page_state()] dec_memcg_page_state() [dec_node_page_state()] Link: http://lkml.kernel.org/r/20170404220148.28338-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Acked-by: Vladimir Davydov Acked-by: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 73 +++++++++++++++++++++++----------------------- 1 file changed, 37 insertions(+), 36 deletions(-) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0fa1f5de6841..899949bbb2f9 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -472,8 +472,8 @@ extern int do_swap_account; void lock_page_memcg(struct page *page); void unlock_page_memcg(struct page *page); -static inline unsigned long mem_cgroup_read_stat(struct mem_cgroup *memcg, - enum memcg_stat_item idx) +static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, + enum memcg_stat_item idx) { long val = 0; int cpu; @@ -487,27 +487,27 @@ static inline unsigned long mem_cgroup_read_stat(struct mem_cgroup *memcg, return val; } -static inline void mem_cgroup_update_stat(struct mem_cgroup *memcg, - enum memcg_stat_item idx, int val) +static inline void mod_memcg_state(struct mem_cgroup *memcg, + enum memcg_stat_item idx, int val) { if (!mem_cgroup_disabled()) this_cpu_add(memcg->stat->count[idx], val); } -static inline void mem_cgroup_inc_stat(struct mem_cgroup *memcg, - enum memcg_stat_item idx) +static inline void inc_memcg_state(struct mem_cgroup *memcg, + enum memcg_stat_item idx) { - mem_cgroup_update_stat(memcg, idx, 1); + mod_memcg_state(memcg, idx, 1); } -static inline void mem_cgroup_dec_stat(struct mem_cgroup *memcg, - enum memcg_stat_item idx) +static inline void dec_memcg_state(struct mem_cgroup *memcg, + enum memcg_stat_item idx) { - mem_cgroup_update_stat(memcg, idx, -1); + mod_memcg_state(memcg, idx, -1); } /** - * mem_cgroup_update_page_stat - update page state statistics + * mod_memcg_page_state - update page state statistics * @page: the page * @idx: page state item to account * @val: number of pages (positive or negative) @@ -518,28 +518,28 @@ static inline void mem_cgroup_dec_stat(struct mem_cgroup *memcg, * * lock_page(page) or lock_page_memcg(page) * if (TestClearPageState(page)) - * mem_cgroup_update_page_stat(page, state, -1); + * mod_memcg_page_state(page, state, -1); * unlock_page(page) or unlock_page_memcg(page) * * Kernel pages are an exception to this, since they'll never move. */ -static inline void mem_cgroup_update_page_stat(struct page *page, - enum memcg_stat_item idx, int val) +static inline void mod_memcg_page_state(struct page *page, + enum memcg_stat_item idx, int val) { if (page->mem_cgroup) - mem_cgroup_update_stat(page->mem_cgroup, idx, val); + mod_memcg_state(page->mem_cgroup, idx, val); } -static inline void mem_cgroup_inc_page_stat(struct page *page, - enum memcg_stat_item idx) +static inline void inc_memcg_page_state(struct page *page, + enum memcg_stat_item idx) { - mem_cgroup_update_page_stat(page, idx, 1); + mod_memcg_page_state(page, idx, 1); } -static inline void mem_cgroup_dec_page_stat(struct page *page, - enum memcg_stat_item idx) +static inline void dec_memcg_page_state(struct page *page, + enum memcg_stat_item idx) { - mem_cgroup_update_page_stat(page, idx, -1); + mod_memcg_page_state(page, idx, -1); } unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, @@ -739,40 +739,41 @@ static inline bool mem_cgroup_oom_synchronize(bool wait) return false; } -static inline unsigned long mem_cgroup_read_stat(struct mem_cgroup *memcg, - enum mem_cgroup_stat_index idx) +static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, + enum memcg_stat_item idx) { return 0; } -static inline void mem_cgroup_update_stat(struct mem_cgroup *memcg, - enum memcg_stat_item idx, int val) +static inline void mod_memcg_state(struct mem_cgroup *memcg, + enum memcg_stat_item idx, + int nr) { } -static inline void mem_cgroup_inc_stat(struct mem_cgroup *memcg, - enum memcg_stat_item idx) +static inline void inc_memcg_state(struct mem_cgroup *memcg, + enum memcg_stat_item idx) { } -static inline void mem_cgroup_dec_stat(struct mem_cgroup *memcg, - enum memcg_stat_item idx) +static inline void dec_memcg_state(struct mem_cgroup *memcg, + enum memcg_stat_item idx) { } -static inline void mem_cgroup_update_page_stat(struct page *page, - enum memcg_stat_item idx, - int nr) +static inline void mod_memcg_page_state(struct page *page, + enum memcg_stat_item idx, + int nr) { } -static inline void mem_cgroup_inc_page_stat(struct page *page, - enum memcg_stat_item idx) +static inline void inc_memcg_page_state(struct page *page, + enum memcg_stat_item idx) { } -static inline void mem_cgroup_dec_page_stat(struct page *page, - enum memcg_stat_item idx) +static inline void dec_memcg_page_state(struct page *page, + enum memcg_stat_item idx) { } -- cgit v1.2.3 From df6b7499806bffd233e6dd0465901827b0b385b8 Mon Sep 17 00:00:00 2001 From: Huang Ying Date: Wed, 3 May 2017 14:55:19 -0700 Subject: mm, swap: remove unused function prototype This is a code cleanup patch, no functionality changes. There are 2 unused function prototype in swap.h, they are removed. Link: http://lkml.kernel.org/r/20170405071017.23677-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" Cc: Tim Chen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/swap.h | 3 --- 1 file changed, 3 deletions(-) (limited to 'include') diff --git a/include/linux/swap.h b/include/linux/swap.h index 486494e6b2fc..ba5882419a7d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -411,9 +411,6 @@ struct backing_dev_info; extern int init_swap_address_space(unsigned int type, unsigned long nr_pages); extern void exit_swap_address_space(unsigned int type); -extern int get_swap_slots(int n, swp_entry_t *slots); -extern void swapcache_free_batch(swp_entry_t *entries, int n); - #else /* CONFIG_SWAP */ #define swap_address_space(entry) (NULL) -- cgit v1.2.3