From ebcbc6ea7d8a604ad8504dae70a6ac1b1e64a0b7 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Mon, 14 Feb 2022 18:20:24 -0800 Subject: mm/munlock: delete page_mlock() and all its works We have recommended some applications to mlock their userspace, but that turns out to be counter-productive: when many processes mlock the same file, contention on rmap's i_mmap_rwsem can become intolerable at exit: it is needed for write, to remove any vma mapping that file from rmap's tree; but hogged for read by those with mlocks calling page_mlock() (formerly known as try_to_munlock()) on *each* page mapped from the file (the purpose being to find out whether another process has the page mlocked, so therefore it should not be unmlocked yet). Several optimizations have been made in the past: one is to skip page_mlock() when mapcount tells that nothing else has this page mapped; but that doesn't help at all when others do have it mapped. This time around, I initially intended to add a preliminary search of the rmap tree for overlapping VM_LOCKED ranges; but that gets messy with locking order, when in doubt whether a page is actually present; and risks adding even more contention on the i_mmap_rwsem. A solution would be much easier, if only there were space in struct page for an mlock_count... but actually, most of the time, there is space for it - an mlocked page spends most of its life on an unevictable LRU, but since 3.18 removed the scan_unevictable_pages sysctl, that "LRU" has been redundant. Let's try to reuse its page->lru. But leave that until a later patch: in this patch, clear the ground by removing page_mlock(), and all the infrastructure that has gathered around it - which mostly hinders understanding, and will make reviewing new additions harder. Don't mind those old comments about THPs, they date from before 4.5's refcounting rework: splitting is not a risk here. Just keep a minimal version of munlock_vma_page(), as reminder of what it should attend to (in particular, the odd way PGSTRANDED is counted out of PGMUNLOCKED), and likewise a stub for munlock_vma_pages_range(). Move unchanged __mlock_posix_error_return() out of the way, down to above its caller: this series then makes no further change after mlock_fixup(). After this and each following commit, the kernel builds, boots and runs; but with deficiencies which may show up in testing of mlock and munlock. The system calls succeed or fail as before, and mlock remains effective in preventing page reclaim; but meminfo's Unevictable and Mlocked amounts may be shown too low after mlock, grow, then stay too high after munlock: with previously mlocked pages remaining unevictable for too long, until finally unmapped and freed and counts corrected. Normal service will be resumed in "mm/munlock: mlock_pte_range() when mlocking or munlocking". Signed-off-by: Hugh Dickins Acked-by: Vlastimil Babka Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/rmap.h | 6 ------ 1 file changed, 6 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index e704b1a4c06c..dc48aa8c2c94 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -237,12 +237,6 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *); */ int folio_mkclean(struct folio *); -/* - * called in munlock()/munmap() path to check for other vmas holding - * the page mlocked. - */ -void page_mlock(struct page *page); - void remove_migration_ptes(struct page *old, struct page *new, bool locked); /* -- cgit v1.2.3 From cea86fe246b694a191804b47378eb9d77aefabec Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Mon, 14 Feb 2022 18:26:39 -0800 Subject: mm/munlock: rmap call mlock_vma_page() munlock_vma_page() Add vma argument to mlock_vma_page() and munlock_vma_page(), make them inline functions which check (vma->vm_flags & VM_LOCKED) before calling mlock_page() and munlock_page() in mm/mlock.c. Add bool compound to mlock_vma_page() and munlock_vma_page(): this is because we have understandable difficulty in accounting pte maps of THPs, and if passed a PageHead page, mlock_page() and munlock_page() cannot tell whether it's a pmd map to be counted or a pte map to be ignored. Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the others, and use that to call mlock_vma_page() at the end of the page adds, and munlock_vma_page() at the end of page_remove_rmap() (end or beginning? unimportant, but end was easier for assertions in testing). No page lock is required (although almost all adds happen to hold it): delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s. Certainly page lock did serialize with page migration, but I'm having difficulty explaining why that was ever important. Mlock accounting on THPs has been hard to define, differed between anon and file, involved PageDoubleMap in some places and not others, required clear_page_mlock() at some points. Keep it simple now: just count the pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks. page_add_new_anon_rmap() callers unchanged: they have long been calling lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED handling (it also checks for not VM_SPECIAL: I think that's overcautious, and inconsistent with other checks, that mmap_region() already prevents VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it). Signed-off-by: Hugh Dickins Acked-by: Vlastimil Babka Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/rmap.h | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index dc48aa8c2c94..ac29b076082b 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -167,18 +167,19 @@ struct anon_vma *page_get_anon_vma(struct page *page); */ void page_move_anon_rmap(struct page *, struct vm_area_struct *); void page_add_anon_rmap(struct page *, struct vm_area_struct *, - unsigned long, bool); + unsigned long address, bool compound); void do_page_add_anon_rmap(struct page *, struct vm_area_struct *, - unsigned long, int); + unsigned long address, int flags); void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, - unsigned long, bool); -void page_add_file_rmap(struct page *, bool); -void page_remove_rmap(struct page *, bool); - + unsigned long address, bool compound); +void page_add_file_rmap(struct page *, struct vm_area_struct *, + bool compound); +void page_remove_rmap(struct page *, struct vm_area_struct *, + bool compound); void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *, - unsigned long); + unsigned long address); void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *, - unsigned long); + unsigned long address); static inline void page_dup_rmap(struct page *page, bool compound) { -- cgit v1.2.3 From eed05e54d275b3cfc5d8c79843c5276a5878e94a Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Thu, 3 Feb 2022 09:06:08 -0500 Subject: mm: Add DEFINE_PAGE_VMA_WALK and DEFINE_FOLIO_VMA_WALK Instead of declaring a struct page_vma_mapped_walk directly, use these helpers to allow us to transition to a PFN approach in the following patches. Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/rmap.h | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index ac29b076082b..0d894a2bfaa1 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -214,6 +214,22 @@ struct page_vma_mapped_walk { unsigned int flags; }; +#define DEFINE_PAGE_VMA_WALK(name, _page, _vma, _address, _flags) \ + struct page_vma_mapped_walk name = { \ + .page = _page, \ + .vma = _vma, \ + .address = _address, \ + .flags = _flags, \ + } + +#define DEFINE_FOLIO_VMA_WALK(name, _folio, _vma, _address, _flags) \ + struct page_vma_mapped_walk name = { \ + .page = &_folio->page, \ + .vma = _vma, \ + .address = _address, \ + .flags = _flags, \ + } + static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) { /* HugeTLB pte is set to the relevant page table entry without pte_mapped. */ -- cgit v1.2.3 From 2aff7a4755bed2870ee23b75bc88cdc8d76cdd03 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Thu, 3 Feb 2022 11:40:17 -0500 Subject: mm: Convert page_vma_mapped_walk to work on PFNs page_mapped_in_vma() really just wants to walk one page, but as the code stands, if passed the head page of a compound page, it will walk every page in the compound page. Extract pfn/nr_pages/pgoff from the struct page early, so they can be overridden by page_mapped_in_vma(). Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/rmap.h | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 0d894a2bfaa1..0c838ba1a8ee 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -11,6 +11,7 @@ #include #include #include +#include /* * The anon_vma heads a list of private "related" vmas, to scan if @@ -201,11 +202,13 @@ int make_device_exclusive_range(struct mm_struct *mm, unsigned long start, /* Avoid racy checks */ #define PVMW_SYNC (1 << 0) -/* Look for migarion entries rather than present PTEs */ +/* Look for migration entries rather than present PTEs */ #define PVMW_MIGRATION (1 << 1) struct page_vma_mapped_walk { - struct page *page; + unsigned long pfn; + unsigned long nr_pages; + pgoff_t pgoff; struct vm_area_struct *vma; unsigned long address; pmd_t *pmd; @@ -216,7 +219,9 @@ struct page_vma_mapped_walk { #define DEFINE_PAGE_VMA_WALK(name, _page, _vma, _address, _flags) \ struct page_vma_mapped_walk name = { \ - .page = _page, \ + .pfn = page_to_pfn(_page), \ + .nr_pages = compound_nr(page), \ + .pgoff = page_to_pgoff(page), \ .vma = _vma, \ .address = _address, \ .flags = _flags, \ @@ -224,7 +229,9 @@ struct page_vma_mapped_walk { #define DEFINE_FOLIO_VMA_WALK(name, _folio, _vma, _address, _flags) \ struct page_vma_mapped_walk name = { \ - .page = &_folio->page, \ + .pfn = folio_pfn(_folio), \ + .nr_pages = folio_nr_pages(_folio), \ + .pgoff = folio_pgoff(_folio), \ .vma = _vma, \ .address = _address, \ .flags = _flags, \ @@ -233,7 +240,7 @@ struct page_vma_mapped_walk { static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) { /* HugeTLB pte is set to the relevant page table entry without pte_mapped. */ - if (pvmw->pte && !PageHuge(pvmw->page)) + if (pvmw->pte && !is_vm_hugetlb_page(pvmw->vma)) pte_unmap(pvmw->pte); if (pvmw->ptl) spin_unlock(pvmw->ptl); -- cgit v1.2.3 From b3ac04132c4b9bc5c9c14608424d410e7ca3b400 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Fri, 21 Jan 2022 11:27:31 -0500 Subject: mm/rmap: Turn page_referenced() into folio_referenced() Both its callers pass a page which was previously on an LRU list, so were passing a folio by definition. Use the type system to enforce that and remove a few calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Christoph Hellwig --- include/linux/rmap.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 0c838ba1a8ee..69a1664216de 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -190,7 +190,7 @@ static inline void page_dup_rmap(struct page *page, bool compound) /* * Called from mm/vmscan.c to handle paging out */ -int page_referenced(struct page *, int is_locked, +int folio_referenced(struct folio *, int is_locked, struct mem_cgroup *memcg, unsigned long *vm_flags); void try_to_migrate(struct page *page, enum ttu_flags flags); @@ -301,7 +301,7 @@ void rmap_walk_locked(struct page *page, struct rmap_walk_control *rwc); #define anon_vma_prepare(vma) (0) #define anon_vma_link(vma) do {} while (0) -static inline int page_referenced(struct page *page, int is_locked, +static inline int folio_referenced(struct folio *folio, int is_locked, struct mem_cgroup *memcg, unsigned long *vm_flags) { -- cgit v1.2.3 From 869f7ee6f6477341f859c8b0949ae81caf9ca7f3 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 15 Feb 2022 09:28:49 -0500 Subject: mm/rmap: Convert try_to_unmap() to take a folio Change all three callers and the worker function try_to_unmap_one(). Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/rmap.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 69a1664216de..a0c5c38c733f 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -194,7 +194,7 @@ int folio_referenced(struct folio *, int is_locked, struct mem_cgroup *memcg, unsigned long *vm_flags); void try_to_migrate(struct page *page, enum ttu_flags flags); -void try_to_unmap(struct page *, enum ttu_flags flags); +void try_to_unmap(struct folio *, enum ttu_flags flags); int make_device_exclusive_range(struct mm_struct *mm, unsigned long start, unsigned long end, struct page **pages, @@ -309,7 +309,7 @@ static inline int folio_referenced(struct folio *folio, int is_locked, return 0; } -static inline void try_to_unmap(struct page *page, enum ttu_flags flags) +static inline void try_to_unmap(struct folio *folio, enum ttu_flags flags) { } -- cgit v1.2.3 From 4b8554c527f3cfa183f6c06d231a9387873205a0 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Fri, 28 Jan 2022 14:29:43 -0500 Subject: mm/rmap: Convert try_to_migrate() to folios Convert the callers to pass a folio and the try_to_migrate_one() worker to use a folio throughout. Fixes an assumption that a folio must be <= PMD size. Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/rmap.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index a0c5c38c733f..e6e935c81414 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -193,7 +193,7 @@ static inline void page_dup_rmap(struct page *page, bool compound) int folio_referenced(struct folio *, int is_locked, struct mem_cgroup *memcg, unsigned long *vm_flags); -void try_to_migrate(struct page *page, enum ttu_flags flags); +void try_to_migrate(struct folio *folio, enum ttu_flags flags); void try_to_unmap(struct folio *, enum ttu_flags flags); int make_device_exclusive_range(struct mm_struct *mm, unsigned long start, -- cgit v1.2.3 From 4eecb8b9163df82c87c91764a02fff228ef25f6d Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Fri, 28 Jan 2022 23:32:59 -0500 Subject: mm/migrate: Convert remove_migration_ptes() to folios Convert the implementation and all callers. Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/rmap.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index e6e935c81414..21af80d5b711 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -261,7 +261,7 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *); */ int folio_mkclean(struct folio *); -void remove_migration_ptes(struct page *old, struct page *new, bool locked); +void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked); /* * Called by memory-failure.c to kill processes. -- cgit v1.2.3 From 9595d76942b8714627d670a7e7ae543812c731ae Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 1 Feb 2022 23:33:08 -0500 Subject: mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read() Add back page_lock_anon_vma_read() as a wrapper. This saves a few calls to compound_head(). If any callers were passing a tail page before, this would have failed to lock the anon VMA as page->mapping is not valid for tail pages. Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/rmap.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 21af80d5b711..be020d38b0a5 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -267,6 +267,7 @@ void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked); * Called by memory-failure.c to kill processes. */ struct anon_vma *page_lock_anon_vma_read(struct page *page); +struct anon_vma *folio_lock_anon_vma_read(struct folio *folio); void page_unlock_anon_vma_read(struct anon_vma *anon_vma); int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma); -- cgit v1.2.3 From 2f031c6f042cb8a9b221a8b6b80e69de5170f830 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Sat, 29 Jan 2022 16:06:53 -0500 Subject: mm/rmap: Convert rmap_walk() to take a folio This ripples all the way through to every calling and called function from rmap. Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/rmap.h | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index be020d38b0a5..a74d811c5b3f 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -266,7 +266,6 @@ void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked); /* * Called by memory-failure.c to kill processes. */ -struct anon_vma *page_lock_anon_vma_read(struct page *page); struct anon_vma *folio_lock_anon_vma_read(struct folio *folio); void page_unlock_anon_vma_read(struct anon_vma *anon_vma); int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma); @@ -286,15 +285,15 @@ struct rmap_walk_control { * Return false if page table scanning in rmap_walk should be stopped. * Otherwise, return true. */ - bool (*rmap_one)(struct page *page, struct vm_area_struct *vma, + bool (*rmap_one)(struct folio *folio, struct vm_area_struct *vma, unsigned long addr, void *arg); - int (*done)(struct page *page); - struct anon_vma *(*anon_lock)(struct page *page); + int (*done)(struct folio *folio); + struct anon_vma *(*anon_lock)(struct folio *folio); bool (*invalid_vma)(struct vm_area_struct *vma, void *arg); }; -void rmap_walk(struct page *page, struct rmap_walk_control *rwc); -void rmap_walk_locked(struct page *page, struct rmap_walk_control *rwc); +void rmap_walk(struct folio *folio, struct rmap_walk_control *rwc); +void rmap_walk_locked(struct folio *folio, struct rmap_walk_control *rwc); #else /* !CONFIG_MMU */ -- cgit v1.2.3 From 84fbbe21894bb9be8e16df408cbfbb9466fe396d Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Sat, 29 Jan 2022 16:16:54 -0500 Subject: mm/rmap: Constify the rmap_walk_control argument The rmap walking functions do not modify the rmap_walk_control, and page_idle_clear_pte_refs() takes advantage of that to move construction of the rmap_walk_control to compile time. This lets us remove an unclean cast. Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/rmap.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index a74d811c5b3f..17230c458341 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -292,8 +292,8 @@ struct rmap_walk_control { bool (*invalid_vma)(struct vm_area_struct *vma, void *arg); }; -void rmap_walk(struct folio *folio, struct rmap_walk_control *rwc); -void rmap_walk_locked(struct folio *folio, struct rmap_walk_control *rwc); +void rmap_walk(struct folio *folio, const struct rmap_walk_control *rwc); +void rmap_walk_locked(struct folio *folio, const struct rmap_walk_control *rwc); #else /* !CONFIG_MMU */ -- cgit v1.2.3