From 3d3544a6c996e88bb793bb6b2665c3e3f674f5eb Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Mon, 13 Apr 2026 11:57:13 +0100 Subject: mm/vma: remove __vma_check_mmap_hook() Commit c50ca15dd496 ("mm: add vm_ops->mapped hook") introduced __vma_check_mmap_hook() in order to assert that a driver doesn't incorrectly implement both an f_op->mmap() and a vm_ops->mapped hook, the latter of which would not ultimately get invoked. However, this did not correctly account for stacked drivers (or drivers that otherwise use the compatibility layer) which might recursively call an mmap_prepare hook via the compatibility layer. Thus the nested mmap_prepare() invocation might result in a VMA which has vm_ops->mapped set with an overlaying mmap() hook, causing the __vma_check_mmap_hook() to fail in vfs_mmap(), wrongly failing the operation. This patch resolves this by simply removing the check, as we can't be certain that an mmap() hook doesn't at some point invoke the compatibility layer, and it's not worth trying to track it. Link: https://lore.kernel.org/20260413105713.92625-1-ljs@kernel.org Fixes: c50ca15dd496 ("mm: add vm_ops->mapped hook") Reported-by: Shinichiro Kawasaki Closes: https://lore.kernel.org/all/adx2ws5z0NMIe5Yj@shinmob/ Signed-off-by: Lorenzo Stoakes Acked-by: Vlastimil Babka (SUSE) Tested-by: Shinichiro Kawasaki Cc: Al Viro Cc: Christian Brauner Cc: David Hildenbrand Cc: Jan Kara Cc: Liam Howlett Cc: Michal Hocko Cc: Mike Rapoport Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- include/linux/fs.h | 9 +-------- mm/util.c | 10 ---------- 2 files changed, 1 insertion(+), 18 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 0bdccfa70b44..f3ca9b841892 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2062,20 +2062,13 @@ void compat_set_desc_from_vma(struct vm_area_desc *desc, const struct file *file const struct vm_area_struct *vma); int __compat_vma_mmap(struct vm_area_desc *desc, struct vm_area_struct *vma); int compat_vma_mmap(struct file *file, struct vm_area_struct *vma); -int __vma_check_mmap_hook(struct vm_area_struct *vma); static inline int vfs_mmap(struct file *file, struct vm_area_struct *vma) { - int err; - if (file->f_op->mmap_prepare) return compat_vma_mmap(file, vma); - err = file->f_op->mmap(file, vma); - if (err) - return err; - - return __vma_check_mmap_hook(vma); + return file->f_op->mmap(file, vma); } static inline int vfs_mmap_prepare(struct file *file, struct vm_area_desc *desc) diff --git a/mm/util.c b/mm/util.c index f063fd4de1e8..232c3930a662 100644 --- a/mm/util.c +++ b/mm/util.c @@ -1281,16 +1281,6 @@ int compat_vma_mmap(struct file *file, struct vm_area_struct *vma) } EXPORT_SYMBOL(compat_vma_mmap); -int __vma_check_mmap_hook(struct vm_area_struct *vma) -{ - /* vm_ops->mapped is not valid if mmap() is specified. */ - if (vma->vm_ops && WARN_ON_ONCE(vma->vm_ops->mapped)) - return -EINVAL; - - return 0; -} -EXPORT_SYMBOL(__vma_check_mmap_hook); - static void set_ps_flags(struct page_snapshot *ps, const struct folio *folio, const struct page *page) { -- cgit v1.2.3 From f95fcd7f28082524938db0b3808ce53630b8a718 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:19 +0800 Subject: mm: memcontrol: remove dead code of checking parent memory cgroup MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "Eliminate Dying Memory Cgroup", v6. Introduction ============ This patchset is intended to transfer the LRU pages to the object cgroup without holding a reference to the original memory cgroup in order to address the issue of the dying memory cgroup. A consensus has already been reached regarding this approach recently [1]. Background ========== The issue of a dying memory cgroup refers to a situation where a memory cgroup is no longer being used by users, but memory (the metadata associated with memory cgroups) remains allocated to it. This situation may potentially result in memory leaks or inefficiencies in memory reclamation and has persisted as an issue for several years. Any memory allocation that endures longer than the lifespan (from the users' perspective) of a memory cgroup can lead to the issue of dying memory cgroup. We have exerted greater efforts to tackle this problem by introducing the infrastructure of object cgroup [2]. Presently, numerous types of objects (slab objects, non-slab kernel allocations, per-CPU objects) are charged to the object cgroup without holding a reference to the original memory cgroup. The final allocations for LRU pages (anonymous pages and file pages) are charged at allocation time and continues to hold a reference to the original memory cgroup until reclaimed. File pages are more complex than anonymous pages as they can be shared among different memory cgroups and may persist beyond the lifespan of the memory cgroup. The long-term pinning of file pages to memory cgroups is a widespread issue that causes recurring problems in practical scenarios [3]. File pages remain unreclaimed for extended periods. Additionally, they are accessed by successive instances (second, third, fourth, etc.) of the same job, which is restarted into a new cgroup each time. As a result, unreclaimable dying memory cgroups accumulate, leading to memory wastage and significantly reducing the efficiency of page reclamation. Fundamentals ============ A folio will no longer pin its corresponding memory cgroup. It is necessary to ensure that the memory cgroup or the lruvec associated with the memory cgroup is not released when a user obtains a pointer to the memory cgroup or lruvec returned by folio_memcg() or folio_lruvec(). Users are required to hold the RCU read lock or acquire a reference to the memory cgroup associated with the folio to prevent its release if they are not concerned about the binding stability between the folio and its corresponding memory cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock) desire a stable binding between the folio and its corresponding memory cgroup. An approach is needed to ensure the stability of the binding while the lruvec lock is held, and to detect the situation of holding the incorrect lruvec lock when there is a race condition during memory cgroup reparenting. The following four steps are taken to achieve these goals. 1. The first step to be taken is to identify all users of both functions (folio_memcg() and folio_lruvec()) who are not concerned about binding stability and implement appropriate measures (such as holding a RCU read lock or temporarily obtaining a reference to the memory cgroup for a brief period) to prevent the release of the memory cgroup. 2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates how to ensure the binding stability from the user's perspective of folio_lruvec(). struct lruvec *folio_lruvec_lock(struct folio *folio) { struct lruvec *lruvec; rcu_read_lock(); retry: lruvec = folio_lruvec(folio); spin_lock(&lruvec->lru_lock); if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) { spin_unlock(&lruvec->lru_lock); goto retry; } return lruvec; } From the perspective of memory cgroup removal, the entire reparenting process (altering the binding relationship between folio and its memory cgroup and moving the LRU lists to its parental memory cgroup) should be carried out under both the lruvec lock of the memory cgroup being removed and the lruvec lock of its parent. 3. Finally, transfer the LRU pages to the object cgroup without holding a reference to the original memory cgroup. Effect ====== Finally, it can be observed that the quantity of dying memory cgroups will not experience a significant increase if the following test script is executed to reproduce the issue. #!/bin/bash # Create a temporary file 'temp' filled with zero bytes dd if=/dev/zero of=temp bs=4096 count=1 # Display memory-cgroup info from /proc/cgroups cat /proc/cgroups | grep memory for i in {0..2000} do mkdir /sys/fs/cgroup/memory/test$i echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs # Append 'temp' file content to 'log' cat temp >> log echo $$ > /sys/fs/cgroup/memory/cgroup.procs # Potentially create a dying memory cgroup rmdir /sys/fs/cgroup/memory/test$i done # Display memory-cgroup info after test cat /proc/cgroups | grep memory rm -f temp log This patch (of 33): Since the no-hierarchy mode has been deprecated after the commit: commit bef8620cd8e0 ("mm: memcg: deprecate the non-hierarchical mode"). As a result, parent_mem_cgroup() will not return NULL except when passing the root memcg, and the root memcg cannot be offline. Hence, it's safe to remove the check on the returned value of parent_mem_cgroup(). Remove the corresponding dead code. Link: https://lore.kernel.org/f4481291bf8c6561dd8949045b5a1ed4008a6b63.1772711148.git.zhengqi.arch@bytedance.com Link: https://lore.kernel.org/linux-mm/Z6OkXXYDorPrBvEQ@hm-sls2/ [1] Link: https://lwn.net/Articles/895431/ [2] Link: https://github.com/systemd/systemd/pull/36827 [3] Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Acked-by: Roman Gushchin Acked-by: Johannes Weiner Reviewed-by: Harry Yoo Reviewed-by: Chen Ridong Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/memcontrol.c | 5 ----- mm/shrinker.c | 6 +----- 2 files changed, 1 insertion(+), 10 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 051b82ebf371..4efa56a91447 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3423,9 +3423,6 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg) return; parent = parent_mem_cgroup(memcg); - if (!parent) - parent = root_mem_cgroup; - memcg_reparent_list_lrus(memcg, parent); /* @@ -3705,8 +3702,6 @@ struct mem_cgroup *mem_cgroup_private_id_get_online(struct mem_cgroup *memcg, un break; } memcg = parent_mem_cgroup(memcg); - if (!memcg) - memcg = root_mem_cgroup; } return memcg; } diff --git a/mm/shrinker.c b/mm/shrinker.c index c23086bccf4d..76b3f750cf65 100644 --- a/mm/shrinker.c +++ b/mm/shrinker.c @@ -288,14 +288,10 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg) { int nid, index, offset; long nr; - struct mem_cgroup *parent; + struct mem_cgroup *parent = parent_mem_cgroup(memcg); struct shrinker_info *child_info, *parent_info; struct shrinker_info_unit *child_unit, *parent_unit; - parent = parent_mem_cgroup(memcg); - if (!parent) - parent = root_mem_cgroup; - /* Prevent from concurrent shrinker_info expand */ mutex_lock(&shrinker_mutex); for_each_node(nid) { -- cgit v1.2.3 From 2b33c342f7d4bf61710fd5a59c0a5e06d2d3082f Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:20 +0800 Subject: mm: workingset: use folio_lruvec() in workingset_refault() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Use folio_lruvec() to simplify the code. Link: https://lore.kernel.org/11bd2fbbf082f4f7972a1113ca42a61fbe2876a9.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Acked-by: Johannes Weiner Reviewed-by: Harry Yoo Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/workingset.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/mm/workingset.c b/mm/workingset.c index 37a94979900f..5e8b6e62a617 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -541,8 +541,6 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset, void workingset_refault(struct folio *folio, void *shadow) { bool file = folio_is_file_lru(folio); - struct pglist_data *pgdat; - struct mem_cgroup *memcg; struct lruvec *lruvec; bool workingset; long nr; @@ -564,10 +562,7 @@ void workingset_refault(struct folio *folio, void *shadow) * locked to guarantee folio_memcg() stability throughout. */ nr = folio_nr_pages(folio); - memcg = folio_memcg(folio); - pgdat = folio_pgdat(folio); - lruvec = mem_cgroup_lruvec(memcg, pgdat); - + lruvec = folio_lruvec(folio); mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr); if (!workingset_test_recent(shadow, file, &workingset, true)) -- cgit v1.2.3 From db128b2c6b7d0c9b514327a0873425bbf18e739b Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:21 +0800 Subject: mm: rename unlock_page_lruvec_irq and its variants MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit It is inappropriate to use folio_lruvec_lock() variants in conjunction with unlock_page_lruvec() variants, as this involves the inconsistent operation of locking a folio while unlocking a page. To rectify this, the functions unlock_page_lruvec{_irq, _irqrestore} are renamed to lruvec_unlock{_irq,_irqrestore}. Link: https://lore.kernel.org/4e5e05271a250df4d1812e1832be65636a78c957.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Acked-by: Roman Gushchin Acked-by: Johannes Weiner Reviewed-by: Harry Yoo Reviewed-by: Chen Ridong Acked-by: David Hildenbrand (Red Hat) Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 10 +++++----- mm/compaction.c | 14 +++++++------- mm/huge_memory.c | 2 +- mm/mlock.c | 2 +- mm/swap.c | 12 ++++++------ mm/vmscan.c | 4 ++-- 6 files changed, 22 insertions(+), 22 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5173a9f16721..6e88288e90d8 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1479,17 +1479,17 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec) return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec)); } -static inline void unlock_page_lruvec(struct lruvec *lruvec) +static inline void lruvec_unlock(struct lruvec *lruvec) { spin_unlock(&lruvec->lru_lock); } -static inline void unlock_page_lruvec_irq(struct lruvec *lruvec) +static inline void lruvec_unlock_irq(struct lruvec *lruvec) { spin_unlock_irq(&lruvec->lru_lock); } -static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec, +static inline void lruvec_unlock_irqrestore(struct lruvec *lruvec, unsigned long flags) { spin_unlock_irqrestore(&lruvec->lru_lock, flags); @@ -1511,7 +1511,7 @@ static inline struct lruvec *folio_lruvec_relock_irq(struct folio *folio, if (folio_matches_lruvec(folio, locked_lruvec)) return locked_lruvec; - unlock_page_lruvec_irq(locked_lruvec); + lruvec_unlock_irq(locked_lruvec); } return folio_lruvec_lock_irq(folio); @@ -1525,7 +1525,7 @@ static inline void folio_lruvec_relock_irqsave(struct folio *folio, if (folio_matches_lruvec(folio, *lruvecp)) return; - unlock_page_lruvec_irqrestore(*lruvecp, *flags); + lruvec_unlock_irqrestore(*lruvecp, *flags); } *lruvecp = folio_lruvec_lock_irqsave(folio, flags); diff --git a/mm/compaction.c b/mm/compaction.c index 1e8f8eca318c..c3e338aaa0ff 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -913,7 +913,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, */ if (!(low_pfn % COMPACT_CLUSTER_MAX)) { if (locked) { - unlock_page_lruvec_irqrestore(locked, flags); + lruvec_unlock_irqrestore(locked, flags); locked = NULL; } @@ -964,7 +964,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, } /* for alloc_contig case */ if (locked) { - unlock_page_lruvec_irqrestore(locked, flags); + lruvec_unlock_irqrestore(locked, flags); locked = NULL; } @@ -1053,7 +1053,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, if (unlikely(page_has_movable_ops(page)) && !PageMovableOpsIsolated(page)) { if (locked) { - unlock_page_lruvec_irqrestore(locked, flags); + lruvec_unlock_irqrestore(locked, flags); locked = NULL; } @@ -1158,7 +1158,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, /* If we already hold the lock, we can skip some rechecking */ if (lruvec != locked) { if (locked) - unlock_page_lruvec_irqrestore(locked, flags); + lruvec_unlock_irqrestore(locked, flags); compact_lock_irqsave(&lruvec->lru_lock, &flags, cc); locked = lruvec; @@ -1226,7 +1226,7 @@ isolate_success_no_list: isolate_fail_put: /* Avoid potential deadlock in freeing page under lru_lock */ if (locked) { - unlock_page_lruvec_irqrestore(locked, flags); + lruvec_unlock_irqrestore(locked, flags); locked = NULL; } folio_put(folio); @@ -1242,7 +1242,7 @@ isolate_fail: */ if (nr_isolated) { if (locked) { - unlock_page_lruvec_irqrestore(locked, flags); + lruvec_unlock_irqrestore(locked, flags); locked = NULL; } putback_movable_pages(&cc->migratepages); @@ -1274,7 +1274,7 @@ isolate_fail: isolate_abort: if (locked) - unlock_page_lruvec_irqrestore(locked, flags); + lruvec_unlock_irqrestore(locked, flags); if (folio) { folio_set_lru(folio); folio_put(folio); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 42c983821c03..958b580c6619 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3994,7 +3994,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n folio_ref_unfreeze(folio, folio_cache_ref_count(folio) + 1); if (do_lru) - unlock_page_lruvec(lruvec); + lruvec_unlock(lruvec); if (ci) swap_cluster_unlock(ci); diff --git a/mm/mlock.c b/mm/mlock.c index fdbd1434a35f..8c227fefa2df 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -205,7 +205,7 @@ static void mlock_folio_batch(struct folio_batch *fbatch) } if (lruvec) - unlock_page_lruvec_irq(lruvec); + lruvec_unlock_irq(lruvec); folios_put(fbatch); } diff --git a/mm/swap.c b/mm/swap.c index 78b4aa811fc6..23df893e2ed7 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -91,7 +91,7 @@ static void page_cache_release(struct folio *folio) __page_cache_release(folio, &lruvec, &flags); if (lruvec) - unlock_page_lruvec_irqrestore(lruvec, flags); + lruvec_unlock_irqrestore(lruvec, flags); } void __folio_put(struct folio *folio) @@ -175,7 +175,7 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn) } if (lruvec) - unlock_page_lruvec_irqrestore(lruvec, flags); + lruvec_unlock_irqrestore(lruvec, flags); folios_put(fbatch); } @@ -349,7 +349,7 @@ void folio_activate(struct folio *folio) lruvec = folio_lruvec_lock_irq(folio); lru_activate(lruvec, folio); - unlock_page_lruvec_irq(lruvec); + lruvec_unlock_irq(lruvec); folio_set_lru(folio); } #endif @@ -963,7 +963,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs) if (folio_is_zone_device(folio)) { if (lruvec) { - unlock_page_lruvec_irqrestore(lruvec, flags); + lruvec_unlock_irqrestore(lruvec, flags); lruvec = NULL; } if (folio_ref_sub_and_test(folio, nr_refs)) @@ -977,7 +977,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs) /* hugetlb has its own memcg */ if (folio_test_hugetlb(folio)) { if (lruvec) { - unlock_page_lruvec_irqrestore(lruvec, flags); + lruvec_unlock_irqrestore(lruvec, flags); lruvec = NULL; } free_huge_folio(folio); @@ -991,7 +991,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs) j++; } if (lruvec) - unlock_page_lruvec_irqrestore(lruvec, flags); + lruvec_unlock_irqrestore(lruvec, flags); if (!j) { folio_batch_reinit(folios); return; diff --git a/mm/vmscan.c b/mm/vmscan.c index 4bf091b1c8af..88bb3337e5eb 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1831,7 +1831,7 @@ bool folio_isolate_lru(struct folio *folio) folio_get(folio); lruvec = folio_lruvec_lock_irq(folio); lruvec_del_folio(lruvec, folio); - unlock_page_lruvec_irq(lruvec); + lruvec_unlock_irq(lruvec); ret = true; } @@ -7898,7 +7898,7 @@ void check_move_unevictable_folios(struct folio_batch *fbatch) if (lruvec) { __count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued); __count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned); - unlock_page_lruvec_irq(lruvec); + lruvec_unlock_irq(lruvec); } else if (pgscanned) { count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned); } -- cgit v1.2.3 From 676496738b7e6c58fc5efba255e9c35b4896cdd6 Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Thu, 5 Mar 2026 19:52:22 +0800 Subject: mm: vmscan: prepare for the refactoring the move_folios_to_lru() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Once we refactor move_folios_to_lru(), its callers will no longer have to hold the lruvec lock; For shrink_inactive_list(), shrink_active_list() and evict_folios(), IRQ disabling is only needed for __count_vm_events() and __mod_node_page_state(). To avoid using local_irq_disable() on the PREEMPT_RT kernel, let's make all callers of move_folios_to_lru() use IRQ-safed count_vm_events() and mod_node_page_state(). Link: https://lore.kernel.org/b3a202f1787b0857bb6cbe059fffb8edefaf67b7.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Acked-by: Johannes Weiner Acked-by: Shakeel Butt Reviewed-by: Chen Ridong Reviewed-by: Harry Yoo Acked-by: Muchun Song Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/vmscan.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 88bb3337e5eb..d88d00f0c2cd 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2021,7 +2021,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc), stat.nr_demoted); - __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); + mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); item = PGSTEAL_KSWAPD + reclaimer_offset(sc); mod_lruvec_state(lruvec, item, nr_reclaimed); mod_lruvec_state(lruvec, PGSTEAL_ANON + file, nr_reclaimed); @@ -2167,10 +2167,10 @@ static void shrink_active_list(unsigned long nr_to_scan, nr_activate = move_folios_to_lru(lruvec, &l_active); nr_deactivate = move_folios_to_lru(lruvec, &l_inactive); - __count_vm_events(PGDEACTIVATE, nr_deactivate); + count_vm_events(PGDEACTIVATE, nr_deactivate); count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate); - __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); + mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); lru_note_cost_unlock_irq(lruvec, file, 0, nr_rotated); trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate, -- cgit v1.2.3 From a760b64ee08809fb98874a72f82acf6fd30c5d7e Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:23 +0800 Subject: mm: vmscan: refactor move_folios_to_lru() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In a subsequent patch, we'll reparent the LRU folios. The folios that are moved to the appropriate LRU list can undergo reparenting during the move_folios_to_lru() process. Hence, it's incorrect for the caller to hold a lruvec lock. Instead, we should utilize the more general interface of folio_lruvec_relock_irq() to obtain the correct lruvec lock. This patch involves only code refactoring and doesn't introduce any functional changes. Link: https://lore.kernel.org/6f1dac88b61e2e3cb7a3e90bacdf06b654acfc15.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Acked-by: Johannes Weiner Acked-by: Shakeel Butt Reviewed-by: Harry Yoo Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/vmscan.c | 46 +++++++++++++++++++++------------------------- 1 file changed, 21 insertions(+), 25 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index d88d00f0c2cd..031fbd35ae10 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1885,24 +1885,27 @@ static bool too_many_isolated(struct pglist_data *pgdat, int file, /* * move_folios_to_lru() moves folios from private @list to appropriate LRU list. * - * Returns the number of pages moved to the given lruvec. + * Returns the number of pages moved to the appropriate lruvec. + * + * Note: The caller must not hold any lruvec lock. */ -static unsigned int move_folios_to_lru(struct lruvec *lruvec, - struct list_head *list) +static unsigned int move_folios_to_lru(struct list_head *list) { int nr_pages, nr_moved = 0; + struct lruvec *lruvec = NULL; struct folio_batch free_folios; folio_batch_init(&free_folios); while (!list_empty(list)) { struct folio *folio = lru_to_folio(list); + lruvec = folio_lruvec_relock_irq(folio, lruvec); VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); list_del(&folio->lru); if (unlikely(!folio_evictable(folio))) { - spin_unlock_irq(&lruvec->lru_lock); + lruvec_unlock_irq(lruvec); folio_putback_lru(folio); - spin_lock_irq(&lruvec->lru_lock); + lruvec = NULL; continue; } @@ -1924,19 +1927,15 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec, folio_unqueue_deferred_split(folio); if (folio_batch_add(&free_folios, folio) == 0) { - spin_unlock_irq(&lruvec->lru_lock); + lruvec_unlock_irq(lruvec); mem_cgroup_uncharge_folios(&free_folios); free_unref_folios(&free_folios); - spin_lock_irq(&lruvec->lru_lock); + lruvec = NULL; } continue; } - /* - * All pages were isolated from the same lruvec (and isolation - * inhibits memcg migration). - */ VM_BUG_ON_FOLIO(!folio_matches_lruvec(folio, lruvec), folio); lruvec_add_folio(lruvec, folio); nr_pages = folio_nr_pages(folio); @@ -1945,11 +1944,12 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec, workingset_age_nonresident(lruvec, nr_pages); } + if (lruvec) + lruvec_unlock_irq(lruvec); + if (free_folios.nr) { - spin_unlock_irq(&lruvec->lru_lock); mem_cgroup_uncharge_folios(&free_folios); free_unref_folios(&free_folios); - spin_lock_irq(&lruvec->lru_lock); } return nr_moved; @@ -2016,8 +2016,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false, lruvec_memcg(lruvec)); - spin_lock_irq(&lruvec->lru_lock); - move_folios_to_lru(lruvec, &folio_list); + move_folios_to_lru(&folio_list); mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc), stat.nr_demoted); @@ -2026,6 +2025,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, mod_lruvec_state(lruvec, item, nr_reclaimed); mod_lruvec_state(lruvec, PGSTEAL_ANON + file, nr_reclaimed); + spin_lock_irq(&lruvec->lru_lock); lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed); @@ -2162,16 +2162,14 @@ static void shrink_active_list(unsigned long nr_to_scan, /* * Move folios back to the lru list. */ - spin_lock_irq(&lruvec->lru_lock); - - nr_activate = move_folios_to_lru(lruvec, &l_active); - nr_deactivate = move_folios_to_lru(lruvec, &l_inactive); + nr_activate = move_folios_to_lru(&l_active); + nr_deactivate = move_folios_to_lru(&l_inactive); count_vm_events(PGDEACTIVATE, nr_deactivate); count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate); - mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); + spin_lock_irq(&lruvec->lru_lock); lru_note_cost_unlock_irq(lruvec, file, 0, nr_rotated); trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate, nr_deactivate, nr_rotated, sc->priority, file); @@ -4749,14 +4747,14 @@ retry: set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_active)); } - spin_lock_irq(&lruvec->lru_lock); - - move_folios_to_lru(lruvec, &list); + move_folios_to_lru(&list); walk = current->reclaim_state->mm_walk; if (walk && walk->batched) { walk->lruvec = lruvec; + spin_lock_irq(&lruvec->lru_lock); reset_batch_size(walk); + spin_unlock_irq(&lruvec->lru_lock); } mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc), @@ -4766,8 +4764,6 @@ retry: mod_lruvec_state(lruvec, item, reclaimed); mod_lruvec_state(lruvec, PGSTEAL_ANON + type, reclaimed); - spin_unlock_irq(&lruvec->lru_lock); - list_splice_init(&clean, &list); if (!list_empty(&list)) { -- cgit v1.2.3 From aa01ec1325e211ee4b57ad1375e4efaa846d7ff3 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:24 +0800 Subject: mm: memcontrol: allocate object cgroup for non-kmem case MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit To allow LRU page reparenting, the objcg infrastructure is no longer solely applicable to the kmem case. In this patch, we extend the scope of the objcg infrastructure beyond the kmem case, enabling LRU folios to reuse it for folio charging purposes. It should be noted that LRU folios are not accounted for at the root level, yet the folio->memcg_data points to the root_mem_cgroup. Hence, the folio->memcg_data of LRU folios always points to a valid pointer. However, the root_mem_cgroup does not possess an object cgroup. Therefore, we also allocate an object cgroup for the root_mem_cgroup. Link: https://lore.kernel.org/b77274aa8e3f37c419bedf4782943fd5885dda82.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Reviewed-by: Harry Yoo Acked-by: Johannes Weiner Acked-by: Shakeel Butt Reviewed-by: Chen Ridong Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/memcontrol.c | 51 ++++++++++++++++++++++++--------------------------- 1 file changed, 24 insertions(+), 27 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4efa56a91447..2cb2d66579d3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -206,10 +206,10 @@ static struct obj_cgroup *obj_cgroup_alloc(void) return objcg; } -static void memcg_reparent_objcgs(struct mem_cgroup *memcg, - struct mem_cgroup *parent) +static void memcg_reparent_objcgs(struct mem_cgroup *memcg) { struct obj_cgroup *objcg, *iter; + struct mem_cgroup *parent = parent_mem_cgroup(memcg); objcg = rcu_replace_pointer(memcg->objcg, NULL, true); @@ -3386,30 +3386,17 @@ void folio_split_memcg_refs(struct folio *folio, unsigned old_order, css_get_many(&__folio_memcg(folio)->css, new_refs); } -static int memcg_online_kmem(struct mem_cgroup *memcg) +static void memcg_online_kmem(struct mem_cgroup *memcg) { - struct obj_cgroup *objcg; - if (mem_cgroup_kmem_disabled()) - return 0; + return; if (unlikely(mem_cgroup_is_root(memcg))) - return 0; - - objcg = obj_cgroup_alloc(); - if (!objcg) - return -ENOMEM; - - objcg->memcg = memcg; - rcu_assign_pointer(memcg->objcg, objcg); - obj_cgroup_get(objcg); - memcg->orig_objcg = objcg; + return; static_branch_enable(&memcg_kmem_online_key); memcg->kmemcg_id = memcg->id.id; - - return 0; } static void memcg_offline_kmem(struct mem_cgroup *memcg) @@ -3424,12 +3411,6 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg) parent = parent_mem_cgroup(memcg); memcg_reparent_list_lrus(memcg, parent); - - /* - * Objcg's reparenting must be after list_lru's, make sure list_lru - * helpers won't use parent's list_lru until child is drained. - */ - memcg_reparent_objcgs(memcg, parent); } #ifdef CONFIG_CGROUP_WRITEBACK @@ -3930,9 +3911,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) static int mem_cgroup_css_online(struct cgroup_subsys_state *css) { struct mem_cgroup *memcg = mem_cgroup_from_css(css); + struct obj_cgroup *objcg; - if (memcg_online_kmem(memcg)) - goto remove_id; + memcg_online_kmem(memcg); /* * A memcg must be visible for expand_shrinker_info() @@ -3942,6 +3923,15 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css) if (alloc_shrinker_info(memcg)) goto offline_kmem; + objcg = obj_cgroup_alloc(); + if (!objcg) + goto free_shrinker; + + objcg->memcg = memcg; + rcu_assign_pointer(memcg->objcg, objcg); + obj_cgroup_get(objcg); + memcg->orig_objcg = objcg; + if (unlikely(mem_cgroup_is_root(memcg)) && !mem_cgroup_disabled()) queue_delayed_work(system_dfl_wq, &stats_flush_dwork, FLUSH_TIME); @@ -3964,9 +3954,10 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css) xa_store(&mem_cgroup_private_ids, memcg->id.id, memcg, GFP_KERNEL); return 0; +free_shrinker: + free_shrinker_info(memcg); offline_kmem: memcg_offline_kmem(memcg); -remove_id: mem_cgroup_private_id_remove(memcg); return -ENOMEM; } @@ -3984,6 +3975,12 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css) memcg_offline_kmem(memcg); reparent_deferred_split_queue(memcg); + /* + * The reparenting of objcg must be after the reparenting of the + * list_lru and deferred_split_queue above, which ensures that they will + * not mistakenly get the parent list_lru and deferred_split_queue. + */ + memcg_reparent_objcgs(memcg); reparent_shrinker_deferred(memcg); wb_memcg_offline(memcg); lru_gen_offline_memcg(memcg); -- cgit v1.2.3 From d5aa8c1d136e7de89defb06f42f8108992967a70 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:25 +0800 Subject: mm: memcontrol: return root object cgroup for root memory cgroup MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Memory cgroup functions such as get_mem_cgroup_from_folio() and get_mem_cgroup_from_mm() return a valid memory cgroup pointer, even for the root memory cgroup. In contrast, the situation for object cgroups has been different. Previously, the root object cgroup couldn't be returned because it didn't exist. Now that a valid root object cgroup exists, for the sake of consistency, it's necessary to align the behavior of object-cgroup-related operations with that of memory cgroup APIs. Link: https://lore.kernel.org/e9c3f40ba7681d9753372d4ee2ac7a0216848b95.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Acked-by: Johannes Weiner Acked-by: Shakeel Butt Reviewed-by: Harry Yoo Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 26 ++++++++++++++++++++------ mm/memcontrol.c | 45 ++++++++++++++++++++++++--------------------- mm/percpu.c | 2 +- 3 files changed, 45 insertions(+), 28 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6e88288e90d8..9a015258a2ff 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -332,6 +332,7 @@ struct mem_cgroup { #define MEMCG_CHARGE_BATCH 64U extern struct mem_cgroup *root_mem_cgroup; +extern struct obj_cgroup *root_obj_cgroup; enum page_memcg_data_flags { /* page->memcg_data is a pointer to an slabobj_ext vector */ @@ -548,6 +549,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) return (memcg == root_mem_cgroup); } +static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg) +{ + return objcg == root_obj_cgroup; +} + static inline bool mem_cgroup_disabled(void) { return !cgroup_subsys_enabled(memory_cgrp_subsys); @@ -774,23 +780,26 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){ static inline bool obj_cgroup_tryget(struct obj_cgroup *objcg) { + if (obj_cgroup_is_root(objcg)) + return true; return percpu_ref_tryget(&objcg->refcnt); } -static inline void obj_cgroup_get(struct obj_cgroup *objcg) +static inline void obj_cgroup_get_many(struct obj_cgroup *objcg, + unsigned long nr) { - percpu_ref_get(&objcg->refcnt); + if (!obj_cgroup_is_root(objcg)) + percpu_ref_get_many(&objcg->refcnt, nr); } -static inline void obj_cgroup_get_many(struct obj_cgroup *objcg, - unsigned long nr) +static inline void obj_cgroup_get(struct obj_cgroup *objcg) { - percpu_ref_get_many(&objcg->refcnt, nr); + obj_cgroup_get_many(objcg, 1); } static inline void obj_cgroup_put(struct obj_cgroup *objcg) { - if (objcg) + if (objcg && !obj_cgroup_is_root(objcg)) percpu_ref_put(&objcg->refcnt); } @@ -1087,6 +1096,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) return true; } +static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg) +{ + return true; +} + static inline bool mem_cgroup_disabled(void) { return true; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2cb2d66579d3..e7022adcea7f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -83,6 +83,8 @@ EXPORT_SYMBOL(memory_cgrp_subsys); struct mem_cgroup *root_mem_cgroup __read_mostly; EXPORT_SYMBOL(root_mem_cgroup); +struct obj_cgroup *root_obj_cgroup __read_mostly; + /* Active memory cgroup to use from an interrupt context */ DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg); EXPORT_PER_CPU_SYMBOL_GPL(int_active_memcg); @@ -2693,15 +2695,14 @@ struct mem_cgroup *mem_cgroup_from_virt(void *p) static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg) { - struct obj_cgroup *objcg = NULL; + for (; memcg; memcg = parent_mem_cgroup(memcg)) { + struct obj_cgroup *objcg = rcu_dereference(memcg->objcg); - for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { - objcg = rcu_dereference(memcg->objcg); if (likely(objcg && obj_cgroup_tryget(objcg))) - break; - objcg = NULL; + return objcg; } - return objcg; + + return NULL; } static struct obj_cgroup *current_objcg_update(void) @@ -2775,18 +2776,17 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void) * Objcg reference is kept by the task, so it's safe * to use the objcg by the current task. */ - return objcg; + return objcg ? : root_obj_cgroup; } memcg = this_cpu_read(int_active_memcg); if (unlikely(memcg)) goto from_memcg; - return NULL; + return root_obj_cgroup; from_memcg: - objcg = NULL; - for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { + for (; memcg; memcg = parent_mem_cgroup(memcg)) { /* * Memcg pointer is protected by scope (see set_active_memcg()) * and is pinning the corresponding objcg, so objcg can't go @@ -2795,10 +2795,10 @@ from_memcg: */ objcg = rcu_dereference_check(memcg->objcg, 1); if (likely(objcg)) - break; + return objcg; } - return objcg; + return root_obj_cgroup; } struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio) @@ -2812,14 +2812,8 @@ struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio) objcg = __folio_objcg(folio); obj_cgroup_get(objcg); } else { - struct mem_cgroup *memcg; - rcu_read_lock(); - memcg = __folio_memcg(folio); - if (memcg) - objcg = __get_obj_cgroup_from_memcg(memcg); - else - objcg = NULL; + objcg = __get_obj_cgroup_from_memcg(__folio_memcg(folio)); rcu_read_unlock(); } return objcg; @@ -2922,7 +2916,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order) int ret = 0; objcg = current_obj_cgroup(); - if (objcg) { + if (objcg && !obj_cgroup_is_root(objcg)) { ret = obj_cgroup_charge_pages(objcg, gfp, 1 << order); if (!ret) { obj_cgroup_get(objcg); @@ -3251,7 +3245,7 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, * obj_cgroup_get() is used to get a permanent reference. */ objcg = current_obj_cgroup(); - if (!objcg) + if (!objcg || obj_cgroup_is_root(objcg)) return true; /* @@ -3927,6 +3921,9 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css) if (!objcg) goto free_shrinker; + if (unlikely(mem_cgroup_is_root(memcg))) + root_obj_cgroup = objcg; + objcg->memcg = memcg; rcu_assign_pointer(memcg->objcg, objcg); obj_cgroup_get(objcg); @@ -5551,6 +5548,9 @@ void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size) if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) return; + if (obj_cgroup_is_root(objcg)) + return; + VM_WARN_ON_ONCE(!(current->flags & PF_MEMALLOC)); /* PF_MEMALLOC context, charging must succeed */ @@ -5580,6 +5580,9 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size) if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) return; + if (obj_cgroup_is_root(objcg)) + return; + obj_cgroup_uncharge(objcg, size); rcu_read_lock(); diff --git a/mm/percpu.c b/mm/percpu.c index a2107bdebf0b..b0676b8054ed 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1622,7 +1622,7 @@ static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp, return true; objcg = current_obj_cgroup(); - if (!objcg) + if (!objcg || obj_cgroup_is_root(objcg)) return true; if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size))) -- cgit v1.2.3 From af86590786d7ee1597ff0d8ea4e18f94529d2442 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:26 +0800 Subject: mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In the near future, a folio will no longer pin its corresponding memory cgroup. To ensure safety, it will only be appropriate to hold the rcu read lock or acquire a reference to the memory cgroup returned by folio_memcg(), thereby preventing it from being released. In the current patch, the rcu read lock is employed to safeguard against the release of the memory cgroup in get_mem_cgroup_from_folio(). This serves as a preparatory measure for the reparenting of the LRU pages. Link: https://lore.kernel.org/a5a64c6173a566bd21534606aeaaa9220cb1366d.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Reviewed-by: Harry Yoo Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Johannes Weiner Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/memcontrol.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e7022adcea7f..dbcf0d2bf114 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -996,14 +996,18 @@ again: */ struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio) { - struct mem_cgroup *memcg = folio_memcg(folio); + struct mem_cgroup *memcg; if (mem_cgroup_disabled()) return NULL; + if (!folio_memcg_charged(folio)) + return root_mem_cgroup; + rcu_read_lock(); - if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css))) - memcg = root_mem_cgroup; + do { + memcg = folio_memcg(folio); + } while (unlikely(!css_tryget(&memcg->css))); rcu_read_unlock(); return memcg; } -- cgit v1.2.3 From d10adce2c1a8ec61b46ff1841d3662f3c7a66d7a Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:27 +0800 Subject: buffer: prevent memory cgroup release in folio_alloc_buffers() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In the near future, a folio will no longer pin its corresponding memory cgroup. To ensure safety, it will only be appropriate to hold the rcu read lock or acquire a reference to the memory cgroup returned by folio_memcg(), thereby preventing it from being released. In the current patch, the function get_mem_cgroup_from_folio() is employed to safeguard against the release of the memory cgroup. This serves as a preparatory measure for the reparenting of the LRU pages. Link: https://lore.kernel.org/d6d48fdcf329c549373ac0a1c80fd9f38067e34e.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Reviewed-by: Harry Yoo Acked-by: Johannes Weiner Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- fs/buffer.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index f3122160ee2d..bbe42edad59d 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -922,8 +922,7 @@ struct buffer_head *folio_alloc_buffers(struct folio *folio, unsigned long size, long offset; struct mem_cgroup *memcg, *old_memcg; - /* The folio lock pins the memcg */ - memcg = folio_memcg(folio); + memcg = get_mem_cgroup_from_folio(folio); old_memcg = set_active_memcg(memcg); head = NULL; @@ -944,6 +943,7 @@ struct buffer_head *folio_alloc_buffers(struct folio *folio, unsigned long size, } out: set_active_memcg(old_memcg); + mem_cgroup_put(memcg); return head; /* * In case anything failed, we just free everything we got. -- cgit v1.2.3 From 49717c7bd6b8e14329c2d04b1e8ec691175b6f4e Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:28 +0800 Subject: writeback: prevent memory cgroup release in writeback module MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In the near future, a folio will no longer pin its corresponding memory cgroup. To ensure safety, it will only be appropriate to hold the rcu read lock or acquire a reference to the memory cgroup returned by folio_memcg(), thereby preventing it from being released. In the current patch, the function get_mem_cgroup_css_from_folio() and the rcu read lock are employed to safeguard against the release of the memory cgroup. This serves as a preparatory measure for the reparenting of the LRU pages. Link: https://lore.kernel.org/645f99bc344575417f67def3744f975596df2793.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Reviewed-by: Harry Yoo Acked-by: Johannes Weiner Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- fs/fs-writeback.c | 22 +++++++++++----------- include/linux/memcontrol.h | 9 +++++++-- include/trace/events/writeback.h | 3 +++ mm/memcontrol.c | 14 ++++++++------ 4 files changed, 29 insertions(+), 19 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 7c75ed7e8979..c3442a38450c 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -280,15 +280,13 @@ void __inode_attach_wb(struct inode *inode, struct folio *folio) if (inode_cgwb_enabled(inode)) { struct cgroup_subsys_state *memcg_css; - if (folio) { - memcg_css = mem_cgroup_css_from_folio(folio); - wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); - } else { - /* must pin memcg_css, see wb_get_create() */ + /* must pin memcg_css, see wb_get_create() */ + if (folio) + memcg_css = get_mem_cgroup_css_from_folio(folio); + else memcg_css = task_get_css(current, memory_cgrp_id); - wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); - css_put(memcg_css); - } + wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); + css_put(memcg_css); } if (!wb) @@ -979,16 +977,16 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct folio *folio if (!wbc->wb || wbc->no_cgroup_owner) return; - css = mem_cgroup_css_from_folio(folio); + css = get_mem_cgroup_css_from_folio(folio); /* dead cgroups shouldn't contribute to inode ownership arbitration */ if (!css_is_online(css)) - return; + goto out; id = css->id; if (id == wbc->wb_id) { wbc->wb_bytes += bytes; - return; + goto out; } if (id == wbc->wb_lcand_id) @@ -1001,6 +999,8 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct folio *folio wbc->wb_tcand_bytes += bytes; else wbc->wb_tcand_bytes -= min(bytes, wbc->wb_tcand_bytes); +out: + css_put(css); } EXPORT_SYMBOL_GPL(wbc_account_cgroup_owner); diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 9a015258a2ff..4454f03a4acf 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -894,7 +894,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm, return match; } -struct cgroup_subsys_state *mem_cgroup_css_from_folio(struct folio *folio); +struct cgroup_subsys_state *get_mem_cgroup_css_from_folio(struct folio *folio); ino_t page_cgroup_ino(struct page *page); static inline bool mem_cgroup_online(struct mem_cgroup *memcg) @@ -1563,9 +1563,14 @@ static inline void mem_cgroup_track_foreign_dirty(struct folio *folio, if (mem_cgroup_disabled()) return; + if (!folio_memcg_charged(folio)) + return; + + rcu_read_lock(); memcg = folio_memcg(folio); - if (unlikely(memcg && &memcg->css != wb->memcg_css)) + if (unlikely(&memcg->css != wb->memcg_css)) mem_cgroup_track_foreign_dirty_slowpath(folio, wb); + rcu_read_unlock(); } void mem_cgroup_flush_foreign(struct bdi_writeback *wb); diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 4d3d8c8f3a1b..b849b8cc96b1 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -294,7 +294,10 @@ TRACE_EVENT(track_foreign_dirty, __entry->ino = inode ? inode->i_ino : 0; __entry->memcg_id = wb->memcg_css->id; __entry->cgroup_ino = __trace_wb_assign_cgroup(wb); + + rcu_read_lock(); __entry->page_cgroup_ino = cgroup_ino(folio_memcg(folio)->css.cgroup); + rcu_read_unlock(); ), TP_printk("bdi %s[%llu]: ino=%lu memcg_id=%u cgroup_ino=%lu page_cgroup_ino=%lu", diff --git a/mm/memcontrol.c b/mm/memcontrol.c index dbcf0d2bf114..d7d4b44c5af5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -243,7 +243,7 @@ DEFINE_STATIC_KEY_FALSE(memcg_bpf_enabled_key); EXPORT_SYMBOL(memcg_bpf_enabled_key); /** - * mem_cgroup_css_from_folio - css of the memcg associated with a folio + * get_mem_cgroup_css_from_folio - acquire a css of the memcg associated with a folio * @folio: folio of interest * * If memcg is bound to the default hierarchy, css of the memcg associated @@ -253,14 +253,16 @@ EXPORT_SYMBOL(memcg_bpf_enabled_key); * If memcg is bound to a traditional hierarchy, the css of root_mem_cgroup * is returned. */ -struct cgroup_subsys_state *mem_cgroup_css_from_folio(struct folio *folio) +struct cgroup_subsys_state *get_mem_cgroup_css_from_folio(struct folio *folio) { - struct mem_cgroup *memcg = folio_memcg(folio); + struct mem_cgroup *memcg; - if (!memcg || !cgroup_subsys_on_dfl(memory_cgrp_subsys)) - memcg = root_mem_cgroup; + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) + return &root_mem_cgroup->css; - return &memcg->css; + memcg = get_mem_cgroup_from_folio(folio); + + return memcg ? &memcg->css : &root_mem_cgroup->css; } /** -- cgit v1.2.3 From f995da5341c1854e59415c2c2c6f0b6406b498f2 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:29 +0800 Subject: mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In the near future, a folio will no longer pin its corresponding memory cgroup. To ensure safety, it will only be appropriate to hold the rcu read lock or acquire a reference to the memory cgroup returned by folio_memcg(), thereby preventing it from being released. In the current patch, the rcu read lock is employed to safeguard against the release of the memory cgroup in count_memcg_folio_events(). This serves as a preparatory measure for the reparenting of the LRU pages. Link: https://lore.kernel.org/dea6aa0389367f7fd6b715c8837a2cf7506bd889.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Reviewed-by: Harry Yoo Acked-by: Johannes Weiner Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 4454f03a4acf..ef26ba087844 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -975,10 +975,15 @@ void count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, static inline void count_memcg_folio_events(struct folio *folio, enum vm_event_item idx, unsigned long nr) { - struct mem_cgroup *memcg = folio_memcg(folio); + struct mem_cgroup *memcg; - if (memcg) - count_memcg_events(memcg, idx, nr); + if (!folio_memcg_charged(folio)) + return; + + rcu_read_lock(); + memcg = folio_memcg(folio); + count_memcg_events(memcg, idx, nr); + rcu_read_unlock(); } static inline void count_memcg_events_mm(struct mm_struct *mm, -- cgit v1.2.3 From 1f6f80c2dbb4516dffaaeb54a9009acea2bf61ca Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:30 +0800 Subject: mm: page_io: prevent memory cgroup release in page_io module MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In the near future, a folio will no longer pin its corresponding memory cgroup. To ensure safety, it will only be appropriate to hold the rcu read lock or acquire a reference to the memory cgroup returned by folio_memcg(), thereby preventing it from being released. In the current patch, the rcu read lock is employed to safeguard against the release of the memory cgroup in swap_writeout() and bio_associate_blkg_from_page(). This serves as a preparatory measure for the reparenting of the LRU pages. Link: https://lore.kernel.org/7c3708358412fb02c482d0985feb5e9513a863ef.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Reviewed-by: Harry Yoo Acked-by: Johannes Weiner Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/page_io.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/mm/page_io.c b/mm/page_io.c index 330abc5ab7b4..93d03d9e2a6a 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -276,10 +276,14 @@ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug) count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT); goto out_unlock; } + + rcu_read_lock(); if (!mem_cgroup_zswap_writeback_enabled(folio_memcg(folio))) { + rcu_read_unlock(); folio_mark_dirty(folio); return AOP_WRITEPAGE_ACTIVATE; } + rcu_read_unlock(); __swap_writepage(folio, swap_plug); return 0; @@ -307,11 +311,11 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct folio *folio) struct cgroup_subsys_state *css; struct mem_cgroup *memcg; - memcg = folio_memcg(folio); - if (!memcg) + if (!folio_memcg_charged(folio)) return; rcu_read_lock(); + memcg = folio_memcg(folio); css = cgroup_e_css(memcg->css.cgroup, &io_cgrp_subsys); bio_associate_blkg_from_css(bio, css); rcu_read_unlock(); -- cgit v1.2.3 From 53050890802e25b6b04ab5b243c90e42d10ef777 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:31 +0800 Subject: mm: migrate: prevent memory cgroup release in folio_migrate_mapping() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In the near future, a folio will no longer pin its corresponding memory cgroup. To ensure safety, it will only be appropriate to hold the rcu read lock or acquire a reference to the memory cgroup returned by folio_memcg(), thereby preventing it from being released. In __folio_migrate_mapping(), the rcu read lock is employed to safeguard against the release of the memory cgroup in folio_migrate_mapping(). This serves as a preparatory measure for the reparenting of the LRU pages. Link: https://lore.kernel.org/0f156c2f1188f256855617953f8305f43e066065.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Reviewed-by: Harry Yoo Acked-by: Johannes Weiner Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/migrate.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/migrate.c b/mm/migrate.c index 76142a02192b..8a64291ab5b4 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -672,6 +672,7 @@ static int __folio_migrate_mapping(struct address_space *mapping, struct lruvec *old_lruvec, *new_lruvec; struct mem_cgroup *memcg; + rcu_read_lock(); memcg = folio_memcg(folio); old_lruvec = mem_cgroup_lruvec(memcg, oldzone->zone_pgdat); new_lruvec = mem_cgroup_lruvec(memcg, newzone->zone_pgdat); @@ -699,6 +700,7 @@ static int __folio_migrate_mapping(struct address_space *mapping, mod_lruvec_state(new_lruvec, NR_FILE_DIRTY, nr); __mod_zone_page_state(newzone, NR_ZONE_WRITE_PENDING, nr); } + rcu_read_unlock(); } local_irq_enable(); -- cgit v1.2.3 From c29f90a2dac18bdd407eafc4cbaa57f14665393a Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:32 +0800 Subject: mm: mglru: prevent memory cgroup release in mglru MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In the near future, a folio will no longer pin its corresponding memory cgroup. To ensure safety, it will only be appropriate to hold the rcu read lock or acquire a reference to the memory cgroup returned by folio_memcg(), thereby preventing it from being released. In the current patch, the rcu read lock is employed to safeguard against the release of the memory cgroup in mglru. This serves as a preparatory measure for the reparenting of the LRU pages. Link: https://lore.kernel.org/9d887662a9d39c425742dd8468e3123316bccfe3.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Acked-by: Shakeel Butt Reviewed-by: Harry Yoo Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Johannes Weiner Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/vmscan.c | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 031fbd35ae10..6f3f9e20ff67 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3440,8 +3440,10 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg, if (folio_nid(folio) != pgdat->node_id) return NULL; + rcu_read_lock(); if (folio_memcg(folio) != memcg) - return NULL; + folio = NULL; + rcu_read_unlock(); return folio; } @@ -4211,12 +4213,12 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw, unsigned int nr) unsigned long addr = pvmw->address; struct vm_area_struct *vma = pvmw->vma; struct folio *folio = pfn_folio(pvmw->pfn); - struct mem_cgroup *memcg = folio_memcg(folio); + struct mem_cgroup *memcg; struct pglist_data *pgdat = folio_pgdat(folio); - struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); - struct lru_gen_mm_state *mm_state = get_mm_state(lruvec); - DEFINE_MAX_SEQ(lruvec); - int gen = lru_gen_from_seq(max_seq); + struct lruvec *lruvec; + struct lru_gen_mm_state *mm_state; + unsigned long max_seq; + int gen; lockdep_assert_held(pvmw->ptl); VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio); @@ -4251,6 +4253,12 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw, unsigned int nr) } } + memcg = get_mem_cgroup_from_folio(folio); + lruvec = mem_cgroup_lruvec(memcg, pgdat); + max_seq = READ_ONCE((lruvec)->lrugen.max_seq); + gen = lru_gen_from_seq(max_seq); + mm_state = get_mm_state(lruvec); + lazy_mmu_mode_enable(); pte -= (addr - start) / PAGE_SIZE; @@ -4300,6 +4308,8 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw, unsigned int nr) if (mm_state && suitable_to_scan(i, young)) update_bloom_filter(mm_state, max_seq, pvmw->pmd); + mem_cgroup_put(memcg); + return true; } -- cgit v1.2.3 From c863aded26d1f98247af2719b1e3ed01e3d0d4f6 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:33 +0800 Subject: mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In the near future, a folio will no longer pin its corresponding memory cgroup. To ensure safety, it will only be appropriate to hold the rcu read lock or acquire a reference to the memory cgroup returned by folio_memcg(), thereby preventing it from being released. In the current patch, the rcu read lock is employed to safeguard against the release of the memory cgroup in mem_cgroup_swap_full(). This serves as a preparatory measure for the reparenting of the LRU pages. Link: https://lore.kernel.org/21d1abab7342615745ea4c18a88237335ab44d13.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Reviewed-by: Harry Yoo Acked-by: Johannes Weiner Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/memcontrol.c | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d7d4b44c5af5..10021cef176b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5338,27 +5338,29 @@ long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg) bool mem_cgroup_swap_full(struct folio *folio) { struct mem_cgroup *memcg; + bool ret = false; VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); if (vm_swap_full()) return true; - if (do_memsw_account()) - return false; + if (do_memsw_account() || !folio_memcg_charged(folio)) + return ret; + rcu_read_lock(); memcg = folio_memcg(folio); - if (!memcg) - return false; - for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { unsigned long usage = page_counter_read(&memcg->swap); if (usage * 2 >= READ_ONCE(memcg->swap.high) || - usage * 2 >= READ_ONCE(memcg->swap.max)) - return true; + usage * 2 >= READ_ONCE(memcg->swap.max)) { + ret = true; + break; + } } + rcu_read_unlock(); - return false; + return ret; } static int __init setup_swap_account(char *s) -- cgit v1.2.3 From b3ca98297cd98a51ee9d6d491d0a4ee0ca79b515 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:34 +0800 Subject: mm: workingset: prevent memory cgroup release in lru_gen_eviction() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In the near future, a folio will no longer pin its corresponding memory cgroup. To ensure safety, it will only be appropriate to hold the rcu read lock or acquire a reference to the memory cgroup returned by folio_memcg(), thereby preventing it from being released. In the current patch, the rcu read lock is employed to safeguard against the release of the memory cgroup in lru_gen_eviction(). This serves as a preparatory measure for the reparenting of the LRU pages. Link: https://lore.kernel.org/f37e8ae2d84ddc690813d834cd75735d52d1bc78.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Reviewed-by: Harry Yoo Acked-by: Johannes Weiner Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/workingset.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/mm/workingset.c b/mm/workingset.c index 5e8b6e62a617..6971aa163e46 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -244,12 +244,15 @@ static void *lru_gen_eviction(struct folio *folio) int refs = folio_lru_refs(folio); bool workingset = folio_test_workingset(folio); int tier = lru_tier_from_refs(refs, workingset); - struct mem_cgroup *memcg = folio_memcg(folio); + struct mem_cgroup *memcg; struct pglist_data *pgdat = folio_pgdat(folio); + unsigned short memcg_id; BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - max(EVICTION_SHIFT, EVICTION_SHIFT_ANON)); + rcu_read_lock(); + memcg = folio_memcg(folio); lruvec = mem_cgroup_lruvec(memcg, pgdat); lrugen = &lruvec->lrugen; min_seq = READ_ONCE(lrugen->min_seq[type]); @@ -257,8 +260,10 @@ static void *lru_gen_eviction(struct folio *folio) hist = lru_hist_from_seq(min_seq); atomic_long_add(delta, &lrugen->evicted[hist][type][tier]); + memcg_id = mem_cgroup_private_id(memcg); + rcu_read_unlock(); - return pack_shadow(mem_cgroup_private_id(memcg), pgdat, token, workingset, type); + return pack_shadow(memcg_id, pgdat, token, workingset, type); } /* -- cgit v1.2.3 From 681d325b23dccbf8f6beda18dc1a61d8e3c715cf Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Thu, 5 Mar 2026 19:52:35 +0800 Subject: mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In the near future, a folio will no longer pin its corresponding memory cgroup. To ensure safety, it will only be appropriate to hold the rcu read lock or acquire a reference to the memory cgroup returned by folio_memcg(), thereby preventing it from being released. In the current patch, the rcu read lock is employed to safeguard against the release of the memory cgroup in folio_split_queue_lock{_irqsave}(). Link: https://lore.kernel.org/ca2957c0df1126b2c71b40c738018fd5255525a6.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Reviewed-by: Harry Yoo Acked-by: Johannes Weiner Acked-by: Shakeel Butt Acked-by: David Hildenbrand (Red Hat) Acked-by: Muchun Song Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/huge_memory.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 958b580c6619..970e077019b7 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1218,13 +1218,29 @@ retry: static struct deferred_split *folio_split_queue_lock(struct folio *folio) { - return split_queue_lock(folio_nid(folio), folio_memcg(folio)); + struct deferred_split *queue; + + rcu_read_lock(); + queue = split_queue_lock(folio_nid(folio), folio_memcg(folio)); + /* + * The memcg destruction path is acquiring the split queue lock for + * reparenting. Once you have it locked, it's safe to drop the rcu lock. + */ + rcu_read_unlock(); + + return queue; } static struct deferred_split * folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags) { - return split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags); + struct deferred_split *queue; + + rcu_read_lock(); + queue = split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags); + rcu_read_unlock(); + + return queue; } static inline void split_queue_unlock(struct deferred_split *queue) -- cgit v1.2.3 From cf4d6ad54ba14fc8d6899bacb28b0698ee971cc6 Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Thu, 5 Mar 2026 19:52:36 +0800 Subject: mm: zswap: prevent memory cgroup release in zswap_compress() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In the near future, a folio will no longer pin its corresponding memory cgroup. To ensure safety, it will only be appropriate to hold the rcu read lock or acquire a reference to the memory cgroup returned by folio_memcg(), thereby preventing it from being released. In the current patch, the rcu read lock is employed to safeguard against the release of the memory cgroup in zswap_compress(). Link: https://lore.kernel.org/340f315050fb8a67caaf01b4836d4f38a41cf1a8.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Acked-by: Johannes Weiner Acked-by: Shakeel Butt Acked-by: Muchun Song Reviewed-by: Harry Yoo Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/zswap.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/zswap.c b/mm/zswap.c index 0823cadd02b6..a1c883c68ef6 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -893,11 +893,14 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, * to the active LRU list in the case. */ if (comp_ret || !dlen || dlen >= PAGE_SIZE) { + rcu_read_lock(); if (!mem_cgroup_zswap_writeback_enabled( folio_memcg(page_folio(page)))) { + rcu_read_unlock(); comp_ret = comp_ret ? comp_ret : -EINVAL; goto unlock; } + rcu_read_unlock(); comp_ret = 0; dlen = PAGE_SIZE; dst = kmap_local_page(page); -- cgit v1.2.3 From fe132152c885d482eb232209bfea87ac94bf253a Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:37 +0800 Subject: mm: workingset: prevent lruvec release in workingset_refault() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In the near future, a folio will no longer pin its corresponding memory cgroup. So an lruvec returned by folio_lruvec() could be released without the rcu read lock or a reference to its memory cgroup. In the current patch, the rcu read lock is employed to safeguard against the release of the lruvec in workingset_refault(). This serves as a preparatory measure for the reparenting of the LRU pages. Link: https://lore.kernel.org/e3a8c19a9b18422b43213f6c89c451c5b6ca1577.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Reviewed-by: Harry Yoo Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Johannes Weiner Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yuanchu Xie Cc: Zi Yan Cc: Chengming Zhou Cc: Chen Ridong Cc: Muchun Song Cc: Nhat Pham Cc: Yosry Ahmed Signed-off-by: Andrew Morton --- mm/workingset.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/mm/workingset.c b/mm/workingset.c index 6971aa163e46..2de2a355f0f8 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -546,6 +546,7 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset, void workingset_refault(struct folio *folio, void *shadow) { bool file = folio_is_file_lru(folio); + struct mem_cgroup *memcg; struct lruvec *lruvec; bool workingset; long nr; @@ -567,11 +568,12 @@ void workingset_refault(struct folio *folio, void *shadow) * locked to guarantee folio_memcg() stability throughout. */ nr = folio_nr_pages(folio); - lruvec = folio_lruvec(folio); + memcg = get_mem_cgroup_from_folio(folio); + lruvec = mem_cgroup_lruvec(memcg, folio_pgdat(folio)); mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr); if (!workingset_test_recent(shadow, file, &workingset, true)) - return; + goto out; folio_set_active(folio); workingset_age_nonresident(lruvec, nr); @@ -587,6 +589,8 @@ void workingset_refault(struct folio *folio, void *shadow) lru_note_cost_refault(folio); mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr); } +out: + mem_cgroup_put(memcg); } /** -- cgit v1.2.3 From d5ddaf4341f70b13a357b7e8800c8087c96ff318 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:38 +0800 Subject: mm: zswap: prevent lruvec release in zswap_folio_swapin() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In the near future, a folio will no longer pin its corresponding memory cgroup. So an lruvec returned by folio_lruvec() could be released without the rcu read lock or a reference to its memory cgroup. In the current patch, the rcu read lock is employed to safeguard against the release of the lruvec in zswap_folio_swapin(). This serves as a preparatory measure for the reparenting of the LRU pages. Link: https://lore.kernel.org/02b3f76ee8d1132f69ac5baaedce38fb82b09a48.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Acked-by: Nhat Pham Reviewed-by: Chengming Zhou Reviewed-by: Harry Yoo Acked-by: Johannes Weiner Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/zswap.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/zswap.c b/mm/zswap.c index a1c883c68ef6..4f2e652e8ad3 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -664,8 +664,10 @@ void zswap_folio_swapin(struct folio *folio) struct lruvec *lruvec; if (folio) { + rcu_read_lock(); lruvec = folio_lruvec(folio); atomic_long_inc(&lruvec->zswap_lruvec_state.nr_disk_swapins); + rcu_read_unlock(); } } -- cgit v1.2.3 From 74e225ffaac7bd8d22cc485902a484381cafa1ab Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:39 +0800 Subject: mm: swap: prevent lruvec release in lru_gen_clear_refs() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In the near future, a folio will no longer pin its corresponding memory cgroup. So an lruvec returned by folio_lruvec() could be released without the rcu read lock or a reference to its memory cgroup. In the current patch, the rcu read lock is employed to safeguard against the release of the lruvec in lru_gen_clear_refs(). This serves as a preparatory measure for the reparenting of the LRU pages. Link: https://lore.kernel.org/986cd26227191a48a7c34a2a15812d361f4ebd53.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Reviewed-by: Harry Yoo Acked-by: Johannes Weiner Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/swap.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index 23df893e2ed7..009b32d6d344 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -412,18 +412,20 @@ static void lru_gen_inc_refs(struct folio *folio) static bool lru_gen_clear_refs(struct folio *folio) { - struct lru_gen_folio *lrugen; int gen = folio_lru_gen(folio); int type = folio_is_file_lru(folio); + unsigned long seq; if (gen < 0) return true; set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS | BIT(PG_workingset), 0); - lrugen = &folio_lruvec(folio)->lrugen; + rcu_read_lock(); + seq = READ_ONCE(folio_lruvec(folio)->lrugen.min_seq[type]); + rcu_read_unlock(); /* whether can do without shuffling under the LRU lock */ - return gen == lru_gen_from_seq(READ_ONCE(lrugen->min_seq[type])); + return gen == lru_gen_from_seq(seq); } #else /* !CONFIG_LRU_GEN */ -- cgit v1.2.3 From 507382970b6ad2806fbfd72bc13e3f7c1249c4b1 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:40 +0800 Subject: mm: workingset: prevent lruvec release in workingset_activation() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In the near future, a folio will no longer pin its corresponding memory cgroup. So an lruvec returned by folio_lruvec() could be released without the rcu read lock or a reference to its memory cgroup. In the current patch, the rcu read lock is employed to safeguard against the release of the lruvec in workingset_activation(). This serves as a preparatory measure for the reparenting of the LRU pages. Link: https://lore.kernel.org/c6130476affbba0a7d309a887c3df11e0167990b.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Reviewed-by: Harry Yoo Acked-by: Johannes Weiner Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/workingset.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/workingset.c b/mm/workingset.c index 2de2a355f0f8..95d722a452e1 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -603,8 +603,11 @@ void workingset_activation(struct folio *folio) * Filter non-memcg pages here, e.g. unmap can call * mark_page_accessed() on VDSO pages. */ - if (mem_cgroup_disabled() || folio_memcg_charged(folio)) + if (mem_cgroup_disabled() || folio_memcg_charged(folio)) { + rcu_read_lock(); workingset_age_nonresident(folio_lruvec(folio), folio_nr_pages(folio)); + rcu_read_unlock(); + } } /* -- cgit v1.2.3 From d14f87858178c64cc94ecd05bb41bba474c1c654 Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Thu, 5 Mar 2026 19:52:41 +0800 Subject: mm: do not open-code lruvec lock MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Now we have lruvec_unlock(), lruvec_unlock_irq() and lruvec_unlock_irqrestore(), but no the paired lruvec_lock(), lruvec_lock_irq() and lruvec_lock_irqsave(). There is currently no use case for lruvec_lock_irqsave(), so only introduce lruvec_lock_irq(), and change all open-code places to use this helper function. This looks cleaner and prepares for reparenting LRU pages, preventing user from missing RCU lock calls due to open-code lruvec lock. Link: https://lore.kernel.org/2d0bafe7564e17ece46dfd58197af22ce57017dc.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Acked-by: Muchun Song Acked-by: Shakeel Butt Reviewed-by: Harry Yoo Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Johannes Weiner Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 5 +++++ mm/vmscan.c | 38 +++++++++++++++++++------------------- 2 files changed, 24 insertions(+), 19 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index ef26ba087844..38f94c7271c1 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1498,6 +1498,11 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec) return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec)); } +static inline void lruvec_lock_irq(struct lruvec *lruvec) +{ + spin_lock_irq(&lruvec->lru_lock); +} + static inline void lruvec_unlock(struct lruvec *lruvec) { spin_unlock(&lruvec->lru_lock); diff --git a/mm/vmscan.c b/mm/vmscan.c index 6f3f9e20ff67..d4b649abe645 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1998,7 +1998,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, lru_add_drain(); - spin_lock_irq(&lruvec->lru_lock); + lruvec_lock_irq(lruvec); nr_taken = isolate_lru_folios(nr_to_scan, lruvec, &folio_list, &nr_scanned, sc, lru); @@ -2008,7 +2008,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, mod_lruvec_state(lruvec, item, nr_scanned); mod_lruvec_state(lruvec, PGSCAN_ANON + file, nr_scanned); - spin_unlock_irq(&lruvec->lru_lock); + lruvec_unlock_irq(lruvec); if (nr_taken == 0) return 0; @@ -2025,7 +2025,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, mod_lruvec_state(lruvec, item, nr_reclaimed); mod_lruvec_state(lruvec, PGSTEAL_ANON + file, nr_reclaimed); - spin_lock_irq(&lruvec->lru_lock); + lruvec_lock_irq(lruvec); lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed); @@ -2104,7 +2104,7 @@ static void shrink_active_list(unsigned long nr_to_scan, lru_add_drain(); - spin_lock_irq(&lruvec->lru_lock); + lruvec_lock_irq(lruvec); nr_taken = isolate_lru_folios(nr_to_scan, lruvec, &l_hold, &nr_scanned, sc, lru); @@ -2113,7 +2113,7 @@ static void shrink_active_list(unsigned long nr_to_scan, mod_lruvec_state(lruvec, PGREFILL, nr_scanned); - spin_unlock_irq(&lruvec->lru_lock); + lruvec_unlock_irq(lruvec); while (!list_empty(&l_hold)) { struct folio *folio; @@ -2169,7 +2169,7 @@ static void shrink_active_list(unsigned long nr_to_scan, count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate); mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); - spin_lock_irq(&lruvec->lru_lock); + lruvec_lock_irq(lruvec); lru_note_cost_unlock_irq(lruvec, file, 0, nr_rotated); trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate, nr_deactivate, nr_rotated, sc->priority, file); @@ -3803,9 +3803,9 @@ static void walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk) } if (walk->batched) { - spin_lock_irq(&lruvec->lru_lock); + lruvec_lock_irq(lruvec); reset_batch_size(walk); - spin_unlock_irq(&lruvec->lru_lock); + lruvec_unlock_irq(lruvec); } cond_resched(); @@ -3965,7 +3965,7 @@ restart: if (seq < READ_ONCE(lrugen->max_seq)) return false; - spin_lock_irq(&lruvec->lru_lock); + lruvec_lock_irq(lruvec); VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); @@ -3980,7 +3980,7 @@ restart: if (inc_min_seq(lruvec, type, swappiness)) continue; - spin_unlock_irq(&lruvec->lru_lock); + lruvec_unlock_irq(lruvec); cond_resched(); goto restart; } @@ -4015,7 +4015,7 @@ restart: /* make sure preceding modifications appear */ smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1); unlock: - spin_unlock_irq(&lruvec->lru_lock); + lruvec_unlock_irq(lruvec); return success; } @@ -4715,7 +4715,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec, struct mem_cgroup *memcg = lruvec_memcg(lruvec); struct pglist_data *pgdat = lruvec_pgdat(lruvec); - spin_lock_irq(&lruvec->lru_lock); + lruvec_lock_irq(lruvec); scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list); @@ -4724,7 +4724,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec, if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS > lrugen->max_seq) scanned = 0; - spin_unlock_irq(&lruvec->lru_lock); + lruvec_unlock_irq(lruvec); if (list_empty(&list)) return scanned; @@ -4762,9 +4762,9 @@ retry: walk = current->reclaim_state->mm_walk; if (walk && walk->batched) { walk->lruvec = lruvec; - spin_lock_irq(&lruvec->lru_lock); + lruvec_lock_irq(lruvec); reset_batch_size(walk); - spin_unlock_irq(&lruvec->lru_lock); + lruvec_unlock_irq(lruvec); } mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc), @@ -5202,7 +5202,7 @@ static void lru_gen_change_state(bool enabled) for_each_node(nid) { struct lruvec *lruvec = get_lruvec(memcg, nid); - spin_lock_irq(&lruvec->lru_lock); + lruvec_lock_irq(lruvec); VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); VM_WARN_ON_ONCE(!state_is_valid(lruvec)); @@ -5210,12 +5210,12 @@ static void lru_gen_change_state(bool enabled) lruvec->lrugen.enabled = enabled; while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) { - spin_unlock_irq(&lruvec->lru_lock); + lruvec_unlock_irq(lruvec); cond_resched(); - spin_lock_irq(&lruvec->lru_lock); + lruvec_lock_irq(lruvec); } - spin_unlock_irq(&lruvec->lru_lock); + lruvec_unlock_irq(lruvec); } cond_resched(); -- cgit v1.2.3 From 31b54a5e8916fdd4819880e3aed93f65ecbb47e3 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:42 +0800 Subject: mm: memcontrol: prepare for reparenting LRU pages for lruvec lock MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The following diagram illustrates how to ensure the safety of the folio lruvec lock when LRU folios undergo reparenting. In the folio_lruvec_lock(folio) function: rcu_read_lock(); retry: lruvec = folio_lruvec(folio); /* There is a possibility of folio reparenting at this point. */ spin_lock(&lruvec->lru_lock); if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) { /* * The wrong lruvec lock was acquired, and a retry is required. * This is because the folio resides on the parent memcg lruvec * list. */ spin_unlock(&lruvec->lru_lock); goto retry; } /* Reaching here indicates that folio_memcg() is stable. */ In the memcg_reparent_objcgs(memcg) function: spin_lock(&lruvec->lru_lock); spin_lock(&lruvec_parent->lru_lock); /* Transfer folios from the lruvec list to the parent's. */ spin_unlock(&lruvec_parent->lru_lock); spin_unlock(&lruvec->lru_lock); After acquiring the lruvec lock, it is necessary to verify whether the folio has been reparented. If reparenting has occurred, the new lruvec lock must be reacquired. During the LRU folio reparenting process, the lruvec lock will also be acquired (this will be implemented in a subsequent patch). Therefore, folio_memcg() remains unchanged while the lruvec lock is held. Given that lruvec_memcg(lruvec) is always equal to folio_memcg(folio) after the lruvec lock is acquired, the lruvec_memcg_debug() check is redundant. Hence, it is removed. This patch serves as a preparation for the reparenting of LRU folios. Link: https://lore.kernel.org/23f22cbb1419f277a3483018b32158ae2b86c666.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Acked-by: Johannes Weiner Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Harry Yoo Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 34 ++++++++++++++--------------- include/linux/swap.h | 3 +-- mm/compaction.c | 29 +++++++++++++++++++------ mm/memcontrol.c | 53 +++++++++++++++++++++++----------------------- mm/swap.c | 6 +++++- 5 files changed, 73 insertions(+), 52 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 38f94c7271c1..12982875073e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -741,7 +741,15 @@ out: * folio_lruvec - return lruvec for isolating/putting an LRU folio * @folio: Pointer to the folio. * - * This function relies on folio->mem_cgroup being stable. + * Call with rcu_read_lock() held to ensure the lifetime of the returned lruvec. + * Note that this alone will NOT guarantee the stability of the folio->lruvec + * association; the folio can be reparented to an ancestor if this races with + * cgroup deletion. + * + * Use folio_lruvec_lock() to ensure both lifetime and stability of the binding. + * Once a lruvec is locked, folio_lruvec() can be called on other folios, and + * their binding is stable if the returned lruvec matches the one the caller has + * locked. Useful for lock batching. */ static inline struct lruvec *folio_lruvec(struct folio *folio) { @@ -764,15 +772,6 @@ struct lruvec *folio_lruvec_lock_irq(struct folio *folio); struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio, unsigned long *flags); -#ifdef CONFIG_DEBUG_VM -void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio); -#else -static inline -void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio) -{ -} -#endif - static inline struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){ return css ? container_of(css, struct mem_cgroup, css) : NULL; @@ -1198,11 +1197,6 @@ static inline struct lruvec *folio_lruvec(struct folio *folio) return &pgdat->__lruvec; } -static inline -void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio) -{ -} - static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) { return NULL; @@ -1261,6 +1255,7 @@ static inline struct lruvec *folio_lruvec_lock(struct folio *folio) { struct pglist_data *pgdat = folio_pgdat(folio); + rcu_read_lock(); spin_lock(&pgdat->__lruvec.lru_lock); return &pgdat->__lruvec; } @@ -1269,6 +1264,7 @@ static inline struct lruvec *folio_lruvec_lock_irq(struct folio *folio) { struct pglist_data *pgdat = folio_pgdat(folio); + rcu_read_lock(); spin_lock_irq(&pgdat->__lruvec.lru_lock); return &pgdat->__lruvec; } @@ -1278,6 +1274,7 @@ static inline struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio, { struct pglist_data *pgdat = folio_pgdat(folio); + rcu_read_lock(); spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp); return &pgdat->__lruvec; } @@ -1500,23 +1497,26 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec) static inline void lruvec_lock_irq(struct lruvec *lruvec) { + rcu_read_lock(); spin_lock_irq(&lruvec->lru_lock); } static inline void lruvec_unlock(struct lruvec *lruvec) { spin_unlock(&lruvec->lru_lock); + rcu_read_unlock(); } static inline void lruvec_unlock_irq(struct lruvec *lruvec) { spin_unlock_irq(&lruvec->lru_lock); + rcu_read_unlock(); } -static inline void lruvec_unlock_irqrestore(struct lruvec *lruvec, - unsigned long flags) +static inline void lruvec_unlock_irqrestore(struct lruvec *lruvec, unsigned long flags) { spin_unlock_irqrestore(&lruvec->lru_lock, flags); + rcu_read_unlock(); } /* Test requires a stable folio->memcg binding, see folio_memcg() */ diff --git a/include/linux/swap.h b/include/linux/swap.h index 4b1f13b5bbad..ea08e2afa2b4 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -310,8 +310,7 @@ extern unsigned long totalreserve_pages; /* linux/mm/swap.c */ void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file, - unsigned int nr_io, unsigned int nr_rotated) - __releases(lruvec->lru_lock); + unsigned int nr_io, unsigned int nr_rotated); void lru_note_cost_refault(struct folio *); void folio_add_lru(struct folio *); void folio_add_lru_vma(struct folio *, struct vm_area_struct *); diff --git a/mm/compaction.c b/mm/compaction.c index c3e338aaa0ff..3648ce22c807 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -518,6 +518,24 @@ static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags, return true; } +static struct lruvec * +compact_folio_lruvec_lock_irqsave(struct folio *folio, unsigned long *flags, + struct compact_control *cc) +{ + struct lruvec *lruvec; + + rcu_read_lock(); +retry: + lruvec = folio_lruvec(folio); + compact_lock_irqsave(&lruvec->lru_lock, flags, cc); + if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) { + spin_unlock_irqrestore(&lruvec->lru_lock, *flags); + goto retry; + } + + return lruvec; +} + /* * Compaction requires the taking of some coarse locks that are potentially * very heavily contended. The lock should be periodically unlocked to avoid @@ -839,7 +857,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, { pg_data_t *pgdat = cc->zone->zone_pgdat; unsigned long nr_scanned = 0, nr_isolated = 0; - struct lruvec *lruvec; + struct lruvec *lruvec = NULL; unsigned long flags = 0; struct lruvec *locked = NULL; struct folio *folio = NULL; @@ -1153,18 +1171,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, if (!folio_test_clear_lru(folio)) goto isolate_fail_put; - lruvec = folio_lruvec(folio); + if (locked) + lruvec = folio_lruvec(folio); /* If we already hold the lock, we can skip some rechecking */ - if (lruvec != locked) { + if (lruvec != locked || !locked) { if (locked) lruvec_unlock_irqrestore(locked, flags); - compact_lock_irqsave(&lruvec->lru_lock, &flags, cc); + lruvec = compact_folio_lruvec_lock_irqsave(folio, &flags, cc); locked = lruvec; - lruvec_memcg_debug(lruvec, folio); - /* * Try get exclusive access under lock. If marked for * skip, the scan is aborted unless the current context diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 10021cef176b..0d4eaaea2b54 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1206,23 +1206,6 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg, } } -#ifdef CONFIG_DEBUG_VM -void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio) -{ - struct mem_cgroup *memcg; - - if (mem_cgroup_disabled()) - return; - - memcg = folio_memcg(folio); - - if (!memcg) - VM_BUG_ON_FOLIO(!mem_cgroup_is_root(lruvec_memcg(lruvec)), folio); - else - VM_BUG_ON_FOLIO(lruvec_memcg(lruvec) != memcg, folio); -} -#endif - /** * folio_lruvec_lock - Lock the lruvec for a folio. * @folio: Pointer to the folio. @@ -1232,14 +1215,20 @@ void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio) * - folio_test_lru false * - folio frozen (refcount of 0) * - * Return: The lruvec this folio is on with its lock held. + * Return: The lruvec this folio is on with its lock held and rcu read lock held. */ struct lruvec *folio_lruvec_lock(struct folio *folio) { - struct lruvec *lruvec = folio_lruvec(folio); + struct lruvec *lruvec; + rcu_read_lock(); +retry: + lruvec = folio_lruvec(folio); spin_lock(&lruvec->lru_lock); - lruvec_memcg_debug(lruvec, folio); + if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) { + spin_unlock(&lruvec->lru_lock); + goto retry; + } return lruvec; } @@ -1254,14 +1243,20 @@ struct lruvec *folio_lruvec_lock(struct folio *folio) * - folio frozen (refcount of 0) * * Return: The lruvec this folio is on with its lock held and interrupts - * disabled. + * disabled and rcu read lock held. */ struct lruvec *folio_lruvec_lock_irq(struct folio *folio) { - struct lruvec *lruvec = folio_lruvec(folio); + struct lruvec *lruvec; + rcu_read_lock(); +retry: + lruvec = folio_lruvec(folio); spin_lock_irq(&lruvec->lru_lock); - lruvec_memcg_debug(lruvec, folio); + if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) { + spin_unlock_irq(&lruvec->lru_lock); + goto retry; + } return lruvec; } @@ -1277,15 +1272,21 @@ struct lruvec *folio_lruvec_lock_irq(struct folio *folio) * - folio frozen (refcount of 0) * * Return: The lruvec this folio is on with its lock held and interrupts - * disabled. + * disabled and rcu read lock held. */ struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio, unsigned long *flags) { - struct lruvec *lruvec = folio_lruvec(folio); + struct lruvec *lruvec; + rcu_read_lock(); +retry: + lruvec = folio_lruvec(folio); spin_lock_irqsave(&lruvec->lru_lock, *flags); - lruvec_memcg_debug(lruvec, folio); + if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) { + spin_unlock_irqrestore(&lruvec->lru_lock, *flags); + goto retry; + } return lruvec; } diff --git a/mm/swap.c b/mm/swap.c index 009b32d6d344..bcd2b52e5def 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -240,6 +240,7 @@ void folio_rotate_reclaimable(struct folio *folio) void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file, unsigned int nr_io, unsigned int nr_rotated) __releases(lruvec->lru_lock) + __releases(rcu) { unsigned long cost; @@ -253,6 +254,7 @@ void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file, cost = nr_io * SWAP_CLUSTER_MAX + nr_rotated; if (!cost) { spin_unlock_irq(&lruvec->lru_lock); + rcu_read_unlock(); return; } @@ -285,8 +287,10 @@ void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file, spin_unlock_irq(&lruvec->lru_lock); lruvec = parent_lruvec(lruvec); - if (!lruvec) + if (!lruvec) { + rcu_read_unlock(); break; + } spin_lock_irq(&lruvec->lru_lock); } } -- cgit v1.2.3 From 07a6e9a2c199fed361f528781284d56771d0016f Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Thu, 5 Mar 2026 19:52:43 +0800 Subject: mm: vmscan: prepare for reparenting traditional LRU folios MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit To resolve the dying memcg issue, we need to reparent LRU folios of child memcg to its parent memcg. For traditional LRU list, each lruvec of every memcg comprises four LRU lists. Due to the symmetry of the LRU lists, it is feasible to transfer the LRU lists from a memcg to its parent memcg during the reparenting process. This commit implements the specific function, which will be used during the reparenting process. Link: https://lore.kernel.org/a92d217a9fc82bd0c401210204a095caaf615b1c.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Reviewed-by: Harry Yoo Acked-by: Johannes Weiner Acked-by: Muchun Song Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/swap.h | 21 +++++++++++++++++++++ mm/swap.c | 33 +++++++++++++++++++++++++++++++++ mm/vmscan.c | 19 ------------------- 3 files changed, 54 insertions(+), 19 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index ea08e2afa2b4..d653fe050b8f 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -546,6 +546,8 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg) return READ_ONCE(memcg->swappiness); } + +void lru_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid); #else static inline int mem_cgroup_swappiness(struct mem_cgroup *mem) { @@ -610,5 +612,24 @@ static inline bool mem_cgroup_swap_full(struct folio *folio) } #endif +/* for_each_managed_zone_pgdat - helper macro to iterate over all managed zones in a pgdat up to + * and including the specified highidx + * @zone: The current zone in the iterator + * @pgdat: The pgdat which node_zones are being iterated + * @idx: The index variable + * @highidx: The index of the highest zone to return + * + * This macro iterates through all managed zones up to and including the specified highidx. + * The zone iterator enters an invalid state after macro call and must be reinitialized + * before it can be used again. + */ +#define for_each_managed_zone_pgdat(zone, pgdat, idx, highidx) \ + for ((idx) = 0, (zone) = (pgdat)->node_zones; \ + (idx) <= (highidx); \ + (idx)++, (zone)++) \ + if (!managed_zone(zone)) \ + continue; \ + else + #endif /* __KERNEL__*/ #endif /* _LINUX_SWAP_H */ diff --git a/mm/swap.c b/mm/swap.c index bcd2b52e5def..5cc44f0de987 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -1090,6 +1090,39 @@ void folio_batch_remove_exceptionals(struct folio_batch *fbatch) fbatch->nr = j; } +#ifdef CONFIG_MEMCG +static void lruvec_reparent_lru(struct lruvec *child_lruvec, + struct lruvec *parent_lruvec, + enum lru_list lru, int nid) +{ + int zid; + struct zone *zone; + + if (lru != LRU_UNEVICTABLE) + list_splice_tail_init(&child_lruvec->lists[lru], &parent_lruvec->lists[lru]); + + for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1) { + unsigned long size = mem_cgroup_get_zone_lru_size(child_lruvec, lru, zid); + + mem_cgroup_update_lru_size(parent_lruvec, lru, zid, size); + } +} + +void lru_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid) +{ + enum lru_list lru; + struct lruvec *child_lruvec, *parent_lruvec; + + child_lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); + parent_lruvec = mem_cgroup_lruvec(parent, NODE_DATA(nid)); + parent_lruvec->anon_cost += child_lruvec->anon_cost; + parent_lruvec->file_cost += child_lruvec->file_cost; + + for_each_lru(lru) + lruvec_reparent_lru(child_lruvec, parent_lruvec, lru, nid); +} +#endif + static const struct ctl_table swap_sysctl_table[] = { { .procname = "page-cluster", diff --git a/mm/vmscan.c b/mm/vmscan.c index d4b649abe645..d225e84b5263 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -269,25 +269,6 @@ static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg) } #endif -/* for_each_managed_zone_pgdat - helper macro to iterate over all managed zones in a pgdat up to - * and including the specified highidx - * @zone: The current zone in the iterator - * @pgdat: The pgdat which node_zones are being iterated - * @idx: The index variable - * @highidx: The index of the highest zone to return - * - * This macro iterates through all managed zones up to and including the specified highidx. - * The zone iterator enters an invalid state after macro call and must be reinitialized - * before it can be used again. - */ -#define for_each_managed_zone_pgdat(zone, pgdat, idx, highidx) \ - for ((idx) = 0, (zone) = (pgdat)->node_zones; \ - (idx) <= (highidx); \ - (idx)++, (zone)++) \ - if (!managed_zone(zone)) \ - continue; \ - else - static void set_task_reclaim_state(struct task_struct *task, struct reclaim_state *rs) { -- cgit v1.2.3 From f304652609eae3814b0e9d11c75c0e0cb62da31f Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Thu, 5 Mar 2026 19:52:44 +0800 Subject: mm: vmscan: prepare for reparenting MGLRU folios MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Similar to traditional LRU folios, in order to solve the dying memcg problem, we also need to reparenting MGLRU folios to the parent memcg when memcg offline. However, there are the following challenges: 1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the number of generations of the parent and child memcg may be different, so we cannot simply transfer MGLRU folios in the child memcg to the parent memcg as we did for traditional LRU folios. 2. The generation information is stored in folio->flags, but we cannot traverse these folios while holding the lru lock, otherwise it may cause softlockup. 3. In walk_update_folio(), the gen of folio and corresponding lru size may be updated, but the folio is not immediately moved to the corresponding lru list. Therefore, there may be folios of different generations on an LRU list. 4. In lru_gen_del_folio(), the generation to which the folio belongs is found based on the generation information in folio->flags, and the corresponding LRU size will be updated. Therefore, we need to update the lru size correctly during reparenting, otherwise the lru size may be updated incorrectly in lru_gen_del_folio(). Finally, this patch chose a compromise method, which is to splice the lru list in the child memcg to the lru list of the same generation in the parent memcg during reparenting. And in order to ensure that the parent memcg has the same generation, we need to increase the generations in the parent memcg to the MAX_NR_GENS before reparenting. Of course, the same generation has different meanings in the parent and child memcg, this will cause confusion in the hot and cold information of folios. But other than that, this method is simple enough, the lru size is correct, and there is no need to consider some concurrency issues (such as lru_gen_del_folio()). To prepare for the above work, this commit implements the specific functions, which will be used during reparenting. [zhengqi.arch@bytedance.com: use list_splice_tail_init() to reparent child folios] Link: https://lore.kernel.org/20260324114937.28569-1-qi.zheng@linux.dev Link: https://lore.kernel.org/e75050354cdbc42221a04f7cf133292b61105548.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Suggested-by: Harry Yoo Suggested-by: Imran Khan Acked-by: Harry Yoo Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 17 ++++++ mm/vmscan.c | 142 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 159 insertions(+) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4a20df132258..20f920dede65 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -692,6 +692,9 @@ void lru_gen_online_memcg(struct mem_cgroup *memcg); void lru_gen_offline_memcg(struct mem_cgroup *memcg); void lru_gen_release_memcg(struct mem_cgroup *memcg); void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid); +void max_lru_gen_memcg(struct mem_cgroup *memcg, int nid); +bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg, int nid); +void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid); #else /* !CONFIG_LRU_GEN */ @@ -733,6 +736,20 @@ static inline void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid) { } +static inline void max_lru_gen_memcg(struct mem_cgroup *memcg, int nid) +{ +} + +static inline bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg, int nid) +{ + return true; +} + +static inline +void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid) +{ +} + #endif /* CONFIG_LRU_GEN */ struct lruvec { diff --git a/mm/vmscan.c b/mm/vmscan.c index d225e84b5263..8472aa4bddd5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4426,6 +4426,148 @@ void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid) lru_gen_rotate_memcg(lruvec, MEMCG_LRU_HEAD); } +bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg, int nid) +{ + struct lruvec *lruvec = get_lruvec(memcg, nid); + int type; + + for (type = 0; type < ANON_AND_FILE; type++) { + if (get_nr_gens(lruvec, type) != MAX_NR_GENS) + return false; + } + + return true; +} + +static void try_to_inc_max_seq_nowalk(struct mem_cgroup *memcg, + struct lruvec *lruvec) +{ + struct lru_gen_mm_list *mm_list = get_mm_list(memcg); + struct lru_gen_mm_state *mm_state = get_mm_state(lruvec); + int swappiness = mem_cgroup_swappiness(memcg); + DEFINE_MAX_SEQ(lruvec); + bool success = false; + + /* + * We are not iterating the mm_list here, updating mm_state->seq is just + * to make mm walkers work properly. + */ + if (mm_state) { + spin_lock(&mm_list->lock); + VM_WARN_ON_ONCE(mm_state->seq + 1 < max_seq); + if (max_seq > mm_state->seq) { + WRITE_ONCE(mm_state->seq, mm_state->seq + 1); + success = true; + } + spin_unlock(&mm_list->lock); + } else { + success = true; + } + + if (success) + inc_max_seq(lruvec, max_seq, swappiness); +} + +/* + * We need to ensure that the folios of child memcg can be reparented to the + * same gen of the parent memcg, so the gens of the parent memcg needed be + * incremented to the MAX_NR_GENS before reparenting. + */ +void max_lru_gen_memcg(struct mem_cgroup *memcg, int nid) +{ + struct lruvec *lruvec = get_lruvec(memcg, nid); + int type; + + for (type = 0; type < ANON_AND_FILE; type++) { + while (get_nr_gens(lruvec, type) < MAX_NR_GENS) { + try_to_inc_max_seq_nowalk(memcg, lruvec); + cond_resched(); + } + } +} + +/* + * Compared to traditional LRU, MGLRU faces the following challenges: + * + * 1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the + * number of generations of the parent and child memcg may be different, + * so we cannot simply transfer MGLRU folios in the child memcg to the + * parent memcg as we did for traditional LRU folios. + * 2. The generation information is stored in folio->flags, but we cannot + * traverse these folios while holding the lru lock, otherwise it may + * cause softlockup. + * 3. In walk_update_folio(), the gen of folio and corresponding lru size + * may be updated, but the folio is not immediately moved to the + * corresponding lru list. Therefore, there may be folios of different + * generations on an LRU list. + * 4. In lru_gen_del_folio(), the generation to which the folio belongs is + * found based on the generation information in folio->flags, and the + * corresponding LRU size will be updated. Therefore, we need to update + * the lru size correctly during reparenting, otherwise the lru size may + * be updated incorrectly in lru_gen_del_folio(). + * + * Finally, we choose a compromise method, which is to splice the lru list in + * the child memcg to the lru list of the same generation in the parent memcg + * during reparenting. + * + * The same generation has different meanings in the parent and child memcg, + * so this compromise method will cause the LRU inversion problem. But as the + * system runs, this problem will be fixed automatically. + */ +static void __lru_gen_reparent_memcg(struct lruvec *child_lruvec, struct lruvec *parent_lruvec, + int zone, int type) +{ + struct lru_gen_folio *child_lrugen, *parent_lrugen; + enum lru_list lru = type * LRU_INACTIVE_FILE; + int i; + + child_lrugen = &child_lruvec->lrugen; + parent_lrugen = &parent_lruvec->lrugen; + + for (i = 0; i < get_nr_gens(child_lruvec, type); i++) { + int gen = lru_gen_from_seq(child_lrugen->max_seq - i); + long nr_pages = child_lrugen->nr_pages[gen][type][zone]; + int child_lru_active = lru_gen_is_active(child_lruvec, gen) ? LRU_ACTIVE : 0; + int parent_lru_active = lru_gen_is_active(parent_lruvec, gen) ? LRU_ACTIVE : 0; + + /* Assuming that child pages are colder than parent pages */ + list_splice_tail_init(&child_lrugen->folios[gen][type][zone], + &parent_lrugen->folios[gen][type][zone]); + + WRITE_ONCE(child_lrugen->nr_pages[gen][type][zone], 0); + WRITE_ONCE(parent_lrugen->nr_pages[gen][type][zone], + parent_lrugen->nr_pages[gen][type][zone] + nr_pages); + + if (lru_gen_is_active(child_lruvec, gen) != lru_gen_is_active(parent_lruvec, gen)) { + __update_lru_size(child_lruvec, lru + child_lru_active, zone, -nr_pages); + __update_lru_size(parent_lruvec, lru + parent_lru_active, zone, nr_pages); + } + } +} + +void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid) +{ + struct lruvec *child_lruvec, *parent_lruvec; + int type, zid; + struct zone *zone; + enum lru_list lru; + + child_lruvec = get_lruvec(memcg, nid); + parent_lruvec = get_lruvec(parent, nid); + + for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1) + for (type = 0; type < ANON_AND_FILE; type++) + __lru_gen_reparent_memcg(child_lruvec, parent_lruvec, zid, type); + + for_each_lru(lru) { + for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1) { + unsigned long size = mem_cgroup_get_zone_lru_size(child_lruvec, lru, zid); + + mem_cgroup_update_lru_size(parent_lruvec, lru, zid, size); + } + } +} + #endif /* CONFIG_MEMCG */ /****************************************************************************** -- cgit v1.2.3 From 131adcc774bb138b55ab2d09201dd333832db87b Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Thu, 5 Mar 2026 19:52:45 +0800 Subject: mm: memcontrol: refactor memcg_reparent_objcgs() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Refactor the memcg_reparent_objcgs() to facilitate subsequent reparenting LRU folios here. Link: https://lore.kernel.org/2e5696db1993e593a51004c1dacedbc261689629.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Acked-by: Johannes Weiner Acked-by: Shakeel Butt Reviewed-by: Harry Yoo Reviewed-by: Muchun Song Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/memcontrol.c | 29 ++++++++++++++++++++++++----- 1 file changed, 24 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0d4eaaea2b54..e43ca8da8daf 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -208,15 +208,12 @@ static struct obj_cgroup *obj_cgroup_alloc(void) return objcg; } -static void memcg_reparent_objcgs(struct mem_cgroup *memcg) +static inline struct obj_cgroup *__memcg_reparent_objcgs(struct mem_cgroup *memcg, + struct mem_cgroup *parent) { struct obj_cgroup *objcg, *iter; - struct mem_cgroup *parent = parent_mem_cgroup(memcg); objcg = rcu_replace_pointer(memcg->objcg, NULL, true); - - spin_lock_irq(&objcg_lock); - /* 1) Ready to reparent active objcg. */ list_add(&objcg->list, &memcg->objcg_list); /* 2) Reparent active objcg and already reparented objcgs to parent. */ @@ -225,7 +222,29 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg) /* 3) Move already reparented objcgs to the parent's list */ list_splice(&memcg->objcg_list, &parent->objcg_list); + return objcg; +} + +static inline void reparent_locks(struct mem_cgroup *memcg, struct mem_cgroup *parent) +{ + spin_lock_irq(&objcg_lock); +} + +static inline void reparent_unlocks(struct mem_cgroup *memcg, struct mem_cgroup *parent) +{ spin_unlock_irq(&objcg_lock); +} + +static void memcg_reparent_objcgs(struct mem_cgroup *memcg) +{ + struct obj_cgroup *objcg; + struct mem_cgroup *parent = parent_mem_cgroup(memcg); + + reparent_locks(memcg, parent); + + objcg = __memcg_reparent_objcgs(memcg, parent); + + reparent_unlocks(memcg, parent); percpu_ref_kill(&objcg->refcnt); } -- cgit v1.2.3 From 7404bd37cfbeb2aa06249418c1788ca94bae2875 Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Thu, 5 Mar 2026 19:52:46 +0800 Subject: mm: workingset: use lruvec_lru_size() to get the number of lru pages MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit For cgroup v2, count_shadow_nodes() is the only place to read non-hierarchical stats (lruvec_stats->state_local). To avoid the need to consider cgroup v2 during subsequent non-hierarchical stats reparenting, use lruvec_lru_size() instead of lruvec_page_state_local() to get the number of lru pages. For NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B cases, it appears that the statistics here have already been problematic for a while since slab pages have been reparented. So just ignore it for now. Link: https://lore.kernel.org/b1d448c667a8fb377c3390d9aba43bdb7e4d5739.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Acked-by: Shakeel Butt Acked-by: Muchun Song Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Harry Yoo Cc: Hugh Dickins Cc: Imran Khan Cc: Johannes Weiner Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/swap.h | 1 + mm/vmscan.c | 3 +-- mm/workingset.c | 5 +++-- 3 files changed, 5 insertions(+), 4 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index d653fe050b8f..7a09df6977a5 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -352,6 +352,7 @@ extern void swap_setup(void); extern unsigned long zone_reclaimable_pages(struct zone *zone); extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); +unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx); #define MEMCG_RECLAIM_MAY_SWAP (1 << 1) #define MEMCG_RECLAIM_PROACTIVE (1 << 2) diff --git a/mm/vmscan.c b/mm/vmscan.c index 8472aa4bddd5..1ac4f959ec1c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -390,8 +390,7 @@ unsigned long zone_reclaimable_pages(struct zone *zone) * @lru: lru to use * @zone_idx: zones to consider (use MAX_NR_ZONES - 1 for the whole LRU list) */ -static unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, - int zone_idx) +unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx) { unsigned long size = 0; int zid; diff --git a/mm/workingset.c b/mm/workingset.c index 95d722a452e1..07e6836d0502 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -691,9 +691,10 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, mem_cgroup_flush_stats_ratelimited(sc->memcg); lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); + for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) - pages += lruvec_page_state_local(lruvec, - NR_LRU_BASE + i); + pages += lruvec_lru_size(lruvec, i, MAX_NR_ZONES - 1); + pages += lruvec_page_state_local( lruvec, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT; pages += lruvec_page_state_local( -- cgit v1.2.3 From 5371e350fda70bbdbee364215ca37b7fea25047b Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Thu, 5 Mar 2026 19:52:47 +0800 Subject: mm: memcontrol: refactor mod_memcg_state() and mod_memcg_lruvec_state() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Refactor the memcg_reparent_objcgs() to facilitate subsequent reparenting non-hierarchical stats. Link: https://lore.kernel.org/7f8bd3aacec2270b9453428fc8585cca9f10751e.1772711148.git.zhengqi.arch@bytedance.com Co-developed-by: Yosry Ahmed Signed-off-by: Yosry Ahmed Signed-off-by: Qi Zheng Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Harry Yoo Cc: Hugh Dickins Cc: Imran Khan Cc: Johannes Weiner Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/memcontrol.c | 50 +++++++++++++++++++++++++++++++------------------- 1 file changed, 31 insertions(+), 19 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e43ca8da8daf..271d4c6307b6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -717,21 +717,12 @@ static int memcg_state_val_in_pages(int idx, int val) return max(val * unit / PAGE_SIZE, 1UL); } -/** - * mod_memcg_state - update cgroup memory statistics - * @memcg: the memory cgroup - * @idx: the stat item - can be enum memcg_stat_item or enum node_stat_item - * @val: delta to add to the counter, can be negative - */ -void mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx, - int val) +static void __mod_memcg_state(struct mem_cgroup *memcg, + enum memcg_stat_item idx, int val) { int i = memcg_stats_index(idx); int cpu; - if (mem_cgroup_disabled()) - return; - if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx)) return; @@ -745,6 +736,21 @@ void mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx, put_cpu(); } +/** + * mod_memcg_state - update cgroup memory statistics + * @memcg: the memory cgroup + * @idx: the stat item - can be enum memcg_stat_item or enum node_stat_item + * @val: delta to add to the counter, can be negative + */ +void mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx, + int val) +{ + if (mem_cgroup_disabled()) + return; + + __mod_memcg_state(memcg, idx, val); +} + #ifdef CONFIG_MEMCG_V1 /* idx can be of type enum memcg_stat_item or node_stat_item. */ unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) @@ -764,21 +770,16 @@ unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) } #endif -static void mod_memcg_lruvec_state(struct lruvec *lruvec, - enum node_stat_item idx, - int val) +static void __mod_memcg_lruvec_state(struct mem_cgroup_per_node *pn, + enum node_stat_item idx, int val) { - struct mem_cgroup_per_node *pn; - struct mem_cgroup *memcg; + struct mem_cgroup *memcg = pn->memcg; int i = memcg_stats_index(idx); int cpu; if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx)) return; - pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); - memcg = pn->memcg; - cpu = get_cpu(); /* Update memcg */ @@ -794,6 +795,17 @@ static void mod_memcg_lruvec_state(struct lruvec *lruvec, put_cpu(); } +static void mod_memcg_lruvec_state(struct lruvec *lruvec, + enum node_stat_item idx, + int val) +{ + struct mem_cgroup_per_node *pn; + + pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); + + __mod_memcg_lruvec_state(pn, idx, val); +} + /** * mod_lruvec_state - update lruvec memory statistics * @lruvec: the lruvec -- cgit v1.2.3 From 8285917d6f383aef274fb442eb0e6f948d76abe3 Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Thu, 5 Mar 2026 19:52:48 +0800 Subject: mm: memcontrol: prepare for reparenting non-hierarchical stats MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit To resolve the dying memcg issue, we need to reparent LRU folios of child memcg to its parent memcg. This could cause problems for non-hierarchical stats. As Yosry Ahmed pointed out: In short, if memory is charged to a dying cgroup at the time of reparenting, when the memory gets uncharged the stats updates will occur at the parent. This will update both hierarchical and non-hierarchical stats of the parent, which would corrupt the parent's non-hierarchical stats (because those counters were never incremented when the memory was charged). Now we have the following two types of non-hierarchical stats, and they are only used in CONFIG_MEMCG_V1: a. memcg->vmstats->state_local[i] b. pn->lruvec_stats->state_local[i] To ensure that these non-hierarchical stats work properly, we need to reparent these non-hierarchical stats after reparenting LRU folios. To this end, this commit makes the following preparations: 1. implement reparent_state_local() to reparent non-hierarchical stats 2. make css_killed_work_fn() to be called in rcu work, and implement get_non_dying_memcg_start() and get_non_dying_memcg_end() to avoid race between mod_memcg_state()/mod_memcg_lruvec_state() and reparent_state_local() Link: https://lore.kernel.org/e862995c45a7101a541284b6ebee5e5c32c89066.1772711148.git.zhengqi.arch@bytedance.com Co-developed-by: Yosry Ahmed Signed-off-by: Yosry Ahmed Signed-off-by: Qi Zheng Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Harry Yoo Cc: Hugh Dickins Cc: Imran Khan Cc: Johannes Weiner Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- kernel/cgroup/cgroup.c | 9 ++--- mm/memcontrol-v1.c | 16 +++++++++ mm/memcontrol-v1.h | 7 ++++ mm/memcontrol.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 125 insertions(+), 4 deletions(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 01fc2a93f3ef..babf7b456048 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -6050,8 +6050,9 @@ out_unlock: */ static void css_killed_work_fn(struct work_struct *work) { - struct cgroup_subsys_state *css = - container_of(work, struct cgroup_subsys_state, destroy_work); + struct cgroup_subsys_state *css; + + css = container_of(to_rcu_work(work), struct cgroup_subsys_state, destroy_rwork); cgroup_lock(); @@ -6072,8 +6073,8 @@ static void css_killed_ref_fn(struct percpu_ref *ref) container_of(ref, struct cgroup_subsys_state, refcnt); if (atomic_dec_and_test(&css->online_cnt)) { - INIT_WORK(&css->destroy_work, css_killed_work_fn); - queue_work(cgroup_offline_wq, &css->destroy_work); + INIT_RCU_WORK(&css->destroy_rwork, css_killed_work_fn); + queue_rcu_work(cgroup_offline_wq, &css->destroy_rwork); } } diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index 437cd25784fe..8380adfa0f68 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -1884,6 +1884,22 @@ static const unsigned int memcg1_events[] = { PGMAJFAULT, }; +void reparent_memcg1_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) + reparent_memcg_state_local(memcg, parent, memcg1_stats[i]); +} + +void reparent_memcg1_lruvec_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent) +{ + int i; + + for (i = 0; i < NR_LRU_LISTS; i++) + reparent_memcg_lruvec_state_local(memcg, parent, i); +} + void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) { unsigned long memory, memsw; diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h index 1b969294ea6a..f92f81108d5e 100644 --- a/mm/memcontrol-v1.h +++ b/mm/memcontrol-v1.h @@ -73,6 +73,13 @@ void memcg1_uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout, unsigned long nr_memory, int nid); void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s); +void reparent_memcg1_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent); +void reparent_memcg1_lruvec_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent); + +void reparent_memcg_state_local(struct mem_cgroup *memcg, + struct mem_cgroup *parent, int idx); +void reparent_memcg_lruvec_state_local(struct mem_cgroup *memcg, + struct mem_cgroup *parent, int idx); void memcg1_account_kmem(struct mem_cgroup *memcg, int nr_pages); static inline bool memcg1_tcpmem_active(struct mem_cgroup *memcg) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 271d4c6307b6..c9e5ea0d9fc6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -225,6 +225,34 @@ static inline struct obj_cgroup *__memcg_reparent_objcgs(struct mem_cgroup *memc return objcg; } +#ifdef CONFIG_MEMCG_V1 +static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force); + +static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent) +{ + if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) + return; + + /* + * Reparent stats exposed non-hierarchically. Flush @memcg's stats first + * to read its stats accurately , and conservatively flush @parent's + * stats after reparenting to avoid hiding a potentially large stat + * update (e.g. from callers of mem_cgroup_flush_stats_ratelimited()). + */ + __mem_cgroup_flush_stats(memcg, true); + + /* The following counts are all non-hierarchical and need to be reparented. */ + reparent_memcg1_state_local(memcg, parent); + reparent_memcg1_lruvec_state_local(memcg, parent); + + __mem_cgroup_flush_stats(parent, true); +} +#else +static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent) +{ +} +#endif + static inline void reparent_locks(struct mem_cgroup *memcg, struct mem_cgroup *parent) { spin_lock_irq(&objcg_lock); @@ -472,6 +500,30 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec, return x; } +#ifdef CONFIG_MEMCG_V1 +static void __mod_memcg_lruvec_state(struct mem_cgroup_per_node *pn, + enum node_stat_item idx, int val); + +void reparent_memcg_lruvec_state_local(struct mem_cgroup *memcg, + struct mem_cgroup *parent, int idx) +{ + int nid; + + for_each_node(nid) { + struct lruvec *child_lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); + struct lruvec *parent_lruvec = mem_cgroup_lruvec(parent, NODE_DATA(nid)); + unsigned long value = lruvec_page_state_local(child_lruvec, idx); + struct mem_cgroup_per_node *child_pn, *parent_pn; + + child_pn = container_of(child_lruvec, struct mem_cgroup_per_node, lruvec); + parent_pn = container_of(parent_lruvec, struct mem_cgroup_per_node, lruvec); + + __mod_memcg_lruvec_state(child_pn, idx, -value); + __mod_memcg_lruvec_state(parent_pn, idx, value); + } +} +#endif + /* Subset of vm_event_item to report for memcg event stats */ static const unsigned int memcg_vm_event_stat[] = { #ifdef CONFIG_MEMCG_V1 @@ -717,6 +769,42 @@ static int memcg_state_val_in_pages(int idx, int val) return max(val * unit / PAGE_SIZE, 1UL); } +#ifdef CONFIG_MEMCG_V1 +/* + * Used in mod_memcg_state() and mod_memcg_lruvec_state() to avoid race with + * reparenting of non-hierarchical state_locals. + */ +static inline struct mem_cgroup *get_non_dying_memcg_start(struct mem_cgroup *memcg) +{ + if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) + return memcg; + + rcu_read_lock(); + + while (memcg_is_dying(memcg)) + memcg = parent_mem_cgroup(memcg); + + return memcg; +} + +static inline void get_non_dying_memcg_end(void) +{ + if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) + return; + + rcu_read_unlock(); +} +#else +static inline struct mem_cgroup *get_non_dying_memcg_start(struct mem_cgroup *memcg) +{ + return memcg; +} + +static inline void get_non_dying_memcg_end(void) +{ +} +#endif + static void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx, int val) { @@ -768,6 +856,15 @@ unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) #endif return x; } + +void reparent_memcg_state_local(struct mem_cgroup *memcg, + struct mem_cgroup *parent, int idx) +{ + unsigned long value = memcg_page_state_local(memcg, idx); + + __mod_memcg_state(memcg, idx, -value); + __mod_memcg_state(parent, idx, value); +} #endif static void __mod_memcg_lruvec_state(struct mem_cgroup_per_node *pn, -- cgit v1.2.3 From 01b9da291c4969354807b52956f4aae1f41b4924 Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Thu, 5 Mar 2026 19:52:49 +0800 Subject: mm: memcontrol: convert objcg to be per-memcg per-node type MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Convert objcg to be per-memcg per-node type, so that when reparent LRU folios later, we can hold the lru lock at the node level, thus avoiding holding too many lru locks at once. [zhengqi.arch@bytedance.com: reset pn->orig_objcg to NULL] Link: https://lore.kernel.org/20260309112939.31937-1-qi.zheng@linux.dev [akpm@linux-foundation.org: fix comment typo, per Usama. Reflow comment to 80 cols] [devnexen@gmail.com: fix obj_cgroup leak in mem_cgroup_css_online() error path] Link: https://lore.kernel.org/20260322193631.45457-1-devnexen@gmail.com [devnexen@gmail.com: add newline, per Qi Zheng] Link: https://lore.kernel.org/20260323063007.7783-1-devnexen@gmail.com Link: https://lore.kernel.org/56c04b1c5d54f75ccdc12896df6c1ca35403ecc3.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Signed-off-by: David Carlier Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Harry Yoo Cc: Hugh Dickins Cc: Imran Khan Cc: Johannes Weiner Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Cc: Usama Arif Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 23 ++++++------ include/linux/sched.h | 2 +- mm/memcontrol.c | 92 +++++++++++++++++++++++++++++++--------------- 3 files changed, 75 insertions(+), 42 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 12982875073e..3e836b56bfcb 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -115,6 +115,16 @@ struct mem_cgroup_per_node { unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS]; struct mem_cgroup_reclaim_iter iter; + /* + * objcg is wiped out as a part of the objcg repaprenting process. + * orig_objcg preserves a pointer (and a reference) to the original + * objcg until the end of live of memcg. + */ + struct obj_cgroup __rcu *objcg; + struct obj_cgroup *orig_objcg; + /* list of inherited objcgs, protected by objcg_lock */ + struct list_head objcg_list; + #ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC /* slab stats for nmi context */ atomic_t slab_reclaimable; @@ -179,6 +189,7 @@ struct obj_cgroup { struct list_head list; /* protected by objcg_lock */ struct rcu_head rcu; }; + bool is_root; }; /* @@ -257,15 +268,6 @@ struct mem_cgroup { seqlock_t socket_pressure_seqlock; #endif int kmemcg_id; - /* - * memcg->objcg is wiped out as a part of the objcg repaprenting - * process. memcg->orig_objcg preserves a pointer (and a reference) - * to the original objcg until the end of live of memcg. - */ - struct obj_cgroup __rcu *objcg; - struct obj_cgroup *orig_objcg; - /* list of inherited objcgs, protected by objcg_lock */ - struct list_head objcg_list; struct memcg_vmstats_percpu __percpu *vmstats_percpu; @@ -332,7 +334,6 @@ struct mem_cgroup { #define MEMCG_CHARGE_BATCH 64U extern struct mem_cgroup *root_mem_cgroup; -extern struct obj_cgroup *root_obj_cgroup; enum page_memcg_data_flags { /* page->memcg_data is a pointer to an slabobj_ext vector */ @@ -551,7 +552,7 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg) { - return objcg == root_obj_cgroup; + return objcg->is_root; } static inline bool mem_cgroup_disabled(void) diff --git a/include/linux/sched.h b/include/linux/sched.h index 5a5d3dbc9cdf..0d27775546f8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1533,7 +1533,7 @@ struct task_struct { /* Used by memcontrol for targeted memcg charge: */ struct mem_cgroup *active_memcg; - /* Cache for current->cgroups->memcg->objcg lookups: */ + /* Cache for current->cgroups->memcg->nodeinfo[nid]->objcg lookups: */ struct obj_cgroup *objcg; #endif diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c9e5ea0d9fc6..1aaa66f729b3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -83,8 +83,6 @@ EXPORT_SYMBOL(memory_cgrp_subsys); struct mem_cgroup *root_mem_cgroup __read_mostly; EXPORT_SYMBOL(root_mem_cgroup); -struct obj_cgroup *root_obj_cgroup __read_mostly; - /* Active memory cgroup to use from an interrupt context */ DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg); EXPORT_PER_CPU_SYMBOL_GPL(int_active_memcg); @@ -209,18 +207,21 @@ static struct obj_cgroup *obj_cgroup_alloc(void) } static inline struct obj_cgroup *__memcg_reparent_objcgs(struct mem_cgroup *memcg, - struct mem_cgroup *parent) + struct mem_cgroup *parent, + int nid) { struct obj_cgroup *objcg, *iter; + struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid]; + struct mem_cgroup_per_node *parent_pn = parent->nodeinfo[nid]; - objcg = rcu_replace_pointer(memcg->objcg, NULL, true); + objcg = rcu_replace_pointer(pn->objcg, NULL, true); /* 1) Ready to reparent active objcg. */ - list_add(&objcg->list, &memcg->objcg_list); + list_add(&objcg->list, &pn->objcg_list); /* 2) Reparent active objcg and already reparented objcgs to parent. */ - list_for_each_entry(iter, &memcg->objcg_list, list) + list_for_each_entry(iter, &pn->objcg_list, list) WRITE_ONCE(iter->memcg, parent); /* 3) Move already reparented objcgs to the parent's list */ - list_splice(&memcg->objcg_list, &parent->objcg_list); + list_splice(&pn->objcg_list, &parent_pn->objcg_list); return objcg; } @@ -267,14 +268,17 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg) { struct obj_cgroup *objcg; struct mem_cgroup *parent = parent_mem_cgroup(memcg); + int nid; - reparent_locks(memcg, parent); + for_each_node(nid) { + reparent_locks(memcg, parent); - objcg = __memcg_reparent_objcgs(memcg, parent); + objcg = __memcg_reparent_objcgs(memcg, parent, nid); - reparent_unlocks(memcg, parent); + reparent_unlocks(memcg, parent); - percpu_ref_kill(&objcg->refcnt); + percpu_ref_kill(&objcg->refcnt); + } } /* @@ -2830,8 +2834,10 @@ struct mem_cgroup *mem_cgroup_from_virt(void *p) static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg) { + int nid = numa_node_id(); + for (; memcg; memcg = parent_mem_cgroup(memcg)) { - struct obj_cgroup *objcg = rcu_dereference(memcg->objcg); + struct obj_cgroup *objcg = rcu_dereference(memcg->nodeinfo[nid]->objcg); if (likely(objcg && obj_cgroup_tryget(objcg))) return objcg; @@ -2895,6 +2901,7 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void) { struct mem_cgroup *memcg; struct obj_cgroup *objcg; + int nid = numa_node_id(); if (IS_ENABLED(CONFIG_MEMCG_NMI_UNSAFE) && in_nmi()) return NULL; @@ -2911,14 +2918,14 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void) * Objcg reference is kept by the task, so it's safe * to use the objcg by the current task. */ - return objcg ? : root_obj_cgroup; + return objcg ? : rcu_dereference_check(root_mem_cgroup->nodeinfo[nid]->objcg, 1); } memcg = this_cpu_read(int_active_memcg); if (unlikely(memcg)) goto from_memcg; - return root_obj_cgroup; + return rcu_dereference_check(root_mem_cgroup->nodeinfo[nid]->objcg, 1); from_memcg: for (; memcg; memcg = parent_mem_cgroup(memcg)) { @@ -2928,12 +2935,12 @@ from_memcg: * away and can be used within the scope without any additional * protection. */ - objcg = rcu_dereference_check(memcg->objcg, 1); + objcg = rcu_dereference_check(memcg->nodeinfo[nid]->objcg, 1); if (likely(objcg)) return objcg; } - return root_obj_cgroup; + return rcu_dereference_check(root_mem_cgroup->nodeinfo[nid]->objcg, 1); } struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio) @@ -3876,6 +3883,8 @@ static bool alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) if (!pn->lruvec_stats_percpu) goto fail; + INIT_LIST_HEAD(&pn->objcg_list); + lruvec_init(&pn->lruvec); pn->memcg = memcg; @@ -3890,10 +3899,14 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) { int node; - obj_cgroup_put(memcg->orig_objcg); + for_each_node(node) { + struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; + if (!pn) + continue; - for_each_node(node) - free_mem_cgroup_per_node_info(memcg->nodeinfo[node]); + obj_cgroup_put(pn->orig_objcg); + free_mem_cgroup_per_node_info(pn); + } memcg1_free_events(memcg); kfree(memcg->vmstats); free_percpu(memcg->vmstats_percpu); @@ -3964,7 +3977,6 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent) #endif memcg1_memcg_init(memcg); memcg->kmemcg_id = -1; - INIT_LIST_HEAD(&memcg->objcg_list); #ifdef CONFIG_CGROUP_WRITEBACK INIT_LIST_HEAD(&memcg->cgwb_list); for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) @@ -4041,6 +4053,7 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css) { struct mem_cgroup *memcg = mem_cgroup_from_css(css); struct obj_cgroup *objcg; + int nid; memcg_online_kmem(memcg); @@ -4052,17 +4065,19 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css) if (alloc_shrinker_info(memcg)) goto offline_kmem; - objcg = obj_cgroup_alloc(); - if (!objcg) - goto free_shrinker; + for_each_node(nid) { + objcg = obj_cgroup_alloc(); + if (!objcg) + goto free_objcg; - if (unlikely(mem_cgroup_is_root(memcg))) - root_obj_cgroup = objcg; + if (unlikely(mem_cgroup_is_root(memcg))) + objcg->is_root = true; - objcg->memcg = memcg; - rcu_assign_pointer(memcg->objcg, objcg); - obj_cgroup_get(objcg); - memcg->orig_objcg = objcg; + objcg->memcg = memcg; + rcu_assign_pointer(memcg->nodeinfo[nid]->objcg, objcg); + obj_cgroup_get(objcg); + memcg->nodeinfo[nid]->orig_objcg = objcg; + } if (unlikely(mem_cgroup_is_root(memcg)) && !mem_cgroup_disabled()) queue_delayed_work(system_dfl_wq, &stats_flush_dwork, @@ -4086,7 +4101,24 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css) xa_store(&mem_cgroup_private_ids, memcg->id.id, memcg, GFP_KERNEL); return 0; -free_shrinker: +free_objcg: + for_each_node(nid) { + struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid]; + + objcg = rcu_replace_pointer(pn->objcg, NULL, true); + if (objcg) + percpu_ref_kill(&objcg->refcnt); + + if (pn->orig_objcg) { + obj_cgroup_put(pn->orig_objcg); + /* + * Reset pn->orig_objcg to NULL to prevent + * obj_cgroup_put() from being called again in + * __mem_cgroup_free(). + */ + pn->orig_objcg = NULL; + } + } free_shrinker_info(memcg); offline_kmem: memcg_offline_kmem(memcg); -- cgit v1.2.3 From f1cf8d2f36dc369688bbe61ce064fbd829dbc9e1 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:50 +0800 Subject: mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Now that everything is set up, switch folio->memcg_data pointers to objcgs, update the accessors, and execute reparenting on cgroup death. Finally, folio->memcg_data of LRU folios and kmem folios will always point to an object cgroup pointer. The folio->memcg_data of slab folios will point to an vector of object cgroups. Link: https://lore.kernel.org/80cb7af198dc6f2173fe616d1207a4c315ece141.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Harry Yoo Cc: Hugh Dickins Cc: Imran Khan Cc: Johannes Weiner Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 77 ++++++------------ mm/memcontrol-v1.c | 15 ++-- mm/memcontrol.c | 194 +++++++++++++++++++++++++++------------------ 3 files changed, 151 insertions(+), 135 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 3e836b56bfcb..086158969529 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -369,9 +369,6 @@ enum objext_flags { #define OBJEXTS_FLAGS_MASK (__NR_OBJEXTS_FLAGS - 1) #ifdef CONFIG_MEMCG - -static inline bool folio_memcg_kmem(struct folio *folio); - /* * After the initialization objcg->memcg is always pointing at * a valid memcg, but can be atomically swapped to the parent memcg. @@ -385,43 +382,19 @@ static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg) } /* - * __folio_memcg - Get the memory cgroup associated with a non-kmem folio - * @folio: Pointer to the folio. - * - * Returns a pointer to the memory cgroup associated with the folio, - * or NULL. This function assumes that the folio is known to have a - * proper memory cgroup pointer. It's not safe to call this function - * against some type of folios, e.g. slab folios or ex-slab folios or - * kmem folios. - */ -static inline struct mem_cgroup *__folio_memcg(struct folio *folio) -{ - unsigned long memcg_data = folio->memcg_data; - - VM_BUG_ON_FOLIO(folio_test_slab(folio), folio); - VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio); - VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio); - - return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK); -} - -/* - * __folio_objcg - get the object cgroup associated with a kmem folio. + * folio_objcg - get the object cgroup associated with a folio. * @folio: Pointer to the folio. * * Returns a pointer to the object cgroup associated with the folio, * or NULL. This function assumes that the folio is known to have a - * proper object cgroup pointer. It's not safe to call this function - * against some type of folios, e.g. slab folios or ex-slab folios or - * LRU folios. + * proper object cgroup pointer. */ -static inline struct obj_cgroup *__folio_objcg(struct folio *folio) +static inline struct obj_cgroup *folio_objcg(struct folio *folio) { unsigned long memcg_data = folio->memcg_data; VM_BUG_ON_FOLIO(folio_test_slab(folio), folio); VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio); - VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio); return (struct obj_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK); } @@ -435,21 +408,30 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio) * proper memory cgroup pointer. It's not safe to call this function * against some type of folios, e.g. slab folios or ex-slab folios. * - * For a non-kmem folio any of the following ensures folio and memcg binding - * stability: + * For a folio any of the following ensures folio and objcg binding stability: * * - the folio lock * - LRU isolation * - exclusive reference * - * For a kmem folio a caller should hold an rcu read lock to protect memcg - * associated with a kmem folio from being released. + * Based on the stable binding of folio and objcg, for a folio any of the + * following ensures folio and memcg binding stability: + * + * - cgroup_mutex + * - the lruvec lock + * + * If the caller only want to ensure that the page counters of memcg are + * updated correctly, ensure that the binding stability of folio and objcg + * is sufficient. + * + * Note: The caller should hold an rcu read lock or cgroup_mutex to protect + * memcg associated with a folio from being released. */ static inline struct mem_cgroup *folio_memcg(struct folio *folio) { - if (folio_memcg_kmem(folio)) - return obj_cgroup_memcg(__folio_objcg(folio)); - return __folio_memcg(folio); + struct obj_cgroup *objcg = folio_objcg(folio); + + return objcg ? obj_cgroup_memcg(objcg) : NULL; } /* @@ -473,15 +455,10 @@ static inline bool folio_memcg_charged(struct folio *folio) * has an associated memory cgroup pointer or an object cgroups vector or * an object cgroup. * - * For a non-kmem folio any of the following ensures folio and memcg binding - * stability: + * The page and objcg or memcg binding rules can refer to folio_memcg(). * - * - the folio lock - * - LRU isolation - * - exclusive reference - * - * For a kmem folio a caller should hold an rcu read lock to protect memcg - * associated with a kmem folio from being released. + * A caller should hold an rcu read lock to protect memcg associated with a + * page from being released. */ static inline struct mem_cgroup *folio_memcg_check(struct folio *folio) { @@ -490,18 +467,14 @@ static inline struct mem_cgroup *folio_memcg_check(struct folio *folio) * for slabs, READ_ONCE() should be used here. */ unsigned long memcg_data = READ_ONCE(folio->memcg_data); + struct obj_cgroup *objcg; if (memcg_data & MEMCG_DATA_OBJEXTS) return NULL; - if (memcg_data & MEMCG_DATA_KMEM) { - struct obj_cgroup *objcg; - - objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK); - return obj_cgroup_memcg(objcg); - } + objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK); - return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK); + return objcg ? obj_cgroup_memcg(objcg) : NULL; } static inline struct mem_cgroup *page_memcg_check(struct page *page) diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index 8380adfa0f68..433bba9dfe71 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -613,6 +613,7 @@ void memcg1_commit_charge(struct folio *folio, struct mem_cgroup *memcg) void memcg1_swapout(struct folio *folio, swp_entry_t entry) { struct mem_cgroup *memcg, *swap_memcg; + struct obj_cgroup *objcg; unsigned int nr_entries; VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); @@ -624,12 +625,13 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry) if (!do_memsw_account()) return; - memcg = folio_memcg(folio); - - VM_WARN_ON_ONCE_FOLIO(!memcg, folio); - if (!memcg) + objcg = folio_objcg(folio); + VM_WARN_ON_ONCE_FOLIO(!objcg, folio); + if (!objcg) return; + rcu_read_lock(); + memcg = obj_cgroup_memcg(objcg); /* * In case the memcg owning these pages has been offlined and doesn't * have an ID allocated to it anymore, charge the closest online @@ -644,7 +646,7 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry) folio_unqueue_deferred_split(folio); folio->memcg_data = 0; - if (!mem_cgroup_is_root(memcg)) + if (!obj_cgroup_is_root(objcg)) page_counter_uncharge(&memcg->memory, nr_entries); if (memcg != swap_memcg) { @@ -665,7 +667,8 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry) preempt_enable_nested(); memcg1_check_events(memcg, folio_nid(folio)); - css_put(&memcg->css); + rcu_read_unlock(); + obj_cgroup_put(objcg); } /* diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1aaa66f729b3..b696823b34d0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -254,13 +254,17 @@ static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgr } #endif -static inline void reparent_locks(struct mem_cgroup *memcg, struct mem_cgroup *parent) +static inline void reparent_locks(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid) { spin_lock_irq(&objcg_lock); + spin_lock_nested(&mem_cgroup_lruvec(memcg, NODE_DATA(nid))->lru_lock, 1); + spin_lock_nested(&mem_cgroup_lruvec(parent, NODE_DATA(nid))->lru_lock, 2); } -static inline void reparent_unlocks(struct mem_cgroup *memcg, struct mem_cgroup *parent) +static inline void reparent_unlocks(struct mem_cgroup *memcg, struct mem_cgroup *parent, int nid) { + spin_unlock(&mem_cgroup_lruvec(parent, NODE_DATA(nid))->lru_lock); + spin_unlock(&mem_cgroup_lruvec(memcg, NODE_DATA(nid))->lru_lock); spin_unlock_irq(&objcg_lock); } @@ -271,14 +275,31 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg) int nid; for_each_node(nid) { - reparent_locks(memcg, parent); +retry: + if (lru_gen_enabled()) + max_lru_gen_memcg(parent, nid); + + reparent_locks(memcg, parent, nid); + + if (lru_gen_enabled()) { + if (!recheck_lru_gen_max_memcg(parent, nid)) { + reparent_unlocks(memcg, parent, nid); + cond_resched(); + goto retry; + } + lru_gen_reparent_memcg(memcg, parent, nid); + } else { + lru_reparent_memcg(memcg, parent, nid); + } objcg = __memcg_reparent_objcgs(memcg, parent, nid); - reparent_unlocks(memcg, parent); + reparent_unlocks(memcg, parent, nid); percpu_ref_kill(&objcg->refcnt); } + + reparent_state_local(memcg, parent); } /* @@ -823,6 +844,7 @@ static void __mod_memcg_state(struct mem_cgroup *memcg, this_cpu_add(memcg->vmstats_percpu->state[i], val); val = memcg_state_val_in_pages(idx, val); memcg_rstat_updated(memcg, val, cpu); + trace_mod_memcg_state(memcg, idx, val); put_cpu(); @@ -840,7 +862,9 @@ void mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx, if (mem_cgroup_disabled()) return; + memcg = get_non_dying_memcg_start(memcg); __mod_memcg_state(memcg, idx, val); + get_non_dying_memcg_end(); } #ifdef CONFIG_MEMCG_V1 @@ -900,11 +924,17 @@ static void mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val) { + struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct mem_cgroup_per_node *pn; + struct mem_cgroup *memcg; pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); + memcg = get_non_dying_memcg_start(pn->memcg); + pn = memcg->nodeinfo[pgdat->node_id]; __mod_memcg_lruvec_state(pn, idx, val); + + get_non_dying_memcg_end(); } /** @@ -1127,6 +1157,8 @@ again: /** * get_mem_cgroup_from_folio - Obtain a reference on a given folio's memcg. * @folio: folio from which memcg should be extracted. + * + * See folio_memcg() for folio->objcg/memcg binding rules. */ struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio) { @@ -2722,17 +2754,17 @@ static inline int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, return try_charge_memcg(memcg, gfp_mask, nr_pages); } -static void commit_charge(struct folio *folio, struct mem_cgroup *memcg) +static void commit_charge(struct folio *folio, struct obj_cgroup *objcg) { VM_BUG_ON_FOLIO(folio_memcg_charged(folio), folio); /* - * Any of the following ensures page's memcg stability: + * Any of the following ensures folio's objcg stability: * * - the page lock * - LRU isolation * - exclusive reference */ - folio->memcg_data = (unsigned long)memcg; + folio->memcg_data = (unsigned long)objcg; } #ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC @@ -2846,6 +2878,17 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg) return NULL; } +static inline struct obj_cgroup *get_obj_cgroup_from_memcg(struct mem_cgroup *memcg) +{ + struct obj_cgroup *objcg; + + rcu_read_lock(); + objcg = __get_obj_cgroup_from_memcg(memcg); + rcu_read_unlock(); + + return objcg; +} + static struct obj_cgroup *current_objcg_update(void) { struct mem_cgroup *memcg; @@ -2947,17 +2990,10 @@ struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio) { struct obj_cgroup *objcg; - if (!memcg_kmem_online()) - return NULL; - - if (folio_memcg_kmem(folio)) { - objcg = __folio_objcg(folio); + objcg = folio_objcg(folio); + if (objcg) obj_cgroup_get(objcg); - } else { - rcu_read_lock(); - objcg = __get_obj_cgroup_from_memcg(__folio_memcg(folio)); - rcu_read_unlock(); - } + return objcg; } @@ -3519,7 +3555,7 @@ void folio_split_memcg_refs(struct folio *folio, unsigned old_order, return; new_refs = (1 << (old_order - new_order)) - 1; - css_get_many(&__folio_memcg(folio)->css, new_refs); + obj_cgroup_get_many(folio_objcg(folio), new_refs); } static void memcg_online_kmem(struct mem_cgroup *memcg) @@ -4955,16 +4991,20 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root, static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg, gfp_t gfp) { - int ret; - - ret = try_charge(memcg, gfp, folio_nr_pages(folio)); - if (ret) - goto out; + int ret = 0; + struct obj_cgroup *objcg; - css_get(&memcg->css); - commit_charge(folio, memcg); + objcg = get_obj_cgroup_from_memcg(memcg); + /* Do not account at the root objcg level. */ + if (!obj_cgroup_is_root(objcg)) + ret = try_charge_memcg(memcg, gfp, folio_nr_pages(folio)); + if (ret) { + obj_cgroup_put(objcg); + return ret; + } + commit_charge(folio, objcg); memcg1_commit_charge(folio, memcg); -out: + return ret; } @@ -5050,7 +5090,7 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, } struct uncharge_gather { - struct mem_cgroup *memcg; + struct obj_cgroup *objcg; unsigned long nr_memory; unsigned long pgpgout; unsigned long nr_kmem; @@ -5064,58 +5104,52 @@ static inline void uncharge_gather_clear(struct uncharge_gather *ug) static void uncharge_batch(const struct uncharge_gather *ug) { + struct mem_cgroup *memcg; + + rcu_read_lock(); + memcg = obj_cgroup_memcg(ug->objcg); if (ug->nr_memory) { - memcg_uncharge(ug->memcg, ug->nr_memory); + memcg_uncharge(memcg, ug->nr_memory); if (ug->nr_kmem) { - mod_memcg_state(ug->memcg, MEMCG_KMEM, -ug->nr_kmem); - memcg1_account_kmem(ug->memcg, -ug->nr_kmem); + mod_memcg_state(memcg, MEMCG_KMEM, -ug->nr_kmem); + memcg1_account_kmem(memcg, -ug->nr_kmem); } - memcg1_oom_recover(ug->memcg); + memcg1_oom_recover(memcg); } - memcg1_uncharge_batch(ug->memcg, ug->pgpgout, ug->nr_memory, ug->nid); + memcg1_uncharge_batch(memcg, ug->pgpgout, ug->nr_memory, ug->nid); + rcu_read_unlock(); /* drop reference from uncharge_folio */ - css_put(&ug->memcg->css); + obj_cgroup_put(ug->objcg); } static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug) { long nr_pages; - struct mem_cgroup *memcg; struct obj_cgroup *objcg; VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); /* * Nobody should be changing or seriously looking at - * folio memcg or objcg at this point, we have fully - * exclusive access to the folio. + * folio objcg at this point, we have fully exclusive + * access to the folio. */ - if (folio_memcg_kmem(folio)) { - objcg = __folio_objcg(folio); - /* - * This get matches the put at the end of the function and - * kmem pages do not hold memcg references anymore. - */ - memcg = get_mem_cgroup_from_objcg(objcg); - } else { - memcg = __folio_memcg(folio); - } - - if (!memcg) + objcg = folio_objcg(folio); + if (!objcg) return; - if (ug->memcg != memcg) { - if (ug->memcg) { + if (ug->objcg != objcg) { + if (ug->objcg) { uncharge_batch(ug); uncharge_gather_clear(ug); } - ug->memcg = memcg; + ug->objcg = objcg; ug->nid = folio_nid(folio); - /* pairs with css_put in uncharge_batch */ - css_get(&memcg->css); + /* pairs with obj_cgroup_put in uncharge_batch */ + obj_cgroup_get(objcg); } nr_pages = folio_nr_pages(folio); @@ -5123,20 +5157,17 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug) if (folio_memcg_kmem(folio)) { ug->nr_memory += nr_pages; ug->nr_kmem += nr_pages; - - folio->memcg_data = 0; - obj_cgroup_put(objcg); } else { /* LRU pages aren't accounted at the root level */ - if (!mem_cgroup_is_root(memcg)) + if (!obj_cgroup_is_root(objcg)) ug->nr_memory += nr_pages; ug->pgpgout++; WARN_ON_ONCE(folio_unqueue_deferred_split(folio)); - folio->memcg_data = 0; } - css_put(&memcg->css); + folio->memcg_data = 0; + obj_cgroup_put(objcg); } void __mem_cgroup_uncharge(struct folio *folio) @@ -5160,7 +5191,7 @@ void __mem_cgroup_uncharge_folios(struct folio_batch *folios) uncharge_gather_clear(&ug); for (i = 0; i < folios->nr; i++) uncharge_folio(folios->folios[i], &ug); - if (ug.memcg) + if (ug.objcg) uncharge_batch(&ug); } @@ -5177,6 +5208,7 @@ void __mem_cgroup_uncharge_folios(struct folio_batch *folios) void mem_cgroup_replace_folio(struct folio *old, struct folio *new) { struct mem_cgroup *memcg; + struct obj_cgroup *objcg; long nr_pages = folio_nr_pages(new); VM_BUG_ON_FOLIO(!folio_test_locked(old), old); @@ -5191,21 +5223,24 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new) if (folio_memcg_charged(new)) return; - memcg = folio_memcg(old); - VM_WARN_ON_ONCE_FOLIO(!memcg, old); - if (!memcg) + objcg = folio_objcg(old); + VM_WARN_ON_ONCE_FOLIO(!objcg, old); + if (!objcg) return; + rcu_read_lock(); + memcg = obj_cgroup_memcg(objcg); /* Force-charge the new page. The old one will be freed soon */ - if (!mem_cgroup_is_root(memcg)) { + if (!obj_cgroup_is_root(objcg)) { page_counter_charge(&memcg->memory, nr_pages); if (do_memsw_account()) page_counter_charge(&memcg->memsw, nr_pages); } - css_get(&memcg->css); - commit_charge(new, memcg); + obj_cgroup_get(objcg); + commit_charge(new, objcg); memcg1_commit_charge(new, memcg); + rcu_read_unlock(); } /** @@ -5221,7 +5256,7 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new) */ void mem_cgroup_migrate(struct folio *old, struct folio *new) { - struct mem_cgroup *memcg; + struct obj_cgroup *objcg; VM_BUG_ON_FOLIO(!folio_test_locked(old), old); VM_BUG_ON_FOLIO(!folio_test_locked(new), new); @@ -5232,18 +5267,18 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new) if (mem_cgroup_disabled()) return; - memcg = folio_memcg(old); + objcg = folio_objcg(old); /* - * Note that it is normal to see !memcg for a hugetlb folio. + * Note that it is normal to see !objcg for a hugetlb folio. * For e.g, it could have been allocated when memory_hugetlb_accounting * was not selected. */ - VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(old) && !memcg, old); - if (!memcg) + VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(old) && !objcg, old); + if (!objcg) return; - /* Transfer the charge and the css ref */ - commit_charge(new, memcg); + /* Transfer the charge and the objcg ref */ + commit_charge(new, objcg); /* Warning should never happen, so don't worry about refcount non-0 */ WARN_ON_ONCE(folio_unqueue_deferred_split(old)); @@ -5426,22 +5461,27 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry) unsigned int nr_pages = folio_nr_pages(folio); struct page_counter *counter; struct mem_cgroup *memcg; + struct obj_cgroup *objcg; if (do_memsw_account()) return 0; - memcg = folio_memcg(folio); - - VM_WARN_ON_ONCE_FOLIO(!memcg, folio); - if (!memcg) + objcg = folio_objcg(folio); + VM_WARN_ON_ONCE_FOLIO(!objcg, folio); + if (!objcg) return 0; + rcu_read_lock(); + memcg = obj_cgroup_memcg(objcg); if (!entry.val) { memcg_memory_event(memcg, MEMCG_SWAP_FAIL); + rcu_read_unlock(); return 0; } memcg = mem_cgroup_private_id_get_online(memcg, nr_pages); + /* memcg is pined by memcg ID. */ + rcu_read_unlock(); if (!mem_cgroup_is_root(memcg) && !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) { -- cgit v1.2.3 From 0a98e13963424d7f1f50211c692f46a3b1e8d03f Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 5 Mar 2026 19:52:51 +0800 Subject: mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit We must ensure the folio is deleted from or added to the correct lruvec list. So, add VM_WARN_ON_ONCE_FOLIO() to catch invalid users. The VM_BUG_ON_PAGE() in move_pages_to_lru() can be removed as add_page_to_lru_list() will perform the necessary check. Link: https://lore.kernel.org/2c90fc006d9d730331a3caeef96f7e5dabe2036d.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Acked-by: Roman Gushchin Acked-by: Johannes Weiner Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Harry Yoo Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/mm_inline.h | 6 ++++++ mm/vmscan.c | 1 - 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 7fc2ced00f8f..a171070e15f0 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -348,6 +348,8 @@ void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio) { enum lru_list lru = folio_lru_list(folio); + VM_WARN_ON_ONCE_FOLIO(!folio_matches_lruvec(folio, lruvec), folio); + if (lru_gen_add_folio(lruvec, folio, false)) return; @@ -362,6 +364,8 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio) { enum lru_list lru = folio_lru_list(folio); + VM_WARN_ON_ONCE_FOLIO(!folio_matches_lruvec(folio, lruvec), folio); + if (lru_gen_add_folio(lruvec, folio, true)) return; @@ -376,6 +380,8 @@ void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio) { enum lru_list lru = folio_lru_list(folio); + VM_WARN_ON_ONCE_FOLIO(!folio_matches_lruvec(folio, lruvec), folio); + if (lru_gen_del_folio(lruvec, folio, false)) return; diff --git a/mm/vmscan.c b/mm/vmscan.c index 1ac4f959ec1c..fd120e898c70 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1916,7 +1916,6 @@ static unsigned int move_folios_to_lru(struct list_head *list) continue; } - VM_BUG_ON_FOLIO(!folio_matches_lruvec(folio, lruvec), folio); lruvec_add_folio(lruvec, folio); nr_pages = folio_nr_pages(folio); nr_moved += nr_pages; -- cgit v1.2.3 From 616795d7db00377b76f6918fafd32f61af7f78f0 Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Fri, 27 Mar 2026 18:16:28 +0800 Subject: mm: memcontrol: correct the type of stats_updates to unsigned long MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "fix unexpected type conversions and potential overflows", v3. As Harry Yoo pointed out [1], in scenarios where massive state updates occur (e.g., during the reparenting of LRU folios), the values passed to memcg stat update functions can accumulate and exceed the upper limit of a 32-bit integer. If the parameter types are not large enough (like 'int') or are handled incorrectly, it can lead to severe truncation, potential overflow issues, and unexpected type conversion bugs. This series aims to address these issues by correcting the parameter types in the relevant functions, and by fixing an implicit conversion bug in memcg_state_val_in_pages(). This patch (of 3): The memcg_rstat_updated() tracks updates for vmstats_percpu->state and lruvec_stats_percpu->state. Since these state values are of type long, change the val parameter passed to memcg_rstat_updated() to long as well. Correspondingly, change the type of stats_updates in struct memcg_vmstats_percpu and struct memcg_vmstats from unsigned int and atomic_t to unsigned long and atomic_long_t respectively to prevent potential overflow when handling large state updates during the reparenting of LRU folios. Link: https://lore.kernel.org/cover.1774604356.git.zhengqi.arch@bytedance.com Link: https://lore.kernel.org/a5b0b468e7b4fe5f26c50e36d5d016f16d92f98f.1774604356.git.zhengqi.arch@bytedance.com Link: https://lore.kernel.org/all/acDxaEgnqPI-Z4be@hyeyoo/ [1] Signed-off-by: Qi Zheng Reviewed-by: Lorenzo Stoakes (Oracle) Reviewed-by: Harry Yoo (Oracle) Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Johannes Weiner Cc: Kamalesh Babulal Cc: Lance Yang Cc: Michal Hocko Cc: Michal Koutný Cc: Muchun Song Cc: Roman Gushchin Cc: Shakeel Butt Cc: Usama Arif Cc: Wei Xu Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/memcontrol.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b696823b34d0..4ee668c20fa6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -608,7 +608,7 @@ static inline int memcg_events_index(enum vm_event_item idx) struct memcg_vmstats_percpu { /* Stats updates since the last flush */ - unsigned int stats_updates; + unsigned long stats_updates; /* Cached pointers for fast iteration in memcg_rstat_updated() */ struct memcg_vmstats_percpu __percpu *parent_pcpu; @@ -639,7 +639,7 @@ struct memcg_vmstats { unsigned long events_pending[NR_MEMCG_EVENTS]; /* Stats updates since the last flush */ - atomic_t stats_updates; + atomic_long_t stats_updates; }; /* @@ -665,16 +665,16 @@ static u64 flush_last_time; static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats) { - return atomic_read(&vmstats->stats_updates) > + return atomic_long_read(&vmstats->stats_updates) > MEMCG_CHARGE_BATCH * num_online_cpus(); } -static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val, +static inline void memcg_rstat_updated(struct mem_cgroup *memcg, long val, int cpu) { struct memcg_vmstats_percpu __percpu *statc_pcpu; struct memcg_vmstats_percpu *statc; - unsigned int stats_updates; + unsigned long stats_updates; if (!val) return; @@ -697,7 +697,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val, continue; stats_updates = this_cpu_xchg(statc_pcpu->stats_updates, 0); - atomic_add(stats_updates, &statc->vmstats->stats_updates); + atomic_long_add(stats_updates, &statc->vmstats->stats_updates); } } @@ -705,7 +705,7 @@ static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force) { bool needs_flush = memcg_vmstats_needs_flush(memcg->vmstats); - trace_memcg_flush_stats(memcg, atomic_read(&memcg->vmstats->stats_updates), + trace_memcg_flush_stats(memcg, atomic_long_read(&memcg->vmstats->stats_updates), force, needs_flush); if (!force && !needs_flush) @@ -4413,8 +4413,8 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) } WRITE_ONCE(statc->stats_updates, 0); /* We are in a per-cpu loop here, only do the atomic write once */ - if (atomic_read(&memcg->vmstats->stats_updates)) - atomic_set(&memcg->vmstats->stats_updates, 0); + if (atomic_long_read(&memcg->vmstats->stats_updates)) + atomic_long_set(&memcg->vmstats->stats_updates, 0); } static void mem_cgroup_fork(struct task_struct *task) -- cgit v1.2.3 From 85358bad68f5d72a8cff3d79d46e4c38a91afe06 Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Fri, 27 Mar 2026 18:16:29 +0800 Subject: mm: memcontrol: change val type to long in __mod_memcg_{lruvec_}state() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The __mod_memcg_state() and __mod_memcg_lruvec_state() functions are also used to reparent non-hierarchical stats. In this scenario, the values passed to them are accumulated statistics that might be extremely large and exceed the upper limit of a 32-bit integer. Change the val parameter type from int to long in these functions and their corresponding tracepoints (memcg_rstat_stats) to prevent potential overflow issues. After that, in memcg_state_val_in_pages(), if the passed val is negative, the expression val * unit / PAGE_SIZE could be implicitly converted to a massive positive number when compared with 1UL in the max() macro. This leads to returning an incorrect massive positive value. Fix this by using abs(val) to calculate the magnitude first, and then restoring the sign of the value before returning the result. Additionally, use mult_frac() to prevent potential overflow during the multiplication of val and unit. Link: https://lore.kernel.org/70a9440e49c464b4dca88bcabc6b491bd335c9f0.1774604356.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Reported-by: Harry Yoo (Oracle) Reviewed-by: Lorenzo Stoakes (Oracle) Reviewed-by: Harry Yoo (Oracle) Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Johannes Weiner Cc: Kamalesh Babulal Cc: Lance Yang Cc: Michal Hocko Cc: Michal Koutný Cc: Muchun Song Cc: Roman Gushchin Cc: Shakeel Butt Cc: Usama Arif Cc: Wei Xu Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- include/trace/events/memcg.h | 10 +++++----- mm/memcontrol.c | 18 ++++++++++++------ 2 files changed, 17 insertions(+), 11 deletions(-) diff --git a/include/trace/events/memcg.h b/include/trace/events/memcg.h index dfe2f51019b4..51b62c5931fc 100644 --- a/include/trace/events/memcg.h +++ b/include/trace/events/memcg.h @@ -11,14 +11,14 @@ DECLARE_EVENT_CLASS(memcg_rstat_stats, - TP_PROTO(struct mem_cgroup *memcg, int item, int val), + TP_PROTO(struct mem_cgroup *memcg, int item, long val), TP_ARGS(memcg, item, val), TP_STRUCT__entry( __field(u64, id) __field(int, item) - __field(int, val) + __field(long, val) ), TP_fast_assign( @@ -27,20 +27,20 @@ DECLARE_EVENT_CLASS(memcg_rstat_stats, __entry->val = val; ), - TP_printk("memcg_id=%llu item=%d val=%d", + TP_printk("memcg_id=%llu item=%d val=%ld", __entry->id, __entry->item, __entry->val) ); DEFINE_EVENT(memcg_rstat_stats, mod_memcg_state, - TP_PROTO(struct mem_cgroup *memcg, int item, int val), + TP_PROTO(struct mem_cgroup *memcg, int item, long val), TP_ARGS(memcg, item, val) ); DEFINE_EVENT(memcg_rstat_stats, mod_memcg_lruvec_state, - TP_PROTO(struct mem_cgroup *memcg, int item, int val), + TP_PROTO(struct mem_cgroup *memcg, int item, long val), TP_ARGS(memcg, item, val) ); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4ee668c20fa6..685e6dd48ce5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -527,7 +527,7 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec, #ifdef CONFIG_MEMCG_V1 static void __mod_memcg_lruvec_state(struct mem_cgroup_per_node *pn, - enum node_stat_item idx, int val); + enum node_stat_item idx, long val); void reparent_memcg_lruvec_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent, int idx) @@ -784,14 +784,20 @@ static int memcg_page_state_unit(int item); * Normalize the value passed into memcg_rstat_updated() to be in pages. Round * up non-zero sub-page updates to 1 page as zero page updates are ignored. */ -static int memcg_state_val_in_pages(int idx, int val) +static long memcg_state_val_in_pages(int idx, long val) { int unit = memcg_page_state_unit(idx); + long res; if (!val || unit == PAGE_SIZE) return val; - else - return max(val * unit / PAGE_SIZE, 1UL); + + /* Get the absolute value of (val * unit / PAGE_SIZE). */ + res = mult_frac(abs(val), unit, PAGE_SIZE); + /* Round up zero values. */ + res = res ? : 1; + + return val < 0 ? -res : res; } #ifdef CONFIG_MEMCG_V1 @@ -831,7 +837,7 @@ static inline void get_non_dying_memcg_end(void) #endif static void __mod_memcg_state(struct mem_cgroup *memcg, - enum memcg_stat_item idx, int val) + enum memcg_stat_item idx, long val) { int i = memcg_stats_index(idx); int cpu; @@ -896,7 +902,7 @@ void reparent_memcg_state_local(struct mem_cgroup *memcg, #endif static void __mod_memcg_lruvec_state(struct mem_cgroup_per_node *pn, - enum node_stat_item idx, int val) + enum node_stat_item idx, long val) { struct mem_cgroup *memcg = pn->memcg; int i = memcg_stats_index(idx); -- cgit v1.2.3 From 1c514a2c6e4c3bf2016a1dbbddc36d19fdf52ce5 Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Fri, 27 Mar 2026 18:16:30 +0800 Subject: mm: memcontrol: correct the nr_pages parameter type of mem_cgroup_update_lru_size() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The nr_pages parameter of mem_cgroup_update_lru_size() represents a page count. During the reparenting of LRU folios, the value passed to it can potentially exceed the maximum value of a 32-bit integer. It should be declared as long instead of int to match the types used in lruvec size accounting and to prevent possible overflow. Update the parameter type to long to ensure correctness. Link: https://lore.kernel.org/fd4140de44fa0a3978e4e2426731187fe8625f0b.1774604356.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Reviewed-by: Lorenzo Stoakes (Oracle) Reviewed-by: Harry Yoo (Oracle) Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Johannes Weiner Cc: Kamalesh Babulal Cc: Lance Yang Cc: Michal Hocko Cc: Michal Koutný Cc: Muchun Song Cc: Roman Gushchin Cc: Shakeel Butt Cc: Usama Arif Cc: Wei Xu Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 2 +- mm/memcontrol.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 086158969529..dc3fa687759b 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -878,7 +878,7 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg) } void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, - int zid, int nr_pages); + int zid, long nr_pages); static inline unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 685e6dd48ce5..c3d98ab41f1f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1472,7 +1472,7 @@ retry: * to or just after a page is removed from an lru list. */ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, - int zid, int nr_pages) + int zid, long nr_pages) { struct mem_cgroup_per_node *mz; unsigned long *lru_size; @@ -1489,7 +1489,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, size = *lru_size; if (WARN_ONCE(size < 0, - "%s(%p, %d, %d): lru_size %ld\n", + "%s(%p, %d, %ld): lru_size %ld\n", __func__, lruvec, lru, nr_pages, size)) { VM_BUG_ON(1); *lru_size = 0; -- cgit v1.2.3 From 34c45804ae0535c7bce9e7ce00329382afe68a1f Mon Sep 17 00:00:00 2001 From: "Lorenzo Stoakes (Oracle)" Date: Thu, 26 Mar 2026 18:56:29 +0000 Subject: MAINTAINERS: update MGLRU entry to reflect current status We are moving to a far more proactive model of maintainership within mm and thus put a great deal of emphasis on sub-maintainers being active within the community both in terms of code contributions and review. The MGLRU has not had much activity since being added to the kernel and the current maintainers who kindly stepped up have unfortunately not been able to contribute a great deal to it for over a year, nor engage all that heavily in review. As a result, and within no negative connotations implied whatsoever, it seems appropriate to downgrade the current maintainers to reviewers. At this time nobody is quite exercising the maintainer role in this area of the kernel, but there is encouraging activity from a number of people who are trusted elsewhere in the kernel, and who have contributed relevant work or review. Therefore add further reviewers, and at this stage - to reflect the reality on the ground - we will not have any sub-maintainers listed at all. Each of the files listed are shared with other sections in MAINTAINERS, so this doesn't reduce sub-maintainer coverage. Link: https://lore.kernel.org/20260326185629.355476-1-ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) Acked-by: Axel Rasmussen Acked-by: Vlastimil Babka (SUSE) Acked-by: David Hildenbrand (Arm) Acked-by: Barry Song Acked-by: SeongJae Park Acked-by: Kairui Song Acked-by: Qi Zheng Acked-by: Yuanchu Xie Acked-by: Shakeel Butt Cc: Suren Baghdasaryan Cc: Wei Xu Cc: Kalesh Singh Signed-off-by: Andrew Morton --- MAINTAINERS | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index 76431aa5efbe..16874c32e288 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -16757,8 +16757,12 @@ F: mm/migrate_device.c MEMORY MANAGEMENT - MGLRU (MULTI-GEN LRU) M: Andrew Morton -M: Axel Rasmussen -M: Yuanchu Xie +R: Kairui Song +R: Qi Zheng +R: Shakeel Butt +R: Barry Song +R: Axel Rasmussen +R: Yuanchu Xie R: Wei Xu L: linux-mm@kvack.org S: Maintained -- cgit v1.2.3 From e9d973ef18b0554f5a819b4b0e0d5ac9c3b74657 Mon Sep 17 00:00:00 2001 From: Breno Leitao Date: Mon, 23 Mar 2026 04:12:13 -0700 Subject: mm: kmemleak: add CONFIG_DEBUG_KMEMLEAK_VERBOSE build option Add a Kconfig option to default kmemleak verbose mode on at build time. This option depends on DEBUG_KMEMLEAK_AUTO_SCAN since verbose reporting is only meaningful when the automatic scanning thread is running. When enabled, kmemleak prints full details (backtrace, hex dump, address) of unreferenced objects to dmesg as they are detected during scanning, removing the need to manually read /sys/kernel/debug/kmemleak. Making this a compile-time option rather than a boot parameter allows debug kernel flavors to enable verbose kmemleak reporting by default without requiring changes to boot arguments. A machine can simply swap to a debug kernel and benefit from kmemleak reporting automatically. By surfacing leak reports directly in dmesg, they are automatically forwarded through any kernel logging infrastructure and can be easily captured by log aggregation tooling, making it practical to monitor memory leaks across large fleets. The verbose setting can still be toggled at runtime via /sys/module/kmemleak/parameters/verbose. Link: https://lore.kernel.org/20260323-kmemleak_report-v1-1-ba2cdd9c11b9@debian.org Signed-off-by: Breno Leitao Acked-by: SeongJae Park Acked-by: Vlastimil Babka (SUSE) Reviewed-by: Lorenzo Stoakes (Oracle) Acked-by: Catalin Marinas Cc: David Hildenbrand Cc: Liam Howlett Cc: Michal Hocko Cc: Mike Rapoport Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- mm/Kconfig.debug | 11 +++++++++++ mm/kmemleak.c | 2 +- 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug index 7638d75b27db..91b3e027b753 100644 --- a/mm/Kconfig.debug +++ b/mm/Kconfig.debug @@ -297,6 +297,17 @@ config DEBUG_KMEMLEAK_AUTO_SCAN If unsure, say Y. +config DEBUG_KMEMLEAK_VERBOSE + bool "Default kmemleak to verbose mode" + depends on DEBUG_KMEMLEAK_AUTO_SCAN + help + Say Y here to have kmemleak print unreferenced object details + (backtrace, hex dump, address) to dmesg when new memory leaks are + detected during automatic scanning. This can also be toggled at + runtime via /sys/module/kmemleak/parameters/verbose. + + If unsure, say N. + config PER_VMA_LOCK_STATS bool "Statistics for per-vma locks" depends on PER_VMA_LOCK diff --git a/mm/kmemleak.c b/mm/kmemleak.c index fa8201e23222..2eff0d6b622b 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -241,7 +241,7 @@ static int kmemleak_skip_disable; /* If there are leaks that can be reported */ static bool kmemleak_found_leaks; -static bool kmemleak_verbose; +static bool kmemleak_verbose = IS_ENABLED(CONFIG_DEBUG_KMEMLEAK_VERBOSE); module_param_named(verbose, kmemleak_verbose, bool, 0600); static void kmemleak_disable(void); -- cgit v1.2.3 From d9e4142e7635f6f7173854667c0695ce5b836bbc Mon Sep 17 00:00:00 2001 From: Breno Leitao Date: Mon, 16 Mar 2026 04:54:31 -0700 Subject: kho: add size parameter to kho_add_subtree() Patch series "kho: history: track previous kernel version and kexec boot count", v9. Use Kexec Handover (KHO) to pass the previous kernel's version string and the number of kexec reboots since the last cold boot to the next kernel, and print it at boot time. Example ======= [ 0.000000] Linux version 6.19.0-rc3-upstream-00047-ge5d992347849 ... [ 0.000000] KHO: exec from: 6.19.0-rc4-next-20260107upstream-00004-g3071b0dc4498 (count 1) Motivation ========== Bugs that only reproduce when kexecing from specific kernel versions are difficult to diagnose. These issues occur when a buggy kernel kexecs into a new kernel, with the bug manifesting only in the second kernel. Recent examples include: * eb2266312507 ("x86/boot: Fix page table access in 5-level to 4-level paging transition") * 77d48d39e991 ("efistub/tpm: Use ACPI reclaim memory for event log to avoid corruption") * 64b45dd46e15 ("x86/efi: skip memattr table on kexec boot") As kexec-based reboots become more common, these version-dependent bugs are appearing more frequently. At scale, correlating crashes to the previous kernel version is challenging, especially when issues only occur in specific transition scenarios. Some bugs manifest only after multiple consecutive kexec reboots. Tracking the kexec count helps identify these cases (this metric is already used by live update sub-system). KHO provides a reliable mechanism to pass information between kernels. By carrying the previous kernel's release string and kexec count forward, we can print this context at boot time to aid debugging. The goal of this feature is to have this information being printed in early boot, so, users can trace back kernel releases in kexec. Systemd is not helpful because we cannot assume that the previous kernel has systemd or even write access to the disk (common when using Linux as bootloaders) This patch (of 6): kho_add_subtree() assumes the fdt argument is always an FDT and calls fdt_totalsize() on it in the debugfs code path. This assumption will break if a caller passes arbitrary data instead of an FDT. When CONFIG_KEXEC_HANDOVER_DEBUGFS is enabled, kho_debugfs_fdt_add() calls __kho_debugfs_fdt_add(), which executes: f->wrapper.size = fdt_totalsize(fdt); Fix this by adding an explicit size parameter to kho_add_subtree() so callers specify the blob size. This allows subtrees to contain arbitrary data formats, not just FDTs. Update all callers: - memblock.c: use fdt_totalsize(fdt) - luo_core.c: use fdt_totalsize(fdt_out) - test_kho.c: use fdt_totalsize() - kexec_handover.c (root fdt): use fdt_totalsize(kho_out.fdt) Also update __kho_debugfs_fdt_add() to receive the size explicitly instead of computing it internally via fdt_totalsize(). In kho_in_debugfs_init(), pass fdt_totalsize() for the root FDT and sub-blobs since all current users are FDTs. A subsequent patch will persist the size in the KHO FDT so the incoming side can handle non-FDT blobs correctly. Link: https://lore.kernel.org/20260323110747.193569-1-duanchenghao@kylinos.cn Link: https://lore.kernel.org/20260316-kho-v9-1-ed6dcd951988@debian.org Signed-off-by: Breno Leitao Suggested-by: Pratyush Yadav Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav Cc: Alexander Graf Cc: David Hildenbrand Cc: Jonathan Corbet Cc: "Liam R. Howlett" Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Pasha Tatashin Cc: SeongJae Park Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/kexec_handover.h | 4 ++-- kernel/liveupdate/kexec_handover.c | 8 +++++--- kernel/liveupdate/kexec_handover_debugfs.c | 15 +++++++++------ kernel/liveupdate/kexec_handover_internal.h | 5 +++-- kernel/liveupdate/luo_core.c | 3 ++- lib/test_kho.c | 3 ++- mm/memblock.c | 2 +- 7 files changed, 24 insertions(+), 16 deletions(-) diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h index ac4129d1d741..abb1d324f42d 100644 --- a/include/linux/kexec_handover.h +++ b/include/linux/kexec_handover.h @@ -32,7 +32,7 @@ void kho_restore_free(void *mem); struct folio *kho_restore_folio(phys_addr_t phys); struct page *kho_restore_pages(phys_addr_t phys, unsigned long nr_pages); void *kho_restore_vmalloc(const struct kho_vmalloc *preservation); -int kho_add_subtree(const char *name, void *fdt); +int kho_add_subtree(const char *name, void *fdt, size_t size); void kho_remove_subtree(void *fdt); int kho_retrieve_subtree(const char *name, phys_addr_t *phys); @@ -97,7 +97,7 @@ static inline void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) return NULL; } -static inline int kho_add_subtree(const char *name, void *fdt) +static inline int kho_add_subtree(const char *name, void *fdt, size_t size) { return -EOPNOTSUPP; } diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 532f455c5d4f..8cc25e29ff91 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -727,6 +727,7 @@ err_disable_kho: * kho_add_subtree - record the physical address of a sub FDT in KHO root tree. * @name: name of the sub tree. * @fdt: the sub tree blob. + * @size: size of the blob in bytes. * * Creates a new child node named @name in KHO root FDT and records * the physical address of @fdt. The pages of @fdt must also be preserved @@ -738,7 +739,7 @@ err_disable_kho: * * Return: 0 on success, error code on failure */ -int kho_add_subtree(const char *name, void *fdt) +int kho_add_subtree(const char *name, void *fdt, size_t size) { phys_addr_t phys = virt_to_phys(fdt); void *root_fdt = kho_out.fdt; @@ -763,7 +764,7 @@ int kho_add_subtree(const char *name, void *fdt) if (err < 0) goto out_pack; - WARN_ON_ONCE(kho_debugfs_fdt_add(&kho_out.dbg, name, fdt, false)); + WARN_ON_ONCE(kho_debugfs_fdt_add(&kho_out.dbg, name, fdt, size, false)); out_pack: fdt_pack(root_fdt); @@ -1431,7 +1432,8 @@ static __init int kho_init(void) } WARN_ON_ONCE(kho_debugfs_fdt_add(&kho_out.dbg, "fdt", - kho_out.fdt, true)); + kho_out.fdt, + fdt_totalsize(kho_out.fdt), true)); return 0; diff --git a/kernel/liveupdate/kexec_handover_debugfs.c b/kernel/liveupdate/kexec_handover_debugfs.c index acf368222682..ca0153736af1 100644 --- a/kernel/liveupdate/kexec_handover_debugfs.c +++ b/kernel/liveupdate/kexec_handover_debugfs.c @@ -25,7 +25,7 @@ struct fdt_debugfs { }; static int __kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir, - const char *name, const void *fdt) + const char *name, const void *fdt, size_t size) { struct fdt_debugfs *f; struct dentry *file; @@ -35,7 +35,7 @@ static int __kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir, return -ENOMEM; f->wrapper.data = (void *)fdt; - f->wrapper.size = fdt_totalsize(fdt); + f->wrapper.size = size; file = debugfs_create_blob(name, 0400, dir, &f->wrapper); if (IS_ERR(file)) { @@ -50,7 +50,7 @@ static int __kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir, } int kho_debugfs_fdt_add(struct kho_debugfs *dbg, const char *name, - const void *fdt, bool root) + const void *fdt, size_t size, bool root) { struct dentry *dir; @@ -59,7 +59,7 @@ int kho_debugfs_fdt_add(struct kho_debugfs *dbg, const char *name, else dir = dbg->sub_fdt_dir; - return __kho_debugfs_fdt_add(&dbg->fdt_list, dir, name, fdt); + return __kho_debugfs_fdt_add(&dbg->fdt_list, dir, name, fdt, size); } void kho_debugfs_fdt_remove(struct kho_debugfs *dbg, void *fdt) @@ -113,7 +113,8 @@ __init void kho_in_debugfs_init(struct kho_debugfs *dbg, const void *fdt) goto err_rmdir; } - err = __kho_debugfs_fdt_add(&dbg->fdt_list, dir, "fdt", fdt); + err = __kho_debugfs_fdt_add(&dbg->fdt_list, dir, "fdt", fdt, + fdt_totalsize(fdt)); if (err) goto err_rmdir; @@ -121,6 +122,7 @@ __init void kho_in_debugfs_init(struct kho_debugfs *dbg, const void *fdt) int len = 0; const char *name = fdt_get_name(fdt, child, NULL); const u64 *fdt_phys; + void *sub_fdt; fdt_phys = fdt_getprop(fdt, child, KHO_FDT_SUB_TREE_PROP_NAME, &len); if (!fdt_phys) @@ -130,8 +132,9 @@ __init void kho_in_debugfs_init(struct kho_debugfs *dbg, const void *fdt) name, len); continue; } + sub_fdt = phys_to_virt(*fdt_phys); err = __kho_debugfs_fdt_add(&dbg->fdt_list, sub_fdt_dir, name, - phys_to_virt(*fdt_phys)); + sub_fdt, fdt_totalsize(sub_fdt)); if (err) { pr_warn("failed to add fdt %s to debugfs: %pe\n", name, ERR_PTR(err)); diff --git a/kernel/liveupdate/kexec_handover_internal.h b/kernel/liveupdate/kexec_handover_internal.h index 9a832a35254c..2a28cb8db9b0 100644 --- a/kernel/liveupdate/kexec_handover_internal.h +++ b/kernel/liveupdate/kexec_handover_internal.h @@ -27,7 +27,7 @@ int kho_debugfs_init(void); void kho_in_debugfs_init(struct kho_debugfs *dbg, const void *fdt); int kho_out_debugfs_init(struct kho_debugfs *dbg); int kho_debugfs_fdt_add(struct kho_debugfs *dbg, const char *name, - const void *fdt, bool root); + const void *fdt, size_t size, bool root); void kho_debugfs_fdt_remove(struct kho_debugfs *dbg, void *fdt); #else static inline int kho_debugfs_init(void) { return 0; } @@ -35,7 +35,8 @@ static inline void kho_in_debugfs_init(struct kho_debugfs *dbg, const void *fdt) { } static inline int kho_out_debugfs_init(struct kho_debugfs *dbg) { return 0; } static inline int kho_debugfs_fdt_add(struct kho_debugfs *dbg, const char *name, - const void *fdt, bool root) { return 0; } + const void *fdt, size_t size, + bool root) { return 0; } static inline void kho_debugfs_fdt_remove(struct kho_debugfs *dbg, void *fdt) { } #endif /* CONFIG_KEXEC_HANDOVER_DEBUGFS */ diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c index 84ac728d63ba..04d06a0906c0 100644 --- a/kernel/liveupdate/luo_core.c +++ b/kernel/liveupdate/luo_core.c @@ -172,7 +172,8 @@ static int __init luo_fdt_setup(void) if (err) goto exit_free; - err = kho_add_subtree(LUO_FDT_KHO_ENTRY_NAME, fdt_out); + err = kho_add_subtree(LUO_FDT_KHO_ENTRY_NAME, fdt_out, + fdt_totalsize(fdt_out)); if (err) goto exit_free; luo_global.fdt_out = fdt_out; diff --git a/lib/test_kho.c b/lib/test_kho.c index 7ef9e4061869..263182437315 100644 --- a/lib/test_kho.c +++ b/lib/test_kho.c @@ -143,7 +143,8 @@ static int kho_test_preserve(struct kho_test_state *state) if (err) goto err_unpreserve_data; - err = kho_add_subtree(KHO_TEST_FDT, folio_address(state->fdt)); + err = kho_add_subtree(KHO_TEST_FDT, folio_address(state->fdt), + fdt_totalsize(folio_address(state->fdt))); if (err) goto err_unpreserve_data; diff --git a/mm/memblock.c b/mm/memblock.c index b3ddfdec7a80..91d4162eec63 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -2510,7 +2510,7 @@ static int __init prepare_kho_fdt(void) if (err) goto err_unpreserve_fdt; - err = kho_add_subtree(MEMBLOCK_KHO_FDT, fdt); + err = kho_add_subtree(MEMBLOCK_KHO_FDT, fdt, fdt_totalsize(fdt)); if (err) goto err_unpreserve_fdt; -- cgit v1.2.3 From 4916ae386760ad666eafa8afc075957bf479afbc Mon Sep 17 00:00:00 2001 From: Breno Leitao Date: Mon, 16 Mar 2026 04:54:32 -0700 Subject: kho: rename fdt parameter to blob in kho_add/remove_subtree() Since kho_add_subtree() now accepts arbitrary data blobs (not just FDTs), rename the parameter from 'fdt' to 'blob' to better reflect its purpose. Apply the same rename to kho_remove_subtree() for consistency. Also rename kho_debugfs_fdt_add() and kho_debugfs_fdt_remove() to kho_debugfs_blob_add() and kho_debugfs_blob_remove() respectively, with the same parameter rename from 'fdt' to 'blob'. Link: https://lore.kernel.org/20260316-kho-v9-2-ed6dcd951988@debian.org Signed-off-by: Breno Leitao Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav Cc: Alexander Graf Cc: David Hildenbrand Cc: Jonathan Corbet Cc: "Liam R. Howlett" Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Pasha Tatashin Cc: SeongJae Park Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/kho.rst | 2 +- include/linux/kexec_handover.h | 8 +++---- kernel/liveupdate/kexec_handover.c | 33 +++++++++++++++-------------- kernel/liveupdate/kexec_handover_debugfs.c | 25 +++++++++++----------- kernel/liveupdate/kexec_handover_internal.h | 16 +++++++------- 5 files changed, 43 insertions(+), 41 deletions(-) diff --git a/Documentation/admin-guide/mm/kho.rst b/Documentation/admin-guide/mm/kho.rst index cb9a20f64920..6a4ddf344046 100644 --- a/Documentation/admin-guide/mm/kho.rst +++ b/Documentation/admin-guide/mm/kho.rst @@ -80,5 +80,5 @@ stabilized. it finished to interpret their metadata. ``/sys/kernel/debug/kho/in/sub_fdts/`` - Similar to ``kho/out/sub_fdts/``, but contains sub FDT blobs + Similar to ``kho/out/sub_fdts/``, but contains sub blobs of KHO producers passed from the old kernel. diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h index abb1d324f42d..0666cf298c7f 100644 --- a/include/linux/kexec_handover.h +++ b/include/linux/kexec_handover.h @@ -32,8 +32,8 @@ void kho_restore_free(void *mem); struct folio *kho_restore_folio(phys_addr_t phys); struct page *kho_restore_pages(phys_addr_t phys, unsigned long nr_pages); void *kho_restore_vmalloc(const struct kho_vmalloc *preservation); -int kho_add_subtree(const char *name, void *fdt, size_t size); -void kho_remove_subtree(void *fdt); +int kho_add_subtree(const char *name, void *blob, size_t size); +void kho_remove_subtree(void *blob); int kho_retrieve_subtree(const char *name, phys_addr_t *phys); void kho_memory_init(void); @@ -97,12 +97,12 @@ static inline void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) return NULL; } -static inline int kho_add_subtree(const char *name, void *fdt, size_t size) +static inline int kho_add_subtree(const char *name, void *blob, size_t size) { return -EOPNOTSUPP; } -static inline void kho_remove_subtree(void *fdt) { } +static inline void kho_remove_subtree(void *blob) { } static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys) { diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 8cc25e29ff91..711b6c3376e7 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -724,13 +724,13 @@ err_disable_kho: } /** - * kho_add_subtree - record the physical address of a sub FDT in KHO root tree. + * kho_add_subtree - record the physical address of a sub blob in KHO root tree. * @name: name of the sub tree. - * @fdt: the sub tree blob. + * @blob: the sub tree blob. * @size: size of the blob in bytes. * * Creates a new child node named @name in KHO root FDT and records - * the physical address of @fdt. The pages of @fdt must also be preserved + * the physical address of @blob. The pages of @blob must also be preserved * by KHO for the new kernel to retrieve it after kexec. * * A debugfs blob entry is also created at @@ -739,9 +739,9 @@ err_disable_kho: * * Return: 0 on success, error code on failure */ -int kho_add_subtree(const char *name, void *fdt, size_t size) +int kho_add_subtree(const char *name, void *blob, size_t size) { - phys_addr_t phys = virt_to_phys(fdt); + phys_addr_t phys = virt_to_phys(blob); void *root_fdt = kho_out.fdt; int err = -ENOMEM; int off, fdt_err; @@ -764,7 +764,8 @@ int kho_add_subtree(const char *name, void *fdt, size_t size) if (err < 0) goto out_pack; - WARN_ON_ONCE(kho_debugfs_fdt_add(&kho_out.dbg, name, fdt, size, false)); + WARN_ON_ONCE(kho_debugfs_blob_add(&kho_out.dbg, name, blob, + size, false)); out_pack: fdt_pack(root_fdt); @@ -773,9 +774,9 @@ out_pack: } EXPORT_SYMBOL_GPL(kho_add_subtree); -void kho_remove_subtree(void *fdt) +void kho_remove_subtree(void *blob) { - phys_addr_t target_phys = virt_to_phys(fdt); + phys_addr_t target_phys = virt_to_phys(blob); void *root_fdt = kho_out.fdt; int off; int err; @@ -797,7 +798,7 @@ void kho_remove_subtree(void *fdt) if ((phys_addr_t)*val == target_phys) { fdt_del_node(root_fdt, off); - kho_debugfs_fdt_remove(&kho_out.dbg, fdt); + kho_debugfs_blob_remove(&kho_out.dbg, blob); break; } } @@ -1293,11 +1294,11 @@ bool is_kho_boot(void) EXPORT_SYMBOL_GPL(is_kho_boot); /** - * kho_retrieve_subtree - retrieve a preserved sub FDT by its name. - * @name: the name of the sub FDT passed to kho_add_subtree(). - * @phys: if found, the physical address of the sub FDT is stored in @phys. + * kho_retrieve_subtree - retrieve a preserved sub blob by its name. + * @name: the name of the sub blob passed to kho_add_subtree(). + * @phys: if found, the physical address of the sub blob is stored in @phys. * - * Retrieve a preserved sub FDT named @name and store its physical + * Retrieve a preserved sub blob named @name and store its physical * address in @phys. * * Return: 0 on success, error code on failure @@ -1431,9 +1432,9 @@ static __init int kho_init(void) init_cma_reserved_pageblock(pfn_to_page(pfn)); } - WARN_ON_ONCE(kho_debugfs_fdt_add(&kho_out.dbg, "fdt", - kho_out.fdt, - fdt_totalsize(kho_out.fdt), true)); + WARN_ON_ONCE(kho_debugfs_blob_add(&kho_out.dbg, "fdt", + kho_out.fdt, + fdt_totalsize(kho_out.fdt), true)); return 0; diff --git a/kernel/liveupdate/kexec_handover_debugfs.c b/kernel/liveupdate/kexec_handover_debugfs.c index ca0153736af1..cab923e4f5c8 100644 --- a/kernel/liveupdate/kexec_handover_debugfs.c +++ b/kernel/liveupdate/kexec_handover_debugfs.c @@ -24,8 +24,9 @@ struct fdt_debugfs { struct dentry *file; }; -static int __kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir, - const char *name, const void *fdt, size_t size) +static int __kho_debugfs_blob_add(struct list_head *list, struct dentry *dir, + const char *name, const void *blob, + size_t size) { struct fdt_debugfs *f; struct dentry *file; @@ -34,7 +35,7 @@ static int __kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir, if (!f) return -ENOMEM; - f->wrapper.data = (void *)fdt; + f->wrapper.data = (void *)blob; f->wrapper.size = size; file = debugfs_create_blob(name, 0400, dir, &f->wrapper); @@ -49,8 +50,8 @@ static int __kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir, return 0; } -int kho_debugfs_fdt_add(struct kho_debugfs *dbg, const char *name, - const void *fdt, size_t size, bool root) +int kho_debugfs_blob_add(struct kho_debugfs *dbg, const char *name, + const void *blob, size_t size, bool root) { struct dentry *dir; @@ -59,15 +60,15 @@ int kho_debugfs_fdt_add(struct kho_debugfs *dbg, const char *name, else dir = dbg->sub_fdt_dir; - return __kho_debugfs_fdt_add(&dbg->fdt_list, dir, name, fdt, size); + return __kho_debugfs_blob_add(&dbg->fdt_list, dir, name, blob, size); } -void kho_debugfs_fdt_remove(struct kho_debugfs *dbg, void *fdt) +void kho_debugfs_blob_remove(struct kho_debugfs *dbg, void *blob) { struct fdt_debugfs *ff; list_for_each_entry(ff, &dbg->fdt_list, list) { - if (ff->wrapper.data == fdt) { + if (ff->wrapper.data == blob) { debugfs_remove(ff->file); list_del(&ff->list); kfree(ff); @@ -113,8 +114,8 @@ __init void kho_in_debugfs_init(struct kho_debugfs *dbg, const void *fdt) goto err_rmdir; } - err = __kho_debugfs_fdt_add(&dbg->fdt_list, dir, "fdt", fdt, - fdt_totalsize(fdt)); + err = __kho_debugfs_blob_add(&dbg->fdt_list, dir, "fdt", fdt, + fdt_totalsize(fdt)); if (err) goto err_rmdir; @@ -133,8 +134,8 @@ __init void kho_in_debugfs_init(struct kho_debugfs *dbg, const void *fdt) continue; } sub_fdt = phys_to_virt(*fdt_phys); - err = __kho_debugfs_fdt_add(&dbg->fdt_list, sub_fdt_dir, name, - sub_fdt, fdt_totalsize(sub_fdt)); + err = __kho_debugfs_blob_add(&dbg->fdt_list, sub_fdt_dir, name, + sub_fdt, fdt_totalsize(sub_fdt)); if (err) { pr_warn("failed to add fdt %s to debugfs: %pe\n", name, ERR_PTR(err)); diff --git a/kernel/liveupdate/kexec_handover_internal.h b/kernel/liveupdate/kexec_handover_internal.h index 2a28cb8db9b0..0399ff107775 100644 --- a/kernel/liveupdate/kexec_handover_internal.h +++ b/kernel/liveupdate/kexec_handover_internal.h @@ -26,19 +26,19 @@ extern unsigned int kho_scratch_cnt; int kho_debugfs_init(void); void kho_in_debugfs_init(struct kho_debugfs *dbg, const void *fdt); int kho_out_debugfs_init(struct kho_debugfs *dbg); -int kho_debugfs_fdt_add(struct kho_debugfs *dbg, const char *name, - const void *fdt, size_t size, bool root); -void kho_debugfs_fdt_remove(struct kho_debugfs *dbg, void *fdt); +int kho_debugfs_blob_add(struct kho_debugfs *dbg, const char *name, + const void *blob, size_t size, bool root); +void kho_debugfs_blob_remove(struct kho_debugfs *dbg, void *blob); #else static inline int kho_debugfs_init(void) { return 0; } static inline void kho_in_debugfs_init(struct kho_debugfs *dbg, const void *fdt) { } static inline int kho_out_debugfs_init(struct kho_debugfs *dbg) { return 0; } -static inline int kho_debugfs_fdt_add(struct kho_debugfs *dbg, const char *name, - const void *fdt, size_t size, - bool root) { return 0; } -static inline void kho_debugfs_fdt_remove(struct kho_debugfs *dbg, - void *fdt) { } +static inline int kho_debugfs_blob_add(struct kho_debugfs *dbg, + const char *name, const void *blob, + size_t size, bool root) { return 0; } +static inline void kho_debugfs_blob_remove(struct kho_debugfs *dbg, + void *blob) { } #endif /* CONFIG_KEXEC_HANDOVER_DEBUGFS */ #ifdef CONFIG_KEXEC_HANDOVER_DEBUG -- cgit v1.2.3 From 85e41392820fcf0f7a3f9784cea907905f921358 Mon Sep 17 00:00:00 2001 From: Breno Leitao Date: Mon, 16 Mar 2026 04:54:33 -0700 Subject: kho: persist blob size in KHO FDT kho_add_subtree() accepts a size parameter but only forwards it to debugfs. The size is not persisted in the KHO FDT, so it is lost across kexec. This makes it impossible for the incoming kernel to determine the blob size without understanding the blob format. Store the blob size as a "blob-size" property in the KHO FDT alongside the "preserved-data" physical address. This allows the receiving kernel to recover the size for any blob regardless of format. Also extend kho_retrieve_subtree() with an optional size output parameter so callers can learn the blob size without needing to understand the blob format. Update all callers to pass NULL for the new parameter. Link: https://lore.kernel.org/20260316-kho-v9-3-ed6dcd951988@debian.org Signed-off-by: Breno Leitao Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav Cc: Alexander Graf Cc: David Hildenbrand Cc: Jonathan Corbet Cc: "Liam R. Howlett" Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Pasha Tatashin Cc: SeongJae Park Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/kexec_handover.h | 5 +++-- include/linux/kho/abi/kexec_handover.h | 20 ++++++++++++++++---- kernel/liveupdate/kexec_handover.c | 27 ++++++++++++++++++++++----- kernel/liveupdate/kexec_handover_debugfs.c | 3 ++- kernel/liveupdate/luo_core.c | 2 +- lib/test_kho.c | 2 +- mm/memblock.c | 2 +- 7 files changed, 46 insertions(+), 15 deletions(-) diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h index 0666cf298c7f..8968c56d2d73 100644 --- a/include/linux/kexec_handover.h +++ b/include/linux/kexec_handover.h @@ -34,7 +34,7 @@ struct page *kho_restore_pages(phys_addr_t phys, unsigned long nr_pages); void *kho_restore_vmalloc(const struct kho_vmalloc *preservation); int kho_add_subtree(const char *name, void *blob, size_t size); void kho_remove_subtree(void *blob); -int kho_retrieve_subtree(const char *name, phys_addr_t *phys); +int kho_retrieve_subtree(const char *name, phys_addr_t *phys, size_t *size); void kho_memory_init(void); @@ -104,7 +104,8 @@ static inline int kho_add_subtree(const char *name, void *blob, size_t size) static inline void kho_remove_subtree(void *blob) { } -static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys) +static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys, + size_t *size) { return -EOPNOTSUPP; } diff --git a/include/linux/kho/abi/kexec_handover.h b/include/linux/kho/abi/kexec_handover.h index 6b7d8ef550f9..7e847a2339b0 100644 --- a/include/linux/kho/abi/kexec_handover.h +++ b/include/linux/kho/abi/kexec_handover.h @@ -41,25 +41,28 @@ * restore the preserved data.:: * * / { - * compatible = "kho-v2"; + * compatible = "kho-v3"; * * preserved-memory-map = <0x...>; * * { * preserved-data = <0x...>; + * blob-size = <0x...>; * }; * * { * preserved-data = <0x...>; + * blob-size = <0x...>; * }; * ... ... * { * preserved-data = <0x...>; + * blob-size = <0x...>; * }; * }; * * Root KHO Node (/): - * - compatible: "kho-v2" + * - compatible: "kho-v3" * * Indentifies the overall KHO ABI version. * @@ -78,16 +81,25 @@ * * Physical address pointing to a subnode data blob that is also * being preserved. + * + * - blob-size: u64 + * + * Size in bytes of the preserved data blob. This is needed because + * blobs may use arbitrary formats (not just FDT), so the size + * cannot be determined from the blob content alone. */ /* The compatible string for the KHO FDT root node. */ -#define KHO_FDT_COMPATIBLE "kho-v2" +#define KHO_FDT_COMPATIBLE "kho-v3" /* The FDT property for the preserved memory map. */ #define KHO_FDT_MEMORY_MAP_PROP_NAME "preserved-memory-map" /* The FDT property for preserved data blobs. */ -#define KHO_FDT_SUB_TREE_PROP_NAME "preserved-data" +#define KHO_SUB_TREE_PROP_NAME "preserved-data" + +/* The FDT property for the size of preserved data blobs. */ +#define KHO_SUB_TREE_SIZE_PROP_NAME "blob-size" /** * DOC: Kexec Handover ABI for vmalloc Preservation diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 711b6c3376e7..adf6541f70f9 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -743,6 +743,7 @@ int kho_add_subtree(const char *name, void *blob, size_t size) { phys_addr_t phys = virt_to_phys(blob); void *root_fdt = kho_out.fdt; + u64 size_u64 = size; int err = -ENOMEM; int off, fdt_err; @@ -759,11 +760,16 @@ int kho_add_subtree(const char *name, void *blob, size_t size) goto out_pack; } - err = fdt_setprop(root_fdt, off, KHO_FDT_SUB_TREE_PROP_NAME, + err = fdt_setprop(root_fdt, off, KHO_SUB_TREE_PROP_NAME, &phys, sizeof(phys)); if (err < 0) goto out_pack; + err = fdt_setprop(root_fdt, off, KHO_SUB_TREE_SIZE_PROP_NAME, + &size_u64, sizeof(size_u64)); + if (err < 0) + goto out_pack; + WARN_ON_ONCE(kho_debugfs_blob_add(&kho_out.dbg, name, blob, size, false)); @@ -792,7 +798,7 @@ void kho_remove_subtree(void *blob) const u64 *val; int len; - val = fdt_getprop(root_fdt, off, KHO_FDT_SUB_TREE_PROP_NAME, &len); + val = fdt_getprop(root_fdt, off, KHO_SUB_TREE_PROP_NAME, &len); if (!val || len != sizeof(phys_addr_t)) continue; @@ -1297,13 +1303,14 @@ EXPORT_SYMBOL_GPL(is_kho_boot); * kho_retrieve_subtree - retrieve a preserved sub blob by its name. * @name: the name of the sub blob passed to kho_add_subtree(). * @phys: if found, the physical address of the sub blob is stored in @phys. + * @size: if not NULL and found, the size of the sub blob is stored in @size. * * Retrieve a preserved sub blob named @name and store its physical - * address in @phys. + * address in @phys and optionally its size in @size. * * Return: 0 on success, error code on failure */ -int kho_retrieve_subtree(const char *name, phys_addr_t *phys) +int kho_retrieve_subtree(const char *name, phys_addr_t *phys, size_t *size) { const void *fdt = kho_get_fdt(); const u64 *val; @@ -1319,12 +1326,22 @@ int kho_retrieve_subtree(const char *name, phys_addr_t *phys) if (offset < 0) return -ENOENT; - val = fdt_getprop(fdt, offset, KHO_FDT_SUB_TREE_PROP_NAME, &len); + val = fdt_getprop(fdt, offset, KHO_SUB_TREE_PROP_NAME, &len); if (!val || len != sizeof(*val)) return -EINVAL; *phys = (phys_addr_t)*val; + val = fdt_getprop(fdt, offset, KHO_SUB_TREE_SIZE_PROP_NAME, &len); + if (!val || len != sizeof(*val)) { + pr_warn("broken KHO subnode '%s': missing or invalid blob-size property\n", + name); + return -EINVAL; + } + + if (size) + *size = (size_t)*val; + return 0; } EXPORT_SYMBOL_GPL(kho_retrieve_subtree); diff --git a/kernel/liveupdate/kexec_handover_debugfs.c b/kernel/liveupdate/kexec_handover_debugfs.c index cab923e4f5c8..b416846810d7 100644 --- a/kernel/liveupdate/kexec_handover_debugfs.c +++ b/kernel/liveupdate/kexec_handover_debugfs.c @@ -125,7 +125,8 @@ __init void kho_in_debugfs_init(struct kho_debugfs *dbg, const void *fdt) const u64 *fdt_phys; void *sub_fdt; - fdt_phys = fdt_getprop(fdt, child, KHO_FDT_SUB_TREE_PROP_NAME, &len); + fdt_phys = fdt_getprop(fdt, child, + KHO_SUB_TREE_PROP_NAME, &len); if (!fdt_phys) continue; if (len != sizeof(*fdt_phys)) { diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c index 04d06a0906c0..48b25c9abeda 100644 --- a/kernel/liveupdate/luo_core.c +++ b/kernel/liveupdate/luo_core.c @@ -88,7 +88,7 @@ static int __init luo_early_startup(void) } /* Retrieve LUO subtree, and verify its format. */ - err = kho_retrieve_subtree(LUO_FDT_KHO_ENTRY_NAME, &fdt_phys); + err = kho_retrieve_subtree(LUO_FDT_KHO_ENTRY_NAME, &fdt_phys, NULL); if (err) { if (err != -ENOENT) { pr_err("failed to retrieve FDT '%s' from KHO: %pe\n", diff --git a/lib/test_kho.c b/lib/test_kho.c index 263182437315..aa6a0956bb8b 100644 --- a/lib/test_kho.c +++ b/lib/test_kho.c @@ -319,7 +319,7 @@ static int __init kho_test_init(void) if (!kho_is_enabled()) return 0; - err = kho_retrieve_subtree(KHO_TEST_FDT, &fdt_phys); + err = kho_retrieve_subtree(KHO_TEST_FDT, &fdt_phys, NULL); if (!err) { err = kho_test_restore(fdt_phys); if (err) diff --git a/mm/memblock.c b/mm/memblock.c index 91d4162eec63..a1c6dd0f6fad 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -2555,7 +2555,7 @@ static void *__init reserve_mem_kho_retrieve_fdt(void) if (fdt) return fdt; - err = kho_retrieve_subtree(MEMBLOCK_KHO_FDT, &fdt_phys); + err = kho_retrieve_subtree(MEMBLOCK_KHO_FDT, &fdt_phys, NULL); if (err) { if (err != -ENOENT) pr_warn("failed to retrieve FDT '%s' from KHO: %d\n", -- cgit v1.2.3 From 062dd306d99cc2e02f761124e064e6a3735e27b0 Mon Sep 17 00:00:00 2001 From: Breno Leitao Date: Mon, 16 Mar 2026 04:54:34 -0700 Subject: kho: fix kho_in_debugfs_init() to handle non-FDT blobs kho_in_debugfs_init() calls fdt_totalsize() to determine blob sizes, which assumes all blobs are FDTs. This breaks for non-FDT blobs like struct kho_kexec_metadata. Fix this by reading the "blob-size" property from the FDT (persisted by kho_add_subtree()) instead of calling fdt_totalsize(). Also rename local variables from fdt_phys/sub_fdt to blob_phys/blob for consistency with the non-FDT-specific naming. Link: https://lore.kernel.org/20260316-kho-v9-4-ed6dcd951988@debian.org Signed-off-by: Breno Leitao Reviewed-by: Pratyush Yadav Cc: Alexander Graf Cc: David Hildenbrand Cc: Jonathan Corbet Cc: "Liam R. Howlett" Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Mike Rapoport (Microsoft) Cc: Pasha Tatashin Cc: SeongJae Park Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- kernel/liveupdate/kexec_handover_debugfs.c | 32 ++++++++++++++++++++---------- 1 file changed, 21 insertions(+), 11 deletions(-) diff --git a/kernel/liveupdate/kexec_handover_debugfs.c b/kernel/liveupdate/kexec_handover_debugfs.c index b416846810d7..257ee8a52be6 100644 --- a/kernel/liveupdate/kexec_handover_debugfs.c +++ b/kernel/liveupdate/kexec_handover_debugfs.c @@ -122,24 +122,34 @@ __init void kho_in_debugfs_init(struct kho_debugfs *dbg, const void *fdt) fdt_for_each_subnode(child, fdt, 0) { int len = 0; const char *name = fdt_get_name(fdt, child, NULL); - const u64 *fdt_phys; - void *sub_fdt; + const u64 *blob_phys; + const u64 *blob_size; + void *blob; - fdt_phys = fdt_getprop(fdt, child, + blob_phys = fdt_getprop(fdt, child, KHO_SUB_TREE_PROP_NAME, &len); - if (!fdt_phys) + if (!blob_phys) continue; - if (len != sizeof(*fdt_phys)) { - pr_warn("node %s prop fdt has invalid length: %d\n", - name, len); + if (len != sizeof(*blob_phys)) { + pr_warn("node %s prop %s has invalid length: %d\n", + name, KHO_SUB_TREE_PROP_NAME, len); continue; } - sub_fdt = phys_to_virt(*fdt_phys); + + blob_size = fdt_getprop(fdt, child, + KHO_SUB_TREE_SIZE_PROP_NAME, &len); + if (!blob_size || len != sizeof(*blob_size)) { + pr_warn("node %s missing or invalid %s property\n", + name, KHO_SUB_TREE_SIZE_PROP_NAME); + continue; + } + + blob = phys_to_virt(*blob_phys); err = __kho_debugfs_blob_add(&dbg->fdt_list, sub_fdt_dir, name, - sub_fdt, fdt_totalsize(sub_fdt)); + blob, *blob_size); if (err) { - pr_warn("failed to add fdt %s to debugfs: %pe\n", name, - ERR_PTR(err)); + pr_warn("failed to add blob %s to debugfs: %pe\n", + name, ERR_PTR(err)); continue; } } -- cgit v1.2.3 From 76aa46b9e4049247858309c6e3527d477da2b2fe Mon Sep 17 00:00:00 2001 From: Breno Leitao Date: Mon, 16 Mar 2026 04:54:35 -0700 Subject: kho: kexec-metadata: track previous kernel chain Use Kexec Handover (KHO) to pass the previous kernel's version string and the number of kexec reboots since the last cold boot to the next kernel, and print it at boot time. Example output: [ 0.000000] KHO: exec from: 6.19.0-rc4-next-20260107 (count 1) Motivation ========== Bugs that only reproduce when kexecing from specific kernel versions are difficult to diagnose. These issues occur when a buggy kernel kexecs into a new kernel, with the bug manifesting only in the second kernel. Recent examples include the following commits: * commit eb2266312507 ("x86/boot: Fix page table access in 5-level to 4-level paging transition") * commit 77d48d39e991 ("efistub/tpm: Use ACPI reclaim memory for event log to avoid corruption") * commit 64b45dd46e15 ("x86/efi: skip memattr table on kexec boot") As kexec-based reboots become more common, these version-dependent bugs are appearing more frequently. At scale, correlating crashes to the previous kernel version is challenging, especially when issues only occur in specific transition scenarios. Implementation ============== The kexec metadata is stored as a plain C struct (struct kho_kexec_metadata) rather than FDT format, for simplicity and direct field access. It is registered via kho_add_subtree() as a separate subtree, keeping it independent from the core KHO ABI. This design choice: - Keeps the core KHO ABI minimal and stable - Allows the metadata format to evolve independently - Avoids requiring version bumps for all KHO consumers (LUO, etc.) when the metadata format changes The struct kho_kexec_metadata contains two fields: - previous_release: The kernel version that initiated the kexec - kexec_count: Number of kexec boots since last cold boot On cold boot, kexec_count starts at 0 and increments with each kexec. The count helps identify issues that only manifest after multiple consecutive kexec reboots. [leitao@debian.org: call kho_kexec_metadata_init() for both boot paths] Link: https://lore.kernel.org/all/20260309-kho-v8-5-c3abcf4ac750@debian.org/ [1] Link: https://lore.kernel.org/20260409-kho_fix_merge_issue-v1-1-710c84ceaa85@debian.org Link: https://lore.kernel.org/20260316-kho-v9-5-ed6dcd951988@debian.org Signed-off-by: Breno Leitao Acked-by: SeongJae Park Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav Cc: Alexander Graf Cc: David Hildenbrand Cc: Jonathan Corbet Cc: "Liam R. Howlett" Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Pasha Tatashin Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/kho/abi/kexec_metadata.h | 46 ++++++++++++++++ kernel/liveupdate/kexec_handover.c | 98 ++++++++++++++++++++++++++++++++++ 2 files changed, 144 insertions(+) create mode 100644 include/linux/kho/abi/kexec_metadata.h diff --git a/include/linux/kho/abi/kexec_metadata.h b/include/linux/kho/abi/kexec_metadata.h new file mode 100644 index 000000000000..e9e3f7e38a7c --- /dev/null +++ b/include/linux/kho/abi/kexec_metadata.h @@ -0,0 +1,46 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ + +/** + * DOC: Kexec Metadata ABI + * + * The "kexec-metadata" subtree stores optional metadata about the kexec chain. + * It is registered via kho_add_subtree(), keeping it independent from the core + * KHO ABI. This allows the metadata format to evolve without affecting other + * KHO consumers. + * + * The metadata is stored as a plain C struct rather than FDT format for + * simplicity and direct field access. + * + * Copyright (c) 2026 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2026 Breno Leitao + */ + +#ifndef _LINUX_KHO_ABI_KEXEC_METADATA_H +#define _LINUX_KHO_ABI_KEXEC_METADATA_H + +#include +#include + +#define KHO_KEXEC_METADATA_VERSION 1 + +/** + * struct kho_kexec_metadata - Kexec metadata passed between kernels + * @version: ABI version of this struct (must be first field) + * @previous_release: Kernel version string that initiated the kexec + * @kexec_count: Number of kexec boots since last cold boot + * + * This structure is preserved across kexec and allows the new kernel to + * identify which kernel it was booted from and how many kexec reboots + * have occurred. + * + * __NEW_UTS_LEN is part of uABI, so it safe to use it in here. + */ +struct kho_kexec_metadata { + u32 version; + char previous_release[__NEW_UTS_LEN + 1]; + u32 kexec_count; +} __packed; + +#define KHO_METADATA_NODE_NAME "kexec-metadata" + +#endif /* _LINUX_KHO_ABI_KEXEC_METADATA_H */ diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index adf6541f70f9..94762de1fe5f 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -18,7 +18,9 @@ #include #include #include +#include #include +#include #include #include #include @@ -1268,6 +1270,8 @@ EXPORT_SYMBOL_GPL(kho_restore_free); struct kho_in { phys_addr_t fdt_phys; phys_addr_t scratch_phys; + char previous_release[__NEW_UTS_LEN + 1]; + u32 kexec_count; struct kho_debugfs dbg; }; @@ -1392,6 +1396,96 @@ static __init int kho_out_fdt_setup(void) return err; } +static void __init kho_in_kexec_metadata(void) +{ + struct kho_kexec_metadata *metadata; + phys_addr_t metadata_phys; + size_t blob_size; + int err; + + err = kho_retrieve_subtree(KHO_METADATA_NODE_NAME, &metadata_phys, + &blob_size); + if (err) + /* This is fine, previous kernel didn't export metadata */ + return; + + /* Check that, at least, "version" is present */ + if (blob_size < sizeof(u32)) { + pr_warn("kexec-metadata blob too small (%zu bytes)\n", + blob_size); + return; + } + + metadata = phys_to_virt(metadata_phys); + + if (metadata->version != KHO_KEXEC_METADATA_VERSION) { + pr_warn("kexec-metadata version %u not supported (expected %u)\n", + metadata->version, KHO_KEXEC_METADATA_VERSION); + return; + } + + if (blob_size < sizeof(*metadata)) { + pr_warn("kexec-metadata blob too small for v%u (%zu < %zu)\n", + metadata->version, blob_size, sizeof(*metadata)); + return; + } + + /* + * Copy data to the kernel structure that will persist during + * kernel lifetime. + */ + kho_in.kexec_count = metadata->kexec_count; + strscpy(kho_in.previous_release, metadata->previous_release, + sizeof(kho_in.previous_release)); + + pr_info("exec from: %s (count %u)\n", + kho_in.previous_release, kho_in.kexec_count); +} + +/* + * Create kexec metadata to pass kernel version and boot count to the + * next kernel. This keeps the core KHO ABI minimal and allows the + * metadata format to evolve independently. + */ +static __init int kho_out_kexec_metadata(void) +{ + struct kho_kexec_metadata *metadata; + int err; + + metadata = kho_alloc_preserve(sizeof(*metadata)); + if (IS_ERR(metadata)) + return PTR_ERR(metadata); + + metadata->version = KHO_KEXEC_METADATA_VERSION; + strscpy(metadata->previous_release, init_uts_ns.name.release, + sizeof(metadata->previous_release)); + /* kho_in.kexec_count is set to 0 on cold boot */ + metadata->kexec_count = kho_in.kexec_count + 1; + + err = kho_add_subtree(KHO_METADATA_NODE_NAME, metadata, + sizeof(*metadata)); + if (err) + kho_unpreserve_free(metadata); + + return err; +} + +static int __init kho_kexec_metadata_init(const void *fdt) +{ + int err; + + if (fdt) + kho_in_kexec_metadata(); + + /* Populate kexec metadata for the possible next kexec */ + err = kho_out_kexec_metadata(); + if (err) + pr_warn("failed to initialize kexec-metadata subtree: %d\n", + err); + + return err; +} + static __init int kho_init(void) { struct kho_radix_tree *tree = &kho_out.radix_tree; @@ -1425,6 +1519,10 @@ static __init int kho_init(void) if (err) goto err_free_fdt; + err = kho_kexec_metadata_init(fdt); + if (err) + goto err_free_fdt; + if (fdt) { kho_in_debugfs_init(&kho_in.dbg, fdt); return 0; -- cgit v1.2.3 From e524feaad5467f39a56d2697f7db31f02796dc7d Mon Sep 17 00:00:00 2001 From: Breno Leitao Date: Mon, 16 Mar 2026 04:54:36 -0700 Subject: kho: document kexec-metadata tracking feature Add documentation for the kexec-metadata feature that tracks the previous kernel version and kexec boot count across kexec reboots. This helps diagnose bugs that only reproduce when kexecing from specific kernel versions. Link: https://lore.kernel.org/20260316-kho-v9-6-ed6dcd951988@debian.org Signed-off-by: Breno Leitao Suggested-by: Mike Rapoport Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav Cc: Alexander Graf Cc: David Hildenbrand Cc: Jonathan Corbet Cc: "Liam R. Howlett" Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Pasha Tatashin Cc: SeongJae Park Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/kho.rst | 39 ++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/Documentation/admin-guide/mm/kho.rst b/Documentation/admin-guide/mm/kho.rst index 6a4ddf344046..2c26e560bd78 100644 --- a/Documentation/admin-guide/mm/kho.rst +++ b/Documentation/admin-guide/mm/kho.rst @@ -42,6 +42,45 @@ For example, if you used ``reserve_mem`` command line parameter to create an early memory reservation, the new kernel will have that memory at the same physical address as the old kernel. +Kexec Metadata +============== + +KHO automatically tracks metadata about the kexec chain, passing information +about the previous kernel to the next kernel. This feature helps diagnose +bugs that only reproduce when kexecing from specific kernel versions. + +On each KHO kexec, the kernel logs the previous kernel's version and the +number of kexec reboots since the last cold boot:: + + [ 0.000000] KHO: exec from: 6.19.0-rc4-next-20260107 (count 1) + +The metadata includes: + +``previous_release`` + The kernel version string (from ``uname -r``) of the kernel that + initiated the kexec. + +``kexec_count`` + The number of kexec boots since the last cold boot. On cold boot, + this counter starts at 0 and increments with each kexec. This helps + identify issues that only manifest after multiple consecutive kexec + reboots. + +Use Cases +--------- + +This metadata is particularly useful for debugging kexec transition bugs, +where a buggy kernel kexecs into a new kernel and the bug manifests only +in the second kernel. Examples of such bugs include: + +- Memory corruption from the previous kernel affecting the new kernel +- Incorrect hardware state left by the previous kernel +- Firmware/ACPI state issues that only appear in kexec scenarios + +At scale, correlating crashes to the previous kernel version enables +faster root cause analysis when issues only occur in specific kernel +transition scenarios. + debugfs Interfaces ================== -- cgit v1.2.3 From 00d0b372374f2528394aabf7b1f53f8dafe294de Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Thu, 26 Mar 2026 16:39:41 +0000 Subject: liveupdate: prevent double management of files Patch series "liveupdate: prevent double preservation", v4. Currently, LUO does not prevent the same file from being managed twice across different active sessions. Because LUO preserves files of absolutely different types: memfd, and upcoming vfiofd [1], iommufd [2], guestmefd (and possible kvmfd/cpufd). There is no common private data or guarantee on how to prevent that the same file is not preserved twice beside using inode or some slower and expensive method like hashtables. This patch (of 4) Currently, LUO does not prevent the same file from being managed twice across different active sessions. Use a global xarray luo_preserved_files to keep track of file identifiers being preserved by LUO. Update luo_preserve_file() to check and insert the file identifier into this xarray when it is preserved, and erase it in luo_file_unpreserve_files() when it is released. To allow handlers to define what constitutes a "unique" file (e.g., different struct file objects pointing to the same hardware resource), add a get_id() callback to struct liveupdate_file_ops. If not provided, the default identifier is the struct file pointer itself. This ensures that the same file (or resource) cannot be managed by multiple sessions. If another session attempts to preserve an already managed file, it will now fail with -EBUSY. Link: https://lore.kernel.org/20260326163943.574070-1-pasha.tatashin@soleen.com Link: https://lore.kernel.org/20260326163943.574070-2-pasha.tatashin@soleen.com Link: https://lore.kernel.org/all/20260129212510.967611-1-dmatlack@google.com [1] Link: https://lore.kernel.org/all/20260203220948.2176157-1-skhawaja@google.com [2] Signed-off-by: Pasha Tatashin Reviewed-by: Samiullah Khawaja Reviewed-by: Mike Rapoport (Microsoft) Cc: David Matlack Cc: Pratyush Yadav Cc: Shuah Khan Cc: Christian Brauner Signed-off-by: Andrew Morton --- include/linux/liveupdate.h | 2 ++ kernel/liveupdate/luo_file.c | 32 ++++++++++++++++++++++++++++++-- 2 files changed, 32 insertions(+), 2 deletions(-) diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h index dd11fdc76a5f..61325ad26526 100644 --- a/include/linux/liveupdate.h +++ b/include/linux/liveupdate.h @@ -63,6 +63,7 @@ struct liveupdate_file_op_args { * finish, in order to do successful finish calls for all * resources in the session. * @finish: Required. Final cleanup in the new kernel. + * @get_id: Optional. Returns a unique identifier for the file. * @owner: Module reference * * All operations (except can_preserve) receive a pointer to a @@ -78,6 +79,7 @@ struct liveupdate_file_ops { int (*retrieve)(struct liveupdate_file_op_args *args); bool (*can_finish)(struct liveupdate_file_op_args *args); void (*finish)(struct liveupdate_file_op_args *args); + unsigned long (*get_id)(struct file *file); struct module *owner; }; diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c index 5acee4174bf0..09103cf81107 100644 --- a/kernel/liveupdate/luo_file.c +++ b/kernel/liveupdate/luo_file.c @@ -108,12 +108,16 @@ #include #include #include +#include #include #include #include "luo_internal.h" static LIST_HEAD(luo_file_handler_list); +/* Keep track of files being preserved by LUO */ +static DEFINE_XARRAY(luo_preserved_files); + /* 2 4K pages, give space for 128 files per file_set */ #define LUO_FILE_PGCNT 2ul #define LUO_FILE_MAX \ @@ -203,6 +207,12 @@ static void luo_free_files_mem(struct luo_file_set *file_set) file_set->files = NULL; } +static unsigned long luo_get_id(struct liveupdate_file_handler *fh, + struct file *file) +{ + return fh->ops->get_id ? fh->ops->get_id(file) : (unsigned long)file; +} + static bool luo_token_is_used(struct luo_file_set *file_set, u64 token) { struct luo_file *iter; @@ -248,6 +258,7 @@ static bool luo_token_is_used(struct luo_file_set *file_set, u64 token) * Context: Can be called from an ioctl handler during normal system operation. * Return: 0 on success. Returns a negative errno on failure: * -EEXIST if the token is already used. + * -EBUSY if the file descriptor is already preserved by another session. * -EBADF if the file descriptor is invalid. * -ENOSPC if the file_set is full. * -ENOENT if no compatible handler is found. @@ -288,10 +299,15 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) if (err) goto err_free_files_mem; - err = luo_flb_file_preserve(fh); + err = xa_insert(&luo_preserved_files, luo_get_id(fh, file), + file, GFP_KERNEL); if (err) goto err_free_files_mem; + err = luo_flb_file_preserve(fh); + if (err) + goto err_erase_xa; + luo_file = kzalloc_obj(*luo_file); if (!luo_file) { err = -ENOMEM; @@ -320,6 +336,8 @@ err_kfree: kfree(luo_file); err_flb_unpreserve: luo_flb_file_unpreserve(fh); +err_erase_xa: + xa_erase(&luo_preserved_files, luo_get_id(fh, file)); err_free_files_mem: luo_free_files_mem(file_set); err_fput: @@ -363,6 +381,8 @@ void luo_file_unpreserve_files(struct luo_file_set *file_set) luo_file->fh->ops->unpreserve(&args); luo_flb_file_unpreserve(luo_file->fh); + xa_erase(&luo_preserved_files, + luo_get_id(luo_file->fh, luo_file->file)); list_del(&luo_file->list); file_set->count--; @@ -606,6 +626,11 @@ int luo_retrieve_file(struct luo_file_set *file_set, u64 token, luo_file->file = args.file; /* Get reference so we can keep this file in LUO until finish */ get_file(luo_file->file); + + WARN_ON(xa_insert(&luo_preserved_files, + luo_get_id(luo_file->fh, luo_file->file), + luo_file->file, GFP_KERNEL)); + *filep = luo_file->file; luo_file->retrieve_status = 1; @@ -701,8 +726,11 @@ int luo_file_finish(struct luo_file_set *file_set) luo_file_finish_one(file_set, luo_file); - if (luo_file->file) + if (luo_file->file) { + xa_erase(&luo_preserved_files, + luo_get_id(luo_file->fh, luo_file->file)); fput(luo_file->file); + } list_del(&luo_file->list); file_set->count--; mutex_destroy(&luo_file->mutex); -- cgit v1.2.3 From bc3a5763f4664c5da812eb3f14d55b0c99abd4ab Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Thu, 26 Mar 2026 16:39:42 +0000 Subject: memfd: implement get_id for memfd_luo Memfds are identified by their underlying inode. Implement get_id for memfd_luo to return the inode pointer. This prevents the same memfd from being managed twice by LUO if the same inode is pointed by multiple file objects. Link: https://lore.kernel.org/20260326163943.574070-3-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) Cc: David Matlack Cc: Mike Rapoport (Microsoft) Cc: Samiullah Khawaja Cc: Shuah Khan Cc: Christian Brauner Signed-off-by: Andrew Morton --- mm/memfd_luo.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/mm/memfd_luo.c b/mm/memfd_luo.c index bc7f4f045edf..9130e6ce396d 100644 --- a/mm/memfd_luo.c +++ b/mm/memfd_luo.c @@ -560,6 +560,11 @@ static bool memfd_luo_can_preserve(struct liveupdate_file_handler *handler, return shmem_file(file) && !inode->i_nlink; } +static unsigned long memfd_luo_get_id(struct file *file) +{ + return (unsigned long)file_inode(file); +} + static const struct liveupdate_file_ops memfd_luo_file_ops = { .freeze = memfd_luo_freeze, .finish = memfd_luo_finish, @@ -567,6 +572,7 @@ static const struct liveupdate_file_ops memfd_luo_file_ops = { .preserve = memfd_luo_preserve, .unpreserve = memfd_luo_unpreserve, .can_preserve = memfd_luo_can_preserve, + .get_id = memfd_luo_get_id, .owner = THIS_MODULE, }; -- cgit v1.2.3 From e3e613a33e654a37c4fb34b7eb2776008c461e0c Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Thu, 26 Mar 2026 16:39:43 +0000 Subject: selftests: liveupdate: add test for double preservation Verify that a file can only be preserved once across all active sessions. Attempting to preserve it a second time, whether in the same or a different session, should fail with EBUSY. Link: https://lore.kernel.org/20260326163943.574070-4-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Samiullah Khawaja Cc: David Matlack Cc: Pratyush Yadav Cc: Shuah Khan Cc: Christian Brauner Signed-off-by: Andrew Morton --- tools/testing/selftests/liveupdate/liveupdate.c | 41 +++++++++++++++++++++++++ 1 file changed, 41 insertions(+) diff --git a/tools/testing/selftests/liveupdate/liveupdate.c b/tools/testing/selftests/liveupdate/liveupdate.c index c2878e3d5ef9..37c808fbe1e9 100644 --- a/tools/testing/selftests/liveupdate/liveupdate.c +++ b/tools/testing/selftests/liveupdate/liveupdate.c @@ -345,4 +345,45 @@ TEST_F(liveupdate_device, preserve_unsupported_fd) ASSERT_EQ(close(session_fd), 0); } +/* + * Test Case: Prevent Double Preservation + * + * Verifies that a file (memfd) can only be preserved once across all active + * sessions. Attempting to preserve it a second time, whether in the same or + * a different session, should fail with EBUSY. + */ +TEST_F(liveupdate_device, prevent_double_preservation) +{ + int session_fd1, session_fd2, mem_fd; + int ret; + + self->fd1 = open(LIVEUPDATE_DEV, O_RDWR); + if (self->fd1 < 0 && errno == ENOENT) + SKIP(return, "%s does not exist", LIVEUPDATE_DEV); + ASSERT_GE(self->fd1, 0); + + session_fd1 = create_session(self->fd1, "double-preserve-session-1"); + ASSERT_GE(session_fd1, 0); + session_fd2 = create_session(self->fd1, "double-preserve-session-2"); + ASSERT_GE(session_fd2, 0); + + mem_fd = memfd_create("test-memfd", 0); + ASSERT_GE(mem_fd, 0); + + /* First preservation should succeed */ + ASSERT_EQ(preserve_fd(session_fd1, mem_fd, 0x1111), 0); + + /* Second preservation in a different session should fail with EBUSY */ + ret = preserve_fd(session_fd2, mem_fd, 0x2222); + EXPECT_EQ(ret, -EBUSY); + + /* Second preservation in the same session (different token) should fail with EBUSY */ + ret = preserve_fd(session_fd1, mem_fd, 0x3333); + EXPECT_EQ(ret, -EBUSY); + + ASSERT_EQ(close(mem_fd), 0); + ASSERT_EQ(close(session_fd1), 0); + ASSERT_EQ(close(session_fd2), 0); +} + TEST_HARNESS_MAIN -- cgit v1.2.3 From 13b6b620910436c29dea398382e83c9499fd13e4 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Fri, 27 Mar 2026 18:21:08 +0800 Subject: mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU The balance_dirty_pages() won't do the dirty folios throttling on cgroupv1. See commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback on traditional hierarchies"). Moreover, after commit 6b0dfabb3555 ("fs: Remove aops->writepage"), we no longer attempt to write back filesystem folios through reclaim. On large memory systems, the flusher may not be able to write back quickly enough. Consequently, MGLRU will encounter many folios that are already under writeback. Since we cannot reclaim these dirty folios, the system may run out of memory and trigger the OOM killer. Hence, for cgroup v1, let's throttle reclaim after waking up the flusher, which is similar to commit 81a70c21d917 ("mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1"), to avoid unnecessary OOM. The following test program can easily reproduce the OOM issue. With this patch applied, the test passes successfully. $mkdir /sys/fs/cgroup/memory/test $echo 256M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes $echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs $dd if=/dev/zero of=/mnt/data.bin bs=1M count=800 Link: https://lore.kernel.org/3445af0f09e8ca945492e052e82594f8c4f2e2f6.1774606060.git.baolin.wang@linux.alibaba.com Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation") Signed-off-by: Baolin Wang Reviewed-by: Barry Song Reviewed-by: Kairui Song Acked-by: Johannes Weiner Acked-by: Shakeel Butt Cc: Axel Rasmussen Cc: David Hildenbrand Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Qi Zheng Cc: Wei Xu Cc: Yuanchu Xie Signed-off-by: Andrew Morton --- mm/vmscan.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index fd120e898c70..4f05a149a0dd 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -5027,9 +5027,24 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) * If too many file cache in the coldest generation can't be evicted * due to being dirty, wake up the flusher. */ - if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) + if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) { + struct pglist_data *pgdat = lruvec_pgdat(lruvec); + wakeup_flusher_threads(WB_REASON_VMSCAN); + /* + * For cgroupv1 dirty throttling is achieved by waking up + * the kernel flusher here and later waiting on folios + * which are in writeback to finish (see shrink_folio_list()). + * + * Flusher may not be able to issue writeback quickly + * enough for cgroupv1 writeback throttling to work + * on a large system. + */ + if (!writeback_throttling_sane(sc)) + reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK); + } + /* whether this lruvec should be rotated */ return nr_to_scan < 0; } -- cgit v1.2.3 From 277f4e5e398b8c59148ebc33dbee8f9821f087eb Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Fri, 27 Mar 2026 03:33:25 +0000 Subject: liveupdate: safely print untrusted strings Patch series "liveupdate: Fix module unloading and unregister API", v3. This patch series addresses an issue with how LUO handles module reference counting and unregistration during a module unload (e.g., via rmmod). Currently, modules that register live update file handlers are pinned for the entire duration they are registered. This prevents the modules from being unloaded gracefully, even when no live update session is in progress. Furthermore, if a module is forcefully unloaded, the unregistration functions return an error (e.g. -EBUSY) if a session is active, which is ignored by the kernel's module unload path, leaving dangling pointers in the LUO global lists. To resolve these issues, this series introduces the following changes: 1. Adds a global read-write semaphore (luo_register_rwlock) to protect the registration lists for both file handlers and FLBs. 2. Reduces the scope of module reference counting for file handlers and FLBs. Instead of pinning modules indefinitely upon registration, references are now taken only when they are actively used in a live update session (e.g., during preservation, retrieval, or deserialization). 3. Removes the global luo_session_quiesce() mechanism since module unload behavior now handles active sessions implicitly. 4. Introduces auto-unregistration of FLBs during file handler unregistration to prevent leaving dangling resources. 5. Changes the unregistration functions to return void instead of an error code. 6. Fixes a data race in luo_flb_get_private() by introducing a spinlock for thread-safe lazy initialization. 7. Strengthens security by using %.*s when printing untrusted deserialized compatible strings and session names to prevent out-of-bounds reads. This patch (of 10): Deserialized strings from KHO data (such as file handler compatible strings and session names) are provided by the previous kernel and might not be null-terminated if the data is corrupted or maliciously crafted. When printing these strings in error messages, use the %.*s format specifier with the maximum buffer size to prevent out-of-bounds reads into adjacent kernel memory. Link: https://lore.kernel.org/20260327033335.696621-1-pasha.tatashin@soleen.com Link: https://lore.kernel.org/20260327033335.696621-2-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) Cc: David Matlack Cc: Mike Rapoport Cc: Samiullah Khawaja Signed-off-by: Andrew Morton --- kernel/liveupdate/luo_file.c | 3 ++- kernel/liveupdate/luo_session.c | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c index 09103cf81107..8fcf302c73b6 100644 --- a/kernel/liveupdate/luo_file.c +++ b/kernel/liveupdate/luo_file.c @@ -813,7 +813,8 @@ int luo_file_deserialize(struct luo_file_set *file_set, } if (!handler_found) { - pr_warn("No registered handler for compatible '%s'\n", + pr_warn("No registered handler for compatible '%.*s'\n", + (int)sizeof(file_ser[i].compatible), file_ser[i].compatible); return -ENOENT; } diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 783677295640..c68a0041bcf2 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -544,7 +544,8 @@ int luo_session_deserialize(void) session = luo_session_alloc(sh->ser[i].name); if (IS_ERR(session)) { - pr_warn("Failed to allocate session [%s] during deserialization %pe\n", + pr_warn("Failed to allocate session [%.*s] during deserialization %pe\n", + (int)sizeof(sh->ser[i].name), sh->ser[i].name, session); return PTR_ERR(session); } -- cgit v1.2.3 From 38fb71ace230bcf0106b6a09e7361c09255ba332 Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Fri, 27 Mar 2026 03:33:26 +0000 Subject: liveupdate: synchronize lazy initialization of FLB private state The luo_flb_get_private() function, which is responsible for lazily initializing the private state of FLB objects, can be called concurrently from multiple threads. This creates a data race on the 'initialized' flag and can lead to multiple executions of mutex_init() and INIT_LIST_HEAD() on the same memory. Introduce a static spinlock (luo_flb_init_lock) local to the function to synchronize the initialization path. Use smp_load_acquire() and smp_store_release() for memory ordering between the fast path and the slow path. Link: https://lore.kernel.org/20260327033335.696621-3-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav Cc: David Matlack Cc: Mike Rapoport Cc: Samiullah Khawaja Signed-off-by: Andrew Morton --- kernel/liveupdate/luo_flb.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/kernel/liveupdate/luo_flb.c b/kernel/liveupdate/luo_flb.c index f52e8114837e..cf4a8f854c83 100644 --- a/kernel/liveupdate/luo_flb.c +++ b/kernel/liveupdate/luo_flb.c @@ -89,13 +89,18 @@ struct luo_flb_link { static struct luo_flb_private *luo_flb_get_private(struct liveupdate_flb *flb) { struct luo_flb_private *private = &ACCESS_PRIVATE(flb, private); + static DEFINE_SPINLOCK(luo_flb_init_lock); + if (smp_load_acquire(&private->initialized)) + return private; + + guard(spinlock)(&luo_flb_init_lock); if (!private->initialized) { mutex_init(&private->incoming.lock); mutex_init(&private->outgoing.lock); INIT_LIST_HEAD(&private->list); private->users = 0; - private->initialized = true; + smp_store_release(&private->initialized, true); } return private; -- cgit v1.2.3 From 9e1e18584548e8ef8b37a2a7f5eb84b91e35a160 Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Fri, 27 Mar 2026 03:33:27 +0000 Subject: liveupdate: protect file handler list with rwsem Because liveupdate file handlers will no longer hold a module reference when registered, we must ensure that the access to the handler list is protected against concurrent module unloading. Utilize the global luo_register_rwlock to protect the global registry of file handlers. Read locks are taken during list traversals in luo_preserve_file() and luo_file_deserialize(). Write locks are taken during registration and unregistration. Link: https://lore.kernel.org/20260327033335.696621-4-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) Cc: David Matlack Cc: Mike Rapoport Cc: Samiullah Khawaja Signed-off-by: Andrew Morton --- kernel/liveupdate/luo_core.c | 6 ++++++ kernel/liveupdate/luo_file.c | 22 +++++++++++++++++----- kernel/liveupdate/luo_internal.h | 2 ++ 3 files changed, 25 insertions(+), 5 deletions(-) diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c index 48b25c9abeda..803f51c84275 100644 --- a/kernel/liveupdate/luo_core.c +++ b/kernel/liveupdate/luo_core.c @@ -54,6 +54,7 @@ #include #include #include +#include #include #include #include @@ -68,6 +69,11 @@ static struct { u64 liveupdate_num; } luo_global; +/* + * luo_register_rwlock - Protects registration of file handlers and FLBs. + */ +DECLARE_RWSEM(luo_register_rwlock); + static int __init early_liveupdate_param(char *buf) { return kstrtobool(buf, &luo_global.enabled); diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c index 8fcf302c73b6..91edbf4e44ac 100644 --- a/kernel/liveupdate/luo_file.c +++ b/kernel/liveupdate/luo_file.c @@ -288,12 +288,14 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) goto err_fput; err = -ENOENT; + down_read(&luo_register_rwlock); list_private_for_each_entry(fh, &luo_file_handler_list, list) { if (fh->ops->can_preserve(fh, file)) { err = 0; break; } } + up_read(&luo_register_rwlock); /* err is still -ENOENT if no handler was found */ if (err) @@ -805,12 +807,14 @@ int luo_file_deserialize(struct luo_file_set *file_set, bool handler_found = false; struct luo_file *luo_file; + down_read(&luo_register_rwlock); list_private_for_each_entry(fh, &luo_file_handler_list, list) { if (!strcmp(fh->compatible, file_ser[i].compatible)) { handler_found = true; break; } } + up_read(&luo_register_rwlock); if (!handler_found) { pr_warn("No registered handler for compatible '%.*s'\n", @@ -879,32 +883,36 @@ int liveupdate_register_file_handler(struct liveupdate_file_handler *fh) if (!luo_session_quiesce()) return -EBUSY; + down_write(&luo_register_rwlock); /* Check for duplicate compatible strings */ list_private_for_each_entry(fh_iter, &luo_file_handler_list, list) { if (!strcmp(fh_iter->compatible, fh->compatible)) { pr_err("File handler registration failed: Compatible string '%s' already registered.\n", fh->compatible); err = -EEXIST; - goto err_resume; + goto err_unlock; } } /* Pin the module implementing the handler */ if (!try_module_get(fh->ops->owner)) { err = -EAGAIN; - goto err_resume; + goto err_unlock; } INIT_LIST_HEAD(&ACCESS_PRIVATE(fh, flb_list)); INIT_LIST_HEAD(&ACCESS_PRIVATE(fh, list)); list_add_tail(&ACCESS_PRIVATE(fh, list), &luo_file_handler_list); + up_write(&luo_register_rwlock); + luo_session_resume(); liveupdate_test_register(fh); return 0; -err_resume: +err_unlock: + up_write(&luo_register_rwlock); luo_session_resume(); return err; } @@ -938,16 +946,20 @@ int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) if (!luo_session_quiesce()) goto err_register; + down_write(&luo_register_rwlock); if (!list_empty(&ACCESS_PRIVATE(fh, flb_list))) - goto err_resume; + goto err_unlock; list_del(&ACCESS_PRIVATE(fh, list)); + up_write(&luo_register_rwlock); + module_put(fh->ops->owner); luo_session_resume(); return 0; -err_resume: +err_unlock: + up_write(&luo_register_rwlock); luo_session_resume(); err_register: liveupdate_test_register(fh); diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index 8083d8739b09..4bfe00ac8866 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -77,6 +77,8 @@ struct luo_session { struct mutex mutex; }; +extern struct rw_semaphore luo_register_rwlock; + int luo_session_create(const char *name, struct file **filep); int luo_session_retrieve(const char *name, struct file **filep); int __init luo_session_setup_outgoing(void *fdt); -- cgit v1.2.3 From 6b2b22f7c8cf1596490beaac96a989cbafdfea57 Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Fri, 27 Mar 2026 03:33:28 +0000 Subject: liveupdate: protect FLB lists with luo_register_rwlock Because liveupdate FLB objects will soon drop their persistent module references when registered, list traversals must be protected against concurrent module unloading. To provide this protection, utilize the global luo_register_rwlock. It protects the global registry of FLBs and the handler's specific list of FLB dependencies. Read locks are used during concurrent list traversals (e.g., during preservation and serialization). Write locks are taken during registration and unregistration. Link: https://lore.kernel.org/20260327033335.696621-5-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) Cc: David Matlack Cc: Mike Rapoport Cc: Samiullah Khawaja Signed-off-by: Andrew Morton --- include/linux/liveupdate.h | 1 + kernel/liveupdate/luo_flb.c | 14 ++++++++++++++ 2 files changed, 15 insertions(+) diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h index 61325ad26526..9c761d9bacf8 100644 --- a/include/linux/liveupdate.h +++ b/include/linux/liveupdate.h @@ -12,6 +12,7 @@ #include #include #include +#include #include #include diff --git a/kernel/liveupdate/luo_flb.c b/kernel/liveupdate/luo_flb.c index cf4a8f854c83..fdb274410e8f 100644 --- a/kernel/liveupdate/luo_flb.c +++ b/kernel/liveupdate/luo_flb.c @@ -245,17 +245,20 @@ int luo_flb_file_preserve(struct liveupdate_file_handler *fh) struct luo_flb_link *iter; int err = 0; + down_read(&luo_register_rwlock); list_for_each_entry(iter, flb_list, list) { err = luo_flb_file_preserve_one(iter->flb); if (err) goto exit_err; } + up_read(&luo_register_rwlock); return 0; exit_err: list_for_each_entry_continue_reverse(iter, flb_list, list) luo_flb_file_unpreserve_one(iter->flb); + up_read(&luo_register_rwlock); return err; } @@ -277,6 +280,7 @@ void luo_flb_file_unpreserve(struct liveupdate_file_handler *fh) struct list_head *flb_list = &ACCESS_PRIVATE(fh, flb_list); struct luo_flb_link *iter; + guard(rwsem_read)(&luo_register_rwlock); list_for_each_entry_reverse(iter, flb_list, list) luo_flb_file_unpreserve_one(iter->flb); } @@ -297,6 +301,7 @@ void luo_flb_file_finish(struct liveupdate_file_handler *fh) struct list_head *flb_list = &ACCESS_PRIVATE(fh, flb_list); struct luo_flb_link *iter; + guard(rwsem_read)(&luo_register_rwlock); list_for_each_entry_reverse(iter, flb_list, list) luo_flb_file_finish_one(iter->flb); } @@ -360,6 +365,8 @@ int liveupdate_register_flb(struct liveupdate_file_handler *fh, if (!luo_session_quiesce()) return -EBUSY; + down_write(&luo_register_rwlock); + /* Check that this FLB is not already linked to this file handler */ err = -EEXIST; list_for_each_entry(iter, flb_list, list) { @@ -401,11 +408,13 @@ int liveupdate_register_flb(struct liveupdate_file_handler *fh, private->users++; link->flb = flb; list_add_tail(&no_free_ptr(link)->list, flb_list); + up_write(&luo_register_rwlock); luo_session_resume(); return 0; err_resume: + up_write(&luo_register_rwlock); luo_session_resume(); return err; } @@ -449,6 +458,8 @@ int liveupdate_unregister_flb(struct liveupdate_file_handler *fh, if (!luo_session_quiesce()) return -EBUSY; + down_write(&luo_register_rwlock); + /* Find and remove the link from the file handler's list */ list_for_each_entry(iter, flb_list, list) { if (iter->flb == flb) { @@ -473,11 +484,13 @@ int liveupdate_unregister_flb(struct liveupdate_file_handler *fh, module_put(flb->ops->owner); } + up_write(&luo_register_rwlock); luo_session_resume(); return 0; err_resume: + up_write(&luo_register_rwlock); luo_session_resume(); return err; } @@ -643,6 +656,7 @@ void luo_flb_serialize(void) struct liveupdate_flb *gflb; int i = 0; + guard(rwsem_read)(&luo_register_rwlock); list_private_for_each_entry(gflb, &luo_flb_global.list, private.list) { struct luo_flb_private *private = luo_flb_get_private(gflb); -- cgit v1.2.3 From 76be9983df33aebd69716edaa8204ed90e72fef1 Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Fri, 27 Mar 2026 03:33:29 +0000 Subject: liveupdate: defer FLB module refcounting to active sessions Stop pinning modules indefinitely upon FLB registration. Instead, dynamically take a module reference when the FLB is actively used in a session (e.g., during preserve and retrieve) and release it when the session concludes. This allows modules providing FLB operations to be cleanly unloaded when not in active use by the live update orchestrator. Link: https://lore.kernel.org/20260327033335.696621-6-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Reviewed-by: Samiullah Khawaja Reviewed-by: Pratyush Yadav (Google) Cc: David Matlack Cc: Mike Rapoport Signed-off-by: Andrew Morton --- kernel/liveupdate/luo_flb.c | 27 +++++++++++++++++---------- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/kernel/liveupdate/luo_flb.c b/kernel/liveupdate/luo_flb.c index fdb274410e8f..3d439d1c8ff1 100644 --- a/kernel/liveupdate/luo_flb.c +++ b/kernel/liveupdate/luo_flb.c @@ -115,10 +115,15 @@ static int luo_flb_file_preserve_one(struct liveupdate_flb *flb) struct liveupdate_flb_op_args args = {0}; int err; + if (!try_module_get(flb->ops->owner)) + return -ENODEV; + args.flb = flb; err = flb->ops->preserve(&args); - if (err) + if (err) { + module_put(flb->ops->owner); return err; + } private->outgoing.data = args.data; private->outgoing.obj = args.obj; } @@ -146,6 +151,7 @@ static void luo_flb_file_unpreserve_one(struct liveupdate_flb *flb) private->outgoing.data = 0; private->outgoing.obj = NULL; + module_put(flb->ops->owner); } } } @@ -181,12 +187,17 @@ static int luo_flb_retrieve_one(struct liveupdate_flb *flb) if (!found) return -ENOENT; + if (!try_module_get(flb->ops->owner)) + return -ENODEV; + args.flb = flb; args.data = private->incoming.data; err = flb->ops->retrieve(&args); - if (err) + if (err) { + module_put(flb->ops->owner); return err; + } private->incoming.obj = args.obj; private->incoming.retrieved = true; @@ -220,6 +231,7 @@ static void luo_flb_file_finish_one(struct liveupdate_flb *flb) private->incoming.data = 0; private->incoming.obj = NULL; private->incoming.finished = true; + module_put(flb->ops->owner); } } } @@ -395,11 +407,6 @@ int liveupdate_register_flb(struct liveupdate_file_handler *fh, goto err_resume; } - if (!try_module_get(flb->ops->owner)) { - err = -EAGAIN; - goto err_resume; - } - list_add_tail(&private->list, &luo_flb_global.list); luo_flb_global.count++; } @@ -476,12 +483,11 @@ int liveupdate_unregister_flb(struct liveupdate_file_handler *fh, private->users--; /* * If this is the last file-handler with which we are registred, remove - * from the global list, and relese module reference. + * from the global list. */ if (!private->users) { list_del_init(&private->list); luo_flb_global.count--; - module_put(flb->ops->owner); } up_write(&luo_register_rwlock); @@ -510,7 +516,8 @@ err_resume: * * Return: 0 on success, or a negative errno on failure. -ENODATA means no * incoming FLB data, -ENOENT means specific flb not found in the incoming - * data, and -EOPNOTSUPP when live update is disabled or not configured. + * data, -ENODEV if the FLB's module is unloading, and -EOPNOTSUPP when + * live update is disabled or not configured. */ int liveupdate_flb_get_incoming(struct liveupdate_flb *flb, void **objp) { -- cgit v1.2.3 From 118c3908242076c6e281c7010d29c2d0607c3190 Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Fri, 27 Mar 2026 03:33:30 +0000 Subject: liveupdate: remove luo_session_quiesce() Now that FLB module references are handled dynamically during active sessions, we can safely remove the luo_session_quiesce() and luo_session_resume() mechanism. Link: https://lore.kernel.org/20260327033335.696621-7-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) Cc: David Matlack Cc: Mike Rapoport Cc: Samiullah Khawaja Signed-off-by: Andrew Morton --- kernel/liveupdate/luo_file.c | 21 +------------- kernel/liveupdate/luo_flb.c | 59 +++++++--------------------------------- kernel/liveupdate/luo_internal.h | 2 -- kernel/liveupdate/luo_session.c | 43 ----------------------------- 4 files changed, 11 insertions(+), 114 deletions(-) diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c index 91edbf4e44ac..97342b8b8b69 100644 --- a/kernel/liveupdate/luo_file.c +++ b/kernel/liveupdate/luo_file.c @@ -875,14 +875,6 @@ int liveupdate_register_file_handler(struct liveupdate_file_handler *fh) return -EINVAL; } - /* - * Ensure the system is quiescent (no active sessions). - * This prevents registering new handlers while sessions are active or - * while deserialization is in progress. - */ - if (!luo_session_quiesce()) - return -EBUSY; - down_write(&luo_register_rwlock); /* Check for duplicate compatible strings */ list_private_for_each_entry(fh_iter, &luo_file_handler_list, list) { @@ -905,15 +897,12 @@ int liveupdate_register_file_handler(struct liveupdate_file_handler *fh) list_add_tail(&ACCESS_PRIVATE(fh, list), &luo_file_handler_list); up_write(&luo_register_rwlock); - luo_session_resume(); - liveupdate_test_register(fh); return 0; err_unlock: up_write(&luo_register_rwlock); - luo_session_resume(); return err; } @@ -925,14 +914,12 @@ err_unlock: * reverses the operations of liveupdate_register_file_handler(). * * It ensures safe removal by checking that: - * No live update session is currently in progress. * No FLB registered with this file handler. * * If the unregistration fails, the internal test state is reverted. * * Return: 0 Success. -EOPNOTSUPP when live update is not enabled. -EBUSY A live - * update is in progress, can't quiesce live update or FLB is registred with - * this file handler. + * update is in progress, FLB is registred with this file handler. */ int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) { @@ -943,9 +930,6 @@ int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) liveupdate_test_unregister(fh); - if (!luo_session_quiesce()) - goto err_register; - down_write(&luo_register_rwlock); if (!list_empty(&ACCESS_PRIVATE(fh, flb_list))) goto err_unlock; @@ -954,14 +938,11 @@ int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) up_write(&luo_register_rwlock); module_put(fh->ops->owner); - luo_session_resume(); return 0; err_unlock: up_write(&luo_register_rwlock); - luo_session_resume(); -err_register: liveupdate_test_register(fh); return err; } diff --git a/kernel/liveupdate/luo_flb.c b/kernel/liveupdate/luo_flb.c index 3d439d1c8ff1..13f96d11ecc9 100644 --- a/kernel/liveupdate/luo_flb.c +++ b/kernel/liveupdate/luo_flb.c @@ -348,7 +348,6 @@ int liveupdate_register_flb(struct liveupdate_file_handler *fh, struct luo_flb_link *link __free(kfree) = NULL; struct liveupdate_flb *gflb; struct luo_flb_link *iter; - int err; if (!liveupdate_enabled()) return -EOPNOTSUPP; @@ -369,21 +368,12 @@ int liveupdate_register_flb(struct liveupdate_file_handler *fh, if (!link) return -ENOMEM; - /* - * Ensure the system is quiescent (no active sessions). - * This acts as a global lock for registration: no other thread can - * be in this section, and no sessions can be creating/using FDs. - */ - if (!luo_session_quiesce()) - return -EBUSY; - - down_write(&luo_register_rwlock); + guard(rwsem_write)(&luo_register_rwlock); /* Check that this FLB is not already linked to this file handler */ - err = -EEXIST; list_for_each_entry(iter, flb_list, list) { if (iter->flb == flb) - goto err_resume; + return -EEXIST; } /* @@ -391,20 +381,16 @@ int liveupdate_register_flb(struct liveupdate_file_handler *fh, * is registered */ if (!private->users) { - if (WARN_ON(!list_empty(&private->list))) { - err = -EINVAL; - goto err_resume; - } + if (WARN_ON(!list_empty(&private->list))) + return -EINVAL; - if (luo_flb_global.count == LUO_FLB_MAX) { - err = -ENOSPC; - goto err_resume; - } + if (luo_flb_global.count == LUO_FLB_MAX) + return -ENOSPC; /* Check that compatible string is unique in global list */ list_private_for_each_entry(gflb, &luo_flb_global.list, private.list) { if (!strcmp(gflb->compatible, flb->compatible)) - goto err_resume; + return -EEXIST; } list_add_tail(&private->list, &luo_flb_global.list); @@ -415,15 +401,8 @@ int liveupdate_register_flb(struct liveupdate_file_handler *fh, private->users++; link->flb = flb; list_add_tail(&no_free_ptr(link)->list, flb_list); - up_write(&luo_register_rwlock); - luo_session_resume(); return 0; - -err_resume: - up_write(&luo_register_rwlock); - luo_session_resume(); - return err; } /** @@ -439,12 +418,9 @@ err_resume: * the FLB is removed from the global registry and the reference to its * owner module (acquired during registration) is released. * - * Context: This function ensures the session is quiesced (no active FDs - * being created) during the update. It is typically called from a - * subsystem's module exit function. + * Context: It is typically called from a subsystem's module exit function. * Return: 0 on success. * -EOPNOTSUPP if live update is disabled. - * -EBUSY if the live update session is active and cannot be quiesced. * -ENOENT if the FLB was not found in the file handler's list. */ int liveupdate_unregister_flb(struct liveupdate_file_handler *fh, @@ -458,14 +434,7 @@ int liveupdate_unregister_flb(struct liveupdate_file_handler *fh, if (!liveupdate_enabled()) return -EOPNOTSUPP; - /* - * Ensure the system is quiescent (no active sessions). - * This acts as a global lock for unregistration. - */ - if (!luo_session_quiesce()) - return -EBUSY; - - down_write(&luo_register_rwlock); + guard(rwsem_write)(&luo_register_rwlock); /* Find and remove the link from the file handler's list */ list_for_each_entry(iter, flb_list, list) { @@ -478,7 +447,7 @@ int liveupdate_unregister_flb(struct liveupdate_file_handler *fh, } if (err) - goto err_resume; + return err; private->users--; /* @@ -490,15 +459,7 @@ int liveupdate_unregister_flb(struct liveupdate_file_handler *fh, luo_flb_global.count--; } - up_write(&luo_register_rwlock); - luo_session_resume(); - return 0; - -err_resume: - up_write(&luo_register_rwlock); - luo_session_resume(); - return err; } /** diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index 4bfe00ac8866..40a011bdfa55 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -85,8 +85,6 @@ int __init luo_session_setup_outgoing(void *fdt); int __init luo_session_setup_incoming(void *fdt); int luo_session_serialize(void); int luo_session_deserialize(void); -bool luo_session_quiesce(void); -void luo_session_resume(void); int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd); void luo_file_unpreserve_files(struct luo_file_set *file_set); diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index c68a0041bcf2..e5d35e83ac3d 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -602,46 +602,3 @@ err_undo: return err; } -/** - * luo_session_quiesce - Ensure no active sessions exist and lock session lists. - * - * Acquires exclusive write locks on both incoming and outgoing session lists. - * It then validates no sessions exist in either list. - * - * This mechanism is used during file handler un/registration to ensure that no - * sessions are currently using the handler, and no new sessions can be created - * while un/registration is in progress. - * - * This prevents registering new handlers while sessions are active or - * while deserialization is in progress. - * - * Return: - * true - System is quiescent (0 sessions) and locked. - * false - Active sessions exist. The locks are released internally. - */ -bool luo_session_quiesce(void) -{ - down_write(&luo_session_global.incoming.rwsem); - down_write(&luo_session_global.outgoing.rwsem); - - if (luo_session_global.incoming.count || - luo_session_global.outgoing.count) { - up_write(&luo_session_global.outgoing.rwsem); - up_write(&luo_session_global.incoming.rwsem); - return false; - } - - return true; -} - -/** - * luo_session_resume - Unlock session lists and resume normal activity. - * - * Releases the exclusive locks acquired by a successful call to - * luo_session_quiesce(). - */ -void luo_session_resume(void) -{ - up_write(&luo_session_global.outgoing.rwsem); - up_write(&luo_session_global.incoming.rwsem); -} -- cgit v1.2.3 From 5ee1c7d6414a0b1cb7285bd4904b4969c0d9fab1 Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Fri, 27 Mar 2026 03:33:31 +0000 Subject: liveupdate: auto unregister FLBs on file handler unregistration To ensure that unregistration is always successful and doesn't leave dangling resources, introduce auto-unregistration of FLBs: when a file handler is unregistered, all FLBs associated with it are automatically unregistered. Introduce a new helper luo_flb_unregister_all() which unregisters all FLBs linked to the given file handler. Link: https://lore.kernel.org/20260327033335.696621-8-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) Cc: David Matlack Cc: Mike Rapoport Cc: Samiullah Khawaja Signed-off-by: Andrew Morton --- kernel/liveupdate/luo_file.c | 14 +------ kernel/liveupdate/luo_flb.c | 84 +++++++++++++++++++++++++++------------- kernel/liveupdate/luo_internal.h | 1 + 3 files changed, 60 insertions(+), 39 deletions(-) diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c index 97342b8b8b69..9ba904c10425 100644 --- a/kernel/liveupdate/luo_file.c +++ b/kernel/liveupdate/luo_file.c @@ -923,26 +923,16 @@ err_unlock: */ int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) { - int err = -EBUSY; - if (!liveupdate_enabled()) return -EOPNOTSUPP; liveupdate_test_unregister(fh); - down_write(&luo_register_rwlock); - if (!list_empty(&ACCESS_PRIVATE(fh, flb_list))) - goto err_unlock; - + guard(rwsem_write)(&luo_register_rwlock); + luo_flb_unregister_all(fh); list_del(&ACCESS_PRIVATE(fh, list)); - up_write(&luo_register_rwlock); module_put(fh->ops->owner); return 0; - -err_unlock: - up_write(&luo_register_rwlock); - liveupdate_test_register(fh); - return err; } diff --git a/kernel/liveupdate/luo_flb.c b/kernel/liveupdate/luo_flb.c index 13f96d11ecc9..e069d694163e 100644 --- a/kernel/liveupdate/luo_flb.c +++ b/kernel/liveupdate/luo_flb.c @@ -318,6 +318,62 @@ void luo_flb_file_finish(struct liveupdate_file_handler *fh) luo_flb_file_finish_one(iter->flb); } +static void luo_flb_unregister_one(struct liveupdate_file_handler *fh, + struct liveupdate_flb *flb) +{ + struct luo_flb_private *private = luo_flb_get_private(flb); + struct list_head *flb_list = &ACCESS_PRIVATE(fh, flb_list); + struct luo_flb_link *iter; + bool found = false; + + /* Find and remove the link from the file handler's list */ + list_for_each_entry(iter, flb_list, list) { + if (iter->flb == flb) { + list_del(&iter->list); + kfree(iter); + found = true; + break; + } + } + + if (!found) { + pr_warn("Failed to unregister FLB '%s': not found in file handler '%s'\n", + flb->compatible, fh->compatible); + return; + } + + private->users--; + + /* + * If this is the last file-handler with which we are registred, remove + * from the global list. + */ + if (!private->users) { + list_del_init(&private->list); + luo_flb_global.count--; + } +} + +/** + * luo_flb_unregister_all - Unregister all FLBs associated with a file handler. + * @fh: The file handler whose FLBs should be unregistered. + * + * This function iterates through the list of FLBs associated with the given + * file handler and unregisters them all one by one. + */ +void luo_flb_unregister_all(struct liveupdate_file_handler *fh) +{ + struct list_head *flb_list = &ACCESS_PRIVATE(fh, flb_list); + struct luo_flb_link *iter, *tmp; + + if (!liveupdate_enabled()) + return; + + lockdep_assert_held_write(&luo_register_rwlock); + list_for_each_entry_safe(iter, tmp, flb_list, list) + luo_flb_unregister_one(fh, iter->flb); +} + /** * liveupdate_register_flb - Associate an FLB with a file handler and register it globally. * @fh: The file handler that will now depend on the FLB. @@ -426,38 +482,12 @@ int liveupdate_register_flb(struct liveupdate_file_handler *fh, int liveupdate_unregister_flb(struct liveupdate_file_handler *fh, struct liveupdate_flb *flb) { - struct luo_flb_private *private = luo_flb_get_private(flb); - struct list_head *flb_list = &ACCESS_PRIVATE(fh, flb_list); - struct luo_flb_link *iter; - int err = -ENOENT; - if (!liveupdate_enabled()) return -EOPNOTSUPP; guard(rwsem_write)(&luo_register_rwlock); - /* Find and remove the link from the file handler's list */ - list_for_each_entry(iter, flb_list, list) { - if (iter->flb == flb) { - list_del(&iter->list); - kfree(iter); - err = 0; - break; - } - } - - if (err) - return err; - - private->users--; - /* - * If this is the last file-handler with which we are registred, remove - * from the global list. - */ - if (!private->users) { - list_del_init(&private->list); - luo_flb_global.count--; - } + luo_flb_unregister_one(fh, flb); return 0; } diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index 40a011bdfa55..22f6901f89ed 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -103,6 +103,7 @@ void luo_file_set_destroy(struct luo_file_set *file_set); int luo_flb_file_preserve(struct liveupdate_file_handler *fh); void luo_flb_file_unpreserve(struct liveupdate_file_handler *fh); void luo_flb_file_finish(struct liveupdate_file_handler *fh); +void luo_flb_unregister_all(struct liveupdate_file_handler *fh); int __init luo_flb_setup_outgoing(void *fdt); int __init luo_flb_setup_incoming(void *fdt); void luo_flb_serialize(void); -- cgit v1.2.3 From 074488008d6e745af067e968d6046f2c04b12537 Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Fri, 27 Mar 2026 03:33:32 +0000 Subject: liveupdate: remove liveupdate_test_unregister() Now that file handler unregistration automatically unregisters all associated file handlers (FLBs), the liveupdate_test_unregister() function is no longer needed. Remove it along with its usages and declarations. Link: https://lore.kernel.org/20260327033335.696621-9-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) Cc: David Matlack Cc: Mike Rapoport Cc: Samiullah Khawaja Signed-off-by: Andrew Morton --- kernel/liveupdate/luo_file.c | 2 -- kernel/liveupdate/luo_internal.h | 2 -- lib/tests/liveupdate.c | 18 ------------------ 3 files changed, 22 deletions(-) diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c index 9ba904c10425..4060b6064248 100644 --- a/kernel/liveupdate/luo_file.c +++ b/kernel/liveupdate/luo_file.c @@ -926,8 +926,6 @@ int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) if (!liveupdate_enabled()) return -EOPNOTSUPP; - liveupdate_test_unregister(fh); - guard(rwsem_write)(&luo_register_rwlock); luo_flb_unregister_all(fh); list_del(&ACCESS_PRIVATE(fh, list)); diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index 22f6901f89ed..875844d7a41d 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -110,10 +110,8 @@ void luo_flb_serialize(void); #ifdef CONFIG_LIVEUPDATE_TEST void liveupdate_test_register(struct liveupdate_file_handler *fh); -void liveupdate_test_unregister(struct liveupdate_file_handler *fh); #else static inline void liveupdate_test_register(struct liveupdate_file_handler *fh) { } -static inline void liveupdate_test_unregister(struct liveupdate_file_handler *fh) { } #endif #endif /* _LINUX_LUO_INTERNAL_H */ diff --git a/lib/tests/liveupdate.c b/lib/tests/liveupdate.c index 496d6ef91a30..e4b0ecbee32f 100644 --- a/lib/tests/liveupdate.c +++ b/lib/tests/liveupdate.c @@ -135,24 +135,6 @@ void liveupdate_test_register(struct liveupdate_file_handler *fh) TEST_NFLBS, fh->compatible); } -void liveupdate_test_unregister(struct liveupdate_file_handler *fh) -{ - int err, i; - - for (i = 0; i < TEST_NFLBS; i++) { - struct liveupdate_flb *flb = &test_flbs[i]; - - err = liveupdate_unregister_flb(fh, flb); - if (err) { - pr_err("Failed to unregister %s %pe\n", - flb->compatible, ERR_PTR(err)); - } - } - - pr_info("Unregistered %d FLBs from file handler: [%s]\n", - TEST_NFLBS, fh->compatible); -} - MODULE_LICENSE("GPL"); MODULE_AUTHOR("Pasha Tatashin "); MODULE_DESCRIPTION("In-kernel test for LUO mechanism"); -- cgit v1.2.3 From 2ab7207e7ec6cd5af1912d9be5174f114633286b Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Fri, 27 Mar 2026 03:33:33 +0000 Subject: liveupdate: make unregister functions return void Change liveupdate_unregister_file_handler and liveupdate_unregister_flb to return void instead of an error code. This follows the design principle that unregistration during module unload should not fail, as the unload cannot be stopped at that point. Link: https://lore.kernel.org/20260327033335.696621-10-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) Cc: David Matlack Cc: Mike Rapoport Cc: Samiullah Khawaja Signed-off-by: Andrew Morton --- include/linux/liveupdate.h | 14 ++++++-------- kernel/liveupdate/luo_file.c | 14 ++------------ kernel/liveupdate/luo_flb.c | 11 +++-------- 3 files changed, 11 insertions(+), 28 deletions(-) diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h index 9c761d9bacf8..30c5a39ff9e9 100644 --- a/include/linux/liveupdate.h +++ b/include/linux/liveupdate.h @@ -231,12 +231,12 @@ bool liveupdate_enabled(void); int liveupdate_reboot(void); int liveupdate_register_file_handler(struct liveupdate_file_handler *fh); -int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh); +void liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh); int liveupdate_register_flb(struct liveupdate_file_handler *fh, struct liveupdate_flb *flb); -int liveupdate_unregister_flb(struct liveupdate_file_handler *fh, - struct liveupdate_flb *flb); +void liveupdate_unregister_flb(struct liveupdate_file_handler *fh, + struct liveupdate_flb *flb); int liveupdate_flb_get_incoming(struct liveupdate_flb *flb, void **objp); int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, void **objp); @@ -258,9 +258,8 @@ static inline int liveupdate_register_file_handler(struct liveupdate_file_handle return -EOPNOTSUPP; } -static inline int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) +static inline void liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) { - return -EOPNOTSUPP; } static inline int liveupdate_register_flb(struct liveupdate_file_handler *fh, @@ -269,10 +268,9 @@ static inline int liveupdate_register_flb(struct liveupdate_file_handler *fh, return -EOPNOTSUPP; } -static inline int liveupdate_unregister_flb(struct liveupdate_file_handler *fh, - struct liveupdate_flb *flb) +static inline void liveupdate_unregister_flb(struct liveupdate_file_handler *fh, + struct liveupdate_flb *flb) { - return -EOPNOTSUPP; } static inline int liveupdate_flb_get_incoming(struct liveupdate_flb *flb, diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c index 4060b6064248..0730865711c1 100644 --- a/kernel/liveupdate/luo_file.c +++ b/kernel/liveupdate/luo_file.c @@ -912,25 +912,15 @@ err_unlock: * * Unregisters the file handler from the liveupdate core. This function * reverses the operations of liveupdate_register_file_handler(). - * - * It ensures safe removal by checking that: - * No FLB registered with this file handler. - * - * If the unregistration fails, the internal test state is reverted. - * - * Return: 0 Success. -EOPNOTSUPP when live update is not enabled. -EBUSY A live - * update is in progress, FLB is registred with this file handler. */ -int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) +void liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) { if (!liveupdate_enabled()) - return -EOPNOTSUPP; + return; guard(rwsem_write)(&luo_register_rwlock); luo_flb_unregister_all(fh); list_del(&ACCESS_PRIVATE(fh, list)); module_put(fh->ops->owner); - - return 0; } diff --git a/kernel/liveupdate/luo_flb.c b/kernel/liveupdate/luo_flb.c index e069d694163e..00f5494812c4 100644 --- a/kernel/liveupdate/luo_flb.c +++ b/kernel/liveupdate/luo_flb.c @@ -475,21 +475,16 @@ int liveupdate_register_flb(struct liveupdate_file_handler *fh, * owner module (acquired during registration) is released. * * Context: It is typically called from a subsystem's module exit function. - * Return: 0 on success. - * -EOPNOTSUPP if live update is disabled. - * -ENOENT if the FLB was not found in the file handler's list. */ -int liveupdate_unregister_flb(struct liveupdate_file_handler *fh, - struct liveupdate_flb *flb) +void liveupdate_unregister_flb(struct liveupdate_file_handler *fh, + struct liveupdate_flb *flb) { if (!liveupdate_enabled()) - return -EOPNOTSUPP; + return; guard(rwsem_write)(&luo_register_rwlock); luo_flb_unregister_one(fh, flb); - - return 0; } /** -- cgit v1.2.3 From 68750e820bc4095d25cf70002782c284e5702415 Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Fri, 27 Mar 2026 03:33:34 +0000 Subject: liveupdate: defer file handler module refcounting to active sessions Stop pinning modules indefinitely upon file handler registration. Instead, dynamically increment the module reference count only when a live update session actively uses the file handler (e.g., during preservation or deserialization), and release it when the session ends. This allows modules providing live update handlers to be gracefully unloaded when no live update is in progress. Link: https://lore.kernel.org/20260327033335.696621-11-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) Cc: David Matlack Cc: Mike Rapoport Cc: Samiullah Khawaja Signed-off-by: Andrew Morton --- kernel/liveupdate/luo_file.c | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c index 0730865711c1..a0a419085e28 100644 --- a/kernel/liveupdate/luo_file.c +++ b/kernel/liveupdate/luo_file.c @@ -291,7 +291,8 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) down_read(&luo_register_rwlock); list_private_for_each_entry(fh, &luo_file_handler_list, list) { if (fh->ops->can_preserve(fh, file)) { - err = 0; + if (try_module_get(fh->ops->owner)) + err = 0; break; } } @@ -304,7 +305,7 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) err = xa_insert(&luo_preserved_files, luo_get_id(fh, file), file, GFP_KERNEL); if (err) - goto err_free_files_mem; + goto err_module_put; err = luo_flb_file_preserve(fh); if (err) @@ -340,6 +341,8 @@ err_flb_unpreserve: luo_flb_file_unpreserve(fh); err_erase_xa: xa_erase(&luo_preserved_files, luo_get_id(fh, file)); +err_module_put: + module_put(fh->ops->owner); err_free_files_mem: luo_free_files_mem(file_set); err_fput: @@ -382,6 +385,7 @@ void luo_file_unpreserve_files(struct luo_file_set *file_set) args.private_data = luo_file->private_data; luo_file->fh->ops->unpreserve(&args); luo_flb_file_unpreserve(luo_file->fh); + module_put(luo_file->fh->ops->owner); xa_erase(&luo_preserved_files, luo_get_id(luo_file->fh, luo_file->file)); @@ -673,6 +677,7 @@ static void luo_file_finish_one(struct luo_file_set *file_set, luo_file->fh->ops->finish(&args); luo_flb_file_finish(luo_file->fh); + module_put(luo_file->fh->ops->owner); } /** @@ -810,7 +815,8 @@ int luo_file_deserialize(struct luo_file_set *file_set, down_read(&luo_register_rwlock); list_private_for_each_entry(fh, &luo_file_handler_list, list) { if (!strcmp(fh->compatible, file_ser[i].compatible)) { - handler_found = true; + if (try_module_get(fh->ops->owner)) + handler_found = true; break; } } @@ -824,8 +830,10 @@ int luo_file_deserialize(struct luo_file_set *file_set, } luo_file = kzalloc_obj(*luo_file); - if (!luo_file) + if (!luo_file) { + module_put(fh->ops->owner); return -ENOMEM; + } luo_file->fh = fh; luo_file->file = NULL; @@ -886,12 +894,6 @@ int liveupdate_register_file_handler(struct liveupdate_file_handler *fh) } } - /* Pin the module implementing the handler */ - if (!try_module_get(fh->ops->owner)) { - err = -EAGAIN; - goto err_unlock; - } - INIT_LIST_HEAD(&ACCESS_PRIVATE(fh, flb_list)); INIT_LIST_HEAD(&ACCESS_PRIVATE(fh, list)); list_add_tail(&ACCESS_PRIVATE(fh, list), &luo_file_handler_list); @@ -921,6 +923,4 @@ void liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) guard(rwsem_write)(&luo_register_rwlock); luo_flb_unregister_all(fh); list_del(&ACCESS_PRIVATE(fh, list)); - - module_put(fh->ops->owner); } -- cgit v1.2.3 From d14514c66cb9721b54318850796c005c446d76d6 Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Sun, 22 Mar 2026 00:08:43 -0700 Subject: mm/vmscan: prevent MGLRU reclaim from pinning address space When shrinking lruvec, MGLRU pins address space before walking it. This is excessive since all it needs for walking the page range is a stable mm_struct to be able to take and release mmap_read_lock and a stable mm->mm_mt tree to walk. This address space pinning results in delays when releasing the memory of a dying process. This also prevents mm reapers (both in-kernel oom-reaper and userspace process_mrelease()) from doing their job during MGLRU scan because they check task_will_free_mem() which will yield negative result due to the elevated mm->mm_users. This affects the system in the sense that if the MM of the killed process is being reclaimed by kswapd then reapers won't be able to reap it. Even the process itself (which might have higher-priority than kswapd) will not free its memory until kswapd drops the last reference. IOW, we delay freeing the memory because kswapd is reclaiming it. In Android the visible result for us is that process_mrelease() (userspace reaper) skips MM in such cases and we see process memory not released for an unusually long time (secs). Replace unnecessary address space pinning with mm_struct pinning by replacing mmget/mmput with mmgrab/mmdrop calls. mm_mt is contained within mm_struct itself, therefore it won't be freed as long as mm_struct is stable and it won't change during the walk because mmap_read_lock is being held. Link: https://lore.kernel.org/20260322070843.941997-1-surenb@google.com Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks") Signed-off-by: Suren Baghdasaryan Reviewed-by: Lorenzo Stoakes (Oracle) Cc: Axel Rasmussen Cc: David Hildenbrand Cc: Johannes Weiner Cc: Liam Howlett Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Qi Zheng Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Wei Xu Cc: Yuanchu Xie Cc: Yu Zhao Cc: Kalesh Singh Signed-off-by: Andrew Morton --- mm/vmscan.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 4f05a149a0dd..7fd97e0e0ab9 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2863,8 +2863,9 @@ static struct mm_struct *get_next_mm(struct lru_gen_mm_walk *walk) return NULL; clear_bit(key, &mm->lru_gen.bitmap); + mmgrab(mm); - return mmget_not_zero(mm) ? mm : NULL; + return mm; } void lru_gen_add_mm(struct mm_struct *mm) @@ -3064,7 +3065,7 @@ done: reset_bloom_filter(mm_state, walk->seq + 1); if (*iter) - mmput_async(*iter); + mmdrop(*iter); *iter = mm; -- cgit v1.2.3 From 6b1842775a460245e97d36d3a67d0cfba7c4ff79 Mon Sep 17 00:00:00 2001 From: Hao Ge Date: Tue, 31 Mar 2026 16:13:12 +0800 Subject: mm/alloc_tag: clear codetag for pages allocated before page_ext initialization Due to initialization ordering, page_ext is allocated and initialized relatively late during boot. Some pages have already been allocated and freed before page_ext becomes available, leaving their codetag uninitialized. A clear example is in init_section_page_ext(): alloc_page_ext() calls kmemleak_alloc(). If the slab cache has no free objects, it falls back to the buddy allocator to allocate memory. However, at this point page_ext is not yet fully initialized, so these newly allocated pages have no codetag set. These pages may later be reclaimed by KASAN, which causes the warning to trigger when they are freed because their codetag ref is still empty. Use a global array to track pages allocated before page_ext is fully initialized. The array size is fixed at 8192 entries, and will emit a warning if this limit is exceeded. When page_ext initialization completes, set their codetag to empty to avoid warnings when they are freed later. This warning is only observed with CONFIG_MEM_ALLOC_PROFILING_DEBUG=Y and mem_profiling_compressed disabled: [ 9.582133] ------------[ cut here ]------------ [ 9.582137] alloc_tag was not set [ 9.582139] WARNING: ./include/linux/alloc_tag.h:164 at __pgalloc_tag_sub+0x40f/0x550, CPU#5: systemd/1 [ 9.582190] CPU: 5 UID: 0 PID: 1 Comm: systemd Not tainted 7.0.0-rc4 #1 PREEMPT(lazy) [ 9.582192] Hardware name: Red Hat KVM, BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 9.582194] RIP: 0010:__pgalloc_tag_sub+0x40f/0x550 [ 9.582196] Code: 00 00 4c 29 e5 48 8b 05 1f 88 56 05 48 8d 4c ad 00 48 8d 2c c8 e9 87 fd ff ff 0f 0b 0f 0b e9 f3 fe ff ff 48 8d 3d 61 2f ed 03 <67> 48 0f b9 3a e9 b3 fd ff ff 0f 0b eb e4 e8 5e cd 14 02 4c 89 c7 [ 9.582197] RSP: 0018:ffffc9000001f940 EFLAGS: 00010246 [ 9.582200] RAX: dffffc0000000000 RBX: 1ffff92000003f2b RCX: 1ffff110200d806c [ 9.582201] RDX: ffff8881006c0360 RSI: 0000000000000004 RDI: ffffffff9bc7b460 [ 9.582202] RBP: 0000000000000000 R08: 0000000000000000 R09: fffffbfff3a62324 [ 9.582203] R10: ffffffff9d311923 R11: 0000000000000000 R12: ffffea0004001b00 [ 9.582204] R13: 0000000000002000 R14: ffffea0000000000 R15: ffff8881006c0360 [ 9.582206] FS: 00007ffbbcf2d940(0000) GS:ffff888450479000(0000) knlGS:0000000000000000 [ 9.582208] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9.582210] CR2: 000055ee3aa260d0 CR3: 0000000148b67005 CR4: 0000000000770ef0 [ 9.582211] PKRU: 55555554 [ 9.582212] Call Trace: [ 9.582213] [ 9.582214] ? __pfx___pgalloc_tag_sub+0x10/0x10 [ 9.582216] ? check_bytes_and_report+0x68/0x140 [ 9.582219] __free_frozen_pages+0x2e4/0x1150 [ 9.582221] ? __free_slab+0xc2/0x2b0 [ 9.582224] qlist_free_all+0x4c/0xf0 [ 9.582227] kasan_quarantine_reduce+0x15d/0x180 [ 9.582229] __kasan_slab_alloc+0x69/0x90 [ 9.582232] kmem_cache_alloc_noprof+0x14a/0x500 [ 9.582234] do_getname+0x96/0x310 [ 9.582237] do_readlinkat+0x91/0x2f0 [ 9.582239] ? __pfx_do_readlinkat+0x10/0x10 [ 9.582240] ? get_random_bytes_user+0x1df/0x2c0 [ 9.582244] __x64_sys_readlinkat+0x96/0x100 [ 9.582246] do_syscall_64+0xce/0x650 [ 9.582250] ? __x64_sys_getrandom+0x13a/0x1e0 [ 9.582252] ? __pfx___x64_sys_getrandom+0x10/0x10 [ 9.582254] ? do_syscall_64+0x114/0x650 [ 9.582255] ? ksys_read+0xfc/0x1d0 [ 9.582258] ? __pfx_ksys_read+0x10/0x10 [ 9.582260] ? do_syscall_64+0x114/0x650 [ 9.582262] ? do_syscall_64+0x114/0x650 [ 9.582264] ? __pfx_fput_close_sync+0x10/0x10 [ 9.582266] ? file_close_fd_locked+0x178/0x2a0 [ 9.582268] ? __x64_sys_faccessat2+0x96/0x100 [ 9.582269] ? __x64_sys_close+0x7d/0xd0 [ 9.582271] ? do_syscall_64+0x114/0x650 [ 9.582273] ? do_syscall_64+0x114/0x650 [ 9.582275] ? clear_bhb_loop+0x50/0xa0 [ 9.582277] ? clear_bhb_loop+0x50/0xa0 [ 9.582279] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 9.582280] RIP: 0033:0x7ffbbda345ee [ 9.582282] Code: 0f 1f 40 00 48 8b 15 29 38 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 0f 1f 40 00 f3 0f 1e fa 49 89 ca b8 0b 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fa 37 0d 00 f7 d8 64 89 01 48 [ 9.582284] RSP: 002b:00007ffe2ad8de58 EFLAGS: 00000202 ORIG_RAX: 000000000000010b [ 9.582286] RAX: ffffffffffffffda RBX: 000055ee3aa25570 RCX: 00007ffbbda345ee [ 9.582287] RDX: 000055ee3aa25570 RSI: 00007ffe2ad8dee0 RDI: 00000000ffffff9c [ 9.582288] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000001001 [ 9.582289] R10: 0000000000001000 R11: 0000000000000202 R12: 0000000000000033 [ 9.582290] R13: 00007ffe2ad8dee0 R14: 00000000ffffff9c R15: 00007ffe2ad8deb0 [ 9.582292] [ 9.582293] ---[ end trace 0000000000000000 ]--- Link: https://lore.kernel.org/20260331081312.123719-1-hao.ge@linux.dev Fixes: dcfe378c81f72 ("lib: introduce support for page allocation tagging") Signed-off-by: Hao Ge Suggested-by: Suren Baghdasaryan Acked-by: Suren Baghdasaryan Cc: Kent Overstreet Cc: Signed-off-by: Andrew Morton --- include/linux/alloc_tag.h | 2 + include/linux/pgalloc_tag.h | 2 +- lib/alloc_tag.c | 109 ++++++++++++++++++++++++++++++++++++++++++++ mm/page_alloc.c | 10 +++- 4 files changed, 121 insertions(+), 2 deletions(-) diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h index d40ac39bfbe8..02de2ede560f 100644 --- a/include/linux/alloc_tag.h +++ b/include/linux/alloc_tag.h @@ -163,9 +163,11 @@ static inline void alloc_tag_sub_check(union codetag_ref *ref) { WARN_ONCE(ref && !ref->ct, "alloc_tag was not set\n"); } +void alloc_tag_add_early_pfn(unsigned long pfn); #else static inline void alloc_tag_add_check(union codetag_ref *ref, struct alloc_tag *tag) {} static inline void alloc_tag_sub_check(union codetag_ref *ref) {} +static inline void alloc_tag_add_early_pfn(unsigned long pfn) {} #endif /* Caller should verify both ref and tag to be valid */ diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h index 38a82d65e58e..951d33362268 100644 --- a/include/linux/pgalloc_tag.h +++ b/include/linux/pgalloc_tag.h @@ -181,7 +181,7 @@ static inline struct alloc_tag *__pgalloc_tag_get(struct page *page) if (get_page_tag_ref(page, &ref, &handle)) { alloc_tag_sub_check(&ref); - if (ref.ct) + if (ref.ct && !is_codetag_empty(&ref)) tag = ct_to_alloc_tag(ref.ct); put_page_tag_ref(handle); } diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c index 58991ab09d84..ed1bdcf1f8ab 100644 --- a/lib/alloc_tag.c +++ b/lib/alloc_tag.c @@ -6,7 +6,9 @@ #include #include #include +#include #include +#include #include #include #include @@ -758,8 +760,115 @@ static __init bool need_page_alloc_tagging(void) return mem_profiling_support; } +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG +/* + * Track page allocations before page_ext is initialized. + * Some pages are allocated before page_ext becomes available, leaving + * their codetag uninitialized. Track these early PFNs so we can clear + * their codetag refs later to avoid warnings when they are freed. + * + * Early allocations include: + * - Base allocations independent of CPU count + * - Per-CPU allocations (e.g., CPU hotplug callbacks during smp_init, + * such as trace ring buffers, scheduler per-cpu data) + * + * For simplicity, we fix the size to 8192. + * If insufficient, a warning will be triggered to alert the user. + * + * TODO: Replace fixed-size array with dynamic allocation using + * a GFP flag similar to ___GFP_NO_OBJ_EXT to avoid recursion. + */ +#define EARLY_ALLOC_PFN_MAX 8192 + +static unsigned long early_pfns[EARLY_ALLOC_PFN_MAX] __initdata; +static atomic_t early_pfn_count __initdata = ATOMIC_INIT(0); + +static void __init __alloc_tag_add_early_pfn(unsigned long pfn) +{ + int old_idx, new_idx; + + do { + old_idx = atomic_read(&early_pfn_count); + if (old_idx >= EARLY_ALLOC_PFN_MAX) { + pr_warn_once("Early page allocations before page_ext init exceeded EARLY_ALLOC_PFN_MAX (%d)\n", + EARLY_ALLOC_PFN_MAX); + return; + } + new_idx = old_idx + 1; + } while (!atomic_try_cmpxchg(&early_pfn_count, &old_idx, new_idx)); + + early_pfns[old_idx] = pfn; +} + +typedef void alloc_tag_add_func(unsigned long pfn); +static alloc_tag_add_func __rcu *alloc_tag_add_early_pfn_ptr __refdata = + RCU_INITIALIZER(__alloc_tag_add_early_pfn); + +void alloc_tag_add_early_pfn(unsigned long pfn) +{ + alloc_tag_add_func *alloc_tag_add; + + if (static_key_enabled(&mem_profiling_compressed)) + return; + + rcu_read_lock(); + alloc_tag_add = rcu_dereference(alloc_tag_add_early_pfn_ptr); + if (alloc_tag_add) + alloc_tag_add(pfn); + rcu_read_unlock(); +} + +static void __init clear_early_alloc_pfn_tag_refs(void) +{ + unsigned int i; + + if (static_key_enabled(&mem_profiling_compressed)) + return; + + rcu_assign_pointer(alloc_tag_add_early_pfn_ptr, NULL); + /* Make sure we are not racing with __alloc_tag_add_early_pfn() */ + synchronize_rcu(); + + for (i = 0; i < atomic_read(&early_pfn_count); i++) { + unsigned long pfn = early_pfns[i]; + + if (pfn_valid(pfn)) { + struct page *page = pfn_to_page(pfn); + union pgtag_ref_handle handle; + union codetag_ref ref; + + if (get_page_tag_ref(page, &ref, &handle)) { + /* + * An early-allocated page could be freed and reallocated + * after its page_ext is initialized but before we clear it. + * In that case, it already has a valid tag set. + * We should not overwrite that valid tag with CODETAG_EMPTY. + * + * Note: there is still a small race window between checking + * ref.ct and calling set_codetag_empty(). We accept this + * race as it's unlikely and the extra complexity of atomic + * cmpxchg is not worth it for this debug-only code path. + */ + if (ref.ct) { + put_page_tag_ref(handle); + continue; + } + + set_codetag_empty(&ref); + update_page_tag_ref(handle, &ref); + put_page_tag_ref(handle); + } + } + + } +} +#else /* !CONFIG_MEM_ALLOC_PROFILING_DEBUG */ +static inline void __init clear_early_alloc_pfn_tag_refs(void) {} +#endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */ + static __init void init_page_alloc_tagging(void) { + clear_early_alloc_pfn_tag_refs(); } struct page_ext_operations page_alloc_tagging_ops = { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 111b54df8a3c..b1c5430cad4e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1252,10 +1252,18 @@ void __pgalloc_tag_add(struct page *page, struct task_struct *task, union pgtag_ref_handle handle; union codetag_ref ref; - if (get_page_tag_ref(page, &ref, &handle)) { + if (likely(get_page_tag_ref(page, &ref, &handle))) { alloc_tag_add(&ref, task->alloc_tag, PAGE_SIZE * nr); update_page_tag_ref(handle, &ref); put_page_tag_ref(handle); + } else { + /* + * page_ext is not available yet, record the pfn so we can + * clear the tag ref later when page_ext is initialized. + */ + alloc_tag_add_early_pfn(page_to_pfn(page)); + if (task->alloc_tag) + alloc_tag_set_inaccurate(task->alloc_tag); } } -- cgit v1.2.3 From 1556478e9e86585d4c48fcddb8f490713bd78156 Mon Sep 17 00:00:00 2001 From: "Kanchana P. Sridhar" Date: Tue, 31 Mar 2026 11:33:50 -0700 Subject: mm: zswap: remove redundant checks in zswap_cpu_comp_dead() Patch series "zswap pool per-CPU acomp_ctx simplifications", v3. This patchset first removes redundant checks on the acomp_ctx and its "req" member in zswap_cpu_comp_dead(). Next, it persists the zswap pool's per-CPU acomp_ctx resources to last until the pool is destroyed. It then simplifies the per-CPU acomp_ctx mutex locking in zswap_compress()/zswap_decompress(). Code comments added after allocation and before checking to deallocate the per-CPU acomp_ctx's members, based on expected crypto API return values and zswap changes this patchset makes. Patch 2 is an independent submission of patch 23 from [1], to facilitate merging. This patch (of 2): There are presently redundant checks on the per-CPU acomp_ctx and it's "req" member in zswap_cpu_comp_dead(): redundant because they are inconsistent with zswap_pool_create() handling of failure in allocating the acomp_ctx, and with the expected NULL return value from the acomp_request_alloc() API when it fails to allocate an acomp_req. Fix these by converting to them to be NULL checks. Add comments in zswap_cpu_comp_prepare() clarifying the expected return values of the crypto_alloc_acomp_node() and acomp_request_alloc() API. Link: https://lore.kernel.org/20260331183351.29844-2-kanchanapsridhar2026@gmail.com Link: https://patchwork.kernel.org/project/linux-mm/list/?series=1046677 Signed-off-by: Kanchana P. Sridhar Suggested-by: Yosry Ahmed Acked-by: Yosry Ahmed Signed-off-by: Andrew Morton --- mm/zswap.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index 4f2e652e8ad3..c59045b59ffe 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -749,6 +749,10 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node) goto fail; } + /* + * In case of an error, crypto_alloc_acomp_node() returns an + * error pointer, never NULL. + */ acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu)); if (IS_ERR(acomp)) { pr_err("could not alloc crypto acomp %s : %pe\n", @@ -757,6 +761,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node) goto fail; } + /* acomp_request_alloc() returns NULL in case of an error. */ req = acomp_request_alloc(acomp); if (!req) { pr_err("could not alloc crypto acomp_request %s\n", @@ -802,7 +807,7 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) struct crypto_acomp *acomp; u8 *buffer; - if (IS_ERR_OR_NULL(acomp_ctx)) + if (!acomp_ctx) return 0; mutex_lock(&acomp_ctx->mutex); @@ -817,8 +822,11 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) /* * Do the actual freeing after releasing the mutex to avoid subtle * locking dependencies causing deadlocks. + * + * If there was an error in allocating @acomp_ctx->req, it + * would be set to NULL. */ - if (!IS_ERR_OR_NULL(req)) + if (req) acomp_request_free(req); if (!IS_ERR_OR_NULL(acomp)) crypto_free_acomp(acomp); -- cgit v1.2.3 From ef3c0f6cb798e2602a8d8ee3f669fb1cc52345ce Mon Sep 17 00:00:00 2001 From: "Kanchana P. Sridhar" Date: Tue, 31 Mar 2026 11:33:51 -0700 Subject: mm: zswap: tie per-CPU acomp_ctx lifetime to the pool Currently, per-CPU acomp_ctx are allocated on pool creation and/or CPU hotplug, and destroyed on pool destruction or CPU hotunplug. This complicates the lifetime management to save memory while a CPU is offlined, which is not very common. Simplify lifetime management by allocating per-CPU acomp_ctx once on pool creation (or CPU hotplug for CPUs onlined later), and keeping them allocated until the pool is destroyed. Refactor cleanup code from zswap_cpu_comp_dead() into acomp_ctx_free() to be used elsewhere. The main benefit of using the CPU hotplug multi state instance startup callback to allocate the acomp_ctx resources is that it prevents the cores from being offlined until the multi state instance addition call returns. From Documentation/core-api/cpu_hotplug.rst: "The node list add/remove operations and the callback invocations are serialized against CPU hotplug operations." Furthermore, zswap_[de]compress() cannot contend with zswap_cpu_comp_prepare() because: - During pool creation/deletion, the pool is not in the zswap_pools list. - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed out. zswap_cpu_comp_prepare() will be run on a control CPU, since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section of "enum cpuhp_state". In both these cases, any recursions into zswap reclaim from zswap_cpu_comp_prepare() will be handled by the old pool. The above two observations enable the following simplifications: 1) zswap_cpu_comp_prepare(): a) acomp_ctx mutex locking: If the process gets migrated while zswap_cpu_comp_prepare() is running, it will complete on the new CPU. In case of failures, we pass the acomp_ctx pointer obtained at the start of zswap_cpu_comp_prepare() to acomp_ctx_free(), which again, can only undergo migration. There appear to be no contention scenarios that might cause inconsistent values of acomp_ctx's members. Hence, it seems there is no need for mutex_lock(&acomp_ctx->mutex) in zswap_cpu_comp_prepare(). b) acomp_ctx mutex initialization: Since the pool is not yet on zswap_pools list, we don't need to initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This has been restored to occur in zswap_cpu_comp_prepare(). c) Subsequent CPU offline-online transitions: zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is valid. If so, it returns success. This should handle any CPU hotplug online-offline transitions after pool creation is done. 2) CPU offline vis-a-vis zswap ops: Let's suppose the process is migrated to another CPU before the current CPU is dysfunctional. If zswap_[de]compress() holds the acomp_ctx->mutex lock of the offlined CPU, that mutex will be released once it completes on the new CPU. Since there is no teardown callback, there is no possibility of UAF. 3) Pool creation/deletion and process migration to another CPU: During pool creation/deletion, the pool is not in the zswap_pools list. Hence it cannot contend with zswap ops on that CPU. However, the process can get migrated. a) Pool creation --> zswap_cpu_comp_prepare() --> process migrated: * Old CPU offline: no-op. * zswap_cpu_comp_prepare() continues to run on the new CPU to finish allocating acomp_ctx resources for the offlined CPU. b) Pool deletion --> acomp_ctx_free() --> process migrated: * Old CPU offline: no-op. * acomp_ctx_free() continues to run on the new CPU to finish de-allocating acomp_ctx resources for the offlined CPU. 4) Pool deletion vis-a-vis CPU onlining: The call to cpuhp_state_remove_instance() cannot race with zswap_cpu_comp_prepare() because of hotplug synchronization. The current acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock() are deleted. Instead, zswap_[de]compress() directly call mutex_[un]lock(&acomp_ctx->mutex). The per-CPU memory cost of not deleting the acomp_ctx resources upon CPU offlining, and only deleting them when the pool is destroyed, is 8.28 KB on x86_64. This cost is only paid when a CPU is offlined, until it is onlined again. Link: https://lore.kernel.org/20260331183351.29844-3-kanchanapsridhar2026@gmail.com Co-developed-by: Kanchana P. Sridhar Signed-off-by: Kanchana P. Sridhar Signed-off-by: Kanchana P Sridhar Acked-by: Yosry Ahmed Cc: Chengming Zhou Cc: Herbert Xu Cc: Johannes Weiner Cc: Nhat Pham Cc: Sergey Senozhatsky Signed-off-by: Andrew Morton --- mm/zswap.c | 180 +++++++++++++++++++++++++++---------------------------------- 1 file changed, 80 insertions(+), 100 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index c59045b59ffe..4b5149173b0e 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -242,6 +242,34 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp) **********************************/ static void __zswap_pool_empty(struct percpu_ref *ref); +static void acomp_ctx_free(struct crypto_acomp_ctx *acomp_ctx) +{ + if (!acomp_ctx) + return; + + /* + * If there was an error in allocating @acomp_ctx->req, it + * would be set to NULL. + */ + if (acomp_ctx->req) + acomp_request_free(acomp_ctx->req); + + acomp_ctx->req = NULL; + + /* + * We have to handle both cases here: an error pointer return from + * crypto_alloc_acomp_node(); and a) NULL initialization by zswap, or + * b) NULL assignment done in a previous call to acomp_ctx_free(). + */ + if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) + crypto_free_acomp(acomp_ctx->acomp); + + acomp_ctx->acomp = NULL; + + kfree(acomp_ctx->buffer); + acomp_ctx->buffer = NULL; +} + static struct zswap_pool *zswap_pool_create(char *compressor) { struct zswap_pool *pool; @@ -263,19 +291,27 @@ static struct zswap_pool *zswap_pool_create(char *compressor) strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name)); - pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx); + /* Many things rely on the zero-initialization. */ + pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx, + GFP_KERNEL | __GFP_ZERO); if (!pool->acomp_ctx) { pr_err("percpu alloc failed\n"); goto error; } - for_each_possible_cpu(cpu) - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); - + /* + * This is serialized against CPU hotplug operations. Hence, cores + * cannot be offlined until this finishes. + */ ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); + + /* + * cpuhp_state_add_instance() will not cleanup on failure since + * we don't register a hotunplug callback. + */ if (ret) - goto error; + goto cpuhp_add_fail; /* being the current pool takes 1 ref; this func expects the * caller to always add the new pool as the current pool @@ -292,6 +328,10 @@ static struct zswap_pool *zswap_pool_create(char *compressor) ref_fail: cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); + +cpuhp_add_fail: + for_each_possible_cpu(cpu) + acomp_ctx_free(per_cpu_ptr(pool->acomp_ctx, cpu)); error: if (pool->acomp_ctx) free_percpu(pool->acomp_ctx); @@ -322,9 +362,15 @@ static struct zswap_pool *__zswap_pool_create_fallback(void) static void zswap_pool_destroy(struct zswap_pool *pool) { + int cpu; + zswap_pool_debug("destroying", pool); cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); + + for_each_possible_cpu(cpu) + acomp_ctx_free(per_cpu_ptr(pool->acomp_ctx, cpu)); + free_percpu(pool->acomp_ctx); zs_destroy_pool(pool->zs_pool); @@ -738,44 +784,41 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node) { struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); - struct crypto_acomp *acomp = NULL; - struct acomp_req *req = NULL; - u8 *buffer = NULL; - int ret; + int ret = -ENOMEM; - buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu)); - if (!buffer) { - ret = -ENOMEM; - goto fail; + /* + * To handle cases where the CPU goes through online-offline-online + * transitions, we return if the acomp_ctx has already been initialized. + */ + if (acomp_ctx->acomp) { + WARN_ON_ONCE(IS_ERR(acomp_ctx->acomp)); + return 0; } + acomp_ctx->buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu)); + if (!acomp_ctx->buffer) + return ret; + /* * In case of an error, crypto_alloc_acomp_node() returns an * error pointer, never NULL. */ - acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu)); - if (IS_ERR(acomp)) { + acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu)); + if (IS_ERR(acomp_ctx->acomp)) { pr_err("could not alloc crypto acomp %s : %pe\n", - pool->tfm_name, acomp); - ret = PTR_ERR(acomp); + pool->tfm_name, acomp_ctx->acomp); + ret = PTR_ERR(acomp_ctx->acomp); goto fail; } /* acomp_request_alloc() returns NULL in case of an error. */ - req = acomp_request_alloc(acomp); - if (!req) { + acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp); + if (!acomp_ctx->req) { pr_err("could not alloc crypto acomp_request %s\n", pool->tfm_name); - ret = -ENOMEM; goto fail; } - /* - * Only hold the mutex after completing allocations, otherwise we may - * recurse into zswap through reclaim and attempt to hold the mutex - * again resulting in a deadlock. - */ - mutex_lock(&acomp_ctx->mutex); crypto_init_wait(&acomp_ctx->wait); /* @@ -783,83 +826,17 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node) * crypto_wait_req(); if the backend of acomp is scomp, the callback * won't be called, crypto_wait_req() will return without blocking. */ - acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG, + acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG, crypto_req_done, &acomp_ctx->wait); - acomp_ctx->buffer = buffer; - acomp_ctx->acomp = acomp; - acomp_ctx->req = req; - mutex_unlock(&acomp_ctx->mutex); + mutex_init(&acomp_ctx->mutex); return 0; fail: - if (!IS_ERR_OR_NULL(acomp)) - crypto_free_acomp(acomp); - kfree(buffer); + acomp_ctx_free(acomp_ctx); return ret; } -static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) -{ - struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); - struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); - struct acomp_req *req; - struct crypto_acomp *acomp; - u8 *buffer; - - if (!acomp_ctx) - return 0; - - mutex_lock(&acomp_ctx->mutex); - req = acomp_ctx->req; - acomp = acomp_ctx->acomp; - buffer = acomp_ctx->buffer; - acomp_ctx->req = NULL; - acomp_ctx->acomp = NULL; - acomp_ctx->buffer = NULL; - mutex_unlock(&acomp_ctx->mutex); - - /* - * Do the actual freeing after releasing the mutex to avoid subtle - * locking dependencies causing deadlocks. - * - * If there was an error in allocating @acomp_ctx->req, it - * would be set to NULL. - */ - if (req) - acomp_request_free(req); - if (!IS_ERR_OR_NULL(acomp)) - crypto_free_acomp(acomp); - kfree(buffer); - - return 0; -} - -static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) -{ - struct crypto_acomp_ctx *acomp_ctx; - - for (;;) { - acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); - mutex_lock(&acomp_ctx->mutex); - if (likely(acomp_ctx->req)) - return acomp_ctx; - /* - * It is possible that we were migrated to a different CPU after - * getting the per-CPU ctx but before the mutex was acquired. If - * the old CPU got offlined, zswap_cpu_comp_dead() could have - * already freed ctx->req (among other things) and set it to - * NULL. Just try again on the new CPU that we ended up on. - */ - mutex_unlock(&acomp_ctx->mutex); - } -} - -static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx) -{ - mutex_unlock(&acomp_ctx->mutex); -} - static bool zswap_compress(struct page *page, struct zswap_entry *entry, struct zswap_pool *pool) { @@ -872,7 +849,9 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, u8 *dst; bool mapped = false; - acomp_ctx = acomp_ctx_get_cpu_lock(pool); + acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); + mutex_lock(&acomp_ctx->mutex); + dst = acomp_ctx->buffer; sg_init_table(&input, 1); sg_set_page(&input, page, PAGE_SIZE, 0); @@ -938,7 +917,7 @@ unlock: else if (alloc_ret) zswap_reject_alloc_fail++; - acomp_ctx_put_unlock(acomp_ctx); + mutex_unlock(&acomp_ctx->mutex); return comp_ret == 0 && alloc_ret == 0; } @@ -950,7 +929,8 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio) struct crypto_acomp_ctx *acomp_ctx; int ret = 0, dlen; - acomp_ctx = acomp_ctx_get_cpu_lock(pool); + acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); + mutex_lock(&acomp_ctx->mutex); zs_obj_read_sg_begin(pool->zs_pool, entry->handle, input, entry->length); /* zswap entries of length PAGE_SIZE are not compressed. */ @@ -975,7 +955,7 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio) } zs_obj_read_sg_end(pool->zs_pool, entry->handle); - acomp_ctx_put_unlock(acomp_ctx); + mutex_unlock(&acomp_ctx->mutex); if (!ret && dlen == PAGE_SIZE) return true; @@ -1795,7 +1775,7 @@ static int zswap_setup(void) ret = cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE, "mm/zswap_pool:prepare", zswap_cpu_comp_prepare, - zswap_cpu_comp_dead); + NULL); if (ret) goto hp_fail; -- cgit v1.2.3 From 55da81663b9642dd046b26dd6f1baddbcf337c1e Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Fri, 27 Mar 2026 16:33:14 -0700 Subject: mm/damon/core: fix damon_call() vs kdamond_fn() exit race Patch series "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race". damon_call() and damos_walk() can leak memory and/or deadlock when they race with kdamond terminations. Fix those. This patch (of 2); When kdamond_fn() main loop is finished, the function cancels all remaining damon_call() requests and unset the damon_ctx->kdamond so that API callers and API functions themselves can know the context is terminated. damon_call() adds the caller's request to the queue first. After that, it shows if the kdamond of the damon_ctx is still running (damon_ctx->kdamond is set). Only if the kdamond is running, damon_call() starts waiting for the kdamond's handling of the newly added request. The damon_call() requests registration and damon_ctx->kdamond unset are protected by different mutexes, though. Hence, damon_call() could race with damon_ctx->kdamond unset, and result in deadlocks. For example, let's suppose kdamond successfully finished the damon_call() requests cancelling. Right after that, damon_call() is called for the context. It registers the new request, and shows the context is still running, because damon_ctx->kdamond unset is not yet done. Hence the damon_call() caller starts waiting for the handling of the request. However, the kdamond is already on the termination steps, so it never handles the new request. As a result, the damon_call() caller threads infinitely waits. Fix this by introducing another damon_ctx field, namely call_controls_obsolete. It is protected by the damon_ctx->call_controls_lock, which protects damon_call() requests registration. Initialize (unset) it in kdamond_fn() before letting damon_start() returns and set it just before the cancelling of remaining damon_call() requests is executed. damon_call() reads the obsolete field under the lock and avoids adding a new request. After this change, only requests that are guaranteed to be handled or cancelled are registered. Hence the after-registration DAMON context termination check is no longer needed. Remove it together. Note that the deadlock will not happen when damon_call() is called for repeat mode request. In tis case, damon_call() returns instead of waiting for the handling when the request registration succeeds and it shows the kdamond is running. However, if the request also has dealloc_on_cancel, the request memory would be leaked. The issue is found by sashiko [1]. Link: https://lore.kernel.org/20260327233319.3528-1-sj@kernel.org Link: https://lore.kernel.org/20260327233319.3528-2-sj@kernel.org Link: https://lore.kernel.org/20260325141956.87144-1-sj@kernel.org [1] Fixes: 42b7491af14c ("mm/damon/core: introduce damon_call()") Signed-off-by: SeongJae Park Cc: # 6.14.x Signed-off-by: Andrew Morton --- include/linux/damon.h | 1 + mm/damon/core.c | 45 ++++++++++++++------------------------------- 2 files changed, 15 insertions(+), 31 deletions(-) diff --git a/include/linux/damon.h b/include/linux/damon.h index d9a3babbafc1..5129de70e7b7 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -818,6 +818,7 @@ struct damon_ctx { /* lists of &struct damon_call_control */ struct list_head call_controls; + bool call_controls_obsolete; struct mutex call_controls_lock; struct damos_walk_control *walk_control; diff --git a/mm/damon/core.c b/mm/damon/core.c index db6c67e52d2b..9bcda2765ac9 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -1573,35 +1573,6 @@ int damon_kdamond_pid(struct damon_ctx *ctx) return pid; } -/* - * damon_call_handle_inactive_ctx() - handle DAMON call request that added to - * an inactive context. - * @ctx: The inactive DAMON context. - * @control: Control variable of the call request. - * - * This function is called in a case that @control is added to @ctx but @ctx is - * not running (inactive). See if @ctx handled @control or not, and cleanup - * @control if it was not handled. - * - * Returns 0 if @control was handled by @ctx, negative error code otherwise. - */ -static int damon_call_handle_inactive_ctx( - struct damon_ctx *ctx, struct damon_call_control *control) -{ - struct damon_call_control *c; - - mutex_lock(&ctx->call_controls_lock); - list_for_each_entry(c, &ctx->call_controls, list) { - if (c == control) { - list_del(&control->list); - mutex_unlock(&ctx->call_controls_lock); - return -EINVAL; - } - } - mutex_unlock(&ctx->call_controls_lock); - return 0; -} - /** * damon_call() - Invoke a given function on DAMON worker thread (kdamond). * @ctx: DAMON context to call the function for. @@ -1619,6 +1590,10 @@ static int damon_call_handle_inactive_ctx( * synchronization. The return value of the function will be saved in * &damon_call_control->return_code. * + * Note that this function should be called only after damon_start() with the + * @ctx has succeeded. Otherwise, this function could fall into an indefinite + * wait. + * * Return: 0 on success, negative error code otherwise. */ int damon_call(struct damon_ctx *ctx, struct damon_call_control *control) @@ -1629,10 +1604,12 @@ int damon_call(struct damon_ctx *ctx, struct damon_call_control *control) INIT_LIST_HEAD(&control->list); mutex_lock(&ctx->call_controls_lock); + if (ctx->call_controls_obsolete) { + mutex_unlock(&ctx->call_controls_lock); + return -ECANCELED; + } list_add_tail(&control->list, &ctx->call_controls); mutex_unlock(&ctx->call_controls_lock); - if (!damon_is_running(ctx)) - return damon_call_handle_inactive_ctx(ctx, control); if (control->repeat) return 0; wait_for_completion(&control->completion); @@ -2952,6 +2929,9 @@ static int kdamond_fn(void *data) pr_debug("kdamond (%d) starts\n", current->pid); + mutex_lock(&ctx->call_controls_lock); + ctx->call_controls_obsolete = false; + mutex_unlock(&ctx->call_controls_lock); complete(&ctx->kdamond_started); kdamond_init_ctx(ctx); @@ -3062,6 +3042,9 @@ done: damon_destroy_targets(ctx); kfree(ctx->regions_score_histogram); + mutex_lock(&ctx->call_controls_lock); + ctx->call_controls_obsolete = true; + mutex_unlock(&ctx->call_controls_lock); kdamond_call(ctx, true); damos_walk_cancel(ctx); -- cgit v1.2.3 From 33c3f6c2b48cd84b441dba1ee3e62290e53930f4 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Fri, 27 Mar 2026 16:33:15 -0700 Subject: mm/damon/core: fix damos_walk() vs kdamond_fn() exit race When kdamond_fn() main loop is finished, the function cancels remaining damos_walk() request and unset the damon_ctx->kdamond so that API callers and API functions themselves can show the context is terminated. damos_walk() adds the caller's request to the queue first. After that, it shows if the kdamond of the damon_ctx is still running (damon_ctx->kdamond is set). Only if the kdamond is running, damos_walk() starts waiting for the kdamond's handling of the newly added request. The damos_walk() requests registration and damon_ctx->kdamond unset are protected by different mutexes, though. Hence, damos_walk() could race with damon_ctx->kdamond unset, and result in deadlocks. For example, let's suppose kdamond successfully finished the damow_walk() request cancelling. Right after that, damos_walk() is called for the context. It registers the new request, and shows the context is still running, because damon_ctx->kdamond unset is not yet done. Hence the damos_walk() caller starts waiting for the handling of the request. However, the kdamond is already on the termination steps, so it never handles the new request. As a result, the damos_walk() caller thread infinitely waits. Fix this by introducing another damon_ctx field, namely walk_control_obsolete. It is protected by the damon_ctx->walk_control_lock, which protects damos_walk() request registration. Initialize (unset) it in kdamond_fn() before letting damon_start() returns and set it just before the cancelling of the remaining damos_walk() request is executed. damos_walk() reads the obsolete field under the lock and avoids adding a new request. After this change, only requests that are guaranteed to be handled or cancelled are registered. Hence the after-registration DAMON context termination check is no longer needed. Remove it together. The issue is found by sashiko [1]. Link: https://lore.kernel.org/20260327233319.3528-3-sj@kernel.org Link: https://lore.kernel.org/20260325141956.87144-1-sj@kernel.org [1] Fixes: bf0eaba0ff9c ("mm/damon/core: implement damos_walk()") Signed-off-by: SeongJae Park Cc: # 6.14.x Signed-off-by: Andrew Morton --- include/linux/damon.h | 1 + mm/damon/core.c | 21 ++++++++++++++------- 2 files changed, 15 insertions(+), 7 deletions(-) diff --git a/include/linux/damon.h b/include/linux/damon.h index 5129de70e7b7..f2cdb7c3f5e6 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -822,6 +822,7 @@ struct damon_ctx { struct mutex call_controls_lock; struct damos_walk_control *walk_control; + bool walk_control_obsolete; struct mutex walk_control_lock; /* diff --git a/mm/damon/core.c b/mm/damon/core.c index 9bcda2765ac9..ddabb93f2377 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -1637,6 +1637,10 @@ int damon_call(struct damon_ctx *ctx, struct damon_call_control *control) * passed at least one &damos->apply_interval_us, kdamond marks the request as * completed so that damos_walk() can wakeup and return. * + * Note that this function should be called only after damon_start() with the + * @ctx has succeeded. Otherwise, this function could fall into an indefinite + * wait. + * * Return: 0 on success, negative error code otherwise. */ int damos_walk(struct damon_ctx *ctx, struct damos_walk_control *control) @@ -1644,19 +1648,16 @@ int damos_walk(struct damon_ctx *ctx, struct damos_walk_control *control) init_completion(&control->completion); control->canceled = false; mutex_lock(&ctx->walk_control_lock); + if (ctx->walk_control_obsolete) { + mutex_unlock(&ctx->walk_control_lock); + return -ECANCELED; + } if (ctx->walk_control) { mutex_unlock(&ctx->walk_control_lock); return -EBUSY; } ctx->walk_control = control; mutex_unlock(&ctx->walk_control_lock); - if (!damon_is_running(ctx)) { - mutex_lock(&ctx->walk_control_lock); - if (ctx->walk_control == control) - ctx->walk_control = NULL; - mutex_unlock(&ctx->walk_control_lock); - return -EINVAL; - } wait_for_completion(&control->completion); if (control->canceled) return -ECANCELED; @@ -2932,6 +2933,9 @@ static int kdamond_fn(void *data) mutex_lock(&ctx->call_controls_lock); ctx->call_controls_obsolete = false; mutex_unlock(&ctx->call_controls_lock); + mutex_lock(&ctx->walk_control_lock); + ctx->walk_control_obsolete = false; + mutex_unlock(&ctx->walk_control_lock); complete(&ctx->kdamond_started); kdamond_init_ctx(ctx); @@ -3046,6 +3050,9 @@ done: ctx->call_controls_obsolete = true; mutex_unlock(&ctx->call_controls_lock); kdamond_call(ctx, true); + mutex_lock(&ctx->walk_control_lock); + ctx->walk_control_obsolete = true; + mutex_unlock(&ctx->walk_control_lock); damos_walk_cancel(ctx); pr_debug("kdamond (%d) finishes\n", current->pid); -- cgit v1.2.3 From e04ed278d25bf15769800bf6e35c6737f137186f Mon Sep 17 00:00:00 2001 From: Jackie Liu Date: Tue, 31 Mar 2026 18:15:53 +0800 Subject: mm/damon/stat: fix memory leak on damon_start() failure in damon_stat_start() Destroy the DAMON context and reset the global pointer when damon_start() fails. Otherwise, the context allocated by damon_stat_build_ctx() is leaked, and the stale damon_stat_context pointer will be overwritten on the next enable attempt, making the old allocation permanently unreachable. Link: https://lore.kernel.org/20260331101553.88422-1-liu.yun@linux.dev Fixes: 369c415e6073 ("mm/damon: introduce DAMON_STAT module") Signed-off-by: Jackie Liu Reviewed-by: SeongJae Park Cc: # 6.17.x Signed-off-by: Andrew Morton --- mm/damon/stat.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/damon/stat.c b/mm/damon/stat.c index cf2c5a541eee..5a742fc157e4 100644 --- a/mm/damon/stat.c +++ b/mm/damon/stat.c @@ -249,8 +249,11 @@ static int damon_stat_start(void) if (!damon_stat_context) return -ENOMEM; err = damon_start(&damon_stat_context, 1, true); - if (err) + if (err) { + damon_destroy_ctx(damon_stat_context); + damon_stat_context = NULL; return err; + } damon_stat_last_refresh_jiffies = jiffies; call_control.data = damon_stat_context; -- cgit v1.2.3 From 40250b2dded0604a112be605f3828700d80ad7c2 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 28 Mar 2026 21:38:59 -0700 Subject: mm/damon/core: validate damos_quota_goal->nid for node_mem_{used,free}_bp Patch series "mm/damon/core: validate damos_quota_goal->nid". node_mem[cg]_{used,free}_bp DAMOS quota goals receive the node id. The node id is used for si_meminfo_node() and NODE_DATA() without proper validation. As a result, privileged users can trigger an out of bounds memory access using DAMON_SYSFS. Fix the issues. The issue was originally reported [1] with a fix by another author. The original author announced [2] that they will stop working including the fix that was still in the review stage. Hence I'm restarting this. This patch (of 2): Users can set damos_quota_goal->nid with arbitrary value for node_mem_{used,free}_bp. But DAMON core is using those for si_meminfo_node() without the validation of the value. This can result in out of bounds memory access. The issue can actually triggered using DAMON user-space tool (damo), like below. $ sudo ./damo start --damos_action stat \ --damos_quota_goal node_mem_used_bp 50% -1 \ --damos_quota_interval 1s $ sudo dmesg [...] [ 65.565986] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000098 Fix this issue by adding the validation of the given node. If an invalid node id is given, it returns 0% for used memory ratio, and 100% for free memory ratio. Link: https://lore.kernel.org/20260329043902.46163-2-sj@kernel.org Link: https://lore.kernel.org/20260325073034.140353-1-objecting@objecting.org [1] Link: https://lore.kernel.org/20260327040924.68553-1-sj@kernel.org [2] Fixes: 0e1c773b501f ("mm/damon/core: introduce damos quota goal metrics for memory node utilization") Signed-off-by: SeongJae Park Cc: # 6.16.x Signed-off-by: Andrew Morton --- mm/damon/core.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/mm/damon/core.c b/mm/damon/core.c index ddabb93f2377..9a848d7647ef 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -2217,12 +2217,24 @@ static inline u64 damos_get_some_mem_psi_total(void) #endif /* CONFIG_PSI */ #ifdef CONFIG_NUMA +static bool invalid_mem_node(int nid) +{ + return nid < 0 || nid >= MAX_NUMNODES || !node_state(nid, N_MEMORY); +} + static __kernel_ulong_t damos_get_node_mem_bp( struct damos_quota_goal *goal) { struct sysinfo i; __kernel_ulong_t numerator; + if (invalid_mem_node(goal->nid)) { + if (goal->metric == DAMOS_QUOTA_NODE_MEM_USED_BP) + return 0; + else /* DAMOS_QUOTA_NODE_MEM_FREE_BP */ + return 10000; + } + si_meminfo_node(&i, goal->nid); if (goal->metric == DAMOS_QUOTA_NODE_MEM_USED_BP) numerator = i.totalram - i.freeram; -- cgit v1.2.3 From a34dac6482e53e2c76944f25b1489b9b7da3a6e6 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 28 Mar 2026 21:39:00 -0700 Subject: mm/damon/core: validate damos_quota_goal->nid for node_memcg_{used,free}_bp Users can set damos_quota_goal->nid with arbitrary value for node_memcg_{used,free}_bp. But DAMON core is using those for NODE-DATA() without a validation of the value. This can result in out of bounds memory access. The issue can actually triggered using DAMON user-space tool (damo), like below. $ sudo mkdir /sys/fs/cgroup/foo $ sudo ./damo start --damos_action stat --damos_quota_interval 1s \ --damos_quota_goal node_memcg_used_bp 50% -1 /foo $ sudo dmseg [...] [ 524.181426] Unable to handle kernel paging request at virtual address 0000000000002c00 Fix this issue by adding the validation of the given node id. If an invalid node id is given, it returns 0% for used memory ratio, and 100% for free memory ratio. Link: https://lore.kernel.org/20260329043902.46163-3-sj@kernel.org Fixes: b74a120bcf50 ("mm/damon/core: implement DAMOS_QUOTA_NODE_MEMCG_USED_BP") Signed-off-by: SeongJae Park Cc: # 6.19.x Signed-off-by: Andrew Morton --- mm/damon/core.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/mm/damon/core.c b/mm/damon/core.c index 9a848d7647ef..19642c175568 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -2251,6 +2251,13 @@ static unsigned long damos_get_node_memcg_used_bp( unsigned long used_pages, numerator; struct sysinfo i; + if (invalid_mem_node(goal->nid)) { + if (goal->metric == DAMOS_QUOTA_NODE_MEMCG_USED_BP) + return 0; + else /* DAMOS_QUOTA_NODE_MEMCG_FREE_BP */ + return 10000; + } + memcg = mem_cgroup_get_from_id(goal->memcg_id); if (!memcg) { if (goal->metric == DAMOS_QUOTA_NODE_MEMCG_USED_BP) -- cgit v1.2.3 From 049a57421dd67a28c45ae7e92c36df758033e5fa Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 29 Mar 2026 08:23:05 -0700 Subject: mm/damon/core: use time_in_range_open() for damos quota window start damos_adjust_quota() uses time_after_eq() to show if it is time to start a new quota charge window, comparing the current jiffies and the scheduled next charge window start time. If it is, the next charge window start time is updated and the new charge window starts. The time check and next window start time update is skipped while the scheme is deactivated by the watermarks. Let's suppose the deactivation is kept more than LONG_MAX jiffies (assuming CONFIG_HZ of 250, more than 99 days in 32 bit systems and more than one billion years in 64 bit systems), resulting in having the jiffies larger than the next charge window start time + LONG_MAX. Then, the time_after_eq() call can return false until another LONG_MAX jiffies are passed. This means the scheme can continue working after being reactivated by the watermarks. But, soon, the quota will be exceeded and the scheme will again effectively stop working until the next charge window starts. Because the current charge window is extended to up to LONG_MAX jiffies, however, it will look like it stopped unexpectedly and indefinitely, from the user's perspective. Fix this by using !time_in_range_open() instead. The issue was discovered [1] by sashiko. Link: https://lore.kernel.org/20260329152306.45796-1-sj@kernel.org Link: https://lore.kernel.org/20260324040722.57944-1-sj@kernel.org [1] Fixes: ee801b7dd782 ("mm/damon/schemes: activate schemes based on a watermarks mechanism") Signed-off-by: SeongJae Park Cc: # 5.16.x Signed-off-by: Andrew Morton --- mm/damon/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/damon/core.c b/mm/damon/core.c index 19642c175568..3bc7a2bbfe7d 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -2449,7 +2449,8 @@ static void damos_adjust_quota(struct damon_ctx *c, struct damos *s) } /* New charge window starts */ - if (time_after_eq(jiffies, quota->charged_from + + if (!time_in_range_open(jiffies, quota->charged_from, + quota->charged_from + msecs_to_jiffies(quota->reset_interval))) { if (damos_quota_is_set(quota) && quota->charged_sz >= quota->esz) -- cgit v1.2.3 From 0beba407d4585a15b0dc09f2064b5b3ddcb0e857 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 29 Mar 2026 08:30:49 -0700 Subject: Docs/admin-guide/mm/damon/reclaim: warn commit_inputs vs param updates race Patch series "Docs/admin-guide/mm/damon: warn commit_inputs vs other params race". Writing 'Y' to the commit_inputs parameter of DAMON_RECLAIM and DAMON_LRU_SORT, and writing other parameters before the commit_inputs request is completely processed can cause race conditions. While the consequence can be bad, the documentation is not clearly describing that. Add clear warnings. The issue was discovered [1,2] by sashiko. This patch (of 2): DAMON_RECLAIM handles commit_inputs request inside kdamond thread, reading the module parameters. If the user updates the module parameters while the kdamond thread is reading those, races can happen. To avoid this, the commit_inputs parameter shows whether it is still in the progress, assuming users wouldn't update parameters in the middle of the work. Some users might ignore that. Add a warning about the behavior. The issue was discovered in [1] by sashiko. Link: https://lore.kernel.org/20260329153052.46657-2-sj@kernel.org Link: https://lore.kernel.org/20260319161620.189392-3-objecting@objecting.org [1] Link: https://lore.kernel.org/20260319161620.189392-2-objecting@objecting.org [3] Fixes: 81a84182c343 ("Docs/admin-guide/mm/damon/reclaim: document 'commit_inputs' parameter") Signed-off-by: SeongJae Park Cc: # 5.19.x Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/reclaim.rst | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/Documentation/admin-guide/mm/damon/reclaim.rst b/Documentation/admin-guide/mm/damon/reclaim.rst index 47854c461706..d7a0225b4950 100644 --- a/Documentation/admin-guide/mm/damon/reclaim.rst +++ b/Documentation/admin-guide/mm/damon/reclaim.rst @@ -71,6 +71,10 @@ of parametrs except ``enabled`` again. Once the re-reading is done, this parameter is set as ``N``. If invalid parameters are found while the re-reading, DAMON_RECLAIM will be disabled. +Once ``Y`` is written to this parameter, the user must not write to any +parameters until reading ``commit_inputs`` again returns ``N``. If users +violate this rule, the kernel may exhibit undefined behavior. + min_age ------- -- cgit v1.2.3 From 0c13ed77dd2bc1c2d46db8ef27721213742cccd8 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 29 Mar 2026 08:30:50 -0700 Subject: Docs/admin-guide/mm/damon/lru_sort: warn commit_inputs vs param updates race DAMON_LRU_SORT handles commit_inputs request inside kdamond thread, reading the module parameters. If the user updates the module parameters while the kdamond thread is reading those, races can happen. To avoid this, the commit_inputs parameter shows whether it is still in the progress, assuming users wouldn't update parameters in the middle of the work. Some users might ignore that. Add a warning about the behavior. The issue was discovered in [1] by sashiko. Link: https://lore.kernel.org/20260329153052.46657-3-sj@kernel.org Link: https://lore.kernel.org/20260319161620.189392-2-objecting@objecting.org [1] Fixes: 6acfcd0d7524 ("Docs/admin-guide/damon: add a document for DAMON_LRU_SORT") Signed-off-by: SeongJae Park Cc: # 6.0.x Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/lru_sort.rst | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/Documentation/admin-guide/mm/damon/lru_sort.rst b/Documentation/admin-guide/mm/damon/lru_sort.rst index a7dea7c75a9b..14cc6b2db897 100644 --- a/Documentation/admin-guide/mm/damon/lru_sort.rst +++ b/Documentation/admin-guide/mm/damon/lru_sort.rst @@ -79,6 +79,10 @@ of parametrs except ``enabled`` again. Once the re-reading is done, this parameter is set as ``N``. If invalid parameters are found while the re-reading, DAMON_LRU_SORT will be disabled. +Once ``Y`` is written to this parameter, the user must not write to any +parameters until reading ``commit_inputs`` again returns ``N``. If users +violate this rule, the kernel may exhibit undefined behavior. + active_mem_bp ------------- -- cgit v1.2.3 From 6fae274ce0e3109cbbc4c18b354eaace1f0af7d7 Mon Sep 17 00:00:00 2001 From: Jackie Liu Date: Wed, 1 Apr 2026 08:57:02 +0800 Subject: mm/mempolicy: fix memory leaks in weighted_interleave_auto_store() weighted_interleave_auto_store() fetches old_wi_state inside the if (!input) block only. This causes two memory leaks: 1. When a user writes "false" and the current mode is already manual, the function returns early without freeing the freshly allocated new_wi_state. 2. When a user writes "true", old_wi_state stays NULL because the fetch is skipped entirely. The old state is then overwritten by rcu_assign_pointer() but never freed, since the cleanup path is gated on old_wi_state being non-NULL. A user can trigger this repeatedly by writing "1" in a loop. Fix both leaks by moving the old_wi_state fetch before the input check, making it unconditional. This also allows a unified early return for both "true" and "false" when the requested mode matches the current mode. Link: https://lore.kernel.org/20260401005702.7096-1-liu.yun@linux.dev Link: https://sashiko.dev/#/patchset/20260331100740.84906-1-liu.yun@linux.dev Fixes: e341f9c3c841 ("mm/mempolicy: Weighted Interleave Auto-tuning") Signed-off-by: Jackie Liu Reviewed-by: Joshua Hahn Reviewed by: Donet Tom Cc: Gregory Price Cc: Alistair Popple Cc: Byungchul Park Cc: David Hildenbrand Cc: # v6.16+ Signed-off-by: Andrew Morton --- mm/mempolicy.c | 23 ++++++++++++----------- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index fd08771e2057..62108a5b74c4 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -3700,18 +3700,19 @@ static ssize_t weighted_interleave_auto_store(struct kobject *kobj, new_wi_state->iw_table[i] = 1; mutex_lock(&wi_state_lock); - if (!input) { - old_wi_state = rcu_dereference_protected(wi_state, - lockdep_is_held(&wi_state_lock)); - if (!old_wi_state) - goto update_wi_state; - if (input == old_wi_state->mode_auto) { - mutex_unlock(&wi_state_lock); - return count; - } + old_wi_state = rcu_dereference_protected(wi_state, + lockdep_is_held(&wi_state_lock)); - memcpy(new_wi_state->iw_table, old_wi_state->iw_table, - nr_node_ids * sizeof(u8)); + if (old_wi_state && input == old_wi_state->mode_auto) { + mutex_unlock(&wi_state_lock); + kfree(new_wi_state); + return count; + } + + if (!input) { + if (old_wi_state) + memcpy(new_wi_state->iw_table, old_wi_state->iw_table, + nr_node_ids * sizeof(u8)); goto update_wi_state; } -- cgit v1.2.3 From 84f4928446e65b9f3f142809f192edf46f67e380 Mon Sep 17 00:00:00 2001 From: "Lorenzo Stoakes (Oracle)" Date: Tue, 31 Mar 2026 08:36:27 +0100 Subject: tools/testing/selftests: add merge test for partial msealed range Commit 2697dd8ae721 ("mm/mseal: update VMA end correctly on merge") fixed an issue in the loop which iterates through VMAs applying mseal, which was triggered by mseal()'ing a range of VMAs where the second was mseal()'d and the first mergeable with it, once mseal()'d. Add a regression test to assert that this behaviour is correct. We place it in the merge selftests as this is strictly an issue with merging (via a vma_modify() invocation). It also asserts that mseal()'d ranges are correctly merged as you'd expect. The test is implemented such that it is skipped if mseal() is not available on the system. [rppt@kernel.org: fix inclusions, to fix handle_uprobe_upon_merged_vma()] Link: https://lore.kernel.org/ac_mCIUQWRAbuH8F@kernel.org [ljs@kernel.org: simplifications per Pedro] Link: https://lore.kernel.org/1c9c922d-5cb5-4cff-9273-b737cdb57ca1@lucifer.local Link: https://lore.kernel.org/20260331073627.50010-1-ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) Signed-off-by: Mike Rapoport Cc: David Hildenbrand Cc: Jann Horn Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Pedro Falcato Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/merge.c | 88 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 88 insertions(+) diff --git a/tools/testing/selftests/mm/merge.c b/tools/testing/selftests/mm/merge.c index 10b686102b79..519e5ac02db7 100644 --- a/tools/testing/selftests/mm/merge.c +++ b/tools/testing/selftests/mm/merge.c @@ -48,6 +48,19 @@ static pid_t do_fork(struct procmap_fd *procmap) return 0; } +#ifdef __NR_mseal +static int sys_mseal(void *ptr, size_t len, unsigned long flags) +{ + return syscall(__NR_mseal, (unsigned long)ptr, len, flags); +} +#else +static int sys_mseal(void *ptr, size_t len, unsigned long flags) +{ + errno = ENOSYS; + return -1; +} +#endif + FIXTURE_SETUP(merge) { self->page_size = psize(); @@ -1217,6 +1230,81 @@ TEST_F(merge, mremap_correct_placed_faulted) ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size); } +TEST_F(merge, merge_vmas_with_mseal) +{ + unsigned int page_size = self->page_size; + struct procmap_fd *procmap = &self->procmap; + char *ptr, *ptr2, *ptr3; + /* We need our own as cannot munmap() once sealed. */ + char *carveout; + + /* Invalid mseal() call to see if implemented. */ + ASSERT_EQ(sys_mseal(NULL, 0, ~0UL), -1); + if (errno == ENOSYS) + SKIP(return, "mseal not supported, skipping."); + + /* Map carveout. */ + carveout = mmap(NULL, 5 * page_size, PROT_NONE, + MAP_PRIVATE | MAP_ANON, -1, 0); + ASSERT_NE(carveout, MAP_FAILED); + + /* + * Map 3 separate VMAs: + * + * |-----------|-----------|-----------| + * | RW | RWE | RO | + * |-----------|-----------|-----------| + * ptr ptr2 ptr3 + */ + ptr = mmap(&carveout[page_size], page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + ptr2 = mmap(&carveout[2 * page_size], page_size, + PROT_READ | PROT_WRITE | PROT_EXEC, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr2, MAP_FAILED); + ptr3 = mmap(&carveout[3 * page_size], page_size, PROT_READ, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr3, MAP_FAILED); + + /* + * mseal the second VMA: + * + * |-----------|-----------|-----------| + * | RW | RWES | RO | + * |-----------|-----------|-----------| + * ptr ptr2 ptr3 + */ + ASSERT_EQ(sys_mseal(ptr2, page_size, 0), 0); + + /* Make first VMA mergeable upon mseal. */ + ASSERT_EQ(mprotect(ptr, page_size, + PROT_READ | PROT_WRITE | PROT_EXEC), 0); + /* + * At this point we have: + * + * |-----------|-----------|-----------| + * | RWE | RWES | RO | + * |-----------|-----------|-----------| + * ptr ptr2 ptr3 + * + * Now mseal all of the VMAs. + */ + ASSERT_EQ(sys_mseal(ptr, 3 * page_size, 0), 0); + + /* + * We should end up with: + * + * |-----------------------|-----------| + * | RWES | ROS | + * |-----------------------|-----------| + * ptr ptr3 + */ + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 2 * page_size); +} + TEST_F(merge_with_fork, mremap_faulted_to_unfaulted_prev) { struct procmap_fd *procmap = &self->procmap; -- cgit v1.2.3 From 047a6d494033db26736b19e247851632cd74959d Mon Sep 17 00:00:00 2001 From: Li Wang Date: Wed, 1 Apr 2026 17:05:20 +0800 Subject: selftests/mm: skip hugetlb_dio tests when DIO alignment is incompatible hugetlb_dio test uses sub-page offsets (pagesize / 2) to verify that hugepages used as DIO user buffers are correctly unpinned at completion. However, on filesystems with a logical block size larger than half the page size (e.g., 4K-sector block devices), these unaligned DIO writes are rejected with -EINVAL, causing the test to fail unexpectedly. Add get_dio_alignment() to query the filesystem's required DIO alignment via statx(STATX_DIOALIGN) and skip individual test cases whose file offset or write size is not a multiple of that alignment. Aligned cases continue to run so the core coverage is preserved. While here, open the temporary file once in main() and share the fd across all test cases instead of reopening it in each invocation. === Reproduce Steps === # dd if=/dev/zero of=/tmp/test.img bs=1M count=512 # losetup --sector-size 4096 /dev/loop0 /tmp/test.img # mkfs.xfs /dev/loop0 # mkdir -p /mnt/dio_test # mount /dev/loop0 /mnt/dio_test // Modify test to open /mnt/dio_test and rebuild it: - fd = open("/tmp", O_TMPFILE | O_RDWR | O_DIRECT, 0664); + fd = open("/mnt/dio_test", O_TMPFILE | O_RDWR | O_DIRECT, 0664); # getconf PAGESIZE 4096 # echo 100 >/proc/sys/vm/nr_hugepages # ./hugetlb_dio TAP version 13 1..4 # No. Free pages before allocation : 100 # No. Free pages after munmap : 100 ok 1 free huge pages from 0-12288 Bail out! Error writing to file : Invalid argument (22) # Planned tests != run tests (4 != 1) # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 Link: https://lore.kernel.org/20260401090520.24018-1-liwang@redhat.com Signed-off-by: Li Wang Suggested-by: Mike Rapoport Suggested-by: David Hildenbrand Acked-by: David Hildenbrand (Arm) Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Mike Rapoport Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/hugetlb_dio.c | 91 ++++++++++++++++++++++++-------- 1 file changed, 69 insertions(+), 22 deletions(-) diff --git a/tools/testing/selftests/mm/hugetlb_dio.c b/tools/testing/selftests/mm/hugetlb_dio.c index 9ac62eb4c97d..31a054fa8134 100644 --- a/tools/testing/selftests/mm/hugetlb_dio.c +++ b/tools/testing/selftests/mm/hugetlb_dio.c @@ -17,12 +17,57 @@ #include #include #include +#include #include "vm_util.h" #include "kselftest.h" -void run_dio_using_hugetlb(unsigned int start_off, unsigned int end_off) +#ifndef STATX_DIOALIGN +#define STATX_DIOALIGN 0x00002000U +#endif + +static int get_dio_alignment(int fd) +{ + struct statx stx; + int ret; + + ret = syscall(__NR_statx, fd, "", AT_EMPTY_PATH, STATX_DIOALIGN, &stx); + if (ret < 0) + return -1; + + /* + * If STATX_DIOALIGN is unsupported, assume no alignment + * constraint and let the test proceed. + */ + if (!(stx.stx_mask & STATX_DIOALIGN) || !stx.stx_dio_offset_align) + return 1; + + return stx.stx_dio_offset_align; +} + +static bool check_dio_alignment(unsigned int start_off, + unsigned int end_off, unsigned int align) +{ + unsigned int writesize = end_off - start_off; + + /* + * The kernel's DIO path checks that file offset, length, and + * buffer address are all multiples of dio_offset_align. When + * this test case's parameters don't satisfy that, the write + * would fail with -EINVAL before exercising the hugetlb unpin + * path, so skip. + */ + if (start_off % align != 0 || writesize % align != 0) { + ksft_test_result_skip("DIO align=%u incompatible with offset %u writesize %u\n", + align, start_off, writesize); + return false; + } + + return true; +} + +static void run_dio_using_hugetlb(int fd, unsigned int start_off, + unsigned int end_off, unsigned int align) { - int fd; char *buffer = NULL; char *orig_buffer = NULL; size_t h_pagesize = 0; @@ -32,6 +77,9 @@ void run_dio_using_hugetlb(unsigned int start_off, unsigned int end_off) const int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB; const int mmap_prot = PROT_READ | PROT_WRITE; + if (!check_dio_alignment(start_off, end_off, align)) + return; + writesize = end_off - start_off; /* Get the default huge page size */ @@ -39,10 +87,9 @@ void run_dio_using_hugetlb(unsigned int start_off, unsigned int end_off) if (!h_pagesize) ksft_exit_fail_msg("Unable to determine huge page size\n"); - /* Open the file to DIO */ - fd = open("/tmp", O_TMPFILE | O_RDWR | O_DIRECT, 0664); - if (fd < 0) - ksft_exit_fail_perror("Error opening file\n"); + /* Reset file position since fd is shared across tests */ + if (lseek(fd, 0, SEEK_SET) < 0) + ksft_exit_fail_perror("lseek failed\n"); /* Get the free huge pages before allocation */ free_hpage_b = get_free_hugepages(); @@ -71,7 +118,6 @@ void run_dio_using_hugetlb(unsigned int start_off, unsigned int end_off) /* unmap the huge page */ munmap(orig_buffer, h_pagesize); - close(fd); /* Get the free huge pages after unmap*/ free_hpage_a = get_free_hugepages(); @@ -89,37 +135,38 @@ void run_dio_using_hugetlb(unsigned int start_off, unsigned int end_off) int main(void) { - size_t pagesize = 0; - int fd; + int fd, align; + const size_t pagesize = psize(); ksft_print_header(); - /* Open the file to DIO */ - fd = open("/tmp", O_TMPFILE | O_RDWR | O_DIRECT, 0664); - if (fd < 0) - ksft_exit_skip("Unable to allocate file: %s\n", strerror(errno)); - close(fd); - /* Check if huge pages are free */ if (!get_free_hugepages()) ksft_exit_skip("No free hugepage, exiting\n"); - ksft_set_plan(4); + fd = open("/tmp", O_TMPFILE | O_RDWR | O_DIRECT, 0664); + if (fd < 0) + ksft_exit_skip("Unable to allocate file: %s\n", strerror(errno)); - /* Get base page size */ - pagesize = psize(); + align = get_dio_alignment(fd); + if (align < 0) + ksft_exit_skip("Unable to obtain DIO alignment: %s\n", + strerror(errno)); + ksft_set_plan(4); /* start and end is aligned to pagesize */ - run_dio_using_hugetlb(0, (pagesize * 3)); + run_dio_using_hugetlb(fd, 0, (pagesize * 3), align); /* start is aligned but end is not aligned */ - run_dio_using_hugetlb(0, (pagesize * 3) - (pagesize / 2)); + run_dio_using_hugetlb(fd, 0, (pagesize * 3) - (pagesize / 2), align); /* start is unaligned and end is aligned */ - run_dio_using_hugetlb(pagesize / 2, (pagesize * 3)); + run_dio_using_hugetlb(fd, pagesize / 2, (pagesize * 3), align); /* both start and end are unaligned */ - run_dio_using_hugetlb(pagesize / 2, (pagesize * 3) + (pagesize / 2)); + run_dio_using_hugetlb(fd, pagesize / 2, (pagesize * 3) + (pagesize / 2), align); + + close(fd); ksft_finished(); } -- cgit v1.2.3 From 744dd97752ef1076a8d8672bb0d8aa2c7abc1144 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Tue, 31 Mar 2026 17:34:43 +1100 Subject: lib: test_hmm: evict device pages on file close to avoid use-after-free Patch series "Minor hmm_test fixes and cleanups". Two bugfixes a cleanup for the HMM kernel selftests. These were mostly reported by Zenghui Yu with special thanks to Lorenzo for analysing and pointing out the problems. This patch (of 3): When dmirror_fops_release() is called it frees the dmirror struct but doesn't migrate device private pages back to system memory first. This leaves those pages with a dangling zone_device_data pointer to the freed dmirror. If a subsequent fault occurs on those pages (eg. during coredump) the dmirror_devmem_fault() callback dereferences the stale pointer causing a kernel panic. This was reported [1] when running mm/ksft_hmm.sh on arm64, where a test failure triggered SIGABRT and the resulting coredump walked the VMAs faulting in the stale device private pages. Fix this by calling dmirror_device_evict_chunk() for each devmem chunk in dmirror_fops_release() to migrate all device private pages back to system memory before freeing the dmirror struct. The function is moved earlier in the file to avoid a forward declaration. Link: https://lore.kernel.org/20260331063445.3551404-1-apopple@nvidia.com Link: https://lore.kernel.org/20260331063445.3551404-2-apopple@nvidia.com Fixes: b2ef9f5a5cb3 ("mm/hmm/test: add selftest driver for HMM") Signed-off-by: Alistair Popple Reported-by: Zenghui Yu Closes: https://lore.kernel.org/linux-mm/8bd0396a-8997-4d2e-a13f-5aac033083d7@linux.dev/ Reviewed-by: Balbir Singh Tested-by: Zenghui Yu Cc: David Hildenbrand Cc: Jason Gunthorpe Cc: Leon Romanovsky Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Mike Rapoport Cc: Suren Baghdasaryan Cc: Zenghui Yu Cc: Matthew Brost Cc: Signed-off-by: Andrew Morton --- lib/test_hmm.c | 112 +++++++++++++++++++++++++++++++-------------------------- 1 file changed, 62 insertions(+), 50 deletions(-) diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 0964d53365e6..79fe7d233df1 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -185,11 +185,73 @@ static int dmirror_fops_open(struct inode *inode, struct file *filp) return 0; } +static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk) +{ + unsigned long start_pfn = chunk->pagemap.range.start >> PAGE_SHIFT; + unsigned long end_pfn = chunk->pagemap.range.end >> PAGE_SHIFT; + unsigned long npages = end_pfn - start_pfn + 1; + unsigned long i; + unsigned long *src_pfns; + unsigned long *dst_pfns; + unsigned int order = 0; + + src_pfns = kvcalloc(npages, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL); + dst_pfns = kvcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL); + + migrate_device_range(src_pfns, start_pfn, npages); + for (i = 0; i < npages; i++) { + struct page *dpage, *spage; + + spage = migrate_pfn_to_page(src_pfns[i]); + if (!spage || !(src_pfns[i] & MIGRATE_PFN_MIGRATE)) + continue; + + if (WARN_ON(!is_device_private_page(spage) && + !is_device_coherent_page(spage))) + continue; + + order = folio_order(page_folio(spage)); + spage = BACKING_PAGE(spage); + if (src_pfns[i] & MIGRATE_PFN_COMPOUND) { + dpage = folio_page(folio_alloc(GFP_HIGHUSER_MOVABLE, + order), 0); + } else { + dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL); + order = 0; + } + + /* TODO Support splitting here */ + lock_page(dpage); + dst_pfns[i] = migrate_pfn(page_to_pfn(dpage)); + if (src_pfns[i] & MIGRATE_PFN_WRITE) + dst_pfns[i] |= MIGRATE_PFN_WRITE; + if (order) + dst_pfns[i] |= MIGRATE_PFN_COMPOUND; + folio_copy(page_folio(dpage), page_folio(spage)); + } + migrate_device_pages(src_pfns, dst_pfns, npages); + migrate_device_finalize(src_pfns, dst_pfns, npages); + kvfree(src_pfns); + kvfree(dst_pfns); +} + static int dmirror_fops_release(struct inode *inode, struct file *filp) { struct dmirror *dmirror = filp->private_data; + struct dmirror_device *mdevice = dmirror->mdevice; + int i; mmu_interval_notifier_remove(&dmirror->notifier); + + if (mdevice->devmem_chunks) { + for (i = 0; i < mdevice->devmem_count; i++) { + struct dmirror_chunk *devmem = + mdevice->devmem_chunks[i]; + + dmirror_device_evict_chunk(devmem); + } + } + xa_destroy(&dmirror->pt); kfree(dmirror); return 0; @@ -1377,56 +1439,6 @@ static int dmirror_snapshot(struct dmirror *dmirror, return ret; } -static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk) -{ - unsigned long start_pfn = chunk->pagemap.range.start >> PAGE_SHIFT; - unsigned long end_pfn = chunk->pagemap.range.end >> PAGE_SHIFT; - unsigned long npages = end_pfn - start_pfn + 1; - unsigned long i; - unsigned long *src_pfns; - unsigned long *dst_pfns; - unsigned int order = 0; - - src_pfns = kvcalloc(npages, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL); - dst_pfns = kvcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL); - - migrate_device_range(src_pfns, start_pfn, npages); - for (i = 0; i < npages; i++) { - struct page *dpage, *spage; - - spage = migrate_pfn_to_page(src_pfns[i]); - if (!spage || !(src_pfns[i] & MIGRATE_PFN_MIGRATE)) - continue; - - if (WARN_ON(!is_device_private_page(spage) && - !is_device_coherent_page(spage))) - continue; - - order = folio_order(page_folio(spage)); - spage = BACKING_PAGE(spage); - if (src_pfns[i] & MIGRATE_PFN_COMPOUND) { - dpage = folio_page(folio_alloc(GFP_HIGHUSER_MOVABLE, - order), 0); - } else { - dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL); - order = 0; - } - - /* TODO Support splitting here */ - lock_page(dpage); - dst_pfns[i] = migrate_pfn(page_to_pfn(dpage)); - if (src_pfns[i] & MIGRATE_PFN_WRITE) - dst_pfns[i] |= MIGRATE_PFN_WRITE; - if (order) - dst_pfns[i] |= MIGRATE_PFN_COMPOUND; - folio_copy(page_folio(dpage), page_folio(spage)); - } - migrate_device_pages(src_pfns, dst_pfns, npages); - migrate_device_finalize(src_pfns, dst_pfns, npages); - kvfree(src_pfns); - kvfree(dst_pfns); -} - /* Removes free pages from the free list so they can't be re-allocated */ static void dmirror_remove_free_pages(struct dmirror_chunk *devmem) { -- cgit v1.2.3 From f9d7975c52c00b3685cf9a90a81023d17817d991 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Tue, 31 Mar 2026 17:34:44 +1100 Subject: selftests/mm: hmm-tests: don't hardcode THP size to 2MB Several HMM tests hardcode TWOMEG as the THP size. This is wrong on architectures where the PMD size is not 2MB such as arm64 with 64K base pages where THP is 512MB. Fix this by using read_pmd_pagesize() from vm_util instead. While here also replace the custom file_read_ulong() helper used to parse the default hugetlbfs page size from /proc/meminfo with the existing default_huge_page_size() from vm_util. Link: https://lore.kernel.org/20260331063445.3551404-3-apopple@nvidia.com Link: https://lore.kernel.org/linux-mm/8bd0396a-8997-4d2e-a13f-5aac033083d7@linux.dev/ Fixes: fee9f6d1b8df ("mm/hmm/test: add selftests for HMM") Fixes: 519071529d2a ("selftests/mm/hmm-tests: new tests for zone device THP migration") Signed-off-by: Alistair Popple Reported-by: Zenghui Yu Closes: https://lore.kernel.org/linux-mm/8bd0396a-8997-4d2e-a13f-5aac033083d7@linux.dev/ Reviewed-by: Balbir Singh Cc: Matthew Brost Cc: David Hildenbrand Cc: Jason Gunthorpe Cc: Leon Romanovsky Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Mike Rapoport Cc: Suren Baghdasaryan Cc: Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/hmm-tests.c | 83 +++++++--------------------------- 1 file changed, 16 insertions(+), 67 deletions(-) diff --git a/tools/testing/selftests/mm/hmm-tests.c b/tools/testing/selftests/mm/hmm-tests.c index e8328c89d855..788689497e92 100644 --- a/tools/testing/selftests/mm/hmm-tests.c +++ b/tools/testing/selftests/mm/hmm-tests.c @@ -34,6 +34,7 @@ */ #include #include +#include struct hmm_buffer { void *ptr; @@ -548,7 +549,7 @@ TEST_F(hmm, anon_write_child) for (migrate = 0; migrate < 2; ++migrate) { for (use_thp = 0; use_thp < 2; ++use_thp) { - npages = ALIGN(use_thp ? TWOMEG : HMM_BUFFER_SIZE, + npages = ALIGN(use_thp ? read_pmd_pagesize() : HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; ASSERT_NE(npages, 0); size = npages << self->page_shift; @@ -728,7 +729,7 @@ TEST_F(hmm, anon_write_huge) int *ptr; int ret; - size = 2 * TWOMEG; + size = 2 * read_pmd_pagesize(); buffer = malloc(sizeof(*buffer)); ASSERT_NE(buffer, NULL); @@ -744,7 +745,7 @@ TEST_F(hmm, anon_write_huge) buffer->fd, 0); ASSERT_NE(buffer->ptr, MAP_FAILED); - size = TWOMEG; + size /= 2; npages = size >> self->page_shift; map = (void *)ALIGN((uintptr_t)buffer->ptr, size); ret = madvise(map, size, MADV_HUGEPAGE); @@ -770,54 +771,6 @@ TEST_F(hmm, anon_write_huge) hmm_buffer_free(buffer); } -/* - * Read numeric data from raw and tagged kernel status files. Used to read - * /proc and /sys data (without a tag) and from /proc/meminfo (with a tag). - */ -static long file_read_ulong(char *file, const char *tag) -{ - int fd; - char buf[2048]; - int len; - char *p, *q; - long val; - - fd = open(file, O_RDONLY); - if (fd < 0) { - /* Error opening the file */ - return -1; - } - - len = read(fd, buf, sizeof(buf)); - close(fd); - if (len < 0) { - /* Error in reading the file */ - return -1; - } - if (len == sizeof(buf)) { - /* Error file is too large */ - return -1; - } - buf[len] = '\0'; - - /* Search for a tag if provided */ - if (tag) { - p = strstr(buf, tag); - if (!p) - return -1; /* looks like the line we want isn't there */ - p += strlen(tag); - } else - p = buf; - - val = strtol(p, &q, 0); - if (*q != ' ') { - /* Error parsing the file */ - return -1; - } - - return val; -} - /* * Write huge TLBFS page. */ @@ -826,15 +779,13 @@ TEST_F(hmm, anon_write_hugetlbfs) struct hmm_buffer *buffer; unsigned long npages; unsigned long size; - unsigned long default_hsize; + unsigned long default_hsize = default_huge_page_size(); unsigned long i; int *ptr; int ret; - default_hsize = file_read_ulong("/proc/meminfo", "Hugepagesize:"); - if (default_hsize < 0 || default_hsize*1024 < default_hsize) + if (!default_hsize) SKIP(return, "Huge page size could not be determined"); - default_hsize = default_hsize*1024; /* KB to B */ size = ALIGN(TWOMEG, default_hsize); npages = size >> self->page_shift; @@ -1606,7 +1557,7 @@ TEST_F(hmm, compound) struct hmm_buffer *buffer; unsigned long npages; unsigned long size; - unsigned long default_hsize; + unsigned long default_hsize = default_huge_page_size(); int *ptr; unsigned char *m; int ret; @@ -1614,10 +1565,8 @@ TEST_F(hmm, compound) /* Skip test if we can't allocate a hugetlbfs page. */ - default_hsize = file_read_ulong("/proc/meminfo", "Hugepagesize:"); - if (default_hsize < 0 || default_hsize*1024 < default_hsize) + if (!default_hsize) SKIP(return, "Huge page size could not be determined"); - default_hsize = default_hsize*1024; /* KB to B */ size = ALIGN(TWOMEG, default_hsize); npages = size >> self->page_shift; @@ -2106,7 +2055,7 @@ TEST_F(hmm, migrate_anon_huge_empty) int *ptr; int ret; - size = TWOMEG; + size = read_pmd_pagesize(); buffer = malloc(sizeof(*buffer)); ASSERT_NE(buffer, NULL); @@ -2158,7 +2107,7 @@ TEST_F(hmm, migrate_anon_huge_zero) int ret; int val; - size = TWOMEG; + size = read_pmd_pagesize(); buffer = malloc(sizeof(*buffer)); ASSERT_NE(buffer, NULL); @@ -2221,7 +2170,7 @@ TEST_F(hmm, migrate_anon_huge_free) int *ptr; int ret; - size = TWOMEG; + size = read_pmd_pagesize(); buffer = malloc(sizeof(*buffer)); ASSERT_NE(buffer, NULL); @@ -2280,7 +2229,7 @@ TEST_F(hmm, migrate_anon_huge_fault) int *ptr; int ret; - size = TWOMEG; + size = read_pmd_pagesize(); buffer = malloc(sizeof(*buffer)); ASSERT_NE(buffer, NULL); @@ -2332,7 +2281,7 @@ TEST_F(hmm, migrate_partial_unmap_fault) { struct hmm_buffer *buffer; unsigned long npages; - unsigned long size = TWOMEG; + unsigned long size = read_pmd_pagesize(); unsigned long i; void *old_ptr; void *map; @@ -2398,7 +2347,7 @@ TEST_F(hmm, migrate_remap_fault) { struct hmm_buffer *buffer; unsigned long npages; - unsigned long size = TWOMEG; + unsigned long size = read_pmd_pagesize(); unsigned long i; void *old_ptr, *new_ptr = NULL; void *map; @@ -2498,7 +2447,7 @@ TEST_F(hmm, migrate_anon_huge_err) int *ptr; int ret; - size = TWOMEG; + size = read_pmd_pagesize(); buffer = malloc(sizeof(*buffer)); ASSERT_NE(buffer, NULL); @@ -2593,7 +2542,7 @@ TEST_F(hmm, migrate_anon_huge_zero_err) int *ptr; int ret; - size = TWOMEG; + size = read_pmd_pagesize(); buffer = malloc(sizeof(*buffer)); ASSERT_NE(buffer, NULL); -- cgit v1.2.3 From af69016dab967346f759016ca503ebc61dd048b5 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Tue, 31 Mar 2026 17:34:45 +1100 Subject: lib: test_hmm: implement a device release method Unloading the HMM test module produces the following warning: [ 3782.224783] ------------[ cut here ]------------ [ 3782.226323] Device 'hmm_dmirror0' does not have a release() function, it is broken and must be fixed. See Documentation/core-api/kobject.rst. [ 3782.230570] WARNING: drivers/base/core.c:2567 at device_release+0x185/0x210, CPU#20: rmmod/1924 [ 3782.233949] Modules linked in: test_hmm(-) nvidia_uvm(O) nvidia(O) [ 3782.236321] CPU: 20 UID: 0 PID: 1924 Comm: rmmod Tainted: G O 7.0.0-rc1+ #374 PREEMPT(full) [ 3782.240226] Tainted: [O]=OOT_MODULE [ 3782.241639] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 [ 3782.246193] RIP: 0010:device_release+0x185/0x210 [ 3782.247860] Code: 00 00 fc ff df 48 8d 7b 50 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 86 00 00 00 48 8b 73 50 48 85 f6 74 11 48 8d 3d db 25 29 03 <67> 48 0f b9 3a e9 0d ff ff ff 48 b8 00 00 00 00 00 fc ff df 48 89 [ 3782.254211] RSP: 0018:ffff888126577d98 EFLAGS: 00010246 [ 3782.256054] RAX: dffffc0000000000 RBX: ffffffffc2b70310 RCX: ffffffff8fe61ba1 [ 3782.258512] RDX: 1ffffffff856e062 RSI: ffff88811341eea0 RDI: ffffffff91bbacb0 [ 3782.261041] RBP: ffff888111475000 R08: 0000000000000001 R09: fffffbfff856e069 [ 3782.263471] R10: ffffffffc2b7034b R11: 00000000ffffffff R12: 0000000000000000 [ 3782.265983] R13: dffffc0000000000 R14: ffff88811341eea0 R15: 0000000000000000 [ 3782.268443] FS: 00007fd5a3689040(0000) GS:ffff88842c8d0000(0000) knlGS:0000000000000000 [ 3782.271236] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3782.273251] CR2: 00007fd5a36d2c10 CR3: 00000001242b8000 CR4: 00000000000006f0 [ 3782.275362] Call Trace: [ 3782.276071] [ 3782.276678] kobject_put+0x146/0x270 [ 3782.277731] hmm_dmirror_exit+0x7a/0x130 [test_hmm] [ 3782.279135] __do_sys_delete_module+0x341/0x510 [ 3782.280438] ? module_flags+0x300/0x300 [ 3782.281547] do_syscall_64+0x111/0x670 [ 3782.282620] entry_SYSCALL_64_after_hwframe+0x4b/0x53 [ 3782.284091] RIP: 0033:0x7fd5a3793b37 [ 3782.285303] Code: 73 01 c3 48 8b 0d c9 82 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 99 82 0c 00 f7 d8 64 89 01 48 [ 3782.290708] RSP: 002b:00007ffd68b7dc68 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 [ 3782.292817] RAX: ffffffffffffffda RBX: 000055e3c0d1c770 RCX: 00007fd5a3793b37 [ 3782.294735] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055e3c0d1c7d8 [ 3782.296661] RBP: 0000000000000000 R08: 1999999999999999 R09: 0000000000000000 [ 3782.298622] R10: 00007fd5a3806ac0 R11: 0000000000000206 R12: 00007ffd68b7deb0 [ 3782.300576] R13: 00007ffd68b7e781 R14: 000055e3c0d1b2a0 R15: 00007ffd68b7deb8 [ 3782.301963] [ 3782.302371] irq event stamp: 5019 [ 3782.302987] hardirqs last enabled at (5027): [] __up_console_sem+0x52/0x60 [ 3782.304507] hardirqs last disabled at (5036): [] __up_console_sem+0x37/0x60 [ 3782.306086] softirqs last enabled at (4940): [] __irq_exit_rcu+0xc0/0xf0 [ 3782.307567] softirqs last disabled at (4929): [] __irq_exit_rcu+0xc0/0xf0 [ 3782.309105] ---[ end trace 0000000000000000 ]--- This is because the test module doesn't have a device.release method. In this case one probably isn't needed for correctness - the device structs are in a static array so don't need freeing when the final reference goes away. However some device state is freed on exit, so to ensure this happens at the right time and to silence the warning move the deinitialisation to a release method and assign that as the device release callback. Whilst here also fix a minor error handling bug where cdev_device_del() wasn't being called if allocation failed. Link: https://lore.kernel.org/20260331063445.3551404-4-apopple@nvidia.com Fixes: 6a760f58c792 ("mm/hmm/test: use char dev with struct device to get device node") Signed-off-by: Alistair Popple Acked-by: Balbir Singh Tested-by: Zenghui Yu (Huawei) Cc: David Hildenbrand Cc: Jason Gunthorpe Cc: Leon Romanovsky Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Mike Rapoport Cc: Suren Baghdasaryan Cc: Matthew Brost Cc: Signed-off-by: Andrew Morton --- lib/test_hmm.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 79fe7d233df1..213504915737 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -1738,6 +1738,13 @@ static const struct dev_pagemap_ops dmirror_devmem_ops = { .folio_split = dmirror_devmem_folio_split, }; +static void dmirror_device_release(struct device *dev) +{ + struct dmirror_device *mdevice = container_of(dev, struct dmirror_device, device); + + dmirror_device_remove_chunks(mdevice); +} + static int dmirror_device_init(struct dmirror_device *mdevice, int id) { dev_t dev; @@ -1749,6 +1756,8 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id) cdev_init(&mdevice->cdevice, &dmirror_fops); mdevice->cdevice.owner = THIS_MODULE; + mdevice->device.release = dmirror_device_release; + device_initialize(&mdevice->device); mdevice->device.devt = dev; @@ -1756,12 +1765,16 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id) if (ret) goto put_device; + /* Build a list of free ZONE_DEVICE struct pages */ + ret = dmirror_allocate_chunk(mdevice, NULL, false); + if (ret) + goto put_device; + ret = cdev_device_add(&mdevice->cdevice, &mdevice->device); if (ret) goto put_device; - /* Build a list of free ZONE_DEVICE struct pages */ - return dmirror_allocate_chunk(mdevice, NULL, false); + return 0; put_device: put_device(&mdevice->device); @@ -1770,7 +1783,6 @@ put_device: static void dmirror_device_remove(struct dmirror_device *mdevice) { - dmirror_device_remove_chunks(mdevice); cdev_device_del(&mdevice->cdevice, &mdevice->device); put_device(&mdevice->device); } -- cgit v1.2.3 From e3668b371329ea036ff022ce8ecc82f8befcf003 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Tue, 31 Mar 2026 16:42:44 +0900 Subject: zram: do not forget to endio for partial discard requests As reported by Qu Wenruo and Avinesh Kumar, the following getconf PAGESIZE 65536 blkdiscard -p 4k /dev/zram0 takes literally forever to complete. zram doesn't support partial discards and just returns immediately w/o doing any discard work in such cases. The problem is that we forget to endio on our way out, so blkdiscard sleeps forever in submit_bio_wait(). Fix this by jumping to end_bio label, which does bio_endio(). Link: https://lore.kernel.org/20260331074255.777019-1-senozhatsky@chromium.org Fixes: 0120dd6e4e20 ("zram: make zram_bio_discard more self-contained") Signed-off-by: Sergey Senozhatsky Reported-by: Qu Wenruo Closes: https://lore.kernel.org/linux-block/92361cd3-fb8b-482e-bc89-15ff1acb9a59@suse.com Tested-by: Qu Wenruo Reported-by: Avinesh Kumar Closes: https://bugzilla.suse.com/show_bug.cgi?id=1256530 Reviewed-by: Christoph Hellwig Cc: Brian Geffon Cc: Jens Axboe Cc: Minchan Kim Cc: Signed-off-by: Andrew Morton --- drivers/block/zram/zram_drv.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index c2afd1c34f4a..43b68fdd95d6 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -2678,7 +2678,7 @@ static void zram_bio_discard(struct zram *zram, struct bio *bio) */ if (offset) { if (n <= (PAGE_SIZE - offset)) - return; + goto end_bio; n -= (PAGE_SIZE - offset); index++; @@ -2693,6 +2693,7 @@ static void zram_bio_discard(struct zram *zram, struct bio *bio) n -= PAGE_SIZE; } +end_bio: bio_endio(bio); } -- cgit v1.2.3 From 7cf6d940f4032d87d9cfe6b27c0e49e309818e5d Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Tue, 31 Mar 2026 19:37:24 +0800 Subject: mm/sparse: fix preinited section_mem_map clobbering on failure path sparse_init_nid() is careful to leave alone every section whose vmemmap has already been set up by sparse_vmemmap_init_nid_early(); it only clears section_mem_map for the rest: if (!preinited_vmemmap_section(ms)) ms->section_mem_map = 0; A leftover line after that conditional block ms->section_mem_map = 0; was supposed to be deleted but was missed in the failure path, causing the field to be overwritten for all sections when memory allocation fails, effectively destroying the pre-initialization check. Drop the stray assignment so that preinited sections retain their already valid state. Those pre-inited sections (HugeTLB pages) are not activated. However, such failures are extremely rare, so I don't see any major userspace issues. Link: https://lore.kernel.org/20260331113724.2080833-1-songmuchun@bytedance.com Fixes: d65917c42373 ("mm/sparse: allow for alternate vmemmap section init at boot") Signed-off-by: Muchun Song Acked-by: David Hildenbrand (Arm) Reviewed by: Donet Tom Cc: David Hildenbrand Cc: Frank van der Linden Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Mike Rapoport Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/sparse.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/sparse.c b/mm/sparse.c index 007fd52c621e..effdac6b0ab1 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -403,7 +403,6 @@ failed: ms = __nr_to_section(pnum); if (!preinited_vmemmap_section(ms)) ms->section_mem_map = 0; - ms->section_mem_map = 0; } } -- cgit v1.2.3 From ed2a29dc6dcf4630ef19d588704c2ca7b46607bb Mon Sep 17 00:00:00 2001 From: Chenghao Duan Date: Thu, 26 Mar 2026 16:47:21 +0800 Subject: mm/memfd: use folio_nr_pages() for shmem inode accounting I found several modifiable points while reading the code. This patch (of 6): Patch series "Modify memfd_luo code", v3. memfd_luo_retrieve_folios() called shmem_inode_acct_blocks() and shmem_recalc_inode() with hardcoded 1 instead of the actual folio page count. memfd may use large folios (THP/hugepages), causing quota/limit under-accounting and incorrect stat output. Fix by using folio_nr_pages(folio) for both functions. Issue found by AI review and suggested by Pratyush Yadav . https://sashiko.dev/#/patchset/20260319012845.29570-1-duanchenghao%40kylinos.cn Link: https://lore.kernel.org/20260326084727.118437-1-duanchenghao@kylinos.cn Link: https://lore.kernel.org/20260326084727.118437-2-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan Suggested-by: Pratyush Yadav Reviewed-by: Pasha Tatashin Reviewed-by: Pratyush Yadav Cc: Haoran Jiang Cc: Mike Rapoport Signed-off-by: Andrew Morton --- mm/memfd_luo.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/mm/memfd_luo.c b/mm/memfd_luo.c index 9130e6ce396d..ec4c3e1e2891 100644 --- a/mm/memfd_luo.c +++ b/mm/memfd_luo.c @@ -410,6 +410,7 @@ static int memfd_luo_retrieve_folios(struct file *file, struct inode *inode = file_inode(file); struct address_space *mapping = inode->i_mapping; struct folio *folio; + long npages; int err = -EIO; long i; @@ -456,14 +457,15 @@ static int memfd_luo_retrieve_folios(struct file *file, if (flags & MEMFD_LUO_FOLIO_DIRTY) folio_mark_dirty(folio); - err = shmem_inode_acct_blocks(inode, 1); + npages = folio_nr_pages(folio); + err = shmem_inode_acct_blocks(inode, npages); if (err) { - pr_err("shmem: failed to account folio index %ld: %d\n", - i, err); + pr_err("shmem: failed to account folio index %ld(%ld pages): %d\n", + i, npages, err); goto unlock_folio; } - shmem_recalc_inode(inode, 1, 0); + shmem_recalc_inode(inode, npages, 0); folio_add_lru(folio); folio_unlock(folio); folio_put(folio); -- cgit v1.2.3 From 502d3c2ad8f05d1545ae05f96f71a5916aa88b0f Mon Sep 17 00:00:00 2001 From: Chenghao Duan Date: Thu, 26 Mar 2026 16:47:22 +0800 Subject: mm/memfd_luo: optimize shmem_recalc_inode calls in retrieve path Move shmem_recalc_inode() out of the loop in memfd_luo_retrieve_folios() to improve performance when restoring large memfds. Currently, shmem_recalc_inode() is called for each folio during restore, which is O(n) expensive operations. This patch collects the number of successfully added folios and calls shmem_recalc_inode() once after the loop completes, reducing complexity to O(1). Additionally, fix the error path to also call shmem_recalc_inode() for the folios that were successfully added before the error occurred. Link: https://lore.kernel.org/20260326084727.118437-3-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan Reviewed-by: Pasha Tatashin Reviewed-by: Pratyush Yadav Cc: Haoran Jiang Cc: Mike Rapoport (Microsoft) Signed-off-by: Andrew Morton --- mm/memfd_luo.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/mm/memfd_luo.c b/mm/memfd_luo.c index ec4c3e1e2891..865b044bee62 100644 --- a/mm/memfd_luo.c +++ b/mm/memfd_luo.c @@ -410,7 +410,7 @@ static int memfd_luo_retrieve_folios(struct file *file, struct inode *inode = file_inode(file); struct address_space *mapping = inode->i_mapping; struct folio *folio; - long npages; + long npages, nr_added_pages = 0; int err = -EIO; long i; @@ -465,12 +465,14 @@ static int memfd_luo_retrieve_folios(struct file *file, goto unlock_folio; } - shmem_recalc_inode(inode, npages, 0); + nr_added_pages += npages; folio_add_lru(folio); folio_unlock(folio); folio_put(folio); } + shmem_recalc_inode(inode, nr_added_pages, 0); + return 0; unlock_folio: @@ -489,6 +491,8 @@ put_folios: folio_put(folio); } + shmem_recalc_inode(inode, nr_added_pages, 0); + return err; } -- cgit v1.2.3 From 4aa6424f37b58a4f8298329166657bd4fd8e9ca8 Mon Sep 17 00:00:00 2001 From: Chenghao Duan Date: Thu, 26 Mar 2026 16:47:23 +0800 Subject: mm/memfd_luo: remove unnecessary memset in zero-size memfd path The memset(kho_vmalloc, 0, sizeof(*kho_vmalloc)) call in the zero-size file handling path is unnecessary because the allocation of the ser structure already uses the __GFP_ZERO flag, ensuring the memory is already zero-initialized. Link: https://lore.kernel.org/20260326084727.118437-4-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan Reviewed-by: Pratyush Yadav Reviewed-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) Cc: Haoran Jiang Signed-off-by: Andrew Morton --- mm/memfd_luo.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/memfd_luo.c b/mm/memfd_luo.c index 865b044bee62..5a8ead5be087 100644 --- a/mm/memfd_luo.c +++ b/mm/memfd_luo.c @@ -105,7 +105,6 @@ static int memfd_luo_preserve_folios(struct file *file, if (!size) { *nr_foliosp = 0; *out_folios_ser = NULL; - memset(kho_vmalloc, 0, sizeof(*kho_vmalloc)); return 0; } -- cgit v1.2.3 From 32f6cec5e7511ce3e48d504601035f108844e063 Mon Sep 17 00:00:00 2001 From: Chenghao Duan Date: Thu, 26 Mar 2026 16:47:24 +0800 Subject: mm/memfd_luo: use i_size_write() to set inode size during retrieve Use i_size_write() instead of directly assigning to inode->i_size when restoring the memfd size in memfd_luo_retrieve(), to keep code consistency. No functional change intended. Link: https://lore.kernel.org/20260326084727.118437-5-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan Reviewed-by: Pasha Tatashin Cc: Haoran Jiang Cc: Mike Rapoport (Microsoft) Cc: Pratyush Yadav Signed-off-by: Andrew Morton --- mm/memfd_luo.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memfd_luo.c b/mm/memfd_luo.c index 5a8ead5be087..eb9f4cc0e7ae 100644 --- a/mm/memfd_luo.c +++ b/mm/memfd_luo.c @@ -530,7 +530,7 @@ static int memfd_luo_retrieve(struct liveupdate_file_op_args *args) } vfs_setpos(file, ser->pos, MAX_LFS_FILESIZE); - file->f_inode->i_size = ser->size; + i_size_write(file_inode(file), ser->size); if (ser->nr_folios) { folios_ser = kho_restore_vmalloc(&ser->folios); -- cgit v1.2.3 From 3538f90ab89aaf302782b4b073a0aae66904cd67 Mon Sep 17 00:00:00 2001 From: Chenghao Duan Date: Thu, 26 Mar 2026 16:47:25 +0800 Subject: mm/memfd_luo: fix physical address conversion in put_folios cleanup In memfd_luo_retrieve_folios()'s put_folios cleanup path: 1. kho_restore_folio() expects a phys_addr_t (physical address) but receives a raw PFN (pfolio->pfn). This causes kho_restore_page() to check the wrong physical address (pfn << PAGE_SHIFT instead of the actual physical address). 2. This loop lacks the !pfolio->pfn check that exists in the main retrieval loop and memfd_luo_discard_folios(), which could incorrectly process sparse file holes where pfn=0. Fix by converting PFN to physical address with PFN_PHYS() and adding the !pfolio->pfn check, matching the pattern used elsewhere in this file. This issue was identified by the AI review. https://sashiko.dev/#/patchset/20260323110747.193569-1-duanchenghao@kylinos.cn Link: https://lore.kernel.org/20260326084727.118437-6-duanchenghao@kylinos.cn Fixes: b3749f174d68 ("mm: memfd_luo: allow preserving memfd") Signed-off-by: Chenghao Duan Reviewed-by: Pasha Tatashin Reviewed-by: Pratyush Yadav Cc: Haoran Jiang Cc: Mike Rapoport (Microsoft) Cc: Signed-off-by: Andrew Morton --- mm/memfd_luo.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/memfd_luo.c b/mm/memfd_luo.c index eb9f4cc0e7ae..eb611527dedd 100644 --- a/mm/memfd_luo.c +++ b/mm/memfd_luo.c @@ -484,8 +484,13 @@ put_folios: */ for (long j = i + 1; j < nr_folios; j++) { const struct memfd_luo_folio_ser *pfolio = &folios_ser[j]; + phys_addr_t phys; + + if (!pfolio->pfn) + continue; - folio = kho_restore_folio(pfolio->pfn); + phys = PFN_PHYS(pfolio->pfn); + folio = kho_restore_folio(phys); if (folio) folio_put(folio); } -- cgit v1.2.3 From dc44f32fde25c401da6c4746c389ec552ddbc30f Mon Sep 17 00:00:00 2001 From: Chenghao Duan Date: Thu, 26 Mar 2026 16:47:26 +0800 Subject: mm/memfd_luo: remove folio from page cache when accounting fails In memfd_luo_retrieve_folios(), when shmem_inode_acct_blocks() fails after successfully adding the folio to the page cache, the code jumps to unlock_folio without removing the folio from the page cache. While the folio eventually will be freed when the file is released by memfd_luo_retrieve(), it is a good idea to directly remove a folio that was not fully added to the file. This avoids the possibility of accounting mismatches in shmem or filemap core. Fix by adding a remove_from_cache label that calls filemap_remove_folio() before unlocking, matching the error handling pattern in shmem_alloc_and_add_folio(). This issue was identified by AI review: https://sashiko.dev/#/patchset/20260323110747.193569-1-duanchenghao@kylinos.cn [pratyush@kernel.org: changelog alterations] Link: https://lore.kernel.org/2vxzzf3lfujq.fsf@kernel.org Link: https://lore.kernel.org/20260326084727.118437-7-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan Reviewed-by: Pasha Tatashin Reviewed-by: Pratyush Yadav Cc: Haoran Jiang Cc: Mike Rapoport (Microsoft) Signed-off-by: Andrew Morton --- mm/memfd_luo.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/memfd_luo.c b/mm/memfd_luo.c index eb611527dedd..b02b503c750d 100644 --- a/mm/memfd_luo.c +++ b/mm/memfd_luo.c @@ -461,7 +461,7 @@ static int memfd_luo_retrieve_folios(struct file *file, if (err) { pr_err("shmem: failed to account folio index %ld(%ld pages): %d\n", i, npages, err); - goto unlock_folio; + goto remove_from_cache; } nr_added_pages += npages; @@ -474,6 +474,8 @@ static int memfd_luo_retrieve_folios(struct file *file, return 0; +remove_from_cache: + filemap_remove_folio(folio); unlock_folio: folio_unlock(folio); folio_put(folio); -- cgit v1.2.3 From c0620487fc33320ed7ccdfdd9644d996f8c09c5a Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Thu, 2 Apr 2026 07:11:42 +0300 Subject: userfaultfd: introduce mfill_copy_folio_locked() helper Patch series "mm, kvm: allow uffd support in guest_memfd", v4. These patches enable support for userfaultfd in guest_memfd. As the groundwork I refactored userfaultfd handling of PTE-based memory types (anonymous and shmem) and converted them to use vm_uffd_ops for allocating a folio or getting an existing folio from the page cache. shmem also implements callbacks that add a folio to the page cache after the data passed in UFFDIO_COPY was copied and remove the folio from the page cache if page table update fails. In order for guest_memfd to notify userspace about page faults, there are new VM_FAULT_UFFD_MINOR and VM_FAULT_UFFD_MISSING that a ->fault() handler can return to inform the page fault handler that it needs to call handle_userfault() to complete the fault. Nikita helped to plumb these new goodies into guest_memfd and provided basic tests to verify that guest_memfd works with userfaultfd. The handling of UFFDIO_MISSING in guest_memfd requires ability to remove a folio from page cache, the best way I could find was exporting filemap_remove_folio() to KVM. I deliberately left hugetlb out, at least for the most part. hugetlb handles acquisition of VMA and more importantly establishing of parent page table entry differently than PTE-based memory types. This is a different abstraction level than what vm_uffd_ops provides and people objected to exposing such low level APIs as a part of VMA operations. Also, to enable uffd in guest_memfd refactoring of hugetlb is not needed and I prefer to delay it until the dust settles after the changes in this set. This patch (of 4): Split copying of data when locks held from mfill_atomic_pte_copy() into a helper function mfill_copy_folio_locked(). This makes improves code readability and makes complex mfill_atomic_pte_copy() function easier to comprehend. No functional change. Link: https://lore.kernel.org/20260402041156.1377214-1-rppt@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Acked-by: Peter Xu Reviewed-by: David Hildenbrand (Arm) Reviewed-by: Harry Yoo (Oracle) Cc: Andrea Arcangeli Cc: Andrei Vagin Cc: Axel Rasmussen Cc: Baolin Wang Cc: Hugh Dickins Cc: James Houghton Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Mike Rapoport Cc: Muchun Song Cc: Oscar Salvador Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Harry Yoo Cc: Nikita Kalyazin Cc: David Carlier Signed-off-by: Andrew Morton --- mm/userfaultfd.c | 59 +++++++++++++++++++++++++++++++++----------------------- 1 file changed, 35 insertions(+), 24 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 89879c3ba344..795bafb2c6cc 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -238,6 +238,40 @@ out: return ret; } +static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr) +{ + void *kaddr; + int ret; + + kaddr = kmap_local_folio(folio, 0); + /* + * The read mmap_lock is held here. Despite the + * mmap_lock being read recursive a deadlock is still + * possible if a writer has taken a lock. For example: + * + * process A thread 1 takes read lock on own mmap_lock + * process A thread 2 calls mmap, blocks taking write lock + * process B thread 1 takes page fault, read lock on own mmap lock + * process B thread 2 calls mmap, blocks taking write lock + * process A thread 1 blocks taking read lock on process B + * process B thread 1 blocks taking read lock on process A + * + * Disable page faults to prevent potential deadlock + * and retry the copy outside the mmap_lock. + */ + pagefault_disable(); + ret = copy_from_user(kaddr, (const void __user *) src_addr, + PAGE_SIZE); + pagefault_enable(); + kunmap_local(kaddr); + + if (ret) + return -EFAULT; + + flush_dcache_folio(folio); + return ret; +} + static int mfill_atomic_pte_copy(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, @@ -245,7 +279,6 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd, uffd_flags_t flags, struct folio **foliop) { - void *kaddr; int ret; struct folio *folio; @@ -256,27 +289,7 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd, if (!folio) goto out; - kaddr = kmap_local_folio(folio, 0); - /* - * The read mmap_lock is held here. Despite the - * mmap_lock being read recursive a deadlock is still - * possible if a writer has taken a lock. For example: - * - * process A thread 1 takes read lock on own mmap_lock - * process A thread 2 calls mmap, blocks taking write lock - * process B thread 1 takes page fault, read lock on own mmap lock - * process B thread 2 calls mmap, blocks taking write lock - * process A thread 1 blocks taking read lock on process B - * process B thread 1 blocks taking read lock on process A - * - * Disable page faults to prevent potential deadlock - * and retry the copy outside the mmap_lock. - */ - pagefault_disable(); - ret = copy_from_user(kaddr, (const void __user *) src_addr, - PAGE_SIZE); - pagefault_enable(); - kunmap_local(kaddr); + ret = mfill_copy_folio_locked(folio, src_addr); /* fallback to copy_from_user outside mmap_lock */ if (unlikely(ret)) { @@ -285,8 +298,6 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd, /* don't free the page */ goto out; } - - flush_dcache_folio(folio); } else { folio = *foliop; *foliop = NULL; -- cgit v1.2.3 From db0062d2c0357eb23b1c2dd4978ff4c2e1e5806b Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Thu, 2 Apr 2026 07:11:43 +0300 Subject: userfaultfd: introduce struct mfill_state mfill_atomic() passes a lot of parameters down to its callees. Aggregate them all into mfill_state structure and pass this structure to functions that implement various UFFDIO_ commands. Tracking the state in a structure will allow moving the code that retries copying of data for UFFDIO_COPY into mfill_atomic_pte_copy() and make the loop in mfill_atomic() identical for all UFFDIO operations on PTE-mapped memory. The mfill_state definition is deliberately local to mm/userfaultfd.c, hence shmem_mfill_atomic_pte() is not updated. [harry.yoo@oracle.com: properly initialize mfill_state.len to fix folio_add_new_anon_rmap() WARN] Link: https://lore.kernel.org/abehBY7QakYF9bK4@hyeyoo Link: https://lore.kernel.org/20260402041156.1377214-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Signed-off-by: Harry Yoo Acked-by: David Hildenbrand (Arm) Reviewed-by: Harry Yoo (Oracle) Cc: Andrea Arcangeli Cc: Andrei Vagin Cc: Axel Rasmussen Cc: Baolin Wang Cc: Harry Yoo (Oracle) Cc: Hugh Dickins Cc: James Houghton Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Muchun Song Cc: Nikita Kalyazin Cc: Oscar Salvador Cc: Paolo Bonzini Cc: Peter Xu Cc: Sean Christopherson Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: David Carlier Signed-off-by: Andrew Morton --- mm/userfaultfd.c | 147 ++++++++++++++++++++++++++++++------------------------- 1 file changed, 81 insertions(+), 66 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 795bafb2c6cc..a12a8411e85e 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -20,6 +20,20 @@ #include "internal.h" #include "swap.h" +struct mfill_state { + struct userfaultfd_ctx *ctx; + unsigned long src_start; + unsigned long dst_start; + unsigned long len; + uffd_flags_t flags; + + struct vm_area_struct *vma; + unsigned long src_addr; + unsigned long dst_addr; + struct folio *folio; + pmd_t *pmd; +}; + static __always_inline bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end) { @@ -272,17 +286,17 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr) return ret; } -static int mfill_atomic_pte_copy(pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - unsigned long src_addr, - uffd_flags_t flags, - struct folio **foliop) +static int mfill_atomic_pte_copy(struct mfill_state *state) { - int ret; + struct vm_area_struct *dst_vma = state->vma; + unsigned long dst_addr = state->dst_addr; + unsigned long src_addr = state->src_addr; + uffd_flags_t flags = state->flags; + pmd_t *dst_pmd = state->pmd; struct folio *folio; + int ret; - if (!*foliop) { + if (!state->folio) { ret = -ENOMEM; folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, dst_vma, dst_addr); @@ -294,13 +308,13 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd, /* fallback to copy_from_user outside mmap_lock */ if (unlikely(ret)) { ret = -ENOENT; - *foliop = folio; + state->folio = folio; /* don't free the page */ goto out; } } else { - folio = *foliop; - *foliop = NULL; + folio = state->folio; + state->folio = NULL; } /* @@ -357,10 +371,11 @@ out_put: return ret; } -static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr) +static int mfill_atomic_pte_zeropage(struct mfill_state *state) { + struct vm_area_struct *dst_vma = state->vma; + unsigned long dst_addr = state->dst_addr; + pmd_t *dst_pmd = state->pmd; pte_t _dst_pte, *dst_pte; spinlock_t *ptl; int ret; @@ -392,13 +407,14 @@ out: } /* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */ -static int mfill_atomic_pte_continue(pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - uffd_flags_t flags) +static int mfill_atomic_pte_continue(struct mfill_state *state) { - struct inode *inode = file_inode(dst_vma->vm_file); + struct vm_area_struct *dst_vma = state->vma; + unsigned long dst_addr = state->dst_addr; pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); + struct inode *inode = file_inode(dst_vma->vm_file); + uffd_flags_t flags = state->flags; + pmd_t *dst_pmd = state->pmd; struct folio *folio; struct page *page; int ret; @@ -436,15 +452,15 @@ out_release: } /* Handles UFFDIO_POISON for all non-hugetlb VMAs. */ -static int mfill_atomic_pte_poison(pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - uffd_flags_t flags) +static int mfill_atomic_pte_poison(struct mfill_state *state) { - int ret; + struct vm_area_struct *dst_vma = state->vma; struct mm_struct *dst_mm = dst_vma->vm_mm; + unsigned long dst_addr = state->dst_addr; + pmd_t *dst_pmd = state->pmd; pte_t _dst_pte, *dst_pte; spinlock_t *ptl; + int ret; _dst_pte = make_pte_marker(PTE_MARKER_POISONED); ret = -EAGAIN; @@ -668,22 +684,20 @@ extern ssize_t mfill_atomic_hugetlb(struct userfaultfd_ctx *ctx, uffd_flags_t flags); #endif /* CONFIG_HUGETLB_PAGE */ -static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - unsigned long src_addr, - uffd_flags_t flags, - struct folio **foliop) +static __always_inline ssize_t mfill_atomic_pte(struct mfill_state *state) { + struct vm_area_struct *dst_vma = state->vma; + unsigned long src_addr = state->src_addr; + unsigned long dst_addr = state->dst_addr; + struct folio **foliop = &state->folio; + uffd_flags_t flags = state->flags; + pmd_t *dst_pmd = state->pmd; ssize_t err; - if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) { - return mfill_atomic_pte_continue(dst_pmd, dst_vma, - dst_addr, flags); - } else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) { - return mfill_atomic_pte_poison(dst_pmd, dst_vma, - dst_addr, flags); - } + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) + return mfill_atomic_pte_continue(state); + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) + return mfill_atomic_pte_poison(state); /* * The normal page fault path for a shmem will invoke the @@ -697,12 +711,9 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd, */ if (!(dst_vma->vm_flags & VM_SHARED)) { if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) - err = mfill_atomic_pte_copy(dst_pmd, dst_vma, - dst_addr, src_addr, - flags, foliop); + err = mfill_atomic_pte_copy(state); else - err = mfill_atomic_pte_zeropage(dst_pmd, - dst_vma, dst_addr); + err = mfill_atomic_pte_zeropage(state); } else { err = shmem_mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, src_addr, @@ -718,13 +729,20 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, unsigned long len, uffd_flags_t flags) { + struct mfill_state state = (struct mfill_state){ + .ctx = ctx, + .dst_start = dst_start, + .src_start = src_start, + .flags = flags, + .len = len, + .src_addr = src_start, + .dst_addr = dst_start, + }; struct mm_struct *dst_mm = ctx->mm; struct vm_area_struct *dst_vma; + long copied = 0; ssize_t err; pmd_t *dst_pmd; - unsigned long src_addr, dst_addr; - long copied; - struct folio *folio; /* * Sanitize the command parameters: @@ -736,10 +754,6 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, VM_WARN_ON_ONCE(src_start + len <= src_start); VM_WARN_ON_ONCE(dst_start + len <= dst_start); - src_addr = src_start; - dst_addr = dst_start; - copied = 0; - folio = NULL; retry: /* * Make sure the vma is not shared, that the dst range is @@ -750,6 +764,7 @@ retry: err = PTR_ERR(dst_vma); goto out; } + state.vma = dst_vma; /* * If memory mappings are changing because of non-cooperative @@ -790,12 +805,12 @@ retry: uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) goto out_unlock; - while (src_addr < src_start + len) { - pmd_t dst_pmdval; + while (state.src_addr < src_start + len) { + VM_WARN_ON_ONCE(state.dst_addr >= dst_start + len); - VM_WARN_ON_ONCE(dst_addr >= dst_start + len); + pmd_t dst_pmdval; - dst_pmd = mm_alloc_pmd(dst_mm, dst_addr); + dst_pmd = mm_alloc_pmd(dst_mm, state.dst_addr); if (unlikely(!dst_pmd)) { err = -ENOMEM; break; @@ -827,34 +842,34 @@ retry: * tables under us; pte_offset_map_lock() will deal with that. */ - err = mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, - src_addr, flags, &folio); + state.pmd = dst_pmd; + err = mfill_atomic_pte(&state); cond_resched(); if (unlikely(err == -ENOENT)) { void *kaddr; up_read(&ctx->map_changing_lock); - uffd_mfill_unlock(dst_vma); - VM_WARN_ON_ONCE(!folio); + uffd_mfill_unlock(state.vma); + VM_WARN_ON_ONCE(!state.folio); - kaddr = kmap_local_folio(folio, 0); + kaddr = kmap_local_folio(state.folio, 0); err = copy_from_user(kaddr, - (const void __user *) src_addr, + (const void __user *)state.src_addr, PAGE_SIZE); kunmap_local(kaddr); if (unlikely(err)) { err = -EFAULT; goto out; } - flush_dcache_folio(folio); + flush_dcache_folio(state.folio); goto retry; } else - VM_WARN_ON_ONCE(folio); + VM_WARN_ON_ONCE(state.folio); if (!err) { - dst_addr += PAGE_SIZE; - src_addr += PAGE_SIZE; + state.dst_addr += PAGE_SIZE; + state.src_addr += PAGE_SIZE; copied += PAGE_SIZE; if (fatal_signal_pending(current)) @@ -866,10 +881,10 @@ retry: out_unlock: up_read(&ctx->map_changing_lock); - uffd_mfill_unlock(dst_vma); + uffd_mfill_unlock(state.vma); out: - if (folio) - folio_put(folio); + if (state.folio) + folio_put(state.folio); VM_WARN_ON_ONCE(copied < 0); VM_WARN_ON_ONCE(err > 0); VM_WARN_ON_ONCE(!copied && !err); -- cgit v1.2.3 From e2e0b826d37419536b91b25fa51ecc0565d27726 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Thu, 2 Apr 2026 07:11:44 +0300 Subject: userfaultfd: introduce mfill_establish_pmd() helper There is a lengthy code chunk in mfill_atomic() that establishes the PMD for UFFDIO operations. This code may be called twice: first time when the copy is performed with VMA/mm locks held and the other time after the copy is retried with locks dropped. Move the code that establishes a PMD into a helper function so it can be reused later during refactoring of mfill_atomic_pte_copy(). Link: https://lore.kernel.org/20260402041156.1377214-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: Harry Yoo (Oracle) Cc: Andrea Arcangeli Cc: Andrei Vagin Cc: Axel Rasmussen Cc: Baolin Wang Cc: David Hildenbrand (Arm) Cc: Harry Yoo Cc: Hugh Dickins Cc: James Houghton Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Muchun Song Cc: Nikita Kalyazin Cc: Oscar Salvador Cc: Paolo Bonzini Cc: Peter Xu Cc: Sean Christopherson Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: David Carlier Signed-off-by: Andrew Morton --- mm/userfaultfd.c | 102 ++++++++++++++++++++++++++++--------------------------- 1 file changed, 52 insertions(+), 50 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index a12a8411e85e..3f7ed93020bc 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -157,6 +157,56 @@ static void uffd_mfill_unlock(struct vm_area_struct *vma) } #endif +static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) +{ + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + + pgd = pgd_offset(mm, address); + p4d = p4d_alloc(mm, pgd, address); + if (!p4d) + return NULL; + pud = pud_alloc(mm, p4d, address); + if (!pud) + return NULL; + /* + * Note that we didn't run this because the pmd was + * missing, the *pmd may be already established and in + * turn it may also be a trans_huge_pmd. + */ + return pmd_alloc(mm, pud, address); +} + +static int mfill_establish_pmd(struct mfill_state *state) +{ + struct mm_struct *dst_mm = state->ctx->mm; + pmd_t *dst_pmd, dst_pmdval; + + dst_pmd = mm_alloc_pmd(dst_mm, state->dst_addr); + if (unlikely(!dst_pmd)) + return -ENOMEM; + + dst_pmdval = pmdp_get_lockless(dst_pmd); + if (unlikely(pmd_none(dst_pmdval)) && + unlikely(__pte_alloc(dst_mm, dst_pmd))) + return -ENOMEM; + + dst_pmdval = pmdp_get_lockless(dst_pmd); + /* + * If the dst_pmd is THP don't override it and just be strict. + * (This includes the case where the PMD used to be THP and + * changed back to none after __pte_alloc().) + */ + if (unlikely(!pmd_present(dst_pmdval) || pmd_leaf(dst_pmdval))) + return -EEXIST; + if (unlikely(pmd_bad(dst_pmdval))) + return -EFAULT; + + state->pmd = dst_pmd; + return 0; +} + /* Check if dst_addr is outside of file's size. Must be called with ptl held. */ static bool mfill_file_over_size(struct vm_area_struct *dst_vma, unsigned long dst_addr) @@ -489,27 +539,6 @@ out: return ret; } -static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) -{ - pgd_t *pgd; - p4d_t *p4d; - pud_t *pud; - - pgd = pgd_offset(mm, address); - p4d = p4d_alloc(mm, pgd, address); - if (!p4d) - return NULL; - pud = pud_alloc(mm, p4d, address); - if (!pud) - return NULL; - /* - * Note that we didn't run this because the pmd was - * missing, the *pmd may be already established and in - * turn it may also be a trans_huge_pmd. - */ - return pmd_alloc(mm, pud, address); -} - #ifdef CONFIG_HUGETLB_PAGE /* * mfill_atomic processing for HUGETLB vmas. Note that this routine is @@ -742,7 +771,6 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, struct vm_area_struct *dst_vma; long copied = 0; ssize_t err; - pmd_t *dst_pmd; /* * Sanitize the command parameters: @@ -808,41 +836,15 @@ retry: while (state.src_addr < src_start + len) { VM_WARN_ON_ONCE(state.dst_addr >= dst_start + len); - pmd_t dst_pmdval; - - dst_pmd = mm_alloc_pmd(dst_mm, state.dst_addr); - if (unlikely(!dst_pmd)) { - err = -ENOMEM; + err = mfill_establish_pmd(&state); + if (err) break; - } - dst_pmdval = pmdp_get_lockless(dst_pmd); - if (unlikely(pmd_none(dst_pmdval)) && - unlikely(__pte_alloc(dst_mm, dst_pmd))) { - err = -ENOMEM; - break; - } - dst_pmdval = pmdp_get_lockless(dst_pmd); - /* - * If the dst_pmd is THP don't override it and just be strict. - * (This includes the case where the PMD used to be THP and - * changed back to none after __pte_alloc().) - */ - if (unlikely(!pmd_present(dst_pmdval) || - pmd_trans_huge(dst_pmdval))) { - err = -EEXIST; - break; - } - if (unlikely(pmd_bad(dst_pmdval))) { - err = -EFAULT; - break; - } /* * For shmem mappings, khugepaged is allowed to remove page * tables under us; pte_offset_map_lock() will deal with that. */ - state.pmd = dst_pmd; err = mfill_atomic_pte(&state); cond_resched(); -- cgit v1.2.3 From b8c03b7f4558219ca09693b5fa4f5e068041d2c2 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Thu, 2 Apr 2026 07:11:45 +0300 Subject: userfaultfd: introduce mfill_get_vma() and mfill_put_vma() Split the code that finds, locks and verifies VMA from mfill_atomic() into a helper function. This function will be used later during refactoring of mfill_atomic_pte_copy(). Add a counterpart mfill_put_vma() helper that unlocks the VMA and releases map_changing_lock. [avagin@google.com: fix lock leak in mfill_get_vma()] Link: https://lore.kernel.org/20260316173829.1126728-1-avagin@google.com Link: https://lore.kernel.org/20260402041156.1377214-5-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Signed-off-by: Andrei Vagin Reviewed-by: Harry Yoo (Oracle) Cc: Andrea Arcangeli Cc: Axel Rasmussen Cc: Baolin Wang Cc: David Hildenbrand (Arm) Cc: Harry Yoo Cc: Hugh Dickins Cc: James Houghton Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Muchun Song Cc: Nikita Kalyazin Cc: Oscar Salvador Cc: Paolo Bonzini Cc: Peter Xu Cc: Sean Christopherson Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: David Carlier Signed-off-by: Andrew Morton --- mm/userfaultfd.c | 125 +++++++++++++++++++++++++++++++++---------------------- 1 file changed, 75 insertions(+), 50 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 3f7ed93020bc..bcba57dc1aee 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -157,6 +157,75 @@ static void uffd_mfill_unlock(struct vm_area_struct *vma) } #endif +static void mfill_put_vma(struct mfill_state *state) +{ + if (!state->vma) + return; + + up_read(&state->ctx->map_changing_lock); + uffd_mfill_unlock(state->vma); + state->vma = NULL; +} + +static int mfill_get_vma(struct mfill_state *state) +{ + struct userfaultfd_ctx *ctx = state->ctx; + uffd_flags_t flags = state->flags; + struct vm_area_struct *dst_vma; + int err; + + /* + * Make sure the vma is not shared, that the dst range is + * both valid and fully within a single existing vma. + */ + dst_vma = uffd_mfill_lock(ctx->mm, state->dst_start, state->len); + if (IS_ERR(dst_vma)) + return PTR_ERR(dst_vma); + + /* + * If memory mappings are changing because of non-cooperative + * operation (e.g. mremap) running in parallel, bail out and + * request the user to retry later + */ + down_read(&ctx->map_changing_lock); + state->vma = dst_vma; + err = -EAGAIN; + if (atomic_read(&ctx->mmap_changing)) + goto out_unlock; + + err = -EINVAL; + + /* + * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but + * it will overwrite vm_ops, so vma_is_anonymous must return false. + */ + if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) && + dst_vma->vm_flags & VM_SHARED)) + goto out_unlock; + + /* + * validate 'mode' now that we know the dst_vma: don't allow + * a wrprotect copy if the userfaultfd didn't register as WP. + */ + if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP)) + goto out_unlock; + + if (is_vm_hugetlb_page(dst_vma)) + return 0; + + if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) + goto out_unlock; + if (!vma_is_shmem(dst_vma) && + uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) + goto out_unlock; + + return 0; + +out_unlock: + mfill_put_vma(state); + return err; +} + static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) { pgd_t *pgd; @@ -767,8 +836,6 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, .src_addr = src_start, .dst_addr = dst_start, }; - struct mm_struct *dst_mm = ctx->mm; - struct vm_area_struct *dst_vma; long copied = 0; ssize_t err; @@ -783,56 +850,17 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, VM_WARN_ON_ONCE(dst_start + len <= dst_start); retry: - /* - * Make sure the vma is not shared, that the dst range is - * both valid and fully within a single existing vma. - */ - dst_vma = uffd_mfill_lock(dst_mm, dst_start, len); - if (IS_ERR(dst_vma)) { - err = PTR_ERR(dst_vma); + err = mfill_get_vma(&state); + if (err) goto out; - } - state.vma = dst_vma; - - /* - * If memory mappings are changing because of non-cooperative - * operation (e.g. mremap) running in parallel, bail out and - * request the user to retry later - */ - down_read(&ctx->map_changing_lock); - err = -EAGAIN; - if (atomic_read(&ctx->mmap_changing)) - goto out_unlock; - - err = -EINVAL; - /* - * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but - * it will overwrite vm_ops, so vma_is_anonymous must return false. - */ - if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) && - dst_vma->vm_flags & VM_SHARED)) - goto out_unlock; - - /* - * validate 'mode' now that we know the dst_vma: don't allow - * a wrprotect copy if the userfaultfd didn't register as WP. - */ - if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP)) - goto out_unlock; /* * If this is a HUGETLB vma, pass off to appropriate routine */ - if (is_vm_hugetlb_page(dst_vma)) - return mfill_atomic_hugetlb(ctx, dst_vma, dst_start, + if (is_vm_hugetlb_page(state.vma)) + return mfill_atomic_hugetlb(ctx, state.vma, dst_start, src_start, len, flags); - if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) - goto out_unlock; - if (!vma_is_shmem(dst_vma) && - uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) - goto out_unlock; - while (state.src_addr < src_start + len) { VM_WARN_ON_ONCE(state.dst_addr >= dst_start + len); @@ -851,8 +879,7 @@ retry: if (unlikely(err == -ENOENT)) { void *kaddr; - up_read(&ctx->map_changing_lock); - uffd_mfill_unlock(state.vma); + mfill_put_vma(&state); VM_WARN_ON_ONCE(!state.folio); kaddr = kmap_local_folio(state.folio, 0); @@ -881,9 +908,7 @@ retry: break; } -out_unlock: - up_read(&ctx->map_changing_lock); - uffd_mfill_unlock(state.vma); + mfill_put_vma(&state); out: if (state.folio) folio_put(state.folio); -- cgit v1.2.3 From f5f035a724235f6dbef428ca54a3e9f25becc10e Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Thu, 2 Apr 2026 07:11:46 +0300 Subject: userfaultfd: retry copying with locks dropped in mfill_atomic_pte_copy() Implementation of UFFDIO_COPY for anonymous memory might fail to copy data from userspace buffer when the destination VMA is locked (either with mm_lock or with per-VMA lock). In that case, mfill_atomic() releases the locks, retries copying the data with locks dropped and then re-locks the destination VMA and re-establishes PMD. Since this retry-reget dance is only relevant for UFFDIO_COPY and it never happens for other UFFDIO_ operations, make it a part of mfill_atomic_pte_copy() that actually implements UFFDIO_COPY for anonymous memory. As a temporal safety measure to avoid breaking biscection mfill_atomic_pte_copy() makes sure to never return -ENOENT so that the loop in mfill_atomic() won't retry copiyng outside of mmap_lock. This is removed later when shmem implementation will be updated later and the loop in mfill_atomic() will be adjusted. [akpm@linux-foundation.org: update mfill_copy_folio_retry()] Link: https://lore.kernel.org/20260316173829.1126728-1-avagin@google.com Link: https://lore.kernel.org/20260306171815.3160826-6-rppt@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-6-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: Harry Yoo (Oracle) Cc: Andrea Arcangeli Cc: Axel Rasmussen Cc: Baolin Wang Cc: David Hildenbrand (Arm) Cc: Hugh Dickins Cc: James Houghton Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Muchun Song Cc: Nikita Kalyazin Cc: Oscar Salvador Cc: Paolo Bonzini Cc: Peter Xu Cc: Sean Christopherson Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: David Carlier Cc: Harry Yoo Signed-off-by: Andrew Morton --- mm/userfaultfd.c | 75 ++++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 51 insertions(+), 24 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index bcba57dc1aee..4857be5a7fa2 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -405,35 +405,63 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr) return ret; } +static int mfill_copy_folio_retry(struct mfill_state *state, struct folio *folio) +{ + unsigned long src_addr = state->src_addr; + void *kaddr; + int err; + + /* retry copying with mm_lock dropped */ + mfill_put_vma(state); + + kaddr = kmap_local_folio(folio, 0); + err = copy_from_user(kaddr, (const void __user *) src_addr, PAGE_SIZE); + kunmap_local(kaddr); + if (unlikely(err)) + return -EFAULT; + + flush_dcache_folio(folio); + + /* reget VMA and PMD, they could change underneath us */ + err = mfill_get_vma(state); + if (err) + return err; + + err = mfill_establish_pmd(state); + if (err) + return err; + + return 0; +} + static int mfill_atomic_pte_copy(struct mfill_state *state) { - struct vm_area_struct *dst_vma = state->vma; unsigned long dst_addr = state->dst_addr; unsigned long src_addr = state->src_addr; uffd_flags_t flags = state->flags; - pmd_t *dst_pmd = state->pmd; struct folio *folio; int ret; - if (!state->folio) { - ret = -ENOMEM; - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, dst_vma, - dst_addr); - if (!folio) - goto out; + folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, state->vma, dst_addr); + if (!folio) + return -ENOMEM; - ret = mfill_copy_folio_locked(folio, src_addr); + ret = -ENOMEM; + if (mem_cgroup_charge(folio, state->vma->vm_mm, GFP_KERNEL)) + goto out_release; - /* fallback to copy_from_user outside mmap_lock */ - if (unlikely(ret)) { - ret = -ENOENT; - state->folio = folio; - /* don't free the page */ - goto out; - } - } else { - folio = state->folio; - state->folio = NULL; + ret = mfill_copy_folio_locked(folio, src_addr); + if (unlikely(ret)) { + /* + * Fallback to copy_from_user outside mmap_lock. + * If retry is successful, mfill_copy_folio_locked() returns + * with locks retaken by mfill_get_vma(). + * If there was an error, we must mfill_put_vma() anyway and it + * will take care of unlocking if needed. + */ + ret = mfill_copy_folio_retry(state, folio); + if (ret) + goto out_release; } /* @@ -443,17 +471,16 @@ static int mfill_atomic_pte_copy(struct mfill_state *state) */ __folio_mark_uptodate(folio); - ret = -ENOMEM; - if (mem_cgroup_charge(folio, dst_vma->vm_mm, GFP_KERNEL)) - goto out_release; - - ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, + ret = mfill_atomic_install_pte(state->pmd, state->vma, dst_addr, &folio->page, true, flags); if (ret) goto out_release; out: return ret; out_release: + /* Don't return -ENOENT so that our caller won't retry */ + if (ret == -ENOENT) + ret = -EFAULT; folio_put(folio); goto out; } -- cgit v1.2.3 From a5bb8669872b6b8463b8777a7a259a8305060016 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Thu, 2 Apr 2026 07:11:47 +0300 Subject: userfaultfd: move vma_can_userfault out of line vma_can_userfault() has grown pretty big and it's not called on performance critical path. Move it out of line. No functional changes. Link: https://lore.kernel.org/20260402041156.1377214-7-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: David Hildenbrand (Red Hat) Reviewed-by: Liam R. Howlett Cc: Andrea Arcangeli Cc: Andrei Vagin Cc: Axel Rasmussen Cc: Baolin Wang Cc: Harry Yoo Cc: Harry Yoo (Oracle) Cc: Hugh Dickins Cc: James Houghton Cc: Lorenzo Stoakes (Oracle) Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Muchun Song Cc: Nikita Kalyazin Cc: Oscar Salvador Cc: Paolo Bonzini Cc: Peter Xu Cc: Sean Christopherson Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: David Carlier Signed-off-by: Andrew Morton --- include/linux/userfaultfd_k.h | 35 ++--------------------------------- mm/userfaultfd.c | 33 +++++++++++++++++++++++++++++++++ 2 files changed, 35 insertions(+), 33 deletions(-) diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index d83e349900a3..ce0201c3dd82 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -211,39 +211,8 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma) return vma->vm_flags & __VM_UFFD_FLAGS; } -static inline bool vma_can_userfault(struct vm_area_struct *vma, - vm_flags_t vm_flags, - bool wp_async) -{ - vm_flags &= __VM_UFFD_FLAGS; - - if (vma->vm_flags & VM_DROPPABLE) - return false; - - if ((vm_flags & VM_UFFD_MINOR) && - (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma))) - return false; - - /* - * If wp async enabled, and WP is the only mode enabled, allow any - * memory type. - */ - if (wp_async && (vm_flags == VM_UFFD_WP)) - return true; - - /* - * If user requested uffd-wp but not enabled pte markers for - * uffd-wp, then shmem & hugetlbfs are not supported but only - * anonymous. - */ - if (!uffd_supports_wp_marker() && (vm_flags & VM_UFFD_WP) && - !vma_is_anonymous(vma)) - return false; - - /* By default, allow any of anon|shmem|hugetlb */ - return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) || - vma_is_shmem(vma); -} +bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags, + bool wp_async); static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma) { diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 4857be5a7fa2..ebdc6e24a2c7 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -2018,6 +2018,39 @@ out: return moved ? moved : err; } +bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags, + bool wp_async) +{ + vm_flags &= __VM_UFFD_FLAGS; + + if (vma->vm_flags & VM_DROPPABLE) + return false; + + if ((vm_flags & VM_UFFD_MINOR) && + (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma))) + return false; + + /* + * If wp async enabled, and WP is the only mode enabled, allow any + * memory type. + */ + if (wp_async && (vm_flags == VM_UFFD_WP)) + return true; + + /* + * If user requested uffd-wp but not enabled pte markers for + * uffd-wp, then shmem & hugetlbfs are not supported but only + * anonymous. + */ + if (!uffd_supports_wp_marker() && (vm_flags & VM_UFFD_WP) && + !vma_is_anonymous(vma)) + return false; + + /* By default, allow any of anon|shmem|hugetlb */ + return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) || + vma_is_shmem(vma); +} + static void userfaultfd_set_vm_flags(struct vm_area_struct *vma, vm_flags_t vm_flags) { -- cgit v1.2.3 From 0f48947c4232c934885711dde0b49066f9d8ee87 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Thu, 2 Apr 2026 07:11:48 +0300 Subject: userfaultfd: introduce vm_uffd_ops Current userfaultfd implementation works only with memory managed by core MM: anonymous, shmem and hugetlb. First, there is no fundamental reason to limit userfaultfd support only to the core memory types and userfaults can be handled similarly to regular page faults provided a VMA owner implements appropriate callbacks. Second, historically various code paths were conditioned on vma_is_anonymous(), vma_is_shmem() and is_vm_hugetlb_page() and some of these conditions can be expressed as operations implemented by a particular memory type. Introduce vm_uffd_ops extension to vm_operations_struct that will delegate memory type specific operations to a VMA owner. Operations for anonymous memory are handled internally in userfaultfd using anon_uffd_ops that implicitly assigned to anonymous VMAs. Start with a single operation, ->can_userfault() that will verify that a VMA meets requirements for userfaultfd support at registration time. Implement that method for anonymous, shmem and hugetlb and move relevant parts of vma_can_userfault() into the new callbacks. [rppt@kernel.org: relocate VM_DROPPABLE test, per Tal] Link: https://lore.kernel.org/adffgfM5ANxtPIEF@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-8-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Cc: Andrea Arcangeli Cc: Andrei Vagin Cc: Axel Rasmussen Cc: Baolin Wang Cc: David Hildenbrand (Arm) Cc: Harry Yoo Cc: Harry Yoo (Oracle) Cc: Hugh Dickins Cc: James Houghton Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Muchun Song Cc: Nikita Kalyazin Cc: Oscar Salvador Cc: Paolo Bonzini Cc: Peter Xu Cc: Sean Christopherson Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: David Carlier Cc: Tal Zussman Signed-off-by: Andrew Morton --- include/linux/mm.h | 5 +++++ include/linux/userfaultfd_k.h | 6 ++++++ mm/hugetlb.c | 15 +++++++++++++++ mm/shmem.c | 15 +++++++++++++++ mm/userfaultfd.c | 38 ++++++++++++++++++++++++++++---------- 5 files changed, 69 insertions(+), 10 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 8260e28205e9..633bbf9a184a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -758,6 +758,8 @@ struct vm_fault { */ }; +struct vm_uffd_ops; + /* * These are the virtual MM functions - opening of an area, closing and * unmapping it (needed to keep files on disk up-to-date etc), pointer @@ -865,6 +867,9 @@ struct vm_operations_struct { struct page *(*find_normal_page)(struct vm_area_struct *vma, unsigned long addr); #endif /* CONFIG_FIND_NORMAL_PAGE */ +#ifdef CONFIG_USERFAULTFD + const struct vm_uffd_ops *uffd_ops; +#endif }; #ifdef CONFIG_NUMA_BALANCING diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index ce0201c3dd82..6d445dbfe8ff 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -83,6 +83,12 @@ struct userfaultfd_ctx { extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason); +/* VMA userfaultfd operations */ +struct vm_uffd_ops { + /* Checks if a VMA can support userfaultfd */ + bool (*can_userfault)(struct vm_area_struct *vma, vm_flags_t vm_flags); +}; + /* A combined operation mode + behavior flags. */ typedef unsigned int __bitwise uffd_flags_t; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a786034ac95c..88009cd2a846 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4792,6 +4792,18 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf) return 0; } +#ifdef CONFIG_USERFAULTFD +static bool hugetlb_can_userfault(struct vm_area_struct *vma, + vm_flags_t vm_flags) +{ + return true; +} + +static const struct vm_uffd_ops hugetlb_uffd_ops = { + .can_userfault = hugetlb_can_userfault, +}; +#endif + /* * When a new function is introduced to vm_operations_struct and added * to hugetlb_vm_ops, please consider adding the function to shm_vm_ops. @@ -4805,6 +4817,9 @@ const struct vm_operations_struct hugetlb_vm_ops = { .close = hugetlb_vm_op_close, .may_split = hugetlb_vm_op_split, .pagesize = hugetlb_vm_op_pagesize, +#ifdef CONFIG_USERFAULTFD + .uffd_ops = &hugetlb_uffd_ops, +#endif }; static pte_t make_huge_pte(struct vm_area_struct *vma, struct folio *folio, diff --git a/mm/shmem.c b/mm/shmem.c index 6fa1e8340c93..389b2d76396e 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3288,6 +3288,15 @@ out_unacct_blocks: shmem_inode_unacct_blocks(inode, 1); return ret; } + +static bool shmem_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags) +{ + return true; +} + +static const struct vm_uffd_ops shmem_uffd_ops = { + .can_userfault = shmem_can_userfault, +}; #endif /* CONFIG_USERFAULTFD */ #ifdef CONFIG_TMPFS @@ -5307,6 +5316,9 @@ static const struct vm_operations_struct shmem_vm_ops = { .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif +#ifdef CONFIG_USERFAULTFD + .uffd_ops = &shmem_uffd_ops, +#endif }; static const struct vm_operations_struct shmem_anon_vm_ops = { @@ -5316,6 +5328,9 @@ static const struct vm_operations_struct shmem_anon_vm_ops = { .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif +#ifdef CONFIG_USERFAULTFD + .uffd_ops = &shmem_uffd_ops, +#endif }; int shmem_init_fs_context(struct fs_context *fc) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index ebdc6e24a2c7..3a824e034a09 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -34,6 +34,25 @@ struct mfill_state { pmd_t *pmd; }; +static bool anon_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags) +{ + /* anonymous memory does not support MINOR mode */ + if (vm_flags & VM_UFFD_MINOR) + return false; + return true; +} + +static const struct vm_uffd_ops anon_uffd_ops = { + .can_userfault = anon_can_userfault, +}; + +static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma) +{ + if (vma_is_anonymous(vma)) + return &anon_uffd_ops; + return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL; +} + static __always_inline bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end) { @@ -2021,34 +2040,33 @@ out: bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags, bool wp_async) { - vm_flags &= __VM_UFFD_FLAGS; + const struct vm_uffd_ops *ops = vma_uffd_ops(vma); if (vma->vm_flags & VM_DROPPABLE) return false; - if ((vm_flags & VM_UFFD_MINOR) && - (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma))) - return false; + vm_flags &= __VM_UFFD_FLAGS; /* - * If wp async enabled, and WP is the only mode enabled, allow any + * If WP is the only mode enabled and context is wp async, allow any * memory type. */ if (wp_async && (vm_flags == VM_UFFD_WP)) return true; + /* For any other mode reject VMAs that don't implement vm_uffd_ops */ + if (!ops) + return false; + /* * If user requested uffd-wp but not enabled pte markers for - * uffd-wp, then shmem & hugetlbfs are not supported but only - * anonymous. + * uffd-wp, then only anonymous memory is supported */ if (!uffd_supports_wp_marker() && (vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma)) return false; - /* By default, allow any of anon|shmem|hugetlb */ - return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) || - vma_is_shmem(vma); + return ops->can_userfault(vma, vm_flags); } static void userfaultfd_set_vm_flags(struct vm_area_struct *vma, -- cgit v1.2.3 From dfc4d771820a171bd701d06252fcf920d0ede25c Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Thu, 2 Apr 2026 07:11:49 +0300 Subject: shmem, userfaultfd: use a VMA callback to handle UFFDIO_CONTINUE When userspace resolves a page fault in a shmem VMA with UFFDIO_CONTINUE it needs to get a folio that already exists in the pagecache backing that VMA. Instead of using shmem_get_folio() for that, add a get_folio_noalloc() method to 'struct vm_uffd_ops' that will return a folio if it exists in the VMA's pagecache at given pgoff. Implement get_folio_noalloc() method for shmem and slightly refactor userfaultfd's mfill_get_vma() and mfill_atomic_pte_continue() to support this new API. Link: https://lore.kernel.org/20260402041156.1377214-9-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: James Houghton Cc: Andrea Arcangeli Cc: Andrei Vagin Cc: Axel Rasmussen Cc: Baolin Wang Cc: David Hildenbrand (Arm) Cc: Harry Yoo Cc: Harry Yoo (Oracle) Cc: Hugh Dickins Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Muchun Song Cc: Nikita Kalyazin Cc: Oscar Salvador Cc: Paolo Bonzini Cc: Peter Xu Cc: Sean Christopherson Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: David Carlier Signed-off-by: Andrew Morton --- include/linux/userfaultfd_k.h | 7 +++++++ mm/shmem.c | 15 ++++++++++++++- mm/userfaultfd.c | 34 ++++++++++++++++++---------------- 3 files changed, 39 insertions(+), 17 deletions(-) diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 6d445dbfe8ff..4bda632dae88 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -87,6 +87,13 @@ extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason); struct vm_uffd_ops { /* Checks if a VMA can support userfaultfd */ bool (*can_userfault)(struct vm_area_struct *vma, vm_flags_t vm_flags); + /* + * Called to resolve UFFDIO_CONTINUE request. + * Should return the folio found at pgoff in the VMA's pagecache if it + * exists or ERR_PTR otherwise. + * The returned folio is locked and with reference held. + */ + struct folio *(*get_folio_noalloc)(struct inode *inode, pgoff_t pgoff); }; /* A combined operation mode + behavior flags. */ diff --git a/mm/shmem.c b/mm/shmem.c index 389b2d76396e..ed07d0c03312 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3289,13 +3289,26 @@ out_unacct_blocks: return ret; } +static struct folio *shmem_get_folio_noalloc(struct inode *inode, pgoff_t pgoff) +{ + struct folio *folio; + int err; + + err = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC); + if (err) + return ERR_PTR(err); + + return folio; +} + static bool shmem_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags) { return true; } static const struct vm_uffd_ops shmem_uffd_ops = { - .can_userfault = shmem_can_userfault, + .can_userfault = shmem_can_userfault, + .get_folio_noalloc = shmem_get_folio_noalloc, }; #endif /* CONFIG_USERFAULTFD */ diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 3a824e034a09..5b204c3ec986 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -191,6 +191,7 @@ static int mfill_get_vma(struct mfill_state *state) struct userfaultfd_ctx *ctx = state->ctx; uffd_flags_t flags = state->flags; struct vm_area_struct *dst_vma; + const struct vm_uffd_ops *ops; int err; /* @@ -232,10 +233,12 @@ static int mfill_get_vma(struct mfill_state *state) if (is_vm_hugetlb_page(dst_vma)) return 0; - if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) + ops = vma_uffd_ops(dst_vma); + if (!ops) goto out_unlock; - if (!vma_is_shmem(dst_vma) && - uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) + + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE) && + !ops->get_folio_noalloc) goto out_unlock; return 0; @@ -575,6 +578,7 @@ out: static int mfill_atomic_pte_continue(struct mfill_state *state) { struct vm_area_struct *dst_vma = state->vma; + const struct vm_uffd_ops *ops = vma_uffd_ops(dst_vma); unsigned long dst_addr = state->dst_addr; pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); struct inode *inode = file_inode(dst_vma->vm_file); @@ -584,17 +588,16 @@ static int mfill_atomic_pte_continue(struct mfill_state *state) struct page *page; int ret; - ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC); - /* Our caller expects us to return -EFAULT if we failed to find folio */ - if (ret == -ENOENT) - ret = -EFAULT; - if (ret) - goto out; - if (!folio) { - ret = -EFAULT; - goto out; + if (!ops) { + VM_WARN_ONCE(1, "UFFDIO_CONTINUE for unsupported VMA"); + return -EOPNOTSUPP; } + folio = ops->get_folio_noalloc(inode, pgoff); + /* Our caller expects us to return -EFAULT if we failed to find folio */ + if (IS_ERR_OR_NULL(folio)) + return -EFAULT; + page = folio_file_page(folio, pgoff); if (PageHWPoison(page)) { ret = -EIO; @@ -607,13 +610,12 @@ static int mfill_atomic_pte_continue(struct mfill_state *state) goto out_release; folio_unlock(folio); - ret = 0; -out: - return ret; + return 0; + out_release: folio_unlock(folio); folio_put(folio); - goto out; + return ret; } /* Handles UFFDIO_POISON for all non-hugetlb VMAs. */ -- cgit v1.2.3 From ad9ac3081332e955bc4b513018a1e0e86683bfb5 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Thu, 2 Apr 2026 07:11:50 +0300 Subject: userfaultfd: introduce vm_uffd_ops->alloc_folio() and use it to refactor mfill_atomic_pte_zeroed_folio() and mfill_atomic_pte_copy(). mfill_atomic_pte_zeroed_folio() and mfill_atomic_pte_copy() perform almost identical actions: * allocate a folio * update folio contents (either copy from userspace of fill with zeros) * update page tables with the new folio Split a __mfill_atomic_pte() helper that handles both cases and uses newly introduced vm_uffd_ops->alloc_folio() to allocate the folio. Pass the ops structure from the callers to __mfill_atomic_pte() to later allow using anon_uffd_ops for MAP_PRIVATE mappings of file-backed VMAs. Note, that the new ops method is called alloc_folio() rather than folio_alloc() to avoid clash with alloc_tag macro folio_alloc(). Link: https://lore.kernel.org/20260402041156.1377214-10-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: James Houghton Cc: Andrea Arcangeli Cc: Andrei Vagin Cc: Axel Rasmussen Cc: Baolin Wang Cc: David Hildenbrand (Arm) Cc: Harry Yoo Cc: Harry Yoo (Oracle) Cc: Hugh Dickins Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Muchun Song Cc: Nikita Kalyazin Cc: Oscar Salvador Cc: Paolo Bonzini Cc: Peter Xu Cc: Sean Christopherson Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: David Carlier Signed-off-by: Andrew Morton --- include/linux/userfaultfd_k.h | 6 +++ mm/userfaultfd.c | 92 ++++++++++++++++++++++--------------------- 2 files changed, 54 insertions(+), 44 deletions(-) diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 4bda632dae88..0f508c752741 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -94,6 +94,12 @@ struct vm_uffd_ops { * The returned folio is locked and with reference held. */ struct folio *(*get_folio_noalloc)(struct inode *inode, pgoff_t pgoff); + /* + * Called during resolution of UFFDIO_COPY request. + * Should allocate and return a folio or NULL if allocation fails. + */ + struct folio *(*alloc_folio)(struct vm_area_struct *vma, + unsigned long addr); }; /* A combined operation mode + behavior flags. */ diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 5b204c3ec986..dd191703b320 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -42,8 +42,26 @@ static bool anon_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags) return true; } +static struct folio *anon_alloc_folio(struct vm_area_struct *vma, + unsigned long addr) +{ + struct folio *folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, + addr); + + if (!folio) + return NULL; + + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) { + folio_put(folio); + return NULL; + } + + return folio; +} + static const struct vm_uffd_ops anon_uffd_ops = { .can_userfault = anon_can_userfault, + .alloc_folio = anon_alloc_folio, }; static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma) @@ -456,7 +474,8 @@ static int mfill_copy_folio_retry(struct mfill_state *state, struct folio *folio return 0; } -static int mfill_atomic_pte_copy(struct mfill_state *state) +static int __mfill_atomic_pte(struct mfill_state *state, + const struct vm_uffd_ops *ops) { unsigned long dst_addr = state->dst_addr; unsigned long src_addr = state->src_addr; @@ -464,16 +483,12 @@ static int mfill_atomic_pte_copy(struct mfill_state *state) struct folio *folio; int ret; - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, state->vma, dst_addr); + folio = ops->alloc_folio(state->vma, state->dst_addr); if (!folio) return -ENOMEM; - ret = -ENOMEM; - if (mem_cgroup_charge(folio, state->vma->vm_mm, GFP_KERNEL)) - goto out_release; - - ret = mfill_copy_folio_locked(folio, src_addr); - if (unlikely(ret)) { + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) { + ret = mfill_copy_folio_locked(folio, src_addr); /* * Fallback to copy_from_user outside mmap_lock. * If retry is successful, mfill_copy_folio_locked() returns @@ -481,9 +496,15 @@ static int mfill_atomic_pte_copy(struct mfill_state *state) * If there was an error, we must mfill_put_vma() anyway and it * will take care of unlocking if needed. */ - ret = mfill_copy_folio_retry(state, folio); - if (ret) - goto out_release; + if (unlikely(ret)) { + ret = mfill_copy_folio_retry(state, folio); + if (ret) + goto err_folio_put; + } + } else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) { + clear_user_highpage(&folio->page, state->dst_addr); + } else { + VM_WARN_ONCE(1, "Unknown UFFDIO operation, flags: %x", flags); } /* @@ -496,47 +517,30 @@ static int mfill_atomic_pte_copy(struct mfill_state *state) ret = mfill_atomic_install_pte(state->pmd, state->vma, dst_addr, &folio->page, true, flags); if (ret) - goto out_release; -out: - return ret; -out_release: + goto err_folio_put; + + return 0; + +err_folio_put: + folio_put(folio); /* Don't return -ENOENT so that our caller won't retry */ if (ret == -ENOENT) ret = -EFAULT; - folio_put(folio); - goto out; + return ret; } -static int mfill_atomic_pte_zeroed_folio(pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr) +static int mfill_atomic_pte_copy(struct mfill_state *state) { - struct folio *folio; - int ret = -ENOMEM; - - folio = vma_alloc_zeroed_movable_folio(dst_vma, dst_addr); - if (!folio) - return ret; - - if (mem_cgroup_charge(folio, dst_vma->vm_mm, GFP_KERNEL)) - goto out_put; + const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma); - /* - * The memory barrier inside __folio_mark_uptodate makes sure that - * zeroing out the folio become visible before mapping the page - * using set_pte_at(). See do_anonymous_page(). - */ - __folio_mark_uptodate(folio); + return __mfill_atomic_pte(state, ops); +} - ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, - &folio->page, true, 0); - if (ret) - goto out_put; +static int mfill_atomic_pte_zeroed_folio(struct mfill_state *state) +{ + const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma); - return 0; -out_put: - folio_put(folio); - return ret; + return __mfill_atomic_pte(state, ops); } static int mfill_atomic_pte_zeropage(struct mfill_state *state) @@ -549,7 +553,7 @@ static int mfill_atomic_pte_zeropage(struct mfill_state *state) int ret; if (mm_forbids_zeropage(dst_vma->vm_mm)) - return mfill_atomic_pte_zeroed_folio(dst_pmd, dst_vma, dst_addr); + return mfill_atomic_pte_zeroed_folio(state); _dst_pte = pte_mkspecial(pfn_pte(zero_pfn(dst_addr), dst_vma->vm_page_prot)); -- cgit v1.2.3 From f74991b4e3836dd38f3adb41b146994b283942a1 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Thu, 2 Apr 2026 07:11:51 +0300 Subject: shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops Add filemap_add() and filemap_remove() methods to vm_uffd_ops and use them in __mfill_atomic_pte() to add shmem folios to page cache and remove them in case of error. Implement these methods in shmem along with vm_uffd_ops->alloc_folio() and drop shmem_mfill_atomic_pte(). Since userfaultfd now does not reference any functions from shmem, drop include if linux/shmem_fs.h from mm/userfaultfd.c mfill_atomic_install_pte() is not used anywhere outside of mm/userfaultfd, make it static. Link: https://lore.kernel.org/20260402041156.1377214-11-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: James Houghton Cc: Andrea Arcangeli Cc: Andrei Vagin Cc: Axel Rasmussen Cc: Baolin Wang Cc: David Hildenbrand (Arm) Cc: Harry Yoo Cc: Harry Yoo (Oracle) Cc: Hugh Dickins Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Muchun Song Cc: Nikita Kalyazin Cc: Oscar Salvador Cc: Paolo Bonzini Cc: Peter Xu Cc: Sean Christopherson Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: David Carlier Signed-off-by: Andrew Morton --- include/linux/shmem_fs.h | 14 ---- include/linux/userfaultfd_k.h | 19 ++++-- mm/shmem.c | 148 +++++++++++++++--------------------------- mm/userfaultfd.c | 80 +++++++++++------------ 4 files changed, 106 insertions(+), 155 deletions(-) diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index a8273b32e041..1a345142af7d 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -221,20 +221,6 @@ static inline pgoff_t shmem_fallocend(struct inode *inode, pgoff_t eof) extern bool shmem_charge(struct inode *inode, long pages); -#ifdef CONFIG_USERFAULTFD -#ifdef CONFIG_SHMEM -extern int shmem_mfill_atomic_pte(pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - unsigned long src_addr, - uffd_flags_t flags, - struct folio **foliop); -#else /* !CONFIG_SHMEM */ -#define shmem_mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, \ - src_addr, flags, foliop) ({ BUG(); 0; }) -#endif /* CONFIG_SHMEM */ -#endif /* CONFIG_USERFAULTFD */ - /* * Used space is stored as unsigned 64-bit value in bytes but * quota core supports only signed 64-bit values so use that diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 0f508c752741..d2920f98ab86 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -100,6 +100,20 @@ struct vm_uffd_ops { */ struct folio *(*alloc_folio)(struct vm_area_struct *vma, unsigned long addr); + /* + * Called during resolution of UFFDIO_COPY request. + * Should only be called with a folio returned by alloc_folio() above. + * The folio will be set to locked. + * Returns 0 on success, error code on failure. + */ + int (*filemap_add)(struct folio *folio, struct vm_area_struct *vma, + unsigned long addr); + /* + * Called during resolution of UFFDIO_COPY request on the error + * handling path. + * Should revert the operation of ->filemap_add(). + */ + void (*filemap_remove)(struct folio *folio, struct vm_area_struct *vma); }; /* A combined operation mode + behavior flags. */ @@ -133,11 +147,6 @@ static inline uffd_flags_t uffd_flags_set_mode(uffd_flags_t flags, enum mfill_at /* Flags controlling behavior. These behavior changes are mode-independent. */ #define MFILL_ATOMIC_WP MFILL_ATOMIC_FLAG(0) -extern int mfill_atomic_install_pte(pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, struct page *page, - bool newly_allocated, uffd_flags_t flags); - extern ssize_t mfill_atomic_copy(struct userfaultfd_ctx *ctx, unsigned long dst_start, unsigned long src_start, unsigned long len, uffd_flags_t flags); diff --git a/mm/shmem.c b/mm/shmem.c index ed07d0c03312..5aa43657886c 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3175,118 +3175,73 @@ static struct inode *shmem_get_inode(struct mnt_idmap *idmap, #endif /* CONFIG_TMPFS_QUOTA */ #ifdef CONFIG_USERFAULTFD -int shmem_mfill_atomic_pte(pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - unsigned long src_addr, - uffd_flags_t flags, - struct folio **foliop) -{ - struct inode *inode = file_inode(dst_vma->vm_file); - struct shmem_inode_info *info = SHMEM_I(inode); +static struct folio *shmem_mfill_folio_alloc(struct vm_area_struct *vma, + unsigned long addr) +{ + struct inode *inode = file_inode(vma->vm_file); struct address_space *mapping = inode->i_mapping; + struct shmem_inode_info *info = SHMEM_I(inode); + pgoff_t pgoff = linear_page_index(vma, addr); gfp_t gfp = mapping_gfp_mask(mapping); - pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); - void *page_kaddr; struct folio *folio; - int ret; - pgoff_t max_off; - - if (shmem_inode_acct_blocks(inode, 1)) { - /* - * We may have got a page, returned -ENOENT triggering a retry, - * and now we find ourselves with -ENOMEM. Release the page, to - * avoid a BUG_ON in our caller. - */ - if (unlikely(*foliop)) { - folio_put(*foliop); - *foliop = NULL; - } - return -ENOMEM; - } - if (!*foliop) { - ret = -ENOMEM; - folio = shmem_alloc_folio(gfp, 0, info, pgoff); - if (!folio) - goto out_unacct_blocks; + if (unlikely(pgoff >= DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE))) + return NULL; - if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) { - page_kaddr = kmap_local_folio(folio, 0); - /* - * The read mmap_lock is held here. Despite the - * mmap_lock being read recursive a deadlock is still - * possible if a writer has taken a lock. For example: - * - * process A thread 1 takes read lock on own mmap_lock - * process A thread 2 calls mmap, blocks taking write lock - * process B thread 1 takes page fault, read lock on own mmap lock - * process B thread 2 calls mmap, blocks taking write lock - * process A thread 1 blocks taking read lock on process B - * process B thread 1 blocks taking read lock on process A - * - * Disable page faults to prevent potential deadlock - * and retry the copy outside the mmap_lock. - */ - pagefault_disable(); - ret = copy_from_user(page_kaddr, - (const void __user *)src_addr, - PAGE_SIZE); - pagefault_enable(); - kunmap_local(page_kaddr); - - /* fallback to copy_from_user outside mmap_lock */ - if (unlikely(ret)) { - *foliop = folio; - ret = -ENOENT; - /* don't free the page */ - goto out_unacct_blocks; - } + folio = shmem_alloc_folio(gfp, 0, info, pgoff); + if (!folio) + return NULL; - flush_dcache_folio(folio); - } else { /* ZEROPAGE */ - clear_user_highpage(&folio->page, dst_addr); - } - } else { - folio = *foliop; - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); - *foliop = NULL; + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) { + folio_put(folio); + return NULL; } - VM_BUG_ON(folio_test_locked(folio)); - VM_BUG_ON(folio_test_swapbacked(folio)); + return folio; +} + +static int shmem_mfill_filemap_add(struct folio *folio, + struct vm_area_struct *vma, + unsigned long addr) +{ + struct inode *inode = file_inode(vma->vm_file); + struct address_space *mapping = inode->i_mapping; + pgoff_t pgoff = linear_page_index(vma, addr); + gfp_t gfp = mapping_gfp_mask(mapping); + int err; + __folio_set_locked(folio); __folio_set_swapbacked(folio); - __folio_mark_uptodate(folio); - - ret = -EFAULT; - max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); - if (unlikely(pgoff >= max_off)) - goto out_release; - ret = mem_cgroup_charge(folio, dst_vma->vm_mm, gfp); - if (ret) - goto out_release; - ret = shmem_add_to_page_cache(folio, mapping, pgoff, NULL, gfp); - if (ret) - goto out_release; + err = shmem_add_to_page_cache(folio, mapping, pgoff, NULL, gfp); + if (err) + goto err_unlock; - ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, - &folio->page, true, flags); - if (ret) - goto out_delete_from_cache; + if (shmem_inode_acct_blocks(inode, 1)) { + err = -ENOMEM; + goto err_delete_from_cache; + } + folio_add_lru(folio); shmem_recalc_inode(inode, 1, 0); - folio_unlock(folio); + return 0; -out_delete_from_cache: + +err_delete_from_cache: filemap_remove_folio(folio); -out_release: +err_unlock: + folio_unlock(folio); + return err; +} + +static void shmem_mfill_filemap_remove(struct folio *folio, + struct vm_area_struct *vma) +{ + struct inode *inode = file_inode(vma->vm_file); + + filemap_remove_folio(folio); + shmem_recalc_inode(inode, 0, 0); folio_unlock(folio); - folio_put(folio); -out_unacct_blocks: - shmem_inode_unacct_blocks(inode, 1); - return ret; } static struct folio *shmem_get_folio_noalloc(struct inode *inode, pgoff_t pgoff) @@ -3309,6 +3264,9 @@ static bool shmem_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags) static const struct vm_uffd_ops shmem_uffd_ops = { .can_userfault = shmem_can_userfault, .get_folio_noalloc = shmem_get_folio_noalloc, + .alloc_folio = shmem_mfill_folio_alloc, + .filemap_add = shmem_mfill_filemap_add, + .filemap_remove = shmem_mfill_filemap_remove, }; #endif /* CONFIG_USERFAULTFD */ diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index dd191703b320..8a023d9326c2 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -14,7 +14,6 @@ #include #include #include -#include #include #include #include "internal.h" @@ -338,10 +337,10 @@ static bool mfill_file_over_size(struct vm_area_struct *dst_vma, * This function handles both MCOPY_ATOMIC_NORMAL and _CONTINUE for both shmem * and anon, and for both shared and private VMAs. */ -int mfill_atomic_install_pte(pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, struct page *page, - bool newly_allocated, uffd_flags_t flags) +static int mfill_atomic_install_pte(pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, struct page *page, + uffd_flags_t flags) { int ret; struct mm_struct *dst_mm = dst_vma->vm_mm; @@ -385,9 +384,6 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd, goto out_unlock; if (page_in_cache) { - /* Usually, cache pages are already added to LRU */ - if (newly_allocated) - folio_add_lru(folio); folio_add_file_rmap_pte(folio, page, dst_vma); } else { folio_add_new_anon_rmap(folio, dst_vma, dst_addr, RMAP_EXCLUSIVE); @@ -402,6 +398,9 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd, set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); + if (page_in_cache) + folio_unlock(folio); + /* No need to invalidate - it was non-present before */ update_mmu_cache(dst_vma, dst_addr, dst_pte); ret = 0; @@ -514,13 +513,22 @@ static int __mfill_atomic_pte(struct mfill_state *state, */ __folio_mark_uptodate(folio); + if (ops->filemap_add) { + ret = ops->filemap_add(folio, state->vma, state->dst_addr); + if (ret) + goto err_folio_put; + } + ret = mfill_atomic_install_pte(state->pmd, state->vma, dst_addr, - &folio->page, true, flags); + &folio->page, flags); if (ret) - goto err_folio_put; + goto err_filemap_remove; return 0; +err_filemap_remove: + if (ops->filemap_remove) + ops->filemap_remove(folio, state->vma); err_folio_put: folio_put(folio); /* Don't return -ENOENT so that our caller won't retry */ @@ -533,6 +541,18 @@ static int mfill_atomic_pte_copy(struct mfill_state *state) { const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma); + /* + * The normal page fault path for a MAP_PRIVATE mapping in a + * file-backed VMA will invoke the fault, fill the hole in the file and + * COW it right away. The result generates plain anonymous memory. + * So when we are asked to fill a hole in a MAP_PRIVATE mapping, we'll + * generate anonymous memory directly without actually filling the + * hole. For the MAP_PRIVATE case the robustness check only happens in + * the pagetable (to verify it's still none) and not in the page cache. + */ + if (!(state->vma->vm_flags & VM_SHARED)) + ops = &anon_uffd_ops; + return __mfill_atomic_pte(state, ops); } @@ -552,7 +572,8 @@ static int mfill_atomic_pte_zeropage(struct mfill_state *state) spinlock_t *ptl; int ret; - if (mm_forbids_zeropage(dst_vma->vm_mm)) + if (mm_forbids_zeropage(dst_vma->vm_mm) || + (dst_vma->vm_flags & VM_SHARED)) return mfill_atomic_pte_zeroed_folio(state); _dst_pte = pte_mkspecial(pfn_pte(zero_pfn(dst_addr), @@ -609,11 +630,10 @@ static int mfill_atomic_pte_continue(struct mfill_state *state) } ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, - page, false, flags); + page, flags); if (ret) goto out_release; - folio_unlock(folio); return 0; out_release: @@ -836,41 +856,19 @@ extern ssize_t mfill_atomic_hugetlb(struct userfaultfd_ctx *ctx, static __always_inline ssize_t mfill_atomic_pte(struct mfill_state *state) { - struct vm_area_struct *dst_vma = state->vma; - unsigned long src_addr = state->src_addr; - unsigned long dst_addr = state->dst_addr; - struct folio **foliop = &state->folio; uffd_flags_t flags = state->flags; - pmd_t *dst_pmd = state->pmd; - ssize_t err; if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) return mfill_atomic_pte_continue(state); if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) return mfill_atomic_pte_poison(state); + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) + return mfill_atomic_pte_copy(state); + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) + return mfill_atomic_pte_zeropage(state); - /* - * The normal page fault path for a shmem will invoke the - * fault, fill the hole in the file and COW it right away. The - * result generates plain anonymous memory. So when we are - * asked to fill an hole in a MAP_PRIVATE shmem mapping, we'll - * generate anonymous memory directly without actually filling - * the hole. For the MAP_PRIVATE case the robustness check - * only happens in the pagetable (to verify it's still none) - * and not in the radix tree. - */ - if (!(dst_vma->vm_flags & VM_SHARED)) { - if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) - err = mfill_atomic_pte_copy(state); - else - err = mfill_atomic_pte_zeropage(state); - } else { - err = shmem_mfill_atomic_pte(dst_pmd, dst_vma, - dst_addr, src_addr, - flags, foliop); - } - - return err; + VM_WARN_ONCE(1, "Unknown UFFDIO operation, flags: %x", flags); + return -EOPNOTSUPP; } static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, -- cgit v1.2.3 From 6ab703034f145ef8e1a705b1630cc317ec8dd8a2 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Thu, 2 Apr 2026 07:11:52 +0300 Subject: userfaultfd: mfill_atomic(): remove retry logic Since __mfill_atomic_pte() handles the retry for both anonymous and shmem, there is no need to retry copying the date from the userspace in the loop in mfill_atomic(). Drop the retry logic from mfill_atomic(). [rppt@kernel.org: remove safety measure of not returning ENOENT from _copy] Link: https://lore.kernel.org/ac5zcDUY8CFHr6Lw@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-12-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Cc: Andrea Arcangeli Cc: Andrei Vagin Cc: Axel Rasmussen Cc: Baolin Wang Cc: David Hildenbrand (Arm) Cc: Harry Yoo Cc: Harry Yoo (Oracle) Cc: Hugh Dickins Cc: James Houghton Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Muchun Song Cc: Nikita Kalyazin Cc: Oscar Salvador Cc: Paolo Bonzini Cc: Peter Xu Cc: Sean Christopherson Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: David Carlier Signed-off-by: Andrew Morton --- mm/userfaultfd.c | 27 --------------------------- 1 file changed, 27 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 8a023d9326c2..885da1e56466 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -29,7 +29,6 @@ struct mfill_state { struct vm_area_struct *vma; unsigned long src_addr; unsigned long dst_addr; - struct folio *folio; pmd_t *pmd; }; @@ -531,9 +530,6 @@ err_filemap_remove: ops->filemap_remove(folio, state->vma); err_folio_put: folio_put(folio); - /* Don't return -ENOENT so that our caller won't retry */ - if (ret == -ENOENT) - ret = -EFAULT; return ret; } @@ -899,7 +895,6 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, VM_WARN_ON_ONCE(src_start + len <= src_start); VM_WARN_ON_ONCE(dst_start + len <= dst_start); -retry: err = mfill_get_vma(&state); if (err) goto out; @@ -926,26 +921,6 @@ retry: err = mfill_atomic_pte(&state); cond_resched(); - if (unlikely(err == -ENOENT)) { - void *kaddr; - - mfill_put_vma(&state); - VM_WARN_ON_ONCE(!state.folio); - - kaddr = kmap_local_folio(state.folio, 0); - err = copy_from_user(kaddr, - (const void __user *)state.src_addr, - PAGE_SIZE); - kunmap_local(kaddr); - if (unlikely(err)) { - err = -EFAULT; - goto out; - } - flush_dcache_folio(state.folio); - goto retry; - } else - VM_WARN_ON_ONCE(state.folio); - if (!err) { state.dst_addr += PAGE_SIZE; state.src_addr += PAGE_SIZE; @@ -960,8 +935,6 @@ retry: mfill_put_vma(&state); out: - if (state.folio) - folio_put(state.folio); VM_WARN_ON_ONCE(copied < 0); VM_WARN_ON_ONCE(err > 0); VM_WARN_ON_ONCE(!copied && !err); -- cgit v1.2.3 From fb0fca46b9b460f7ac60f66d92ac6276fce9d9e9 Mon Sep 17 00:00:00 2001 From: Chunyu Hu Date: Thu, 2 Apr 2026 09:45:38 +0800 Subject: selftests/mm/guard-regions: skip collapse test when thp not enabled Patch series "selftests/mm: skip several tests when thp is not available", v8. There are several tests requires transprarent hugepages, when run on thp disabled kernel such as realtime kernel, there will be false negative. Mark those tests as skip when thp is not available. This patch (of 6): When thp is not available, just skip the collape tests to avoid the false negative. Without the change, run with a thp disabled kernel: ./run_vmtests.sh -t madv_guard -n 1 # RUN guard_regions.anon.collapse ... # guard-regions.c:2217:collapse:Expected madvise(ptr, size, MADV_NOHUGEPAGE) (-1) == 0 (0) # collapse: Test terminated by assertion # FAIL guard_regions.anon.collapse not ok 2 guard_regions.anon.collapse # RUN guard_regions.shmem.collapse ... # guard-regions.c:2217:collapse:Expected madvise(ptr, size, MADV_NOHUGEPAGE) (-1) == 0 (0) # collapse: Test terminated by assertion # FAIL guard_regions.shmem.collapse not ok 32 guard_regions.shmem.collapse # RUN guard_regions.file.collapse ... # guard-regions.c:2217:collapse:Expected madvise(ptr, size, MADV_NOHUGEPAGE) (-1) == 0 (0) # collapse: Test terminated by assertion # FAIL guard_regions.file.collapse not ok 62 guard_regions.file.collapse # FAILED: 87 / 90 tests passed. # 17 skipped test(s) detected. Consider enabling relevant config options to improve coverage. # Totals: pass:70 fail:3 xfail:0 xpass:0 skip:17 error:0 With this change, run with thp disabled kernel: ./run_vmtests.sh -t madv_guard -n 1 # RUN guard_regions.anon.collapse ... # SKIP Transparent Hugepages not available # OK guard_regions.anon.collapse ok 2 guard_regions.anon.collapse # SKIP Transparent Hugepages not available # RUN guard_regions.file.collapse ... # SKIP Transparent Hugepages not available # OK guard_regions.file.collapse ok 62 guard_regions.file.collapse # SKIP Transparent Hugepages not available # RUN guard_regions.shmem.collapse ... # SKIP Transparent Hugepages not available # OK guard_regions.shmem.collapse ok 32 guard_regions.shmem.collapse # SKIP Transparent Hugepages not available # PASSED: 90 / 90 tests passed. # 20 skipped test(s) detected. Consider enabling relevant config options to improve coverage. # Totals: pass:70 fail:0 xfail:0 xpass:0 skip:20 error:0 Link: https://lore.kernel.org/20260402014543.1671131-1-chuhu@redhat.com Link: https://lore.kernel.org/20260402014543.1671131-2-chuhu@redhat.com Signed-off-by: Chunyu Hu Reviewed-by: Lorenzo Stoakes (Oracle) Acked-by: David Hildenbrand (Arm) Reviewed-by: Zi Yan Acked-by: Mike Rapoport (Microsoft) Cc: Li Wang Cc: Nico Pache Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/guard-regions.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c index dbd21d66d383..48e8b1539be3 100644 --- a/tools/testing/selftests/mm/guard-regions.c +++ b/tools/testing/selftests/mm/guard-regions.c @@ -21,6 +21,7 @@ #include #include #include "vm_util.h" +#include "thp_settings.h" #include "../pidfd/pidfd.h" @@ -2195,6 +2196,9 @@ TEST_F(guard_regions, collapse) char *ptr; int i; + if (!thp_available()) + SKIP(return, "Transparent Hugepages not available\n"); + /* Need file to be correct size for tests for non-anon. */ if (variant->backing != ANON_BACKED) ASSERT_EQ(ftruncate(self->fd, size), 0); -- cgit v1.2.3 From 929d5fbf1a00ed86e02348a0a26dfddc301ababd Mon Sep 17 00:00:00 2001 From: Chunyu Hu Date: Thu, 2 Apr 2026 09:45:39 +0800 Subject: selftests/mm: soft-dirty: skip two tests when thp is not available The test_hugepage test contain two sub tests. If just reporting one skip when thp not available, there will be error in the log because the test count don't match the test plan. Change to skip two tests by running the ksft_test_result_skip twice in this case. Without the fix (run test on thp disabled kernel): ./run_vmtests.sh -t soft_dirty # -------------------- # running ./soft-dirty # -------------------- # TAP version 13 # 1..19 # ok 1 Test test_simple # ok 2 Test test_vma_reuse dirty bit of allocated page # ok 3 Test test_vma_reuse dirty bit of reused address page # ok 4 # SKIP Transparent Hugepages not available # ok 5 Test test_mprotect-anon dirty bit of new written page # ok 6 Test test_mprotect-anon soft-dirty clear after clear_refs # ok 7 Test test_mprotect-anon soft-dirty clear after marking RO # ok 8 Test test_mprotect-anon soft-dirty clear after marking RW # ok 9 Test test_mprotect-anon soft-dirty after rewritten # ok 10 Test test_mprotect-file dirty bit of new written page # ok 11 Test test_mprotect-file soft-dirty clear after clear_refs # ok 12 Test test_mprotect-file soft-dirty clear after marking RO # ok 13 Test test_mprotect-file soft-dirty clear after marking RW # ok 14 Test test_mprotect-file soft-dirty after rewritten # ok 15 Test test_merge-anon soft-dirty after remap merge 1st pg # ok 16 Test test_merge-anon soft-dirty after remap merge 2nd pg # ok 17 Test test_merge-anon soft-dirty after mprotect merge 1st pg # ok 18 Test test_merge-anon soft-dirty after mprotect merge 2nd pg # # 1 skipped test(s) detected. Consider enabling relevant config options to improve coverage. # # Planned tests != run tests (19 != 18) # # Totals: pass:17 fail:0 xfail:0 xpass:0 skip:1 error:0 # [FAIL] not ok 52 soft-dirty # exit=1 With the fix (run test on thp disabled kernel): ./run_vmtests.sh -t soft_dirty # -------------------- # running ./soft-dirty # TAP version 13 # -------------------- # running ./soft-dirty # -------------------- # TAP version 13 # 1..19 # ok 1 Test test_simple # ok 2 Test test_vma_reuse dirty bit of allocated page # ok 3 Test test_vma_reuse dirty bit of reused address page # # Transparent Hugepages not available # ok 4 # SKIP Test test_hugepage huge page allocation # ok 5 # SKIP Test test_hugepage huge page dirty bit # ok 6 Test test_mprotect-anon dirty bit of new written page # ok 7 Test test_mprotect-anon soft-dirty clear after clear_refs # ok 8 Test test_mprotect-anon soft-dirty clear after marking RO # ok 9 Test test_mprotect-anon soft-dirty clear after marking RW # ok 10 Test test_mprotect-anon soft-dirty after rewritten # ok 11 Test test_mprotect-file dirty bit of new written page # ok 12 Test test_mprotect-file soft-dirty clear after clear_refs # ok 13 Test test_mprotect-file soft-dirty clear after marking RO # ok 14 Test test_mprotect-file soft-dirty clear after marking RW # ok 15 Test test_mprotect-file soft-dirty after rewritten # ok 16 Test test_merge-anon soft-dirty after remap merge 1st pg # ok 17 Test test_merge-anon soft-dirty after remap merge 2nd pg # ok 18 Test test_merge-anon soft-dirty after mprotect merge 1st pg # ok 19 Test test_merge-anon soft-dirty after mprotect merge 2nd pg # # 2 skipped test(s) detected. Consider enabling relevant config options to improve coverage. # # Totals: pass:17 fail:0 xfail:0 xpass:0 skip:2 error:0 # [PASS] ok 1 soft-dirty hwpoison_inject # SUMMARY: PASS=1 SKIP=0 FAIL=0 1..1 Link: https://lore.kernel.org/20260402014543.1671131-3-chuhu@redhat.com Signed-off-by: Chunyu Hu Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Lorenzo Stoakes (Oracle) Acked-by: David Hildenbrand (Arm) Reviewed-by: Zi Yan Cc: Li Wang Cc: Nico Pache Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/soft-dirty.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/mm/soft-dirty.c b/tools/testing/selftests/mm/soft-dirty.c index 59c0dbe99a9b..bcfcac99b436 100644 --- a/tools/testing/selftests/mm/soft-dirty.c +++ b/tools/testing/selftests/mm/soft-dirty.c @@ -82,7 +82,9 @@ static void test_hugepage(int pagemap_fd, int pagesize) int i, ret; if (!thp_is_enabled()) { - ksft_test_result_skip("Transparent Hugepages not available\n"); + ksft_print_msg("Transparent Hugepages not available\n"); + ksft_test_result_skip("Test %s huge page allocation\n", __func__); + ksft_test_result_skip("Test %s huge page dirty bit\n", __func__); return; } -- cgit v1.2.3 From 710d2f307945e892aaa147ae98232fafebe0be33 Mon Sep 17 00:00:00 2001 From: Chunyu Hu Date: Thu, 2 Apr 2026 09:45:40 +0800 Subject: selftests/mm: move write_file helper to vm_util thp_settings provides write_file() helper for safely writing to a file and exit when write failure happens. It's a very low level helper and many sub tests need such a helper, not only thp tests. split_huge_page_test also defines a write_file locally. The two have minior differences in return type and used exit api. And there would be conflicts if split_huge_page_test wanted to include thp_settings.h because of different prototype, making it less convenient. It's possisble to merge the two, although some tests don't use the kselftest infrastrucutre for testing. It would also work when using the ksft_exit_msg() to exit in my test, as the counters are all zero. Output will be like: TAP version 13 1..62 Bail out! /proc/sys/vm/drop_caches1 open failed: No such file or directory # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0 So here we just keep the version in split_huge_page_test, and move it into the vm_util. This makes it easy to maitain and user could just include one vm_util.h when they don't need thp setting helpers. Keep the prototype of void return as the function will exit on any error, return value is not necessary, and will simply the callers like write_num() and write_string(). Link: https://lore.kernel.org/20260402014543.1671131-4-chuhu@redhat.com Signed-off-by: Chunyu Hu Reviewed-by: Lorenzo Stoakes (Oracle) Acked-by: David Hildenbrand (Arm) Reviewed-by: Zi Yan Acked-by: Mike Rapoport (Microsoft) Suggested-by: Mike Rapoport Cc: Nico Pache Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/split_huge_page_test.c | 15 ---------- tools/testing/selftests/mm/thp_settings.c | 35 ++--------------------- tools/testing/selftests/mm/thp_settings.h | 1 - tools/testing/selftests/mm/vm_util.c | 15 ++++++++++ tools/testing/selftests/mm/vm_util.h | 2 ++ 5 files changed, 20 insertions(+), 48 deletions(-) diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c index e0167111bdd1..93f205327b84 100644 --- a/tools/testing/selftests/mm/split_huge_page_test.c +++ b/tools/testing/selftests/mm/split_huge_page_test.c @@ -255,21 +255,6 @@ static int check_after_split_folio_orders(char *vaddr_start, size_t len, return status; } -static void write_file(const char *path, const char *buf, size_t buflen) -{ - int fd; - ssize_t numwritten; - - fd = open(path, O_WRONLY); - if (fd == -1) - ksft_exit_fail_msg("%s open failed: %s\n", path, strerror(errno)); - - numwritten = write(fd, buf, buflen - 1); - close(fd); - if (numwritten < 1) - ksft_exit_fail_msg("Write failed\n"); -} - static void write_debugfs(const char *fmt, ...) { char input[INPUT_MAX]; diff --git a/tools/testing/selftests/mm/thp_settings.c b/tools/testing/selftests/mm/thp_settings.c index 574bd0f8ae48..e748ebfb3d4e 100644 --- a/tools/testing/selftests/mm/thp_settings.c +++ b/tools/testing/selftests/mm/thp_settings.c @@ -6,6 +6,7 @@ #include #include +#include "vm_util.h" #include "thp_settings.h" #define THP_SYSFS "/sys/kernel/mm/transparent_hugepage/" @@ -64,29 +65,6 @@ int read_file(const char *path, char *buf, size_t buflen) return (unsigned int) numread; } -int write_file(const char *path, const char *buf, size_t buflen) -{ - int fd; - ssize_t numwritten; - - fd = open(path, O_WRONLY); - if (fd == -1) { - printf("open(%s)\n", path); - exit(EXIT_FAILURE); - return 0; - } - - numwritten = write(fd, buf, buflen - 1); - close(fd); - if (numwritten < 1) { - printf("write(%s)\n", buf); - exit(EXIT_FAILURE); - return 0; - } - - return (unsigned int) numwritten; -} - unsigned long read_num(const char *path) { char buf[21]; @@ -104,10 +82,7 @@ void write_num(const char *path, unsigned long num) char buf[21]; sprintf(buf, "%ld", num); - if (!write_file(path, buf, strlen(buf) + 1)) { - perror(path); - exit(EXIT_FAILURE); - } + write_file(path, buf, strlen(buf) + 1); } int thp_read_string(const char *name, const char * const strings[]) @@ -165,11 +140,7 @@ void thp_write_string(const char *name, const char *val) printf("%s: Pathname is too long\n", __func__); exit(EXIT_FAILURE); } - - if (!write_file(path, val, strlen(val) + 1)) { - perror(path); - exit(EXIT_FAILURE); - } + write_file(path, val, strlen(val) + 1); } unsigned long thp_read_num(const char *name) diff --git a/tools/testing/selftests/mm/thp_settings.h b/tools/testing/selftests/mm/thp_settings.h index 76eeb712e5f1..7748a9009191 100644 --- a/tools/testing/selftests/mm/thp_settings.h +++ b/tools/testing/selftests/mm/thp_settings.h @@ -63,7 +63,6 @@ struct thp_settings { }; int read_file(const char *path, char *buf, size_t buflen); -int write_file(const char *path, const char *buf, size_t buflen); unsigned long read_num(const char *path); void write_num(const char *path, unsigned long num); diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c index a6d4ff7dfdc0..ad96d19d1b85 100644 --- a/tools/testing/selftests/mm/vm_util.c +++ b/tools/testing/selftests/mm/vm_util.c @@ -764,3 +764,18 @@ int unpoison_memory(unsigned long pfn) return ret > 0 ? 0 : -errno; } + +void write_file(const char *path, const char *buf, size_t buflen) +{ + int fd; + ssize_t numwritten; + + fd = open(path, O_WRONLY); + if (fd == -1) + ksft_exit_fail_msg("%s open failed: %s\n", path, strerror(errno)); + + numwritten = write(fd, buf, buflen - 1); + close(fd); + if (numwritten < 1) + ksft_exit_fail_msg("Write failed\n"); +} diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h index e9c4e24769c1..1a07305ceff4 100644 --- a/tools/testing/selftests/mm/vm_util.h +++ b/tools/testing/selftests/mm/vm_util.h @@ -166,3 +166,5 @@ int unpoison_memory(unsigned long pfn); #define PAGEMAP_PRESENT(ent) (((ent) & (1ull << 63)) != 0) #define PAGEMAP_PFN(ent) ((ent) & ((1ull << 55) - 1)) + +void write_file(const char *path, const char *buf, size_t buflen); -- cgit v1.2.3 From a784a3a39cc58b45807083b6447fa13028fd47e7 Mon Sep 17 00:00:00 2001 From: Chunyu Hu Date: Thu, 2 Apr 2026 09:45:41 +0800 Subject: selftests/mm/vm_util: robust write_file() Add three more checks for buflen and numwritten. The buflen should be at least two, that means at least one char and the null-end. The error case check is added by checking numwriten < 0 instead of numwritten < 1. And the truncate case is checked. The test will exit if any of these conditions aren't met. Additionally, add more print information when a write failure occurs or a truncated write happens, providing clearer diagnostics. Link: https://lore.kernel.org/20260402014543.1671131-5-chuhu@redhat.com Signed-off-by: Chunyu Hu Acked-by: David Hildenbrand (Arm) Reviewed-by: Lorenzo Stoakes Cc: Nico Pache Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/vm_util.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c index ad96d19d1b85..db94564f4431 100644 --- a/tools/testing/selftests/mm/vm_util.c +++ b/tools/testing/selftests/mm/vm_util.c @@ -767,15 +767,24 @@ int unpoison_memory(unsigned long pfn) void write_file(const char *path, const char *buf, size_t buflen) { - int fd; + int fd, saved_errno; ssize_t numwritten; + if (buflen < 2) + ksft_exit_fail_msg("Incorrect buffer len: %zu\n", buflen); + fd = open(path, O_WRONLY); if (fd == -1) ksft_exit_fail_msg("%s open failed: %s\n", path, strerror(errno)); numwritten = write(fd, buf, buflen - 1); + saved_errno = errno; close(fd); - if (numwritten < 1) - ksft_exit_fail_msg("Write failed\n"); + errno = saved_errno; + if (numwritten < 0) + ksft_exit_fail_msg("%s write(%.*s) failed: %s\n", path, (int)(buflen - 1), + buf, strerror(errno)); + if (numwritten != buflen - 1) + ksft_exit_fail_msg("%s write(%.*s) is truncated, expected %zu bytes, got %zd bytes\n", + path, (int)(buflen - 1), buf, buflen - 1, numwritten); } -- cgit v1.2.3 From dad4964a34c20cb86dcbedfe64ef7fe0728346df Mon Sep 17 00:00:00 2001 From: Chunyu Hu Date: Thu, 2 Apr 2026 09:45:42 +0800 Subject: selftests/mm: split_huge_page_test: skip the test when thp is not available When thp is not enabled on some kernel config such as realtime kernel, the test will report failure. Fix the false positive by skipping the test directly when thp is not enabled. Tested with thp disabled kernel: Before The fix: # -------------------------------------------------- # running ./split_huge_page_test /tmp/xfs_dir_Ywup9p # -------------------------------------------------- # TAP version 13 # Bail out! Reading PMD pagesize failed # # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0 # [FAIL] not ok 61 split_huge_page_test /tmp/xfs_dir_Ywup9p # exit=1 After the fix: # -------------------------------------------------- # running ./split_huge_page_test /tmp/xfs_dir_YHPUPl # -------------------------------------------------- # TAP version 13 # 1..0 # SKIP Transparent Hugepages not available # [SKIP] ok 6 split_huge_page_test /tmp/xfs_dir_YHPUPl # SKIP Link: https://lore.kernel.org/20260402014543.1671131-6-chuhu@redhat.com Signed-off-by: Chunyu Hu Acked-by: David Hildenbrand (Arm) Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Lorenzo Stoakes (Oracle) Reviewed-by: Zi Yan Cc: Li Wang Cc: Nico Pache Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/split_huge_page_test.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c index 93f205327b84..500d07c4938b 100644 --- a/tools/testing/selftests/mm/split_huge_page_test.c +++ b/tools/testing/selftests/mm/split_huge_page_test.c @@ -21,6 +21,7 @@ #include #include "vm_util.h" #include "kselftest.h" +#include "thp_settings.h" uint64_t pagesize; unsigned int pageshift; @@ -757,6 +758,9 @@ int main(int argc, char **argv) ksft_finished(); } + if (!thp_is_enabled()) + ksft_exit_skip("Transparent Hugepages not available\n"); + if (argc > 1) optional_xfs_path = argv[1]; -- cgit v1.2.3 From cfe9a446f519f355f2e3741e2d63944e6064c4cc Mon Sep 17 00:00:00 2001 From: Chunyu Hu Date: Thu, 2 Apr 2026 09:45:43 +0800 Subject: selftests/mm: transhuge_stress: skip the test when thp not available The test requires thp, skip the test when thp is not available to avoid false positive. Tested with thp disabled kernel. Before the fix: # -------------------------------- # running ./transhuge-stress -d 20 # -------------------------------- # TAP version 13 # 1..1 # transhuge-stress: allocate 1453 transhuge pages, using 2907 MiB virtual memory and 11 MiB of ram # Bail out! MADV_HUGEPAGE# Planned tests != run tests (1 != 0) # # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0 # [FAIL] not ok 60 transhuge-stress -d 20 # exit=1 After the fix: # -------------------------------- # running ./transhuge-stress -d 20 # -------------------------------- # TAP version 13 # 1..0 # SKIP Transparent Hugepages not available # [SKIP] ok 5 transhuge-stress -d 20 # SKIP Link: https://lore.kernel.org/20260402014543.1671131-7-chuhu@redhat.com Signed-off-by: Chunyu Hu Acked-by: David Hildenbrand (Arm) Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Lorenzo Stoakes (Oracle) Reviewed-by: Zi Yan Cc: Li Wang Cc: Nico Pache Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/transhuge-stress.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/tools/testing/selftests/mm/transhuge-stress.c b/tools/testing/selftests/mm/transhuge-stress.c index bcad47c09518..7a9f1035099b 100644 --- a/tools/testing/selftests/mm/transhuge-stress.c +++ b/tools/testing/selftests/mm/transhuge-stress.c @@ -17,6 +17,7 @@ #include #include "vm_util.h" #include "kselftest.h" +#include "thp_settings.h" int backing_fd = -1; int mmap_flags = MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE; @@ -37,6 +38,9 @@ int main(int argc, char **argv) ksft_print_header(); + if (!thp_is_enabled()) + ksft_exit_skip("Transparent Hugepages not available\n"); + ram = sysconf(_SC_PHYS_PAGES); if (ram > SIZE_MAX / psize() / 4) ram = SIZE_MAX / 4; -- cgit v1.2.3 From df620ec4d4d703f11f3b0adecd4450c34489e0f1 Mon Sep 17 00:00:00 2001 From: David Carlier Date: Thu, 2 Apr 2026 07:14:07 +0100 Subject: mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete() sio_read_complete() uses sio->pages to account global PSWPIN vm events, but sio->pages tracks the number of bvec entries (folios), not base pages. While large folios cannot currently reach this path (SWP_FS_OPS and SWP_SYNCHRONOUS_IO are mutually exclusive, and mTHP swap-in allocation is gated on SWP_SYNCHRONOUS_IO), the accounting is semantically inconsistent with the per-memcg path which correctly uses folio_nr_pages(). Use sio->len >> PAGE_SHIFT instead, which gives the correct base page count since sio->len is accumulated via folio_size(folio). Link: https://lore.kernel.org/20260402061408.36119-1-devnexen@gmail.com Signed-off-by: David Carlier Acked-by: David Hildenbrand (Arm) Cc: Baoquan He Cc: Chris Li Cc: Kairui Song Cc: Kemeng Shi Cc: NeilBrown Cc: Nhat Pham Signed-off-by: Andrew Morton --- mm/page_io.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_io.c b/mm/page_io.c index 93d03d9e2a6a..70cea9e24d2f 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -497,7 +497,7 @@ static void sio_read_complete(struct kiocb *iocb, long ret) folio_mark_uptodate(folio); folio_unlock(folio); } - count_vm_events(PSWPIN, sio->pages); + count_vm_events(PSWPIN, sio->len >> PAGE_SHIFT); } else { for (p = 0; p < sio->pages; p++) { struct folio *folio = page_folio(sio->bvec[p].bv_page); -- cgit v1.2.3 From 77c368f057e17b59b23899a1907ee9d4f4d7a532 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 2 Apr 2026 18:23:20 +0800 Subject: mm/sparse: fix comment for section map alignment The comment in mmzone.h currently details exhaustive per-architecture bit-width lists and explains alignment using min(PAGE_SHIFT, PFN_SECTION_SHIFT). Such details risk falling out of date over time and may inadvertently be left un-updated. We always expect a single section to cover full pages. Therefore, we can safely assume that PFN_SECTION_SHIFT is large enough to accommodate SECTION_MAP_LAST_BIT. We use BUILD_BUG_ON() to ensure this. Update the comment to accurately reflect this consensus, making it clear that we rely on a single section covering full pages. Link: https://lore.kernel.org/20260402102320.3617578-1-songmuchun@bytedance.com Signed-off-by: Muchun Song Acked-by: David Hildenbrand (Arm) Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Mike Rapoport Cc: Petr Tesarik Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 25 ++++++++++--------------- 1 file changed, 10 insertions(+), 15 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 20f920dede65..07f501a62d67 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -2068,21 +2068,16 @@ static inline struct mem_section *__nr_to_section(unsigned long nr) extern size_t mem_section_usage_size(void); /* - * We use the lower bits of the mem_map pointer to store - * a little bit of information. The pointer is calculated - * as mem_map - section_nr_to_pfn(pnum). The result is - * aligned to the minimum alignment of the two values: - * 1. All mem_map arrays are page-aligned. - * 2. section_nr_to_pfn() always clears PFN_SECTION_SHIFT - * lowest bits. PFN_SECTION_SHIFT is arch-specific - * (equal SECTION_SIZE_BITS - PAGE_SHIFT), and the - * worst combination is powerpc with 256k pages, - * which results in PFN_SECTION_SHIFT equal 6. - * To sum it up, at least 6 bits are available on all architectures. - * However, we can exceed 6 bits on some other architectures except - * powerpc (e.g. 15 bits are available on x86_64, 13 bits are available - * with the worst case of 64K pages on arm64) if we make sure the - * exceeded bit is not applicable to powerpc. + * We use the lower bits of the mem_map pointer to store a little bit of + * information. The pointer is calculated as mem_map - section_nr_to_pfn(). + * The result is aligned to the minimum alignment of the two values: + * + * 1. All mem_map arrays are page-aligned. + * 2. section_nr_to_pfn() always clears PFN_SECTION_SHIFT lowest bits. + * + * We always expect a single section to cover full pages. Therefore, + * we can safely assume that PFN_SECTION_SHIFT is large enough to + * accommodate SECTION_MAP_LAST_BIT. We use BUILD_BUG_ON() to ensure this. */ enum { SECTION_MARKED_PRESENT_BIT, -- cgit v1.2.3 From 19999e479c2a38672789e66b4830f43c645ca1f2 Mon Sep 17 00:00:00 2001 From: Zhaoyang Huang Date: Thu, 12 Feb 2026 11:21:11 +0800 Subject: mm: remove '!root_reclaim' checking in should_abort_scan() Android systems usually use memory.reclaim interface to implement user space memory management which expects that the requested reclaim target and actually reclaimed amount memory are not diverging by too much. With the current MGRLU implementation there is, however, no bail out when the reclaim target is reached and this could lead to an excessive reclaim that scales with the reclaim hierarchy size.For example, we can get a nr_reclaimed=394/nr_to_reclaim=32 proactive reclaim under a common 1-N cgroup hierarchy. This defect arose from the goal of keeping fairness among memcgs that is, for try_to_free_mem_cgroup_pages -> shrink_node_memcgs -> shrink_lruvec -> lru_gen_shrink_lruvec -> try_to_shrink_lruvec, the !root_reclaim(sc) check was there for reclaim fairness, which was necessary before commit b82b530740b9 ("mm: vmscan: restore incremental cgroup iteration") because the fairness depended on attempted proportional reclaim from every memcg under the target memcg. However after commit b82b530740b9 there is no longer a need to visit every memcg to ensure fairness. Let's have try_to_shrink_lruvec bail out when the nr_reclaimed achieved. Link: https://lore.kernel.org/20260318011558.1696310-1-zhaoyang.huang@unisoc.com Link: https://lore.kernel.org/20260212032111.408865-1-zhaoyang.huang@unisoc.com Signed-off-by: Zhaoyang Huang Suggested-by: T.J.Mercier Reviewed-by: T.J. Mercier Acked-by: Shakeel Butt Acked-by: Qi Zheng Reviewed-by: Barry Song Reviewed-by: Kairui Song Cc: Johannes Weiner Cc: Michal Hocko Cc: Rik van Riel Cc: Roman Gushchin Cc: Yu Zhao Cc: Axel Rasmussen Cc: Yuanchu Xie Cc: Wei Xu Signed-off-by: Andrew Morton --- mm/vmscan.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 7fd97e0e0ab9..5a8c8fcccbfc 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4971,10 +4971,6 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc) int i; enum zone_watermarks mark; - /* don't abort memcg reclaim to ensure fairness */ - if (!root_reclaim(sc)) - return false; - if (sc->nr_reclaimed >= max(sc->nr_to_reclaim, compact_gap(sc->order))) return true; -- cgit v1.2.3 From 3bc181c1436373e42220baaa0d8c9b45fa18afe1 Mon Sep 17 00:00:00 2001 From: Pedro Falcato Date: Thu, 2 Apr 2026 15:16:27 +0100 Subject: mm/mprotect: move softleaf code out of the main function Patch series "mm/mprotect: micro-optimization work", v3. Micro-optimize the change_protection functionality and the change_pte_range() routine. This set of functions works in an incredibly tight loop, and even small inefficiencies are incredibly evident when spun hundreds, thousands or hundreds of thousands of times. There was an attempt to keep the batching functionality as much as possible, which introduced some part of the slowness, but not all of it. Removing it for !arm64 architectures would speed mprotect() up even further, but could easily pessimize cases where large folios are mapped (which is not as rare as it seems, particularly when it comes to the page cache these days). The micro-benchmark used for the tests was [0] (usable using google/benchmark and g++ -O2 -lbenchmark repro.cpp) This resulted in the following (first entry is baseline): --------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------- mprotect_bench 85967 ns 85967 ns 6935 mprotect_bench 70684 ns 70684 ns 9887 After the patchset we can observe an ~18% speedup in mprotect. Wonderful for the elusive mprotect-based workloads! Testing & more ideas welcome. I suspect there is plenty of improvement possible but it would require more time than what I have on my hands right now. The entire inlined function (which inlines into change_protection()) is gigantic - I'm not surprised this is so finnicky. Note: per my profiling, the next _big_ bottleneck here is modify_prot_start_ptes, exactly on the xchg() done by x86. ptep_get_and_clear() is _expensive_. I don't think there's a properly safe way to go about it since we do depend on the D bit quite a lot. This might not be such an issue on other architectures. Luke Yang reported [1]: : On average, we see improvements ranging from a minimum of 5% to a : maximum of 55%, with most improvements showing around a 25% speed up in : the libmicro/mprot_tw4m micro benchmark. This patch (of 2): Move softleaf change_pte_range code into a separate function. This makes the change_pte_range() function a good bit smaller, and lessens cognitive load when reading through the function. Link: https://lore.kernel.org/20260402141628.3367596-1-pfalcato@suse.de Link: https://lore.kernel.org/20260402141628.3367596-2-pfalcato@suse.de Link: https://lore.kernel.org/all/aY8-XuFZ7zCvXulB@luyang-thinkpadp1gen7.toromso.csb/ Link: https://gist.github.com/heatd/1450d273005aba91fa5744f44dfcd933 [0] Link: https://lore.kernel.org/CAL2CeBxT4jtJ+LxYb6=BNxNMGinpgD_HYH5gGxOP-45Q2OncqQ@mail.gmail.com [1] Signed-off-by: Pedro Falcato Reviewed-by: Lorenzo Stoakes (Oracle) Acked-by: David Hildenbrand (Arm) Tested-by: Luke Yang Reviewed-by: Vlastimil Babka (SUSE) Cc: Dev Jain Cc: Jann Horn Cc: Jiri Hladky Cc: Liam Howlett Cc: Davidlohr Bueso Signed-off-by: Andrew Morton --- mm/mprotect.c | 127 +++++++++++++++++++++++++++++++--------------------------- 1 file changed, 67 insertions(+), 60 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 110d47a36d4b..86b9895afe72 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -211,6 +211,72 @@ static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb); } +static long change_softleaf_pte(struct vm_area_struct *vma, + unsigned long addr, pte_t *pte, pte_t oldpte, unsigned long cp_flags) +{ + const bool uffd_wp = cp_flags & MM_CP_UFFD_WP; + const bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; + softleaf_t entry = softleaf_from_pte(oldpte); + pte_t newpte; + + if (softleaf_is_migration_write(entry)) { + const struct folio *folio = softleaf_to_folio(entry); + + /* + * A protection check is difficult so + * just be safe and disable write + */ + if (folio_test_anon(folio)) + entry = make_readable_exclusive_migration_entry(swp_offset(entry)); + else + entry = make_readable_migration_entry(swp_offset(entry)); + newpte = swp_entry_to_pte(entry); + if (pte_swp_soft_dirty(oldpte)) + newpte = pte_swp_mksoft_dirty(newpte); + } else if (softleaf_is_device_private_write(entry)) { + /* + * We do not preserve soft-dirtiness. See + * copy_nonpresent_pte() for explanation. + */ + entry = make_readable_device_private_entry(swp_offset(entry)); + newpte = swp_entry_to_pte(entry); + if (pte_swp_uffd_wp(oldpte)) + newpte = pte_swp_mkuffd_wp(newpte); + } else if (softleaf_is_marker(entry)) { + /* + * Ignore error swap entries unconditionally, + * because any access should sigbus/sigsegv + * anyway. + */ + if (softleaf_is_poison_marker(entry) || + softleaf_is_guard_marker(entry)) + return 0; + /* + * If this is uffd-wp pte marker and we'd like + * to unprotect it, drop it; the next page + * fault will trigger without uffd trapping. + */ + if (uffd_wp_resolve) { + pte_clear(vma->vm_mm, addr, pte); + return 1; + } + return 0; + } else { + newpte = oldpte; + } + + if (uffd_wp) + newpte = pte_swp_mkuffd_wp(newpte); + else if (uffd_wp_resolve) + newpte = pte_swp_clear_uffd_wp(newpte); + + if (!pte_same(oldpte, newpte)) { + set_pte_at(vma->vm_mm, addr, pte, newpte); + return 1; + } + return 0; +} + static long change_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, pgprot_t newprot, unsigned long cp_flags) @@ -317,66 +383,7 @@ static long change_pte_range(struct mmu_gather *tlb, pages++; } } else { - softleaf_t entry = softleaf_from_pte(oldpte); - pte_t newpte; - - if (softleaf_is_migration_write(entry)) { - const struct folio *folio = softleaf_to_folio(entry); - - /* - * A protection check is difficult so - * just be safe and disable write - */ - if (folio_test_anon(folio)) - entry = make_readable_exclusive_migration_entry( - swp_offset(entry)); - else - entry = make_readable_migration_entry(swp_offset(entry)); - newpte = swp_entry_to_pte(entry); - if (pte_swp_soft_dirty(oldpte)) - newpte = pte_swp_mksoft_dirty(newpte); - } else if (softleaf_is_device_private_write(entry)) { - /* - * We do not preserve soft-dirtiness. See - * copy_nonpresent_pte() for explanation. - */ - entry = make_readable_device_private_entry( - swp_offset(entry)); - newpte = swp_entry_to_pte(entry); - if (pte_swp_uffd_wp(oldpte)) - newpte = pte_swp_mkuffd_wp(newpte); - } else if (softleaf_is_marker(entry)) { - /* - * Ignore error swap entries unconditionally, - * because any access should sigbus/sigsegv - * anyway. - */ - if (softleaf_is_poison_marker(entry) || - softleaf_is_guard_marker(entry)) - continue; - /* - * If this is uffd-wp pte marker and we'd like - * to unprotect it, drop it; the next page - * fault will trigger without uffd trapping. - */ - if (uffd_wp_resolve) { - pte_clear(vma->vm_mm, addr, pte); - pages++; - } - continue; - } else { - newpte = oldpte; - } - - if (uffd_wp) - newpte = pte_swp_mkuffd_wp(newpte); - else if (uffd_wp_resolve) - newpte = pte_swp_clear_uffd_wp(newpte); - - if (!pte_same(oldpte, newpte)) { - set_pte_at(vma->vm_mm, addr, pte, newpte); - pages++; - } + pages += change_softleaf_pte(vma, addr, pte, oldpte, cp_flags); } } while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end); lazy_mmu_mode_disable(); -- cgit v1.2.3 From 89e613bc0b2d6d4a18a09b161131ce4ca5c70f2a Mon Sep 17 00:00:00 2001 From: Pedro Falcato Date: Thu, 2 Apr 2026 15:16:28 +0100 Subject: mm/mprotect: special-case small folios when applying permissions The common order-0 case is important enough to want its own branch, and avoids the hairy, large loop logic that the CPU does not seem to handle particularly well. While at it, encourage the compiler to inline batch PTE logic and resolve constant branches by adding __always_inline strategically. Link: https://lore.kernel.org/20260402141628.3367596-3-pfalcato@suse.de Signed-off-by: Pedro Falcato Suggested-by: David Hildenbrand (Arm) Reviewed-by: Lorenzo Stoakes (Oracle) Tested-by: Luke Yang Reviewed-by: Vlastimil Babka (SUSE) Cc: Dev Jain Cc: Jann Horn Cc: Jiri Hladky Cc: Liam Howlett Cc: Davidlohr Bueso Signed-off-by: Andrew Morton --- mm/mprotect.c | 91 +++++++++++++++++++++++++++++++++++++---------------------- 1 file changed, 57 insertions(+), 34 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 86b9895afe72..9cbf932b028c 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -117,9 +117,9 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, } /* Set nr_ptes number of ptes, starting from idx */ -static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr, - pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, - int idx, bool set_write, struct mmu_gather *tlb) +static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent, + int nr_ptes, int idx, bool set_write, struct mmu_gather *tlb) { /* * Advance the position in the batch by idx; note that if idx > 0, @@ -143,7 +143,7 @@ static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long add * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce * that the ptes point to consecutive pages of the same anon large folio. */ -static int page_anon_exclusive_sub_batch(int start_idx, int max_len, +static __always_inline int page_anon_exclusive_sub_batch(int start_idx, int max_len, struct page *first_page, bool expected_anon_exclusive) { int idx; @@ -169,7 +169,7 @@ static int page_anon_exclusive_sub_batch(int start_idx, int max_len, * pte of the batch. Therefore, we must individually check all pages and * retrieve sub-batches. */ -static void commit_anon_folio_batch(struct vm_area_struct *vma, +static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma, struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) { @@ -188,7 +188,7 @@ static void commit_anon_folio_batch(struct vm_area_struct *vma, } } -static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, +static __always_inline void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, struct folio *folio, struct page *page, unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) { @@ -277,6 +277,45 @@ static long change_softleaf_pte(struct vm_area_struct *vma, return 0; } +static __always_inline void change_present_ptes(struct mmu_gather *tlb, + struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, + int nr_ptes, unsigned long end, pgprot_t newprot, + struct folio *folio, struct page *page, unsigned long cp_flags) +{ + const bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; + const bool uffd_wp = cp_flags & MM_CP_UFFD_WP; + pte_t ptent, oldpte; + + oldpte = modify_prot_start_ptes(vma, addr, ptep, nr_ptes); + ptent = pte_modify(oldpte, newprot); + + if (uffd_wp) + ptent = pte_mkuffd_wp(ptent); + else if (uffd_wp_resolve) + ptent = pte_clear_uffd_wp(ptent); + + /* + * In some writable, shared mappings, we might want + * to catch actual write access -- see + * vma_wants_writenotify(). + * + * In all writable, private mappings, we have to + * properly handle COW. + * + * In both cases, we can sometimes still change PTEs + * writable and avoid the write-fault handler, for + * example, if a PTE is already dirty and no other + * COW or special handling is required. + */ + if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && + !pte_write(ptent)) + set_write_prot_commit_flush_ptes(vma, folio, page, + addr, ptep, oldpte, ptent, nr_ptes, tlb); + else + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, + nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb); +} + static long change_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, pgprot_t newprot, unsigned long cp_flags) @@ -287,7 +326,6 @@ static long change_pte_range(struct mmu_gather *tlb, bool is_private_single_threaded; bool prot_numa = cp_flags & MM_CP_PROT_NUMA; bool uffd_wp = cp_flags & MM_CP_UFFD_WP; - bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; int nr_ptes; tlb_change_page_size(tlb, PAGE_SIZE); @@ -308,7 +346,6 @@ static long change_pte_range(struct mmu_gather *tlb, int max_nr_ptes = (end - addr) >> PAGE_SHIFT; struct folio *folio = NULL; struct page *page; - pte_t ptent; /* Already in the desired state. */ if (prot_numa && pte_protnone(oldpte)) @@ -334,34 +371,20 @@ static long change_pte_range(struct mmu_gather *tlb, nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags); - oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes); - ptent = pte_modify(oldpte, newprot); - - if (uffd_wp) - ptent = pte_mkuffd_wp(ptent); - else if (uffd_wp_resolve) - ptent = pte_clear_uffd_wp(ptent); - /* - * In some writable, shared mappings, we might want - * to catch actual write access -- see - * vma_wants_writenotify(). - * - * In all writable, private mappings, we have to - * properly handle COW. - * - * In both cases, we can sometimes still change PTEs - * writable and avoid the write-fault handler, for - * example, if a PTE is already dirty and no other - * COW or special handling is required. + * Optimize for the small-folio common case by + * special-casing it here. Compiler constant propagation + * plus copious amounts of __always_inline does wonders. */ - if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && - !pte_write(ptent)) - set_write_prot_commit_flush_ptes(vma, folio, page, - addr, pte, oldpte, ptent, nr_ptes, tlb); - else - prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent, - nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb); + if (likely(nr_ptes == 1)) { + change_present_ptes(tlb, vma, addr, pte, 1, + end, newprot, folio, page, cp_flags); + } else { + change_present_ptes(tlb, vma, addr, pte, + nr_ptes, end, newprot, folio, page, + cp_flags); + } + pages += nr_ptes; } else if (pte_none(oldpte)) { /* -- cgit v1.2.3 From 9a8ea3c1cb251d4fc354d031e649da099140c4f4 Mon Sep 17 00:00:00 2001 From: Kevin Brodsky Date: Tue, 7 Apr 2026 13:51:33 +0100 Subject: docs: proc: document ProtectionKey in smaps The ProtectionKey entry was added in v4.9; back then it was x86-specific, but it now lives in generic code and applies to all architectures supporting pkeys (currently x86, power, arm64). Time to document it: add a paragraph to proc.rst about the ProtectionKey entry. [akpm@linux-foundation.org: s/system/hardware/, per review discussion] [akpm@linux-foundation.org: s/hardware/CPU/] Link: https://lore.kernel.org/20260407125133.564182-1-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky Reported-by: Yury Khrustalev Acked-by: Vlastimil Babka (SUSE) Reviewed-by: David Hildenbrand (Arm) Reviewed-by: Lorenzo Stoakes Acked-by: Dave Hansen Cc: Jonathan Corbet Cc: Kevin Brodsky Cc: Marc Rutland Cc: Shuah Khan Cc: Randy Dunlap Signed-off-by: Andrew Morton --- Documentation/filesystems/proc.rst | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index b0c0d1b45b99..628364b0f69f 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -549,6 +549,10 @@ does not take into account swapped out page of underlying shmem objects. naturally aligned THP pages of any currently enabled size. 1 if true, 0 otherwise. +If both the kernel and the CPU support protection keys (pkeys), +"ProtectionKey" indicates the memory protection key associated with the +virtual memory area. + "VmFlags" field deserves a separate description. This member represents the kernel flags associated with the particular virtual memory area in two letter encoded manner. The codes are the following: -- cgit v1.2.3 From 2f529e73d72048743b6eaa241da6ac2bcb28099e Mon Sep 17 00:00:00 2001 From: Andrew Stellman Date: Tue, 7 Apr 2026 11:30:27 -0400 Subject: zram: reject unrecognized type= values in recompress_store() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit recompress_store() parses the type= parameter with three if statements checking for "idle", "huge", and "huge_idle". An unrecognized value silently falls through with mode left at 0, causing the recompression pass to run with no slot filter — processing all slots instead of the intended subset. Add a !mode check after the type parsing block to return -EINVAL for unrecognized values, consistent with the function's other parameter validation. Link: https://lore.kernel.org/20260407153027.42425-1-astellman@stellman-greene.com Signed-off-by: Andrew Stellman Suggested-by: Sergey Senozhatsky Reviewed-by: Sergey Senozhatsky Cc: Jens Axboe Cc: Minchan Kim Signed-off-by: Andrew Morton --- drivers/block/zram/zram_drv.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 43b68fdd95d6..aebc710f0d6a 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -2546,6 +2546,8 @@ static ssize_t recompress_store(struct device *dev, mode = RECOMPRESS_HUGE; if (!strcmp(val, "huge_idle")) mode = RECOMPRESS_IDLE | RECOMPRESS_HUGE; + if (!mode) + return -EINVAL; continue; } -- cgit v1.2.3 From c45b354911d01565156e38d7f6bc07edb51fc34c Mon Sep 17 00:00:00 2001 From: Thorsten Blum Date: Thu, 9 Apr 2026 12:54:40 +0200 Subject: mm/hugetlb: fix early boot crash on parameters without '=' separator If hugepages, hugepagesz, or default_hugepagesz are specified on the kernel command line without the '=' separator, early parameter parsing passes NULL to hugetlb_add_param(), which dereferences it in strlen() and can crash the system during early boot. Reject NULL values in hugetlb_add_param() and return -EINVAL instead. Link: https://lore.kernel.org/20260409105437.108686-4-thorsten.blum@linux.dev Fixes: 5b47c02967ab ("mm/hugetlb: convert cmdline parameters from setup to early") Signed-off-by: Thorsten Blum Reviewed-by: Muchun Song Cc: David Hildenbrand Cc: Frank van der Linden Cc: Oscar Salvador Cc: Signed-off-by: Andrew Morton --- mm/hugetlb.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 88009cd2a846..e8024574a2d4 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4226,6 +4226,9 @@ static __init int hugetlb_add_param(char *s, int (*setup)(char *)) size_t len; char *p; + if (!s) + return -EINVAL; + if (hugetlb_param_index >= HUGE_MAX_CMDLINE_ARGS) return -EINVAL; -- cgit v1.2.3 From 2b19bf05719b73f7d04d7d27ec423b459b868852 Mon Sep 17 00:00:00 2001 From: Breno Leitao Date: Thu, 9 Apr 2026 05:26:36 -0700 Subject: mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit vmstat_shepherd uses delayed_work_pending() to check whether vmstat_update is already scheduled for a given CPU before queuing it. However, delayed_work_pending() only tests WORK_STRUCT_PENDING_BIT, which is cleared the moment a worker thread picks up the work to execute it. This means that while vmstat_update is actively running on a CPU, delayed_work_pending() returns false. If need_update() also returns true at that point (per-cpu counters not yet zeroed mid-flush), the shepherd queues a second invocation with delay=0, causing vmstat_update to run again immediately after finishing. On a 72-CPU system this race is readily observable: before the fix, many CPUs show invocation gaps well below 500 jiffies (the minimum round_jiffies_relative() can produce), with the most extreme cases reaching 0 jiffies—vmstat_update called twice within the same jiffy. Fix this by replacing delayed_work_pending() with work_busy(), which returns non-zero for both WORK_BUSY_PENDING (timer armed or work queued) and WORK_BUSY_RUNNING (work currently executing). The shepherd now correctly skips a CPU in all busy states. After the fix, all sub-jiffy and most sub-100-jiffie gaps disappear. The remaining early invocations have gaps in the 700–999 jiffie range, attributable to round_jiffies_relative() aligning to a nearer jiffie-second boundary rather than to this race. Each spurious vmstat_update invocation has a measurable side effect: refresh_cpu_vm_stats() calls decay_pcp_high() for every zone, which drains idle per-CPU pages back to the buddy allocator via free_pcppages_bulk(), taking the zone spinlock each time. Eliminating the double-scheduling therefore reduces zone lock contention directly. On a 72-CPU stress-ng workload measured with perf lock contention: free_pcppages_bulk contention count: ~55% reduction free_pcppages_bulk total wait time: ~57% reduction free_pcppages_bulk max wait time: ~47% reduction Note: work_busy() is inherently racy—between the check and the subsequent queue_delayed_work_on() call, vmstat_update can finish execution, leaving the work neither pending nor running. In that narrow window the shepherd can still queue a second invocation. After the fix, this residual race is rare and produces only occasional small gaps, a significant improvement over the systematic double-scheduling seen with delayed_work_pending(). Link: https://lore.kernel.org/20260409-vmstat-v2-1-e9d9a6db08ad@debian.org Fixes: 7b8da4c7f07774 ("vmstat: get rid of the ugly cpu_stat_off variable") Signed-off-by: Breno Leitao Reviewed-by: Vlastimil Babka (SUSE) Acked-by: Michal Hocko Reviewed-by: Dmitry Ilvokhin Cc: Christoph Lameter Cc: David Hildenbrand Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Mike Rapoport Cc: Shakeel Butt Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- mm/vmstat.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmstat.c b/mm/vmstat.c index 2370c6fb1fcd..cc5fdc0d0f29 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -2139,7 +2139,7 @@ static void vmstat_shepherd(struct work_struct *w) if (cpu_is_isolated(cpu)) continue; - if (!delayed_work_pending(dw) && need_update(cpu)) + if (!work_busy(&dw->work) && need_update(cpu)) queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0); } -- cgit v1.2.3 From 161ce69c2c89781784b945d8e281ff2da9dede9c Mon Sep 17 00:00:00 2001 From: "Denis M. Karpov" Date: Thu, 9 Apr 2026 13:33:45 +0300 Subject: userfaultfd: allow registration of ranges below mmap_min_addr The current implementation of validate_range() in fs/userfaultfd.c performs a hard check against mmap_min_addr. This is redundant because UFFDIO_REGISTER operates on memory ranges that must already be backed by a VMA. Enforcing mmap_min_addr or capability checks again in userfaultfd is unnecessary and prevents applications like binary compilers from using UFFD for valid memory regions mapped by application. Remove the redundant check for mmap_min_addr. We started using UFFD instead of the classic mprotect approach in the binary translator to track application writes. During development, we encountered this bug. The translator cannot control where the translated application chooses to map its memory and if the app requires a low-address area, UFFD fails, whereas mprotect would work just fine. I believe this is a genuine logic bug rather than an improvement, and I would appreciate including the fix in stable. Link: https://lore.kernel.org/20260409103345.15044-1-komlomal@gmail.com Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization") Signed-off-by: Denis M. Karpov Reviewed-by: Lorenzo Stoakes Acked-by: Harry Yoo (Oracle) Reviewed-by: Pedro Falcato Reviewed-by: Liam R. Howlett Reviewed-by: Mike Rapoport (Microsoft) Cc: Alexander Viro Cc: Al Viro Cc: Christian Brauner Cc: Jan Kara Cc: Jann Horn Cc: Peter Xu Cc: Signed-off-by: Andrew Morton --- fs/userfaultfd.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index bdc84e5219cd..4b53dc4a3266 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1238,8 +1238,6 @@ static __always_inline int validate_unaligned_range( return -EINVAL; if (!len) return -EINVAL; - if (start < mmap_min_addr) - return -EINVAL; if (start >= task_size) return -EINVAL; if (len > task_size - start) -- cgit v1.2.3 From d432e8847f58f825dada827eb492c34f65cdc82a Mon Sep 17 00:00:00 2001 From: Cao Ruichuang Date: Fri, 10 Apr 2026 12:41:39 +0800 Subject: selftests: mm: skip charge_reserved_hugetlb without killall charge_reserved_hugetlb.sh tears down background writers with killall from psmisc. Minimal Ubuntu images do not always provide that tool, so the selftest fails in cleanup for an environment reason rather than for the hugetlb behavior it is trying to cover. Skip the test when killall is unavailable, similar to the existing root check, so these environments report the dependency clearly instead of failing the test. Link: https://lore.kernel.org/20260410044139.67480-1-create0818@163.com Signed-off-by: Cao Ruichuang Acked-by: Mike Rapoport (Microsoft) Cc: David Hildenbrand Cc: "Liam R. Howlett" Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/charge_reserved_hugetlb.sh | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/tools/testing/selftests/mm/charge_reserved_hugetlb.sh b/tools/testing/selftests/mm/charge_reserved_hugetlb.sh index 447769657634..44f4e703deb9 100755 --- a/tools/testing/selftests/mm/charge_reserved_hugetlb.sh +++ b/tools/testing/selftests/mm/charge_reserved_hugetlb.sh @@ -11,6 +11,11 @@ if [[ $(id -u) -ne 0 ]]; then exit $ksft_skip fi +if ! command -v killall >/dev/null 2>&1; then + echo "killall not available. Skipping..." + exit $ksft_skip +fi + nr_hugepgs=$(cat /proc/sys/vm/nr_hugepages) fault_limit_file=limit_in_bytes -- cgit v1.2.3 From 57294a97bdd115b06ac05486e0e4a4f50a21ab7b Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Wed, 11 Feb 2026 17:46:11 -0800 Subject: mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The softleaf_is_migration() check is unreachable as entries that are not device_private are filtered out. Similarly, the PTE-level equivalent in migrate_vma_collect_pmd() skips migration entries. This dead branch also contained a double spin_unlock(ptl) bug. Link: https://lore.kernel.org/20260212014611.416695-1-dave@stgolabs.net Fixes: a30b48bf1b244 ("mm/migrate_device: implement THP migration of zone device pages") Signed-off-by: Davidlohr Bueso Suggested-by: Matthew Brost Reviewed-by: Alistair Popple Acked-by: Balbir Singh Acked-by: David Hildenbrand (Arm) Cc: Byungchul Park Cc: Gregory Price Cc: Jason Gunthorpe Cc: John Hubbard Cc: Joshua Hahn Cc: Mathew Brost Cc: Rakie Kim Cc: Ying Huang Cc: Zi Yan Cc: Thomas Hellström Signed-off-by: Andrew Morton --- mm/migrate_device.c | 6 ------ 1 file changed, 6 deletions(-) diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 2912eba575d5..fbfe5715f635 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -175,12 +175,6 @@ static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start, return migrate_vma_collect_skip(start, end, walk); } - if (softleaf_is_migration(entry)) { - softleaf_entry_wait_on_locked(entry, ptl); - spin_unlock(ptl); - return -EAGAIN; - } - if (softleaf_is_device_private_write(entry)) write = MIGRATE_PFN_WRITE; } else { -- cgit v1.2.3 From 60087b49f8e7289681586609fc1d012615354754 Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Mon, 13 Apr 2026 12:11:46 +0000 Subject: MAINTAINERS: update kexec/kdump maintainers entries Update KEXEC and KDUMP maintainer entries by adding the live update group maintainers. Remove Vivek Goyal due to inactivity to keep the MAINTAINERS file up-to-date, and add Vivek to the CREDITS file to recognize their contributions. Link: https://lore.kernel.org/20260413121146.49215-1-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Acked-by: Pratyush Yadav Acked-by: Mike Rapoport (Microsoft) Cc: Diego Viola Cc: Jakub Kacinski Cc: Magnus Karlsson Cc: Mark Brown Cc: Martin Kepplinger Cc: Masahiro Yamada Signed-off-by: Andrew Morton --- CREDITS | 4 ++++ MAINTAINERS | 7 ++++++- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/CREDITS b/CREDITS index 9091bac3d2da..035f70bae0cc 100644 --- a/CREDITS +++ b/CREDITS @@ -1456,6 +1456,10 @@ N: Andy Gospodarek E: andy@greyhouse.net D: Maintenance and contributions to the network interface bonding driver. +N: Vivek Goyal +E: vgoyal@redhat.com +D: KDUMP, KEXEC, and VIRTIO FILE SYSTEM + N: Wolfgang Grandegger E: wg@grandegger.com D: Controller Area Network (device drivers) diff --git a/MAINTAINERS b/MAINTAINERS index 16874c32e288..cc42f7997a7d 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13786,7 +13786,9 @@ F: scripts/Makefile.kcsan KDUMP M: Andrew Morton M: Baoquan He -R: Vivek Goyal +M: Mike Rapoport +M: Pasha Tatashin +M: Pratyush Yadav R: Dave Young L: kexec@lists.infradead.org S: Maintained @@ -14102,6 +14104,9 @@ F: include/linux/kernfs.h KEXEC M: Andrew Morton M: Baoquan He +M: Mike Rapoport +M: Pasha Tatashin +M: Pratyush Yadav L: kexec@lists.infradead.org W: http://kernel.org/pub/linux/utils/kernel/kexec/ F: include/linux/kexec.h -- cgit v1.2.3 From 320c7234d1d1d3552cbbf58886f4219cc1a5ba48 Mon Sep 17 00:00:00 2001 From: "Pratyush Yadav (Google)" Date: Tue, 14 Apr 2026 12:17:18 +0000 Subject: MAINTAINERS: update KHO and LIVE UPDATE maintainers Patch series "MAINTAINERS: update KHO and LIVE UPDATE entries". This series contains some updates for the Kexec Handover (KHO) and Live update entries. Patch 1 updates the maintainers list and adds the liveupdate tree. Patches 2 and 3 clean up stale files in the list. This patch (of 3): I have been helping out with reviewing and developing KHO. I would also like to help maintain it. Change my entry from R to M for KHO and live update. Alex has been inactive for a while, so to avoid over-crowding the KHO entry and to keep the information up-to-date, move his entry from M to R. We also now have a tree for KHO and live update at liveupdate/linux.git where we plan to start maintaining those subsystems and start queuing the patches. List that in the entries as well. Link: https://lore.kernel.org/20260414121752.1912847-1-pratyush@kernel.org Link: https://lore.kernel.org/20260414121752.1912847-2-pratyush@kernel.org Signed-off-by: Pratyush Yadav (Google) Reviewed-by: Alexander Graf Reviewed-by: Pasha Tatashin Acked-by: Mike Rapoport (Microsoft) Cc: Baoquan He Cc: David Hildenbrand Signed-off-by: Andrew Morton --- CREDITS | 4 ++++ MAINTAINERS | 8 +++++--- 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/CREDITS b/CREDITS index 035f70bae0cc..9c33094c0178 100644 --- a/CREDITS +++ b/CREDITS @@ -1460,6 +1460,10 @@ N: Vivek Goyal E: vgoyal@redhat.com D: KDUMP, KEXEC, and VIRTIO FILE SYSTEM +N: Alexander Graf +E: graf@amazon.com +D: Kexec Handover (KHO) + N: Wolfgang Grandegger E: wg@grandegger.com D: Controller Area Network (device drivers) diff --git a/MAINTAINERS b/MAINTAINERS index cc42f7997a7d..1422aa920964 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14114,13 +14114,14 @@ F: include/uapi/linux/kexec.h F: kernel/kexec* KEXEC HANDOVER (KHO) -M: Alexander Graf M: Mike Rapoport M: Pasha Tatashin -R: Pratyush Yadav +M: Pratyush Yadav +R: Alexander Graf L: kexec@lists.infradead.org L: linux-mm@kvack.org S: Maintained +T: git git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git F: Documentation/admin-guide/mm/kho.rst F: Documentation/core-api/kho/* F: include/linux/kexec_handover.h @@ -14807,9 +14808,10 @@ F: tools/testing/selftests/livepatch/ LIVE UPDATE M: Pasha Tatashin M: Mike Rapoport -R: Pratyush Yadav +M: Pratyush Yadav L: linux-kernel@vger.kernel.org S: Maintained +T: git git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git F: Documentation/core-api/liveupdate.rst F: Documentation/mm/memfd_preservation.rst F: Documentation/userspace-api/liveupdate.rst -- cgit v1.2.3 From de61e40bcbb84546972191fb70ef64c5aecdda68 Mon Sep 17 00:00:00 2001 From: "Pratyush Yadav (Google)" Date: Tue, 14 Apr 2026 12:17:19 +0000 Subject: MAINTAINERS: drop include/linux/kho/abi/ from KHO The KHO entry already includes include/linux/kho. Listing its subdirectory is redundant. Link: https://lore.kernel.org/20260414121752.1912847-3-pratyush@kernel.org Signed-off-by: Pratyush Yadav (Google) Reviewed-by: Pasha Tatashin Acked-by: Mike Rapoport (Microsoft) Reviewed-by: David Hildenbrand (Arm) Reviewed-by: SeongJae Park Cc: Alexander Graf Cc: Baoquan He Signed-off-by: Andrew Morton --- MAINTAINERS | 1 - 1 file changed, 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 1422aa920964..5f8f8f1b9030 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14126,7 +14126,6 @@ F: Documentation/admin-guide/mm/kho.rst F: Documentation/core-api/kho/* F: include/linux/kexec_handover.h F: include/linux/kho/ -F: include/linux/kho/abi/ F: kernel/liveupdate/kexec_handover* F: lib/test_kho.c F: tools/testing/selftests/kho/ -- cgit v1.2.3 From b5a9ac2bb0e4f8a2a03c395c5176a85cea273c15 Mon Sep 17 00:00:00 2001 From: "Pratyush Yadav (Google)" Date: Tue, 14 Apr 2026 12:17:20 +0000 Subject: MAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE The directory does not exist any more. Link: https://lore.kernel.org/20260414121752.1912847-4-pratyush@kernel.org Signed-off-by: Pratyush Yadav (Google) Reviewed-by: Pasha Tatashin Acked-by: Mike Rapoport (Microsoft) Reviewed-by: David Hildenbrand (Arm) Reviewed-by: SeongJae Park Cc: Alexander Graf Cc: Baoquan He Signed-off-by: Andrew Morton --- MAINTAINERS | 1 - 1 file changed, 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 5f8f8f1b9030..d6f1e9751d95 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14816,7 +14816,6 @@ F: Documentation/mm/memfd_preservation.rst F: Documentation/userspace-api/liveupdate.rst F: include/linux/kho/abi/ F: include/linux/liveupdate.h -F: include/linux/liveupdate/ F: include/uapi/linux/liveupdate.h F: kernel/liveupdate/ F: lib/tests/liveupdate.c -- cgit v1.2.3 From e86ffbe7dfdd869498f1c44edd9ff230286d514e Mon Sep 17 00:00:00 2001 From: Dave Young Date: Wed, 15 Apr 2026 11:29:26 +0800 Subject: MAINTAINERS: update Dave's kdump reviewer email address Use my personal email address due to the Red Hat work will stop soon Link: https://lore.kernel.org/ad8GFhh3SI1wb7IC@darkstar.users.ipa.redhat.com Signed-off-by: Dave Young Acked-by: Dave Young Signed-off-by: Andrew Morton --- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index d6f1e9751d95..f065598caa43 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13789,7 +13789,7 @@ M: Baoquan He M: Mike Rapoport M: Pasha Tatashin M: Pratyush Yadav -R: Dave Young +R: Dave Young L: kexec@lists.infradead.org S: Maintained W: http://lse.sourceforge.net/kdump/ -- cgit v1.2.3 From 3de705a43a465fa92a45c0a494ec13bf0bad2642 Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Tue, 14 Apr 2026 08:51:58 +0200 Subject: mm/vmscan: avoid false-positive -Wuninitialized warning When the -fsanitize=bounds sanitizer is enabled, gcc-16 sometimes runs into a corner case in the read_ctrl_pos() pos function, where it sees possible undefined behavior from the 'tier' index overflowing, presumably in the case that this was called with a negative tier: In function 'get_tier_idx', inlined from 'isolate_folios' at mm/vmscan.c:4671:14: mm/vmscan.c: In function 'isolate_folios': mm/vmscan.c:4645:29: error: 'pv.refaulted' is used uninitialized [-Werror=uninitialized] Part of the problem seems to be that read_ctrl_pos() has unusual calling conventions since commit 37a260870f2c ("mm/mglru: rework type selection") where passing MAX_NR_TIERS makes it accumulate all tiers but passing a smaller positive number makes it read a single tier instead. Shut up the warning by adding a fake initialization to the two instances of this variable that can run into that corner case. Link: https://lore.kernel.org/all/CAJHvVcjtFW86o5FoQC8MMEXCHAC0FviggaQsd5EmiCHP+1fBpg@mail.gmail.com/ Link: https://lore.kernel.org/20260414065206.3236176-1-arnd@kernel.org Signed-off-by: Arnd Bergmann Cc: Axel Rasmussen Cc: Baolin Wang Cc: Barry Song Cc: David Hildenbrand Cc: Davidlohr Bueso Cc: Johannes Weiner Cc: Kairui Song Cc: Koichiro Den Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Muchun Song Cc: Qi Zheng Cc: Shakeel Butt Cc: Wei Xu Cc: Yuanchu Xie Signed-off-by: Andrew Morton --- mm/vmscan.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 5a8c8fcccbfc..bd1b1aa12581 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4760,7 +4760,7 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec, static int get_tier_idx(struct lruvec *lruvec, int type) { int tier; - struct ctrl_pos sp, pv; + struct ctrl_pos sp, pv = {}; /* * To leave a margin for fluctuations, use a larger gain factor (2:3). @@ -4779,7 +4779,7 @@ static int get_tier_idx(struct lruvec *lruvec, int type) static int get_type_to_scan(struct lruvec *lruvec, int swappiness) { - struct ctrl_pos sp, pv; + struct ctrl_pos sp, pv = {}; if (swappiness <= MIN_SWAPPINESS + 1) return LRU_GEN_FILE; -- cgit v1.2.3 From 0b5e8d7999076ac3c490fc18376a404e2626abff Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Wed, 15 Apr 2026 19:40:40 +0200 Subject: MAINTAINERS: add page cache reviewer Add myself as a page cache reviewer since I tend to review changes in these areas anyway. [akpm@linux-foundation.org: add linux-mm@kvack.org] Link: https://lore.kernel.org/20260415174039.13016-2-jack@suse.cz Signed-off-by: Jan Kara Acked-by: Matthew Wilcox (Oracle) Acked-by: Lorenzo Stoakes Signed-off-by: Andrew Morton --- MAINTAINERS | 2 ++ 1 file changed, 2 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index f065598caa43..ab54a9c77603 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -19964,7 +19964,9 @@ F: kernel/padata.c PAGE CACHE M: Matthew Wilcox (Oracle) +R: Jan Kara L: linux-fsdevel@vger.kernel.org +L: linux-mm@kvack.org S: Supported T: git git://git.infradead.org/users/willy/pagecache.git F: Documentation/filesystems/locking.rst -- cgit v1.2.3