summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2026-06-04memcg: int16_t for cached slab statsShakeel Butt
Currently struct obj_stock_pcp stores cached slab stats in 'int' which is 4 bytes per counter on 64-bit machines. Switch them to int16_t to shrink the cached metadata. The existing PAGE_SIZE flush in __account_obj_stock() bounds *bytes at PAGE_SIZE on 4KiB and 16KiB page archs, well within int16_t. On 64KiB pages PAGE_SIZE is well above S16_MAX so that flush never fires, and a sufficiently long run of accumulations would overflow the cache. Add an explicit S16_MAX guard before each add: when the next add would push abs(*bytes) past S16_MAX, fold the cached value into @nr and flush directly via mod_objcg_mlstate() before the accumulation. Link: https://lore.kernel.org/20260526033931.1760588-4-shakeel.butt@linux.dev Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Tested-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Acked-by: Qi Zheng <qi.zheng@linux.dev> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04memcg: uint16_t for nr_bytes in obj_stock_pcpShakeel Butt
Currently struct obj_stock_pcp stores nr_bytes in an 'unsigned int' which is 4 bytes on 64-bit machines. Switch the field to uint16_t to shrink the per-CPU cache. The kernel supports PAGE_SIZE_4KB, _8KB, _16KB, _32KB, _64KB and _256KB (see HAVE_PAGE_SIZE_* in arch/Kconfig). After the PAGE_SIZE-aligned flush in __refill_obj_stock(), the sub-page remainder fits in uint16_t up through 64KiB pages where PAGE_SIZE - 1 == U16_MAX, but on 256KiB pages PAGE_SIZE - 1 == 0x3FFFF exceeds U16_MAX. The accumulator also needs to stay within uint16_t between page-aligned flushes on 64KiB pages where PAGE_SIZE itself is U16_MAX + 1. Accumulate the new total in an 'unsigned int' local, then on PAGE_SHIFT <= 16 flush whenever the accumulator would hit U16_MAX; together with the existing allow_uncharge flush at PAGE_SIZE this keeps the uint16_t safe. On configs with PAGE_SHIFT > 16 (PAGE_SIZE_256KB on hexagon and powerpc 44x, both 32-bit), uint16_t cannot represent the sub-page remainder. Define obj_stock_bytes_t as 'unsigned int' on those archs so nr_bytes can hold the full remainder and the normal page-boundary flush in __refill_obj_stock() and the page extraction in drain_obj_stock() both work correctly. The single-cache-line layout target only applies to PAGE_SHIFT <= 16; those archs are 32-bit embedded and not the optimization target. Link: https://lore.kernel.org/20260526033931.1760588-3-shakeel.butt@linux.dev Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Tested-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Acked-by: Qi Zheng <qi.zheng@linux.dev> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04memcg: store node_id instead of pglist_data pointerShakeel Butt
Patch series "memcg: shrink obj_stock_pcp and cache multiple objcgs", v3. Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type") split a memcg's single obj_cgroup into one per NUMA node so that reparenting LRU folios can take per-node lru locks. As a side effect, the per-CPU obj_stock_pcp -- which caches a single cached_objcg pointer -- thrashes on workloads where threads of the same memcg run on different NUMA nodes. The kernel test robot reported a 67.7% regression on stress-ng.switch.ops_per_sec from this pattern. Commit d0211878ce06 ("memcg: cache obj_stock by memcg, not by objcg pointer") landed as a temporary fix by treating sibling per-node objcgs as equivalent for the cache lookup, intended to be reverted once per-node kmem accounting is introduced. This series takes a more general approach: cache multiple objcgs per CPU using the multi-slot pattern memcg_stock_pcp already uses, so the per-node objcg variants of one memcg can all coexist in the stock without ever forcing a drain. The temporary fix can then be reverted. To avoid increasing the per-CPU cache footprint, the first three patches shrink the existing single-slot obj_stock_pcp fields. The final patch converts cached_objcg and nr_bytes into NR_OBJ_STOCK=5 slot arrays and reorders the struct so the entire consume/refill/account hot path fits within a single 64-byte cache line on non-debug 64-bit builds (verified with pahole). This patch (of 4): The struct obj_stock_pcp stores a pointer to pglist_data for the slab stats cached on the cpu. On 64-bit machines, this costs 8 bytes. The pointer is not strictly required: NODE_DATA() can recover it from the node id. Replace cached_pgdat with int16_t node_id and use NUMA_NO_NODE as the "no stats cached" sentinel. At the moment all the archs limit MAX_NUMNODES to 1024 so int16_t is plenty; a BUILD_BUG_ON() makes sure we notice if that ever changes. Link: https://lore.kernel.org/20260526033931.1760588-1-shakeel.butt@linux.dev Link: https://lore.kernel.org/20260526033931.1760588-2-shakeel.butt@linux.dev Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Tested-by: kernel test robot <oliver.sang@intel.com> Acked-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Acked-by: Qi Zheng <qi.zheng@linux.dev> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm: shmem: refactor thpsize_shmem_enabled_show() with helper arraysRan Xiaokai
Replace the hardcoded if/else chain of test_bit() calls and string literals in thpsize_shmem_enabled_show() with a loop over huge_shmem_orders_by_mode[] and huge_shmem_enabled_mode_strings[] arrays. This makes thpsize_shmem_enabled_show() consistent with thpsize_shmem_enabled_store() and eliminates duplicated mode name strings. Link: https://lore.kernel.org/20260525102700.68707-3-ranxiaokai627@163.com Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Breno Leitao <leitao@debian.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Lance Yang <lance.yang@linux.dev> Acked-by: David Hildenbrand (arm) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm: shmem: refactor thpsize_shmem_enabled_store() with sysfs_match_string()Ran Xiaokai
Patch series "refactors thpsize_shmem_enabled_store() and thpsize_shmem_enabled_show()", v4. This patch (of 2): Inspired by commit 82d9ff648c6c ("mm: huge_memory: refactor anon_enabled_store() with set_anon_enabled_mode()"), refactor thpsize_shmem_enabled_store() using sysfs_match_string(). This eliminates the duplicated spin_lock/unlock(), set/clear_bit(), calls across all branches, reducing code duplication. Behavioral change: Call start_stop_khugepaged() only when the mode actually changes. If unchanged, call set_recommended_min_free_kbytes() to preserve legacy watermark behavior. This avoids unnecessary khugepaged restarts. Tested with selftests ./run_kselftest.sh -t mm:ksft_thp.sh, all test cases passed. Link: https://lore.kernel.org/20260525102700.68707-1-ranxiaokai627@163.com Link: https://lore.kernel.org/20260525102700.68707-2-ranxiaokai627@163.com Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Lance Yang <lance.yang@linux.dev> Acked-by: David Hildenbrand (arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Breno Leitao <leitao@debian.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm: make mmap_miss accounting symmetric for VM_SEQ_READUsama Arif
do_sync_mmap_readahead() skips both the mmap_miss increment and the MMAP_LOTSAMISS check for VM_SEQ_READ mappings, since sequential access is non-speculative and should always read ahead. The two decrement sites in do_async_mmap_readahead() and filemap_map_pages() do not mirror this skip, so concurrent faults on a VM_SEQ_READ mapping can still drive ra->mmap_miss down to zero through the decrement paths even though nothing in the sync path ever increments it. The counter itself is per-file (file->f_ra.mmap_miss), so it can be moved by any VMA mapping the file, not just the one currently faulting. Skip the decrement for VM_SEQ_READ in both decrement sites so the counter only moves for mappings that also participate in the increment side. No functional change for VM_SEQ_READ users, since the increment-side gate already prevents the counter from being consulted on their behalf, but it stops a VM_SEQ_READ mapping from biasing the counter for other mappings of the same file. Link: https://lore.kernel.org/20260525145751.2671248-1-usama.arif@linux.dev Signed-off-by: Usama Arif <usama.arif@linux.dev> Closes: https://lore.kernel.org/all/8edc8cd0-f65c-4456-9b3f-362e744c9a96@linux.dev/ Reviewed-by: William Kucharski <william.kucharski@linux.dev> Reviewed-by: Jan Kara <jack@suse.cz> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/vmscan: unify writeback reclaim statistic and throttlingKairui Song
Currently MGLRU and non-MGLRU handle the reclaim statistic and writeback handling very differently, especially throttling. Basically MGLRU just ignored the throttling part. Let's just unify this part, use a helper to deduplicate the code so both setups will share the same behavior. Test using following reproducer using bash: echo "Setup a slow device using dm delay" dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048 LOOP=$(losetup --show -f /var/tmp/backing) mkfs.ext4 -q $LOOP echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \ dmsetup create slow_dev mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow echo "Start writeback pressure" sync && echo 3 > /proc/sys/vm/drop_caches mkdir /sys/fs/cgroup/test_wb echo 128M > /sys/fs/cgroup/test_wb/memory.max (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \ dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192) echo "Clean up" echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev dmsetup resume slow_dev umount -l /mnt/slow && sync dmsetup remove slow_dev Before this commit, `dd` will get OOM killed immediately if MGLRU is enabled. Classic LRU is fine. After this commit, throttling is now effective and no more spin on LRU or premature OOM. Stress test on other workloads also looks good. Global throttling is not here yet, we will fix that separately later. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-15-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chen Ridong <chenridong@huaweicloud.com> Tested-by: Leno Hou <lenohou@gmail.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/vmscan: remove sc->unqueued_dirtyKairui Song
No one is using it now, just remove it. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-14-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/vmscan: remove sc->file_takenKairui Song
No one is using it now, just remove it. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-13-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/mglru: remove no longer used reclaim argument for folio protectionKairui Song
Now dirty reclaim folios are handled after isolation, not before, since dirty reactivation must take the folio off LRU first, and that helps to unify the dirty handling logic. So this argument is no longer needed. Just remove it. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-12-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chen Ridong <chenridong@huaweicloud.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/mglru: simplify and improve dirty writeback handlingKairui Song
Right now the flusher wakeup mechanism for MGLRU is less responsive and unlikely to trigger compared to classical LRU. The classical LRU wakes the flusher if one batch of folios passed to shrink_folio_list is unevictable due to under writeback. MGLRU instead check and handle this after the whole reclaim loop is done. We previously even saw OOM problems due to passive flusher, which were fixed but still not perfect [1]. We have just unified the dirty folio counting and activation routine, now just move the dirty flush into the loop right after shrink_folio_list. This improves the performance a lot for workloads involving heavy writeback and prepares for throttling too. Test with YCSB workloadb showed a major performance improvement: Before this series: Throughput(ops/sec): 62485.02962831822 AverageLatency(us): 500.9746963330107 pgpgin 159347462 workingset_refault_file 34522071 After this commit: Throughput(ops/sec): 80857.08510208207 AverageLatency(us): 386.653262968934 pgpgin 112233121 workingset_refault_file 19516246 The performance is a lot better with significantly lower refault. We also observed similar or higher performance gain for other real-world workloads. We were concerned that the dirty flush could cause more wear for SSD: that should not be the problem here, since the wakeup condition is when the dirty folios have been pushed to the tail of LRU, which indicates that memory pressure is so high that writeback is blocking the workload already. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-11-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1] Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chen Ridong <chenridong@huaweicloud.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/mglru: use the common routine for dirty/writeback reactivationKairui Song
Currently MGLRU will move the dirty writeback folios to the second oldest gen instead of reactivate them like the classical LRU. This might help to reduce the LRU contention as it skipped the isolation. But as a result we will see these folios at the LRU tail more frequently leading to inefficient reclaim. Besides, the dirty / writeback check after isolation in shrink_folio_list is more accurate and covers more cases. So instead, just drop the special handling for dirty writeback, use the common routine and re-activate it like the classical LRU. This should in theory improve the scan efficiency. These folios will be rotated back to LRU tail once writeback is done so there is no risk of hotness inversion. And now each reclaim loop will have a higher success rate. This also prepares for unifying the writeback and throttling mechanism with classical LRU, we keep these folios far from tail so detecting the tail batch will have a similar pattern with classical LRU. The micro optimization that avoids LRU contention by skipping the isolation is gone, which should be fine. Compared to IO and writeback cost, the isolation overhead is trivial. And using the common routine also keeps the folio's referenced bits (tier bits), which could improve metrics in the long term. Also no more need to clean reclaim bit as the common routine will make use of it. Note the common routine updates a few throttling and writeback counters, which are not used, and never have been for the MGLRU case. We will start making use of these in later commits. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-10-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chen Ridong <chenridong@huaweicloud.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/mglru: remove redundant swap constrained check upon isolationKairui Song
Remove the swap-constrained early reject check upon isolation. This check is a micro optimization when swap IO is not allowed, so folios are rejected early. But it is redundant and overly broad since shrink_folio_list() already handles all these cases with proper granularity. Notably, this check wrongly rejected lazyfree folios, and it doesn't cover all rejection cases. shrink_folio_list() uses may_enter_fs(), which distinguishes non-SWP_FS_OPS devices from filesystem-backed swap and does all the checks after folio is locked, so flags like swap cache are stable. This check also covers dirty file folios, which are not a problem now since sort_folio() already bumps dirty file folios to the next generation, but causes trouble for unifying dirty folio writeback handling. And there should be no performance impact from removing it. We may have lost a micro optimization, but unblocked lazyfree reclaim for NOIO contexts, which is not a common case in the first place. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-9-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/mglru: don't abort scan immediately right after agingKairui Song
Right now, if eviction triggers aging, the reclaimer will abort. This is not the optimal strategy for several reasons. Aborting the reclaim early wastes a reclaim cycle when under pressure, and for concurrent reclaim, if the LRU is under aging, all concurrent reclaimers might fail. And if the age has just finished, new cold folios exposed by the aging are not reclaimed until the next reclaim iteration. What's more, the current aging trigger is quite lenient, having 3 gens with a reclaim priority lower than default will trigger aging, and blocks reclaiming from one memcg. This wastes reclaim retry cycles easily. And in the worst case, if the reclaim is making slower progress and all following attempts fail due to being blocked by aging, it triggers unexpected early OOM. And if a lruvec requires aging, it doesn't mean it's hot. Instead, the lruvec could be idle for quite a while, and hence it might contain lots of cold folios to be reclaimed. While it's helpful to rotate memcg LRU after aging for global reclaim, as global reclaim fairness is coupled with the rotation in shrink_many, memcg fairness is instead handled by cgroup iteration in shrink_node_memcgs. So, for memcg level pressure, this abort is not the key part for keeping the fairness. And in most cases, there is no need to age, and fairness must be achieved by upper-level reclaim control. So instead, just keep the scanning going unless one whole batch of folios failed to be isolated or enough folios have been scanned, which is triggered by evict_folios returning 0. And only abort for global reclaim after one batch, so when there are fewer memcgs, progress is still made, and the fairness mechanism described above still works fine. And in most cases, the one more batch attempt for global reclaim might just be enough to satisfy what the reclaimer needs, hence improving global reclaim performance by reducing reclaim retry cycles. Rotation is still there after the reclaim is done, which still follows the comment in mmzone.h. And fairness still looking good. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-8-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/mglru: use a smaller batch for reclaimKairui Song
With a fixed number to reclaim calculated at the beginning, making each following step smaller should reduce the lock contention and avoid over-aggressive reclaim of folios, as it will abort earlier when the number of folios to be reclaimed is reached. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-7-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/mglru: avoid reclaim type fall back when isolation makes no progressBarry Song (Xiaomi)
While isolation makes no progress in scan_folios(), we quickly fall back to the other type in isolate_folios(). This is incorrect, as the current type may still have sufficient folios. Falling back can undermine the positive_ctrl_err() result from get_type_to_scan(), which is derived from swappiness. So just continue scanning this type for another round. Worth noting if the cold generations are all reclaimed, scan will no longer make any progress either, which may undermine the swappiness again. This is not a new issue and hence better be fixed later [1]. Link: https://lore.kernel.org/linux-mm/CAGsJ_4zjdOYEtuO6gNjABm7NDxW0skzBFNRNee-k2D6VwsYEQA@mail.gmail.com/ [1] Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-6-02fabb92dc43@tencent.com Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org> Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Kairui Song <kasong@tencent.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chen Ridong <chenridong@huaweicloud.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/mglru: scan and count the exact number of foliosKairui Song
Make the scan helpers return the exact number of folios being scanned or isolated. Since the reclaim loop now has a natural scan budget that controls the scan progress, returning the scan number and consuming the budget makes the scan more accurate and easier to follow. The number of scanned folios for each iteration is always larger than 0, unless the reclaim must stop for a forced aging, so there is no more need for any special handling when there is no progress made: - `return isolated || !remaining ? scanned : 0` in scan_folios: both the function and the call now just return the exact scan count, combined with the scan budget introduced in the previous commit to avoid livelock or under scan. - `scanned += try_to_inc_min_seq` in evict_folios: adding a bool as a scan count was kind of confusing and no longer needed, as scan number should never be zero as long as there are still evictable gens. We may encounter a empty old gen that returns 0 scan count, to avoid that, do a try_to_inc_min_seq before toisolation which have slight to none overhead in most cases. - `evictable_min_seq + MIN_NR_GENS > max_seq` guard in evict_folios: the per-type get_nr_gens == MIN_NR_GENS check in scan_folios naturally returns 0 when only two gens remain and breaks the loop. Also change try_to_inc_min_seq to return void, as its return value is no longer used by any caller. Call it before isolate_folios to flush any empty gens left by external folio freeing, and again after isolate_folios when scanning moved or protected folios may have emptied the oldest gen. The scan still stops if only two gens are left, as the scan number will be zero. This matches the previous behavior. This forced gen protection may be removed or softened later to improve reclaim further. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-5-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/mglru: restructure the reclaim loopKairui Song
The current loop will calculate the scan number on each iteration. The number of folios to scan is based on the LRU length, with some unclear behaviors, e.g, the scan number is only shifted by reclaim priority when aging is not needed or when at the default priority, and it couples the number calculation with aging and rotation. Adjust, simplify it, and decouple aging and rotation. Just calculate the scan number for once at the beginning of the reclaim, always respect the reclaim priority, and make the aging and rotation more explicit. This slightly changes how aging and offline memcg reclaim works: Previously, aging was skipped at DEF_PRIORITY even when eviction was no longer possible, so the reclaimer wasted an iteration until the priority escalated. Now aging runs immediately whenever it is needed to make progress; the DEF_PRIORITY skip only applies when eviction is still viable. This may avoid wasted iterations that over-reclaim slab and break reclaim balance in multi-cgroup setups. Similar for offline memcg. Previously, offline memcg wouldn't be aged unless it didn't have any evictable folios. Now, we might age it if it has only 3 generations, which should be fine. On one hand, offline memcg might still hold long-term folios, and in fact, a long-existing offline memcg must be pinned by some long-term folios like shmem. These folios might be used by other memcg, so aging them as ordinary memcg seems correct. Besides, aging enables further reclaim of an offlined memcg, which will certainly happen if we keep shrinking it. And offline memcg might soon be no longer an issue with reparenting. Overall, the memcg LRU rotation, as described in mmzone.h, remains the same. Note that because the scan budget is now pinned at loop entry, tiny lruvec might skip this reclaim pass, also skipping aging, which could be beneficial as aging is not helpful since it will still be un-reclaimable after aging. Reclaim will go on as usual once priority escalates. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-4-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chen Ridong <chenridong@huaweicloud.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/mglru: relocate the LRU scan batch limit to callersKairui Song
Same as active / inactive LRU, MGLRU isolates and scans folios in batches. The batch split is done hidden deep in the helper, which makes the code harder to follow. The helper's arguments are also confusing since callers usually request more folios than the batch size, so the helper almost never processes the full requested amount. Move the batch splitting into the top loop to make it cleaner, there should be no behavior change. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-3-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/mglru: rename variables related to aging and rotationKairui Song
The current variable name isn't helpful. Make the variable names more meaningful. Only naming change, no behavior change. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-2-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Barry Song <baohua@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/mglru: consolidate common code for retrieving evictable sizeKairui Song
Patch series "mm/mglru: improve reclaim loop and dirty folio", v7. This series cleans up and slightly improves MGLRU's reclaim loop and dirty writeback handling. As a result, we can see an up to ~30% increase in some workloads like MongoDB with YCSB and a huge decrease in file refault, no swap involved. Other common benchmarks have no regression, and LOC is reduced, with less unexpected OOM, too. Some of the problems were found in our production environment, and others were mostly exposed while stress testing during the development of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up the code base and fixes several performance issues, preparing for further work. MGLRU's reclaim loop is a bit complex, and hence these problems are somehow related to each other. The aging, scan number calculation, and reclaim loop are coupled together, and the dirty folio handling logic is quite different, making the reclaim loop hard to follow and the dirty flush ineffective. This series slightly cleans up and improves these issues using a scan budget by calculating the number of folios to scan at the beginning of the loop, and decouples aging from the reclaim calculation helpers. Then, move the dirty flush logic inside the reclaim loop so it can kick in more effectively. These issues are somehow related, and this series handles them and improves MGLRU reclaim in many ways. Test results: All tests are done on a 48c96t NUMA machine with 2 nodes and a 128G memory machine using NVME as storage. Classical (non-MGLRU) LRU numbers are included as "MGLRU disabled" for each benchmark below; see [8] and [9] for the longer write-up. MongoDB ======= Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000, threads:32), which does 95% read and 5% update to generate mixed read and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and the WiredTiger cache size is set to 4.5G, using NVME as storage. This is close to the case we observed regressing in our production environment: mixed read and writeback pressure, so it is a practical case for evaluation. Not using SWAP. The intent is to isolate the file LRU writeback path. Enabling SWAP would just add noise from anonymous reclaim. MGLRU Before: Throughput(ops/sec): 60653.502655 workingset_refault_file 12904916 pgpgin 165366622 pgpgout 5219588 MGLRU After: Throughput(ops/sec): 82384.354760 (+35.8%, higher is better) workingset_refault_file 7128285 (-44.7%, lower is better) pgpgin 113170693 (-31.5%, lower is better) pgpgout 5639724 MGLRU Disabled: Throughput(ops/sec): 93713.640901 workingset_refault_file 15013443 pgpgin 85365614 pgpgout 5866508 We can see a significant performance improvement after this series. The test is done on NVME and the performance gap would be even larger for slow devices, such as HDD or network storage. We observed over 100% gain for some workloads with slow IO. Note, classical LRU is still faster for this benchmark, MGLRU may catch up later with further work [7]. Chrome & Node.js [3] ==================== Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2 nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64 workers. Many memcgs each applying roughly equal pressure exercises the LRU's ability to detect/protect each tenant's working set and to balance reclamation fairly between tenants, which makes this a meaningful test for the reclaim mechanism. Fairness is reported via Jain's fairness index (1.0 means all tenants get exactly equal allocation, lower is worse). Under equal pressure, all memcgs should make roughly equal forward progress. See [8] for the longer rationale and per-memcg breakdown. MGLRU before: Total requests: 81898 Per-worker mean: 1279.7 Per-worker 95% CI (mean): [ 1259.0, 1300.4] Jain's fairness index: 0.995893 (1.0 = perfectly fair) Latency: Bucket Count Pct Cumul [0,1)s 28392 34.67% 34.67% [1,2)s 8022 9.80% 44.46% [2,4)s 6130 7.48% 51.95% [4,8)s 39354 48.05% 100.00% MGLRU after: Total requests: 82901 Per-worker mean: 1295.3 Per-worker 95% CI (mean): [ 1265.3, 1325.4] Jain's fairness index: 0.991607 (1.0 = perfectly fair) Latency: Bucket Count Pct Cumul [0,1)s 28128 33.93% 33.93% [1,2)s 8756 10.56% 44.49% [2,4)s 7028 8.48% 52.97% [4,8)s 38989 47.03% 100.00% MGLRU disabled: Total requests: 62399 Per-worker mean: 975.0 Per-worker 95% CI (mean): [ 941.9, 1008.1] Jain's fairness index: 0.982156 (1.0 = perfectly fair) Latency: Bucket Count Pct Cumul [0,1)s 20051 32.13% 32.13% [1,2)s 2255 3.61% 35.75% [2,4)s 6149 9.85% 45.60% [4,8)s 33927 54.37% 99.97% [8,16)s 17 0.03% 100.00% Reclaim is still fair and effective, total requests number seems slightly better. OOM issue with aging and throttling =================================== For the throttling OOM issue, it can be easily reproduced using dd and cgroup limit as demonstrated and fixed by a later patch in this series. The aging OOM is a bit tricky, a specific reproducer can be used to simulate what we encountered in production environment [4]: Spawns multiple workers that keep reading the given file using mmap, and pauses for 120ms after one file read batch. It also spawns another set of workers that keep allocating and freeing a given size of anonymous memory. The total memory size exceeds the memory limit (eg. 14G anon + 8G file, which is 22G vs a 16G memcg limit). - MGLRU disabled: Finished 128 iterations. - MGLRU enabled: OOM with following info after about ~10-20 iterations: [ 62.624130] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 [ 62.624999] memory: usage 16777216kB, limit 16777216kB, failcnt 24460 [ 62.640200] swap: usage 0kB, limit 9007199254740988kB, failcnt 0 [ 62.640823] Memory cgroup stats for /demo: [ 62.641017] anon 10604879872 [ 62.641941] file 6574858240 OOM occurs despite there being still evictable file folios. - MGLRU enabled after this series: Finished 128 iterations. Worth noting there is another OOM related issue reported in V1 of this series, which is tested and looking OK now [5]. MySQL: ====== Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using ZRAM as swap and test command: sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \ --tables=48 --table-size=2000000 --threads=48 --time=600 run A 24G InnoDB buffer pool inside a 2G memcg with ZRAM as swap forces aggressive eviction of cached database anon pages, which exercises the LRU's hot page detection and the eviction path under swap pressure. The workload is practical, and the pressure is higher than what we usually see in production but it is intended to expose the extreme case. MGLRU before: 17313.688333 tps MGLRU after: 17286.195000 tps MGLRU disabled: 16245.330000 tps Seems only noise level changes, no regression. FIO: ==== Testing with the following command, where /mnt/ramdisk is a 64G EXT4 ramdisk, each test file is 3G, in a 10G memcg, 6 test run each: fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \ --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \ --rw=randread --norandommap --time_based \ --ramp_time=1m --runtime=5m --group_reporting Random buffered mmap read on a ramdisk strips out storage variance and stresses purely the LRU's ability to evict and recycle the page cache under heavy random read pressure. MGLRU before: 9033.91 MB/s MGLRU after: 9065.72 MB/s MGLRU disabled: 8254.54 MB/s Also seem only noise level changes and no regression or slightly better. Build kernel: ============= Build kernel test using ZRAM as swap, kernel source on tmpfs, in a memcg with memory.max=3G, using make -j96 and defconfig, measuring system time, 6 test run each. Building the kernel is a classical mixed anon + file workload (lots of small file reads/writes plus parallel anon allocations from cc/ld) and is representative of many real compilation jobs. MGLRU before: 2823.13s MGLRU after: 2801.26s MGLRU disabled: 5023.50s Also seem only noise level changes, no regression or very slightly better. Android: ======== Xinyu reported a performance gain on Android, too, with this series. The test consisted of cold-starting multiple applications sequentially under moderate system load [6]; this is a real Android user-visible scenario, dominated by the LRU's ability to keep the right working set resident and re-fault launch-critical pages quickly. Before: Launch Time Summary (all apps, all runs) Mean 868.0ms P50 888.0ms P90 1274.2ms P95 1399.0ms After: Launch Time Summary (all apps, all runs) Mean 850.5ms (-2.07%) P50 861.5ms (-3.04%) P90 1179.0ms (-8.05%) P95 1228.0ms (-12.2%) This patch (of 15): Merge commonly used code for counting evictable folios in a lruvec. No behavior change. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-0-02fabb92dc43@tencent.com Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-1-02fabb92dc43@tencent.com Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1] Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2] Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3] Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4] Link: https://lore.kernel.org/linux-mm/acgNCzRDVmSbXrOE@KASONG-MC4/ [5] Link: https://lore.kernel.org/linux-mm/20260417025123.2971253-1-wxy2009nrrr@163.com/ [6] Link: https://lore.kernel.org/linux-mm/20260502-mglru-fg-v1-0-913619b014d9@tencent.com/ [7] Link: https://lore.kernel.org/linux-mm/CAMgjq7BzQAPp8u_3-9e3ueXmRCoW=2sydok0hFM=MYL7VC1YYg@mail.gmail.com/ [8] Link: https://lore.kernel.org/linux-mm/CAMgjq7D+4QmiWe73OPFuH0s+ZKCUJoo+MfcWOdJcV+VO-T2Wmg@mail.gmail.com/ [9] Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Yuanchu Xie <yuanchu@google.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Leno Hou <lenohou@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04userfaultfd: make functions that are not used outside uffd staticMike Rapoport (Microsoft)
After merging fs/userfaultfd.c into mm/userfaultfd.c, several functions that were previously shared between the two files are now only used within mm/userfaultfd.c. Make them static and remove their declarations from include/linux/userfaultfd_k.h. Link: https://lore.kernel.org/20260523173759.3964908-3-rppt@kernel.org Assisted-by: Copilot:claude-opus-4-6 Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04userfaultfd: merge fs/userfaultfd.c into mm/userfaultfd.cMike Rapoport (Microsoft)
Patch series "userfaultfd: merge fs/userfaultfd.c into mm/userfaultfd.c", v3. These patches merge fs/userfaultfd.c into mm/userfaultfd.c and make functions used only inside mm/userfaultfd.c static. This patch (of 2): Historically userfaultfd implementation has been split between fs/userfaultfd.c and mm/userfaultfd.c. The mm/ part implemented memory management operations, while the fs/ part implemented file descriptor handling and called into the mm/ part for the actual memory management work. This separation is quite artificial and fs/userfaultfd.c does not seem to belong to fs/ because it's only a user if vfs APIs and like for other users, for example, memfd and secretmem, the file descriptor handling could live in mm/ as well. "Append" fs/userfaultfd.c to mm/userfaultfd and update fs/Makefile and MAINTAINERS accordingly. No intended functional changes. Link: https://lore.kernel.org/20260523173759.3964908-1-rppt@kernel.org Link: https://lore.kernel.org/20260523173759.3964908-2-rppt@kernel.org Assisted-by: Copilot:claude-opus-4-6 Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Christian Brauner (Amutable) <brauner@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Hildenbrand <david@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/page_alloc: remove VM_BUG_ON()s from pindex helpersBrendan Jackman
Vlastimil pointed out that the VM_BUG_ON()s have fallen out of favour, so remove them. Link: https://lore.kernel.org/20260526-page_alloc-unmapped-prep-v2-1-412f4d486115@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Suggested-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Link: https://lore.kernel.org/all/4074a816-9e75-45a6-8141-25459bcc106b@kernel.org/ Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/mglru: use folio_mark_accessed to replace folio_set_activeBarry Song (Xiaomi)
MGLRU gives high priority to folios mapped in page tables. As a result, folio_set_active() is invoked for all folios read during page faults. In practice, however, readahead can bring in many folios that are never accessed via page tables. A previous attempt by Lei Liu proposed introducing a separate LRU for readahead[1] to make readahead pages easier to reclaim, but that approach is likely over-engineered. Before commit 4d5d14a01e2c ("mm/mglru: rework workingset protection"), folios with PG_active were always placed in the youngest generation, leading to over-protection and increased refaults. After that commit, PG_active folios are placed in the second youngest generation, which is still too optimistic given the presence of readahead. In contrast, the classic active/inactive scheme is more conservative. This patch switches to using folio_mark_accessed() and begins prefaulted file folios from the second oldest generation instead of active generations. We should also adjust the following accordingly: - WORKINGSET_ACTIVATE: aligned with setting active for refaulted workingset folios; - lru_gen_folio_seq(): place (pre)faulted file folios into the second oldest generation; - promote second-scanned folios to workingset in folio_check_references(): we now have to depend on folio_lru_refs() > 1, since we previously relied on PG_referenced being set during the first scan, but PG_referenced is now set earlier. On x86, running a kernel build inside a memcg with a 1GB memory limit using 20 threads. w/o patch: real 1m50.764s user 25m32.305s sys 4m0.012s pswpin: 1333245 pswpout: 4366443 pgpgin: 6962592 pgpgout: 17780712 swpout_zero: 1019603 swpin_zero: 14764 refault_file: 287794 refault_anon: 1347963 w/ patch: real 1m48.879s user 25m29.224s sys 3m37.421s pswpin: 568480 pswpout: 2322657 pgpgin: 4073416 pgpgout: 9613408 swpout_zero: 593275 swpin_zero: 9118 refault_file: 262505 refault_anon: 577550 active/inactive LRU: real 1m49.928s user 25m28.196s sys 3m40.740s pswpin: 463452 pswpout: 2309119 pgpgin: 4438856 pgpgout: 9568628 swpout_zero: 743704 swpin_zero: 7244 refault_file: 562555 refault_anon: 470694 Lance and Xueyuan made a huge contribution to this patch through testing. Link: https://lore.kernel.org/20260526130938.66253-1-baohua@kernel.org Link: https://lore.kernel.org/linux-mm/20250916072226.220426-1-liulei.rjpt@vivo.com/ [1] Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org> Tested-by: Lance Yang <lance.yang@linux.dev> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Kairui Song <kasong@tencent.com> Cc: Qi Zheng <qi.zheng@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: wangzicheng <wangzicheng@honor.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Lei Liu <liulei.rjpt@vivo.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04kasan/test: only do kmalloc_double_kzfree for generic modeWang Wensheng
kmalloc_double_kzfree() would corrupt kernel memory when the just freed memory were allocated by another thread before the second call to kfree_sensitive() and the new allocation tag happened to match the old one. This could not happen in GENERIC mode as it uses quarantine. Link: https://lore.kernel.org/20260524031053.381776-1-wsw9603@163.com Signed-off-by: Wang Wensheng <wsw9603@163.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/damon/core: trace esz at first setupSeongJae Park
DAMON traces effective size quota from the second update, only if a change has been made by the update. Tracing only changed updates was an intentional decision to avoid unnecessary same value tracing. Always skipping the first value is just an unintended mistake. The mistake makes the tracepoint based investigation incomplete, because the first effective size quota is never traced. It is not a big issue when the 'consist' quota tuner is used, because it keeps changing the quota in the usual setup. However, when the 'temporal' tuner is used, the quota value is not changed before the goal achievement status is completely changed. For example, if the DAMOS scheme is started with an under-achieved goal, the quota is set to the maximum value, and kept the same value until the goal is achieved. Because DAMON skips the first value, the user cannot know what effective quota the current scheme is using. Only after the goal is achieved, the effective quota is changed to zero, and traced. Unconditionally trace the initial quota value to fix this problem. Note that the 'temporal' quota tuner was introduced by commit af738a6a00c1 ("mm/damon/core: introduce DAMOS_QUOTA_GOAL_TUNER_TEMPORAL"), which was added to 7.1-rc1. But even with the 'consist' quota tuner, the tracing is unintentionally incomplete. Hence this commit marks the introduction of the trace event as the broken commit. Link: https://lore.kernel.org/20260520150311.80925-1-sj@kernel.org Fixes: a86d695193bf ("mm/damon: add trace event for effective size quota") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.17.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/damon/tests/core-kunit: add damon_set_regions() test casesSeongJae Park
damon_set_regions() is one of the main DAMON kernel API functions that set up the monitoring target memory region boundaries. Implement unit tests for verifying its basic functionalities. Link: https://lore.kernel.org/20260522154026.80546-11-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/damon/core: remove damon_verify_nr_regions()SeongJae Park
When CONFIG_DAMON_DEBUG_SANITY is enabled, damon_verify_nr_regions() is called for each damon_nr_regions() invocation. damon_veify_nr_regions() iterates all regions. damon_nr_regions() is called for each region in kdamond_reset_aggregated() and damos_apply_scheme(). Hence it imposes O(n**2) overhead where n is the number of regions. Though the verification is enabled only under DAMON_DEBUG_SANITY, which is not for production use cases, it could be too high overhead. Meanwhile, damon_verify_ctx() is doing the damon_nr_regions() test. Because damon_verify_ctx() is called for each kdamond_call(), the test coverage from damon_verify_ctx() could be sufficient. Remove damon_nr_regions() verification. Link: https://lore.kernel.org/20260522154026.80546-10-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/damon/core: add kdamond_call() debug_sanity checkSeongJae Park
kdamond_call() is the place where DAMON API callers are allowed to access the DAMON context's public internal state including the monitoring results. Hence it is important to ensure it is called with the expected DAMON context state. Do the check under DAMON_DEBUG_SANITY. Link: https://lore.kernel.org/20260522154026.80546-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/damon/core: hide damon_destroy_region()SeongJae Park
damon_destroy_region() is being used by only DAMON core, but exposed to DAMON API callers. Exposing something that is not really being used by others will only increase the maintenance cost. Hide it. Link: https://lore.kernel.org/20260522154026.80546-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/damon/core: hide damon_insert_region()SeongJae Park
damon_insert_region() is being used by only DAMON core, but exposed to DAMON API callers. Exposing something that is not really being used by others will only increase the maintenance cost. Hide it. Link: https://lore.kernel.org/20260522154026.80546-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/damon/core: hide damon_add_region()SeongJae Park
damon_add_region() is being used by only DAMON core, but exposed to DAMON API callers. Exposing something that is not really being used by others will only increase the maintenance cost. Hide it. Link: https://lore.kernel.org/20260522154026.80546-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/damon/tests/vaddr-kunit: replace damon_add_region() with damon_set_regions()SeongJae Park
DAMON virtual address operation set (vaddr) unit tests is using damon_add_region() for setup of DAMON monitoring target region boundaries setup. But, damon_set_regions() is designed for exactly the purpose. All other DAMON API callers use the function for the purpose. Replace damon_add_region() usage in the unit tests with damon_set_regions(), for unifying the use case and reducing the maintenance cost. Link: https://lore.kernel.org/20260522154026.80546-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/damon/core: do not use region out of a loop in damon_set_regions()SeongJae Park
damon_set_regions() assumes the DAMON region iterator is referencing the last region after the region iteration loop is completed. The code is indeed implemented in the way, but that is not a documented safe behavior. Hence it is unreliable and difficult to read. Cleanup the code to avoid the case. No behavioral change is intended. Link: https://lore.kernel.org/20260522154026.80546-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/damon/core: safely handle no region case in damon_set_regions()SeongJae Park
Patch series "mm/damon: minor improvements for code readability and tests". Implement minor improvements on code readability and tests for DAMON. First seven patches are for DAMON code readability and resulting maintenance. Patches 1 and 2 make damon_set_regions() safer and easier to read. Patches 3 and 4 remove fragmented DAMON API use cases. Patches 5-7 hides unused core functions that are unnecessarily exposed to API callers. The following seven patches are for DAMON tests improvement. Patches 8 and 9 adds and removes DAMON_DEBUG_SANITY verifications to ensure reasonable test coverage without too high overhead. Patch 10 adds a new kunit test for damon_set_regions(). Patch 11 makes sysfs.py selftest more gracefully finishes under test failures. Patches 12-13 adds simple sysfs.sh test cases for the monitoring intervals goal directory, the addr_unit file and the pause file. This patch (of 14): damon_set_regions() calls damon_first_region() regardless of the number of DAMON regions in a given DAMON target. damon_first_region() internally uses list_first_entry(), which clearly documents the list is expected to be not empty. Due to the internal implementation of the macro, damon_set_regions() is safe for now. But the internal implementation of the macro can be changed in future. Refactor the function to explicitly and safely handle the empty region list case without depending on the internal implementation. No behavioral change is intended. Link: https://lore.kernel.org/20260522154026.80546-1-sj@kernel.org Link: https://lore.kernel.org/20260522154026.80546-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/vma: eliminate mmap_action->error_hook, introduce error_overrideLorenzo Stoakes
Rather than providing a hook, simplify things by providing the ability to override mmap action errors. This allows us to more carefully validate the value provided and thus ensure only a valid error code is specified, and simplifies the interface. This way, we eliminate all hooks but mmap_prepare and allow only mmap actions to be specified (which core mm controls). This significantly improves robustness and eliminates any unnecessary code duplication in driver mmap hooks. We also update the /dev/mem logic (the only user) to use mmap_action->error_override instead. Link: https://lore.kernel.org/55d13f7d016b827c459946d46a56105635be111c.1780397980.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/vma: remove mmap_action->success_hookLorenzo Stoakes
This hook was introduced to work around code that seemed to absolutely require access to a VMA pointer upon mmap(). However, providing this hook leaves a backdoor to drivers getting access to the very thing mmap_prepare eliminates - a pointer to the VMA. Let's solve this contradiction by removing it. The key intended user was hugetlb, however it seems that the best course now is to avoid allowing all drivers the ability to work around mmap_prepare, and find a different solution there. Link: https://lore.kernel.org/f79434e6d30af6d92999be6b76e197f1847105fa.1780397980.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04drivers/char/mem: eliminate unnecessary use of success_hookLorenzo Stoakes
Patch series "remove mmap_action success, error hooks", v3. The mmap_action->success_hook was a strange beast added to enable code which appeared to absolutely require access to a VMA pointer to work correctly. Primarily this was for hugetlb, however a different approach will be taken there, as clearly more work is required to figure out a sensible way of converting hugetlb to use mmap_prepare. The other user was the memory char driver, specifically /dev/zero which has the unusual property of explicitly setting file-backed VMAs anonymous. Providing the success hook was always foolish, as it allowed drivers a way to workaround the restriction that they should not access a pointer to a not-yet-correctly-initialised VMA - which defeats the purpose of the mmap_prepare work. We can achieve the same thing in memory char driver without needing the success hook, so this series removes that, then removes the success hook altogether. The error hook is also unnecessary - the motivation for this was for functions which need to override the error code when performing an mmap action in order to avoid breaking userspace. We can achieve this by just providing a field for the error code. Doing this means we don't have to worry about the hook doing anything odd. We also add a check to ensure the error code is in fact valid. Again the memory char driver is the only current user of this, so this series updates it to use that. After this change mmap_action has no custom hooks at all, which seems rather more cromulent than before. This patch (of 3): /dev/zero, uniquely, marks memory mapped there as anonymous. This is currently achieved using the mmap_action->success_hook. However this hook circumvents the abstraction of VMA initialisation so it's preferable to do things a different way. To achieve this, this patch firstly defaults the VMA descriptor's vm_ops field to the dummy VMA operations, which is what file-backed VMAs default this field to. That way, we can detect whether a driver sets this field to NULL in order to mark it anonymous. We then introduce vma_desc_set_anonymous() to do this explicitly, and invoke it in mmap_zero_prepare(). This way, any driver which does not explicitly set desc->vm_ops, retains the dummy vm_ops as they would previously. We also update set_vma_user_defined_fields() to make clear that we are either setting vma->vm_ops to what is provided by the driver (or defaulting to dummy_vm_ops if not set), or setting the VMA anonymous. This lays the groundwork for removing the success hook. Link: https://lore.kernel.org/cover.1780397980.git.ljs@kernel.org Link: https://lore.kernel.org/010579cca6787cf7bb057ab1f7228978b10601c8.1780397980.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/page_alloc: fix defrag_mode for non-reclaimable allocationsDmitry Ilvokhin
When defrag_mode is enabled, ALLOC_NOFRAGMENT is enforced to prevent migratetype fallbacks and keep pageblocks clean. The allocator relies on reclaim and compaction to free pages of the correct type before allowing fallback as a last resort. However, non-reclaimable allocations such as GFP_ATOMIC cannot invoke direct reclaim or compaction. With defrag_mode=1, these allocations hit the !can_direct_reclaim bailout in __alloc_pages_slowpath() with ALLOC_NOFRAGMENT still set, and fail without ever attempting a fallback. This causes a large number of SLUB allocation failures for skbuff_head_cache under network-heavy workloads, despite free memory being available in other migratetype freelists. We observed it on a few of the Meta workloads that adopted defrag_mode=1. For the service under load there were 85509 SLUB allocation failures messages in dmesg within 2 hours. All of them are GFP_ATOMIC allocations for skbuff_head_cache, despite free pages being available in other migratetype freelists (~13 GB free). Since it is networking path from the practical point of view, this means dropped packets, failed RPC requests, tail latency spikes and overall service degradation. Clear ALLOC_NOFRAGMENT and retry for allocations that request kswapd reclaim but cannot do direct reclaim themselves (GFP_ATOMIC). Purely speculative allocations like GFP_TRANSHUGE_LIGHT that don't set __GFP_KSWAPD_RECLAIM are left to fail, since they have reasonable fallbacks and should not cause fragmentation. Link: https://lore.kernel.org/20260520122228.201550-1-d@ilvokhin.com Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode") Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04buffer: Remove submit_bh()Matthew Wilcox (Oracle)
No users are left; remove this API. Also remove/fix comments mentioning it, and end_bio_bh_io_sync() as it's now unused. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20260528173150.1093780-32-willy@infradead.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-06-04mm: track DONTCACHE dirty pages per bdi_writebackJeff Layton
Add a per-wb WB_DONTCACHE_DIRTY counter that tracks the number of dirty pages with the dropbehind flag set (i.e., pages dirtied via RWF_DONTCACHE writes). Increment the counter alongside WB_RECLAIMABLE in folio_account_dirtied() when the folio has the dropbehind flag set, and decrement it in folio_clear_dirty_for_io() and folio_account_cleaned(). Also decrement it when a non-DONTCACHE lookup atomically clears the dropbehind flag on a dirty folio in __filemap_get_folio_mpol(), using folio_test_clear_dropbehind() to prevent concurrent lookups from double-decrementing the counter, and guarding the decrement with mapping_can_writeback() to match the increment path. Transfer the counter alongside WB_RECLAIMABLE in inode_do_switch_wbs() so that the stat is properly migrated when an inode switches cgroup writeback domains. The counter will be used by the writeback flusher to determine how many pages to write back when expediting writeback for IOCB_DONTCACHE writes, without flushing the entire BDI's dirty pages. Suggested-by: Jan Kara <jack@suse.cz> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260511-dontcache-v7-2-2848ddce8090@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-06-04mm: preserve PG_dropbehind flag during folio splitJeff Layton
__split_folio_to_order() copies page flags from the original folio to newly created sub-folios using an explicit allowlist, but PG_dropbehind is not included. When a large folio with PG_dropbehind set is split, only the head sub-folio retains the flag; all tail sub-folios silently lose it and will not be reclaimed eagerly after writeback completes. Add PG_dropbehind to the flag copy mask so that the drop-behind hint is preserved across folio splits. Fixes: a323281cdfec ("mm: add PG_dropbehind folio flag") Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260511-dontcache-v7-1-2848ddce8090@kernel.org Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-06-04libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callersJohn Hubbard
init_pseudo() now sets SB_I_NOEXEC and SB_I_NODEV by default, so the per-caller assignments are redundant. Drop them. Signed-off-by: John Hubbard <jhubbard@nvidia.com> Link: https://patch.msgid.link/20260604025315.245910-3-jhubbard@nvidia.com Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-06-03mm/mincore: handle non-swap entries before !CONFIG_SWAP guardUsama Arif
mincore_swap() also fields migration/hwpoison entries (and shmem swapin-error entries), which can exist on !CONFIG_SWAP builds when CONFIG_MIGRATION or CONFIG_MEMORY_FAILURE is enabled. The !IS_ENABLED(CONFIG_SWAP) guard ran before the non-swap-entry early return, so mincore_pte_range() can spuriously WARN and report these pages nonresident on !CONFIG_SWAP kernels. Move the guard below the non-swap-entry check so only true swap entries trip the WARN, and migration/hwpoison entries take the existing "uptodate / non-shmem" path. Link: https://lore.kernel.org/20260602172247.279421-1-usama.arif@linux.dev Fixes: 1f2052755c15 ("mm/mincore: use a helper for checking the swap cache") Signed-off-by: Usama Arif <usama.arif@linux.dev> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Kairui Song <kasong@tencent.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Baoquan He <baoquan.he@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-03mm/list_lru: drain before clearing xarray entry on reparentShakeel Butt
memcg_reparent_list_lrus() clears the dying memcg's xarray entry with xas_store(&xas, NULL) before reparenting its per-node lists into the parent. This opens a window where a concurrent list_lru_del() arriving for the dying memcg sees xa_load() == NULL, walks to the parent in lock_list_lru_of_memcg(), takes the parent's per-node lock, and calls list_del_init() on an item still physically linked on the dying memcg's list. If another in-flight thread holds the dying memcg's per-node lock at the same moment (another list_lru_del, or a list_lru_walk_one running an isolate callback), both threads modify ->next/->prev pointers on the same physical list under different locks. Adjacent items can corrupt each other's links. Fix it by reversing the order: reparent each per-node list and mark the child's list lru dead and then clear the xarray entry. Any concurrent list_lru op that finds the still-set xarray entry either takes the dying memcg's per-node lock (synchronizing with the drain) or sees LONG_MIN and walks to the parent, where the items now live. Link: https://lore.kernel.org/20260601161501.1444829-1-shakeel.butt@linux.dev Fixes: fb56fdf8b9a2 ("mm/list_lru: split the lock to per-cgroup scope") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reported-by: Chris Mason <clm@fb.com> Reviewed-by: Kairui Song <kasong@tencent.com> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-03mm/huge_memory: use correct flags for device private PMD entryLorenzo Stoakes
Commit 65edfda6f3f2 ("mm/rmap: extend rmap and migration support device-private entries") updated set_pmd_migration_entry() to use pmdp_huge_get_and_clear() in the softleaf case, but made no further adjustments to the function itself. Therefore this function continues to incorrectly use pmd_write(), pmd_soft_dirty() and pmd_uffd_wp() to determine whether the installed migration entry should be marked writable, softdirty or uffd-wp respectively. Whilst all are incorrect, the most problematic of these is pmd_write(), as this can lead to corrupted rmap state. On x86-64 _PAGE_SWP_SOFT_DIRTY is aliased to _PAGE_RW. So calling pmd_write() on a softleaf will return the softdirty state encoded in the entry, assuming CONFIG_MEM_SOFT_DIRTY was enabled. This was observed when running the hmm.hmm_device_private.anon_write_child selftest: 1. The test faults in a range then migrates it such that a device-private THP range is established. 2. The parent then migrates it to a device-private writable PMD entry whose folio is entirely AnonExclusive with entire_mapcount=1, softdirty set (accidentally correct write state). 3. The parent forks and the PMD entries are set to device-private read only entries, entire_mapcount=2, softdirty still set. 4. [BUG] The child writes to the range then migrates to RAM - intending to install non-writable migration entries - but replacing parent and child PMD mappings with WRITABLE entries due to misinterpreting the softdirty bit. 5. In remove_migration_pmd(), if !softleaf_is_migration_read(entry) we set the RMAP_EXCLUSIVE flag when calling folio_add_anon_rmap_pmd() for both parent and child, which are therefore AnonExclusive. 6. [SPLAT] Child sets migrated folio entire_mapcount=1, parent sets entire_mapcount=2 and we end up with an AnonExclusive folio with entire_mapcount=2! Assert fires in __folio_add_anon_rmap(): VM_WARN_ON_FOLIO(folio_test_large(folio) && folio_entire_mapcount(folio) > 1 && PageAnonExclusive(cur_page), folio) This patch fixes the issue by correctly referencing the softleaf entry fields for writable, softdirty and uffd-wp in set_pmd_migration_entry(). It also only updates A/D flags if the entry is present as these are otherwise not meaningful for a softleaf entry. This patch also flips the if (!present) { ... } else { ... } logic in set_pmd_migration_entry() so it is easier to understand, and adds some comments to make things clearer. I was able to bisect this to commit 775465fd26a3 ("lib/test_hmm: add zone device private THP test infrastructure") which first exposes this bug as it was the commit that permitted test_hmm to generate the test. However commit 65edfda6f3f2 ("mm/rmap: extend rmap and migration support device-private entries") is the commit that actually enabled this behaviour. Link: https://lore.kernel.org/20260601083044.57132-1-ljs@kernel.org Fixes: 65edfda6f3f2 ("mm/rmap: extend rmap and migration support device-private entries") Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Balbir Singh <balbirs@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Oscar Salvador (SUSE) <osalvador@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-03mm/damon/lru_sort: handle ctx allocation failureSeongJae Park
DAMON_LRU_SORT allocates the damon_ctx object for its kdamond in its init function. damon_lru_sort_enabled_store() wrongly assumes the allocation will always succeed once tried. If the damon_ctx allocation was failed, therefore, code execution reaches to damon_commit_ctx() while 'ctx' is NULL. As a result, it dereferences the NULL 'ctx' pointer. Avoid the NULL dereference by returning -ENOMEM if 'ctx' is NULL. Link: https://lore.kernel.org/20260529000104.7006-3-sj@kernel.org Fixes: c4a8e662c839 ("mm/damon/lru_sort: use damon_initialized()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.18.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-03mm/damon/reclaim: handle ctx allocation failureSeongJae Park
Patch series "mm/damon/{reclaim,lru_sort}: handle ctx allocation failures". DAMON_RECLAIM and DAMON_LRU_SORT could dereference NULL pointers if their damon_ctx object allocations fail. The bugs are expected to happen infrequently because the allocations are arguably too small to fail on common setups. But theoretically they are possible and the consequences are bad. Fix those. The issues were discovered [1] by Sashiko. This patch (of 2): DAMON_RECLAIM allocates the damon_ctx object for its kdamond in its init function. damon_reclaim_enabled_store() wrongly assumes the allocation will always succeed once tried. If the damon_ctx allocation was failed, therefore, code execution reaches to damon_commit_ctx() while 'ctx' is NULL. As a result, it dereferences the NULL 'ctx' pointer. Avoid the NULL dereference by returning -ENOMEM if 'ctx' is NULL. Link: https://lore.kernel.org/20260529000104.7006-2-sj@kernel.org Link: https://lore.kernel.org/20260419014800.877-1-sj@kernel.org [1] Fixes: 3f7a914ab9a5 ("mm/damon/reclaim: use damon_initialized()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.18.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-03mm/cma_sysfs: skip inactive CMA areas in sysfsKaitao Cheng
cma_activate_area() can fail after a CMA area has already been added to cma_areas[]. In that case the area is left in the global array, but it does not reach the point where CMA_ACTIVATED is set. cma_sysfs_init() currently walks all cma_area_count entries and creates sysfs files for every area, including ones that failed activation. These areas are not usable CMA areas and should not be exposed to userspace as valid CMA regions. If such an inactive area is exposed, userspace sees a CMA directory whose read-only accounting files report zeros. total_pages and available_pages report zero because the failed activation path clears cma->count and cma->available_count, while the allocation and release counters also stay at zero because the area cannot service CMA allocations. This makes the failed area look like a valid but empty CMA region and can mislead tests, monitoring, and diagnostics. Skip CMA areas that did not reach CMA_ACTIVATED when creating the sysfs objects. Since inactive entries can now be skipped, make the error unwind tolerate entries that never had cma_kobj initialized. Link: https://lore.kernel.org/20260524140420.61864-1-kaitao.cheng@linux.dev Link: https://lore.kernel.org/20260522131434.78532-1-kaitao.cheng@linux.dev Fixes: 43ca106fa8ec ("mm: cma: support sysfs") Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> Reported-by: David Hildenbrand (Arm) <david@kernel.org> Suggested-by: David Hildenbrand (Arm) <david@kernel.org> Suggested-by: Muchun Song <songmuchun@bytedance.com> Reported-by: Muchun Song <songmuchun@bytedance.com> Closes: https://lore.kernel.org/linux-mm/55481a8b-dcfc-4bef-ba59-aa0b43dca88b@kernel.org/ Acked-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Dmitry Osipenko <digetx@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>