summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2026-05-28mm/sparse: remove unnecessary NULL check before allocating mem_sectionSang-Heon Jeon
Commit 850ed20539a4 ("mm: move array mem_section init code out of memory_present()") moved mem_section allocation logic into memblocks_present(). Before that move, memory_present() could be called multiple times, so unlikely() matched the common case, where most calls found mem_section already allocated. After that move, memblocks_present() is called exactly once from sparse_init(). Under CONFIG_SPARSEMEM_EXTREME, mem_section is always NULL when it is called. So remove unnecessary NULL check before allocating mem_section. No functional change. Link: https://lore.kernel.org/20260419144225.2875654-1-ekffu200098@gmail.com Signed-off-by: Sang-Heon Jeon <ekffu200098@gmail.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed by: Donet Tom <donettom@linux.ibm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/migrate_device: cleanup up PMD Checks and warningsSunny Patel
Remove the odd VM_WARN_ON_FOLIO(!folio, folio) usage and replace it with a simpler VM_WARN_ON_ONCE(!folio) check. Drop the redundant VM_WARN_ON_ONCE(!pmd_none(*pmdp) && !is_huge_zero_pmd(*pmdp)). Refactor the PMD checks, making the control flow clearer and avoiding duplicate condition checks. Link: https://lore.kernel.org/20260419174747.10701-1-nueralspacetech@gmail.com Signed-off-by: Sunny Patel <nueralspacetech@gmail.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/damon/tests/core-kunit: test fail_charge_{num,denom} committingSeongJae Park
Extend damos_test_commit_quotas() kunit test to ensure damos_commit_quota() handles fail_charge_{num,denom} parameters. Link: https://lore.kernel.org/20260428013402.115171-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/damon/sysfs-schemes: implement fail_charge_{num,denom} filesSeongJae Park
Implement the user-space ABI for the DAMOS action failed region quota-charge ratio setup. For this, add two new sysfs files under the DAMON sysfs interface for DAMOS quotas. Names of the files are fail_charge_num and fail_charge_denom, and work for reading and setting the numerator and denominator of the failed regions charge ratio. Link: https://lore.kernel.org/20260428013402.115171-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/damon/core: introduce failed region quota charge ratioSeongJae Park
DAMOS quota is charged to all DAMOS action application attempted memory, regardless of how much of the memory the action was successful and failed. This makes understanding quota behavior without DAMOS stat but only with end level metrics (e.g., increased amount of free memory for DAMOS_PAGEOUT action) difficult. Also, charging action-failed memory same as action-successful memory is somewhat unfair, as successful action application will induce more overhead in most cases. Introduce DAMON core API for setting the charge ratio for such action-failed memory. It allows API callers to specify the ratio in a flexible way, by setting the numerator and the denominator. Link: https://lore.kernel.org/20260428013402.115171-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/damon/core: merge regions after applying DAMOS schemesSeongJae Park
damos_apply_scheme() could split the given region if applying the scheme's action to the entire region can result in violating the quota-set upper limit. Keeping regions that are created by such split operations is unnecessary overhead. The overhead would be negligible in the common case because such split operations could happen only up to the number of installed schemes per scheme apply interval. The following commit could make the impact larger, though. The following commit will allow the action-failed region to be charged in a different ratio. If both the ratio and the remaining quota is quite small while the region to apply the scheme is quite large and the action is nearly always failing, a high number of split operations could happen. Remove the unnecessary overhead by merging regions after applying schemes is done for each region. The merge operation is made only if it will not lose monitoring information and keep min_nr_regions constraint. In the worst case, the max_nr_regions could still be violated until the next per-aggregation interval merge operation is made. Link: https://lore.kernel.org/20260428013402.115171-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/damon/core: handle <min_region_sz remaining quota as emptySeongJae Park
Patch series "mm/damon: introduce DAMOS failed region quota charge ratio". Let users set different DAMOS quota charge ratios for DAMOS action failed regions, for deterministic and consistent DAMOS action progress. Common Reports: Unexpectedly Slow DAMOS ======================================= One common issue report that we get from DAMON users is that DAMOS action applying progress speed is sometimes much slower than expected. And one common root cause is that the DAMOS quota is exceeded by the action applying failed memory regions. For example, a group of users tried to run DAMOS-based proactive memory reclamation (DAMON_RECLAIM) with 100 MiB per second DAMOS quota. They ran it on a system having no active workload which means all memory of the system is cold. The expectation was that the system will show 100 MiB per second reclamation until (nearly) all memory is reclaimed. But what they found is that the speed is quite inconsistent and sometimes it becomes very slower than the expectation, sometimes even no reclamation at all for about tens of seconds. The upper limit of the speed (100 MiB per second) was being kept as expected, though. By monitoring the qt_exceeds (number of DAMOS quota exceed events) DAMOS stat, we found DAMOS quota is always exceeded when the speed is slow. By monitoring sz_tried and sz_applied (the total amount of DAMOS action tried memory and succeeded memory) DAMOS stats together, we found the reclamation attempts nearly always failed when the speed is slow. DAMOS quota charges DAMOS action tried regions regardless of the successfulness of the try. Hence in the example reported case, there was unreclaimable memory spread around the system memory. Sometimes nearly 100 MiB of memory that DAMOS tried to reclaim in the given quota interval was reclaimable, and therefore showed nearly 100 MiB per second speed. Sometimes nearly 99 MiB of memory that DAMOS was trying to reclaim in the given quota interval was unreclaimable, and therefore showing only about 1 MiB per second reclaim speed. We explained it is an expected behavior of the feature rather than a bug, as DAMOS quota is there for only the upper-limit of the speed. The users agreed and later reported a huge win from the adoption of DAMON_RECLAIM on their products. It is Not a Bug but a Feature; But... ===================================== So nothing is broken. DAMOS quota is working as intended, as the upper limit of the speed. It also provides its behavior observability via DAMOS stat. In the real world production environment that runs long term active workloads and matters stability, the speed sometimes being slow is not a real problem. But, the non-deterministic behavior is sometimes annoying, especially in lab environments. Even in a realistic production environment, when there is a huge amount of DAMOS action unapplicable memory, the speed could be problematically slow. Let's suppose a virtual machines provider that setup 99% of the host memory as hugetlb pages that cannot be reclaimed, to give it to virtual machines. Also, when aim-oriented DAMOS auto-tuning is applied, this could also make the internal feedback loop confused. The intention of the current behavior was that trying DAMOS action to regions would anyway impose some overhead, and therefore somehow be charged. But in the real world, the overhead for failed action is much lighter than successful action. Charging those at the same ratio may be unfair, or at least suboptimum in some environments. DAMOS Action Failed Region Quota Charge Ratio ============================================= Let users set the charge ratio for the action-failed memory, for more optimal and deterministic use of DAMOS. It allows users to specify the numerator and the denominator of the ratio for flexible setup. For example, let's suppose the numerator and the denominator are set to 1 and 4,096, respectively. The ratio is 1 / 4,096. A DAMOS scheme action is applied to 5 GiB memory. For 1 GiB of the memory, the action is succeeded. For the rest (4 GiB), the action is failed. Then, only 1 GiB and 1 MiB quota is charged. The optimal charge ratio will depend on the use case and system/workload. I'd recommend starting from setting the nominator as 1 and the denominator as PAGE_SIZE and tune based on the results, because many DAMOS actions are applied at page level. Tests ===== I tested this feature in the steps below. 1. Allocate 50% of system memory and mlock() it using a test program. 2. Fill up the page cache to exhaust nearly all free memory. 3. Start DAMON-based proactive reclamation with 100 MiB/second DAMOS hard-quota. Auto-tune the DAMOS soft-quota under the hard-quota for achieving 40% free memory of the system with 'temporal' tuner. For step 1, I run a simple C program that is written by Gemini. It is quite straightforward, so I'm not sharing the code here. For step 2, I use dd command like below: dd if=/dev/zero of=foo bs=1M count=$50_percent_of_system_memory For step 3, I use the latest version of DAMON user-space tool (damo) like below. sudo damo start --damos_action pageout \ ` # Do the pageout only up to 100 MiB per second ` \ --damos_quota_space 100M --damos_quota_interval 1s \ ` # Auto-tune the quota below the hard quota aiming` \ ` # 40% free memory of the node 0 ` \ ` # (entire node of the test system)` \ --damos_quota_goal node_mem_free_bp 40% 0 \ ` # use temporal tuner, which is easy to understnd ` \ --damos_quota_goal_tuner temporal As expected, the progress of the reclamation is not consistent, because the quota is exceeded for the failed reclamation of the unreclaimable memory. I do this again, but with the failed region charge ratio feature. For this, the above 'damo' command is used, after appending command line option for setup of the charge ratio like below. Note that the option was added to 'damo' after v3.1.9. sudo ./damo start --damos_action pageout \ [...] ` # quota-charge only 1/4096 for pageout-failed regions ` \ --damos_quota_fail_charge_ratio 1 4096 The progress of the reclamation was nearly 100 MiB per second until the goal was achieved, meeting the expectation. Patches Sequence ================ The first two patches make preparational changes. Patch 1 updates fully charged quota check to handle <min_region_sz remaining quota, which will be able to exist after this series is applied. Patch 2 merges regions after applying schemes is done as long as it is ok to do, since regions split operations for quota could happen much more frequently under a corner case that this series will make available. Patch 3 implements the feature and exposes it via DAMON core API. Patch 4 implements DAMON sysfs ABI for the feature. Three following patches (5-7) document the feature and ABI on design, usage, and ABI documents, respectively. Four patches for testing of the new feature follow. Patch 8 implements a kunit test for the feature. Patches 9 and 10 extend DAMON selftest helpers for DAMON sysfs control and internal state dumping for adding a new selftest for the feature. Patch 11 extends existing DAMON sysfs interface selftest to test the new feature using the extended helper scripts. This patch (of 11): Less than min_region_sz remaining quota effectively means the quota is fully charged. In other words, no remaining quota. This is because DAMOS actions are applied in the region granularity, and each region should have min_region_sz or larger size. However the existing fully charged quota check, which is also used for setting charge_target_from and charge_addr_from of the quota, is not aware of the case. For the reason, charge_target_from and charge_addr_from of the quota will not be updated in the case. This can result in DAMOS action being applied more frequently to a specific area of the memory. The case is unreal because quota charging is also made in the region granularity. It could be changed in future, though. Actually, the following commit will make the change, by allowing users to set arbitrary quota charging ratio for action-failed regions. To be prepared for the change, update the fully charged quota checks to treat having less than min_region_sz remaining quota as fully charged. Link: https://lore.kernel.org/20260428013402.115171-1-sj@kernel.org Link: https://lore.kernel.org/20260428013402.115171-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/damon: add node_eligible_mem_bp goal metricRavi Jonnalagadda
Background and Motivation ========================= In heterogeneous memory systems, controlling memory distribution across NUMA nodes is essential for performance optimization. This patch enables system-wide page distribution with target-state goals such as "maintain 60% of scheme-eligible memory on DRAM" using PA-mode DAMON schemes. Rather than using absolute thresholds, this metric tracks the ratio of memory that matches each scheme's access pattern filters on a target node, enabling the quota system to automatically adjust migration aggressiveness to maintain the desired distribution. What This Metric Measures ========================= node_eligible_mem_bp: scheme_eligible_bytes_on_node / total_scheme_eligible_bytes * 10000 Two-Scheme Setup for Hot Page Distribution ========================================== For maintaining 60% of hot memory on DRAM (node 0) and 40% on CXL (node 1): PULL scheme: migrate_hot to node 0 goal: node_eligible_mem_bp, nid=0, target=6000 addr filter: node 1 address range (only migrate FROM CXL) "Move hot pages to DRAM if less than 60% of hot data is in DRAM" PUSH scheme: migrate_hot to node 1 goal: node_eligible_mem_bp, nid=1, target=4000 addr filter: node 0 address range (only migrate FROM DRAM) "Move hot pages to CXL if less than 40% of hot data is in CXL" Each scheme independently measures its own eligible memory and adjusts its quota to achieve its target ratio. The schemes work in concert through DAMON's unified monitoring context, with the quota autotuner balancing their relative aggressiveness. Implementation Details ====================== The implementation adds a new quota goal metric type DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP to the existing DAMOS quota goal framework. When this metric is configured for a scheme: 1. During each quota adjustment cycle, damos_get_node_eligible_mem_bp() is called to calculate the current memory distribution. 2. The function iterates through all regions that match the scheme's access pattern (via __damos_valid_target()) and calculates: - Total eligible bytes across all nodes - Eligible bytes specifically on the target node (goal->nid) 3. For each eligible region, damos_calc_eligible_bytes() walks through the physical address range, using damon_get_folio() to look up each folio and determine its NUMA node via folio_nid(). 4. Large folios are handled by calculating the exact overlap between the region boundaries and folio boundaries, ensuring accurate byte counts even when regions partially span folios. 5. The ratio (node_eligible / total_eligible * 10000) is returned as basis points, which the quota autotuner uses to adjust the scheme's effective quota size (esz). The implementation requires CONFIG_DAMON_PADDR since damon_get_folio() is only available for physical address space monitoring. Testing Results =============== Functionally tested on a two-node heterogeneous memory system with DRAM (node 0) and CXL memory (node 1). A PUSH+PULL scheme configuration using migrate_hot actions was used to reach a target hot memory ratio between the two tiers. With the TEMPORAL tuner, the system converges quickly to the target distribution. The tuner drives esz to maximum when under goal and to zero once the goal is met, forming a simple on/off feedback loop that stabilizes at the desired ratio. With the CONSIST tuner, the scheme still converges but more slowly, as it migrates and then throttles itself based on quota feedback. The time to reach the goal varies depending on workload intensity. Note: This metric works with both TEMPORAL and CONSIST goal tuners. Link: https://lore.kernel.org/20260428030520.701-1-ravis.opensrc@gmail.com Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@gmail.com> Suggested-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Honggyu Kim <honggyu.kim@sk.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Yunjeong Mun <yunjeong.mun@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/damon/core: make charge_addr_from aware of end-address exclusivitySeongJae Park
DAMON region end address is exclusive one, but charge_addr_from is assigned assuming the end address is inclusive. As a result, DAMOS action to next up to min_region_sz memory can be skipped. This is quite negligible user impact. But, the bug is a bug that can be very simply fixed. Fix the wrong assignment to respect the exclusiveness of the address. The issue was discovered [1] by Sashiko. Link: https://lore.kernel.org/20260428042942.118230-1-sj@kernel.org Link: https://lore.kernel.org/20260428032324.115663-1-sj@kernel.org [1] Fixes: 50585192bc2e ("mm/damon/schemes: skip already charged targets and regions") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 5.16.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/memory: update stale locking comments for fault handlersAditya Sharma
Update the comments for wp_page_copy(), do_wp_page(), do_swap_page(), do_anonymous_page(), __do_fault(), do_fault(), handle_pte_fault(), __handle_mm_fault(), and handle_mm_fault() to concisely clarify that they can be entered holding either the mmap_lock or the VMA lock, and that the lock may be released upon returning VM_FAULT_RETRY. Additionally, make the following corrections: - In do_anonymous_page(), correct the outdated claim that the function is entered with the PTE "mapped but not yet locked". Since handle_pte_fault() unmaps the empty PTE before routing to do_pte_missing(), the comment now correctly states it is entered with the PTE unmapped and unlocked. - In __do_fault(), update the stale reference from __lock_page_retry() to __folio_lock_or_retry(). Link: https://lore.kernel.org/20260424092217.263648-1-adi.sharma@zohomail.in Signed-off-by: Aditya Sharma <adi.sharma@zohomail.in> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/gup: cleanup pgtable entry accessorsAlexander Gordeev
PMD and PUD entries revalidation has the same semantics as PTE entry revalidation. Convert the remaining direct entry dereferences to the corresponding accessors. The PTE validation in gup_fast_pte_range() is inconsistent with the prior value acquisition in the sense that it drops the lockless access semantics. Use the lockless accessor not only for the PTE, but also for the PMD validation, which is likewise inconsistent with the prior value acquisition in gup_fast_pmd_range(). Link: https://lore.kernel.org/20260421051754.1691221-1-agordeev@linux.ibm.com Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/page_alloc: optimize __free_contig_frozen_range()Muhammad Usama Anjum
Apply the same batch-freeing optimization from free_contig_range() to the frozen page path. The previous __free_contig_frozen_range() freed each order-0 page individually via free_frozen_pages(), which is slow for the same reason the old free_contig_range() was: each page goes to the order-0 pcp list rather than being coalesced into higher-order blocks. Rewrite __free_contig_frozen_range() to call free_pages_prepare() for each order-0 page, then batch the prepared pages into the largest possible power-of-2 aligned chunks via free_prepared_contig_range(). If free_pages_prepare() fails (e.g. HWPoison, bad page) the page is deliberately not freed; it should not be returned to the allocator. I've tested CMA through debugfs. The test allocates 16384 pages per allocation for several iterations. There is 3.5x improvement. Before: 1406 usec per iteration After: 402 usec per iteration Before: 70.89% 0.69% cma [kernel.kallsyms] [.] free_contig_frozen_range | |--70.20%--free_contig_frozen_range | | | |--46.41%--__free_frozen_pages | | | | | --36.18%--free_frozen_page_commit | | | | | --29.63%--_raw_spin_unlock_irqrestore | | | |--8.76%--_raw_spin_trylock | | | |--7.03%--__preempt_count_dec_and_test | | | |--4.57%--_raw_spin_unlock | | | |--1.96%--__get_pfnblock_flags_mask.isra.0 | | | --1.15%--free_frozen_page_commit | --0.69%--el0t_64_sync After: 23.57% 0.00% cma [kernel.kallsyms] [.] free_contig_frozen_range | ---free_contig_frozen_range | |--20.45%--__free_contig_frozen_range | | | |--17.77%--free_pages_prepare | | | --0.72%--free_prepared_contig_range | | | --0.55%--__free_frozen_pages | --3.12%--free_pages_prepare Link: https://lore.kernel.org/20260401101634.2868165-4-usama.anjum@arm.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Suggested-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Sterba <dsterba@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28vmalloc: optimize vfree with free_pages_bulk()Ryan Roberts
Whenever vmalloc allocates high order pages (e.g. for a huge mapping) it must immediately split_page() to order-0 so that it remains compatible with users that want to access the underlying struct page. Commit a06157804399 ("mm/vmalloc: request large order pages from buddy allocator") recently made it much more likely for vmalloc to allocate high order pages which are subsequently split to order-0. Unfortunately this had the side effect of causing performance regressions for tight vmalloc/vfree loops (e.g. test_vmalloc.ko benchmarks). See Closes: tag. This happens because the high order pages must be gotten from the buddy but then because they are split to order-0, when they are freed they are freed to the order-0 pcp. Previously allocation was for order-0 pages so they were recycled from the pcp. It would be preferable if when vmalloc allocates an (e.g.) order-3 page that it also frees that order-3 page to the order-3 pcp, then the regression could be removed. So let's do exactly that; update stats separately first as coalescing is hard to do correctly without complexity. Use free_pages_bulk() which uses the new __free_contig_range() API to batch-free contiguous ranges of pfns. This not only removes the regression, but significantly improves performance of vfree beyond the baseline. A selection of test_vmalloc benchmarks running on arm64 server class system. mm-new is the baseline. Commit a06157804399 ("mm/vmalloc: request large order pages from buddy allocator") was added in v6.19-rc1 where we see regressions. Then with this change performance is much better. (>0 is faster, <0 is slower, (R)/(I) = statistically significant Regression/Improvement): +-----------------+----------------------------------------------------------+-------------------+--------------------+ | Benchmark | Result Class | mm-new | this series | +=================+==========================================================+===================+====================+ | micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec) | 1331843.33 | (I) 67.17% | | | fix_size_alloc_test: p:1, h:0, l:500000 (usec) | 415907.33 | -5.14% | | | fix_size_alloc_test: p:4, h:0, l:500000 (usec) | 755448.00 | (I) 53.55% | | | fix_size_alloc_test: p:16, h:0, l:500000 (usec) | 1591331.33 | (I) 57.26% | | | fix_size_alloc_test: p:16, h:1, l:500000 (usec) | 1594345.67 | (I) 68.46% | | | fix_size_alloc_test: p:64, h:0, l:100000 (usec) | 1071826.00 | (I) 79.27% | | | fix_size_alloc_test: p:64, h:1, l:100000 (usec) | 1018385.00 | (I) 84.17% | | | fix_size_alloc_test: p:256, h:0, l:100000 (usec) | 3970899.67 | (I) 77.01% | | | fix_size_alloc_test: p:256, h:1, l:100000 (usec) | 3821788.67 | (I) 89.44% | | | fix_size_alloc_test: p:512, h:0, l:100000 (usec) | 7795968.00 | (I) 82.67% | | | fix_size_alloc_test: p:512, h:1, l:100000 (usec) | 6530169.67 | (I) 118.09% | | | full_fit_alloc_test: p:1, h:0, l:500000 (usec) | 626808.33 | -0.98% | | | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | 532145.67 | -1.68% | | | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | 537032.67 | -0.96% | | | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec) | 8805069.00 | (I) 74.58% | | | pcpu_alloc_test: p:1, h:0, l:500000 (usec) | 500824.67 | 4.35% | | | random_size_align_alloc_test: p:1, h:0, l:500000 (usec) | 1637554.67 | (I) 76.99% | | | random_size_alloc_test: p:1, h:0, l:500000 (usec) | 4556288.67 | (I) 72.23% | | | vm_map_ram_test: p:1, h:0, l:500000 (usec) | 107371.00 | -0.70% | +-----------------+----------------------------------------------------------+-------------------+--------------------+ Link: https://lore.kernel.org/20260401101634.2868165-3-usama.anjum@arm.com Fixes: a06157804399 ("mm/vmalloc: request large order pages from buddy allocator") Closes: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Sterba <dsterba@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/page_alloc: optimize free_contig_range()Ryan Roberts
Patch series "mm: Free contiguous order-0 pages efficiently", v6. A recent change to vmalloc caused some performance benchmark regressions (see [1]). I'm attempting to fix that (and at the same time significantly improve beyond the baseline) by freeing a contiguous set of order-0 pages as a batch. At the same time I observed that free_contig_range() was essentially doing the same thing as vfree() so I've fixed it there too. While at it, optimize the __free_contig_frozen_range() as well. Check that the contiguous range falls in the same section. If they aren't enabled, the if conditions get optimized out by the compiler as memdesc_section() returns 0. See num_pages_contiguous() for more details about it. This patch (of 3): Decompose the range of order-0 pages to be freed into the set of largest possible power-of-2 size and aligned chunks and free them to the pcp or buddy. This improves on the previous approach which freed each order-0 page individually in a loop. Testing shows performance to be improved by more than 10x in some cases. Since each page is order-0, we must decrement each page's reference count individually and only consider the page for freeing as part of a high order chunk if the reference count goes to zero. Additionally free_pages_prepare() must be called for each individual order-0 page too, so that the struct page state and global accounting state can be appropriately managed. But once this is done, the resulting high order chunks can be freed as a unit to the pcp or buddy. This significantly speeds up the free operation but also has the side benefit that high order blocks are added to the pcp instead of each page ending up on the pcp order-0 list; memory remains more readily available in high orders. vmalloc will shortly become a user of this new optimized free_contig_range() since it aggressively allocates high order non-compound pages, but then calls split_page() to end up with contiguous order-0 pages. These can now be freed much more efficiently. The execution time of the following function was measured in a server class arm64 machine: static int page_alloc_high_order_test(void) { unsigned int order = HPAGE_PMD_ORDER; struct page *page; int i; for (i = 0; i < 100000; i++) { page = alloc_pages(GFP_KERNEL, order); if (!page) return -1; split_page(page, order); free_contig_range(page_to_pfn(page), 1UL << order); } return 0; } Execution time before: 4097358 usec Execution time after: 729831 usec Perf trace before: 99.63% 0.00% kthreadd [kernel.kallsyms] [.] kthread | ---kthread 0xffffb33c12a26af8 | |--98.13%--0xffffb33c12a26060 | | | |--97.37%--free_contig_range | | | | | |--94.93%--___free_pages | | | | | | | |--55.42%--__free_frozen_pages | | | | | | | | | --43.20%--free_frozen_page_commit | | | | | | | | | --35.37%--_raw_spin_unlock_irqrestore | | | | | | | |--11.53%--_raw_spin_trylock | | | | | | | |--8.19%--__preempt_count_dec_and_test | | | | | | | |--5.64%--_raw_spin_unlock | | | | | | | |--2.37%--__get_pfnblock_flags_mask.isra.0 | | | | | | | --1.07%--free_frozen_page_commit | | | | | --1.54%--__free_frozen_pages | | | --0.77%--___free_pages | --0.98%--0xffffb33c12a26078 alloc_pages_noprof Perf trace after: 8.42% 2.90% kthreadd [kernel.kallsyms] [k] __free_contig_range | |--5.52%--__free_contig_range | | | |--5.00%--free_prepared_contig_range | | | | | |--1.43%--__free_frozen_pages | | | | | | | --0.51%--free_frozen_page_commit | | | | | |--1.08%--_raw_spin_trylock | | | | | --0.89%--_raw_spin_unlock | | | --0.52%--free_pages_prepare | --2.90%--ret_from_fork kthread 0xffffae1c12abeaf8 0xffffae1c12abe7a0 | --2.69%--vfree __free_contig_range Link: https://lore.kernel.org/20260401101634.2868165-1-usama.anjum@arm.com Link: https://lore.kernel.org/20260401101634.2868165-2-usama.anjum@arm.com Link: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com [1] Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Sterba <dsterba@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/vmscan: add balance_pgdat begin/end tracepointsBunyod Suvonov
Vmscan has six main reclaim entry points: try_to_free_pages() for direct reclaim, try_to_free_mem_cgroup_pages() for memcg reclaim, mem_cgroup_shrink_node() for memcg soft limit reclaim, node_reclaim() for node reclaim, shrink_all_memory() for hibernation reclaim, and balance_pgdat() for kswapd reclaim. All of them, except for shrink_all_memory() and balance_pgdat(), already have begin/end tracepoints. This makes it harder to trace which reclaim path is responsible for memory reclaim activity, because kswapd reclaim cannot be identified as cleanly as other reclaim entry points, even though it is the main background reclaim path under memory pressure. There may be no need to trace shrink_all_memory() as it is primarily used during hibernation. So this patch adds the missing tracepoint pair for balance_pgdat(). The begin tracepoint records the node id, requested reclaim order, and the requested classzone bound (highest_zoneidx). The end tracepoint records the node id, the reclaim order that balance_pgdat() finished with, the requested classzone bound, and nr_reclaimed. Together, they show the requested reclaim order and classzone bound, whether reclaim fell back to a lower order, and how much reclaim work was done. The end tracepoint also records highest_zoneidx even though it does not change within a balance_pgdat() invocation. This keeps the end event self-contained, so users can analyze reclaim results directly from end events without depending on begin/end correlation, which is less convenient when tracing is filtered or records are dropped. It also makes it straightforward to relate nr_reclaimed and the final reclaim order to the requested classzone bound. Link: https://lore.kernel.org/20260424031418.174597-1-b.suvonov@sjtu.edu.cn Link: https://lore.kernel.org/20260423103753.546582-1-b.suvonov@sjtu.edu.cn Signed-off-by: Bunyod Suvonov <b.suvonov@sjtu.edu.cn> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm: convert vmemmap_p?d_populate() to static functionsChengkaitao
Since the vmemmap_p?d_populate functions are unused outside the mm subsystem, we can remove their external declarations and convert them to static functions. Link: https://lore.kernel.org/20260423101441.7089-1-kaitao.cheng@linux.dev Signed-off-by: Chengkaitao <chengkaitao@kylinos.cn> Acked-by: David Hildenbrand (arm) <david@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@kernel.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/huge_memory: fix outdated comment about freeing subpages in __folio_splitBarry Song (Xiaomi)
The comment appears to be outdated. add_to_swap() no longer exists, and the explanation of why we need to call put_page() after splitting could be made more general. Link: https://lore.kernel.org/20260423034917.8234-1-baohua@kernel.org Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Youngjun Park <youngjun.park@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28Revert "tmpfs: don't enable large folios if not supported"Baolin Wang
This reverts commit 5a90c155defa684f3a21f68c3f8e40c056e6114c. Currently, when shmem mounts are initialized, they only use 'sbinfo->huge' to determine whether the shmem mount supports large folios. However, for anonymous shmem, whether it supports large folios can be dynamically configured via sysfs interfaces, so setting or not setting mapping_set_large_folios() during initialization cannot accurately reflect whether anonymous shmem actually supports large folios, which has already caused some confusion[1]. Moreover, for tmpfs mounts, relying on 'sbinfo->huge' cannot keep the mapping_set_large_folios() setting consistent across all mappings in the entire tmpfs mount. In other words, under the same tmpfs mount, after remount, we might end up with some mappings supporting large folios (calling mapping_set_large_folios()) while others don't. After some investigation, I found that the write performance regression addressed by commit 5a90c155defa has already been fixed by the following commit 665575cff098b ("filemap: move prefaulting out of hot write path"). See the following test data: Base: dd if=/dev/zero of=/mnt/tmpfs/test bs=400K count=10485 (3.2 GB/s) dd if=/dev/zero of=/mnt/tmpfs/test bs=800K count=5242 (3.2 GB/s) dd if=/dev/zero of=/mnt/tmpfs/test bs=1600K count=2621 (3.1 GB/s) dd if=/dev/zero of=/mnt/tmpfs/test bs=2200K count=1906 (3.0 GB/s ) dd if=/dev/zero of=/mnt/tmpfs/test bs=3000K count=1398 (3.0 GB/s) dd if=/dev/zero of=/mnt/tmpfs/test bs=4500K count=932 (3.1 GB/s) Base + revert 5a90c155defa: dd if=/dev/zero of=/mnt/tmpfs/test bs=400K count=10485 (3.3 GB/s) dd if=/dev/zero of=/mnt/tmpfs/test bs=800K count=5242 (3.3 GB/s) dd if=/dev/zero of=/mnt/tmpfs/test bs=1600K count=2621 (3.2 GB/s) dd if=/dev/zero of=/mnt/tmpfs/test bs=2200K count=1906 (3.1 GB/s) dd if=/dev/zero of=/mnt/tmpfs/testbs=3000K count=1398 (3.0 GB/s) dd if=/dev/zero of=/mnt/tmpfs/test bs=4500K count=932 (3.1 GB/s) The data is basically consistent with minor fluctuation noise. So we can now safely revert commit 5a90c155defa to set mapping_set_large_folios() for all shmem mounts unconditionally. Link: https://lore.kernel.org/b2c7deee259a94b0d00a7c320d8d24d2c421f761.1776908112.git.baolin.wang@linux.alibaba.com Link: https://lore.kernel.org/all/ec927492-4577-4192-8fad-85eb1bb43121@linux.alibaba.com/ [1] Link: https://lore.kernel.org/all/116df9f9-4db7-40d4-a4a4-30a87c0feffa@linux.alibaba.com/ Fixes: 5a90c155defa ("tmpfs: don't enable large folios if not supported") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/page_alloc: replace kernel_init_pages() with batch page clearingHrushikesh Salunke
When init_on_alloc is enabled, kernel_init_pages() clears every page one at a time via clear_highpage_kasan_tagged(), which incurs per-page kmap_local_page()/kunmap_local() overhead and prevents the architecture clearing primitive from operating on contiguous ranges. Introduce clear_highpages_kasan_tagged() as a static batch clearing helper in page_alloc.c that calls clear_pages() for the full contiguous range on !HIGHMEM systems, bypassing the per-page kmap overhead and allowing a single invocation of the arch clearing primitive across the entire allocation. The HIGHMEM path falls back to per-page clearing since those pages require kmap. Replace kernel_init_pages() with direct calls to the new helper, as it becomes a trivial wrapper. Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1: Before: 0.445s After: 0.166s (-62.7%, 2.68x faster) Kernel time (sys) reduction per workload with init_on_alloc=1: Workload Before After Change Graph500 64C128T 30m 41.8s 15m 14.8s -50.3% Graph500 16C32T 15m 56.7s 9m 43.7s -39.0% Pagerank 32T 1m 58.5s 1m 12.8s -38.5% Pagerank 128T 2m 36.3s 1m 40.4s -35.7% [hsalunke@amd.com: move clear_highpages_kasan_tagged() to page_alloc.c] Link: https://lore.kernel.org/20260504063942.553438-1-hsalunke@amd.com Link: https://lore.kernel.org/20260422102729.166599-1-hsalunke@amd.com Signed-off-by: Hrushikesh Salunke <hsalunke@amd.com> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Pankaj Gupta <pankaj.gupta@amd.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Ankur Arora <ankur.a.arora@oracle.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shivank Garg <shivankg@amd.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/cma: fix reserved page leak on activation failureMuchun Song
If cma_activate_area() fails after allocating only part of the range bitmaps, the cleanup path still has to release the reserved pages when CMA_RESERVE_PAGES_ON_ERROR is clear. That is still worth doing even in this __init path. A bitmap_zalloc() failure does not necessarily mean the system cannot make further progress: freeing the reserved CMA pages can return a substantial amount of memory to the buddy allocator and may relieve the temporary memory shortage that caused the allocation failure in the first place. However, the cleanup path currently uses the bitmap-freeing bound for page release as well. That is only correct for ranges whose bitmap allocation already succeeded. The failed range and all later ranges still keep their reserved pages, so a partial bitmap allocation failure can permanently leak them. Fix this by releasing reserved pages for all ranges. Use the saved early_pfn[] value for ranges whose bitmap allocation already succeeded and for the failed range, and use cmr->early_pfn for later ranges whose bitmap allocation was never attempted. Link: https://lore.kernel.org/20260523060123.2207992-1-songmuchun@bytedance.com Fixes: c009da4258f9 ("mm, cma: support multiple contiguous ranges, if requested") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador (SUSE) <osalvador@kernel.org> Acked-by: Usama Arif <usama.arif@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Frank van der Linden <fvdl@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/memory-failure: fix hugetlb_lock AA deadlock in get_huge_page_for_hwpoisonWupeng Ma
Two concurrent madvise(MADV_HWPOISON) calls on the same hugetlb page can trigger a recursive spinlock self-deadlock (AA deadlock) on hugetlb_lock when racing with a concurrent unmap: thread#0 thread#1 -------- -------- madvise(folio, MADV_HWPOISON) -> poisons the folio successfully madvise(folio, MADV_HWPOISON) unmap(folio) try_memory_failure_hugetlb get_huge_page_for_hwpoison spin_lock_irq(&hugetlb_lock) <- held __get_huge_page_for_hwpoison hugetlb_update_hwpoison() -> MF_HUGETLB_FOLIO_PRE_POISONED goto out: folio_put() refcount: 1 -> 0 free_huge_folio() spin_lock_irqsave(&hugetlb_lock) -> AA DEADLOCK! The out: path in __get_huge_page_for_hwpoison() calls folio_put() to drop the GUP reference while the hugetlb_lock is still held by the hugetlb.c wrapper get_huge_page_for_hwpoison(). If concurrent unmap has released the page table mapping reference, folio_put() drops the folio refcount to zero, triggering free_huge_folio() which attempts to re-acquire the non-recursive hugetlb_lock. Fix this by moving hugetlb_lock acquisition from the hugetlb.c wrapper into get_huge_page_for_hwpoison(). Place spin_unlock_irq() before the folio_put() at the out: label so the folio is always released outside the lock. [akpm@linux-foundation.org: fix race, rename label per Miaohe] Link: https://sashiko.dev/#/patchset/20260522010305.4099834-1-mawupeng1@huawei.com Link: https://lore.kernel.org/f39f405e-4b4b-8f79-70fe-a2b5b62114eb@huawei.com Link: https://lore.kernel.org/20260522010305.4099834-1-mawupeng1@huawei.com Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()") Signed-off-by: Wupeng Ma <mawupeng1@huawei.com> Acked-by: Oscar Salvador (SUSE) <osalvador@kernel.org> Acked-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/hugetlb: restore reservation on error in hugetlb folio copy pathsDavid Carlier
Two sites in mm/hugetlb.c allocate a hugetlb folio via alloc_hugetlb_folio() (consuming a VMA reservation) and then call copy_user_large_folio(), which became int-returning in commit 1cb9dc4b475c ("mm: hwpoison: support recovery from HugePage copy-on-write faults") and can now fail (e.g. -EHWPOISON on a hwpoisoned source page). On the failure path, folio_put() restores the global hugetlb pool count through free_huge_folio(), but the per-VMA reservation map entry is left marked consumed: - hugetlb_mfill_atomic_pte() resubmission path (UFFDIO_COPY) - copy_hugetlb_page_range() fork-time CoW path when hugetlb_try_dup_anon_rmap() fails (rare: pinned hugetlb anon folio under fork) User-visible effect: on UFFDIO_COPY into a private hugetlb VMA where the resubmission copy fails, the reservation for that address is leaked from the VMA's reserve map. A subsequent fault at the same address takes the no-reservation path, and under hugetlb pool pressure the task is SIGBUSed at an address it had previously reserved. The fork-time CoW path leaks the same way in the child VMA's reserve map, though it requires the much rarer combination of pinned hugetlb anon page + hwpoisoned source. Add the missing restore_reserve_on_error() call before folio_put() on both error paths. Link: https://lore.kernel.org/20260520044912.6751-1-devnexen@gmail.com Fixes: 1cb9dc4b475c ("mm: hwpoison: support recovery from HugePage copy-on-write faults") Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: yuehaibing <yuehaibing@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/cma_debug: fix invalid accesses for inactive CMA areasMuchun Song
cma_activate_area() can fail after allocating range bitmaps. Its cleanup path frees those bitmaps, but only clears cma->count and cma->available_count. It leaves cma->nranges and each range's count in place, so cma_debugfs_init() can still register debugfs files for an area that never activated successfully. That exposes two problems. Reading the bitmap file can make debugfs walk a freed range bitmap and trigger an invalid memory access. Reading maxchunk can also take cma->lock even though that lock is initialized only on the successful activation path. Fix this by creating debugfs entries only for CMA areas that reached CMA_ACTIVATED. c009da4258f9 introduced the invalid access to bitmap file. 2e32b947606d introduced the invalid access to cma->lock. This change applies to both issues. So I added two Fixes tags. Link: https://lore.kernel.org/20260520061025.3971821-1-songmuchun@bytedance.com Fixes: c009da4258f9 ("mm, cma: support multiple contiguous ranges, if requested") Fixes: 2e32b947606d ("mm: cma: add functions to get region pages counters") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Oscar Salvador (SUSE) <osalvador@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Stefan Strogin <stefan.strogin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28memcg: use round-robin victim selection in refill_stockShakeel Butt
Harry Yoo reported that get_random_u32_below() is not safe to call in the nmi context and memcg charge draining can happen in nmi context. More specifically get_random_u32_below() is neither reentrant- nor NMI-safe: it acquires a per-cpu local_lock via local_lock_irqsave() on the batched_entropy_u32 state. An NMI that lands on a CPU mid-update of the ChaCha batch state and recurses into the random subsystem would corrupt that state. The memcg_stock local_trylock prevents re-entry on the percpu stock itself, but cannot protect an unrelated subsystem's per-cpu lock. Replace the random pick with a per-cpu round-robin counter stored in memcg_stock_pcp and serialized by the same local_trylock that already guards cached[] and nr_pages[]. No atomics, no random calls, no extra locks needed. Link: https://lore.kernel.org/20260521223751.3794625-1-shakeel.butt@linux.dev Fixes: f735eebe55f8f ("memcg: multi-memcg percpu charge cache") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reported-by: Harry Yoo <harry@kernel.org> Closes: https://lore.kernel.org/4e20f643-6983-4b6e-b12d-c6c4eb20ae0c@kernel.org/ Acked-by: Harry Yoo (Oracle) <harry@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/hugetlb: avoid false positive lockdep assertionLorenzo Stoakes
Commit 081056dc00a2 ("mm/hugetlb: unshare page tables during VMA split, not before") changed the locking model around hugetlbfs PMD unsharing on VMA split, but did not update the function which asserts the locks, hugetlb_vma_assert_locked(). This function asserts that either the hugetlb VMA lock is held (if a shared mapping) or that the reservation map lock is held (if private). If you get an unfortunate race between something which results in one of these locks being released and a hugetlb VMA split and you have CONFIG_LOCKDEP enabled, you can therefore see a false positive assertion arise when there is in fact no issue. Since this change introduced a new take_locks parameter to hugetlb_unshare_pmds(), which, when set to false, indicates that locking is sufficient, simply pass this to the unsharing logic and predicate the lock assertions on this. This is safe, as we already asserted the file rmap lock and the VMA write lock prior to this (implying exclusive mmap write lock), so we cannot be raced by either rmap or page fault page table walkers which the asserted locks are intended to protect against (we don't mind GUP-fast). Separate out huge_pmd_unshare() into __huge_pmd_unshare() to add a check_locks parameter, and update hugetlb_unshare_pmds() to pass this parameter to it. This leaves all other callers of huge_pmd_unshare() still correctly asserting the locks. The below reproducer will trigger the assert in a kernel with CONFIG_LOCKDEP enabled by racing process teardown (which will release the hugetlb lock) against a hugetlb split. void execute_one(void) { void *ptr; pid_t pid; /* * Create a hugetlb mapping spanning a PUD entry. * * We force the hugetlb page allocation with populate and * noreserve. * * |---------------------| * | | * |---------------------| * 0 PUD boundary */ ptr = mmap(0, PUD_SIZE, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED | MAP_ANON | MAP_NORESERVE | MAP_HUGETLB | MAP_POPULATE, -1, 0); if (ptr == MAP_FAILED) { perror("mmap"); exit(EXIT_FAILURE); } /* * Fork but with a bogus stack pointer so we try to execute code in * a non-VM_EXEC VMA, causing segfault + teardown via exit_mmap(). * * The clone will cause PMD page table sharing between the * processes first via: * copy_process() -> ... -> huge_pte_alloc() -> huge_pmd_share() * * Then tear down and release the hugetlb 'VMA' lock via: * exit_mmap() -> ... -> vma_close() -> hugetlb_vma_lock_free() */ pid = syscall(__NR_clone, 0, 2 * PMD_SIZE, 0, 0, 0); if (pid < 0) { perror("clone"); exit(EXIT_FAILURE); } if (pid == 0) { /* Pop stack... */ return; } /* * We are the parent process. * * Race the child process's teardown with a PMD unshare. * * We do this by triggering: * * __split_vma() -> hugetlb_split() -> hugetlb_unshare_pmds() * * Which, importantly, doesn't hold the hugetlb VMA lock (nor can * it), meaning we assert in hugetlb_vma_assert_locked(). * * . * |----------.----------| * | . | * |----------.----------| * 0 . PUD boundary */ mmap(0, PUD_SIZE / 2, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); } int main(void) { int i; /* Kick off fork children. */ for (i = 0; i < NUM_FORKS; i++) { pid_t pid = fork(); if (pid < 0) { perror("fork"); exit(EXIT_FAILURE); } /* Fork children do their work and exit. */ if (!pid) { int j; for (j = 0; j < NUM_ITERS; j++) execute_one(); return EXIT_SUCCESS; } } /* If we succeeded, wait on children. */ for (i = 0; i < NUM_FORKS; i++) wait(NULL); return EXIT_SUCCESS; } [ljs@kernel.org: account for the !CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING case] Link: https://lore.kernel.org/agWZsPGYid08uU6O@lucifer Link: https://lore.kernel.org/20260513085658.45264-1-ljs@kernel.org Fixes: 081056dc00a2 ("mm/hugetlb: unshare page tables during VMA split, not before") Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Jann Horn <jannh@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/slub: fix typo in sheaves commentWilson Zeng
Fix a typo in the comment describing oversize sheaves handling: "area" should be "are". Signed-off-by: Wilson Zeng <cheng20011202@gmail.com> Acked-by: Harry Yoo (Oracle) <harry@kernel.org> Link: https://patch.msgid.link/20260516164033.1566208-1-cheng20011202@gmail.com Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-05-28mm, slab: simplify returning slab in __refill_objects_node()Vlastimil Babka (SUSE)
When we return slabs to the partial list because we didn't fully refill from them, we observe the min_partial limit when the returned slab is empty, and discard it when over the limit. But it's unlikely for the limit to be reached while we were refilling, and the worst outcome is to have temporarily more free slabs on the list than necessary. So just drop that code and simplify the function. Link: https://patch.msgid.link/20260522-b4-refill-optimistic-return-v3-2-2ba78ec1c6ed@kernel.org Reviewed-by: Hao Li <hao.li@linux.dev> Tested-by: Hao Li <hao.li@linux.dev> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-05-28mm, slab: add an optimistic __slab_try_return_freelist()Vlastimil Babka (SUSE)
When we end up returning extraneous objects during refill to a slab where we just did a get_freelist_nofreeze(), it is likely no other CPU has freed objects to it meanwhile. We can then reattach the remainder of the freelist without having to walk the (potentially cache cold) freelist for finding its tail to connect slab->freelist to it. Add a __slab_try_return_freelist() function that does that. As suggested by Hao Li, it doesn't need to also return the slab to the partial list, because there's code in __refill_objects_node() that already does that for any slabs where we don't detach the freelist in the first place. So we just put the slab back to the pc.slabs list. It's no longer likely that the list will be empty now, so remove the unlikely() annotation. However, also change that code to add to the tail of the partial list instead of head to match what __slab_free() did and avoid a regression, that was reported for the earlier version by the kernel test robot [1]. This change will also affect slabs which were grabbed from the partial list and not refilled from even partially, but those should be much more rare than a partial refill. [1] https://lore.kernel.org/all/202605112204.9382cecf-lkp@intel.com/ Reviewed-by: Hao Li <hao.li@linux.dev> Tested-by: Hao Li <hao.li@linux.dev> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Link: https://patch.msgid.link/20260522-b4-refill-optimistic-return-v3-1-2ba78ec1c6ed@kernel.org Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-05-26Merge tag 'mm-hotfixes-stable-2026-05-25-16-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "13 hotfixes. 9 are for MM. 9 are cc:stable and the remaining 4 address post-7.1 issues or aren't considered suitable for backporting. All patches are singletons - please see the individual changelogs for details" * tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: Revert "mm: introduce a new page type for page pool in page type" mm/vmalloc: do not trigger BUG() on BH disabled context MAINTAINERS, mailmap: change email for Eugen Hristev mm/migrate_device: fix pgtable leak in migrate_vma_insert_huge_pmd_page kernel/fork: validate exit_signal in kernel_clone() mm: memcontrol: propagate NMI slab stats to memcg vmstats mm/damon/sysfs-schemes: delete tried region in regions_rmdirs() mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one zram: fix use-after-free in zram_writeback_endio memfd: deny writeable mappings when implying SEAL_WRITE ipc: limit next_id allocation to the valid ID range Revert "mm/hugetlbfs: update hugetlbfs to use mmap_prepare" MAINTAINERS: .mailmap: update after GEHC spin-off
2026-05-26exec_state: relocate dumpable informationChristian Brauner (Amutable)
The dumpable flag captured at execve() is consulted by __ptrace_may_access() and several /proc owner / visibility checks. It lives on mm_struct today, which exit_mm() clears from the task long before the task itself is reaped. exec_state is anchored to the execve() that established the current privilege domain. CLONE_VM siblings refcount-share the parent's exec_state via copy_exec_state(); non-CLONE_VM clones allocate a fresh exec_state inheriting the parent's dumpable mode and user_ns reference via task_exec_state_copy(). execve() allocates a fresh instance (via alloc_task_exec_state() in begin_new_exec()) and installs it under task_lock + exec_update_lock with task_exec_state_replace(). init_task uses a static instance. The dumpable mode now lives on task->exec_state->dumpable. task->mm->flags no longer carries dumpability; MMF_DUMPABLE_MASK is removed, but MMF_DUMPABLE_BITS is reserved so MMF_DUMP_FILTER_* bit positions remain stable for the /proc/<pid>/coredump_filter ABI. The task->user_dumpable cache bit and its assignment in exit_mm() are removed; readers go through get_dumpable(task) directly. coredump_params gains a snapshot field cprm.dumpable, populated from get_dumpable(current) at vfs_coredump() entry, replacing the previous __get_dumpable(cprm->mm_flags) consumers in fs/coredump.c and fs/pidfs.c. The user namespace recorded at execve() is consulted by __ptrace_may_access() and by /proc/PID/* owner derivation. Move the captured user_ns onto task_exec_state, which stays attached to the task past exit_mm() and across exit_files(). bprm grows a user_ns field staged in bprm_mm_init() with the caller's user_ns, narrowed by would_dump() to the closest privileged ancestor, and consumed by exec_mmap() via alloc_task_exec_state(bprm->user_ns). free_bprm() releases the staging reference. mm_struct loses ->user_ns entirely. Initializers in init-mm, efi_mm, and the implicit one in mm_init()/dup_mm()/mm_alloc() are removed; __mmdrop() drops the matching put_user_ns(). The kthread_use_mm() WARN_ON_ONCE(!mm->user_ns) is no longer meaningful and goes too. Reviewed-by: Jann Horn <jannh@google.com> Link: https://patch.msgid.link/20260520-work-task_exec_state-v3-4-69f895bc1385@kernel.org Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-05-25memblock: don't touch memblock arrays when memblock_free() is called lateMike Rapoport (Microsoft)
When memblock_free() is called after memblock_discard() on architectures that don't select ARCH_KEEP_MEMBLOCK, it tries to update memblock.reserved that was already discarded and it causes use-after-free, for example [ 8.514775] BUG: KASAN: use-after-free in memblock_isolate_range+0x4ac/0x650 [ 8.514775] Read of size 8 at addr ffff88a07fe6a000 by task swapper/0/1 [ 8.514775] Call Trace: [ 8.514775] <TASK> [ 8.514775] kasan_report+0xb2/0x1b0 [ 8.514775] memblock_isolate_range+0x4ac/0x650 [ 8.514775] memblock_phys_free+0xc4/0x190 [ 8.514775] housekeeping_late_init+0x257/0x280 [ 8.514775] do_one_initcall+0xaa/0x470 [ 8.514775] do_initcalls+0x1b4/0x1f0 [ 8.514775] kernel_init_freeable+0x4b5/0x550 [ 8.514775] kernel_init+0x1c/0x150 [ 8.514775] ret_from_fork+0x5dc/0x8e0 [ 8.514775] ret_from_fork_asm+0x1a/0x30 [ 8.514775] </TASK> Make sure memblock_free() updates memblock.reserved only when called early enough or when ARCH_KEEP_MEMBLOCK is enabled. Reported-by: Waiman Long <longman@redhat.com> Reported-by: Breno Leitao <leitao@debian.org> Closes: https://lore.kernel.org/all/20260505051821.1107133-1-longman@redhat.com Tested-by: Waiman Long <longman@redhat.com> Tested-by: Breno Leitao <leitao@debian.org> Fixes: 87ce9e83ab8b ("memblock, treewide: make memblock_free() handle late freeing") Link: https://patch.msgid.link/20260513105122.502506-1-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-05-25Merge tag 'v7.1-rc5' into driver-core-nextDanilo Krummrich
We need the driver-core fixes in here as well to build on top of. Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2026-05-22Merge tag 'cgroup-for-7.1-rc4-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Two rstat fixes: - Out-of-bounds access in the css_rstat_updated() BPF kfunc when called with an unchecked user-supplied cpu - Over-strict NMI guard after the recent switch to try_cmpxchg left sparc and ppc64 unable to queue rstat updates from NMI" * tag 'cgroup-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: rstat: relax NMI guard after switch to try_cmpxchg cgroup/rstat: validate cpu before css_rstat_cpu() access
2026-05-22Merge tag 'slab-for-7.1-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fix from Vlastimil Babka: - Stable fix for a missing cpus_read_lock in one of the cpu sheaves flushing paths (Qing Wang) * tag 'slab-for-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slub: hold cpus_read_lock around flush_rcu_sheaves_on_cache()
2026-05-21Revert "mm: introduce a new page type for page pool in page type"Byungchul Park
This reverts commit db359fccf212 ("mm: introduce a new page type for page pool in page type") and a part of 735a309b4bfb9e ("net: add net_iov_init() and use it to initialize ->page_type"). Netpp page_type'ed pages might be used in mapping so as to use @_mapcount. However, since @page_type and @_mapcount are union'ed in struct page, these two can't be used at the same time. Revert the commit introducing page_type for Netpp for now. The patch will be retried once @page_type and @_mapcount get allowed to be used at the same time. The revert also includes removal of @page_type initialization part introduced by commit 735a309b4bfb9e ("net: add net_iov_init() and use it to initialize ->page_type"), which will be restored on the retry. Link: https://lore.kernel.org/20260515034701.17027-1-byungchul@sk.com Fixes: db359fccf212 ("mm: introduce a new page type for page pool in page type") Signed-off-by: Byungchul Park <byungchul@sk.com> Reported-by: Dragos Tatulea <dtatulea@nvidia.com> Closes: https://lore.kernel.org/all/982b9bc1-0a0a-4fc5-8e3a-3672db2b29a1@nvidia.com Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Harry Yoo (Oracle) <harry@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jesper Dangaard Brouer <hawk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Mark Bloch <mbloch@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Simon Horman <horms@kernel.org> Cc: Stanislav Fomichev <sdf@fomichev.me> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tariq Toukan <tariqt@nvidia.com> Cc: Toke Hoiland-Jorgensen <toke@redhat.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21mm/vmalloc: do not trigger BUG() on BH disabled contextUladzislau Rezki (Sony)
__get_vm_area_node() currently triggers a BUG() if in_interrupt() returns true. However, in_interrupt() also reports true when BH are disabled. The bridge code can call rhashtable_lookup_insert_fast() with bottom halves disabled: __vlan_add() -> br_fdb_add_local() spin_lock_bh(&br->hash_lock); <-- Disable BH -> fdb_add_local() -> fdb_create() -> rhashtable_lookup_insert_fast() -> kvmalloc() -> vmalloc() -> __get_vm_area_node() -> BUG_ON(in_interrupt()) spin_unlock_bh(&br->hash_lock) this triggers the BUG() despite the caller not being in NMI or hard IRQ context. Replace the in_interrupt() check with in_nmi() || in_hardirq(). Link: https://lore.kernel.org/20260515153009.2296191-1-urezki@gmail.com Fixes: c6307674ed82 ("mm: kvmalloc: add non-blocking support for vmalloc") Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Ido Schimmel <idosch@nvidia.com> Reported-by: syzbot+8b12fc6e0fb139765b58@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69ff8c7c.050a0220.1036b8.000b.GAE@google.com/ Reviewed-by: Baoquan He <baoquan.he@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21mm/migrate_device: fix pgtable leak in migrate_vma_insert_huge_pmd_pageSunny Patel
When migrate_vma_insert_huge_pmd_page() jumps to unlock_abort due to a PMD check failure, the pgtable allocated earlier via pte_alloc_one() is never freed, causing a memory leak. Added free_abort label to release the pgtable in error path. Link: https://lore.kernel.org/20260501115122.23288-1-nueralspacetech@gmail.com Fixes: a30b48bf1b24 ("mm/migrate_device: implement THP migration of zone device pages") Signed-off-by: Sunny Patel <nueralspacetech@gmail.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21mm: memcontrol: propagate NMI slab stats to memcg vmstatsAlexandre Ghiti
flush_nmi_stats() drains per-node NMI slab atomics into the per-node lruvec_stats, but does not propagate them to the memcg-level vmstats. For non NMI case, account_slab_nmi_safe() calls mod_memcg_lruvec_state() which updates both per-node lruvec_stats and memcg-level vmstats, so flush_nmi_stats() needs to flush to per-node lruvec_stats as well as memcg-level vmstats. So fix this by flushing to the memcg-level vmstats for NMI too. Link: https://lore.kernel.org/20260518082830.599102-1-alex@ghiti.fr Fixes: 940b01fc8dc1 ("memcg: nmi safe memcg stats for specific archs") Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21mm/damon/sysfs-schemes: delete tried region in regions_rmdirs()SeongJae Park
DAMON sysfs maintains the DAMOS tried region directory objects via a linked list. When the user requests refresh of the directories, DAMON sysfs removes all the region directories first, and then generate updated regions directory on the empty space. The removal function (damon_sysfs_scheme_regions_rm_dirs()) only puts the kobj objects. Deletion of the container region object from the linked list is done inside the kobj release callback function. If somehow the callback invocation is delayed, the list will contain regions list that gonna be freed. If the updated region directories creation is started in this situation, the list can be corrupted and use-after-free can happen. Because the kobj objects are managed by only DAMON sysfs, the issue cannot happen in normal situation. But, such delays can be made on kernels that built with CONFIG_DEBUG_KOBJECT_RELEASE. On the kernel, the issue can indeed be reproduced like below. # damo start --damos_action stat # cd /sys/kernel/mm/damon/admin/kdamonds/0/ # for i in {1..10}; do echo update_schemes_tried_regions > state; done # dmesg | grep underflow [ 89.296152] refcount_t: underflow; use-after-free. Fix the issue by removing the region object from the list when decrementing the reference count. Also update damos_sysfs_populate_region_dir() to add the region object to the list only after the kobject_init_and_add() is success, so that fail of kobject_init_and_add() is not leaving the deallocated object on the list. The issue was discovered [1] by Sashiko. Link: https://lore.kernel.org/20260518152559.93038-1-sj@kernel.org Link: https://lore.kernel.org/20260513011920.119183-1-sj@kernel.org [1] Fixes: 9277d0367ba1 ("mm/damon/sysfs-schemes: implement scheme region directory") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.2.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_oneDev Jain
Initialize nr_pages to 1 at the start of each loop iteration, like folio_referenced_one() does. Without this, nr_pages computed by a previous folio_unmap_pte_batch() call can be reused on a later iteration that does not run folio_unmap_pte_batch() again. mmap a 64K large folio with MAP_ANONYMOUS | MAP_DROPPABLE, then call madvise(MADV_FREE), then make the last page device-exclusive via HMM_DMIRROR_EXCLUSIVE. Trigger node reclaim through sysfs. Now, in try_to_unmap_one(), we will first clear the first 15 out of 16 entries mapping the lazyfree folio. This will set nr_pages to 15. In the next pvmw walk, this nr_pages gets reused on a device-exclusive pte, thus potentially corrupting folio refcount/mapcount. At the moment, I have a userspace program which can make the kernel spit out a trace, but the blow up is in folio_referenced_one(), because there are existing bugs in the interaction between device-private and rmap (which too I am investigating). I did a one liner kernel change to avoid going into folio_referenced_one(), and the kernel blows up at folio_remove_rmap_ptes in try_to_unmap_one which is what I wanted. Note that the bug is there not since file folio batching but lazyfree folio batching, since device-exclusive only works for anonymous folios. Userspace visible effect is simply kernel crashing somewhere due to refcount/mapcount corruption. Link: https://lore.kernel.org/20260518063656.3721056-1-dev.jain@arm.com Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation") Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Harry Yoo <harry@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21memfd: deny writeable mappings when implying SEAL_WRITEPratyush Yadav (Google)
When SEAL_EXEC is added, SEAL_WRITE is implied to make W^X. But the implied seal is set after the check that makes sure the memfd can not have any writable mappings. This means one can use SEAL_EXEC to apply SEAL_WRITE while having writeable mappings. This breaks the contract that SEAL_WRITE provides and can be used by an attacker to pass a memfd that appears to be write sealed but can still be modified arbitrarily. Fix this by adding the implied seals before the call for mapping_deny_writable() is done. Link: https://lore.kernel.org/20260505133922.797635-1-pratyush@kernel.org Fixes: c4f75bc8bd6b ("mm/memfd: add write seals when apply SEAL_EXEC to executable memfd") Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Jeff Xu <jeffxu@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kees Cook <kees@kernel.org> Cc: "David Hildenbrand (Arm)" <david@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21Revert "mm/hugetlbfs: update hugetlbfs to use mmap_prepare"Lorenzo Stoakes
This reverts commit ea52cb24cd3f ("mm/hugetlbfs: update hugetlbfs to use mmap_prepare") with conflict resolution to account for changes in commit ea52cb24cd3f ("mm/hugetlbfs: update hugetlbfs to use mmap_prepare"). The patch incorrectly handled hugetlb VMA lock allocation at the mmap_prepare stage, where a failed allocation occurring after mmap_prepare is called might result in the lock leaking. There is no risk of a merge causing a similar issues, as VMA_DONTEXPAND_BIT is set for hugetlb mappings. As a first step in addressing this issue, simply revert the change so we can rework how we do this having corrected the underlying issues. We maintain the VMA flags changes as best we can, accounting for the fact that we were working with a VMA descriptor previously and propagating like-for-like changes for this. Note that we invoke vma_set_flags() and do not call vma_start_write() as vm_flags_set() does. This is OK as it's being done in an .mmap hook where the VMA is not yet linked into the tree so nobody else can be accessing it. Link: https://lore.kernel.org/20260512160643.266960-1-ljs@kernel.org Fixes: ea52cb24cd3f ("mm/hugetlbfs: update hugetlbfs to use mmap_prepare") Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Reported-by: Mingyu Wang <25181214217@stu.xidian.edu.cn> Closes: https://lore.kernel.org/linux-mm/20260425070700.562229-1-25181214217@stu.xidian.edu.cn/ Acked-by: Muchun Song <muchun.song@linux.dev> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-19Merge tag 'mm-hotfixes-stable-2026-05-18-21-07' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "14 hotfixes. 9 are for MM. 10 are cc:stable and the remainder are for post-7.1 issues or aren't deemed suitable for backporting. There's a two-patch MAINTAINERS series from Mike Rapoport which updates us for the new KEXEC/KDUMP/crash/LUO/etc arrangements. And another two-patch series from Muchun Song to fix a couple of memory-hotplug issues. Otherwise singletons, please see the changelogs for details" * tag 'mm-hotfixes-stable-2026-05-18-21-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/memory: fix spurious warning when unmapping device-private/exclusive pages mm: fix __vm_normal_page() to handle missing support for pmd_special()/pud_special() drivers/base/memory: fix memory block reference leak in poison accounting mm/memory_hotplug: fix memory block reference leak on remove lib: kunit_iov_iter: fix test fail on powerpc mm/page_alloc: fix initialization of tags of the huge zero folio with init_on_free MAINTAINERS: add kexec@ list to LIVE UPDATE ENTRY MAINTAINERS: add tree for KDUMP and KEXEC selftests/mm: run_vmtests.sh: fix destructive tests invocation scripts/gdb: slab: update field names of struct kmem_cache scripts/gdb: mm: cast untyped symbols in x86_page_ops mm/damon: fix damos_stat tracepoint format for sz_applied mm/damon/sysfs-schemes: call missing mem_cgroup_iter_break() mm/migrate_device: fix spinlock leak in migrate_vma_insert_huge_pmd_page
2026-05-18cgroup/rstat: validate cpu before css_rstat_cpu() accessQing Ming
css_rstat_updated() is exposed as a BPF kfunc and accepts a caller-provided cpu argument. The function uses cpu for per-cpu rstat lookups without checking whether it refers to a valid possible CPU. A BPF iter/cgroup program with CAP_BPF and CAP_PERFMON can pass an invalid cpu value. On an unfixed UBSCAN_BOUNDS test kernel, cpu == 0x7fffffff triggers: UBSAN: array-index-out-of-bounds in kernel/cgroup/rstat.c:31:9 index 2147483647 is out of range for type 'long unsigned int [64]' Call Trace: css_rstat_updated bpf_iter_run_prog cgroup_iter_seq_show bpf_seq_read Add cpu validation to the BPF-facing css_rstat_updated() kfunc and move the common implementation to __css_rstat_updated() for in-kernel callers. Fixes: a319185be9f5 ("cgroup: bpf: enable bpf programs to integrate with rstat") Signed-off-by: Qing Ming <a0yami@mailbox.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-15Merge tag 'v7.1-p4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: - Fix potential dead-lock in rhashtable when used by xattr - Avoid calling kvfree on atomic path in rhashtable * tag 'v7.1-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: rhashtable: Add bucket_table_free_atomic() helper mm/slab: Add kvfree_atomic() helper rhashtable: drop ht->mutex in rhashtable_free_and_destroy()
2026-05-14mm/slub: hold cpus_read_lock around flush_rcu_sheaves_on_cache()Qing Wang
flush_rcu_sheaves_on_cache() calls queue_work_on() in a for_each_online_cpu() loop, which requires the cpu to stay online. But cpus_read_lock() is not held in kvfree_rcu_barrier_on_cache() and the set of "online cpus" is subject to change. There are two paths that call flush_rcu_sheaves_on_cache(): // has cpus_read_lock() flush_all_rcu_sheaves() -> flush_rcu_sheaves_on_cache() // no cpus_read_lock() kvfree_rcu_barrier_on_cache() -> flush_rcu_sheaves_on_cache() Fix this by holding cpus_read_lock() in kvfree_rcu_barrier_on_cache(). Why not move cpus_read_lock() from flush_all_rcu_sheaves() into flush_rcu_sheaves_on_cache()? The reason is it would introduce a new lock order (slab_mutex -> cpu_hotplug_lock). The reverse order (cpu_hotplug_lock -> slab_mutex) is established by - cpuhp_setup_state_nocalls(..., slub_cpu_setup, ...) - kmem_cache_destroy() The two orders together would form an AB-BA deadlock. Finally, add lockdep_assert_cpus_held() in flush_rcu_sheaves_on_cache() to catch the same problem in the future. Fixes: 0f35040de593 ("mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction") Cc: <stable@vger.kernel.org> Signed-off-by: Qing Wang <wangqing7171@gmail.com> Link: https://patch.msgid.link/20260512035035.762317-1-wangqing7171@gmail.com Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-05-14slab: fix kernel-docs for mm-apiMarco Elver
The mm-api kernel-docs have been disconnected from their symbols. While the scripts were previously taught to handle the _noprof suffix added by allocation tagging (in 51a7bf0238c2 "scripts/kernel-doc: drop "_noprof" on function prototypes"), this does not handle cases where the internal implementation function has an additional leading underscore. The added optional parameters (via DECL_KMALLOC_PARAMS) further complicate parsing the internal signatures. When the kernel-doc block remains above the internal implementation function but uses the public API name, the documentation generator fails to associate the documented symbol. Simply moving the docs to the macros in slab.h fixes the association but causes loss of types in the generated documentation (rendering as e.g. untyped 'kmalloc(size, flags)' macro). Fix this by: 1. Moving the kernel-doc comment blocks from slub.c to slab.h, placing them directly above the user-facing macros. 2. Providing explicit, typed C prototypes for the documented APIs inside '#if 0 /* kernel-doc */' blocks. 3. Converting the variadic macros for the documented APIs to use explicit arguments to match the documentation. No functional change intended. Signed-off-by: Marco Elver <elver@google.com> Link: https://patch.msgid.link/20260511200136.3201646-3-elver@google.com Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-05-14slab: support for compiler-assisted type-based slab cache partitioningMarco Elver
Rework the general infrastructure around RANDOM_KMALLOC_CACHES into more flexible KMALLOC_PARTITION_CACHES, with the former being a partitioning mode of the latter. Introduce a new mode, KMALLOC_PARTITION_TYPED, which leverages a feature available in Clang 22 and later, called "allocation tokens" via __builtin_infer_alloc_token() [1]. Unlike KMALLOC_PARTITION_RANDOM (formerly RANDOM_KMALLOC_CACHES), this mode deterministically assigns a slab cache to an allocation of type T, regardless of allocation site. The builtin __builtin_infer_alloc_token(<malloc-args>, ...) instructs the compiler to infer an allocation type from arguments commonly passed to memory-allocating functions and returns a type-derived token ID. The implementation passes kmalloc-args to the builtin: the compiler performs best-effort type inference, and then recognizes common patterns such as `kmalloc(sizeof(T), ...)`, `kmalloc(sizeof(T) * n, ...)`, but also `(T *)kmalloc(...)`. Where the compiler fails to infer a type the fallback token (default: 0) is chosen. Note: kmalloc_obj(..) APIs fix the pattern how size and result type are expressed, and therefore ensures there's not much drift in which patterns the compiler needs to recognize. Specifically, kmalloc_obj() and friends expand to `(TYPE *)KMALLOC(__obj_size, GFP)`, which the compiler recognizes via the cast to TYPE*. Clang's default token ID calculation is described as [1]: typehashpointersplit: This mode assigns a token ID based on the hash of the allocated type's name, where the top half ID-space is reserved for types that contain pointers and the bottom half for types that do not contain pointers. Separating pointer-containing objects from pointerless objects and data allocations can help mitigate certain classes of memory corruption exploits [2]: attackers who gains a buffer overflow on a primitive buffer cannot use it to directly corrupt pointers or other critical metadata in an object residing in a different, isolated heap region. It is important to note that heap isolation strategies offer a best-effort approach, and do not provide a 100% security guarantee, albeit achievable at relatively low performance cost. Note that this also does not prevent cross-cache attacks: while waiting for future features like SLAB_VIRTUAL [3] to provide physical page isolation, this feature should be deployed alongside SHUFFLE_PAGE_ALLOCATOR and init_on_free=1 to mitigate cross-cache attacks and page-reuse attacks as much as possible today. With all that, my kernel (x86 defconfig) shows me a histogram of slab cache object distribution per /proc/slabinfo (after boot): <slab cache> <objs> <hist> kmalloc-part-15 1465 ++++++++++++++ kmalloc-part-14 2988 +++++++++++++++++++++++++++++ kmalloc-part-13 1656 ++++++++++++++++ kmalloc-part-12 1045 ++++++++++ kmalloc-part-11 1697 ++++++++++++++++ kmalloc-part-10 1489 ++++++++++++++ kmalloc-part-09 965 +++++++++ kmalloc-part-08 710 +++++++ kmalloc-part-07 100 + kmalloc-part-06 217 ++ kmalloc-part-05 105 + kmalloc-part-04 4047 ++++++++++++++++++++++++++++++++++++++++ kmalloc-part-03 183 + kmalloc-part-02 283 ++ kmalloc-part-01 316 +++ kmalloc 1422 ++++++++++++++ The above /proc/slabinfo snapshot shows me there are 6673 allocated objects (slabs 00 - 07) that the compiler claims contain no pointers or it was unable to infer the type of, and 12015 objects that contain pointers (slabs 08 - 15). On a whole, this looks relatively sane. Additionally, when I compile my kernel with -Rpass=alloc-token, which provides diagnostics where (after dead-code elimination) type inference failed, I see 186 allocation sites where the compiler failed to identify a type (down from 966 when I sent the RFC [4]). Some initial review confirms these are mostly variable sized buffers, but also include structs with trailing flexible length arrays. Link: https://clang.llvm.org/docs/AllocToken.html [1] Link: https://blog.dfsec.com/ios/2025/05/30/blasting-past-ios-18/ [2] Link: https://lwn.net/Articles/944647/ [3] Link: https://lore.kernel.org/all/20250825154505.1558444-1-elver@google.com/ [4] Link: https://discourse.llvm.org/t/rfc-a-framework-for-allocator-partitioning-hints/87434 Acked-by: GONG Ruiqi <gongruiqi1@huawei.com> Co-developed-by: Harry Yoo (Oracle) <harry@kernel.org> Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org> Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Link: https://patch.msgid.link/20260511200136.3201646-1-elver@google.com Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-05-14mm/slub: defer freelist construction until after bulk allocation from a new slabShengming Hu
Allocations from a fresh slab can consume all of its objects, and the freelist built during slab allocation is discarded immediately as a result. Instead of special-casing the whole-slab bulk refill case, defer freelist construction until after objects are emitted from a fresh slab. new_slab() now only allocates the slab and initializes its metadata. refill_objects() then obtains a fresh slab and lets alloc_from_new_slab() emit objects directly, building a freelist only for the objects left unallocated; the same change is applied to alloc_single_from_new_slab(). To keep CONFIG_SLAB_FREELIST_RANDOM=y/n on the same path, introduce a small iterator abstraction for walking free objects in allocation order. The iterator is used both for filling the sheaf and for building the freelist of the remaining objects. Also mark setup_object() inline. After this optimization, the compiler no longer consistently inlines this helper in the hot path, which can hurt performance. Explicitly marking it inline restores the expected code generation. This reduces per-object overhead when allocating from a fresh slab. The most direct benefit is in the paths that allocate objects first and only build a freelist for the remainder afterward: bulk allocation from a new slab in refill_objects(), single-object allocation from a new slab in ___slab_alloc(), and the corresponding early-boot paths that now use the same deferred-freelist scheme. Since refill_objects() is also used to refill sheaves, the optimization is not limited to the small set of kmem_cache_alloc_bulk()/kmem_cache_free_bulk() users; regular allocation workloads may benefit as well when they refill from a fresh slab. In slub_bulk_bench, the time per object drops by about 42% to 70% with CONFIG_SLAB_FREELIST_RANDOM=n, and by about 58% to 69% with CONFIG_SLAB_FREELIST_RANDOM=y. This benchmark is intended to isolate the cost removed by this change: each iteration allocates exactly slab->objects from a fresh slab. That makes it a near best-case scenario for deferred freelist construction, because the old path still built a full freelist even when no objects remained, while the new path avoids that work. Realistic workloads may see smaller end-to-end gains depending on how often allocations reach this fresh-slab refill path. Benchmark results (slub_bulk_bench): Machine: qemu-system-x86 -m 1024M -smp 8 -enable-kvm -cpu host Kernel: Linux 7.1.0-rc1-next-20260429 Config: x86_64_defconfig Cpu: 0 Rounds: 20 Total: 256MB - CONFIG_SLAB_FREELIST_RANDOM=n - obj_size=16, batch=256: before: 5.44 +- 0.07 ns/object after: 3.12 +- 0.03 ns/object delta: -42.6% obj_size=32, batch=128: before: 7.57 +- 0.32 ns/object after: 3.79 +- 0.07 ns/object delta: -49.9% obj_size=64, batch=64: before: 11.27 +- 0.09 ns/object after: 4.83 +- 0.06 ns/object delta: -57.2% obj_size=128, batch=32: before: 19.38 +- 0.13 ns/object after: 6.43 +- 0.08 ns/object delta: -66.8% obj_size=256, batch=32: before: 23.59 +- 0.18 ns/object after: 6.97 +- 0.07 ns/object delta: -70.5% obj_size=512, batch=32: before: 21.06 +- 0.14 ns/object after: 7.12 +- 0.17 ns/object delta: -66.2% - CONFIG_SLAB_FREELIST_RANDOM=y - obj_size=16, batch=256: before: 9.42 +- 0.11 ns/object after: 4.36 +- 0.19 ns/object delta: -53.7% obj_size=32, batch=128: before: 12.19 +- 0.62 ns/object after: 4.93 +- 0.07 ns/object delta: -59.6% obj_size=64, batch=64: before: 17.01 +- 0.73 ns/object after: 6.14 +- 0.12 ns/object delta: -63.9% obj_size=128, batch=32: before: 23.71 +- 1.10 ns/object after: 8.35 +- 0.18 ns/object delta: -64.8% obj_size=256, batch=32: before: 29.20 +- 0.35 ns/object after: 9.44 +- 1.32 ns/object delta: -67.7% obj_size=512, batch=32: before: 29.35 +- 0.79 ns/object after: 9.21 +- 0.34 ns/object delta: -68.6% Link: https://github.com/HSM6236/slub_bulk_test.git Suggested-by: Harry Yoo (Oracle) <harry@kernel.org> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Reviewed-by: Hao Li <hao.li@linux.dev> Tested-by: Hao Li <hao.li@linux.dev> Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn> Link: https://patch.msgid.link/202604302204413066CxdJnJ3RAGH_7iE4EBIO@zte.com.cn Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-05-13mm/memory: fix spurious warning when unmapping device-private/exclusive pagesAlistair Popple
Device private and exclusive entries are only supported for anonymous folios. This condition is tested in __migrate_device_pages() and make_device_exclusive() using folio_test_anon(). However the unmap path tests this assumption using vma_is_anonymous(). This is wrong because whilst anonymous VMAs can only contain folios where folio_test_anon() is true the opposite relation does not hold. A folio for which folio_test_anon() is true does not imply vma_is_anonymous() is true. Such a condition can occur if for example a folio is part of a private filebacked mapping. In this case vma_is_anonymous() is false as the mapping is filebacked, but folio_test_anon() may be true, thus permitting devices to migrate the folio to device private memory. This can lead to the following spurious warnings during process teardown: [ 772.737706] ------------[ cut here ]------------ [ 772.739201] WARNING: mm/memory.c:1754 at unmap_page_range.cold+0x26/0x18a, CPU#17: hmm-tests/2041 [ 772.742050] Modules linked in: test_hmm nvidia_uvm(O) nvidia(O) [ 772.743959] CPU: 17 UID: 0 PID: 2041 Comm: hmm-tests Tainted: G W O 7.0.0+ #387 PREEMPT(full) [ 772.747104] Tainted: [W]=WARN, [O]=OOT_MODULE [ 772.748509] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 [ 772.752117] RIP: 0010:unmap_page_range.cold+0x26/0x18a [ 772.753780] Code: 7e fe ff ff 48 89 4c 24 78 4c 89 44 24 38 e8 f2 ff b1 00 48 8b 4c 24 78 4c 8b 44 24 38 48 8b 44 24 18 48 83 78 48 00 74 04 90 <0f> 0b 90 48 89 ca b8 ff ff 37 00 48 c1 ea 03 48 c1 e0 2a 80 3c 02 [ 772.759602] RSP: 0018:ffff888112607550 EFLAGS: 00010286 [ 772.761310] RAX: ffff88811bbf4dc0 RBX: dffffc0000000000 RCX: ffffea03e9bfffd8 [ 772.763583] RDX: 1ffff1102377e9c1 RSI: 0000000000000008 RDI: ffff88811bbf4e08 [ 772.765914] RBP: 0000000000000006 R08: ffff8881059f7448 R09: ffffed10224c0e68 [ 772.768184] R10: ffff888112607347 R11: 0000000000000001 R12: 0000000000000001 [ 772.770461] R13: ffffea03e9bfffc0 R14: ffff888112607908 R15: ffffea03e9bfffc0 [ 772.772782] FS: 00007f327caa2780(0000) GS:ffff888427b7d000(0000) knlGS:0000000000000000 [ 772.775328] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 772.777187] CR2: 00007f327ca89000 CR3: 00000001994d5000 CR4: 00000000000006f0 [ 772.779135] Call Trace: [ 772.779792] <TASK> [ 772.780317] ? dmirror_interval_invalidate+0x1a3/0x290 [test_hmm] [ 772.781873] ? vm_normal_page_pud+0x2b0/0x2b0 [ 772.782992] ? __rwlock_init+0x150/0x150 [ 772.784006] ? lock_release+0x216/0x2b0 [ 772.785008] ? __mmu_notifier_invalidate_range_start+0x505/0x6e0 [ 772.786522] ? lock_release+0x216/0x2b0 [ 772.787498] ? unmap_single_vma+0xb6/0x210 [ 772.788573] unmap_vmas+0x27d/0x520 [ 772.789506] ? unmap_single_vma+0x210/0x210 [ 772.790607] ? mas_update_gap.part.0+0x620/0x620 [ 772.791834] unmap_region+0x19e/0x350 [ 772.792769] ? remove_vma+0x130/0x130 [ 772.793684] ? mas_alloc_nodes+0x1f2/0x300 [ 772.794730] vms_complete_munmap_vmas+0x8c1/0xe20 [ 772.795926] ? unmap_region+0x350/0x350 [ 772.796917] do_vmi_align_munmap+0x36a/0x4e0 [ 772.798018] ? lock_release+0x216/0x2b0 [ 772.799024] ? vma_shrink+0x620/0x620 [ 772.799983] do_vmi_munmap+0x150/0x2c0 [ 772.800939] __vm_munmap+0x161/0x2c0 [ 772.801872] ? expand_downwards+0xd60/0xd60 [ 772.802948] ? clockevents_program_event+0x1ef/0x540 [ 772.804217] ? lock_release+0x216/0x2b0 [ 772.805158] __x64_sys_munmap+0x59/0x80 [ 772.805776] do_syscall_64+0xfc/0x670 [ 772.806336] ? irqentry_exit+0xda/0x580 [ 772.806976] entry_SYSCALL_64_after_hwframe+0x4b/0x53 [ 772.807772] RIP: 0033:0x7f327cbb2717 [ 772.808323] Code: 73 01 c3 48 8b 0d f9 76 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 0b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c9 76 0d 00 f7 d8 64 89 01 48 [ 772.811337] RSP: 002b:00007ffde7f57d38 EFLAGS: 00000202 ORIG_RAX: 000000000000000b [ 772.812564] RAX: ffffffffffffffda RBX: 00007f327cc9c000 RCX: 00007f327cbb2717 [ 772.813733] RDX: 0000000000000000 RSI: 0000000000400000 RDI: 00007f327c289000 [ 772.814867] RBP: 0000000000421360 R08: 000000000000001a R09: 0000000000000000 [ 772.815991] R10: 0000000000000003 R11: 0000000000000202 R12: 00007ffde7f57d74 [ 772.817121] R13: 00007f327c689010 R14: 0000000000100000 R15: 00007f327c289000 [ 772.818272] </TASK> [ 772.818614] irq event stamp: 0 [ 772.819159] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [ 772.820174] hardirqs last disabled at (0): [<ffffffff82a57ab3>] copy_process+0x19f3/0x6440 [ 772.821511] softirqs last enabled at (0): [<ffffffff82a57b00>] copy_process+0x1a40/0x6440 [ 772.822869] softirqs last disabled at (0): [<0000000000000000>] 0x0 [ 772.823871] ---[ end trace 0000000000000000 ]--- Fix this by using the same check for folio_test_anon() in zap_nonpresent_ptes(). Also add a hmm-test case for this. Link: https://lore.kernel.org/20260501065116.2057242-1-apopple@nvidia.com Fixes: 999dad824c39 ("mm/shmem: persist uffd-wp bit across zapping for file-backed") Signed-off-by: Alistair Popple <apopple@nvidia.com> Reported-by: Arsen Arsenović <aarsenovic@baylibre.com> Reviewed-by: Balbir Singh <balbirs@nvidia.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>