summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2026-06-03mm: simplify the mempool_alloc_bulk APIChristoph Hellwig
The mempool_alloc_bulk was modelled after the alloc_pages_bulk API, including some misunderstanding of it. Remove checking for NULL slots in the array, as alloc_pages_bulk and kmem_cache_alloc_bulk always fill the array from the beginning and thus we know the offset of the first failing allocation. This removes support for working well with alloc_pages_bulk used to refill page arrays that might have an entry removed from in the middle, but that is only used by sunrpc and hopefully on it's way out. Also remove the allocated parameter as it is redundant because the caller can simply specific and offset into the entries array. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260602160038.3976341-1-hch@lst.de Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-06-03mm/slab: improve kmem_cache_alloc_bulkChristoph Hellwig
The kmem_cache_alloc_bulk return value is weird. It returns the number of allocated objects, but that must always be 0 or the requested number based on the implementations and the handling in the callers, but that assumption is not actually documented anywhere, which confuses automated review tools. Fix this by returning a bool if the allocation succeeded and adding a kerneldoc comment explaining the API. [rob.clark@oss.qualcomm.com: fixups in msm_iommu_pagetable_prealloc_allocate() ] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> # skbuff Link: https://patch.msgid.link/20260528093437.2519248-2-hch@lst.de Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-06-02mm/damon/core: clarify next_intervals_tune_sis update pathniecheng
Patch series "mm/damon: documentation and comment fixes". This patch (of 3): damon_set_attrs() updates next_aggregation_sis and next_ops_update_sis for online attrs updates, but it does not update next_intervals_tune_sis there. This can look like a missing update when reading damon_set_attrs() alone, while next_intervals_tune_sis is actually updated in kdamond_fn(). Add a short comment to make this explicit. Link: https://lore.kernel.org/20260520012104.93602-1-sj@kernel.org Link: https://lore.kernel.org/20260520012104.93602-2-sj@kernel.org Suggested-by: SeongJae Park <sj@kernel.org> Signed-off-by: niecheng <niecheng1@uniontech.com> Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Sakurai Shun <ssh1326@icloud.com> Cc: Zenghui Yu <zenghui.yu@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/vaddr: attempt per-vma lock during page table walkKefeng Wang
Currently, DAMON virtual address operations use mmap_read_lock during page table walks, which can cause unnecessary contention under high concurrency. Introduce damon_va_walk_page_range() to first attempt acquiring a per-vma lock. If the VMA is found and the range is fully contained within it, the page table walk proceeds with the per-vma lock instead of mmap_read_lock. This optimization is expected to be particularly effective for damon_va_young() and damon_va_mkold(), which are frequently called and typically operate within a single VMA. Link: https://lore.kernel.org/20260512151523.2092638-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/memory-failure: use zone_pcp_disable() for poison handlingKaitao Cheng
__page_handle_poison() used drain_all_pages() instead of zone_pcp_disable() because dissolve_free_hugetlb_folio() could restore HVO vmemmap pages and decrement hugetlb_optimize_vmemmap_key. That static key update took cpu_hotplug_lock through static_key_slow_dec(), while zone_pcp_disable() holds pcp_batch_high_lock. CPU hotplug takes the locks in the opposite order through page_alloc_cpu_online/dead(), so the combination could deadlock. That dependency no longer exists. Commit da3e2d1ca43d ("mm/hugetlb: remove hugetlb_optimize_vmemmap_key static key") removed the HVO static key and the static_branch_dec() from hugetlb_vmemmap_restore_folio(). The dissolve_free_hugetlb_folio() path no longer reaches static_key_slow_dec(). Use zone_pcp_disable() again while dissolving the hugetlb folio and taking the target page off the buddy allocator. This prevents the drained PCP lists from being refilled before take_page_off_buddy() runs, making the page isolation deterministic. Link: https://lore.kernel.org/20260514085754.84097-1-kaitao.cheng@linux.dev Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/vmalloc: free unused pages on vrealloc() shrinkShivam Kalra
When vrealloc() shrinks an allocation and the new size crosses a page boundary, unmap and free the tail pages that are no longer needed. This reclaims physical memory that was previously wasted for the lifetime of the allocation. The heuristic is simple: always free when at least one full page becomes unused. Huge page allocations (page_order > 0) are skipped, as partial freeing would require splitting. Allocations with VM_FLUSH_RESET_PERMS are also skipped, as their direct-map permissions must be reset before pages are returned to the page allocator, which is handled by vm_reset_perms() during vfree(). Additionally, allocations with VM_USERMAP are skipped because remap_vmalloc_range_partial() validates mapping requests against the unchanged vm->size; freeing tail pages would cause vmalloc_to_page() to return NULL for the unmapped range. To protect concurrent readers, the shrink path uses Node lock to synchronize before freeing the pages. Finally, we notify kmemleak of the reduced allocation size using kmemleak_free_part() to prevent the kmemleak scanner from faulting on the newly unmapped virtual addresses. The virtual address reservation (vm->size / vmap_area) is intentionally kept unchanged, preserving the address for potential future grow-in-place support. Link: https://lore.kernel.org/20260519-vmalloc-shrink-v14-4-70b96ee3e9c9@zohomail.in Signed-off-by: Shivam Kalra <shivamkalra98@zohomail.in> Suggested-by: Danilo Krummrich <dakr@kernel.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/vmalloc: use physical page count in vread_iter() for VM_ALLOC areasShivam Kalra
For VM_ALLOC areas in vread_iter(), derive the vm area size from vm->nr_pages rather than get_vm_area_size(). Only VM_ALLOC areas are subject to vrealloc() shrinking, which frees pages without reducing the virtual reservation size. Switch to using vm->nr_pages for VM_ALLOC areas so the reader remains correct once shrink support is added. Other mapping types (vmap, ioremap) do not initialize nr_pages and will continue using get_vm_area_size(). [shivamkalra98@zohomail.in: add an nr_pages check] Link: https://lore.kernel.org/aff47da5-4fd5-481d-be18-e1eb99639490@zohomail.in Link: https://lore.kernel.org/20260519-vmalloc-shrink-v14-3-70b96ee3e9c9@zohomail.in Signed-off-by: Shivam Kalra <shivamkalra98@zohomail.in> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Danilo Krummrich <dakr@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/vmalloc: use physical page count for vrealloc() grow-in-place checkShivam Kalra
Update the grow-in-place check in vrealloc() to compare the requested size against the actual physical page count (vm->nr_pages) rather than the virtual area size (alloced_size, derived from get_vm_area_size()). Currently both values are equivalent, but the upcoming vrealloc() shrink functionality will free pages without reducing the virtual reservation size. After such a shrink, the old alloced_size-based comparison would incorrectly allow a grow-in-place operation to succeed and attempt to access freed pages. Switch to vm->nr_pages now so the check remains correct once shrink support is added. Link: https://lore.kernel.org/20260519-vmalloc-shrink-v14-2-70b96ee3e9c9@zohomail.in Signed-off-by: Shivam Kalra <shivamkalra98@zohomail.in> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Danilo Krummrich <dakr@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/vmalloc: extract vm_area_free_pages() helper from vfree()Shivam Kalra
Patch series "mm/vmalloc: free unused pages on vrealloc() shrink", v14. This series implements the TODO in vrealloc() to unmap and free unused pages when shrinking across a page boundary. Problem: When vrealloc() shrinks an allocation, it updates bookkeeping (requested_size, KASAN shadow) but does not free the underlying physical pages. This wastes memory for the lifetime of the allocation. Solution: - Patch 1: Extracts a vm_area_free_pages(vm, start_idx, end_idx) helper from vfree() that frees a range of pages with memcg and nr_vmalloc_pages accounting. Freed page pointers are set to NULL to prevent stale references. - Patch 2: Update the grow-in-place check in vrealloc() to compare the requested size against the actual physical page count (vm->nr_pages) rather than the virtual area sizes. This is a prerequisite for shrinking. - Patch 3: For VM_ALLOC areas in vread_iter(), derive the vm area size from vm->nr_pages rather than get_vm_area_size(), which would overestimate the mapped range after a shrink. Other mapping types (vmap, ioremap) don't set nr_pages and keep using get_vm_area_size(). - Patch 4: Uses the helper to free tail pages when vrealloc() shrinks across a page boundary. - Patch 5: Adds a vrealloc test case to lib/test_vmalloc that exercises grow-realloc, shrink-across-boundary, shrink-within-page, and grow-in-place paths. The virtual address reservation is kept intact to preserve the range for potential future grow-in-place support. A concrete user is the Rust binder driver's KVVec::shrink_to [1], which performs explicit vrealloc() shrinks for memory reclamation. This patch (of 5): Extract page freeing and NR_VMALLOC stat accounting from vfree() into a reusable vm_area_free_pages() helper. The helper operates on a range [start_idx, end_idx) of pages from a vm_struct, making it suitable for both full free (vfree) and partial free (upcoming vrealloc shrink). Freed page pointers in vm->pages[] are set to NULL to prevent stale references when the vm_struct outlives the free (as in vrealloc shrink). Link: https://lore.kernel.org/20260519-vmalloc-shrink-v14-0-70b96ee3e9c9@zohomail.in Link: https://lore.kernel.org/20260519-vmalloc-shrink-v14-1-70b96ee3e9c9@zohomail.in Link: https://lore.kernel.org/all/20260216-binder-shrink-vec-v3-v6-0-ece8e8593e53@zohomail.in/ [1] Signed-off-by: Shivam Kalra <shivamkalra98@zohomail.in> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Danilo Krummrich <dakr@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/sysfs: setup damon_filter->memcg_id from pathSeongJae Park
Find and set the memcg_id for damon_filter from the user-passed memory cgroup path when updating the DAMON input parameters. Link: https://lore.kernel.org/20260518234119.97569-27-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/sysfs-schemes: move memcg_path_to_id() to sysfs-commonSeongJae Park
The next commit will need to find the memcg id from the user-passed path to the memory cgroup, from sysfs.c. memcg_path_to_id() is doing that, but defined in sysfs-schemes.c as a static function. Move the function to sysfs-common.c and mark it as non-static, so that the next commit can reuse the function. Link: https://lore.kernel.org/20260518234119.97569-26-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/sysfs: add filters/<F>/path fileSeongJae Park
Introduce a new DAMON sysfs file for letting users setup the target memory cgroup of the belonging memory cgroup attribute monitoring. The file is named 'path', located under the probe filter directory. Users can set the target memory cgroup by writing the path to the memory cgroup from the cgroup mount point to the file. Link: https://lore.kernel.org/20260518234119.97569-25-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/paddr: support DAMON_FILTER_TYPE_MEMCGSeongJae Park
Implement the support of DAMON_FILTER_TYPE_MEMCG on the DAMON operation set implementation for the physical address space. Link: https://lore.kernel.org/20260518234119.97569-24-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/core: introduce DAMON_FILTER_TYPE_MEMCGSeongJae Park
Belonging memory cgoup is another data attribute that can be useful to monitor. Introduce a new DAMON filter type, namely DAMON_FILTER_TYPE_MEMCG, for monitoring of this attribute. Link: https://lore.kernel.org/20260518234119.97569-23-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon: trace probe_hitsSeongJae Park
Introduce a new tracepoint for exposing the per-region per-probe positive sample count via tracefs. Link: https://lore.kernel.org/20260518234119.97569-19-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/sysfs-schemes: implement probe/hits fileSeongJae Park
Implement sysfs file for showing the per-region per-probe hits count. Link: https://lore.kernel.org/20260518234119.97569-18-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/sysfs-schemes: implement probe dirSeongJae Park
Implement sysfs directory for showing per-probe hits count of each region. Link: https://lore.kernel.org/20260518234119.97569-17-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/sysfs-schemes: implement tried_regions/<r>/probes/SeongJae Park
Implement a sysfs directory for showing the per-region probe hit counts. It is named 'probes/' and located under the DAMOS tried region directory. Link: https://lore.kernel.org/20260518234119.97569-16-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/sysfs: setup probes on DAMON core API parametersSeongJae Park
Add user-installed data probes to DAMON core API parameters, so that user inputs for data probes are passed to DAMON core. Link: https://lore.kernel.org/20260518234119.97569-15-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/sysfs: implement filter dir filesSeongJae Park
Implement sysfs files under the data probe filter directory for letting users to configure each filter. Link: https://lore.kernel.org/20260518234119.97569-14-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/sysfs: implement filter dirSeongJae Park
Implement a sysfs directory for letting the users to configure each data probe filter. Link: https://lore.kernel.org/20260518234119.97569-13-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/sysfs: implement filters directorySeongJae Park
Implement a directory for letting users to install data probe filters. Link: https://lore.kernel.org/20260518234119.97569-12-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/sysfs: implement probe dirSeongJae Park
Implement sysfs directory for letting users install each data probe. Link: https://lore.kernel.org/20260518234119.97569-11-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/sysfs: implement probes dirSeongJae Park
Implement sysfs directory that can be used by the users to install data probes. Link: https://lore.kernel.org/20260518234119.97569-10-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/paddr: support data attributes monitoringSeongJae Park
Implement and register damon_operations->apply_probes() callback to support data attributes monitoring. Link: https://lore.kernel.org/20260518234119.97569-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/core: do data attributes monitoringSeongJae Park
Implement the data attributes monitoring execution. Update kdamond to invoke the probes application callback, and reset the aggregated number of per-region per-probe positive samples for every aggregation interval. Link: https://lore.kernel.org/20260518234119.97569-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/core: introduce damon_region->probe_hitsSeongJae Park
Add an array for the per-region per-probe positive samples count. For simple and efficient implementation, add a limit to the number of data probes and set the array to support only the limited number of counters. Link: https://lore.kernel.org/20260518234119.97569-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/core: commit probesSeongJae Park
Update damon_commit_ctx() to commit installed data probes, too. Link: https://lore.kernel.org/20260518234119.97569-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/core: introduce damon_filterSeongJae Park
Define a data structure for constructing damon_probe's attributes check, namely damon_filter. It is very similar to damos_filter but works only for monitoring purposes. Also embed that into damon_probe, implement essential handling of the link, with fundamental helpers. Link: https://lore.kernel.org/20260518234119.97569-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/core: embed damon_probe objects in damon_ctxSeongJae Park
Let damon_probe objects be able to be installed on a given damon_ctx, by adding a linked list header for storing the objects. Add initialization and cleanup of the new field with helper functions, too. Link: https://lore.kernel.org/20260518234119.97569-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/memory-failure: remove hugetlb output parameter from ↵Ye Liu
try_memory_failure_hugetlb() Use -ENOENT return value to distinguish "not a hugetlb page" from "hugetlb handled", instead of carrying an extra output parameter. Link: https://lore.kernel.org/20260515020144.164941-1-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Suggested-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Oscar Salvador (SUSE) <osalvador@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm, swap: merge zeromap into swap tableKairui Song
By allocating one additional bit in the swap table entry's flags field alongside the count, we can store the zeromap inline For 64 bit systems, zeromap will store in the swap table, avoiding zeromap allocation. It reduces the allocated memory. That is the happy path. For certain 32-bit archs, there might not be enough bits in the swap table to contain both PFN and flags. Therefore, conditionally let each cluster have a zeromap field at build time, and use that instead. If the swapfile cluster is not fully used, it will still save memory for zeromap. The empty cluster does not allocate a zeromap. In the worst case, all cluster are fully populated. We will use memory similar to the previous zeromap implementation. A few macros were moved to different headers for build time struct definition. [akpm@linux-foundation.org: swap_cluster_alloc_table(): remove unused local `ret] [akpm@linux-foundation.org: fix unused label `err_free'] Link: https://lore.kernel.org/20260517-swap-table-p4-v5-12-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Youngjun Park <youngjun.park@lge.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/memcg: remove no longer used swap cgroup arrayKairui Song
Now all swap cgroup records are stored in the swap cluster directly, the static array is no longer needed. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-11-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/memcg, swap: store cgroup id in cluster table directlyKairui Song
Drop the usage of the swap_cgroup_ctrl, and use the dynamic cluster table instead. The per-cluster memcg table is 1024 / 512 bytes on most archs, and does not need RCU protection: the cgroup data is only read and written under the cluster lock. That keeps things simple, lets the allocation use plain kmalloc with immediate kfree (no deferred free), and keeps fragmentation acceptable. [akpm@linux-foundation.org: memcgv1: don't compile swap functions when CONFIG_SWAP=n] Link: https://lore.kernel.org/202605281711.bSeZlErK-lkp@intel.com [akpm@linux-foundation.org: fix CONFIG_SWAP=n build] Link: https://lore.kernel.org/20260517-swap-table-p4-v5-10-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm, swap: consolidate cluster allocation helpersKairui Song
Swap cluster table management is spread across several narrow helpers. As a result, the allocation and fallback sequences are open-coded in multiple places. A few more per-cluster tables will be added soon, so avoid duplicating these sequences per table type. Fold the existing pairs into cluster-oriented helpers, and rename for consistency. No functional change, only a few sanity checks are slightly adjusted. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-9-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm, swap: delay and unify memcg lookup and charging for swapinKairui Song
Instead of checking the cgroup private ID during page table walk in swap_pte_batch(), move the memcg lookup into __swap_cache_add_check() under the cluster lock. The first pre-alloc check is speculative and skips the memcg check since the post-alloc stable check ensures all slots covered by the folio belong to the same memcg. It is very rare for contiguous and aligned entries across a contiguous region of a page table of the same process or shmem mapping to belong to different memcgs. This also prepares for recording the memcg info in the cluster's table. Also make the order check and fallback more compact. There should be no user-observable behavior change. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-8-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm, swap: support flexible batch freeing of slots in different memcgsKairui Song
Instead of requiring the caller to ensure all slots are in the same memcg, make the function handle different memcgs at once. This is both a micro optimization and required for removing the memcg lookup in the page table layer, so it can be unified at the swap layer. We are not removing the memcg lookup in the page table in this commit. It has to be done after the memcg lookup is deferred to the swap layer. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-7-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/memcg, swap: tidy up cgroup v1 memsw swap helpersKairui Song
The cgroup v1 swap helpers always operate on swap cache folios whose swap entry is stable: the folio is locked and in the swap cache. There is no need to pass the swap entry or page count as separate parameters when they can be derived from the folio itself. Simplify the redundant parameters and add sanity checks to document the required preconditions. Also rename memcg1_swapout to __memcg1_swapout to indicate it requires special calling context: the folio must be isolated and dying, and the call must be made with interrupts disabled. No functional change. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-6-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm, swap: unify large folio allocationKairui Song
Now that direct large order allocation is supported in the swap cache, both anon and shmem can use it instead of implementing their own methods. This unifies the fallback and swap cache check, which also reduces the TOCTOU race window of swap cache state: previously, high order swapin required checking swap cache states first, then allocating and falling back separately. Now all these steps happen in the same compact loop. Order fallback and statistics are also unified, callers just need to check and pass the acceptable order bitmask. There is basically no behavior change. This only makes things more unified and prepares for later commits. Cgroup and zero map checks can also be moved into the compact loop, further reducing race windows and redundancy Link: https://lore.kernel.org/20260517-swap-table-p4-v5-5-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm, swap: add support for stable large allocation in swap cache directlyKairui Song
To make it possible to allocate large folios directly in swap cache, provide a new infrastructure helper to handle the swap cache status check, allocation, and order fallback in the swap cache layer The new helper replaces the existing swap_cache_alloc_folio. Based on this, all the separate swap folio allocation that is being done by anon / shmem before is converted to use this helper directly, unifying folio allocation for anon, shmem, and readahead. This slightly consolidates how allocation is synchronized, making it more stable and less prone to errors. The slot-count and cache-conflict check is now always performed with the cluster lock held before allocation, and repeated under the same lock right before cache insertion. This double check produces a stable result compared to the previous anon and shmem mTHP allocation implementation, avoids the false-negative conflict checks that the lockless path can return — large allocations no longer have to be unwound because the range turned out to be occupied — and aborts early for already-freed slots, which helps ordinary swapin and especially readahead, with only a marginal increase in cluster-lock contention (the lock is very lightly contended and stays local in the first place). Hence, callers of swap_cache_alloc_folio() no longer need to check the swap slot count or swap cache status themselves. And now whoever first successfully allocates a folio in the swap cache will be the one who charges it and performs the swap-in. The race window of swapping is also reduced since the loop is much more compact. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-4-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/huge_memory: move THP gfp limit helper into headerKairui Song
Shmem has some special requirements for THP GFP and has to limit it in certain zones or provide a more lenient fallback. We'll use this helper for generic swap THP allocation, which needs to support shmem. For a typical GFP_HIGHUSER_MOVABLE swap-in, this helper is basically a no-op. But it's necessary for certain shmem users, mostly drivers. No feature change. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-3-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm, swap: move common swap cache operations into standalone helpersKairui Song
Move a few swap cache checking, adding, and deletion operations into standalone helpers to be used later. And while at it, add proper kernel doc. No feature or behavior change. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-2-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm, swap: simplify swap cache allocation helperKairui Song
Patch series "mm, swap: swap table phase IV: unify allocation", v5. This series unifies the allocation and charging of anon and shmem swap in folios, provides better synchronization, consolidates the metadata management, hence dropping the static array and map, and improves the performance. The static metadata overhead is now close to zero, and workload performance is slightly improved. For example, mounting a 1TB swap device saves about 512MB of memory: Before: free -m total used free shared buff/cache available Mem: 1464 805 346 1 382 658 Swap: 1048575 0 1048575 After: free -m total used free shared buff/cache available Mem: 1464 277 899 1 356 1187 Swap: 1048575 0 1048575 Memory usage is ~512M lower, and we now have a close to 0 static overhead. It was about 2 bytes per slot before, now roughly 0.09375 bytes per slot (48 bytes ci info per cluster, which is 512 slots). Performance test is also looking good, testing Redis in a 2G VM using 6G ZRAM as swap: valkey-server --maxmemory 2560M redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get Before: 3385017.283654 RPS After: 3433309.307292 RPS (1.42% better) Testing with build kernel under global pressure on a 48c96t system, limiting the total memory to 8G, using 12G ZRAM, 24 test runs, enabling THP: make -j96, using defconfig Before: user time 2904.59s system time 4773.99s After: user time 2909.38s system time 4641.55s (2.77% better) Testing with usemem on a 32c machine using 48G brd ramdisk and 16G RAM, 12 test run: usemem --init-time -O -y -x -n 48 1G Before: Throughput (Sum): 6482.58 MB/s Free Latency: 371371.67us After: Throughput (Sum): 6539.28 MB/s Free Latency: 363059.88us Seems similar, or slightly better. This series also reduces memory thrashing, I no longer see any: "Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF", it was shown several times during stress testing before this series when under great pressure: Before: grep -Ri VM_FAULT_OOM <test logs> | wc -l => 18 After: grep -Ri VM_FAULT_OOM <test logs> | wc -l => 0 This patch (of 12): Instead of trying to return the existing folio if the entry is already cached in swap_cache_alloc_folio, simply return an error pointer if the allocation failed, and drop the output argument that indicates what kind of folio is actually returned. And a proper wrapper swap_cache_read_folio that decouples and handles the actual requirement - read in the folio, or return the already read folio in cache. This is what async swapin and readahead actually required. As for zswap swap out, the caller just needs to abort if the allocation fails because the entry is gone or already cached, so removing simplifies the return argument, making it cleaner. No feature change. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com Link: https://lore.kernel.org/20260517-swap-table-p4-v5-1-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Zi Yan <ziy@nvidia.com> Cc: Youngjun Park <youngjun.park@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm: swap_cgroup: fix NULL deref in lookup_swap_cgroup_id on swapless hostJose Fernandez (Anthropic)
lookup_swap_cgroup_id() passes swap_cgroup_ctrl[type].map to __swap_cgroup_id_lookup() without checking that the type was ever registered via swap_cgroup_swapon(). On a swapless host every ctrl->map is NULL, so __swap_cgroup_id_lookup() dereferences NULL + a scaled swp_offset(). Since commit bea67dcc5eea ("mm: attempt to batch free swap entries for zap_pte_range()"), zap_pte_range() -> swap_pte_batch() calls lookup_swap_cgroup_id() on any non-present, non-none PTE that decodes as a real swap entry, without first validating it against swap_info[]. A single PTE corrupted into a type-0 swap entry takes the host down at process exit. We hit this in production on a swapless 6.12.58 host: ~1s of "get_swap_device: Bad swap file entry 3f800204222bb" (do_swap_page() being correctly defensive about the same entry) followed by BUG: unable to handle page fault for address: 000003f800204220 RIP: 0010:lookup_swap_cgroup_id+0x2b/0x60 Call Trace: swap_pte_batch+0xbf/0x230 zap_pte_range+0x4c8/0x780 unmap_page_range+0x190/0x3e0 exit_mmap+0xd9/0x3c0 do_exit+0x20c/0x4b0 syzbot has reported the identical stack. The source of the PTE corruption is a separate bug; this change makes the teardown path as robust as the fault path already is. Every other caller of lookup_swap_cgroup_id() is downstream of a get_swap_device() that has already validated the entry, so the new branch is cold. Link: https://lore.kernel.org/20260504-swap-cgroup-fix-7-0-v1-1-f53ff41ee553@linux.dev Fixes: bea67dcc5eea ("mm: attempt to batch free swap entries for zap_pte_range()") Signed-off-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev> Reported-by: syzbot+e12bd9ca48157add237a@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/69859728.050a0220.3b3015.0033.GAE@google.com Assisted-by: Claude:unspecified Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/page_alloc: document that alloc_pages_nolock() uses RCUBrendan Jackman
The allocator interacts with cgroups which rely on RCU. RCU does not work everywhere, so the "any context" claim is slightly overstated here. This should already be enforced by objtool, since this function is not marked noinstr the x86 build should fail if you call it from a place where RCU is not watching. But, expecting readers to make that connection for themselves seems a bit cruel (I don't think there is even any documentation of what noinstr means at all, let alone the connection with RCU). Note this is not claiming that any cgroup code called from the allocator would actually break if this restriction was violated, it could very well be that there's no real way for the allocator to act on a cgroup that can disappear concurrently. But, since it's likely nobody has verified this one way or another, better to just be safe and declare that RCU is required. Allocating from an RCU-unsafe context seems a bit crazy anyway. Link: https://lore.kernel.org/20260519-nolock-rcu-comment-v1-1-4a630c8794e5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Suggested-by: Junaid Shahid <junaids@google.com> Acked-by: Harry Yoo (Oracle) <harry@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/page_alloc: drop a misleading __always_inlineBrendan Jackman
get_pfnblock_migratetype() is called from outside page_alloc.c, so it cannot always be inlined. Remove the annotation to avoid misleading readers. At least in my minimal config, with GCC, this doesn't change mm/page_alloc.o at all. Link: https://lore.kernel.org/all/20260517-b4-drop-always-inline-v1-1-97b90930e8b8@google.com/ Signed-off-by: Brendan Jackman <jackmanb@google.com> Suggested-by: Vlastimil Babka <vbabka@kernel.org> Link: https://lore.kernel.org/all/016c8bef-57ef-44ef-bf60-86dbfd368dcd@kernel.org/ Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Vishal Moola <vishal.moola@gmail.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/page_alloc: remove ifdefs from pindex helpersBrendan Jackman
The ifdefs are not technically needed here, everything used here is always defined. Switching to IS_ENABLED() makes the code a bit less tiresome to read. Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-4-dacdf5402be8@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Len Brown <lenb@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm: rejig pageblock mask definitionsBrendan Jackman
- Add a PAGEBLOCK_ prefix to the names to avoid polluting the "global namespace" too much. - This new prefix makes MIGRATETYPE_AND_ISO_MASK look pretty long. Well, that global mask only exists for quite a specific purpose, and is quite a weird thing to have a name for anyway. So drop it and take advantage of the newly-defined PAGEBLOCK_ISO_MASK. Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-3-dacdf5402be8@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Len Brown <lenb@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/page_alloc: don't overload migratetype in find_suitable_fallback()Brendan Jackman
This function currently returns a signed integer that encodes status in-band, as negative numbers, along with a migratetype. Switch to a more explicit/verbose style that encodes the status and migratetype separately. In the spirit of making things more explicit, also create an enum to avoid using magic integer literals with special meanings. This enables documenting the values at their definition instead of in one of the callers. Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-2-dacdf5402be8@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Len Brown <lenb@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm: introduce for_each_free_list()Brendan Jackman
Patch series "mm: misc cleanups from __GFP_UNMAPPED series". In v2 of the __GFP_UNMAPPED series [0], we realised that some of the patches could potentially be merged as independent cleanups. These are all independent of one another, if you think some are useful cleanups and others are pointless churn, it should be fine to just pick whatever subset you prefer. No functional change intended. This patch (of 4): There are a couple of places that iterate over the freelists with awareness of the data structures' layout. It seems ideally, code outside of mm should not be aware of the page allocator's freelists at all. But, this patch just doesn't hide them completely, it's just a meek incremental step in that direction: provide a macro to iterate over it without needing to be aware of the actual struct fields. Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-0-dacdf5402be8@google.com Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-1-dacdf5402be8@google.com Link: https://lore.kernel.org/all/20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com/ [0] Signed-off-by: Brendan Jackman <jackmanb@google.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Len Brown <lenb@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>