From f2be745071ffd6793c032ca8443348c3ce0e3e18 Mon Sep 17 00:00:00 2001 From: Kevin Brodsky Date: Mon, 15 Dec 2025 15:03:14 +0000 Subject: mm: clarify lazy_mmu sleeping constraints The lazy MMU mode documentation makes clear that an implementation should not assume that preemption is disabled or any lock is held upon entry to the mode; however it says nothing about what code using the lazy MMU interface should expect. In practice sleeping is forbidden (for generic code) while the lazy MMU mode is active: say it explicitly. Link: https://lkml.kernel.org/r/20251215150323.2218608-6-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky Reviewed-by: Yeoreum Yun Acked-by: David Hildenbrand (Red Hat) Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Anshuman Khandual Cc: Borislav Betkov Cc: Boris Ostrovsky Cc: Catalin Marinas Cc: Christophe Leroy Cc: David S. Miller Cc: David Woodhouse Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: Juegren Gross Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Michal Hocko Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Peter Zijlstra Cc: Ritesh Harjani (IBM) Cc: Ryan Roberts Cc: Suren Baghdasaryan Cc: Thomas Gleinxer Cc: Venkat Rao Bagalkote Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/pgtable.h | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) (limited to 'include') diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 652f287c1ef6..1abc4a1c3d72 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -225,11 +225,15 @@ static inline int pmd_dirty(pmd_t pmd) * up to date. * * In the general case, no lock is guaranteed to be held between entry and exit - * of the lazy mode. So the implementation must assume preemption may be enabled - * and cpu migration is possible; it must take steps to be robust against this. - * (In practice, for user PTE updates, the appropriate page table lock(s) are - * held, but for kernel PTE updates, no lock is held). Nesting is not permitted - * and the mode cannot be used in interrupt context. + * of the lazy mode. (In practice, for user PTE updates, the appropriate page + * table lock(s) are held, but for kernel PTE updates, no lock is held). + * The implementation must therefore assume preemption may be enabled upon + * entry to the mode and cpu migration is possible; it must take steps to be + * robust against this. An implementation may handle this by disabling + * preemption, as a consequence generic code may not sleep while the lazy MMU + * mode is active. + * + * Nesting is not permitted and the mode cannot be used in interrupt context. */ #ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE static inline void arch_enter_lazy_mmu_mode(void) {} -- cgit v1.2.3 From 7303ecbfe4f46c00191b9b66acaa918784bad210 Mon Sep 17 00:00:00 2001 From: Kevin Brodsky Date: Mon, 15 Dec 2025 15:03:15 +0000 Subject: mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE Architectures currently opt in for implementing lazy_mmu helpers by defining __HAVE_ARCH_ENTER_LAZY_MMU_MODE. In preparation for introducing a generic lazy_mmu layer that will require storage in task_struct, let's switch to a cleaner approach: instead of defining a macro, select a CONFIG option. This patch introduces CONFIG_ARCH_HAS_LAZY_MMU_MODE and has each arch select it when it implements lazy_mmu helpers. __HAVE_ARCH_ENTER_LAZY_MMU_MODE is removed and relies on the new CONFIG instead. On x86, lazy_mmu helpers are only implemented if PARAVIRT_XXL is selected. This creates some complications in arch/x86/boot/, because a few files manually undefine PARAVIRT* options. As a result does not define the lazy_mmu helpers, but this breaks the build as only defines them if !CONFIG_ARCH_HAS_LAZY_MMU_MODE. There does not seem to be a clean way out of this - let's just undefine that new CONFIG too. Link: https://lkml.kernel.org/r/20251215150323.2218608-7-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky Acked-by: David Hildenbrand Reviewed-by: Ritesh Harjani (IBM) Reviewed-by: Ryan Roberts Reviewed-by: Yeoreum Yun Acked-by: Andreas Larsson [sparc] Cc: Alexander Gordeev Cc: Anshuman Khandual Cc: Borislav Betkov Cc: Boris Ostrovsky Cc: Catalin Marinas Cc: Christophe Leroy Cc: David Hildenbrand (Red Hat) Cc: David S. Miller Cc: David Woodhouse Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: Juegren Gross Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Michal Hocko Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Peter Zijlstra Cc: Suren Baghdasaryan Cc: Thomas Gleinxer Cc: Venkat Rao Bagalkote Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/pgtable.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 1abc4a1c3d72..d46d86959bd6 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -235,7 +235,7 @@ static inline int pmd_dirty(pmd_t pmd) * * Nesting is not permitted and the mode cannot be used in interrupt context. */ -#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE +#ifndef CONFIG_ARCH_HAS_LAZY_MMU_MODE static inline void arch_enter_lazy_mmu_mode(void) {} static inline void arch_leave_lazy_mmu_mode(void) {} static inline void arch_flush_lazy_mmu_mode(void) {} -- cgit v1.2.3 From 0a096ab7a3a6e2859c3c88988e548c5c213138bc Mon Sep 17 00:00:00 2001 From: Kevin Brodsky Date: Mon, 15 Dec 2025 15:03:16 +0000 Subject: mm: introduce generic lazy_mmu helpers The implementation of the lazy MMU mode is currently entirely arch-specific; core code directly calls arch helpers: arch_{enter,leave}_lazy_mmu_mode(). We are about to introduce support for nested lazy MMU sections. As things stand we'd have to duplicate that logic in every arch implementing lazy_mmu - adding to a fair amount of logic already duplicated across lazy_mmu implementations. This patch therefore introduces a new generic layer that calls the existing arch_* helpers. Two pair of calls are introduced: * lazy_mmu_mode_enable() ... lazy_mmu_mode_disable() This is the standard case where the mode is enabled for a given block of code by surrounding it with enable() and disable() calls. * lazy_mmu_mode_pause() ... lazy_mmu_mode_resume() This is for situations where the mode is temporarily disabled by first calling pause() and then resume() (e.g. to prevent any batching from occurring in a critical section). The documentation in will be updated in a subsequent patch. No functional change should be introduced at this stage. The implementation of enable()/resume() and disable()/pause() is currently identical, but nesting support will change that. Most of the call sites have been updated using the following Coccinelle script: @@ @@ { ... - arch_enter_lazy_mmu_mode(); + lazy_mmu_mode_enable(); ... - arch_leave_lazy_mmu_mode(); + lazy_mmu_mode_disable(); ... } @@ @@ { ... - arch_leave_lazy_mmu_mode(); + lazy_mmu_mode_pause(); ... - arch_enter_lazy_mmu_mode(); + lazy_mmu_mode_resume(); ... } A couple of notes regarding x86: * Xen is currently the only case where explicit handling is required for lazy MMU when context-switching. This is purely an implementation detail and using the generic lazy_mmu_mode_* functions would cause trouble when nesting support is introduced, because the generic functions must be called from the current task. For that reason we still use arch_leave() and arch_enter() there. * x86 calls arch_flush_lazy_mmu_mode() unconditionally in a few places, but only defines it if PARAVIRT_XXL is selected, and we are removing the fallback in . Add a new fallback definition to to keep things building. Link: https://lkml.kernel.org/r/20251215150323.2218608-8-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky Acked-by: David Hildenbrand Reviewed-by: Anshuman Khandual Reviewed-by: Yeoreum Yun Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Borislav Betkov Cc: Boris Ostrovsky Cc: Catalin Marinas Cc: Christophe Leroy Cc: David Hildenbrand (Red Hat) Cc: David S. Miller Cc: David Woodhouse Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: Juegren Gross Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Michal Hocko Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Peter Zijlstra Cc: Ritesh Harjani (IBM) Cc: Ryan Roberts Cc: Suren Baghdasaryan Cc: Thomas Gleinxer Cc: Venkat Rao Bagalkote Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/pgtable.h | 29 +++++++++++++++++++++++++---- 1 file changed, 25 insertions(+), 4 deletions(-) (limited to 'include') diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index d46d86959bd6..116a18b7916c 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -235,10 +235,31 @@ static inline int pmd_dirty(pmd_t pmd) * * Nesting is not permitted and the mode cannot be used in interrupt context. */ -#ifndef CONFIG_ARCH_HAS_LAZY_MMU_MODE -static inline void arch_enter_lazy_mmu_mode(void) {} -static inline void arch_leave_lazy_mmu_mode(void) {} -static inline void arch_flush_lazy_mmu_mode(void) {} +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE +static inline void lazy_mmu_mode_enable(void) +{ + arch_enter_lazy_mmu_mode(); +} + +static inline void lazy_mmu_mode_disable(void) +{ + arch_leave_lazy_mmu_mode(); +} + +static inline void lazy_mmu_mode_pause(void) +{ + arch_leave_lazy_mmu_mode(); +} + +static inline void lazy_mmu_mode_resume(void) +{ + arch_enter_lazy_mmu_mode(); +} +#else +static inline void lazy_mmu_mode_enable(void) {} +static inline void lazy_mmu_mode_disable(void) {} +static inline void lazy_mmu_mode_pause(void) {} +static inline void lazy_mmu_mode_resume(void) {} #endif #ifndef pte_batch_hint -- cgit v1.2.3 From 9273dfaeaca8ea4d88c7e9fd081922a029984fd4 Mon Sep 17 00:00:00 2001 From: Kevin Brodsky Date: Mon, 15 Dec 2025 15:03:17 +0000 Subject: mm: bail out of lazy_mmu_mode_* in interrupt context The lazy MMU mode cannot be used in interrupt context. This is documented in , but isn't consistently handled across architectures. arm64 ensures that calls to lazy_mmu_mode_* have no effect in interrupt context, because such calls do occur in certain configurations - see commit b81c688426a9 ("arm64/mm: Disable barrier batching in interrupt contexts"). Other architectures do not check this situation, most likely because it hasn't occurred so far. Let's handle this in the new generic lazy_mmu layer, in the same fashion as arm64: bail out of lazy_mmu_mode_* if in_interrupt(). Also remove the arm64 handling that is now redundant. Both arm64 and x86/Xen also ensure that any lazy MMU optimisation is disabled while in interrupt (see queue_pte_barriers() and xen_get_lazy_mode() respectively). This will be handled in the generic layer in a subsequent patch. Link: https://lkml.kernel.org/r/20251215150323.2218608-9-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky Acked-by: David Hildenbrand (Red Hat) Reviewed-by: Anshuman Khandual Reviewed-by: Yeoreum Yun Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Borislav Betkov Cc: Boris Ostrovsky Cc: Catalin Marinas Cc: Christophe Leroy Cc: David Hildenbrand Cc: David S. Miller Cc: David Woodhouse Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: Juegren Gross Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Michal Hocko Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Peter Zijlstra Cc: Ritesh Harjani (IBM) Cc: Ryan Roberts Cc: Suren Baghdasaryan Cc: Thomas Gleinxer Cc: Venkat Rao Bagalkote Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/pgtable.h | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 116a18b7916c..dddde6873d1e 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -233,26 +233,41 @@ static inline int pmd_dirty(pmd_t pmd) * preemption, as a consequence generic code may not sleep while the lazy MMU * mode is active. * - * Nesting is not permitted and the mode cannot be used in interrupt context. + * The mode is disabled in interrupt context and calls to the lazy_mmu API have + * no effect. + * + * Nesting is not permitted. */ #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE static inline void lazy_mmu_mode_enable(void) { + if (in_interrupt()) + return; + arch_enter_lazy_mmu_mode(); } static inline void lazy_mmu_mode_disable(void) { + if (in_interrupt()) + return; + arch_leave_lazy_mmu_mode(); } static inline void lazy_mmu_mode_pause(void) { + if (in_interrupt()) + return; + arch_leave_lazy_mmu_mode(); } static inline void lazy_mmu_mode_resume(void) { + if (in_interrupt()) + return; + arch_enter_lazy_mmu_mode(); } #else -- cgit v1.2.3 From 5ab246749569cff9f815618f02ba0d7cf20e5edd Mon Sep 17 00:00:00 2001 From: Kevin Brodsky Date: Mon, 15 Dec 2025 15:03:18 +0000 Subject: mm: enable lazy_mmu sections to nest MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Despite recent efforts to prevent lazy_mmu sections from nesting, it remains difficult to ensure that it never occurs - and in fact it does occur on arm64 in certain situations (CONFIG_DEBUG_PAGEALLOC). Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested") made nesting tolerable on arm64, but without truly supporting it: the inner call to leave() disables the batching optimisation before the outer section ends. This patch actually enables lazy_mmu sections to nest by tracking the nesting level in task_struct, in a similar fashion to e.g. pagefault_{enable,disable}(). This is fully handled by the generic lazy_mmu helpers that were recently introduced. lazy_mmu sections were not initially intended to nest, so we need to clarify the semantics w.r.t. the arch_*_lazy_mmu_mode() callbacks. This patch takes the following approach: * The outermost calls to lazy_mmu_mode_{enable,disable}() trigger calls to arch_{enter,leave}_lazy_mmu_mode() - this is unchanged. * Nested calls to lazy_mmu_mode_{enable,disable}() are not forwarded to the arch via arch_{enter,leave} - lazy MMU remains enabled so the assumption is that these callbacks are not relevant. However, existing code may rely on a call to disable() to flush any batched state, regardless of nesting. arch_flush_lazy_mmu_mode() is therefore called in that situation. A separate interface was recently introduced to temporarily pause the lazy MMU mode: lazy_mmu_mode_{pause,resume}(). pause() fully exits the mode *regardless of the nesting level*, and resume() restores the mode at the same nesting level. pause()/resume() are themselves allowed to nest, so we actually store two nesting levels in task_struct: enable_count and pause_count. A new helper is_lazy_mmu_mode_active() is introduced to determine whether we are currently in lazy MMU mode; this will be used in subsequent patches to replace the various ways arch's currently track whether the mode is enabled. In summary (enable/pause represent the values *after* the call): lazy_mmu_mode_enable() -> arch_enter() enable=1 pause=0 lazy_mmu_mode_enable() -> ø enable=2 pause=0 lazy_mmu_mode_pause() -> arch_leave() enable=2 pause=1 lazy_mmu_mode_resume() -> arch_enter() enable=2 pause=0 lazy_mmu_mode_disable() -> arch_flush() enable=1 pause=0 lazy_mmu_mode_disable() -> arch_leave() enable=0 pause=0 Note: is_lazy_mmu_mode_active() is added to to allow arch headers included by to use it. Link: https://lkml.kernel.org/r/20251215150323.2218608-10-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky Acked-by: David Hildenbrand (Red Hat) Reviewed-by: Yeoreum Yun Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Anshuman Khandual Cc: Borislav Betkov Cc: Boris Ostrovsky Cc: Catalin Marinas Cc: Christophe Leroy Cc: David Hildenbrand Cc: David S. Miller Cc: David Woodhouse Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: Juegren Gross Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Michal Hocko Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Peter Zijlstra Cc: Ritesh Harjani (IBM) Cc: Ryan Roberts Cc: Suren Baghdasaryan Cc: Thomas Gleinxer Cc: Venkat Rao Bagalkote Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/mm_types_task.h | 5 ++ include/linux/pgtable.h | 114 +++++++++++++++++++++++++++++++++++++++--- include/linux/sched.h | 45 +++++++++++++++++ 3 files changed, 157 insertions(+), 7 deletions(-) (limited to 'include') diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h index a82aa80c0ba4..11bf319d78ec 100644 --- a/include/linux/mm_types_task.h +++ b/include/linux/mm_types_task.h @@ -88,4 +88,9 @@ struct tlbflush_unmap_batch { #endif }; +struct lazy_mmu_state { + u8 enable_count; + u8 pause_count; +}; + #endif /* _LINUX_MM_TYPES_TASK_H */ diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index dddde6873d1e..2f0dd3a4ace1 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -236,39 +236,139 @@ static inline int pmd_dirty(pmd_t pmd) * The mode is disabled in interrupt context and calls to the lazy_mmu API have * no effect. * - * Nesting is not permitted. + * The lazy MMU mode is enabled for a given block of code using: + * + * lazy_mmu_mode_enable(); + * + * lazy_mmu_mode_disable(); + * + * Nesting is permitted: may itself use an enable()/disable() pair. + * A nested call to enable() has no functional effect; however disable() causes + * any batched architectural state to be flushed regardless of nesting. After a + * call to disable(), the caller can therefore rely on all previous page table + * modifications to have taken effect, but the lazy MMU mode may still be + * enabled. + * + * In certain cases, it may be desirable to temporarily pause the lazy MMU mode. + * This can be done using: + * + * lazy_mmu_mode_pause(); + * + * lazy_mmu_mode_resume(); + * + * pause() ensures that the mode is exited regardless of the nesting level; + * resume() re-enters the mode at the same nesting level. Any call to the + * lazy_mmu_mode_* API between those two calls has no effect. In particular, + * this means that pause()/resume() pairs may nest. + * + * is_lazy_mmu_mode_active() can be used to check whether the lazy MMU mode is + * currently enabled. */ #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE +/** + * lazy_mmu_mode_enable() - Enable the lazy MMU mode. + * + * Enters a new lazy MMU mode section; if the mode was not already enabled, + * enables it and calls arch_enter_lazy_mmu_mode(). + * + * Must be paired with a call to lazy_mmu_mode_disable(). + * + * Has no effect if called: + * - While paused - see lazy_mmu_mode_pause() + * - In interrupt context + */ static inline void lazy_mmu_mode_enable(void) { - if (in_interrupt()) + struct lazy_mmu_state *state = ¤t->lazy_mmu_state; + + if (in_interrupt() || state->pause_count > 0) return; - arch_enter_lazy_mmu_mode(); + VM_WARN_ON_ONCE(state->enable_count == U8_MAX); + + if (state->enable_count++ == 0) + arch_enter_lazy_mmu_mode(); } +/** + * lazy_mmu_mode_disable() - Disable the lazy MMU mode. + * + * Exits the current lazy MMU mode section. If it is the outermost section, + * disables the mode and calls arch_leave_lazy_mmu_mode(). Otherwise (nested + * section), calls arch_flush_lazy_mmu_mode(). + * + * Must match a call to lazy_mmu_mode_enable(). + * + * Has no effect if called: + * - While paused - see lazy_mmu_mode_pause() + * - In interrupt context + */ static inline void lazy_mmu_mode_disable(void) { - if (in_interrupt()) + struct lazy_mmu_state *state = ¤t->lazy_mmu_state; + + if (in_interrupt() || state->pause_count > 0) return; - arch_leave_lazy_mmu_mode(); + VM_WARN_ON_ONCE(state->enable_count == 0); + + if (--state->enable_count == 0) + arch_leave_lazy_mmu_mode(); + else /* Exiting a nested section */ + arch_flush_lazy_mmu_mode(); + } +/** + * lazy_mmu_mode_pause() - Pause the lazy MMU mode. + * + * Pauses the lazy MMU mode; if it is currently active, disables it and calls + * arch_leave_lazy_mmu_mode(). + * + * Must be paired with a call to lazy_mmu_mode_resume(). Calls to the + * lazy_mmu_mode_* API have no effect until the matching resume() call. + * + * Has no effect if called: + * - While paused (inside another pause()/resume() pair) + * - In interrupt context + */ static inline void lazy_mmu_mode_pause(void) { + struct lazy_mmu_state *state = ¤t->lazy_mmu_state; + if (in_interrupt()) return; - arch_leave_lazy_mmu_mode(); + VM_WARN_ON_ONCE(state->pause_count == U8_MAX); + + if (state->pause_count++ == 0 && state->enable_count > 0) + arch_leave_lazy_mmu_mode(); } +/** + * lazy_mmu_mode_resume() - Resume the lazy MMU mode. + * + * Resumes the lazy MMU mode; if it was active at the point where the matching + * call to lazy_mmu_mode_pause() was made, re-enables it and calls + * arch_enter_lazy_mmu_mode(). + * + * Must match a call to lazy_mmu_mode_pause(). + * + * Has no effect if called: + * - While paused (inside another pause()/resume() pair) + * - In interrupt context + */ static inline void lazy_mmu_mode_resume(void) { + struct lazy_mmu_state *state = ¤t->lazy_mmu_state; + if (in_interrupt()) return; - arch_enter_lazy_mmu_mode(); + VM_WARN_ON_ONCE(state->pause_count == 0); + + if (--state->pause_count == 0 && state->enable_count > 0) + arch_enter_lazy_mmu_mode(); } #else static inline void lazy_mmu_mode_enable(void) {} diff --git a/include/linux/sched.h b/include/linux/sched.h index da0133524d08..6b563d4e68f6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1419,6 +1419,10 @@ struct task_struct { struct page_frag task_frag; +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE + struct lazy_mmu_state lazy_mmu_state; +#endif + #ifdef CONFIG_TASK_DELAY_ACCT struct task_delay_info *delays; #endif @@ -1702,6 +1706,47 @@ static inline char task_state_to_char(struct task_struct *tsk) return task_index_to_char(task_state_index(tsk)); } +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE +/** + * __task_lazy_mmu_mode_active() - Test the lazy MMU mode state for a task. + * @tsk: The task to check. + * + * Test whether @tsk has its lazy MMU mode state set to active (i.e. enabled + * and not paused). + * + * This function only considers the state saved in task_struct; to test whether + * current actually is in lazy MMU mode, is_lazy_mmu_mode_active() should be + * used instead. + * + * This function is intended for architectures that implement the lazy MMU + * mode; it must not be called from generic code. + */ +static inline bool __task_lazy_mmu_mode_active(struct task_struct *tsk) +{ + struct lazy_mmu_state *state = &tsk->lazy_mmu_state; + + return state->enable_count > 0 && state->pause_count == 0; +} + +/** + * is_lazy_mmu_mode_active() - Test whether we are currently in lazy MMU mode. + * + * Test whether the current context is in lazy MMU mode. This is true if both: + * 1. We are not in interrupt context + * 2. Lazy MMU mode is active for the current task + * + * This function is intended for architectures that implement the lazy MMU + * mode; it must not be called from generic code. + */ +static inline bool is_lazy_mmu_mode_active(void) +{ + if (in_interrupt()) + return false; + + return __task_lazy_mmu_mode_active(current); +} +#endif + extern struct pid *cad_pid; /* -- cgit v1.2.3 From 8e38607aa4aa8ee7ad4058d183465d248d04dca4 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 6 Jan 2026 23:20:02 -0800 Subject: treewide: provide a generic clear_user_page() variant Patch series "mm: folio_zero_user: clear page ranges", v11. This series adds clearing of contiguous page ranges for hugepages. The series improves on the current discontiguous clearing approach in two ways: - clear pages in a contiguous fashion. - use batched clearing via clear_pages() wherever exposed. The first is useful because it allows us to make much better use of hardware prefetchers. The second, enables advertising the real extent to the processor. Where specific instructions support it (ex. string instructions on x86; "mops" on arm64 etc), a processor can optimize based on this because, instead of seeing a sequence of 8-byte stores, or a sequence of 4KB pages, it sees a larger unit being operated on. For instance, AMD Zen uarchs (for extents larger than LLC-size) switch to a mode where they start eliding cacheline allocation. This is helpful not just because it results in higher bandwidth, but also because now the cache is not evicting useful cachelines and replacing them with zeroes. Demand faulting a 64GB region shows performance improvement: $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5 baseline +series (GBps +- %stdev) (GBps +- %stdev) pg-sz=2MB 11.76 +- 1.10% 25.34 +- 1.18% [*] +115.47% preempt=* pg-sz=1GB 24.85 +- 2.41% 39.22 +- 2.32% + 57.82% preempt=none|voluntary pg-sz=1GB (similar) 52.73 +- 0.20% [#] +112.19% preempt=full|lazy [*] This improvement is because switching to sequential clearing allows the hardware prefetchers to do a much better job. [#] For pg-sz=1GB a large part of the improvement is because of the cacheline elision mentioned above. preempt=full|lazy improves upon that because, not needing explicit invocations of cond_resched() to ensure reasonable preemption latency, it can clear the full extent as a single unit. In comparison the maximum extent used for preempt=none|voluntary is PROCESS_PAGES_NON_PREEMPT_BATCH (32MB). When provided the full extent the processor forgoes allocating cachelines on this path almost entirely. (The hope is that eventually, in the fullness of time, the lazy preemption model will be able to do the same job that none or voluntary models are used for, allowing us to do away with cond_resched().) Raghavendra also tested previous version of the series on AMD Genoa and sees similar improvement [1] with preempt=lazy. $ perf bench mem map -p $page-size -f populate -s 64GB -l 10 base patched change pg-sz=2MB 12.731939 GB/sec 26.304263 GB/sec 106.6% pg-sz=1GB 26.232423 GB/sec 61.174836 GB/sec 133.2% This patch (of 8): Let's drop all variants that effectively map to clear_page() and provide it in a generic variant instead. We'll use the macro clear_user_page to indicate whether an architecture provides it's own variant. Also, clear_user_page() is only called from the generic variant of clear_user_highpage(), so define it only if the architecture does not provide a clear_user_highpage(). And, for simplicity define it in linux/highmem.h. Note that for parisc, clear_page() and clear_user_page() map to clear_page_asm(), so we can just get rid of the custom clear_user_page() implementation. There is a clear_user_page_asm() function on parisc, that seems to be unused. Not sure what's up with that. Link: https://lkml.kernel.org/r/20260107072009.1615991-1-ankur.a.arora@oracle.com Link: https://lkml.kernel.org/r/20260107072009.1615991-2-ankur.a.arora@oracle.com Signed-off-by: David Hildenbrand Co-developed-by: Ankur Arora Signed-off-by: Ankur Arora Cc: Andy Lutomirski Cc: Ankur Arora Cc: "Borislav Petkov (AMD)" Cc: Boris Ostrovsky Cc: David Hildenbrand Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Konrad Rzessutek Wilk Cc: Lance Yang Cc: "Liam R. Howlett" Cc: Li Zhe Cc: Lorenzo Stoakes Cc: Mateusz Guzik Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Mike Rapoport Cc: Peter Zijlstra Cc: Raghavendra K T Cc: Suren Baghdasaryan Cc: Thomas Gleixner Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/highmem.h | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/highmem.h b/include/linux/highmem.h index abc20f9810fd..393bd51e5a1f 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -197,15 +197,35 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size) } #endif -/* when CONFIG_HIGHMEM is not set these will be plain clear/copy_page */ #ifndef clear_user_highpage +#ifndef clear_user_page +/** + * clear_user_page() - clear a page to be mapped to user space + * @addr: the address of the page + * @vaddr: the address of the user mapping + * @page: the page + * + * We condition the definition of clear_user_page() on the architecture + * not having a custom clear_user_highpage(). That's because if there + * is some special flushing needed for clear_user_highpage() then it + * is likely that clear_user_page() also needs some magic. And, since + * our only caller is the generic clear_user_highpage(), not defining + * is not much of a loss. + */ +static inline void clear_user_page(void *addr, unsigned long vaddr, struct page *page) +{ + clear_page(addr); +} +#endif + +/* when CONFIG_HIGHMEM is not set these will be plain clear/copy_page */ static inline void clear_user_highpage(struct page *page, unsigned long vaddr) { void *addr = kmap_local_page(page); clear_user_page(addr, vaddr, page); kunmap_local(addr); } -#endif +#endif /* clear_user_highpage */ #ifndef vma_alloc_zeroed_movable_folio /** -- cgit v1.2.3 From 62a9f5a85b98d6d2d9b5e0d67b2d4e5903bc53ec Mon Sep 17 00:00:00 2001 From: Ankur Arora Date: Tue, 6 Jan 2026 23:20:03 -0800 Subject: mm: introduce clear_pages() and clear_user_pages() Introduce clear_pages(), to be overridden by architectures that support more efficient clearing of consecutive pages. Also introduce clear_user_pages(), however, we will not expect this function to be overridden anytime soon. As we do for clear_user_page(), define clear_user_pages() only if the architecture does not define clear_user_highpage(). That is because if the architecture does define clear_user_highpage(), then it likely needs some flushing magic when clearing user pages or highpages. This means we can get away without defining clear_user_pages(), since, much like its single page sibling, its only potential user is the generic clear_user_highpages() which should instead be using clear_user_highpage(). Link: https://lkml.kernel.org/r/20260107072009.1615991-3-ankur.a.arora@oracle.com Signed-off-by: Ankur Arora Acked-by: David Hildenbrand (Red Hat) Cc: Andy Lutomirski Cc: "Borislav Petkov (AMD)" Cc: Boris Ostrovsky Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Konrad Rzessutek Wilk Cc: Lance Yang Cc: "Liam R. Howlett" Cc: Li Zhe Cc: Lorenzo Stoakes Cc: Mateusz Guzik Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Mike Rapoport Cc: Peter Zijlstra Cc: Raghavendra K T Cc: Suren Baghdasaryan Cc: Thomas Gleixner Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/highmem.h | 33 +++++++++++++++++++++++++++++++++ include/linux/mm.h | 20 ++++++++++++++++++++ 2 files changed, 53 insertions(+) (limited to 'include') diff --git a/include/linux/highmem.h b/include/linux/highmem.h index 393bd51e5a1f..019ab7d8c841 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -218,6 +218,39 @@ static inline void clear_user_page(void *addr, unsigned long vaddr, struct page } #endif +/** + * clear_user_pages() - clear a page range to be mapped to user space + * @addr: start address + * @vaddr: start address of the user mapping + * @page: start page + * @npages: number of pages + * + * Assumes that the region (@addr, +@npages) has been validated + * already so this does no exception handling. + * + * If the architecture provides a clear_user_page(), use that; + * otherwise, we can safely use clear_pages(). + */ +static inline void clear_user_pages(void *addr, unsigned long vaddr, + struct page *page, unsigned int npages) +{ + +#ifdef clear_user_page + do { + clear_user_page(addr, vaddr, page); + addr += PAGE_SIZE; + vaddr += PAGE_SIZE; + page++; + } while (--npages); +#else + /* + * Prefer clear_pages() to allow for architectural optimizations + * when operating on contiguous page ranges. + */ + clear_pages(addr, npages); +#endif +} + /* when CONFIG_HIGHMEM is not set these will be plain clear/copy_page */ static inline void clear_user_highpage(struct page *page, unsigned long vaddr) { diff --git a/include/linux/mm.h b/include/linux/mm.h index f0d5be9dc736..d78e294698b0 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4198,6 +4198,26 @@ static inline void clear_page_guard(struct zone *zone, struct page *page, unsigned int order) {} #endif /* CONFIG_DEBUG_PAGEALLOC */ +#ifndef clear_pages +/** + * clear_pages() - clear a page range for kernel-internal use. + * @addr: start address + * @npages: number of pages + * + * Use clear_user_pages() instead when clearing a page range to be + * mapped to user space. + * + * Does absolutely no exception handling. + */ +static inline void clear_pages(void *addr, unsigned int npages) +{ + do { + clear_page(addr); + addr += PAGE_SIZE; + } while (--npages); +} +#endif + #ifdef __HAVE_ARCH_GATE_AREA extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm); extern int in_gate_area_no_mm(unsigned long addr); -- cgit v1.2.3 From 8d846b723e5723d98d859df9feeab89c2c889fb2 Mon Sep 17 00:00:00 2001 From: Ankur Arora Date: Tue, 6 Jan 2026 23:20:04 -0800 Subject: highmem: introduce clear_user_highpages() Define clear_user_highpages() which uses the range clearing primitive, clear_user_pages(). We can safely use this when CONFIG_HIGHMEM is disabled and if the architecture does not have clear_user_highpage. The first is needed to ensure that contiguous page ranges stay contiguous which precludes intermediate maps via HIGMEM. The second, because if the architecture has clear_user_highpage(), it likely needs flushing magic when clearing the page, magic that we aren't privy to. For both of those cases, just fallback to a loop around clear_user_highpage(). Link: https://lkml.kernel.org/r/20260107072009.1615991-4-ankur.a.arora@oracle.com Signed-off-by: Ankur Arora Acked-by: David Hildenbrand (Red Hat) Cc: Andy Lutomirski Cc: "Borislav Petkov (AMD)" Cc: Boris Ostrovsky Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Konrad Rzessutek Wilk Cc: Lance Yang Cc: "Liam R. Howlett" Cc: Li Zhe Cc: Lorenzo Stoakes Cc: Mateusz Guzik Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Mike Rapoport Cc: Peter Zijlstra Cc: Raghavendra K T Cc: Suren Baghdasaryan Cc: Thomas Gleixner Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/highmem.h | 45 ++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 44 insertions(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/highmem.h b/include/linux/highmem.h index 019ab7d8c841..af03db851a1d 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -251,7 +251,14 @@ static inline void clear_user_pages(void *addr, unsigned long vaddr, #endif } -/* when CONFIG_HIGHMEM is not set these will be plain clear/copy_page */ +/** + * clear_user_highpage() - clear a page to be mapped to user space + * @page: start page + * @vaddr: start address of the user mapping + * + * With !CONFIG_HIGHMEM this (and the copy_user_highpage() below) will + * be plain clear_user_page() (and copy_user_page()). + */ static inline void clear_user_highpage(struct page *page, unsigned long vaddr) { void *addr = kmap_local_page(page); @@ -260,6 +267,42 @@ static inline void clear_user_highpage(struct page *page, unsigned long vaddr) } #endif /* clear_user_highpage */ +/** + * clear_user_highpages() - clear a page range to be mapped to user space + * @page: start page + * @vaddr: start address of the user mapping + * @npages: number of pages + * + * Assumes that all the pages in the region (@page, +@npages) are valid + * so this does no exception handling. + */ +static inline void clear_user_highpages(struct page *page, unsigned long vaddr, + unsigned int npages) +{ + +#if defined(clear_user_highpage) || defined(CONFIG_HIGHMEM) + /* + * An architecture defined clear_user_highpage() implies special + * handling is needed. + * + * So we use that or, the generic variant if CONFIG_HIGHMEM is + * enabled. + */ + do { + clear_user_highpage(page, vaddr); + vaddr += PAGE_SIZE; + page++; + } while (--npages); +#else + + /* + * Prefer clear_user_pages() to allow for architectural optimizations + * when operating on contiguous page ranges. + */ + clear_user_pages(page_address(page), vaddr, page, npages); +#endif +} + #ifndef vma_alloc_zeroed_movable_folio /** * vma_alloc_zeroed_movable_folio - Allocate a zeroed page for a VMA. -- cgit v1.2.3 From 94962b2628e6af2c48be6ebdf9f76add28d60ecc Mon Sep 17 00:00:00 2001 From: Ankur Arora Date: Tue, 6 Jan 2026 23:20:08 -0800 Subject: mm: folio_zero_user: clear page ranges Use batch clearing in clear_contig_highpages() instead of clearing a single page at a time. Exposing larger ranges enables the processor to optimize based on extent. To do this we just switch to using clear_user_highpages() which would in turn use clear_user_pages() or clear_pages(). Batched clearing, when running under non-preemptible models, however, has latency considerations. In particular, we need periodic invocations of cond_resched() to keep to reasonable preemption latencies. This is a problem because the clearing primitives do not, or might not be able to, call cond_resched() to check if preemption is needed. So, limit the worst case preemption latency by doing the clearing in units of no more than PROCESS_PAGES_NON_PREEMPT_BATCH pages. (Preemptible models already define away most of cond_resched(), so the batch size is ignored when running under those.) PROCESS_PAGES_NON_PREEMPT_BATCH: for architectures with "fast" clear-pages (ones that define clear_pages()), we define it as 32MB worth of pages. This is meant to be large enough to allow the processor to optimize the operation and yet small enough that we see reasonable preemption latency for when this optimization is not possible (ex. slow microarchitectures, memory bandwidth saturation.) This specific value also allows for a cacheline allocation elision optimization (which might help unrelated applications by not evicting potentially useful cache lines) that kicks in recent generations of AMD Zen processors at around LLC-size (32MB is a typical size). At the same time 32MB is small enough that even with poor clearing bandwidth (say ~10GBps), time to clear 32MB should be well below the scheduler's default warning threshold (sysctl_resched_latency_warn_ms=100). "Slow" architectures (don't have clear_pages()) will continue to use the base value (single page). Performance == Testing a demand fault workload shows a decent improvement in bandwidth with pg-sz=1GB. Bandwidth with pg-sz=2MB stays flat. $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5 contiguous-pages batched-pages (GBps +- %stdev) (GBps +- %stdev) pg-sz=2MB 23.58 +- 1.95% 25.34 +- 1.18% + 7.50% preempt=* pg-sz=1GB 25.09 +- 0.79% 39.22 +- 2.32% + 56.31% preempt=none|voluntary pg-sz=1GB 25.71 +- 0.03% 52.73 +- 0.20% [#] +110.16% preempt=full|lazy [#] We perform much better with preempt=full|lazy because, not needing explicit invocations of cond_resched() we can clear the full extent (pg-sz=1GB) as a single unit which the processor can optimize for. (Unless otherwise noted, all numbers are on AMD Genoa (EPYC 9J13); region-size=64GB, local node; 2.56 GHz, boost=0.) Analysis == pg-sz=1GB: the improvement we see falls in two buckets depending on the batch size in use. For batch-size=32MB the number of cachelines allocated (L1-dcache-loads) -- which stay relatively flat for smaller batches, start to drop off because cacheline allocation elision kicks in. And as can be seen below, at batch-size=1GB, we stop allocating cachelines almost entirely. (Not visible here but from testing with intermediate sizes, the allocation change kicks in only at batch-size=32MB and ramps up from there.) contigous-pages 6,949,417,798 L1-dcache-loads # 883.599 M/sec ( +- 0.01% ) (35.75%) 3,226,709,573 L1-dcache-load-misses # 46.43% of all L1-dcache accesses ( +- 0.05% ) (35.75%) batched,32MB 2,290,365,772 L1-dcache-loads # 471.171 M/sec ( +- 0.36% ) (35.72%) 1,144,426,272 L1-dcache-load-misses # 49.97% of all L1-dcache accesses ( +- 0.58% ) (35.70%) batched,1GB 63,914,157 L1-dcache-loads # 17.464 M/sec ( +- 8.08% ) (35.73%) 22,074,367 L1-dcache-load-misses # 34.54% of all L1-dcache accesses ( +- 16.70% ) (35.70%) The dropoff is also visible in L2 prefetch hits (miss numbers are on similar lines): contiguous-pages 3,464,861,312 l2_pf_hit_l2.all # 437.722 M/sec ( +- 0.74% ) (15.69%) batched,32MB 883,750,087 l2_pf_hit_l2.all # 181.223 M/sec ( +- 1.18% ) (15.71%) batched,1GB 8,967,943 l2_pf_hit_l2.all # 2.450 M/sec ( +- 17.92% ) (15.77%) This largely decouples the frontend from the backend since the clearing operation does not need to wait on loads from memory (we still need cacheline ownership but that's a shorter path). This is most visible if we rerun the test above with (boost=1, 3.66 GHz). $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5 contiguous-pages batched-pages (GBps +- %stdev) (GBps +- %stdev) pg-sz=2MB 26.08 +- 1.72% 26.13 +- 0.92% - preempt=* pg-sz=1GB 26.99 +- 0.62% 48.85 +- 2.19% + 80.99% preempt=none|voluntary pg-sz=1GB 27.69 +- 0.18% 75.18 +- 0.25% +171.50% preempt=full|lazy Comparing the batched-pages numbers from the boost=0 ones and these: for a clock-speed gain of 42% we gain 24.5% for batch-size=32MB and 42.5% for batch-size=1GB. In comparison the baseline contiguous-pages case and both the pg-sz=2MB ones are largely backend bound so gain no more than ~10%. Other platforms tested, Intel Icelakex (Oracle X9) and ARM64 Neoverse-N1 (Ampere Altra) both show an improvement of ~35% for pg-sz=2MB|1GB. The first goes from around 8GBps to 11GBps and the second from 32GBps to 44 GBPs. [ankur.a.arora@oracle.com: move the unit computation and make it a const Link: https://lkml.kernel.org/r/20260108060406.1693853-1-ankur.a.arora@oracle.com Link: https://lkml.kernel.org/r/20260107072009.1615991-8-ankur.a.arora@oracle.com Signed-off-by: Ankur Arora Acked-by: David Hildenbrand (Red Hat) Cc: Andy Lutomirski Cc: "Borislav Petkov (AMD)" Cc: Boris Ostrovsky Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Konrad Rzessutek Wilk Cc: Lance Yang Cc: "Liam R. Howlett" Cc: Li Zhe Cc: Lorenzo Stoakes Cc: Mateusz Guzik Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Mike Rapoport Cc: Peter Zijlstra Cc: Raghavendra K T Cc: Suren Baghdasaryan Cc: Thomas Gleixner Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/mm.h | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) (limited to 'include') diff --git a/include/linux/mm.h b/include/linux/mm.h index d78e294698b0..ab2e7e30aef9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4208,6 +4208,15 @@ static inline void clear_page_guard(struct zone *zone, struct page *page, * mapped to user space. * * Does absolutely no exception handling. + * + * Note that even though the clearing operation is preemptible, clear_pages() + * does not (and on architectures where it reduces to a few long-running + * instructions, might not be able to) call cond_resched() to check if + * rescheduling is required. + * + * When running under preemptible models this is not a problem. Under + * cooperatively scheduled models, however, the caller is expected to + * limit @npages to no more than PROCESS_PAGES_NON_PREEMPT_BATCH. */ static inline void clear_pages(void *addr, unsigned int npages) { @@ -4218,6 +4227,32 @@ static inline void clear_pages(void *addr, unsigned int npages) } #endif +#ifndef PROCESS_PAGES_NON_PREEMPT_BATCH +#ifdef clear_pages +/* + * The architecture defines clear_pages(), and we assume that it is + * generally "fast". So choose a batch size large enough to allow the processor + * headroom for optimizing the operation and yet small enough that we see + * reasonable preemption latency for when this optimization is not possible + * (ex. slow microarchitectures, memory bandwidth saturation.) + * + * With a value of 32MB and assuming a memory bandwidth of ~10GBps, this should + * result in worst case preemption latency of around 3ms when clearing pages. + * + * (See comment above clear_pages() for why preemption latency is a concern + * here.) + */ +#define PROCESS_PAGES_NON_PREEMPT_BATCH (SZ_32M >> PAGE_SHIFT) +#else /* !clear_pages */ +/* + * The architecture does not provide a clear_pages() implementation. Assume + * that clear_page() -- which clear_pages() will fallback to -- is relatively + * slow and choose a small value for PROCESS_PAGES_NON_PREEMPT_BATCH. + */ +#define PROCESS_PAGES_NON_PREEMPT_BATCH 1 +#endif +#endif + #ifdef __HAVE_ARCH_GATE_AREA extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm); extern int in_gate_area_no_mm(unsigned long addr); -- cgit v1.2.3 From 055059ed720ec7546d2bf7122d858814a9f84741 Mon Sep 17 00:00:00 2001 From: Chen Ridong Date: Thu, 11 Dec 2025 01:30:19 +0000 Subject: memcg: remove mem_cgroup_size() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The mem_cgroup_size helper is used only in apply_proportional_protection to read the current memory usage. Its semantics are unclear and inconsistent with other sites, which directly call page_counter_read for the same purpose. Remove this helper and get its usage via mem_cgroup_protection for clarity. Additionally, rename the local variable 'cgroup_size' to 'usage' to better reflect its meaning. No functional changes intended. Link: https://lkml.kernel.org/r/20251211013019.2080004-3-chenridong@huaweicloud.com Signed-off-by: Chen Ridong Acked-by: Johannes Weiner Acked-by: Michal Hocko Acked-by: Shakeel Butt Cc: Michal Koutný Cc: Axel Rasmussen Cc: Lorenzo Stoakes Cc: Lu Jialin Cc: Muchun Song Cc: Qi Zheng Cc: Roman Gushchin Cc: Wei Xu Cc: Yuanchu Xie Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 18 +++++++----------- 1 file changed, 7 insertions(+), 11 deletions(-) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0651865a4564..25908ba30700 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -557,13 +557,15 @@ static inline bool mem_cgroup_disabled(void) static inline void mem_cgroup_protection(struct mem_cgroup *root, struct mem_cgroup *memcg, unsigned long *min, - unsigned long *low) + unsigned long *low, + unsigned long *usage) { - *min = *low = 0; + *min = *low = *usage = 0; if (mem_cgroup_disabled()) return; + *usage = page_counter_read(&memcg->memory); /* * There is no reclaim protection applied to a targeted reclaim. * We are special casing this specific case here because @@ -919,8 +921,6 @@ static inline void mem_cgroup_handle_over_high(gfp_t gfp_mask) unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg); -unsigned long mem_cgroup_size(struct mem_cgroup *memcg); - void mem_cgroup_print_oom_context(struct mem_cgroup *memcg, struct task_struct *p); @@ -1102,9 +1102,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, static inline void mem_cgroup_protection(struct mem_cgroup *root, struct mem_cgroup *memcg, unsigned long *min, - unsigned long *low) + unsigned long *low, + unsigned long *usage) { - *min = *low = 0; + *min = *low = *usage = 0; } static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root, @@ -1328,11 +1329,6 @@ static inline unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg) return 0; } -static inline unsigned long mem_cgroup_size(struct mem_cgroup *memcg) -{ - return 0; -} - static inline void mem_cgroup_print_oom_context(struct mem_cgroup *memcg, struct task_struct *p) { -- cgit v1.2.3 From 16cc8b9396f6d63c1331059d67626cf907a7f23c Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 10 Dec 2025 10:43:01 -0500 Subject: mm: memcontrol: rename mem_cgroup_from_slab_obj() In addition to slab objects, this function is used for resolving non-slab kernel pointers. This has caused confusion in recent refactoring work. Rename it to mem_cgroup_from_virt(), sticking with terminology established by the virt_to_() converters. Link: https://lore.kernel.org/linux-mm/20251113161424.GB3465062@cmpxchg.org/ Link: https://lkml.kernel.org/r/20251210154301.720133-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Acked-by: Roman Gushchin Reviewed-by: Anshuman Khandual Acked-by: Vlastimil Babka Acked-by: Shakeel Butt Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Muchun Song Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 25908ba30700..fd400082313a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1723,7 +1723,7 @@ static inline int memcg_kmem_id(struct mem_cgroup *memcg) return memcg ? memcg->kmemcg_id : -1; } -struct mem_cgroup *mem_cgroup_from_slab_obj(void *p); +struct mem_cgroup *mem_cgroup_from_virt(void *p); static inline void count_objcg_events(struct obj_cgroup *objcg, enum vm_event_item idx, @@ -1795,7 +1795,7 @@ static inline int memcg_kmem_id(struct mem_cgroup *memcg) return -1; } -static inline struct mem_cgroup *mem_cgroup_from_slab_obj(void *p) +static inline struct mem_cgroup *mem_cgroup_from_virt(void *p) { return NULL; } -- cgit v1.2.3 From 4a6ceb7c9744c69546d4ca43b7bd308f4db0927b Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 16 Dec 2025 00:01:14 -0800 Subject: mm/damon/core: introduce nr_snapshots damos stat Patch series "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos stats". Introduce three changes for improving DAMOS stat's provided information, deterministic control, and reading usability. DAMOS provides stats that are important for understanding its behavior. It lacks information about how many DAMON-generated monitoring output snapshots it has worked on. Add a new stat, nr_snapshots, to show the information. Users can control DAMOS schemes in multiple ways. Using the online parameters commit feature, they can install and uninstall DAMOS schemes whenever they want while keeping DAMON runs. DAMOS quotas and watermarks can be used for manually or automatically turning on/off or adjusting the aggressiveness of the scheme. DAMOS filters can be used for applying the scheme to specific memory entities based on their types and locations. Some users want their DAMOS scheme to be applied to only specific number of DAMON snapshots, for more deterministic control. One example use case is tracepoint based snapshot reading. Add a new knob, max_nr_snapshots, to support this. If the nr_snapshots parameter becomes same to or greater than the value of this parameter, the scheme is deactivated. Users can read DAMOS stats via DAMON's sysfs interface. For deep level investigations on environments having advanced tools like perf and bpftrace, exposing the stats via a tracepoint can be useful. Implement a new tracepoint, namely damon:damos_stat_after_apply_interval. First five patches (patches 1-5) of this series implement the new stat, nr_snapshots, on the core layer (patch 1), expose on DAMON sysfs user interface (patch 2), and update documents (patches 3-5). Following six patches (patches 6-11) are for the new stat based DAMOS deactivation (max_nr_snapshots). The first one (patch 6) of this group updates a kernel-doc comment before making further changes. Then an implementation of it on the core layer (patch 7), an introduction of a new DAMON sysfs interface file for users of the feature (patch 8), and three updates of the documents (patches 9-11) follow. The final one (patch 12) introduces the new tracepoint that exposes the DAMOS stat values for each scheme apply interval. This patch (of 12): DAMON generates monitoring results snapshots for every sampling interval. DAMOS applies given schemes on the regions of the snapshots, for every apply interval of the scheme. DAMOS stat informs a given scheme has tried to how many memory entities and applied, in the region and byte level. In some use cases including user-space oriented tuning and investigations, it is useful to know that in the DAMON-snapshot level. Introduce a new stat, namely nr_snapshots for DAMON core API callers. [sj@kernel.org: fix wrong list_is_last() call in damons_is_last_region()] Link: https://lkml.kernel.org/r/20260114152049.99727-1-sj@kernel.org Link: https://lkml.kernel.org/r/20251216080128.42991-1-sj@kernel.org Link: https://lkml.kernel.org/r/20251216080128.42991-2-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Michal Hocko Cc: Mike Rapoport Cc: Steven Rostedt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/damon.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include') diff --git a/include/linux/damon.h b/include/linux/damon.h index 3813373a9200..1d8a1515e75a 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -330,6 +330,8 @@ struct damos_watermarks { * @sz_ops_filter_passed: * Total bytes that passed ops layer-handled DAMOS filters. * @qt_exceeds: Total number of times the quota of the scheme has exceeded. + * @nr_snapshots: + * Total number of DAMON snapshots that the scheme has tried. * * "Tried an action to a region" in this context means the DAMOS core logic * determined the region as eligible to apply the action. The access pattern @@ -355,6 +357,7 @@ struct damos_stat { unsigned long sz_applied; unsigned long sz_ops_filter_passed; unsigned long qt_exceeds; + unsigned long nr_snapshots; }; /** -- cgit v1.2.3 From ccaa2d062a35add92832e8f082b8e00eed3f6efd Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 16 Dec 2025 00:01:19 -0800 Subject: mm/damon: update damos kerneldoc for stat field Commit 0e92c2ee9f45 ("mm/damon/schemes: account scheme actions that successfully applied") has replaced ->stat_count and ->stat_sz of 'struct damos' with ->stat. The commit mistakenly did not update the related kernel doc comment, though. Update the comment. Link: https://lkml.kernel.org/r/20251216080128.42991-7-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Michal Hocko Cc: Mike Rapoport Cc: Steven Rostedt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/damon.h | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) (limited to 'include') diff --git a/include/linux/damon.h b/include/linux/damon.h index 1d8a1515e75a..43dfbfe2292f 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -532,9 +532,7 @@ struct damos_migrate_dests { * unsets @last_applied when each regions walking for applying the scheme is * finished. * - * After applying the &action to each region, &stat_count and &stat_sz is - * updated to reflect the number of regions and total size of regions that the - * &action is applied. + * After applying the &action to each region, &stat is updated. */ struct damos { struct damos_access_pattern pattern; -- cgit v1.2.3 From 84e425c68e6061751adecd2d328789e4f67eac1e Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 16 Dec 2025 00:01:20 -0800 Subject: mm/damon/core: implement max_nr_snapshots There are DAMOS use cases that require user-space centric control of its activation and deactivation. Having the control plane on the user-space, or using DAMOS as a way for monitoring results collection are such examples. DAMON parameters online commit, DAMOS quotas and watermarks can be useful for this purpose. However, those features work only at the sub-DAMON-snapshot level. In some use cases, the DAMON-snapshot level control is required. For example, in DAMOS-based monitoring results collection use case, the user online-installs a DAMOS scheme with DAMOS_STAT action, wait it be applied to whole regions of a single DAMON-snapshot, retrieves the stats and tried regions information, and online-uninstall the scheme. It is efficient to ensure the lifetime of the scheme as no more no less one snapshot consumption. To support such use cases, introduce a new DAMOS core API per-scheme parameter, namely max_nr_snapshots. As the name implies, it is the upper limit of nr_snapshots, which is a DAMOS stat that represents the number of DAMON-snapshots that the scheme has fully applied. If the limit is set with a non-zero value and nr_snapshots reaches or exceeds the limit, the scheme is deactivated. Link: https://lkml.kernel.org/r/20251216080128.42991-8-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Michal Hocko Cc: Mike Rapoport Cc: Steven Rostedt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/damon.h | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'include') diff --git a/include/linux/damon.h b/include/linux/damon.h index 43dfbfe2292f..a67292a2f09d 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -499,6 +499,7 @@ struct damos_migrate_dests { * @ops_filters: ops layer handling &struct damos_filter objects list. * @last_applied: Last @action applied ops-managing entity. * @stat: Statistics of this scheme. + * @max_nr_snapshots: Upper limit of nr_snapshots stat. * @list: List head for siblings. * * For each @apply_interval_us, DAMON finds regions which fit in the @@ -533,6 +534,9 @@ struct damos_migrate_dests { * finished. * * After applying the &action to each region, &stat is updated. + * + * If &max_nr_snapshots is set as non-zero and &stat.nr_snapshots be same to or + * greater than it, the scheme is deactivated. */ struct damos { struct damos_access_pattern pattern; @@ -567,6 +571,7 @@ struct damos { struct list_head ops_filters; void *last_applied; struct damos_stat stat; + unsigned long max_nr_snapshots; struct list_head list; }; -- cgit v1.2.3 From 804c26b961da295bd70c86a3c9dc4bea0b09de88 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 16 Dec 2025 00:01:25 -0800 Subject: mm/damon/core: add trace point for damos stat per apply interval DAMON users can read DAMOS stats via DAMON sysfs interface. It enables efficient, simple and flexible usages of the stats. Especially for systems not having advanced tools like perf or bpftrace, that can be useful. But if the advanced tools are available, exposing the stats via tracepoint can reduce unnecessary reimplementation of the wheels. Add a new tracepoint for DAMOS stats, namely damos_stat_after_apply_interval. The tracepoint is triggered for each scheme's apply interval and exposes the whole stat values. If the user needs sub-apply interval information for any chance, damos_before_apply tracepoint could be used. Link: https://lkml.kernel.org/r/20251216080128.42991-13-sj@kernel.org Signed-off-by: SeongJae Park Reviewed-by: Steven Rostedt (Google) Cc: Jonathan Corbet Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Michal Hocko Cc: Mike Rapoport Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/trace/events/damon.h | 41 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) (limited to 'include') diff --git a/include/trace/events/damon.h b/include/trace/events/damon.h index 852d725afea2..24fc402ab3c8 100644 --- a/include/trace/events/damon.h +++ b/include/trace/events/damon.h @@ -9,6 +9,47 @@ #include #include +TRACE_EVENT(damos_stat_after_apply_interval, + + TP_PROTO(unsigned int context_idx, unsigned int scheme_idx, + struct damos_stat *stat), + + TP_ARGS(context_idx, scheme_idx, stat), + + TP_STRUCT__entry( + __field(unsigned int, context_idx) + __field(unsigned int, scheme_idx) + __field(unsigned long, nr_tried) + __field(unsigned long, sz_tried) + __field(unsigned long, nr_applied) + __field(unsigned long, sz_applied) + __field(unsigned long, sz_ops_filter_passed) + __field(unsigned long, qt_exceeds) + __field(unsigned long, nr_snapshots) + ), + + TP_fast_assign( + __entry->context_idx = context_idx; + __entry->scheme_idx = scheme_idx; + __entry->nr_tried = stat->nr_tried; + __entry->sz_tried = stat->sz_tried; + __entry->nr_applied = stat->nr_applied; + __entry->sz_applied = stat->sz_applied; + __entry->sz_ops_filter_passed = stat->sz_ops_filter_passed; + __entry->qt_exceeds = stat->qt_exceeds; + __entry->nr_snapshots = stat->nr_snapshots; + ), + + TP_printk("ctx_idx=%u scheme_idx=%u nr_tried=%lu sz_tried=%lu " + "nr_applied=%lu sz_tried=%lu sz_ops_filter_passed=%lu " + "qt_exceeds=%lu nr_snapshots=%lu", + __entry->context_idx, __entry->scheme_idx, + __entry->nr_tried, __entry->sz_tried, + __entry->nr_applied, __entry->sz_applied, + __entry->sz_ops_filter_passed, __entry->qt_exceeds, + __entry->nr_snapshots) +); + TRACE_EVENT(damos_esz, TP_PROTO(unsigned int context_idx, unsigned int scheme_idx, -- cgit v1.2.3 From 64dd89ae01f2708a508e028c28b7906e4702a9a7 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Mon, 15 Dec 2025 12:57:53 -0500 Subject: mm/block/fs: remove laptop_mode Laptop mode was introduced to save battery, by delaying and consolidating writes and thereby maximize the time rotating hard drives wouldn't have to spin. Luckily, rotating hard drives, with their high spin-up times and power draw, are a thing of the past for battery-powered devices. Reclaim has also since changed to not write single filesystem pages anymore, and regular filesystem writeback is lumpy by design. The juice doesn't appear worth the squeeze anymore. The footprint of the feature is small, but nevertheless it's a complicating factor in mm, block, filesystems. Developers don't think about it, and it likely hasn't been tested with new reclaim and writeback changes in years. Let's sunset it. Keep the sysctl with a deprecation warning around for a few more cycles, but remove all functionality behind it. [akpm@linux-foundation.org: fix Documentation/admin-guide/laptops/index.rst] Link: https://lkml.kernel.org/r/20251216185201.GH905277@cmpxchg.org Signed-off-by: Johannes Weiner Suggested-by: Christoph Hellwig Reviewed-by: Christoph Hellwig Acked-by: Jens Axboe Reviewed-by: Shakeel Butt Acked-by: Michal Hocko Cc: Deepanshu Kartikey Signed-off-by: Andrew Morton --- include/linux/backing-dev-defs.h | 3 --- include/linux/writeback.h | 4 ---- include/trace/events/writeback.h | 1 - include/uapi/linux/sysctl.h | 2 +- 4 files changed, 1 insertion(+), 9 deletions(-) (limited to 'include') diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 0217c1073735..c88fd4d37d1f 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -46,7 +46,6 @@ enum wb_reason { WB_REASON_VMSCAN, WB_REASON_SYNC, WB_REASON_PERIODIC, - WB_REASON_LAPTOP_TIMER, WB_REASON_FS_FREE_SPACE, /* * There is no bdi forker thread any more and works are done @@ -204,8 +203,6 @@ struct backing_dev_info { char dev_name[64]; struct device *owner; - struct timer_list laptop_mode_wb_timer; - #ifdef CONFIG_DEBUG_FS struct dentry *debug_dir; #endif diff --git a/include/linux/writeback.h b/include/linux/writeback.h index f48e8ccffe81..e530112c4b3a 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -328,9 +328,6 @@ struct dirty_throttle_control { bool dirty_exceeded; }; -void laptop_io_completion(struct backing_dev_info *info); -void laptop_sync_completion(void); -void laptop_mode_timer_fn(struct timer_list *t); bool node_dirty_ok(struct pglist_data *pgdat); int wb_domain_init(struct wb_domain *dom, gfp_t gfp); #ifdef CONFIG_CGROUP_WRITEBACK @@ -342,7 +339,6 @@ extern struct wb_domain global_wb_domain; /* These are exported to sysctl. */ extern unsigned int dirty_writeback_interval; extern unsigned int dirty_expire_interval; -extern int laptop_mode; void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty); unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh); diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 311a341e6fe4..b6f94e97788a 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -42,7 +42,6 @@ EM( WB_REASON_VMSCAN, "vmscan") \ EM( WB_REASON_SYNC, "sync") \ EM( WB_REASON_PERIODIC, "periodic") \ - EM( WB_REASON_LAPTOP_TIMER, "laptop_timer") \ EM( WB_REASON_FS_FREE_SPACE, "fs_free_space") \ EM( WB_REASON_FORKER_THREAD, "forker_thread") \ EMe(WB_REASON_FOREIGN_FLUSH, "foreign_flush") diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h index 63d1464cb71c..6ea9ea8413fa 100644 --- a/include/uapi/linux/sysctl.h +++ b/include/uapi/linux/sysctl.h @@ -183,7 +183,7 @@ enum VM_LOWMEM_RESERVE_RATIO=20,/* reservation ratio for lower memory zones */ VM_MIN_FREE_KBYTES=21, /* Minimum free kilobytes to maintain */ VM_MAX_MAP_COUNT=22, /* int: Maximum number of mmaps/address-space */ - VM_LAPTOP_MODE=23, /* vm laptop mode */ + VM_BLOCK_DUMP=24, /* block dump mode */ VM_HUGETLB_GROUP=25, /* permitted hugetlb group */ VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */ -- cgit v1.2.3 From bd4526e64bcff4cbeaefbbd91c40d3e38b9920a9 Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Wed, 3 Dec 2025 22:45:11 +0000 Subject: maple_tree: remove struct maple_alloc struct maple_alloc is deprecated after the maple tree conversion to sheaves, remove the references from the header file. Link: https://lkml.kernel.org/r/20251203224511.469978-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Reviewed-by: Jinjie Ruan Reviewed-by: Liam R. Howlett Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/maple_tree.h | 8 -------- 1 file changed, 8 deletions(-) (limited to 'include') diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h index 66f98a3da8d8..1323c28a7470 100644 --- a/include/linux/maple_tree.h +++ b/include/linux/maple_tree.h @@ -129,13 +129,6 @@ struct maple_arange_64 { struct maple_metadata meta; }; -struct maple_alloc { - unsigned long total; - unsigned char node_count; - unsigned int request_count; - struct maple_alloc *slot[MAPLE_ALLOC_SLOTS]; -}; - struct maple_topiary { struct maple_pnode *parent; struct maple_enode *next; /* Overlaps the pivot */ @@ -306,7 +299,6 @@ struct maple_node { }; struct maple_range_64 mr64; struct maple_arange_64 ma64; - struct maple_alloc alloc; }; }; -- cgit v1.2.3 From 84355caa271a0eab2d1b55ff73aa8aa3e4627661 Mon Sep 17 00:00:00 2001 From: Thorsten Blum Date: Wed, 17 Dec 2025 12:02:13 +0100 Subject: mm/mm_init: replace simple_strtoul with kstrtobool in set_hashdist Use bool for 'hashdist' and replace simple_strtoul() with kstrtobool() for parsing the 'hashdist=' boot parameter. Unlike simple_strtoul(), which returns an unsigned long, kstrtobool() converts the string directly to bool and avoids implicit casting. Check the return value of kstrtobool() and reject invalid values. This adds error handling while preserving behavior for existing values, and removes use of the deprecated simple_strtoul() helper. The current code silently sets 'hashdist = 0' if parsing fails, instead of leaving the default value (HASHDIST_DEFAULT) unchanged. Additionally, kstrtobool() accepts common boolean strings such as "on" and "off". Link: https://lkml.kernel.org/r/20251217110214.50807-1-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum Reviewed-by: Matthew Wilcox (Oracle) Reviewed-by: Mike Rapoport (Microsoft) Signed-off-by: Andrew Morton --- include/linux/memblock.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 221118b5a16e..6ec5e9ac0699 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -598,9 +598,9 @@ extern void *alloc_large_system_hash(const char *tablename, */ #ifdef CONFIG_NUMA #define HASHDIST_DEFAULT IS_ENABLED(CONFIG_64BIT) -extern int hashdist; /* Distribute hashes across NUMA nodes? */ +extern bool hashdist; /* Distribute hashes across NUMA nodes? */ #else -#define hashdist (0) +#define hashdist (false) #endif #ifdef CONFIG_MEMTEST -- cgit v1.2.3 From 241b3a09639c317bdcaeea6721b7d1aabef341f9 Mon Sep 17 00:00:00 2001 From: Brendan Jackman Date: Fri, 19 Dec 2025 11:32:18 +0000 Subject: mm: clarify GFP_ATOMIC/GFP_NOWAIT doc-comment The current description of contexts where it's invalid to make GFP_ATOMIC and GFP_NOWAIT calls is rather vague. Replace this with a direct description of the actual contexts of concern and refer to the RT docs where this is explained more discursively. While rejigging this prose, also move the documentation of GFP_NOWAIT to the GFP_NOWAIT section. Link: https://lore.kernel.org/all/d912480a-5229-4efe-9336-b31acded30f5@suse.cz/ Link: https://lkml.kernel.org/r/20251219-b4-gfp_atomic-comment-v2-1-4c4ce274c2b6@google.com Signed-off-by: Brendan Jackman Acked-by: Vlastimil Babka Acked-by: David Hildenbrand (Red Hat) Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Mike Rapoport Cc: Sebastian Andrzej Siewior Cc: Steven Rostedt Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- include/linux/gfp_types.h | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h index 3de43b12209e..814bb2892f99 100644 --- a/include/linux/gfp_types.h +++ b/include/linux/gfp_types.h @@ -309,8 +309,10 @@ enum { * * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower * watermark is applied to allow access to "atomic reserves". - * The current implementation doesn't support NMI and few other strict - * non-preemptive contexts (e.g. raw_spin_lock). The same applies to %GFP_NOWAIT. + * The current implementation doesn't support NMI, nor contexts that disable + * preemption under PREEMPT_RT. This includes raw_spin_lock() and plain + * preempt_disable() - see "Memory allocation" in + * Documentation/core-api/real-time/differences.rst for more info. * * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim. @@ -321,6 +323,7 @@ enum { * %GFP_NOWAIT is for kernel allocations that should not stall for direct * reclaim, start physical IO or use any filesystem callback. It is very * likely to fail to allocate memory, even for very small allocations. + * The same restrictions on calling contexts apply as for %GFP_ATOMIC. * * %GFP_NOIO will use direct reclaim to discard clean pages or slab pages * that do not require the starting of any physical IO. -- cgit v1.2.3 From 7db0787000d44d52710e5cdd67113458fa28f3cd Mon Sep 17 00:00:00 2001 From: Wentao Guan Date: Thu, 6 Nov 2025 19:09:29 +0800 Subject: mm: cleanup vma_iter_bulk_alloc commit d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()"), removed the only user and mas_expected_entries has been removed, since commit e3852a1213ffc ("maple_tree: Drop bulk insert support"). Also cleanup the mas_expected_entries in maple_tree.h. No functional change. Link: https://lkml.kernel.org/r/20251106110929.3522073-1-guanwentao@uniontech.com Signed-off-by: Wentao Guan Reviewed-by: Liam R. Howlett Cc: Anshuman Khandual Cc: Cheng Nie Cc: Guan Wentao Cc: Vlastimil Babka Cc: Lorenzo Stoakes Cc: Jann Horn Cc: Pedro Falcato Signed-off-by: Andrew Morton --- include/linux/maple_tree.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include') diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h index 1323c28a7470..7b8aad47121e 100644 --- a/include/linux/maple_tree.h +++ b/include/linux/maple_tree.h @@ -528,7 +528,6 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp); void mas_pause(struct ma_state *mas); void maple_tree_init(void); void mas_destroy(struct ma_state *mas); -int mas_expected_entries(struct ma_state *mas, unsigned long nr_entries); void *mas_prev(struct ma_state *mas, unsigned long min); void *mas_prev_range(struct ma_state *mas, unsigned long max); -- cgit v1.2.3 From 9e80e66ddaf736e5ca80cba8adf8d497bd53092f Mon Sep 17 00:00:00 2001 From: Gregory Price Date: Sun, 21 Dec 2025 07:56:03 -0500 Subject: mm, hugetlb: implement movable_gigantic_pages sysctl This reintroduces a concept removed by: commit d6cb41cc44c6 ("mm, hugetlb: remove hugepages_treat_as_movable sysctl") This sysctl provides flexibility between ZONE_MOVABLE use cases: 1) onlining memory in ZONE_MOVABLE to maintain hotplug compatibility 2) onlining memory in ZONE_MOVABLE to make hugepage allocate reliable When ZONE_MOVABLE is used to make huge page allocation more reliable, disallowing gigantic pages memory in this region is pointless. If hotplug is not a requirement, we can loosen the restrictions to allow 1GB gigantic pages in ZONE_MOVABLE. Since 1GB can be difficult to migrate / has impacts on compaction / defragmentation, we don't enable this by default. Notably, 1GB pages can only be migrated if another 1GB page is available - so hot-unplug will fail if such a page cannot be found. However, since there are scenarios where gigantic pages are migratable, we should allow use of these on movable regions. When not valid 1GB is available for migration, hot-unplug will retry indefinitely (or until interrupted). For example: echo 0 > node0/hugepages/..-1GB/nr_hugepages # clear node0 1GB pages echo 1 > node1/hugepages/..-1GB/nr_hugepages # reserve node1 1GB page ./alloc_huge_node1 & # Allocate a 1GB page on node1 ./node1_offline & # attempt to offline all node1 memory echo 1 > node0/hugepages/..-1GB/nr_hugepages # reserve node0 1GB page In this example, node1_offline will block indefinitely until the final step, when a node0 1GB page is made available. Note: Boot-time CMA is not possible for driver-managed hotplug memory, as CMA requires the memory to be registered as SystemRAM at boot time. Additionally, 1GB huge pages are not supported by THP. Link: https://lkml.kernel.org/r/20251221125603.2364174-1-gourry@gourry.net Signed-off-by: Gregory Price Suggested-by: David Rientjes Link: https://lore.kernel.org/all/20180201193132.Hk7vI_xaU%25akpm@linux-foundation.org/ Acked-by: David Hildenbrand (Red Hat) Acked-by: David Rientjes Cc: Mel Gorman Cc: Michal Hocko Cc: "David Hildenbrand (Red Hat)" Cc: Gregory Price Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Mike Rapoport Cc: Muchun Song Cc: Oscar Salvador Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/hugetlb.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index e51b8ef0cebd..694f6e83c637 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -171,6 +171,7 @@ bool hugetlbfs_pagecache_present(struct hstate *h, struct address_space *hugetlb_folio_mapping_lock_write(struct folio *folio); +extern int movable_gigantic_pages __read_mostly; extern int sysctl_hugetlb_shm_group __read_mostly; extern struct list_head huge_boot_pages[MAX_NUMNODES]; @@ -929,7 +930,7 @@ static inline bool hugepage_movable_supported(struct hstate *h) if (!hugepage_migration_supported(h)) return false; - if (hstate_is_gigantic(h)) + if (hstate_is_gigantic(h) && !movable_gigantic_pages) return false; return true; } -- cgit v1.2.3 From a8d933dc3354bfb9db1fc0e09c289ec1778ee271 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Thu, 25 Dec 2025 21:02:13 +0000 Subject: mm/vmstat: remove unused node and zone state helpers Several helper functions for managing node and zone states have become obsolete and no longer have any callers within the kernel. inc_node_state() inc_zone_state() dec_zone_state() This commit removes the dead code. Link: https://lkml.kernel.org/r/20251225210213.2553-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang Reviewed-by: Joshua Hahn Acked-by: David Hildenbrand (Red Hat) Acked-by: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/vmstat.h | 6 ------ 1 file changed, 6 deletions(-) (limited to 'include') diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 3398a345bda8..cf559e2ce1d4 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -286,10 +286,8 @@ void mod_node_page_state(struct pglist_data *, enum node_stat_item, long); void inc_node_page_state(struct page *, enum node_stat_item); void dec_node_page_state(struct page *, enum node_stat_item); -extern void inc_node_state(struct pglist_data *, enum node_stat_item); extern void __inc_zone_state(struct zone *, enum zone_stat_item); extern void __inc_node_state(struct pglist_data *, enum node_stat_item); -extern void dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_node_state(struct pglist_data *, enum node_stat_item); @@ -394,10 +392,6 @@ static inline void __dec_node_page_state(struct page *page, #define dec_node_page_state __dec_node_page_state #define mod_node_page_state __mod_node_page_state -#define inc_zone_state __inc_zone_state -#define inc_node_state __inc_node_state -#define dec_zone_state __dec_zone_state - #define set_pgdat_percpu_threshold(pgdat, callback) { } static inline void refresh_zone_stat_thresholds(void) { } -- cgit v1.2.3 From f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Wed, 31 Dec 2025 03:00:26 +0000 Subject: mm/mmu_gather: remove @delay_remap of __tlb_remove_page_size() __tlb_remove_page_size() is only used in tlb_remove_page_size() with @delay_remap set to false and it is passed directly to __tlb_remove_folio_pages_size(). Remove @delay_remap of __tlb_remove_page_size() and call __tlb_remove_folio_pages_size() with false @delay_remap. Link: https://lkml.kernel.org/r/20251231030026.15938-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang Acked-by: SeongJae Park Acked-by: David Hildenbrand (Red Hat) Acked-by: Will Deacon Acked-by: Heiko Carstens # s390 Cc: Alexander Gordeev Cc: "Aneesh Kumar K.V" Cc: Arnd Bergmann Cc: Christian Borntraeger Cc: Nicholas Piggin Cc: Peter Zijlstra Cc: Sven Schnelle Cc: Vasily Gorbik Signed-off-by: Andrew Morton --- include/asm-generic/tlb.h | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'include') diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 4d679d2a206b..3975f7d11553 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -287,8 +287,7 @@ struct mmu_gather_batch { */ #define MAX_GATHER_BATCH_COUNT (10000UL/MAX_GATHER_BATCH) -extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, - bool delay_rmap, int page_size); +extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size); bool __tlb_remove_folio_pages(struct mmu_gather *tlb, struct page *page, unsigned int nr_pages, bool delay_rmap); @@ -510,7 +509,7 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) static inline void tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size) { - if (__tlb_remove_page_size(tlb, page, false, page_size)) + if (__tlb_remove_page_size(tlb, page, page_size)) tlb_flush_mmu(tlb); } -- cgit v1.2.3 From 5173ae0a068d64643ccf4915b7cbedf82810a592 Mon Sep 17 00:00:00 2001 From: Shivank Garg Date: Sun, 18 Jan 2026 19:09:40 +0000 Subject: mm/khugepaged: map dirty/writeback pages failures to EAGAIN Patch series "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE", v5. MADV_COLLAPSE on file-backed mappings fails with -EINVAL when TEXT pages are dirty. This affects scenarios like package/container updates or executing binaries immediately after writing them, etc. The issue is that collapse_file() triggers async writeback and returns SCAN_FAIL (maps to -EINVAL), expecting khugepaged to revisit later. But MADV_COLLAPSE is synchronous and userspace expects immediate success or a clear retry signal. Reproduction: - Compile or copy 2MB-aligned executable to XFS/ext4 FS - Call MADV_COLLAPSE on .text section - First call fails with -EINVAL (text pages dirty from copy) - Second call succeeds (async writeback completed) Issue Report: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com This patch (of 2): When collapse_file encounters dirty or writeback pages in file-backed mappings, it currently returns SCAN_FAIL which maps to -EINVAL. This is misleading as EINVAL suggests invalid arguments, whereas dirty/writeback pages represent transient conditions that may resolve on retry. Introduce SCAN_PAGE_DIRTY_OR_WRITEBACK to cover both dirty and writeback states, mapping it to -EAGAIN. For MADV_COLLAPSE, this provides userspace with a clear signal that retry may succeed after writeback completes. For khugepaged, this is harmless as it will naturally revisit the range during periodic scans after async writeback completes. Link: https://lkml.kernel.org/r/20260118190939.8986-2-shivankg@amd.com Link: https://lkml.kernel.org/r/20260118190939.8986-4-shivankg@amd.com Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE") Signed-off-by: Shivank Garg Reported-by: Branden Moore Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com Reviewed-by: Dev Jain Reviewed-by: Lance Yang Reviewed-by: Baolin Wang Reviewed-by: wang lian Acked-by: David Hildenbrand (Red Hat) Cc: Barry Song Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Masami Hiramatsu Cc: Mathieu Desnoyers Cc: Nico Pache Cc: Ryan Roberts Cc: Zach O'Keefe Cc: Zi Yan Signed-off-by: Andrew Morton --- include/trace/events/huge_memory.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'include') diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h index 4cde53b45a85..4e41bff31888 100644 --- a/include/trace/events/huge_memory.h +++ b/include/trace/events/huge_memory.h @@ -37,7 +37,8 @@ EM( SCAN_PAGE_HAS_PRIVATE, "page_has_private") \ EM( SCAN_STORE_FAILED, "store_failed") \ EM( SCAN_COPY_MC, "copy_poisoned_page") \ - EMe(SCAN_PAGE_FILLED, "page_filled") + EM( SCAN_PAGE_FILLED, "page_filled") \ + EMe(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback") #undef EM #undef EMe -- cgit v1.2.3 From ba1c86874e25e95de9b253570bb50cc3b5df542e Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Sun, 11 Jan 2026 10:20:35 +0200 Subject: alpha: introduce arch_zone_limits_init() Patch series "arch, mm: consolidate hugetlb early reservation", v3. Order in which early memory reservation for hugetlb happens depends on architecture, on configuration options and on command line parameters. Some architectures rely on the core MM to call hugetlb_bootmem_alloc() while others call it very early to allow pre-allocation of HVO-style vmemmap. When hugetlb_cma is supported by an architecture it is initialized during setup_arch() and then later hugetlb_init code needs to understand did it happen or not. To make everything consistent and unified, both reservation of hugetlb memory from bootmem and creation of CMA areas for hugetlb must be called from core MM initialization and it would have been a simple change. However, HVO-style pre-initialization ordering requirements slightly complicate things and for HVO pre-init to work sparse and memory map should be initialized after hugetlb reservations. This required pulling out the call to free_area_init() out of setup_arch() path and moving it MM initialization and this is what the first 23 patches do. These changes are deliberately split into per-arch patches that change how the zone limits are calculated for each architecture and the patches 22 and 23 just remove the calls to free_area_init() and sprase_init() from arch/*. Patch 24 is a simple cleanup for MIPS. Patches 25 and 26 actually consolidate hugetlb reservations and patches 27 and 28 perform some aftermath cleanups. This patch (of 29): Move calculations of zone limits to a dedicated arch_zone_limits_init() function. Later MM core will use this function as an architecture specific callback during nodes and zones initialization and thus there won't be a need to call free_area_init() from every architecture. Link: https://lkml.kernel.org/r/20260111082105.290734-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20260111082105.290734-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Acked-by: Magnus Lindholm Cc: Alexander Gordeev Cc: Alex Shi Cc: Andreas Larsson Cc: "Borislav Petkov (AMD)" Cc: Catalin Marinas Cc: David Hildenbrand Cc: David S. Miller Cc: Dinh Nguyen Cc: Geert Uytterhoeven Cc: Guo Ren Cc: Heiko Carstens Cc: Helge Deller Cc: Huacai Chen Cc: Ingo Molnar Cc: Johannes Berg Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Klara Modin Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Matt Turner Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Hocko Cc: Michal Simek Cc: Muchun Song Cc: Oscar Salvador Cc: Palmer Dabbelt Cc: Pratyush Yadav Cc: Richard Weinberger Cc: "Ritesh Harjani (IBM)" Cc: Russell King Cc: Stafford Horne Cc: Suren Baghdasaryan Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vineet Gupta Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/mm.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include') diff --git a/include/linux/mm.h b/include/linux/mm.h index ab2e7e30aef9..477339b7a032 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3556,6 +3556,7 @@ static inline unsigned long get_num_physpages(void) * free_area_init(max_zone_pfns); */ void free_area_init(unsigned long *max_zone_pfn); +void arch_zone_limits_init(unsigned long *max_zone_pfn); unsigned long node_map_pfn_alignment(void); extern unsigned long absent_pages_in_range(unsigned long start_pfn, unsigned long end_pfn); -- cgit v1.2.3 From d49004c5f0c140bb83c87fab46dcf449cf00eb24 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Sun, 11 Jan 2026 10:20:57 +0200 Subject: arch, mm: consolidate initialization of nodes, zones and memory map To initialize node, zone and memory map data structures every architecture calls free_area_init() during setup_arch() and passes it an array of zone limits. Beside code duplication it creates "interesting" ordering cases between allocation and initialization of hugetlb and the memory map. Some architectures allocate hugetlb pages very early in setup_arch() in certain cases, some only create hugetlb CMA areas in setup_arch() and sometimes hugetlb allocations happen mm_core_init(). With arch_zone_limits_init() helper available now on all architectures it is no longer necessary to call free_area_init() from architecture setup code. Rather core MM initialization can call arch_zone_limits_init() in a single place. This allows to unify ordering of hugetlb vs memory map allocation and initialization. Remove the call to free_area_init() from architecture specific code and place it in a new mm_core_init_early() function that is called immediately after setup_arch(). After this refactoring it is possible to consolidate hugetlb allocations and eliminate differences in ordering of hugetlb and memory map initialization among different architectures. As the first step of this consolidation move hugetlb_bootmem_alloc() to mm_core_early_init(). Link: https://lkml.kernel.org/r/20260111082105.290734-24-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Cc: Alexander Gordeev Cc: Alex Shi Cc: Andreas Larsson Cc: "Borislav Petkov (AMD)" Cc: Catalin Marinas Cc: David Hildenbrand Cc: David S. Miller Cc: Dinh Nguyen Cc: Geert Uytterhoeven Cc: Guo Ren Cc: Heiko Carstens Cc: Helge Deller Cc: Huacai Chen Cc: Ingo Molnar Cc: Johannes Berg Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Klara Modin Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Magnus Lindholm Cc: Matt Turner Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Hocko Cc: Michal Simek Cc: Muchun Song Cc: Oscar Salvador Cc: Palmer Dabbelt Cc: Pratyush Yadav Cc: Richard Weinberger Cc: "Ritesh Harjani (IBM)" Cc: Russell King Cc: Stafford Horne Cc: Suren Baghdasaryan Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vineet Gupta Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/mm.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/mm.h b/include/linux/mm.h index 477339b7a032..aacabf8a0b58 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -45,6 +45,7 @@ struct pt_regs; struct folio_batch; void arch_mm_preinit(void); +void mm_core_init_early(void); void mm_core_init(void); void init_mm_internals(void); @@ -3540,7 +3541,7 @@ static inline unsigned long get_num_physpages(void) } /* - * Using memblock node mappings, an architecture may initialise its + * FIXME: Using memblock node mappings, an architecture may initialise its * zones, allocate the backing mem_map and account for memory holes in an * architecture independent manner. * @@ -3555,7 +3556,6 @@ static inline unsigned long get_num_physpages(void) * memblock_add_node(base, size, nid, MEMBLOCK_NONE) * free_area_init(max_zone_pfns); */ -void free_area_init(unsigned long *max_zone_pfn); void arch_zone_limits_init(unsigned long *max_zone_pfn); unsigned long node_map_pfn_alignment(void); extern unsigned long absent_pages_in_range(unsigned long start_pfn, -- cgit v1.2.3 From 4267739cabb82da75780c4699fe8208821929944 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Sun, 11 Jan 2026 10:20:58 +0200 Subject: arch, mm: consolidate initialization of SPARSE memory model Every architecture calls sparse_init() during setup_arch() although the data structures created by sparse_init() are not used until the initialization of the core MM. Beside the code duplication, calling sparse_init() from architecture specific code causes ordering differences of vmemmap and HVO initialization on different architectures. Move the call to sparse_init() from architecture specific code to free_area_init() to ensure that vmemmap and HVO initialization order is always the same. Link: https://lkml.kernel.org/r/20260111082105.290734-25-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Cc: Alexander Gordeev Cc: Alex Shi Cc: Andreas Larsson Cc: "Borislav Petkov (AMD)" Cc: Catalin Marinas Cc: David Hildenbrand Cc: David S. Miller Cc: Dinh Nguyen Cc: Geert Uytterhoeven Cc: Guo Ren Cc: Heiko Carstens Cc: Helge Deller Cc: Huacai Chen Cc: Ingo Molnar Cc: Johannes Berg Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Klara Modin Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Magnus Lindholm Cc: Matt Turner Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Hocko Cc: Michal Simek Cc: Muchun Song Cc: Oscar Salvador Cc: Palmer Dabbelt Cc: Pratyush Yadav Cc: Richard Weinberger Cc: "Ritesh Harjani (IBM)" Cc: Russell King Cc: Stafford Horne Cc: Suren Baghdasaryan Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vineet Gupta Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 2 -- 1 file changed, 2 deletions(-) (limited to 'include') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index fc5d6c88d2f0..eb3815fc94ad 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -2286,9 +2286,7 @@ static inline unsigned long next_present_section_nr(unsigned long section_nr) #define pfn_to_nid(pfn) (0) #endif -void sparse_init(void); #else -#define sparse_init() do {} while (0) #define sparse_index_init(_sec, _nid) do {} while (0) #define sparse_vmemmap_init_nid_early(_nid) do {} while (0) #define sparse_vmemmap_init_nid_late(_nid) do {} while (0) -- cgit v1.2.3 From 9fac145b6d3fe570277438f8d860eabf229dc545 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Sun, 11 Jan 2026 10:21:01 +0200 Subject: mm, arch: consolidate hugetlb CMA reservation Every architecture that supports hugetlb_cma command line parameter reserves CMA areas for hugetlb during setup_arch(). This obfuscates the ordering of hugetlb CMA initialization with respect to the rest initialization of the core MM. Introduce arch_hugetlb_cma_order() callback to allow architectures report the desired order-per-bit of CMA areas and provide a week implementation of arch_hugetlb_cma_order() for architectures that don't support hugetlb with CMA. Use this callback in hugetlb_cma_reserve() instead if passing the order as parameter and call hugetlb_cma_reserve() from mm_core_init_early() rather than have it spread over architecture specific code. Link: https://lkml.kernel.org/r/20260111082105.290734-28-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Cc: Alexander Gordeev Cc: Alex Shi Cc: Andreas Larsson Cc: "Borislav Petkov (AMD)" Cc: Catalin Marinas Cc: David Hildenbrand Cc: David S. Miller Cc: Dinh Nguyen Cc: Geert Uytterhoeven Cc: Guo Ren Cc: Heiko Carstens Cc: Helge Deller Cc: Huacai Chen Cc: Ingo Molnar Cc: Johannes Berg Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Klara Modin Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Magnus Lindholm Cc: Matt Turner Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Hocko Cc: Michal Simek Cc: Muchun Song Cc: Oscar Salvador Cc: Palmer Dabbelt Cc: Pratyush Yadav Cc: Richard Weinberger Cc: "Ritesh Harjani (IBM)" Cc: Russell King Cc: Stafford Horne Cc: Suren Baghdasaryan Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vineet Gupta Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/hugetlb.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 694f6e83c637..00e6a73e7bba 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -281,6 +281,8 @@ void fixup_hugetlb_reservations(struct vm_area_struct *vma); void hugetlb_split(struct vm_area_struct *vma, unsigned long addr); int hugetlb_vma_lock_alloc(struct vm_area_struct *vma); +unsigned int arch_hugetlb_cma_order(void); + #else /* !CONFIG_HUGETLB_PAGE */ static inline void hugetlb_dup_vma_private(struct vm_area_struct *vma) @@ -1322,9 +1324,9 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h, } #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA) -extern void __init hugetlb_cma_reserve(int order); +extern void __init hugetlb_cma_reserve(void); #else -static inline __init void hugetlb_cma_reserve(int order) +static inline __init void hugetlb_cma_reserve(void) { } #endif -- cgit v1.2.3 From 743758ccf8bede3e7c38f3f7d3f5131aa0a7b4a6 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Sun, 11 Jan 2026 10:21:03 +0200 Subject: Revert "mm/hugetlb: deal with multiple calls to hugetlb_bootmem_alloc" This reverts commit d58b2498200724e4f8c12d71a5953da03c8c8bdf. hugetlb_bootmem_alloc() is called only once, no need to check if it was called already at its entry. Other checks performed during HVO initialization are also no longer necessary because sparse_init() that calls hugetlb_vmemmap_init_early() and hugetlb_vmemmap_init_late() is always called after hugetlb_bootmem_alloc(). Link: https://lkml.kernel.org/r/20260111082105.290734-30-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Acked-by: Muchun Song Cc: Alexander Gordeev Cc: Alex Shi Cc: Andreas Larsson Cc: "Borislav Petkov (AMD)" Cc: Catalin Marinas Cc: David Hildenbrand Cc: David S. Miller Cc: Dinh Nguyen Cc: Geert Uytterhoeven Cc: Guo Ren Cc: Heiko Carstens Cc: Helge Deller Cc: Huacai Chen Cc: Ingo Molnar Cc: Johannes Berg Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Klara Modin Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Magnus Lindholm Cc: Matt Turner Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Hocko Cc: Michal Simek Cc: Oscar Salvador Cc: Palmer Dabbelt Cc: Pratyush Yadav Cc: Richard Weinberger Cc: "Ritesh Harjani (IBM)" Cc: Russell King Cc: Stafford Horne Cc: Suren Baghdasaryan Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vineet Gupta Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/hugetlb.h | 6 ------ 1 file changed, 6 deletions(-) (limited to 'include') diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 00e6a73e7bba..94a03591990c 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -176,7 +176,6 @@ extern int sysctl_hugetlb_shm_group __read_mostly; extern struct list_head huge_boot_pages[MAX_NUMNODES]; void hugetlb_bootmem_alloc(void); -bool hugetlb_bootmem_allocated(void); extern nodemask_t hugetlb_bootmem_nodes; void hugetlb_bootmem_set_nodes(void); @@ -1306,11 +1305,6 @@ static inline bool hugetlbfs_pagecache_present( static inline void hugetlb_bootmem_alloc(void) { } - -static inline bool hugetlb_bootmem_allocated(void) -{ - return false; -} #endif /* CONFIG_HUGETLB_PAGE */ static inline spinlock_t *huge_pte_lock(struct hstate *h, -- cgit v1.2.3 From 53eb797ffc3abe30418b19777922b55fb339fc1f Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Sun, 18 Jan 2026 14:50:41 +0000 Subject: mm/rmap: remove anon_vma_merge() function This function is confusing, we already have the concept of anon_vma merge to adjacent VMA's anon_vma's to increase probability of anon_vma compatibility and therefore VMA merge (see is_mergeable_anon_vma() etc.), as well as anon_vma reuse, along side the usual VMA merge logic. We can remove the anon_vma check as it is redundant - a merge would not have been permitted with removal if the anon_vma's were not the same (and in the case of an unfaulted/faulted merge, we would have already set the unfaulted VMA's anon_vma to vp->remove->anon_vma in dup_anon_vma()). Avoid overloading this term when we're very simply unlinking anon_vma state from a removed VMA upon merge. Link: https://lkml.kernel.org/r/56bbe45e309f7af197b1c4f94a9a0c8931ff2d29.1768746221.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Suren Baghdasaryan Reviewed-by: Liam R. Howlett Cc: Barry Song Cc: Chris Li Cc: David Hildenbrand Cc: Harry Yoo Cc: Jann Horn Cc: Michal Hocko Cc: Mike Rapoport Cc: Pedro Falcato Cc: Rik van Riel Cc: Shakeel Butt Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/rmap.h | 7 ------- 1 file changed, 7 deletions(-) (limited to 'include') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index daa92a58585d..832bfc0ccfc6 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -165,13 +165,6 @@ static inline int anon_vma_prepare(struct vm_area_struct *vma) return __anon_vma_prepare(vma); } -static inline void anon_vma_merge(struct vm_area_struct *vma, - struct vm_area_struct *next) -{ - VM_BUG_ON_VMA(vma->anon_vma != next->anon_vma, vma); - unlink_anon_vmas(next); -} - struct anon_vma *folio_get_anon_vma(const struct folio *folio); #ifdef CONFIG_MM_ID -- cgit v1.2.3 From 7549e3d20f1aa9a0b8c77f83144dde54ed6ab4fe Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Sun, 18 Jan 2026 14:50:42 +0000 Subject: mm/rmap: make anon_vma functions internal The bulk of the anon_vma operations are only used by mm, so formalise this by putting the function prototypes and inlines in mm/internal.h. This allows us to make changes without having to worry about the rest of the kernel. Link: https://lkml.kernel.org/r/79ec933c3a9c8bf1f64dab253bbfdae8a01cb921.1768746221.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Suren Baghdasaryan Reviewed-by: Liam R. Howlett Cc: Barry Song Cc: Chris Li Cc: David Hildenbrand Cc: Harry Yoo Cc: Jann Horn Cc: Michal Hocko Cc: Mike Rapoport Cc: Pedro Falcato Cc: Rik van Riel Cc: Shakeel Butt Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/rmap.h | 60 ---------------------------------------------------- 1 file changed, 60 deletions(-) (limited to 'include') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 832bfc0ccfc6..dd764951b03d 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -104,68 +104,8 @@ enum ttu_flags { }; #ifdef CONFIG_MMU -static inline void get_anon_vma(struct anon_vma *anon_vma) -{ - atomic_inc(&anon_vma->refcount); -} - -void __put_anon_vma(struct anon_vma *anon_vma); - -static inline void put_anon_vma(struct anon_vma *anon_vma) -{ - if (atomic_dec_and_test(&anon_vma->refcount)) - __put_anon_vma(anon_vma); -} - -static inline void anon_vma_lock_write(struct anon_vma *anon_vma) -{ - down_write(&anon_vma->root->rwsem); -} -static inline int anon_vma_trylock_write(struct anon_vma *anon_vma) -{ - return down_write_trylock(&anon_vma->root->rwsem); -} - -static inline void anon_vma_unlock_write(struct anon_vma *anon_vma) -{ - up_write(&anon_vma->root->rwsem); -} - -static inline void anon_vma_lock_read(struct anon_vma *anon_vma) -{ - down_read(&anon_vma->root->rwsem); -} - -static inline int anon_vma_trylock_read(struct anon_vma *anon_vma) -{ - return down_read_trylock(&anon_vma->root->rwsem); -} - -static inline void anon_vma_unlock_read(struct anon_vma *anon_vma) -{ - up_read(&anon_vma->root->rwsem); -} - - -/* - * anon_vma helper functions. - */ void anon_vma_init(void); /* create anon_vma_cachep */ -int __anon_vma_prepare(struct vm_area_struct *); -void unlink_anon_vmas(struct vm_area_struct *); -int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *); -int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *); - -static inline int anon_vma_prepare(struct vm_area_struct *vma) -{ - if (likely(vma->anon_vma)) - return 0; - - return __anon_vma_prepare(vma); -} - -struct anon_vma *folio_get_anon_vma(const struct folio *folio); #ifdef CONFIG_MM_ID static __always_inline void folio_lock_large_mapcount(struct folio *folio) -- cgit v1.2.3 From 85f03a86318c4172bfda4484cdf588ebab5fa410 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Sun, 18 Jan 2026 14:50:43 +0000 Subject: mm/mmap_lock: add vma_is_attached() helper This makes it easy to explicitly check for VMA detachment, which is useful for things like asserts. Note that we intentionally do not allow this function to be available should CONFIG_PER_VMA_LOCK be set - this is because vma_assert_attached() and vma_assert_detached() are no-ops if !CONFIG_PER_VMA_LOCK, so there is no correct state for vma_is_attached() to be in if this configuration option is not specified. Therefore users elsewhere must invoke this function only after checking for CONFIG_PER_VMA_LOCK. We rework the assert functions to utilise this. Link: https://lkml.kernel.org/r/0172d3bf527ca54ba27d8bce8f8476095b241ac7.1768746221.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Suren Baghdasaryan Reviewed-by: Liam R. Howlett Cc: Barry Song Cc: Chris Li Cc: David Hildenbrand Cc: Harry Yoo Cc: Jann Horn Cc: Michal Hocko Cc: Mike Rapoport Cc: Pedro Falcato Cc: Rik van Riel Cc: Shakeel Butt Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/mmap_lock.h | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index d53f72dba7fe..b50416fbba20 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -251,6 +251,11 @@ static inline void vma_assert_locked(struct vm_area_struct *vma) !__is_vma_write_locked(vma, &mm_lock_seq), vma); } +static inline bool vma_is_attached(struct vm_area_struct *vma) +{ + return refcount_read(&vma->vm_refcnt); +} + /* * WARNING: to avoid racing with vma_mark_attached()/vma_mark_detached(), these * assertions should be made either under mmap_write_lock or when the object @@ -258,12 +263,12 @@ static inline void vma_assert_locked(struct vm_area_struct *vma) */ static inline void vma_assert_attached(struct vm_area_struct *vma) { - WARN_ON_ONCE(!refcount_read(&vma->vm_refcnt)); + WARN_ON_ONCE(!vma_is_attached(vma)); } static inline void vma_assert_detached(struct vm_area_struct *vma) { - WARN_ON_ONCE(refcount_read(&vma->vm_refcnt)); + WARN_ON_ONCE(vma_is_attached(vma)); } static inline void vma_mark_attached(struct vm_area_struct *vma) -- cgit v1.2.3 From 53a9b4646f67c95df1775aa5f381cb7f42cae957 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Tue, 6 Jan 2026 12:52:37 +0100 Subject: mm/page_alloc: refactor the initial compaction handling The initial direct compaction done in some cases in __alloc_pages_slowpath() stands out from the main retry loop of reclaim + compaction. We can simplify this by instead skipping the initial reclaim attempt via a new local variable compact_first, and handle the compact_prority as necessary to match the original behavior. No functional change intended. Link: https://lkml.kernel.org/r/20260106-thp-thisnode-tweak-v3-2-f5d67c21a193@suse.cz Signed-off-by: Vlastimil Babka Suggested-by: Johannes Weiner Reviewed-by: Joshua Hahn Acked-by: Michal Hocko Cc: Brendan Jackman Cc: David Hildenbrand (Red Hat) Cc: David Rientjes Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Mike Rapoport Cc: Pedro Falcato Cc: Suren Baghdasaryan Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/gfp.h | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/gfp.h b/include/linux/gfp.h index b155929af5b1..f9fdc99ae594 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -407,9 +407,15 @@ extern gfp_t gfp_allowed_mask; /* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask); +/* A helper for checking if gfp includes all the specified flags */ +static inline bool gfp_has_flags(gfp_t gfp, gfp_t flags) +{ + return (gfp & flags) == flags; +} + static inline bool gfp_has_io_fs(gfp_t gfp) { - return (gfp & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS); + return gfp_has_flags(gfp, __GFP_IO | __GFP_FS); } /* -- cgit v1.2.3 From e77786b4682e69336e3de3eaeb12ec994027f611 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Thu, 25 Dec 2025 15:21:09 -0800 Subject: memcg: introduce private id API for in-kernel users Patch series "memcg: separate private and public ID namespaces". The memory cgroup subsystem maintains a private ID infrastructure that is decoupled from the cgroup IDs. This private ID system exists because some kernel objects (like swap entries and shadow entries in the workingset code) can outlive the cgroup they were associated with. The motivation is best described in commit 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many small jobs"). Unfortunately, some in-kernel users (DAMON, LRU gen debugfs interface, shrinker debugfs) started exposing these private IDs to userspace. This is problematic because: 1. The private IDs are internal implementation details that could change 2. Userspace already has access to cgroup IDs through the cgroup filesystem 3. Using different ID namespaces in different interfaces is confusing This series cleans up the memcg ID infrastructure by: 1. Explicitly marking the private ID APIs with "private" in their names to make it clear they are for internal use only (swap/workingset) 2. Making the public cgroup ID APIs (mem_cgroup_id/mem_cgroup_get_from_id) unconditionally available 3. Converting DAMON, LRU gen, and shrinker debugfs interfaces to use the public cgroup IDs instead of the private IDs 4. Removing the now-unused wrapper functions and renaming the public APIs for clarity After this series: - mem_cgroup_private_id() / mem_cgroup_from_private_id() are used for internal kernel objects that outlive their cgroup (swap, workingset) - mem_cgroup_id() / mem_cgroup_get_from_id() return the public cgroup ID (from cgroup_id()) for use in userspace-facing interfaces This patch (of 8): The memory cgroup maintains a private ID infrastructure decoupled from the cgroup IDs for swapout records and shadow entries. The main motivation of this private ID infra is best described in the commit 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many small jobs"). Unfortunately some users have started exposing these private IDs to the userspace where they should have used the cgroup IDs which are already exposed to the userspace. Let's rename the memcg ID APIs to explicitly mark them private. No functional change is intended. Link: https://lkml.kernel.org/r/20251225232116.294540-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20251225232116.294540-2-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Michal Hocko Cc: Axel Rasmussen Cc: Dave Chinner Cc: David Hildenbrand Cc: Johannes Weiner Cc: Lorenzo Stoakes Cc: Muchun Song Cc: Qi Zheng Cc: Roman Gushchin Cc: SeongJae Park Cc: Wei Xu Cc: Yuanchu Xie Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 24 +++++++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index fd400082313a..1c4224bcfb23 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -65,7 +65,7 @@ struct mem_cgroup_reclaim_cookie { #define MEM_CGROUP_ID_SHIFT 16 -struct mem_cgroup_id { +struct mem_cgroup_private_id { int id; refcount_t ref; }; @@ -191,7 +191,7 @@ struct mem_cgroup { struct cgroup_subsys_state css; /* Private memcg ID. Used to ID objects that outlive the cgroup */ - struct mem_cgroup_id id; + struct mem_cgroup_private_id id; /* Accounted resources */ struct page_counter memory; /* Both v1 & v2 */ @@ -821,13 +821,19 @@ void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *); void mem_cgroup_scan_tasks(struct mem_cgroup *memcg, int (*)(struct task_struct *, void *), void *arg); -static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) +static inline unsigned short mem_cgroup_private_id(struct mem_cgroup *memcg) { if (mem_cgroup_disabled()) return 0; return memcg->id.id; } +struct mem_cgroup *mem_cgroup_from_private_id(unsigned short id); + +static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) +{ + return mem_cgroup_private_id(memcg); +} struct mem_cgroup *mem_cgroup_from_id(unsigned short id); #ifdef CONFIG_SHRINKER_DEBUG @@ -1290,6 +1296,18 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id) return NULL; } +static inline unsigned short mem_cgroup_private_id(struct mem_cgroup *memcg) +{ + return 0; +} + +static inline struct mem_cgroup *mem_cgroup_from_private_id(unsigned short id) +{ + WARN_ON_ONCE(id); + /* XXX: This should always return root_mem_cgroup */ + return NULL; +} + #ifdef CONFIG_SHRINKER_DEBUG static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg) { -- cgit v1.2.3 From 1d89d7fd592e2490cadd13c253d7b1b9f6116be8 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Thu, 25 Dec 2025 15:21:10 -0800 Subject: memcg: expose mem_cgroup_ino() and mem_cgroup_get_from_ino() unconditionally Remove the CONFIG_SHRINKER_DEBUG guards around mem_cgroup_ino() and mem_cgroup_get_from_ino(). These APIs provide a way to get a memcg's cgroup inode number and to look up a memcg from an inode number respectively. Making these functions unconditionally available allows other in-kernel users to leverage them without requiring CONFIG_SHRINKER_DEBUG to be enabled. No functional change for existing users. Link: https://lkml.kernel.org/r/20251225232116.294540-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Michal Hocko Cc: Axel Rasmussen Cc: Dave Chinner Cc: David Hildenbrand Cc: Johannes Weiner Cc: Lorenzo Stoakes Cc: Muchun Song Cc: Qi Zheng Cc: Roman Gushchin Cc: SeongJae Park Cc: Wei Xu Cc: Yuanchu Xie Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 4 ---- 1 file changed, 4 deletions(-) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1c4224bcfb23..77f32be26ea8 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -836,14 +836,12 @@ static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) } struct mem_cgroup *mem_cgroup_from_id(unsigned short id); -#ifdef CONFIG_SHRINKER_DEBUG static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg) { return memcg ? cgroup_ino(memcg->css.cgroup) : 0; } struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino); -#endif static inline struct mem_cgroup *mem_cgroup_from_seq(struct seq_file *m) { @@ -1308,7 +1306,6 @@ static inline struct mem_cgroup *mem_cgroup_from_private_id(unsigned short id) return NULL; } -#ifdef CONFIG_SHRINKER_DEBUG static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg) { return 0; @@ -1318,7 +1315,6 @@ static inline struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino) { return NULL; } -#endif static inline struct mem_cgroup *mem_cgroup_from_seq(struct seq_file *m) { -- cgit v1.2.3 From ea73e364716023b1a47d58b9f12e7c92f3b1e6a7 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Thu, 25 Dec 2025 15:21:12 -0800 Subject: memcg: use cgroup_id() instead of cgroup_ino() for memcg ID Switch mem_cgroup_ino() from using cgroup_ino() to cgroup_id(). The cgroup_ino() returns the kernfs inode number while cgroup_id() returns the kernfs node ID. For 64-bit systems, they are the same. Also cgroup_get_from_id() expects 64-bit node ID which is called by mem_cgroup_get_from_ino(). Change the type from unsigned long to u64 to match cgroup_id()'s return type, and update the format specifiers accordingly. Note that the names mem_cgroup_ino() and mem_cgroup_get_from_ino() are now misnomers since they deal with cgroup IDs rather than inode numbers. A follow-up patch will rename them. Link: https://lkml.kernel.org/r/20251225232116.294540-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Michal Hocko Cc: Axel Rasmussen Cc: Dave Chinner Cc: David Hildenbrand Cc: Johannes Weiner Cc: Lorenzo Stoakes Cc: Muchun Song Cc: Qi Zheng Cc: Roman Gushchin Cc: SeongJae Park Cc: Wei Xu Cc: Yuanchu Xie Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 77f32be26ea8..c823150ec288 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -836,12 +836,12 @@ static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) } struct mem_cgroup *mem_cgroup_from_id(unsigned short id); -static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg) +static inline u64 mem_cgroup_ino(struct mem_cgroup *memcg) { - return memcg ? cgroup_ino(memcg->css.cgroup) : 0; + return memcg ? cgroup_id(memcg->css.cgroup) : 0; } -struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino); +struct mem_cgroup *mem_cgroup_get_from_ino(u64 ino); static inline struct mem_cgroup *mem_cgroup_from_seq(struct seq_file *m) { @@ -1306,12 +1306,12 @@ static inline struct mem_cgroup *mem_cgroup_from_private_id(unsigned short id) return NULL; } -static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg) +static inline u64 mem_cgroup_ino(struct mem_cgroup *memcg) { return 0; } -static inline struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino) +static inline struct mem_cgroup *mem_cgroup_get_from_ino(u64 ino) { return NULL; } -- cgit v1.2.3 From 5866891a7ab1348686da70f70e925964d9227bf5 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Thu, 25 Dec 2025 15:21:13 -0800 Subject: mm/damon: use cgroup ID instead of private memcg ID DAMON was using the internal private memcg ID which is meant for tracking kernel objects that outlive their cgroup. Switch to using the public cgroup ID instead. Link: https://lkml.kernel.org/r/20251225232116.294540-6-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Reviewed-by: SeongJae Park Acked-by: Michal Hocko Cc: Axel Rasmussen Cc: Dave Chinner Cc: David Hildenbrand Cc: Johannes Weiner Cc: Lorenzo Stoakes Cc: Muchun Song Cc: Qi Zheng Cc: Roman Gushchin Cc: Wei Xu Cc: Yuanchu Xie Signed-off-by: Andrew Morton --- include/linux/damon.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/damon.h b/include/linux/damon.h index a67292a2f09d..650e7ecfa32b 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -203,7 +203,7 @@ struct damos_quota_goal { u64 last_psi_total; struct { int nid; - unsigned short memcg_id; + u64 memcg_id; }; }; struct list_head list; @@ -419,7 +419,7 @@ struct damos_filter { bool matching; bool allow; union { - unsigned short memcg_id; + u64 memcg_id; struct damon_addr_range addr_range; int target_idx; struct damon_size_range sz_range; -- cgit v1.2.3 From 2202e3a8cb80da583670034ee33c995513708949 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Thu, 25 Dec 2025 15:21:15 -0800 Subject: memcg: remove unused mem_cgroup_id() and mem_cgroup_from_id() Now that all callers have been converted to use either: - The private ID APIs (mem_cgroup_private_id/mem_cgroup_from_private_id) for internal kernel objects that outlive their cgroup - The public cgroup ID APIs (mem_cgroup_ino/mem_cgroup_get_from_ino) for external interfaces Remove the unused wrapper functions mem_cgroup_id() and mem_cgroup_from_id() along with their !CONFIG_MEMCG stubs. Link: https://lkml.kernel.org/r/20251225232116.294540-8-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Michal Hocko Cc: Axel Rasmussen Cc: Dave Chinner Cc: David Hildenbrand Cc: Johannes Weiner Cc: Lorenzo Stoakes Cc: Muchun Song Cc: Qi Zheng Cc: Roman Gushchin Cc: SeongJae Park Cc: Wei Xu Cc: Yuanchu Xie Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 18 ------------------ 1 file changed, 18 deletions(-) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c823150ec288..3e7d69020b39 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -830,12 +830,6 @@ static inline unsigned short mem_cgroup_private_id(struct mem_cgroup *memcg) } struct mem_cgroup *mem_cgroup_from_private_id(unsigned short id); -static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) -{ - return mem_cgroup_private_id(memcg); -} -struct mem_cgroup *mem_cgroup_from_id(unsigned short id); - static inline u64 mem_cgroup_ino(struct mem_cgroup *memcg) { return memcg ? cgroup_id(memcg->css.cgroup) : 0; @@ -1282,18 +1276,6 @@ static inline void mem_cgroup_scan_tasks(struct mem_cgroup *memcg, { } -static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) -{ - return 0; -} - -static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id) -{ - WARN_ON_ONCE(id); - /* XXX: This should always return root_mem_cgroup */ - return NULL; -} - static inline unsigned short mem_cgroup_private_id(struct mem_cgroup *memcg) { return 0; -- cgit v1.2.3 From 95296536eb19c969e91684287cf3bfcb382221d3 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Thu, 25 Dec 2025 15:21:16 -0800 Subject: memcg: rename mem_cgroup_ino() to mem_cgroup_id() Rename mem_cgroup_ino() to mem_cgroup_id() and mem_cgroup_get_from_ino() to mem_cgroup_get_from_id(). These functions now use cgroup IDs (from cgroup_id()) rather than inode numbers, so the names should reflect that. [shakeel.butt@linux.dev: replace ino with id, per SeongJae] Link: https://lkml.kernel.org/r/flkqanhyettp5uq22bjwg37rtmnpeg3mghznsylxcxxgaafpl4@nov2x7tagma7 [akpm@linux-foundation.org: build fix] Link: https://lkml.kernel.org/r/20251225232116.294540-9-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Michal Hocko Reviewed-by: SeongJae Park Cc: Axel Rasmussen Cc: Dave Chinner Cc: David Hildenbrand Cc: Johannes Weiner Cc: Lorenzo Stoakes Cc: Muchun Song Cc: Qi Zheng Cc: Roman Gushchin Cc: Wei Xu Cc: Yuanchu Xie Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 3e7d69020b39..ed4764e1a30e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -830,12 +830,12 @@ static inline unsigned short mem_cgroup_private_id(struct mem_cgroup *memcg) } struct mem_cgroup *mem_cgroup_from_private_id(unsigned short id); -static inline u64 mem_cgroup_ino(struct mem_cgroup *memcg) +static inline u64 mem_cgroup_id(struct mem_cgroup *memcg) { return memcg ? cgroup_id(memcg->css.cgroup) : 0; } -struct mem_cgroup *mem_cgroup_get_from_ino(u64 ino); +struct mem_cgroup *mem_cgroup_get_from_id(u64 id); static inline struct mem_cgroup *mem_cgroup_from_seq(struct seq_file *m) { @@ -1288,12 +1288,12 @@ static inline struct mem_cgroup *mem_cgroup_from_private_id(unsigned short id) return NULL; } -static inline u64 mem_cgroup_ino(struct mem_cgroup *memcg) +static inline u64 mem_cgroup_id(struct mem_cgroup *memcg) { return 0; } -static inline struct mem_cgroup *mem_cgroup_get_from_ino(u64 ino) +static inline struct mem_cgroup *mem_cgroup_get_from_id(u64 id) { return NULL; } -- cgit v1.2.3 From 0be909f114c4e82a4fe5964851af1ab8889dc76c Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Wed, 7 Jan 2026 14:21:44 +0900 Subject: zsmalloc: use actual object size to detect spans Using class->size to detect spanning objects is not entirely correct, because some size classes can hold a range of object sizes of up to class->size bytes in length, due to size-classes merge. Such classes use padding for cases when actually written objects are smaller than class->size. zs_obj_read_begin() can incorrectly hit the slow path and perform memcpy of such objects, basically copying padding bytes. Instead of class->size zs_obj_read_begin() should use the actual compressed object length (both zram and zswap know it) so that it can correctly handle situations when a written object is small enough to fit into the first physical page. Link: https://lkml.kernel.org/r/20260107052145.3586917-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Reviewed-by: Yosry Ahmed [zsmalloc & zswap] Reviewed-by: Nhat Pham Cc: Brian Geffon Cc: Chengming Zhou Cc: Jens Axboe Cc: Johannes Weiner Cc: Minchan Kim Signed-off-by: Andrew Morton --- include/linux/zsmalloc.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h index f3ccff2d966c..5565c3171007 100644 --- a/include/linux/zsmalloc.h +++ b/include/linux/zsmalloc.h @@ -40,9 +40,9 @@ unsigned int zs_lookup_class_index(struct zs_pool *pool, unsigned int size); void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats); void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle, - void *local_copy); + size_t mem_len, void *local_copy); void zs_obj_read_end(struct zs_pool *pool, unsigned long handle, - void *handle_mem); + size_t mem_len, void *handle_mem); void zs_obj_write(struct zs_pool *pool, unsigned long handle, void *handle_mem, size_t mem_len); -- cgit v1.2.3 From 01152bd2e44d6bcecd3573d653221ba3944ed0f1 Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Fri, 9 Jan 2026 17:31:31 +0800 Subject: mm: debug_vm_pgtable: add debug_vm_pgtable_free_huge_page() Patch series "mm: hugetlb: allocate frozen gigantic folio", v6. Introduce alloc_contig_frozen_pages() and cma_alloc_frozen_compound() which avoid atomic operation about page refcount, and then convert to allocate frozen gigantic folio by the new helpers in hugetlb to cleanup the alloc_gigantic_folio(). This patch (of 6): Add a new helper to free huge page to be consistency to debug_vm_pgtable_alloc_huge_page(), and use HPAGE_PUD_ORDER instead of open-code. Also move the free_contig_range() under CONFIG_ALLOC_CONTIG since all caller are built with CONFIG_ALLOC_CONTIG. Link: https://lkml.kernel.org/r/20260109093136.1491549-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Acked-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Muchun Song Reviewed-by: Sidhartha Kumar Cc: Brendan Jackman Cc: Jane Chu Cc: Johannes Weiner Cc: Matthew Wilcox (Oracle) Cc: Oscar Salvador Cc: Vlastimil Babka Cc: Claudiu Beznea Cc: Mark Brown Signed-off-by: Andrew Morton --- include/linux/gfp.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/gfp.h b/include/linux/gfp.h index f9fdc99ae594..627157972f6a 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -444,8 +444,8 @@ extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_ int nid, nodemask_t *nodemask); #define alloc_contig_pages(...) alloc_hooks(alloc_contig_pages_noprof(__VA_ARGS__)) -#endif void free_contig_range(unsigned long pfn, unsigned long nr_pages); +#endif #ifdef CONFIG_CONTIG_ALLOC static inline struct folio *folio_alloc_gigantic_noprof(int order, gfp_t gfp, -- cgit v1.2.3 From a9deb800b89efb2050453f7178e73b1d8b124e0f Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Fri, 9 Jan 2026 17:31:32 +0800 Subject: mm: page_alloc: add __split_page() Factor out the splitting of non-compound page from make_alloc_exact() and split_page() into a new helper function __split_page(). While at it, convert the VM_BUG_ON_PAGE() into a VM_WARN_ON_PAGE(). Link: https://lkml.kernel.org/r/20260109093136.1491549-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Acked-by: David Hildenbrand Acked-by: Muchun Song Reviewed-by: Zi Yan Reviewed-by: Sidhartha Kumar Cc: Brendan Jackman Cc: Jane Chu Cc: Johannes Weiner Cc: Matthew Wilcox (Oracle) Cc: Oscar Salvador Cc: Vlastimil Babka Cc: Claudiu Beznea Cc: Mark Brown Signed-off-by: Andrew Morton --- include/linux/mmdebug.h | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'include') diff --git a/include/linux/mmdebug.h b/include/linux/mmdebug.h index 14a45979cccc..ab60ffba08f5 100644 --- a/include/linux/mmdebug.h +++ b/include/linux/mmdebug.h @@ -47,6 +47,15 @@ void vma_iter_dump_tree(const struct vma_iterator *vmi); BUG(); \ } \ } while (0) +#define VM_WARN_ON_PAGE(cond, page) ({ \ + int __ret_warn = !!(cond); \ + \ + if (unlikely(__ret_warn)) { \ + dump_page(page, "VM_WARN_ON_PAGE(" __stringify(cond)")");\ + WARN_ON(1); \ + } \ + unlikely(__ret_warn); \ +}) #define VM_WARN_ON_ONCE_PAGE(cond, page) ({ \ static bool __section(".data..once") __warned; \ int __ret_warn_once = !!(cond); \ @@ -122,6 +131,7 @@ void vma_iter_dump_tree(const struct vma_iterator *vmi); #define VM_BUG_ON_MM(cond, mm) VM_BUG_ON(cond) #define VM_WARN_ON(cond) BUILD_BUG_ON_INVALID(cond) #define VM_WARN_ON_ONCE(cond) BUILD_BUG_ON_INVALID(cond) +#define VM_WARN_ON_PAGE(cond, page) BUILD_BUG_ON_INVALID(cond) #define VM_WARN_ON_ONCE_PAGE(cond, page) BUILD_BUG_ON_INVALID(cond) #define VM_WARN_ON_FOLIO(cond, folio) BUILD_BUG_ON_INVALID(cond) #define VM_WARN_ON_ONCE_FOLIO(cond, folio) BUILD_BUG_ON_INVALID(cond) -- cgit v1.2.3 From 6c08cc64d194dc5cc3dfc785517098d3b161c05f Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Fri, 9 Jan 2026 17:31:33 +0800 Subject: mm: cma: kill cma_pages_valid() Kill cma_pages_valid() which only used in cma_release(), also cleanup code duplication between cma pages valid checking and cma memrange finding. Link: https://lkml.kernel.org/r/20260109093136.1491549-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Reviewed-by: Jane Chu Reviewed-by: Zi Yan Reviewed-by: Muchun Song Acked-by: David Hildenbrand Cc: Brendan Jackman Cc: Johannes Weiner Cc: Matthew Wilcox (Oracle) Cc: Oscar Salvador Cc: Sidhartha Kumar Cc: Vlastimil Babka Cc: Claudiu Beznea Cc: Mark Brown Signed-off-by: Andrew Morton --- include/linux/cma.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include') diff --git a/include/linux/cma.h b/include/linux/cma.h index 62d9c1cf6326..e5745d2aec55 100644 --- a/include/linux/cma.h +++ b/include/linux/cma.h @@ -49,7 +49,6 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size, struct cma **res_cma); extern struct page *cma_alloc(struct cma *cma, unsigned long count, unsigned int align, bool no_warn); -extern bool cma_pages_valid(struct cma *cma, const struct page *pages, unsigned long count); extern bool cma_release(struct cma *cma, const struct page *pages, unsigned long count); extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data); -- cgit v1.2.3 From e0c1326779cc1b8e3a9e30ae273b89202ed4c82c Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Fri, 9 Jan 2026 17:31:34 +0800 Subject: mm: page_alloc: add alloc_contig_frozen_{range,pages}() In order to allocate given range of pages or allocate compound pages without incrementing their refcount, adding two new helper alloc_contig_frozen_{range,pages}() which may be beneficial to some users (eg hugetlb). The new alloc_contig_{range,pages} only take !__GFP_COMP gfp now, and the free_contig_range() is refactored to only free non-compound pages, the only caller to free compound pages in cma_free_folio() is changed accordingly, and the free_contig_frozen_range() is provided to match the alloc_contig_frozen_range(), which is used to free frozen pages. Link: https://lkml.kernel.org/r/20260109093136.1491549-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Reviewed-by: Zi Yan Reviewed-by: Sidhartha Kumar Cc: Brendan Jackman Cc: David Hildenbrand Cc: Jane Chu Cc: Johannes Weiner Cc: Matthew Wilcox (Oracle) Cc: Muchun Song Cc: Oscar Salvador Cc: Vlastimil Babka Cc: Claudiu Beznea Cc: Mark Brown Signed-off-by: Andrew Morton --- include/linux/gfp.h | 52 +++++++++++++++++++++------------------------------- 1 file changed, 21 insertions(+), 31 deletions(-) (limited to 'include') diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 627157972f6a..6ecf6dda93e0 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -436,40 +436,30 @@ typedef unsigned int __bitwise acr_flags_t; #define ACR_FLAGS_CMA ((__force acr_flags_t)BIT(0)) // allocate for CMA /* The below functions must be run on a range from a single zone. */ -extern int alloc_contig_range_noprof(unsigned long start, unsigned long end, - acr_flags_t alloc_flags, gfp_t gfp_mask); -#define alloc_contig_range(...) alloc_hooks(alloc_contig_range_noprof(__VA_ARGS__)) - -extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask, - int nid, nodemask_t *nodemask); -#define alloc_contig_pages(...) alloc_hooks(alloc_contig_pages_noprof(__VA_ARGS__)) - +int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end, + acr_flags_t alloc_flags, gfp_t gfp_mask); +#define alloc_contig_frozen_range(...) \ + alloc_hooks(alloc_contig_frozen_range_noprof(__VA_ARGS__)) + +int alloc_contig_range_noprof(unsigned long start, unsigned long end, + acr_flags_t alloc_flags, gfp_t gfp_mask); +#define alloc_contig_range(...) \ + alloc_hooks(alloc_contig_range_noprof(__VA_ARGS__)) + +struct page *alloc_contig_frozen_pages_noprof(unsigned long nr_pages, + gfp_t gfp_mask, int nid, nodemask_t *nodemask); +#define alloc_contig_frozen_pages(...) \ + alloc_hooks(alloc_contig_frozen_pages_noprof(__VA_ARGS__)) + +struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask, + int nid, nodemask_t *nodemask); +#define alloc_contig_pages(...) \ + alloc_hooks(alloc_contig_pages_noprof(__VA_ARGS__)) + +void free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages); void free_contig_range(unsigned long pfn, unsigned long nr_pages); #endif -#ifdef CONFIG_CONTIG_ALLOC -static inline struct folio *folio_alloc_gigantic_noprof(int order, gfp_t gfp, - int nid, nodemask_t *node) -{ - struct page *page; - - if (WARN_ON(!order || !(gfp & __GFP_COMP))) - return NULL; - - page = alloc_contig_pages_noprof(1 << order, gfp, nid, node); - - return page ? page_folio(page) : NULL; -} -#else -static inline struct folio *folio_alloc_gigantic_noprof(int order, gfp_t gfp, - int nid, nodemask_t *node) -{ - return NULL; -} -#endif -/* This should be paired with folio_put() rather than free_contig_range(). */ -#define folio_alloc_gigantic(...) alloc_hooks(folio_alloc_gigantic_noprof(__VA_ARGS__)) - DEFINE_FREE(free_page, void *, free_page((unsigned long)_T)) #endif /* __LINUX_GFP_H */ -- cgit v1.2.3 From 9bda131c6093e9c4a8739e2eeb65ba4d5fbefc2f Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Fri, 9 Jan 2026 17:31:35 +0800 Subject: mm: cma: add cma_alloc_frozen{_compound}() Introduce cma_alloc_frozen{_compound}() helper to alloc pages without incrementing their refcount, then convert hugetlb cma to use the cma_alloc_frozen_compound() and cma_release_frozen() and remove the unused cma_{alloc,free}_folio(), also move the cma_validate_zones() into mm/internal.h since no outside user. The set_pages_refcounted() is only called to set non-compound pages after above changes, so remove the processing about PageHead. Link: https://lkml.kernel.org/r/20260109093136.1491549-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Reviewed-by: Zi Yan Cc: Brendan Jackman Cc: David Hildenbrand Cc: Jane Chu Cc: Johannes Weiner Cc: Matthew Wilcox (Oracle) Cc: Muchun Song Cc: Oscar Salvador Cc: Sidhartha Kumar Cc: Vlastimil Babka Cc: Claudiu Beznea Cc: Mark Brown Signed-off-by: Andrew Morton --- include/linux/cma.h | 26 ++++++-------------------- 1 file changed, 6 insertions(+), 20 deletions(-) (limited to 'include') diff --git a/include/linux/cma.h b/include/linux/cma.h index e5745d2aec55..e2a690f7e77e 100644 --- a/include/linux/cma.h +++ b/include/linux/cma.h @@ -51,29 +51,15 @@ extern struct page *cma_alloc(struct cma *cma, unsigned long count, unsigned int bool no_warn); extern bool cma_release(struct cma *cma, const struct page *pages, unsigned long count); +struct page *cma_alloc_frozen(struct cma *cma, unsigned long count, + unsigned int align, bool no_warn); +struct page *cma_alloc_frozen_compound(struct cma *cma, unsigned int order); +bool cma_release_frozen(struct cma *cma, const struct page *pages, + unsigned long count); + extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data); extern bool cma_intersects(struct cma *cma, unsigned long start, unsigned long end); extern void cma_reserve_pages_on_error(struct cma *cma); -#ifdef CONFIG_CMA -struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp); -bool cma_free_folio(struct cma *cma, const struct folio *folio); -bool cma_validate_zones(struct cma *cma); -#else -static inline struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp) -{ - return NULL; -} - -static inline bool cma_free_folio(struct cma *cma, const struct folio *folio) -{ - return false; -} -static inline bool cma_validate_zones(struct cma *cma) -{ - return false; -} -#endif - #endif -- cgit v1.2.3 From 4835e2871321fd9cf5bc9702dded323e3e3fbc1a Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 13 Jan 2026 07:27:06 -0800 Subject: mm/damon/core: introduce [in]active memory ratio damos quota goal metric Patch series "mm/damon: advance DAMOS-based LRU sorting". DAMOS_LRU_[DE]PRIO actions were added to DAMOS for more access-aware LRU lists sorting. For simple usage, a specialized kernel module, namely DAMON_LRU_SORT, has also been introduced. After the introduction of the module, DAMON got a few important new features, including the aim-based quota auto-tuning, age tracking, young page filter, and monitoring intervals auto-tuning. Meanwhile, DAMOS-based LRU sorting had no direct updates. Now we show some rooms to advance for DAMOS-based LRU sorting. Firstly, the aim-oriented quota auto-tuning can simplify the LRU sorting parameters tuning. But there is no good auto-tuning target metric for LRU sorting use case. Secondly, the behavior of DAMOS_LRU_[DE]PRIO are not very symmetric. DAMOS_LRU_DEPRIO directly moves the pages to inactive LRU list, while DAMOS_LRU_PRIO only marks the page as accessed, so that the page can not directly but only eventually moved to the active LRU list. Finally, DAMON_LRU_SORT users cannot utilize the modern features that can be useful for them, too. Improve the situation with the following changes. First, introduce a new DAMOS quota auto-tuning target metric for active:inactive memory size ratio. Since LRU sorting is a kind of balancing of active and inactive pages, the active:inactive memory size ratio can be intuitively set. Second, update DAMOS_LRU_[DE]PRIO behaviors to be more intuitive and symmetric, by letting them directly move the pages to [in]active LRU list. Third, update the DAMON_LRU_SORT module user interface to be able to fully utilize the modern features including the [in]active memory size ratio-based quota auto-tuning, young page filter, and monitoring intervals auto-tuning. With these changes, for example, users can now ask DAMON to "find hot/cold memory regions with auto-tuned monitoring intervals, do one more page level access check for found hot/cold memory, and move pages of those to active or inactive LRU lists accordingly, aiming X:Y active to inactive memory ratio." For example, if they know 30% of the memory is better to be protected from reclamation, 30:70 can be set as the target ratio. Test Results ------------ I ran DAMON_LRU_SORT with the features introduced by this series, on a real world server workload. For the active:inactive ratio goal, I set 50:50. I confirmed it achieves the target active:inactive ratio, without manual tuning of the monitoring intervals and the hot/coldness thresholds. The baseline system that was not running the DAMON_LRU_SORT was keeping active:inactive ratio of about 1:10. Note that the test didn't show a clear performance difference, though. I believe that was mainly because the workload was not very memory intensive. Also, whether the 50:50 target ratio was optimum is unclear. Nonetheless, the positive performance impact of the basic LRU sorting idea is already confirmed with the initial DAMON_LRU_SORT introduction patch series. The goal of this patch series is simplifying the parameters tuning of DAMOS-based LRU sorting, and the test confirmed the aimed goals are achieved. Patches Sequence ---------------- First three patches extend DAMOS quota auto-tuning to support [in]active memory ratio target metric type. Those (patches 1-3) introduce new metrics, implement DAMON sysfs support, and update the documentation, respectively. Following patch (patch 4) makes DAMOS_LRU_PRIO action to directly move target pages to active LRU list, instead of only marking them accessed. Following seven patches (patches 5-11) updates DAMON_LRU_SORT to support modern DAMON features. Patch 5 makes it uses not only access frequency but also age at under-quota regions prioritization. Patches 6-11 add the support for young page filtering, active:inactive memory ratio based quota auto-tuning, and monitoring intervals auto-tuning, with appropriate document updates. This patch (of 11): DAMOS_LRU_[DE]PRIO are DAMOS actions for making balance of active and inactive memory size. There is no appropriate DAMOS quota auto-tuning target metric for the use case. Add two new DAMOS quota goal metrics for the purpose, namely DAMOS_QUOTA_[IN]ACTIVE_MEM_BP. Those will represent the ratio of [in]active memory to total (inactive + active) memory. Hence, users will be able to ask DAMON to, for example, "find hot and cold memory, and move pages of those to active and inactive LRU lists, adjusting the hot/cold thresholds aiming 50:50 active:inactive memory ratio." Link: https://lkml.kernel.org/r/20260113152717.70459-1-sj@kernel.org Link: https://lkml.kernel.org/r/20260113152717.70459-2-sj@kernel.org Signed-off-by: SeongJae Park Cc: David Hildenbrand Cc: Jonathan Corbet Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Mike Rapoport Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/damon.h | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'include') diff --git a/include/linux/damon.h b/include/linux/damon.h index 650e7ecfa32b..26fb8e90dff6 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -155,6 +155,8 @@ enum damos_action { * @DAMOS_QUOTA_NODE_MEM_FREE_BP: MemFree ratio of a node. * @DAMOS_QUOTA_NODE_MEMCG_USED_BP: MemUsed ratio of a node for a cgroup. * @DAMOS_QUOTA_NODE_MEMCG_FREE_BP: MemFree ratio of a node for a cgroup. + * @DAMOS_QUOTA_ACTIVE_MEM_BP: Active to total LRU memory ratio. + * @DAMOS_QUOTA_INACTIVE_MEM_BP: Inactive to total LRU memory ratio. * @NR_DAMOS_QUOTA_GOAL_METRICS: Number of DAMOS quota goal metrics. * * Metrics equal to larger than @NR_DAMOS_QUOTA_GOAL_METRICS are unsupported. @@ -166,6 +168,8 @@ enum damos_quota_goal_metric { DAMOS_QUOTA_NODE_MEM_FREE_BP, DAMOS_QUOTA_NODE_MEMCG_USED_BP, DAMOS_QUOTA_NODE_MEMCG_FREE_BP, + DAMOS_QUOTA_ACTIVE_MEM_BP, + DAMOS_QUOTA_INACTIVE_MEM_BP, NR_DAMOS_QUOTA_GOAL_METRICS, }; -- cgit v1.2.3 From dc2e4982cb018306f0699cd460a9033467f07be5 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Tue, 13 Jan 2026 12:46:45 +0900 Subject: zsmalloc: introduce SG-list based object read API Currently, zsmalloc performs address linearization on read (which sometimes requires memcpy() to a local buffer). Not all zsmalloc users need a linear address. For example, Crypto API supports SG-list, performing linearization under the hood, if needed. In addition, some compressors can have native SG-list support, completely avoiding the linearization step. Provide an SG-list based zsmalloc read API: - zs_obj_read_sg_begin() - zs_obj_read_sg_end() This API allows callers to obtain an SG representation of the object (one entry for objects that are contained in a single page and two entries for spanning objects), avoiding the need for a bounce buffer and memcpy. [senozhatsky@chromium.org: make zs_obj_read_sg_begin() return void, per Yosry] Link: https://lkml.kernel.org/r/20260117024900.792237-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20260113034645.2729998-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Acked-by: Herbert Xu Tested-by: Yosry Ahmed Cc: Herbert Xu Cc: Brian Geffon Cc: Johannes Weiner Cc: Minchan Kim Cc: Nhat Pham Signed-off-by: Andrew Morton --- include/linux/zsmalloc.h | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'include') diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h index 5565c3171007..478410c880b1 100644 --- a/include/linux/zsmalloc.h +++ b/include/linux/zsmalloc.h @@ -22,6 +22,7 @@ struct zs_pool_stats { }; struct zs_pool; +struct scatterlist; struct zs_pool *zs_create_pool(const char *name); void zs_destroy_pool(struct zs_pool *pool); @@ -43,6 +44,9 @@ void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle, size_t mem_len, void *local_copy); void zs_obj_read_end(struct zs_pool *pool, unsigned long handle, size_t mem_len, void *handle_mem); +void zs_obj_read_sg_begin(struct zs_pool *pool, unsigned long handle, + struct scatterlist *sg, size_t mem_len); +void zs_obj_read_sg_end(struct zs_pool *pool, unsigned long handle); void zs_obj_write(struct zs_pool *pool, unsigned long handle, void *handle_mem, size_t mem_len); -- cgit v1.2.3 From 3d702678f57edc524f73a7865382ae304269f590 Mon Sep 17 00:00:00 2001 From: Jinjiang Tu Date: Tue, 23 Dec 2025 19:05:23 +0800 Subject: mm/mempolicy: fix mpol_rebind_nodemask() for MPOL_F_NUMA_BALANCING commit bda420b98505 ("numa balancing: migrate on fault among multiple bound nodes") adds new flag MPOL_F_NUMA_BALANCING to enable NUMA balancing for MPOL_BIND memory policy. When the cpuset of tasks changes, the mempolicy of the task is rebound by mpol_rebind_nodemask(). When MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES are both not set, the behaviour of rebinding should be same whenever MPOL_F_NUMA_BALANCING is set or not. So, when an application calls set_mempolicy() with MPOL_F_NUMA_BALANCING set but both MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES cleared, mempolicy.w.cpuset_mems_allowed should be set to cpuset_current_mems_allowed nodemask. However, in current implementation, mpol_store_user_nodemask() wrongly returns true, causing mempolicy->w.user_nodemask to be incorrectly set to the user-specified nodemask. Later, when the cpuset of the application changes, mpol_rebind_nodemask() ends up rebinding based on the user-specified nodemask rather than the cpuset_mems_allowed nodemask as intended. I can reproduce with the following steps in qemu with 4 NUMA nodes: 1. echo '+cpuset' > /sys/fs/cgroup/cgroup.subtree_control 2. mkdir /sys/fs/cgroup/test 3. ./reproducer & 4. cat /proc/$pid/numa_maps, the task is bound to NUMA 1 5. echo $pid > /sys/fs/cgroup/test/cgroup.procs 6. cat /proc/$pid/numa_maps, the task is bound to NUMA 0 now. The reproducer code: int main() { struct bitmask *bmp; int ret; bmp = numa_parse_nodestring("1"); ret = set_mempolicy(MPOL_BIND | MPOL_F_NUMA_BALANCING, bmp->maskp, bmp->size + 1); if (ret < 0) { perror("Failed to call set_mempolicy"); exit(-1); } while (1); return 0; } If I call set_mempolicy() without MPOL_F_NUMA_BALANCING in the reproducer code. After step 5, the task is still bound to NUMA 1. To fix this, only set mempolicy->w.user_nodemask to the user-specified nodemask if MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES is present. Link: https://lkml.kernel.org/r/20260120011018.1256654-1-tujinjiang@huawei.com Link: https://lkml.kernel.org/r/20251223110523.1161421-1-tujinjiang@huawei.com Fixes: bda420b98505 ("numa balancing: migrate on fault among multiple bound nodes") Signed-off-by: Jinjiang Tu Reviewed-by: Gregory Price Reviewed-by: Huang Ying Acked-by: David Hildenbrand (Red Hat) Cc: Alistair Popple Cc: Byungchul Park Cc: Joshua Hahn Cc: Kefeng Wang Cc: Mathew Brost Cc: Mel Gorman Cc: Rakie Kim Cc: Zi Yan Signed-off-by: Andrew Morton --- include/uapi/linux/mempolicy.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include') diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 8fbbe613611a..6c962d866e86 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -39,6 +39,9 @@ enum { #define MPOL_MODE_FLAGS \ (MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES | MPOL_F_NUMA_BALANCING) +/* Whether the nodemask is specified by users */ +#define MPOL_USER_NODEMASK_FLAGS (MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES) + /* Flags for get_mempolicy */ #define MPOL_F_NODE (1<<0) /* return next IL mode instead of node mask */ #define MPOL_F_ADDR (1<<1) /* look up vma using address */ -- cgit v1.2.3 From 832d95b5314eea558cf4cc9ca40db10122ce8f63 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Fri, 9 Jan 2026 04:13:43 +0000 Subject: migrate: replace RMP_ flags with TTU_ flags Instead of translating between RMP_ and TTU_ flags, remove the RMP_ flags and just use the TTU_ flag space; there's plenty available. Possibly we should rename these to RMAP_ flags, and maybe even pass them in through rmap_walk_arg, but that can be done later. Link: https://lkml.kernel.org/r/20260109041345.3863089-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Acked-by: David Hildenbrand (Red Hat) Reviewed-by: Lorenzo Stoakes Reviewed-by: Zi Yan Cc: Alistair Popple Cc: Byungchul Park Cc: Gregory Price Cc: Jann Horn Cc: Joshua Hahn Cc: Lance Yang Cc: Liam Howlett Cc: Matthew Brost Cc: Rakie Kim Cc: Rik van Riel Cc: Vlastimil Babka Cc: Ying Huang Signed-off-by: Andrew Morton --- include/linux/rmap.h | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) (limited to 'include') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index dd764951b03d..8dc0871e5f00 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -92,6 +92,7 @@ struct anon_vma_chain { }; enum ttu_flags { + TTU_USE_SHARED_ZEROPAGE = 0x2, /* for unused pages of large folios */ TTU_SPLIT_HUGE_PMD = 0x4, /* split huge PMD if any */ TTU_IGNORE_MLOCK = 0x8, /* ignore mlock */ TTU_SYNC = 0x10, /* avoid racy checks with PVMW_SYNC */ @@ -933,12 +934,8 @@ int mapping_wrprotect_range(struct address_space *mapping, pgoff_t pgoff, int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff, struct vm_area_struct *vma); -enum rmp_flags { - RMP_LOCKED = 1 << 0, - RMP_USE_SHARED_ZEROPAGE = 1 << 1, -}; - -void remove_migration_ptes(struct folio *src, struct folio *dst, int flags); +void remove_migration_ptes(struct folio *src, struct folio *dst, + enum ttu_flags flags); /* * rmap_walk_control: To control rmap traversing for specific needs -- cgit v1.2.3 From c4a0c5ff85b7ca0d5fbd71888965f40e55295b19 Mon Sep 17 00:00:00 2001 From: Rohan McLure Date: Fri, 19 Dec 2025 04:09:35 +1100 Subject: mm/page_table_check: reinstate address parameter in [__]page_table_check_pud[s]_set() This reverts commit 6d144436d954 ("mm/page_table_check: remove unused parameter in [__]page_table_check_pud_set"). Reinstate previously unused parameters for the purpose of supporting powerpc platforms, as many do not encode user/kernel ownership of the page in the pte, but instead in the address of the access. Apply this to __page_table_check_puds_set(), page_table_check_puds_set() and the page_table_check_pud_set() wrapper macro. [ajd@linux.ibm.com: rebase on riscv + arm64 changes, update commit message] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-3-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure Signed-off-by: Andrew Donnellan Reviewed-by: Pasha Tatashin Acked-by: Ingo Molnar # x86 Acked-by: Alexandre Ghiti # riscv Cc: Alexander Gordeev Cc: Alexandre Ghiti Cc: Alistair Popple Cc: Christophe Leroy Cc: "Christophe Leroy (CS GROUP)" Cc: David Hildenbrand Cc: Donet Tom Cc: Guo Weikang Cc: Jason Gunthorpe Cc: Kevin Brodsky Cc: Madhavan Srinivasan Cc: Magnus Lindholm Cc: "Matthew Wilcox (Oracle)" Cc: Michael Ellerman Cc: Nicholas Miehlbradt Cc: Nicholas Piggin Cc: Paul Mackerras Cc: Qi Zheng Cc: "Ritesh Harjani (IBM)" Cc: Sweet Tea Dorminy Cc: Thomas Huth Cc: "Vishal Moola (Oracle)" Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/page_table_check.h | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'include') diff --git a/include/linux/page_table_check.h b/include/linux/page_table_check.h index 289620d4aad3..0bf18b884a12 100644 --- a/include/linux/page_table_check.h +++ b/include/linux/page_table_check.h @@ -21,8 +21,8 @@ void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte, unsigned int nr); void __page_table_check_pmds_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd, unsigned int nr); -void __page_table_check_puds_set(struct mm_struct *mm, pud_t *pudp, pud_t pud, - unsigned int nr); +void __page_table_check_puds_set(struct mm_struct *mm, unsigned long addr, + pud_t *pudp, pud_t pud, unsigned int nr); void __page_table_check_pte_clear_range(struct mm_struct *mm, unsigned long addr, pmd_t pmd); @@ -86,12 +86,12 @@ static inline void page_table_check_pmds_set(struct mm_struct *mm, } static inline void page_table_check_puds_set(struct mm_struct *mm, - pud_t *pudp, pud_t pud, unsigned int nr) + unsigned long addr, pud_t *pudp, pud_t pud, unsigned int nr) { if (static_branch_likely(&page_table_check_disabled)) return; - __page_table_check_puds_set(mm, pudp, pud, nr); + __page_table_check_puds_set(mm, addr, pudp, pud, nr); } static inline void page_table_check_pte_clear_range(struct mm_struct *mm, @@ -137,7 +137,7 @@ static inline void page_table_check_pmds_set(struct mm_struct *mm, } static inline void page_table_check_puds_set(struct mm_struct *mm, - pud_t *pudp, pud_t pud, unsigned int nr) + unsigned long addr, pud_t *pudp, pud_t pud, unsigned int nr) { } @@ -150,6 +150,6 @@ static inline void page_table_check_pte_clear_range(struct mm_struct *mm, #endif /* CONFIG_PAGE_TABLE_CHECK */ #define page_table_check_pmd_set(mm, pmdp, pmd) page_table_check_pmds_set(mm, pmdp, pmd, 1) -#define page_table_check_pud_set(mm, pudp, pud) page_table_check_puds_set(mm, pudp, pud, 1) +#define page_table_check_pud_set(mm, addr, pudp, pud) page_table_check_puds_set(mm, addr, pudp, pud, 1) #endif /* __LINUX_PAGE_TABLE_CHECK_H */ -- cgit v1.2.3 From 6e2d8f9fc4edcbf9f4dd953e1f41b0ff64867e5b Mon Sep 17 00:00:00 2001 From: Rohan McLure Date: Fri, 19 Dec 2025 04:09:36 +1100 Subject: mm/page_table_check: reinstate address parameter in [__]page_table_check_pmd[s]_set() This reverts commit a3b837130b58 ("mm/page_table_check: remove unused parameter in [__]page_table_check_pmd_set"). Reinstate previously unused parameters for the purpose of supporting powerpc platforms, as many do not encode user/kernel ownership of the page in the pte, but instead in the address of the access. Apply this to __page_table_check_pmds_set(), page_table_check_pmd_set(), and the page_table_check_pmd_set() wrapper macro. [ajd@linux.ibm.com: rebase on arm64 + riscv changes, update commit message] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-4-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure Signed-off-by: Andrew Donnellan Reviewed-by: Pasha Tatashin Acked-by: Ingo Molnar # x86 Acked-by: Alexandre Ghiti # riscv Cc: Alexander Gordeev Cc: Alexandre Ghiti Cc: Alistair Popple Cc: Christophe Leroy Cc: "Christophe Leroy (CS GROUP)" Cc: David Hildenbrand Cc: Donet Tom Cc: Guo Weikang Cc: Jason Gunthorpe Cc: Kevin Brodsky Cc: Madhavan Srinivasan Cc: Magnus Lindholm Cc: "Matthew Wilcox (Oracle)" Cc: Michael Ellerman Cc: Nicholas Miehlbradt Cc: Nicholas Piggin Cc: Paul Mackerras Cc: Qi Zheng Cc: "Ritesh Harjani (IBM)" Cc: Sweet Tea Dorminy Cc: Thomas Huth Cc: "Vishal Moola (Oracle)" Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/page_table_check.h | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'include') diff --git a/include/linux/page_table_check.h b/include/linux/page_table_check.h index 0bf18b884a12..cf7c28d8d468 100644 --- a/include/linux/page_table_check.h +++ b/include/linux/page_table_check.h @@ -19,8 +19,8 @@ void __page_table_check_pmd_clear(struct mm_struct *mm, pmd_t pmd); void __page_table_check_pud_clear(struct mm_struct *mm, pud_t pud); void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte, unsigned int nr); -void __page_table_check_pmds_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd, - unsigned int nr); +void __page_table_check_pmds_set(struct mm_struct *mm, unsigned long addr, + pmd_t *pmdp, pmd_t pmd, unsigned int nr); void __page_table_check_puds_set(struct mm_struct *mm, unsigned long addr, pud_t *pudp, pud_t pud, unsigned int nr); void __page_table_check_pte_clear_range(struct mm_struct *mm, @@ -77,12 +77,12 @@ static inline void page_table_check_ptes_set(struct mm_struct *mm, } static inline void page_table_check_pmds_set(struct mm_struct *mm, - pmd_t *pmdp, pmd_t pmd, unsigned int nr) + unsigned long addr, pmd_t *pmdp, pmd_t pmd, unsigned int nr) { if (static_branch_likely(&page_table_check_disabled)) return; - __page_table_check_pmds_set(mm, pmdp, pmd, nr); + __page_table_check_pmds_set(mm, addr, pmdp, pmd, nr); } static inline void page_table_check_puds_set(struct mm_struct *mm, @@ -132,7 +132,7 @@ static inline void page_table_check_ptes_set(struct mm_struct *mm, } static inline void page_table_check_pmds_set(struct mm_struct *mm, - pmd_t *pmdp, pmd_t pmd, unsigned int nr) + unsigned long addr, pmd_t *pmdp, pmd_t pmd, unsigned int nr) { } @@ -149,7 +149,7 @@ static inline void page_table_check_pte_clear_range(struct mm_struct *mm, #endif /* CONFIG_PAGE_TABLE_CHECK */ -#define page_table_check_pmd_set(mm, pmdp, pmd) page_table_check_pmds_set(mm, pmdp, pmd, 1) +#define page_table_check_pmd_set(mm, addr, pmdp, pmd) page_table_check_pmds_set(mm, addr, pmdp, pmd, 1) #define page_table_check_pud_set(mm, addr, pudp, pud) page_table_check_puds_set(mm, addr, pudp, pud, 1) #endif /* __LINUX_PAGE_TABLE_CHECK_H */ -- cgit v1.2.3 From 0a5ae4483177a621f5498c349d31f24b1ef10739 Mon Sep 17 00:00:00 2001 From: Rohan McLure Date: Fri, 19 Dec 2025 04:09:37 +1100 Subject: mm/page_table_check: provide addr parameter to page_table_check_ptes_set() To provide support for powerpc platforms, provide an addr parameter to the __page_table_check_ptes_set() and page_table_check_ptes_set() routines. This parameter is needed on some powerpc platforms which do not encode whether a mapping is for user or kernel in the pte. On such platforms, this can be inferred from the addr parameter. [ajd@linux.ibm.com: rebase on arm64 + riscv changes, update commit message] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-5-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure Reviewed-by: Pasha Tatashin Acked-by: Alexandre Ghiti # riscv Signed-off-by: Andrew Donnellan Cc: Alexander Gordeev Cc: Alexandre Ghiti Cc: Alistair Popple Cc: Christophe Leroy Cc: "Christophe Leroy (CS GROUP)" Cc: David Hildenbrand Cc: Donet Tom Cc: Guo Weikang Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Kevin Brodsky Cc: Madhavan Srinivasan Cc: Magnus Lindholm Cc: "Matthew Wilcox (Oracle)" Cc: Michael Ellerman Cc: Nicholas Miehlbradt Cc: Nicholas Piggin Cc: Paul Mackerras Cc: Qi Zheng Cc: "Ritesh Harjani (IBM)" Cc: Sweet Tea Dorminy Cc: Thomas Huth Cc: "Vishal Moola (Oracle)" Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/page_table_check.h | 12 +++++++----- include/linux/pgtable.h | 2 +- 2 files changed, 8 insertions(+), 6 deletions(-) (limited to 'include') diff --git a/include/linux/page_table_check.h b/include/linux/page_table_check.h index cf7c28d8d468..66e109238416 100644 --- a/include/linux/page_table_check.h +++ b/include/linux/page_table_check.h @@ -17,8 +17,8 @@ void __page_table_check_zero(struct page *page, unsigned int order); void __page_table_check_pte_clear(struct mm_struct *mm, pte_t pte); void __page_table_check_pmd_clear(struct mm_struct *mm, pmd_t pmd); void __page_table_check_pud_clear(struct mm_struct *mm, pud_t pud); -void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte, - unsigned int nr); +void __page_table_check_ptes_set(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte, unsigned int nr); void __page_table_check_pmds_set(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pmd_t pmd, unsigned int nr); void __page_table_check_puds_set(struct mm_struct *mm, unsigned long addr, @@ -68,12 +68,13 @@ static inline void page_table_check_pud_clear(struct mm_struct *mm, pud_t pud) } static inline void page_table_check_ptes_set(struct mm_struct *mm, - pte_t *ptep, pte_t pte, unsigned int nr) + unsigned long addr, pte_t *ptep, + pte_t pte, unsigned int nr) { if (static_branch_likely(&page_table_check_disabled)) return; - __page_table_check_ptes_set(mm, ptep, pte, nr); + __page_table_check_ptes_set(mm, addr, ptep, pte, nr); } static inline void page_table_check_pmds_set(struct mm_struct *mm, @@ -127,7 +128,8 @@ static inline void page_table_check_pud_clear(struct mm_struct *mm, pud_t pud) } static inline void page_table_check_ptes_set(struct mm_struct *mm, - pte_t *ptep, pte_t pte, unsigned int nr) + unsigned long addr, pte_t *ptep, + pte_t pte, unsigned int nr) { } diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 2f0dd3a4ace1..496873f44f67 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -429,7 +429,7 @@ static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr) static inline void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte, unsigned int nr) { - page_table_check_ptes_set(mm, ptep, pte, nr); + page_table_check_ptes_set(mm, addr, ptep, pte, nr); for (;;) { set_pte(ptep, pte); -- cgit v1.2.3 From 2e6ac078ce5d6a9dc96cab861359faac508eb56d Mon Sep 17 00:00:00 2001 From: Rohan McLure Date: Fri, 19 Dec 2025 04:09:38 +1100 Subject: mm/page_table_check: reinstate address parameter in [__]page_table_check_pud_clear() This reverts commit 931c38e16499 ("mm/page_table_check: remove unused parameter in [__]page_table_check_pud_clear"). Reinstate previously unused parameters for the purpose of supporting powerpc platforms, as many do not encode user/kernel ownership of the page in the pte, but instead in the address of the access. [ajd@linux.ibm.com: rebase on arm64 changes] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-6-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure Signed-off-by: Andrew Donnellan Reviewed-by: Pasha Tatashin Acked-by: Ingo Molnar # x86 Cc: Alexander Gordeev Cc: Alexandre Ghiti Cc: Alexandre Ghiti Cc: Alistair Popple Cc: Christophe Leroy Cc: "Christophe Leroy (CS GROUP)" Cc: David Hildenbrand Cc: Donet Tom Cc: Guo Weikang Cc: Jason Gunthorpe Cc: Kevin Brodsky Cc: Madhavan Srinivasan Cc: Magnus Lindholm Cc: "Matthew Wilcox (Oracle)" Cc: Michael Ellerman Cc: Nicholas Miehlbradt Cc: Nicholas Piggin Cc: Paul Mackerras Cc: Qi Zheng Cc: "Ritesh Harjani (IBM)" Cc: Sweet Tea Dorminy Cc: Thomas Huth Cc: "Vishal Moola (Oracle)" Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/page_table_check.h | 11 +++++++---- include/linux/pgtable.h | 2 +- 2 files changed, 8 insertions(+), 5 deletions(-) (limited to 'include') diff --git a/include/linux/page_table_check.h b/include/linux/page_table_check.h index 66e109238416..808cc3a48c28 100644 --- a/include/linux/page_table_check.h +++ b/include/linux/page_table_check.h @@ -16,7 +16,8 @@ extern struct page_ext_operations page_table_check_ops; void __page_table_check_zero(struct page *page, unsigned int order); void __page_table_check_pte_clear(struct mm_struct *mm, pte_t pte); void __page_table_check_pmd_clear(struct mm_struct *mm, pmd_t pmd); -void __page_table_check_pud_clear(struct mm_struct *mm, pud_t pud); +void __page_table_check_pud_clear(struct mm_struct *mm, unsigned long addr, + pud_t pud); void __page_table_check_ptes_set(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte, unsigned int nr); void __page_table_check_pmds_set(struct mm_struct *mm, unsigned long addr, @@ -59,12 +60,13 @@ static inline void page_table_check_pmd_clear(struct mm_struct *mm, pmd_t pmd) __page_table_check_pmd_clear(mm, pmd); } -static inline void page_table_check_pud_clear(struct mm_struct *mm, pud_t pud) +static inline void page_table_check_pud_clear(struct mm_struct *mm, + unsigned long addr, pud_t pud) { if (static_branch_likely(&page_table_check_disabled)) return; - __page_table_check_pud_clear(mm, pud); + __page_table_check_pud_clear(mm, addr, pud); } static inline void page_table_check_ptes_set(struct mm_struct *mm, @@ -123,7 +125,8 @@ static inline void page_table_check_pmd_clear(struct mm_struct *mm, pmd_t pmd) { } -static inline void page_table_check_pud_clear(struct mm_struct *mm, pud_t pud) +static inline void page_table_check_pud_clear(struct mm_struct *mm, + unsigned long addr, pud_t pud) { } diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 496873f44f67..ed3c28ebeb35 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -801,7 +801,7 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm, pud_t pud = *pudp; pud_clear(pudp); - page_table_check_pud_clear(mm, pud); + page_table_check_pud_clear(mm, address, pud); return pud; } -- cgit v1.2.3 From 649ec9e3d03c4908ef51731cd7b422c4a3e2ccff Mon Sep 17 00:00:00 2001 From: Rohan McLure Date: Fri, 19 Dec 2025 04:09:39 +1100 Subject: mm/page_table_check: reinstate address parameter in [__]page_table_check_pmd_clear() This reverts commit 1831414cd729 ("mm/page_table_check: remove unused parameter in [__]page_table_check_pmd_clear"). Reinstate previously unused parameters for the purpose of supporting powerpc platforms, as many do not encode user/kernel ownership of the page in the pte, but instead in the address of the access. [ajd@linux.ibm.com: rebase on arm64 changes] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-7-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure Signed-off-by: Andrew Donnellan Reviewed-by: Pasha Tatashin Acked-by: Ingo Molnar # x86 Acked-by: Alexandre Ghiti # riscv Cc: Alexander Gordeev Cc: Alexandre Ghiti Cc: Alistair Popple Cc: Christophe Leroy Cc: "Christophe Leroy (CS GROUP)" Cc: David Hildenbrand Cc: Donet Tom Cc: Guo Weikang Cc: Jason Gunthorpe Cc: Kevin Brodsky Cc: Madhavan Srinivasan Cc: Magnus Lindholm Cc: "Matthew Wilcox (Oracle)" Cc: Michael Ellerman Cc: Nicholas Miehlbradt Cc: Nicholas Piggin Cc: Paul Mackerras Cc: Qi Zheng Cc: "Ritesh Harjani (IBM)" Cc: Sweet Tea Dorminy Cc: Thomas Huth Cc: "Vishal Moola (Oracle)" Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/page_table_check.h | 11 +++++++---- include/linux/pgtable.h | 2 +- 2 files changed, 8 insertions(+), 5 deletions(-) (limited to 'include') diff --git a/include/linux/page_table_check.h b/include/linux/page_table_check.h index 808cc3a48c28..3973b69ae294 100644 --- a/include/linux/page_table_check.h +++ b/include/linux/page_table_check.h @@ -15,7 +15,8 @@ extern struct page_ext_operations page_table_check_ops; void __page_table_check_zero(struct page *page, unsigned int order); void __page_table_check_pte_clear(struct mm_struct *mm, pte_t pte); -void __page_table_check_pmd_clear(struct mm_struct *mm, pmd_t pmd); +void __page_table_check_pmd_clear(struct mm_struct *mm, unsigned long addr, + pmd_t pmd); void __page_table_check_pud_clear(struct mm_struct *mm, unsigned long addr, pud_t pud); void __page_table_check_ptes_set(struct mm_struct *mm, unsigned long addr, @@ -52,12 +53,13 @@ static inline void page_table_check_pte_clear(struct mm_struct *mm, pte_t pte) __page_table_check_pte_clear(mm, pte); } -static inline void page_table_check_pmd_clear(struct mm_struct *mm, pmd_t pmd) +static inline void page_table_check_pmd_clear(struct mm_struct *mm, + unsigned long addr, pmd_t pmd) { if (static_branch_likely(&page_table_check_disabled)) return; - __page_table_check_pmd_clear(mm, pmd); + __page_table_check_pmd_clear(mm, addr, pmd); } static inline void page_table_check_pud_clear(struct mm_struct *mm, @@ -121,7 +123,8 @@ static inline void page_table_check_pte_clear(struct mm_struct *mm, pte_t pte) { } -static inline void page_table_check_pmd_clear(struct mm_struct *mm, pmd_t pmd) +static inline void page_table_check_pmd_clear(struct mm_struct *mm, + unsigned long addr, pmd_t pmd) { } diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index ed3c28ebeb35..2d1f7369624c 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -788,7 +788,7 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, pmd_t pmd = *pmdp; pmd_clear(pmdp); - page_table_check_pmd_clear(mm, pmd); + page_table_check_pmd_clear(mm, address, pmd); return pmd; } -- cgit v1.2.3 From d7b4b67eb6b37aef1723a69add88c9a7add81308 Mon Sep 17 00:00:00 2001 From: Rohan McLure Date: Fri, 19 Dec 2025 04:09:40 +1100 Subject: mm/page_table_check: reinstate address parameter in [__]page_table_check_pte_clear() This reverts commit aa232204c468 ("mm/page_table_check: remove unused parameter in [__]page_table_check_pte_clear"). Reinstate previously unused parameters for the purpose of supporting powerpc platforms, as many do not encode user/kernel ownership of the page in the pte, but instead in the address of the access. [ajd@linux.ibm.com: rebase, fix additional occurrence and loop handling] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-8-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure Signed-off-by: Andrew Donnellan Reviewed-by: Pasha Tatashin Acked-by: Ingo Molnar # x86 Acked-by: Alexandre Ghiti # riscv Cc: Alexander Gordeev Cc: Alexandre Ghiti Cc: Alistair Popple Cc: Christophe Leroy Cc: "Christophe Leroy (CS GROUP)" Cc: David Hildenbrand Cc: Donet Tom Cc: Guo Weikang Cc: Jason Gunthorpe Cc: Kevin Brodsky Cc: Madhavan Srinivasan Cc: Magnus Lindholm Cc: "Matthew Wilcox (Oracle)" Cc: Michael Ellerman Cc: Nicholas Miehlbradt Cc: Nicholas Piggin Cc: Paul Mackerras Cc: Qi Zheng Cc: "Ritesh Harjani (IBM)" Cc: Sweet Tea Dorminy Cc: Thomas Huth Cc: "Vishal Moola (Oracle)" Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/page_table_check.h | 11 +++++++---- include/linux/pgtable.h | 4 ++-- 2 files changed, 9 insertions(+), 6 deletions(-) (limited to 'include') diff --git a/include/linux/page_table_check.h b/include/linux/page_table_check.h index 3973b69ae294..12268a32e8be 100644 --- a/include/linux/page_table_check.h +++ b/include/linux/page_table_check.h @@ -14,7 +14,8 @@ extern struct static_key_true page_table_check_disabled; extern struct page_ext_operations page_table_check_ops; void __page_table_check_zero(struct page *page, unsigned int order); -void __page_table_check_pte_clear(struct mm_struct *mm, pte_t pte); +void __page_table_check_pte_clear(struct mm_struct *mm, unsigned long addr, + pte_t pte); void __page_table_check_pmd_clear(struct mm_struct *mm, unsigned long addr, pmd_t pmd); void __page_table_check_pud_clear(struct mm_struct *mm, unsigned long addr, @@ -45,12 +46,13 @@ static inline void page_table_check_free(struct page *page, unsigned int order) __page_table_check_zero(page, order); } -static inline void page_table_check_pte_clear(struct mm_struct *mm, pte_t pte) +static inline void page_table_check_pte_clear(struct mm_struct *mm, + unsigned long addr, pte_t pte) { if (static_branch_likely(&page_table_check_disabled)) return; - __page_table_check_pte_clear(mm, pte); + __page_table_check_pte_clear(mm, addr, pte); } static inline void page_table_check_pmd_clear(struct mm_struct *mm, @@ -119,7 +121,8 @@ static inline void page_table_check_free(struct page *page, unsigned int order) { } -static inline void page_table_check_pte_clear(struct mm_struct *mm, pte_t pte) +static inline void page_table_check_pte_clear(struct mm_struct *mm, + unsigned long addr, pte_t pte) { } diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 2d1f7369624c..827dca25c0bc 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -634,7 +634,7 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm, { pte_t pte = ptep_get(ptep); pte_clear(mm, address, ptep); - page_table_check_pte_clear(mm, pte); + page_table_check_pte_clear(mm, address, pte); return pte; } #endif @@ -693,7 +693,7 @@ static inline void ptep_clear(struct mm_struct *mm, unsigned long addr, * No need for ptep_get_and_clear(): page table check doesn't care about * any bits that could have been set by HW concurrently. */ - page_table_check_pte_clear(mm, pte); + page_table_check_pte_clear(mm, addr, pte); } #ifdef CONFIG_GUP_GET_PXX_LOW_HIGH -- cgit v1.2.3 From cbc064e708b687cd2dbc2b788c473e2a34e10f7c Mon Sep 17 00:00:00 2001 From: Yury Norov Date: Wed, 14 Jan 2026 12:22:13 -0500 Subject: nodemask: propagate boolean for nodes_and{,not} MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "nodemask: align nodes_and{,not} with underlying bitmap ops". nodes_and{,not} are void despite that underlying bitmap_and(,not) return boolean, true if the result bitmap is non-empty. Align nodemask API, and simplify client code. This patch (of 3): Bitmap functions bitmap_and{,not} return boolean depending on emptiness of the result bitmap. The corresponding nodemask helpers ignore the returned value. Propagate the underlying bitmaps result to nodemasks users, as it simplifies user code. Link: https://lkml.kernel.org/r/20260114172217.861204-1-ynorov@nvidia.com Link: https://lkml.kernel.org/r/20260114172217.861204-2-ynorov@nvidia.com Signed-off-by: Yury Norov Reviewed-by: Gregory Price Reviewed-by: Joshua Hahn Reviewed-by: David Hildenbrand (Red Hat) Cc: Alistair Popple Cc: Byungchul Park Cc: "Huang, Ying" Cc: Johannes Weiner Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Mathew Brost Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Rakie Kim Cc: Rasmus Villemoes Cc: Suren Baghdasaryan Cc: Tejun Heo Cc: Vlastimil Babka Cc: Waiman Long Cc: Yury Norov (NVIDIA) Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/nodemask.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'include') diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h index bd38648c998d..204c92462f3c 100644 --- a/include/linux/nodemask.h +++ b/include/linux/nodemask.h @@ -157,10 +157,10 @@ static __always_inline bool __node_test_and_set(int node, nodemask_t *addr) #define nodes_and(dst, src1, src2) \ __nodes_and(&(dst), &(src1), &(src2), MAX_NUMNODES) -static __always_inline void __nodes_and(nodemask_t *dstp, const nodemask_t *src1p, +static __always_inline bool __nodes_and(nodemask_t *dstp, const nodemask_t *src1p, const nodemask_t *src2p, unsigned int nbits) { - bitmap_and(dstp->bits, src1p->bits, src2p->bits, nbits); + return bitmap_and(dstp->bits, src1p->bits, src2p->bits, nbits); } #define nodes_or(dst, src1, src2) \ @@ -181,10 +181,10 @@ static __always_inline void __nodes_xor(nodemask_t *dstp, const nodemask_t *src1 #define nodes_andnot(dst, src1, src2) \ __nodes_andnot(&(dst), &(src1), &(src2), MAX_NUMNODES) -static __always_inline void __nodes_andnot(nodemask_t *dstp, const nodemask_t *src1p, +static __always_inline bool __nodes_andnot(nodemask_t *dstp, const nodemask_t *src1p, const nodemask_t *src2p, unsigned int nbits) { - bitmap_andnot(dstp->bits, src1p->bits, src2p->bits, nbits); + return bitmap_andnot(dstp->bits, src1p->bits, src2p->bits, nbits); } #define nodes_copy(dst, src) __nodes_copy(&(dst), &(src), MAX_NUMNODES) -- cgit v1.2.3 From 4262c53236977de3ceaa3bf2aefdf772c9b874dd Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Thu, 15 Jan 2026 07:20:41 -0800 Subject: mm/damon/core: implement damon_kdamond_pid() Patch series "mm/damon: hide kdamond and kdamond_lock from API callers". 'kdamond' and 'kdamond_lock' fields initially exposed to DAMON API callers for flexible synchronization and use cases. As DAMON API became somewhat complicated compared to the early days, Keeping those exposed could only encourage the API callers to invent more creative but complicated and difficult-to-debug use cases. Fortunately DAMON API callers didn't invent that many creative use cases. There exist only two use cases of 'kdamond' and 'kdamond_lock'. Finding whether the kdamond is actively running, and getting the pid of the kdamond. For the first use case, a dedicated API function, namely 'damon_is_running()' is provided, and all DAMON API callers are using the function for the use case. Hence only the second use case is where the fields are directly being used by DAMON API callers. To prevent future invention of complicated and erroneous use cases of the fields, hide the fields from the API callers. For that, provide new dedicated DAMON API functions for the remaining use case, namely damon_kdamond_pid(), migrate DAMON API callers to use the new function, and mark the fields as private fields. This patch (of 5): 'kdamond' and 'kdamond_lock' are directly being used by DAMON API callers for getting the pid of the corresponding kdamond. To discourage invention of creative but complicated and erroneous new usages of the fields that require careful synchronization, implement a new API function that can simply be used without the manual synchronizations. Link: https://lkml.kernel.org/r/20260115152047.68415-1-sj@kernel.org Link: https://lkml.kernel.org/r/20260115152047.68415-2-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- include/linux/damon.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include') diff --git a/include/linux/damon.h b/include/linux/damon.h index 26fb8e90dff6..5b7ea7082134 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -972,6 +972,7 @@ bool damon_initialized(void); int damon_start(struct damon_ctx **ctxs, int nr_ctxs, bool exclusive); int damon_stop(struct damon_ctx **ctxs, int nr_ctxs); bool damon_is_running(struct damon_ctx *ctx); +int damon_kdamond_pid(struct damon_ctx *ctx); int damon_call(struct damon_ctx *ctx, struct damon_call_control *control); int damos_walk(struct damon_ctx *ctx, struct damos_walk_control *control); -- cgit v1.2.3 From 6fe0e6d599a6bb4b65704285d40d4972423b7aaa Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Thu, 15 Jan 2026 07:20:45 -0800 Subject: mm/damon: hide kdamond and kdamond_lock of damon_ctx There is no DAMON API caller that directly access 'kdamond' and 'kdamond_lock' fields of 'struct damon_ctx'. Keeping those exposed could only encourage creative but error-prone usages. Hide them from DAMON API callers by marking those as private fields. Link: https://lkml.kernel.org/r/20260115152047.68415-6-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- include/linux/damon.h | 29 ++++++++++++++--------------- 1 file changed, 14 insertions(+), 15 deletions(-) (limited to 'include') diff --git a/include/linux/damon.h b/include/linux/damon.h index 5b7ea7082134..e6930d8574d3 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -759,23 +759,20 @@ struct damon_attrs { * of the monitoring. * * @attrs: Monitoring attributes for accuracy/overhead control. - * @kdamond: Kernel thread who does the monitoring. - * @kdamond_lock: Mutex for the synchronizations with @kdamond. * - * For each monitoring context, one kernel thread for the monitoring is - * created. The pointer to the thread is stored in @kdamond. + * For each monitoring context, one kernel thread for the monitoring, namely + * kdamond, is created. The pid of kdamond can be retrieved using + * damon_kdamond_pid(). * - * Once started, the monitoring thread runs until explicitly required to be - * terminated or every monitoring target is invalid. The validity of the - * targets is checked via the &damon_operations.target_valid of @ops. The - * termination can also be explicitly requested by calling damon_stop(). - * The thread sets @kdamond to NULL when it terminates. Therefore, users can - * know whether the monitoring is ongoing or terminated by reading @kdamond. - * Reads and writes to @kdamond from outside of the monitoring thread must - * be protected by @kdamond_lock. + * Once started, kdamond runs until explicitly required to be terminated or + * every monitoring target is invalid. The validity of the targets is checked + * via the &damon_operations.target_valid of @ops. The termination can also be + * explicitly requested by calling damon_stop(). To know if a kdamond is + * running, damon_is_running() can be used. * - * Note that the monitoring thread protects only @kdamond via @kdamond_lock. - * Accesses to other fields must be protected by themselves. + * While the kdamond is running, all accesses to &struct damon_ctx from a + * thread other than the kdamond should be made using safe DAMON APIs, + * including damon_call() and damos_walk(). * * @ops: Set of monitoring operations for given use cases. * @addr_unit: Scale factor for core to ops address conversion. @@ -816,10 +813,12 @@ struct damon_ctx { struct damos_walk_control *walk_control; struct mutex walk_control_lock; -/* public: */ + /* Working thread of the given DAMON context */ struct task_struct *kdamond; + /* Protects @kdamond field access */ struct mutex kdamond_lock; +/* public: */ struct damon_operations ops; unsigned long addr_unit; unsigned long min_sz_region; -- cgit v1.2.3 From 3ab981c1fca08721a2cc100d4e097d4e0c9e149b Mon Sep 17 00:00:00 2001 From: Shivank Garg Date: Sun, 18 Jan 2026 19:22:57 +0000 Subject: mm/khugepaged: change collapse_pte_mapped_thp() to return void The only external caller of collapse_pte_mapped_thp() is uprobe, which ignores the return value. Change the external API to return void to simplify the interface. Introduce try_collapse_pte_mapped_thp() for internal use that preserves the return value. This prepares for future patch that will convert the return type to use enum scan_result. Link: https://lkml.kernel.org/r/20260118192253.9263-10-shivankg@amd.com Signed-off-by: Shivank Garg Suggested-by: David Hildenbrand (Red Hat) Acked-by: Lance Yang Acked-by: David Hildenbrand (Red Hat) Reviewed-by: Zi Yan Tested-by: Nico Pache Reviewed-by: Nico Pache Cc: Anshuman Khandual Cc: Baolin Wang Cc: Barry Song Cc: Dev Jain Cc: Liam R. Howlett Cc: Lorenzo Stoakes Cc: Ryan Roberts Cc: Wei Yang Signed-off-by: Andrew Morton --- include/linux/khugepaged.h | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) (limited to 'include') diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h index eb1946a70cff..d7a9053ff4fe 100644 --- a/include/linux/khugepaged.h +++ b/include/linux/khugepaged.h @@ -17,8 +17,8 @@ extern void khugepaged_enter_vma(struct vm_area_struct *vma, vm_flags_t vm_flags); extern void khugepaged_min_free_kbytes_update(void); extern bool current_is_khugepaged(void); -extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, - bool install_pmd); +void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, + bool install_pmd); static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm) { @@ -42,10 +42,9 @@ static inline void khugepaged_enter_vma(struct vm_area_struct *vma, vm_flags_t vm_flags) { } -static inline int collapse_pte_mapped_thp(struct mm_struct *mm, - unsigned long addr, bool install_pmd) +static inline void collapse_pte_mapped_thp(struct mm_struct *mm, + unsigned long addr, bool install_pmd) { - return 0; } static inline void khugepaged_min_free_kbytes_update(void) -- cgit v1.2.3 From a00de9ba30aa71fe68ab45a9d2df595a7c39dd74 Mon Sep 17 00:00:00 2001 From: "David Hildenbrand (Red Hat)" Date: Tue, 20 Jan 2026 00:01:14 +0100 Subject: mm/balloon_compaction: centralize adjust_managed_page_count() handling MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Let's centralize it, by allowing for the driver to enable this handling through a new flag (bool for now) in the balloon device info. Note that we now adjust the counter when adding/removing a page into the balloon list: when removing a page to deflate it, it will now happen before the driver communicated with hypervisor, not afterwards. This shouldn't make a difference in practice. Link: https://lkml.kernel.org/r/20260119230133.3551867-7-david@kernel.org Signed-off-by: David Hildenbrand (Red Hat) Acked-by: Liam R. Howlett Acked-by: Michael S. Tsirkin Cc: Arnd Bergmann Cc: Christophe Leroy Cc: Eugenio Pérez Cc: Greg Kroah-Hartman Cc: Jason Wang Cc: Jerrin Shaji George Cc: Jonathan Corbet Cc: Lorenzo Stoakes Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Michal Hocko Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Oscar Salvador Cc: SeongJae Park Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/balloon_compaction.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include') diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index 7cfe48769239..3109d3c43d30 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -56,6 +56,7 @@ struct balloon_dev_info { struct list_head pages; /* Pages enqueued & handled to Host */ int (*migratepage)(struct balloon_dev_info *, struct page *newpage, struct page *page, enum migrate_mode mode); + bool adjust_managed_page_count; }; extern struct page *balloon_page_alloc(void); @@ -73,6 +74,7 @@ static inline void balloon_devinfo_init(struct balloon_dev_info *balloon) spin_lock_init(&balloon->pages_lock); INIT_LIST_HEAD(&balloon->pages); balloon->migratepage = NULL; + balloon->adjust_managed_page_count = false; } #ifdef CONFIG_BALLOON_COMPACTION -- cgit v1.2.3 From 8202313e3dfa9bdeb73427b564cfe2bfd02e4807 Mon Sep 17 00:00:00 2001 From: "David Hildenbrand (Red Hat)" Date: Tue, 20 Jan 2026 00:01:16 +0100 Subject: mm/balloon_compaction: use a device-independent balloon (list) lock MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In order to remove the dependency on the page lock for balloon pages, we need a lock that is independent of the page. It's crucial that we can handle the scenario where balloon deflation (clearing page->private) can race with page isolation (using page->private to obtain the balloon_dev_info where the lock currently resides). The current lock in balloon_dev_info is therefore not suitable. Fortunately, we never really have more than a single balloon device per VM, so we can just keep it simple and use a static lock to protect all balloon devices. Based on this change we will remove the dependency on the page lock next. Link: https://lkml.kernel.org/r/20260119230133.3551867-9-david@kernel.org Signed-off-by: David Hildenbrand (Red Hat) Acked-by: Michael S. Tsirkin Cc: Arnd Bergmann Cc: Christophe Leroy Cc: Eugenio Pérez Cc: Greg Kroah-Hartman Cc: Jason Wang Cc: Jerrin Shaji George Cc: Jonathan Corbet Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Michal Hocko Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Oscar Salvador Cc: SeongJae Park Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/balloon_compaction.h | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) (limited to 'include') diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index 3109d3c43d30..9a8568fcd477 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -21,10 +21,10 @@ * i. Setting the PG_movable_ops flag and page->private with the following * lock order * +-page_lock(page); - * +--spin_lock_irq(&b_dev_info->pages_lock); + * +--spin_lock_irq(&balloon_pages_lock); * * ii. isolation or dequeueing procedure must remove the page from balloon - * device page list under b_dev_info->pages_lock. + * device page list under balloon_pages_lock * * The functions provided by this interface are placed to help on coping with * the aforementioned balloon page corner case, as well as to ensure the simple @@ -52,7 +52,6 @@ */ struct balloon_dev_info { unsigned long isolated_pages; /* # of isolated pages for migration */ - spinlock_t pages_lock; /* Protection to pages list */ struct list_head pages; /* Pages enqueued & handled to Host */ int (*migratepage)(struct balloon_dev_info *, struct page *newpage, struct page *page, enum migrate_mode mode); @@ -71,7 +70,6 @@ extern size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info, static inline void balloon_devinfo_init(struct balloon_dev_info *balloon) { balloon->isolated_pages = 0; - spin_lock_init(&balloon->pages_lock); INIT_LIST_HEAD(&balloon->pages); balloon->migratepage = NULL; balloon->adjust_managed_page_count = false; -- cgit v1.2.3 From a3fafdd3896719923f7055b6d7f10f6ee1950d8b Mon Sep 17 00:00:00 2001 From: "David Hildenbrand (Red Hat)" Date: Tue, 20 Jan 2026 00:01:17 +0100 Subject: mm/balloon_compaction: remove dependency on page lock MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Let's stop using the page lock in balloon code and instead use only the balloon_device_lock. As soon as we set the PG_movable_ops flag, we might now get isolation callbacks for that page as we are no longer holding the page lock. In there, we'll simply synchronize using the balloon_device_lock. So in balloon_page_isolate() lookup the balloon_dev_info through page->private under balloon_device_lock. It's crucial that we update page->private under the balloon_device_lock, so the isolation callback can properly deal with concurrent deflation. Consequently, make sure that balloon_page_finalize() is called under balloon_device_lock as we remove a page from the list and clear page->private. balloon_page_insert() is already called with the balloon_device_lock held. Note that the core will still lock the pages, for example in isolate_movable_ops_page(). The lock is there still relevant for handling the PageMovableOpsIsolated flag, but that can be later changed to use an atomic test-and-set instead, or moved into the movable_ops backends. Link: https://lkml.kernel.org/r/20260119230133.3551867-10-david@kernel.org Signed-off-by: David Hildenbrand (Red Hat) Acked-by: Michael S. Tsirkin Cc: Arnd Bergmann Cc: Christophe Leroy Cc: Eugenio Pérez Cc: Greg Kroah-Hartman Cc: Jason Wang Cc: Jerrin Shaji George Cc: Jonathan Corbet Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Michal Hocko Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Oscar Salvador Cc: SeongJae Park Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/balloon_compaction.h | 25 +++++++++++++------------ 1 file changed, 13 insertions(+), 12 deletions(-) (limited to 'include') diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index 9a8568fcd477..ad594af6ed10 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -12,25 +12,27 @@ * is derived from the page type (PageOffline()) combined with the * PG_movable_ops flag (PageMovableOps()). * + * Once the page type and the PG_movable_ops are set, migration code + * can initiate page isolation by invoking the + * movable_operations()->isolate_page() callback + * + * As long as page->private is set, the page is either on the balloon list + * or isolated for migration. If page->private is not set, the page is + * either still getting inflated, or was deflated to be freed by the balloon + * driver soon. Isolation is impossible in both cases. + * * As the page isolation scanning step a compaction thread does is a lockless * procedure (from a page standpoint), it might bring some racy situations while * performing balloon page compaction. In order to sort out these racy scenarios * and safely perform balloon's page compaction and migration we must, always, * ensure following these simple rules: * - * i. Setting the PG_movable_ops flag and page->private with the following - * lock order - * +-page_lock(page); - * +--spin_lock_irq(&balloon_pages_lock); + * i. Inflation/deflation must set/clear page->private under the + * balloon_pages_lock * * ii. isolation or dequeueing procedure must remove the page from balloon * device page list under balloon_pages_lock * - * The functions provided by this interface are placed to help on coping with - * the aforementioned balloon page corner case, as well as to ensure the simple - * set of exposed rules are satisfied while we are dealing with balloon pages - * compaction / migration. - * * Copyright (C) 2012, Red Hat, Inc. Rafael Aquini */ #ifndef _LINUX_BALLOON_COMPACTION_H @@ -93,8 +95,7 @@ static inline struct balloon_dev_info *balloon_page_device(struct page *page) * @balloon : pointer to balloon device * @page : page to be assigned as a 'balloon page' * - * Caller must ensure the page is locked and the spin_lock protecting balloon - * pages list is held before inserting a page into the balloon device. + * Caller must ensure the balloon_pages_lock is held. */ static inline void balloon_page_insert(struct balloon_dev_info *balloon, struct page *page) @@ -119,7 +120,7 @@ static inline gfp_t balloon_mapping_gfp_mask(void) * balloon list for release to the page allocator * @page: page to be released to the page allocator * - * Caller must ensure that the page is locked. + * Caller must ensure the balloon_pages_lock is held. */ static inline void balloon_page_finalize(struct page *page) { -- cgit v1.2.3 From ddc50a97bef1e34c096bf3f0dc9590d7f570ed7b Mon Sep 17 00:00:00 2001 From: "David Hildenbrand (Red Hat)" Date: Tue, 20 Jan 2026 00:01:18 +0100 Subject: mm/balloon_compaction: make balloon_mops static MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit There is no need to expose this anymore, so let's just make it static. Link: https://lkml.kernel.org/r/20260119230133.3551867-11-david@kernel.org Signed-off-by: David Hildenbrand (Red Hat) Reviewed-by: Lorenzo Stoakes Acked-by: Michael S. Tsirkin Cc: Arnd Bergmann Cc: Christophe Leroy Cc: Eugenio Pérez Cc: Greg Kroah-Hartman Cc: Jason Wang Cc: Jerrin Shaji George Cc: Jonathan Corbet Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Michal Hocko Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Oscar Salvador Cc: SeongJae Park Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/balloon_compaction.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include') diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index ad594af6ed10..7db66c2c86cd 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -78,7 +78,6 @@ static inline void balloon_devinfo_init(struct balloon_dev_info *balloon) } #ifdef CONFIG_BALLOON_COMPACTION -extern const struct movable_operations balloon_mops; /* * balloon_page_device - get the b_dev_info descriptor for the balloon device * that enqueues the given page. -- cgit v1.2.3 From aa974cbf949e94c79b46a0053d40229bc634f9be Mon Sep 17 00:00:00 2001 From: "David Hildenbrand (Red Hat)" Date: Tue, 20 Jan 2026 00:01:19 +0100 Subject: mm/balloon_compaction: drop fs.h include from balloon_compaction.h MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Ever since commit 68f2736a8583 ("mm: Convert all PageMovable users to movable_operations") we no longer store an inode in balloon_dev_info, so we can stop including "fs.h". Link: https://lkml.kernel.org/r/20260119230133.3551867-12-david@kernel.org Signed-off-by: David Hildenbrand (Red Hat) Reviewed-by: Lorenzo Stoakes Acked-by: Michael S. Tsirkin Cc: Arnd Bergmann Cc: Christophe Leroy Cc: Eugenio Pérez Cc: Greg Kroah-Hartman Cc: Jason Wang Cc: Jerrin Shaji George Cc: Jonathan Corbet Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Michal Hocko Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Oscar Salvador Cc: SeongJae Park Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/balloon_compaction.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include') diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index 7db66c2c86cd..1452ea063524 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -42,7 +42,6 @@ #include #include #include -#include #include /* -- cgit v1.2.3 From 0fa3e9a48bafde8aa5a5b994b05396e9b86ce156 Mon Sep 17 00:00:00 2001 From: "David Hildenbrand (Red Hat)" Date: Tue, 20 Jan 2026 00:01:21 +0100 Subject: mm/balloon_compaction: remove balloon_page_push/pop() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Let's remove these helpers as they are unused now. Link: https://lkml.kernel.org/r/20260119230133.3551867-14-david@kernel.org Signed-off-by: David Hildenbrand (Red Hat) Reviewed-by: Lorenzo Stoakes Acked-by: Michael S. Tsirkin Cc: Arnd Bergmann Cc: Christophe Leroy Cc: Eugenio Pérez Cc: Greg Kroah-Hartman Cc: Jason Wang Cc: Jerrin Shaji George Cc: Jonathan Corbet Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Michal Hocko Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Oscar Salvador Cc: SeongJae Park Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/balloon_compaction.h | 30 ------------------------------ 1 file changed, 30 deletions(-) (limited to 'include') diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index 1452ea063524..e5451cf1f658 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -126,34 +126,4 @@ static inline void balloon_page_finalize(struct page *page) set_page_private(page, 0); /* PageOffline is sticky until the page is freed to the buddy. */ } - -/* - * balloon_page_push - insert a page into a page list. - * @head : pointer to list - * @page : page to be added - * - * Caller must ensure the page is private and protect the list. - */ -static inline void balloon_page_push(struct list_head *pages, struct page *page) -{ - list_add(&page->lru, pages); -} - -/* - * balloon_page_pop - remove a page from a page list. - * @head : pointer to list - * @page : page to be added - * - * Caller must ensure the page is private and protect the list. - */ -static inline struct page *balloon_page_pop(struct list_head *pages) -{ - struct page *page = list_first_entry_or_null(pages, struct page, lru); - - if (!page) - return NULL; - - list_del(&page->lru); - return page; -} #endif /* _LINUX_BALLOON_COMPACTION_H */ -- cgit v1.2.3 From 9d792ef33e40c8511b00a38e5e2e63f20bd2d815 Mon Sep 17 00:00:00 2001 From: "David Hildenbrand (Red Hat)" Date: Tue, 20 Jan 2026 00:01:22 +0100 Subject: mm/balloon_compaction: fold balloon_mapping_gfp_mask() into balloon_page_alloc() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Let's just remove balloon_mapping_gfp_mask(). Link: https://lkml.kernel.org/r/20260119230133.3551867-15-david@kernel.org Signed-off-by: David Hildenbrand (Red Hat) Reviewed-by: Lorenzo Stoakes Acked-by: Michael S. Tsirkin Cc: Arnd Bergmann Cc: Christophe Leroy Cc: Eugenio Pérez Cc: Greg Kroah-Hartman Cc: Jason Wang Cc: Jerrin Shaji George Cc: Jonathan Corbet Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Michal Hocko Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Oscar Salvador Cc: SeongJae Park Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/balloon_compaction.h | 7 ------- 1 file changed, 7 deletions(-) (limited to 'include') diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index e5451cf1f658..d1d473939897 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -106,13 +106,6 @@ static inline void balloon_page_insert(struct balloon_dev_info *balloon, list_add(&page->lru, &balloon->pages); } -static inline gfp_t balloon_mapping_gfp_mask(void) -{ - if (IS_ENABLED(CONFIG_BALLOON_COMPACTION)) - return GFP_HIGHUSER_MOVABLE; - return GFP_HIGHUSER; -} - /* * balloon_page_finalize - prepare a balloon page that was removed from the * balloon list for release to the page allocator -- cgit v1.2.3 From 03d6a2f68419b808d51ba39c84aedd6e9a6a92d8 Mon Sep 17 00:00:00 2001 From: "David Hildenbrand (Red Hat)" Date: Tue, 20 Jan 2026 00:01:23 +0100 Subject: mm/balloon_compaction: move internal helpers to balloon_compaction.c MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Let's move the helpers that are not required by drivers anymore. While at it, drop the doc of balloon_page_device() as it is trivial. [david@kernel.org: move balloon_page_device() under CONFIG_BALLOON_COMPACTION] Link: https://lkml.kernel.org/r/27f0adf1-54c1-4d99-8b7f-fd45574e7f41@kernel.org Link: https://lkml.kernel.org/r/20260119230133.3551867-16-david@kernel.org Signed-off-by: David Hildenbrand (Red Hat) Reviewed-by: Lorenzo Stoakes Acked-by: Michael S. Tsirkin Cc: Arnd Bergmann Cc: Christophe Leroy Cc: Eugenio Pérez Cc: Greg Kroah-Hartman Cc: Jason Wang Cc: Jerrin Shaji George Cc: Jonathan Corbet Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Michal Hocko Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Oscar Salvador Cc: SeongJae Park Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/balloon_compaction.h | 44 -------------------------------------- 1 file changed, 44 deletions(-) (limited to 'include') diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index d1d473939897..eec8994056a4 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -75,48 +75,4 @@ static inline void balloon_devinfo_init(struct balloon_dev_info *balloon) balloon->migratepage = NULL; balloon->adjust_managed_page_count = false; } - -#ifdef CONFIG_BALLOON_COMPACTION -/* - * balloon_page_device - get the b_dev_info descriptor for the balloon device - * that enqueues the given page. - */ -static inline struct balloon_dev_info *balloon_page_device(struct page *page) -{ - return (struct balloon_dev_info *)page_private(page); -} -#endif /* CONFIG_BALLOON_COMPACTION */ - -/* - * balloon_page_insert - insert a page into the balloon's page list and make - * the page->private assignment accordingly. - * @balloon : pointer to balloon device - * @page : page to be assigned as a 'balloon page' - * - * Caller must ensure the balloon_pages_lock is held. - */ -static inline void balloon_page_insert(struct balloon_dev_info *balloon, - struct page *page) -{ - __SetPageOffline(page); - if (IS_ENABLED(CONFIG_BALLOON_COMPACTION)) { - SetPageMovableOps(page); - set_page_private(page, (unsigned long)balloon); - } - list_add(&page->lru, &balloon->pages); -} - -/* - * balloon_page_finalize - prepare a balloon page that was removed from the - * balloon list for release to the page allocator - * @page: page to be released to the page allocator - * - * Caller must ensure the balloon_pages_lock is held. - */ -static inline void balloon_page_finalize(struct page *page) -{ - if (IS_ENABLED(CONFIG_BALLOON_COMPACTION)) - set_page_private(page, 0); - /* PageOffline is sticky until the page is freed to the buddy. */ -} #endif /* _LINUX_BALLOON_COMPACTION_H */ -- cgit v1.2.3 From 92ec9260d53b245d3266f74ecc66d8ea47aaec3d Mon Sep 17 00:00:00 2001 From: "David Hildenbrand (Red Hat)" Date: Tue, 20 Jan 2026 00:01:26 +0100 Subject: mm/balloon_compaction: remove "extern" from functions MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adding "extern" to functions is frowned-upon. Let's just get rid of it for all functions here. Link: https://lkml.kernel.org/r/20260119230133.3551867-19-david@kernel.org Signed-off-by: David Hildenbrand (Red Hat) Reviewed-by: Lorenzo Stoakes Acked-by: Michael S. Tsirkin Cc: Arnd Bergmann Cc: Christophe Leroy Cc: Eugenio Pérez Cc: Greg Kroah-Hartman Cc: Jason Wang Cc: Jerrin Shaji George Cc: Jonathan Corbet Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Michal Hocko Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Oscar Salvador Cc: SeongJae Park Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/balloon_compaction.h | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) (limited to 'include') diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index eec8994056a4..7757e0e314fd 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -59,14 +59,14 @@ struct balloon_dev_info { bool adjust_managed_page_count; }; -extern struct page *balloon_page_alloc(void); -extern void balloon_page_enqueue(struct balloon_dev_info *b_dev_info, - struct page *page); -extern struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info); -extern size_t balloon_page_list_enqueue(struct balloon_dev_info *b_dev_info, - struct list_head *pages); -extern size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info, - struct list_head *pages, size_t n_req_pages); +struct page *balloon_page_alloc(void); +void balloon_page_enqueue(struct balloon_dev_info *b_dev_info, + struct page *page); +struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info); +size_t balloon_page_list_enqueue(struct balloon_dev_info *b_dev_info, + struct list_head *pages); +size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info, + struct list_head *pages, size_t n_req_pages); static inline void balloon_devinfo_init(struct balloon_dev_info *balloon) { -- cgit v1.2.3 From 25b48b4cdf912f70998336b861a4bf767ee3d332 Mon Sep 17 00:00:00 2001 From: "David Hildenbrand (Red Hat)" Date: Tue, 20 Jan 2026 00:01:28 +0100 Subject: mm: rename balloon_compaction.(c|h) to balloon.(c|h) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Even without CONFIG_BALLOON_COMPACTION this infrastructure implements basic list and page management for a memory balloon. Link: https://lkml.kernel.org/r/20260119230133.3551867-21-david@kernel.org Signed-off-by: David Hildenbrand (Red Hat) Reviewed-by: Lorenzo Stoakes Acked-by: Michael S. Tsirkin Cc: Arnd Bergmann Cc: Christophe Leroy Cc: Eugenio Pérez Cc: Greg Kroah-Hartman Cc: Jason Wang Cc: Jerrin Shaji George Cc: Jonathan Corbet Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Michal Hocko Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Oscar Salvador Cc: SeongJae Park Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/balloon.h | 77 +++++++++++++++++++++++++++++++++++++ include/linux/balloon_compaction.h | 78 -------------------------------------- 2 files changed, 77 insertions(+), 78 deletions(-) create mode 100644 include/linux/balloon.h delete mode 100644 include/linux/balloon_compaction.h (limited to 'include') diff --git a/include/linux/balloon.h b/include/linux/balloon.h new file mode 100644 index 000000000000..82585542300d --- /dev/null +++ b/include/linux/balloon.h @@ -0,0 +1,77 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Common interface for implementing a memory balloon, including support + * for migration of pages inflated in a memory balloon. + * + * Balloon page migration makes use of the general "movable_ops page migration" + * feature. + * + * page->private is used to reference the responsible balloon device. + * That these pages have movable_ops, and which movable_ops apply, + * is derived from the page type (PageOffline()) combined with the + * PG_movable_ops flag (PageMovableOps()). + * + * Once the page type and the PG_movable_ops are set, migration code + * can initiate page isolation by invoking the + * movable_operations()->isolate_page() callback + * + * As long as page->private is set, the page is either on the balloon list + * or isolated for migration. If page->private is not set, the page is + * either still getting inflated, or was deflated to be freed by the balloon + * driver soon. Isolation is impossible in both cases. + * + * As the page isolation scanning step a compaction thread does is a lockless + * procedure (from a page standpoint), it might bring some racy situations while + * performing balloon page compaction. In order to sort out these racy scenarios + * and safely perform balloon's page compaction and migration we must, always, + * ensure following these simple rules: + * + * i. Inflation/deflation must set/clear page->private under the + * balloon_pages_lock + * + * ii. isolation or dequeueing procedure must remove the page from balloon + * device page list under balloon_pages_lock + * + * Copyright (C) 2012, Red Hat, Inc. Rafael Aquini + */ +#ifndef _LINUX_BALLOON_H +#define _LINUX_BALLOON_H +#include +#include +#include +#include +#include +#include + +/* + * Balloon device information descriptor. + * This struct is used to allow the common balloon compaction interface + * procedures to find the proper balloon device holding memory pages they'll + * have to cope for page compaction / migration, as well as it serves the + * balloon driver as a page book-keeper for its registered balloon devices. + */ +struct balloon_dev_info { + unsigned long isolated_pages; /* # of isolated pages for migration */ + struct list_head pages; /* Pages enqueued & handled to Host */ + int (*migratepage)(struct balloon_dev_info *, struct page *newpage, + struct page *page, enum migrate_mode mode); + bool adjust_managed_page_count; +}; + +struct page *balloon_page_alloc(void); +void balloon_page_enqueue(struct balloon_dev_info *b_dev_info, + struct page *page); +struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info); +size_t balloon_page_list_enqueue(struct balloon_dev_info *b_dev_info, + struct list_head *pages); +size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info, + struct list_head *pages, size_t n_req_pages); + +static inline void balloon_devinfo_init(struct balloon_dev_info *balloon) +{ + balloon->isolated_pages = 0; + INIT_LIST_HEAD(&balloon->pages); + balloon->migratepage = NULL; + balloon->adjust_managed_page_count = false; +} +#endif /* _LINUX_BALLOON_H */ diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h deleted file mode 100644 index 7757e0e314fd..000000000000 --- a/include/linux/balloon_compaction.h +++ /dev/null @@ -1,78 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -/* - * include/linux/balloon_compaction.h - * - * Common interface definitions for making balloon pages movable by compaction. - * - * Balloon page migration makes use of the general "movable_ops page migration" - * feature. - * - * page->private is used to reference the responsible balloon device. - * That these pages have movable_ops, and which movable_ops apply, - * is derived from the page type (PageOffline()) combined with the - * PG_movable_ops flag (PageMovableOps()). - * - * Once the page type and the PG_movable_ops are set, migration code - * can initiate page isolation by invoking the - * movable_operations()->isolate_page() callback - * - * As long as page->private is set, the page is either on the balloon list - * or isolated for migration. If page->private is not set, the page is - * either still getting inflated, or was deflated to be freed by the balloon - * driver soon. Isolation is impossible in both cases. - * - * As the page isolation scanning step a compaction thread does is a lockless - * procedure (from a page standpoint), it might bring some racy situations while - * performing balloon page compaction. In order to sort out these racy scenarios - * and safely perform balloon's page compaction and migration we must, always, - * ensure following these simple rules: - * - * i. Inflation/deflation must set/clear page->private under the - * balloon_pages_lock - * - * ii. isolation or dequeueing procedure must remove the page from balloon - * device page list under balloon_pages_lock - * - * Copyright (C) 2012, Red Hat, Inc. Rafael Aquini - */ -#ifndef _LINUX_BALLOON_COMPACTION_H -#define _LINUX_BALLOON_COMPACTION_H -#include -#include -#include -#include -#include -#include - -/* - * Balloon device information descriptor. - * This struct is used to allow the common balloon compaction interface - * procedures to find the proper balloon device holding memory pages they'll - * have to cope for page compaction / migration, as well as it serves the - * balloon driver as a page book-keeper for its registered balloon devices. - */ -struct balloon_dev_info { - unsigned long isolated_pages; /* # of isolated pages for migration */ - struct list_head pages; /* Pages enqueued & handled to Host */ - int (*migratepage)(struct balloon_dev_info *, struct page *newpage, - struct page *page, enum migrate_mode mode); - bool adjust_managed_page_count; -}; - -struct page *balloon_page_alloc(void); -void balloon_page_enqueue(struct balloon_dev_info *b_dev_info, - struct page *page); -struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info); -size_t balloon_page_list_enqueue(struct balloon_dev_info *b_dev_info, - struct list_head *pages); -size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info, - struct list_head *pages, size_t n_req_pages); - -static inline void balloon_devinfo_init(struct balloon_dev_info *balloon) -{ - balloon->isolated_pages = 0; - INIT_LIST_HEAD(&balloon->pages); - balloon->migratepage = NULL; - balloon->adjust_managed_page_count = false; -} -#endif /* _LINUX_BALLOON_COMPACTION_H */ -- cgit v1.2.3 From cd8e95d80bc29b3c72288bd31e845b11755ef6a5 Mon Sep 17 00:00:00 2001 From: "David Hildenbrand (Red Hat)" Date: Tue, 20 Jan 2026 00:01:30 +0100 Subject: mm: rename CONFIG_BALLOON_COMPACTION to CONFIG_BALLOON_MIGRATION MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit While compaction depends on migration, the other direction is not the case. So let's make it clearer that this is all about migration of balloon pages. Adjust all comments/docs in the core to talk about "migration" instead of "compaction". While at it add some "/* CONFIG_BALLOON_MIGRATION */". Link: https://lkml.kernel.org/r/20260119230133.3551867-23-david@kernel.org Signed-off-by: David Hildenbrand (Red Hat) Reviewed-by: Lorenzo Stoakes Acked-by: Michael S. Tsirkin Cc: Arnd Bergmann Cc: Christophe Leroy Cc: Eugenio Pérez Cc: Greg Kroah-Hartman Cc: Jason Wang Cc: Jerrin Shaji George Cc: Jonathan Corbet Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Michal Hocko Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Oscar Salvador Cc: SeongJae Park Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/balloon.h | 12 ++++++------ include/linux/vm_event_item.h | 4 ++-- 2 files changed, 8 insertions(+), 8 deletions(-) (limited to 'include') diff --git a/include/linux/balloon.h b/include/linux/balloon.h index 82585542300d..ca5b15150f42 100644 --- a/include/linux/balloon.h +++ b/include/linux/balloon.h @@ -22,9 +22,9 @@ * * As the page isolation scanning step a compaction thread does is a lockless * procedure (from a page standpoint), it might bring some racy situations while - * performing balloon page compaction. In order to sort out these racy scenarios - * and safely perform balloon's page compaction and migration we must, always, - * ensure following these simple rules: + * performing balloon page migration. In order to sort out these racy scenarios + * and safely perform balloon's page migration we must, always, ensure following + * these simple rules: * * i. Inflation/deflation must set/clear page->private under the * balloon_pages_lock @@ -45,10 +45,10 @@ /* * Balloon device information descriptor. - * This struct is used to allow the common balloon compaction interface + * This struct is used to allow the common balloon page migration interface * procedures to find the proper balloon device holding memory pages they'll - * have to cope for page compaction / migration, as well as it serves the - * balloon driver as a page book-keeper for its registered balloon devices. + * have to cope for page migration, as well as it serves the balloon driver as + * a page book-keeper for its registered balloon devices. */ struct balloon_dev_info { unsigned long isolated_pages; /* # of isolated pages for migration */ diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 92f80b4d69a6..fca34d3473b6 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -125,9 +125,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #ifdef CONFIG_MEMORY_BALLOON BALLOON_INFLATE, BALLOON_DEFLATE, -#ifdef CONFIG_BALLOON_COMPACTION +#ifdef CONFIG_BALLOON_MIGRATION BALLOON_MIGRATE, -#endif +#endif /* CONFIG_BALLOON_MIGRATION */ #endif #ifdef CONFIG_DEBUG_TLBFLUSH NR_TLB_REMOTE_FLUSH, /* cpu tried to flush others' tlbs */ -- cgit v1.2.3 From 1421758055ca6028d3b758914863f38d434bf36b Mon Sep 17 00:00:00 2001 From: "David Hildenbrand (Red Hat)" Date: Tue, 20 Jan 2026 00:01:31 +0100 Subject: mm: rename CONFIG_MEMORY_BALLOON -> CONFIG_BALLOON MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Let's make it consistent with the naming of the files but also with the naming of CONFIG_BALLOON_MIGRATION. While at it, add a "/* CONFIG_BALLOON */". Link: https://lkml.kernel.org/r/20260119230133.3551867-24-david@kernel.org Signed-off-by: David Hildenbrand (Red Hat) Reviewed-by: Lorenzo Stoakes Acked-by: Michael S. Tsirkin Cc: Arnd Bergmann Cc: Christophe Leroy Cc: Eugenio Pérez Cc: Greg Kroah-Hartman Cc: Jason Wang Cc: Jerrin Shaji George Cc: Jonathan Corbet Cc: Liam Howlett Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Michal Hocko Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Oscar Salvador Cc: SeongJae Park Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/vm_event_item.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index fca34d3473b6..22a139f82d75 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -122,13 +122,13 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, THP_SWPOUT, THP_SWPOUT_FALLBACK, #endif -#ifdef CONFIG_MEMORY_BALLOON +#ifdef CONFIG_BALLOON BALLOON_INFLATE, BALLOON_DEFLATE, #ifdef CONFIG_BALLOON_MIGRATION BALLOON_MIGRATE, #endif /* CONFIG_BALLOON_MIGRATION */ -#endif +#endif /* CONFIG_BALLOON */ #ifdef CONFIG_DEBUG_TLBFLUSH NR_TLB_REMOTE_FLUSH, /* cpu tried to flush others' tlbs */ NR_TLB_REMOTE_FLUSH_RECEIVED,/* cpu received ipi for flush */ -- cgit v1.2.3 From 5898aa8f9a0b42fe1f65c7364010ab15ec5c38bf Mon Sep 17 00:00:00 2001 From: Mathieu Desnoyers Date: Wed, 14 Jan 2026 09:36:42 -0500 Subject: mm: fix OOM killer inaccuracy on large many-core systems Use the precise, albeit slower, precise RSS counter sums for the OOM killer task selection and console dumps. The approximated value is too imprecise on large many-core systems. The following rss tracking issues were noted by Sweet Tea Dorminy [1], which lead to picking wrong tasks as OOM kill target: Recently, several internal services had an RSS usage regression as part of a kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to read RSS statistics in a backup watchdog process to monitor and decide if they'd overrun their memory budget. Now, however, a representative service with five threads, expected to use about a hundred MB of memory, on a 250-cpu machine had memory usage tens of megabytes different from the expected amount -- this constituted a significant percentage of inaccuracy, causing the watchdog to act. This was a result of commit f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter") [1]. Previously, the memory error was bounded by 64*nr_threads pages, a very livable megabyte. Now, however, as a result of scheduler decisions moving the threads around the CPUs, the memory error could be as large as a gigabyte. This is a really tremendous inaccuracy for any few-threaded program on a large machine and impedes monitoring significantly. These stat counters are also used to make OOM killing decisions, so this additional inaccuracy could make a big difference in OOM situations -- either resulting in the wrong process being killed, or in less memory being returned from an OOM-kill than expected. Here is a (possibly incomplete) list of the prior approaches that were used or proposed, along with their downside: 1) Per-thread rss tracking: large error on many-thread processes. 2) Per-CPU counters: up to 12% slower for short-lived processes and 9% increased system time in make test workloads [1]. Moreover, the inaccuracy increases with O(n^2) with the number of CPUs. 3) Per-NUMA-node counters: requires atomics on fast-path (overhead), error is high with systems that have lots of NUMA nodes (32 times the number of NUMA nodes). commit 82241a83cd15 ("mm: fix the inaccurate memory statistics issue for users") introduced get_mm_counter_sum() for precise proc memory status queries for some proc files. The simple fix proposed here is to do the precise per-cpu counters sum every time a counter value needs to be read. This applies to the OOM killer task selection, oom task console dumps (printk). This change increases the latency introduced when the OOM killer executes in favor of doing a more precise OOM target task selection. Effectively, the OOM killer iterates on all tasks, for all relevant page types, for which the precise sum iterates on all possible CPUs. As a reference, here is the execution time of the OOM killer before/after the change: AMD EPYC 9654 96-Core (2 sockets) Within a KVM, configured with 256 logical cpus. | before | after | ----------------------------------|----------|----------| nr_processes=40 | 0.3 ms | 0.5 ms | nr_processes=10000 | 3.0 ms | 80.0 ms | Link: https://lkml.kernel.org/r/20260114143642.47333-1-mathieu.desnoyers@efficios.com Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter") Link: https://lore.kernel.org/lkml/20250331223516.7810-2-sweettea-kernel@dorminy.me/ # [1] Signed-off-by: Mathieu Desnoyers Suggested-by: Michal Hocko Acked-by: Michal Hocko Reviewed-by: Baolin Wang Acked-by: Vlastimil Babka Cc: "Paul E. McKenney" Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Mathieu Desnoyers Cc: Dennis Zhou Cc: Tejun Heo Cc: Christoph Lameter Cc: Martin Liu Cc: David Rientjes Cc: Shakeel Butt Cc: SeongJae Park Cc: Michal Hocko Cc: Johannes Weiner Cc: Sweet Tea Dorminy Cc: Lorenzo Stoakes Cc: "Liam R . Howlett" Cc: Mike Rapoport Cc: Suren Baghdasaryan Cc: Christian Brauner Cc: Wei Yang Cc: David Hildenbrand Cc: Miaohe Lin Cc: Al Viro Cc: Yu Zhao Cc: Roman Gushchin Cc: Mateusz Guzik Cc: Matthew Wilcox Cc: Aboorva Devarajan Signed-off-by: Andrew Morton --- include/linux/mm.h | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'include') diff --git a/include/linux/mm.h b/include/linux/mm.h index aacabf8a0b58..aa90719234f1 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2906,6 +2906,13 @@ static inline unsigned long get_mm_rss(struct mm_struct *mm) get_mm_counter(mm, MM_SHMEMPAGES); } +static inline unsigned long get_mm_rss_sum(struct mm_struct *mm) +{ + return get_mm_counter_sum(mm, MM_FILEPAGES) + + get_mm_counter_sum(mm, MM_ANONPAGES) + + get_mm_counter_sum(mm, MM_SHMEMPAGES); +} + static inline unsigned long get_mm_hiwater_rss(struct mm_struct *mm) { return max(mm->hiwater_rss, get_mm_rss(mm)); -- cgit v1.2.3 From dc9fe9b7056a44ad65715def880e7d91d32c047f Mon Sep 17 00:00:00 2001 From: Jiayuan Chen Date: Tue, 20 Jan 2026 10:43:48 +0800 Subject: mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim Patch series "mm/vmscan: add tracepoint and reason for kswapd_failures reset", v4. Currently, kswapd_failures is reset in multiple places (kswapd, direct reclaim, PCP freeing, memory-tiers), but there's no way to trace when and why it was reset, making it difficult to debug memory reclaim issues. This patch: 1. Introduce kswapd_clear_hopeless() as a wrapper function to centralize kswapd_failures reset logic. 2. Introduce kswapd_test_hopeless() to encapsulate hopeless node checks, replacing all open-coded kswapd_failures comparisons. 3. Add kswapd_clear_hopeless_reason enum to distinguish reset sources: - KSWAPD_CLEAR_HOPELESS_KSWAPD: reset from kswapd context - KSWAPD_CLEAR_HOPELESS_DIRECT: reset from direct reclaim - KSWAPD_CLEAR_HOPELESS_PCP: reset from PCP page freeing - KSWAPD_CLEAR_HOPELESS_OTHER: reset from other paths 4. Add tracepoints for better observability: - mm_vmscan_kswapd_clear_hopeless: traces each reset with reason - mm_vmscan_kswapd_reclaim_fail: traces each kswapd reclaim failure Test results: $ trace-cmd record -e vmscan:mm_vmscan_kswapd_clear_hopeless -e vmscan:mm_vmscan_kswapd_reclaim_fail $ # generate memory pressure $ trace-cmd report cpus=4 kswapd0-71 [000] 27.216563: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=1 kswapd0-71 [000] 27.217169: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=2 kswapd0-71 [000] 27.217764: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=3 kswapd0-71 [000] 27.218353: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=4 kswapd0-71 [000] 27.218993: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=5 kswapd0-71 [000] 27.219744: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=6 kswapd0-71 [000] 27.220488: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=7 kswapd0-71 [000] 27.221206: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=8 kswapd0-71 [000] 27.221806: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=9 kswapd0-71 [000] 27.222634: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=10 kswapd0-71 [000] 27.223286: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=11 kswapd0-71 [000] 27.223894: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=12 kswapd0-71 [000] 27.224712: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=13 kswapd0-71 [000] 27.225424: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=14 kswapd0-71 [000] 27.226082: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=15 kswapd0-71 [000] 27.226810: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=16 kswapd1-72 [002] 27.386869: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=1 kswapd1-72 [002] 27.387435: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=2 kswapd1-72 [002] 27.388016: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=3 kswapd1-72 [002] 27.388586: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=4 kswapd1-72 [002] 27.389155: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=5 kswapd1-72 [002] 27.389723: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=6 kswapd1-72 [002] 27.390292: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=7 kswapd1-72 [002] 27.392364: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=8 kswapd1-72 [002] 27.392934: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=9 kswapd1-72 [002] 27.393504: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=10 kswapd1-72 [002] 27.394073: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=11 kswapd1-72 [002] 27.394899: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=12 kswapd1-72 [002] 27.395472: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=13 kswapd1-72 [002] 27.396055: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=14 kswapd1-72 [002] 27.396628: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=15 kswapd1-72 [002] 27.397199: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=16 kworker/u18:0-40 [002] 27.410151: mm_vmscan_kswapd_clear_hopeless: nid=0 reason=DIRECT kswapd0-71 [000] 27.439454: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=1 kswapd0-71 [000] 27.440048: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=2 kswapd0-71 [000] 27.440634: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=3 kswapd0-71 [000] 27.441211: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=4 kswapd0-71 [000] 27.441787: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=5 kswapd0-71 [000] 27.442363: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=6 kswapd0-71 [000] 27.443030: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=7 kswapd0-71 [000] 27.443725: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=8 kswapd0-71 [000] 27.444315: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=9 kswapd0-71 [000] 27.444898: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=10 kswapd0-71 [000] 27.445476: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=11 kswapd0-71 [000] 27.446053: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=12 kswapd0-71 [000] 27.446646: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=13 kswapd0-71 [000] 27.447230: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=14 kswapd0-71 [000] 27.447812: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=15 kswapd0-71 [000] 27.448391: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=16 ann-423 [003] 28.028285: mm_vmscan_kswapd_clear_hopeless: nid=0 reason=PCP This patch (of 2): When kswapd fails to reclaim memory, kswapd_failures is incremented. Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid futile reclaim attempts. However, any successful direct reclaim unconditionally resets kswapd_failures to 0, which can cause problems. We observed an issue in production on a multi-NUMA system where a process allocated large amounts of anonymous pages on a single NUMA node, causing its watermark to drop below high and evicting most file pages: $ numastat -m Per-node system memory usage (in MBs): Node 0 Node 1 Total --------------- --------------- --------------- MemTotal 128222.19 127983.91 256206.11 MemFree 1414.48 1432.80 2847.29 MemUsed 126807.71 126551.11 252358.82 SwapCached 0.00 0.00 0.00 Active 29017.91 25554.57 54572.48 Inactive 92749.06 95377.00 188126.06 Active(anon) 28998.96 23356.47 52355.43 Inactive(anon) 92685.27 87466.11 180151.39 Active(file) 18.95 2198.10 2217.05 Inactive(file) 63.79 7910.89 7974.68 With swap disabled, only file pages can be reclaimed. When kswapd is woken (e.g., via wake_all_kswapds()), it runs continuously but cannot raise free memory above the high watermark since reclaimable file pages are insufficient. Normally, kswapd would eventually stop after kswapd_failures reaches MAX_RECLAIM_RETRIES. However, containers on this machine have memory.high set in their cgroup. Business processes continuously trigger the high limit, causing frequent direct reclaim that keeps resetting kswapd_failures to 0. This prevents kswapd from ever stopping. The key insight is that direct reclaim triggered by cgroup memory.high performs aggressive scanning to throttle the allocating process. With sufficiently aggressive scanning, even hot pages will eventually be reclaimed, making direct reclaim "successful" at freeing some memory. However, this success does not mean the node has reached a balanced state - the freed memory may still be insufficient to bring free pages above the high watermark. Unconditionally resetting kswapd_failures in this case keeps kswapd alive indefinitely. The result is that kswapd runs endlessly. Unlike direct reclaim which only reclaims from the allocating cgroup, kswapd scans the entire node's memory. This causes hot file pages from all workloads on the node to be evicted, not just those from the cgroup triggering memory.high. These pages constantly refault, generating sustained heavy IO READ pressure across the entire system. Fix this by only resetting kswapd_failures when the node is actually balanced. This allows both kswapd and direct reclaim to clear kswapd_failures upon successful reclaim, but only when the reclaim actually resolves the memory pressure (i.e., the node becomes balanced). Link: https://lkml.kernel.org/r/20260120024402.387576-1-jiayuan.chen@linux.dev Link: https://lkml.kernel.org/r/20260120024402.387576-2-jiayuan.chen@linux.dev Signed-off-by: Jiayuan Chen Signed-off-by: Jiayuan Chen Acked-by: Shakeel Butt Cc: Axel Rasmussen Cc: Brendan Jackman Cc: David Hildenbrand Cc: Johannes Weiner Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Michal Hocko Cc: Mike Rapoport Cc: Qi Zheng Cc: Steven Rostedt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Wei Xu Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index eb3815fc94ad..8881198e85c6 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1536,6 +1536,8 @@ static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat) void build_all_zonelists(pg_data_t *pgdat); void wakeup_kswapd(struct zone *zone, gfp_t gfp_mask, int order, enum zone_type highest_zoneidx); +void kswapd_try_clear_hopeless(struct pglist_data *pgdat, + unsigned int order, int highest_zoneidx); bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, int highest_zoneidx, unsigned int alloc_flags, long free_pages); -- cgit v1.2.3 From a45088376d8a847a5e3b1982fcfceb41644e3b1d Mon Sep 17 00:00:00 2001 From: Jiayuan Chen Date: Tue, 20 Jan 2026 10:43:49 +0800 Subject: mm/vmscan: add tracepoint and reason for kswapd_failures reset Currently, kswapd_failures is reset in multiple places (kswapd, direct reclaim, PCP freeing, memory-tiers), but there's no way to trace when and why it was reset, making it difficult to debug memory reclaim issues. This patch: 1. Introduce kswapd_clear_hopeless() as a wrapper function to centralize kswapd_failures reset logic. 2. Introduce kswapd_test_hopeless() to encapsulate hopeless node checks, replacing all open-coded kswapd_failures comparisons. 3. Add kswapd_clear_hopeless_reason enum to distinguish reset sources: - KSWAPD_CLEAR_HOPELESS_KSWAPD: reset from kswapd context - KSWAPD_CLEAR_HOPELESS_DIRECT: reset from direct reclaim - KSWAPD_CLEAR_HOPELESS_PCP: reset from PCP page freeing - KSWAPD_CLEAR_HOPELESS_OTHER: reset from other paths 4. Add tracepoints for better observability: - mm_vmscan_kswapd_clear_hopeless: traces each reset with reason - mm_vmscan_kswapd_reclaim_fail: traces each kswapd reclaim failure Test results: $ trace-cmd record -e vmscan:mm_vmscan_kswapd_clear_hopeless -e vmscan:mm_vmscan_kswapd_reclaim_fail $ # generate memory pressure $ trace-cmd report cpus=4 kswapd0-71 [000] 27.216563: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=1 kswapd0-71 [000] 27.217169: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=2 kswapd0-71 [000] 27.217764: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=3 kswapd0-71 [000] 27.218353: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=4 kswapd0-71 [000] 27.218993: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=5 kswapd0-71 [000] 27.219744: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=6 kswapd0-71 [000] 27.220488: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=7 kswapd0-71 [000] 27.221206: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=8 kswapd0-71 [000] 27.221806: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=9 kswapd0-71 [000] 27.222634: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=10 kswapd0-71 [000] 27.223286: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=11 kswapd0-71 [000] 27.223894: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=12 kswapd0-71 [000] 27.224712: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=13 kswapd0-71 [000] 27.225424: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=14 kswapd0-71 [000] 27.226082: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=15 kswapd0-71 [000] 27.226810: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=16 kswapd1-72 [002] 27.386869: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=1 kswapd1-72 [002] 27.387435: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=2 kswapd1-72 [002] 27.388016: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=3 kswapd1-72 [002] 27.388586: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=4 kswapd1-72 [002] 27.389155: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=5 kswapd1-72 [002] 27.389723: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=6 kswapd1-72 [002] 27.390292: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=7 kswapd1-72 [002] 27.392364: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=8 kswapd1-72 [002] 27.392934: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=9 kswapd1-72 [002] 27.393504: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=10 kswapd1-72 [002] 27.394073: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=11 kswapd1-72 [002] 27.394899: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=12 kswapd1-72 [002] 27.395472: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=13 kswapd1-72 [002] 27.396055: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=14 kswapd1-72 [002] 27.396628: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=15 kswapd1-72 [002] 27.397199: mm_vmscan_kswapd_reclaim_fail: nid=1 failures=16 kworker/u18:0-40 [002] 27.410151: mm_vmscan_kswapd_clear_hopeless: nid=0 reason=DIRECT kswapd0-71 [000] 27.439454: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=1 kswapd0-71 [000] 27.440048: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=2 kswapd0-71 [000] 27.440634: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=3 kswapd0-71 [000] 27.441211: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=4 kswapd0-71 [000] 27.441787: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=5 kswapd0-71 [000] 27.442363: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=6 kswapd0-71 [000] 27.443030: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=7 kswapd0-71 [000] 27.443725: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=8 kswapd0-71 [000] 27.444315: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=9 kswapd0-71 [000] 27.444898: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=10 kswapd0-71 [000] 27.445476: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=11 kswapd0-71 [000] 27.446053: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=12 kswapd0-71 [000] 27.446646: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=13 kswapd0-71 [000] 27.447230: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=14 kswapd0-71 [000] 27.447812: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=15 kswapd0-71 [000] 27.448391: mm_vmscan_kswapd_reclaim_fail: nid=0 failures=16 ann-423 [003] 28.028285: mm_vmscan_kswapd_clear_hopeless: nid=0 reason=PCP Link: https://lkml.kernel.org/r/20260120024402.387576-3-jiayuan.chen@linux.dev Signed-off-by: Jiayuan Chen Signed-off-by: Jiayuan Chen Acked-by: Shakeel Butt Suggested-by: Johannes Weiner Reviewed-by: Steven Rostedt (Google) [tracing] Cc: Axel Rasmussen Cc: Brendan Jackman Cc: David Hildenbrand Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Michal Hocko Cc: Mike Rapoport Cc: Qi Zheng Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Wei Xu Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 19 ++++++++++++---- include/trace/events/vmscan.h | 51 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 66 insertions(+), 4 deletions(-) (limited to 'include') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 8881198e85c6..3e51190a55e4 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1534,16 +1534,27 @@ static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat) #include void build_all_zonelists(pg_data_t *pgdat); -void wakeup_kswapd(struct zone *zone, gfp_t gfp_mask, int order, - enum zone_type highest_zoneidx); -void kswapd_try_clear_hopeless(struct pglist_data *pgdat, - unsigned int order, int highest_zoneidx); bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, int highest_zoneidx, unsigned int alloc_flags, long free_pages); bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, int highest_zoneidx, unsigned int alloc_flags); + +enum kswapd_clear_hopeless_reason { + KSWAPD_CLEAR_HOPELESS_OTHER = 0, + KSWAPD_CLEAR_HOPELESS_KSWAPD, + KSWAPD_CLEAR_HOPELESS_DIRECT, + KSWAPD_CLEAR_HOPELESS_PCP, +}; + +void wakeup_kswapd(struct zone *zone, gfp_t gfp_mask, int order, + enum zone_type highest_zoneidx); +void kswapd_try_clear_hopeless(struct pglist_data *pgdat, + unsigned int order, int highest_zoneidx); +void kswapd_clear_hopeless(pg_data_t *pgdat, enum kswapd_clear_hopeless_reason reason); +bool kswapd_test_hopeless(pg_data_t *pgdat); + /* * Memory initialization context, use to differentiate memory added by * the platform statically or via memory hotplug interface. diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index 490958fa10de..ea58e4656abf 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -40,6 +40,16 @@ {_VMSCAN_THROTTLE_CONGESTED, "VMSCAN_THROTTLE_CONGESTED"} \ ) : "VMSCAN_THROTTLE_NONE" +TRACE_DEFINE_ENUM(KSWAPD_CLEAR_HOPELESS_OTHER); +TRACE_DEFINE_ENUM(KSWAPD_CLEAR_HOPELESS_KSWAPD); +TRACE_DEFINE_ENUM(KSWAPD_CLEAR_HOPELESS_DIRECT); +TRACE_DEFINE_ENUM(KSWAPD_CLEAR_HOPELESS_PCP); + +#define kswapd_clear_hopeless_reason_ops \ + {KSWAPD_CLEAR_HOPELESS_KSWAPD, "KSWAPD"}, \ + {KSWAPD_CLEAR_HOPELESS_DIRECT, "DIRECT"}, \ + {KSWAPD_CLEAR_HOPELESS_PCP, "PCP"}, \ + {KSWAPD_CLEAR_HOPELESS_OTHER, "OTHER"} #define trace_reclaim_flags(file) ( \ (file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \ @@ -535,6 +545,47 @@ TRACE_EVENT(mm_vmscan_throttled, __entry->usec_delayed, show_throttle_flags(__entry->reason)) ); + +TRACE_EVENT(mm_vmscan_kswapd_reclaim_fail, + + TP_PROTO(int nid, int failures), + + TP_ARGS(nid, failures), + + TP_STRUCT__entry( + __field(int, nid) + __field(int, failures) + ), + + TP_fast_assign( + __entry->nid = nid; + __entry->failures = failures; + ), + + TP_printk("nid=%d failures=%d", + __entry->nid, __entry->failures) +); + +TRACE_EVENT(mm_vmscan_kswapd_clear_hopeless, + + TP_PROTO(int nid, int reason), + + TP_ARGS(nid, reason), + + TP_STRUCT__entry( + __field(int, nid) + __field(int, reason) + ), + + TP_fast_assign( + __entry->nid = nid; + __entry->reason = reason; + ), + + TP_printk("nid=%d reason=%s", + __entry->nid, + __print_symbolic(__entry->reason, kswapd_clear_hopeless_reason_ops)) +); #endif /* _TRACE_VMSCAN_H */ /* This part must be outside protection */ -- cgit v1.2.3 From c83109e95c9d78e41b39e65b6490e511f4b8fba2 Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Mon, 12 Jan 2026 23:09:50 +0800 Subject: mm: page_isolation: introduce page_is_unmovable() Patch series "mm: accelerate gigantic folio allocation". Optimize pfn_range_valid_contig() and replace_free_hugepage_folios() in alloc_contig_frozen_pages() to speed up gigantic folio allocation. The allocation time for 120*1G folios drops from 3.605s to 0.431s. This patch (of 5): Factor out the check if a page is unmovable into a new helper, and will be reused in the following patch. No functional change intended, the minor changes are as follows, 1) Avoid unnecessary calls by checking CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION 2) Directly call PageCompound since PageTransCompound may be dropped 3) Using folio_test_hugetlb() Link: https://lkml.kernel.org/r/20260112150954.1802953-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20260112150954.1802953-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Reviewed-by: Zi Yan Reviewed-by: Oscar Salvador Cc: Brendan Jackman Cc: David Hildenbrand Cc: Jane Chu Cc: Johannes Weiner Cc: Matthew Wilcox (Oracle) Cc: Muchun Song Cc: Sidhartha Kumar Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/page-isolation.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include') diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h index 3e2f960e166c..6f8638c9904f 100644 --- a/include/linux/page-isolation.h +++ b/include/linux/page-isolation.h @@ -67,4 +67,6 @@ void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn); int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, enum pb_isolate_mode mode); +bool page_is_unmovable(struct zone *zone, struct page *page, + enum pb_isolate_mode mode, unsigned long *step); #endif -- cgit v1.2.3 From 50962b16c0d63725fa73f0a5b4b831f740cf7208 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 17 Jan 2026 09:52:48 -0800 Subject: mm/damon: remove damon_operations->cleanup() Patch series "mm/damon: cleanup kdamond, damon_call(), damos filter and DAMON_MIN_REGION". Do miscellaneous code cleanups for improving readability. First three patches cleanup kdamond termination process, by removing unused operation set cleanup callback (patch 1) and moving damon_ctx specific resource cleanups on kdamond termination to synchronization-easy place (patches 2 and 3). Next two patches touch damon_call() infrastructure, by refactoring kdamond_call() function to do less and simpler locking operations (patch 4), and documenting when dealloc_on_free does work (patch 5). Final three patches rename things for clear uses of those. Those rename damos_filter_out() to be more explicit about the fact that it is only for core-handled filters (patch 6), DAMON_MIN_REGION macro to be more explicit it is not about number of regions but size of each region (patch 7), and damon_ctx->min_sz_region to be different from damos_access_patern->min_sz_region (patch 8), so that those are not confusing and easy to grep. This patch (of 8): damon_operations->cleanup() was added for a case that an operation set implementation requires additional cleanups. But no such implementation exists at the moment. Remove it. Link: https://lkml.kernel.org/r/20260117175256.82826-1-sj@kernel.org Link: https://lkml.kernel.org/r/20260117175256.82826-2-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- include/linux/damon.h | 3 --- 1 file changed, 3 deletions(-) (limited to 'include') diff --git a/include/linux/damon.h b/include/linux/damon.h index e6930d8574d3..bd4c76b126bd 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -607,7 +607,6 @@ enum damon_ops_id { * @apply_scheme: Apply a DAMON-based operation scheme. * @target_valid: Determine if the target is valid. * @cleanup_target: Clean up each target before deallocation. - * @cleanup: Clean up the context. * * DAMON can be extended for various address spaces and usages. For this, * users should register the low level operations for their target address @@ -640,7 +639,6 @@ enum damon_ops_id { * @target_valid should check whether the target is still valid for the * monitoring. * @cleanup_target is called before the target will be deallocated. - * @cleanup is called from @kdamond just before its termination. */ struct damon_operations { enum damon_ops_id id; @@ -656,7 +654,6 @@ struct damon_operations { struct damos *scheme, unsigned long *sz_filter_passed); bool (*target_valid)(struct damon_target *t); void (*cleanup_target)(struct damon_target *t); - void (*cleanup)(struct damon_ctx *context); }; /* -- cgit v1.2.3 From 177c8a272968b6bcdbcc8589a72e3eaa32f975d0 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 17 Jan 2026 09:52:52 -0800 Subject: mm/damon: document damon_call_control->dealloc_on_cancel repeat behavior damon_call_control->dealloc_on_cancel works only when ->repeat is true. But the behavior is not clearly documented. DAMON API callers can understand the behavior only after reading kdamond_call() code. Document the behavior on the kernel-doc comment of damon_call_control. Link: https://lkml.kernel.org/r/20260117175256.82826-6-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- include/linux/damon.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/damon.h b/include/linux/damon.h index bd4c76b126bd..bdca28e15e40 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -663,7 +663,7 @@ struct damon_operations { * @data: Data that will be passed to @fn. * @repeat: Repeat invocations. * @return_code: Return code from @fn invocation. - * @dealloc_on_cancel: De-allocate when canceled. + * @dealloc_on_cancel: If @repeat is true, de-allocate when canceled. * * Control damon_call(), which requests specific kdamond to invoke a given * function. Refer to damon_call() for more details. -- cgit v1.2.3 From dfb1b0c9dc0d61e422905640e1e7334b3cf6f384 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 17 Jan 2026 09:52:54 -0800 Subject: mm/damon: rename DAMON_MIN_REGION to DAMON_MIN_REGION_SZ The macro is for the default minimum size of each DAMON region. There was a case that a reader was confused if it is the minimum number of total DAMON regions, which is set on damon_attrs->min_nr_regions. Make the name more explicit. Link: https://lkml.kernel.org/r/20260117175256.82826-8-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- include/linux/damon.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/damon.h b/include/linux/damon.h index bdca28e15e40..5bf8db1d78fe 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -15,7 +15,7 @@ #include /* Minimal region size. Every damon_region is aligned by this. */ -#define DAMON_MIN_REGION PAGE_SIZE +#define DAMON_MIN_REGION_SZ PAGE_SIZE /* Max priority score for DAMON-based operation schemes */ #define DAMOS_MAX_SCORE (99) -- cgit v1.2.3 From cc1db8dff8e751ec3ab352483de366b7f23aefe2 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 17 Jan 2026 09:52:55 -0800 Subject: mm/damon: rename min_sz_region of damon_ctx to min_region_sz 'min_sz_region' field of 'struct damon_ctx' represents the minimum size of each DAMON region for the context. 'struct damos_access_pattern' has a field of the same name. It confuses readers and makes 'grep' less optimal for them. Rename it to 'min_region_sz'. Link: https://lkml.kernel.org/r/20260117175256.82826-9-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- include/linux/damon.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'include') diff --git a/include/linux/damon.h b/include/linux/damon.h index 5bf8db1d78fe..a4fea23da857 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -773,7 +773,7 @@ struct damon_attrs { * * @ops: Set of monitoring operations for given use cases. * @addr_unit: Scale factor for core to ops address conversion. - * @min_sz_region: Minimum region size. + * @min_region_sz: Minimum region size. * @adaptive_targets: Head of monitoring targets (&damon_target) list. * @schemes: Head of schemes (&damos) list. */ @@ -818,7 +818,7 @@ struct damon_ctx { /* public: */ struct damon_operations ops; unsigned long addr_unit; - unsigned long min_sz_region; + unsigned long min_region_sz; struct list_head adaptive_targets; struct list_head schemes; @@ -907,7 +907,7 @@ static inline void damon_insert_region(struct damon_region *r, void damon_add_region(struct damon_region *r, struct damon_target *t); void damon_destroy_region(struct damon_region *r, struct damon_target *t); int damon_set_regions(struct damon_target *t, struct damon_addr_range *ranges, - unsigned int nr_ranges, unsigned long min_sz_region); + unsigned int nr_ranges, unsigned long min_region_sz); void damon_update_region_access_rate(struct damon_region *r, bool accessed, struct damon_attrs *attrs); @@ -975,7 +975,7 @@ int damos_walk(struct damon_ctx *ctx, struct damos_walk_control *control); int damon_set_region_biggest_system_ram_default(struct damon_target *t, unsigned long *start, unsigned long *end, - unsigned long min_sz_region); + unsigned long min_region_sz); #endif /* CONFIG_DAMON */ -- cgit v1.2.3 From 25faccd69977d9a72739fd425040c2a1c2d67e46 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 23 Jan 2026 20:12:11 +0000 Subject: mm/vma: rename VMA_LOCK_OFFSET to VM_REFCNT_EXCLUDE_READERS_FLAG Patch series "mm: add and use vma_assert_stabilised() helper", v4. This series first introduces a series of refactorings, intended to significantly improve readability and abstraction of the code. Sometimes we wish to assert that a VMA is stable, that is - the VMA cannot be changed underneath us. This will be the case if EITHER the VMA lock or the mmap lock is held. We already open-code this in two places - anon_vma_name() in mm/madvise.c and vma_flag_set_atomic() in include/linux/mm.h. This series adds vma_assert_stablised() which abstract this can be used in these callsites instead. This implementation uses lockdep where possible - that is VMA read locks - which correctly track read lock acquisition/release via: vma_start_read() -> rwsem_acquire_read() vma_start_read_locked() -> vma_start_read_locked_nested() -> rwsem_acquire_read() And: vma_end_read() -> vma_refcount_put() -> rwsem_release() We don't track the VMA locks using lockdep for VMA write locks, however these are predicated upon mmap write locks whose lockdep state we do track, and additionally vma_assert_stabillised() asserts this check if VMA read lock is not held, so we get lockdep coverage in this case also. We also add extensive comments to describe what we're doing. There's some tricky stuff around mmap locking and stabilisation races that we have to be careful of that I describe in the patch introducing vma_assert_stabilised(). This change also lays the foundation for future series to add this assert in further places where we wish to make it clear that we rely upon a stabilised VMA. The motivation for this change was precisely this. This patch (of 10): The VMA_LOCK_OFFSET value encodes a flag which vma->vm_refcnt is set to in order to indicate that a VMA is in the process of having VMA read-locks excluded in __vma_enter_locked() (that is, first checking if there are any VMA read locks held, and if there are, waiting on them to be released). This happens when a VMA write lock is being established, or a VMA is being marked detached and discovers that the VMA reference count is elevated due to read-locks temporarily elevating the reference count only to discover a VMA write lock is in place. The naming does not convey any of this, so rename VMA_LOCK_OFFSET to VM_REFCNT_EXCLUDE_READERS_FLAG (with a sensible new prefix to differentiate from the newly introduced VMA_*_BIT flags). Also rename VMA_REF_LIMIT to VM_REFCNT_LIMIT to make this consistent also. Update comments to reflect this. No functional change intended. Link: https://lkml.kernel.org/r/817bd763e5fe35f23e01347996f9007e6eb88460.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Suren Baghdasaryan Reviewed-by: Vlastimil Babka Reviewed-by: Liam R. Howlett Cc: Boqun Feng Cc: Michal Hocko Cc: Mike Rapoport Cc: Shakeel Butt Cc: Waiman Long Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- include/linux/mm_types.h | 17 +++++++++++++---- include/linux/mmap_lock.h | 14 ++++++++------ 2 files changed, 21 insertions(+), 10 deletions(-) (limited to 'include') diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 78950eb8926d..bdbf17c4f26b 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -752,8 +752,17 @@ static inline struct anon_vma_name *anon_vma_name_alloc(const char *name) } #endif -#define VMA_LOCK_OFFSET 0x40000000 -#define VMA_REF_LIMIT (VMA_LOCK_OFFSET - 1) +/* + * While __vma_enter_locked() is working to ensure are no read-locks held on a + * VMA (either while acquiring a VMA write lock or marking a VMA detached) we + * set the VM_REFCNT_EXCLUDE_READERS_FLAG in vma->vm_refcnt to indiciate to + * vma_start_read() that the reference count should be left alone. + * + * Once the operation is complete, this value is subtracted from vma->vm_refcnt. + */ +#define VM_REFCNT_EXCLUDE_READERS_BIT (30) +#define VM_REFCNT_EXCLUDE_READERS_FLAG (1U << VM_REFCNT_EXCLUDE_READERS_BIT) +#define VM_REFCNT_LIMIT (VM_REFCNT_EXCLUDE_READERS_FLAG - 1) struct vma_numab_state { /* @@ -935,10 +944,10 @@ struct vm_area_struct { /* * Can only be written (using WRITE_ONCE()) while holding both: * - mmap_lock (in write mode) - * - vm_refcnt bit at VMA_LOCK_OFFSET is set + * - vm_refcnt bit at VM_REFCNT_EXCLUDE_READERS_FLAG is set * Can be read reliably while holding one of: * - mmap_lock (in read or write mode) - * - vm_refcnt bit at VMA_LOCK_OFFSET is set or vm_refcnt > 1 + * - vm_refcnt bit at VM_REFCNT_EXCLUDE_READERS_BIT is set or vm_refcnt > 1 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout * while holding nothing (except RCU to keep the VMA struct allocated). * diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index b50416fbba20..5acbd4ba1b52 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -125,12 +125,14 @@ static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt) static inline bool is_vma_writer_only(int refcnt) { /* - * With a writer and no readers, refcnt is VMA_LOCK_OFFSET if the vma - * is detached and (VMA_LOCK_OFFSET + 1) if it is attached. Waiting on - * a detached vma happens only in vma_mark_detached() and is a rare - * case, therefore most of the time there will be no unnecessary wakeup. + * With a writer and no readers, refcnt is VM_REFCNT_EXCLUDE_READERS_FLAG + * if the vma is detached and (VM_REFCNT_EXCLUDE_READERS_FLAG + 1) if it is + * attached. Waiting on a detached vma happens only in + * vma_mark_detached() and is a rare case, therefore most of the time + * there will be no unnecessary wakeup. */ - return (refcnt & VMA_LOCK_OFFSET) && refcnt <= VMA_LOCK_OFFSET + 1; + return (refcnt & VM_REFCNT_EXCLUDE_READERS_FLAG) && + refcnt <= VM_REFCNT_EXCLUDE_READERS_FLAG + 1; } static inline void vma_refcount_put(struct vm_area_struct *vma) @@ -159,7 +161,7 @@ static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int mmap_assert_locked(vma->vm_mm); if (unlikely(!__refcount_inc_not_zero_limited_acquire(&vma->vm_refcnt, &oldcnt, - VMA_REF_LIMIT))) + VM_REFCNT_LIMIT))) return false; rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_); -- cgit v1.2.3 From ef4c0cea1e15dc6b1b5b9bb72fa4605b14f2125e Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 23 Jan 2026 20:12:12 +0000 Subject: mm/vma: document possible vma->vm_refcnt values and reference comment The possible vma->vm_refcnt values are confusing and vague, explain in detail what these can be in a comment describing the vma->vm_refcnt field and reference this comment in various places that read/write this field. No functional change intended. [akpm@linux-foundation.org: fix typo, per Suren] Link: https://lkml.kernel.org/r/d462e7678c6cc7461f94e5b26c776547d80a67e8.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Reviewed-by: Suren Baghdasaryan Cc: Boqun Feng Cc: Liam Howlett Cc: Michal Hocko Cc: Mike Rapoport Cc: Shakeel Butt Cc: Waiman Long Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- include/linux/mm_types.h | 42 ++++++++++++++++++++++++++++++++++++++++-- include/linux/mmap_lock.h | 7 +++++++ 2 files changed, 47 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index bdbf17c4f26b..3e608d22cab0 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -758,7 +758,8 @@ static inline struct anon_vma_name *anon_vma_name_alloc(const char *name) * set the VM_REFCNT_EXCLUDE_READERS_FLAG in vma->vm_refcnt to indiciate to * vma_start_read() that the reference count should be left alone. * - * Once the operation is complete, this value is subtracted from vma->vm_refcnt. + * See the comment describing vm_refcnt in vm_area_struct for details as to + * which values the VMA reference count can be. */ #define VM_REFCNT_EXCLUDE_READERS_BIT (30) #define VM_REFCNT_EXCLUDE_READERS_FLAG (1U << VM_REFCNT_EXCLUDE_READERS_BIT) @@ -989,7 +990,44 @@ struct vm_area_struct { struct vma_numab_state *numab_state; /* NUMA Balancing state */ #endif #ifdef CONFIG_PER_VMA_LOCK - /* Unstable RCU readers are allowed to read this. */ + /* + * Used to keep track of firstly, whether the VMA is attached, secondly, + * if attached, how many read locks are taken, and thirdly, if the + * VM_REFCNT_EXCLUDE_READERS_FLAG is set, whether any read locks held + * are currently in the process of being excluded. + * + * This value can be equal to: + * + * 0 - Detached. IMPORTANT: when the refcnt is zero, readers cannot + * increment it. + * + * 1 - Attached and either unlocked or write-locked. Write locks are + * identified via __is_vma_write_locked() which checks for equality of + * vma->vm_lock_seq and mm->mm_lock_seq. + * + * >1, < VM_REFCNT_EXCLUDE_READERS_FLAG - Read-locked or (unlikely) + * write-locked with other threads having temporarily incremented the + * reference count prior to determining it is write-locked and + * decrementing it again. + * + * VM_REFCNT_EXCLUDE_READERS_FLAG - Detached, pending + * __vma_exit_locked() completion which will decrement the reference + * count to zero. IMPORTANT - at this stage no further readers can + * increment the reference count. It can only be reduced. + * + * VM_REFCNT_EXCLUDE_READERS_FLAG + 1 - A thread is either write-locking + * an attached VMA and has yet to invoke __vma_exit_locked(), OR a + * thread is detaching a VMA and is waiting on a single spurious reader + * in order to decrement the reference count. IMPORTANT - as above, no + * further readers can increment the reference count. + * + * > VM_REFCNT_EXCLUDE_READERS_FLAG + 1 - A thread is either + * write-locking or detaching a VMA is waiting on readers to + * exit. IMPORTANT - as above, no further readers can increment the + * reference count. + * + * NOTE: Unstable RCU readers are allowed to read this. + */ refcount_t vm_refcnt ____cacheline_aligned_in_smp; #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lockdep_map vmlock_dep_map; diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index 5acbd4ba1b52..a764439d0276 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -130,6 +130,9 @@ static inline bool is_vma_writer_only(int refcnt) * attached. Waiting on a detached vma happens only in * vma_mark_detached() and is a rare case, therefore most of the time * there will be no unnecessary wakeup. + * + * See the comment describing the vm_area_struct->vm_refcnt field for + * details of possible refcnt values. */ return (refcnt & VM_REFCNT_EXCLUDE_READERS_FLAG) && refcnt <= VM_REFCNT_EXCLUDE_READERS_FLAG + 1; @@ -249,6 +252,10 @@ static inline void vma_assert_locked(struct vm_area_struct *vma) { unsigned int mm_lock_seq; + /* + * See the comment describing the vm_area_struct->vm_refcnt field for + * details of possible refcnt values. + */ VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt) <= 1 && !__is_vma_write_locked(vma, &mm_lock_seq), vma); } -- cgit v1.2.3 From 180355d4cfbd25f370e2e0912877a36aa350ff64 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 23 Jan 2026 20:12:13 +0000 Subject: mm/vma: rename is_vma_write_only(), separate out shared refcount put The is_vma_writer_only() function is misnamed - this isn't determining if there is only a write lock, as it checks for the presence of the VM_REFCNT_EXCLUDE_READERS_FLAG. Really, it is checking to see whether readers are excluded, with a possibility of a false positive in the case of a detachment (there we expect the vma->vm_refcnt to eventually be set to VM_REFCNT_EXCLUDE_READERS_FLAG, whereas for an attached VMA we expect it to eventually be set to VM_REFCNT_EXCLUDE_READERS_FLAG + 1). Rename the function accordingly. Relatedly, we use a __refcount_dec_and_test() primitive directly in vma_refcount_put(), using the old value to determine what the reference count ought to be after the operation is complete (ignoring racing reference count adjustments). Wrap this into a __vma_refcount_put_return() function, which we can then utilise in vma_mark_detached() and thus keep the refcount primitive usage abstracted. This function, as the name implies, returns the value after the reference count has been updated. This reduces duplication in the two invocations of this function. Also adjust comments, removing duplicative comments covered elsewhere and adding more to aid understanding. No functional change intended. Link: https://lkml.kernel.org/r/32053580bff460eb1092ef780b526cefeb748bad.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Reviewed-by: Suren Baghdasaryan Cc: Boqun Feng Cc: Liam Howlett Cc: Michal Hocko Cc: Mike Rapoport Cc: Shakeel Butt Cc: Waiman Long Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- include/linux/mmap_lock.h | 66 +++++++++++++++++++++++++++++++++++++---------- 1 file changed, 53 insertions(+), 13 deletions(-) (limited to 'include') diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index a764439d0276..294fb282052d 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -122,15 +122,22 @@ static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt) vma->vm_lock_seq = UINT_MAX; } -static inline bool is_vma_writer_only(int refcnt) +/* + * This function determines whether the input VMA reference count describes a + * VMA which has excluded all VMA read locks. + * + * In the case of a detached VMA, we may incorrectly indicate that readers are + * excluded when one remains, because in that scenario we target a refcount of + * VM_REFCNT_EXCLUDE_READERS_FLAG, rather than the attached target of + * VM_REFCNT_EXCLUDE_READERS_FLAG + 1. + * + * However, the race window for that is very small so it is unlikely. + * + * Returns: true if readers are excluded, false otherwise. + */ +static inline bool __vma_are_readers_excluded(int refcnt) { /* - * With a writer and no readers, refcnt is VM_REFCNT_EXCLUDE_READERS_FLAG - * if the vma is detached and (VM_REFCNT_EXCLUDE_READERS_FLAG + 1) if it is - * attached. Waiting on a detached vma happens only in - * vma_mark_detached() and is a rare case, therefore most of the time - * there will be no unnecessary wakeup. - * * See the comment describing the vm_area_struct->vm_refcnt field for * details of possible refcnt values. */ @@ -138,18 +145,51 @@ static inline bool is_vma_writer_only(int refcnt) refcnt <= VM_REFCNT_EXCLUDE_READERS_FLAG + 1; } +/* + * Actually decrement the VMA reference count. + * + * The function returns the reference count as it was immediately after the + * decrement took place. If it returns zero, the VMA is now detached. + */ +static inline __must_check unsigned int +__vma_refcount_put_return(struct vm_area_struct *vma) +{ + int oldcnt; + + if (__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) + return 0; + + return oldcnt - 1; +} + +/** + * vma_refcount_put() - Drop reference count in VMA vm_refcnt field due to a + * read-lock being dropped. + * @vma: The VMA whose reference count we wish to decrement. + * + * If we were the last reader, wake up threads waiting to obtain an exclusive + * lock. + */ static inline void vma_refcount_put(struct vm_area_struct *vma) { - /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */ + /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt. */ struct mm_struct *mm = vma->vm_mm; - int oldcnt; + int newcnt; rwsem_release(&vma->vmlock_dep_map, _RET_IP_); - if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) { - if (is_vma_writer_only(oldcnt - 1)) - rcuwait_wake_up(&mm->vma_writer_wait); - } + newcnt = __vma_refcount_put_return(vma); + /* + * __vma_enter_locked() may be sleeping waiting for readers to drop + * their reference count, so wake it up if we were the last reader + * blocking it from being acquired. + * + * We may be raced by other readers temporarily incrementing the + * reference count, though the race window is very small, this might + * cause spurious wakeups. + */ + if (newcnt && __vma_are_readers_excluded(newcnt)) + rcuwait_wake_up(&mm->vma_writer_wait); } /* -- cgit v1.2.3 From 1f2e7efc3ee9b32095d5a331d1f8672623f311bf Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 23 Jan 2026 20:12:14 +0000 Subject: mm/vma: add+use vma lockdep acquire/release defines The code is littered with inscrutable and duplicative lockdep incantations, replace these with defines which explain what is going on and add commentary to explain what we're doing. If lockdep is disabled these become no-ops. We must use defines so _RET_IP_ remains meaningful. These are self-documenting and aid readability of the code. Additionally, instead of using the confusing rwsem_*() form for something that is emphatically not an rwsem, we instead explicitly use lock_[acquired, release]_shared/exclusive() lockdep invocations since we are doing something rather custom here and these make more sense to use. No functional change intended. Link: https://lkml.kernel.org/r/fdae72441949ecf3b4a0ed3510da803e881bb153.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Suren Baghdasaryan Reviewed-by: Vlastimil Babka Reviewed-by: Sebastian Andrzej Siewior Cc: Boqun Feng Cc: Liam Howlett Cc: Michal Hocko Cc: Mike Rapoport Cc: Shakeel Butt Cc: Waiman Long Signed-off-by: Andrew Morton --- include/linux/mmap_lock.h | 37 ++++++++++++++++++++++++++++++++++--- 1 file changed, 34 insertions(+), 3 deletions(-) (limited to 'include') diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index 294fb282052d..1887ca55ead7 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -78,6 +78,37 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm) #ifdef CONFIG_PER_VMA_LOCK +/* + * VMA locks do not behave like most ordinary locks found in the kernel, so we + * cannot quite have full lockdep tracking in the way we would ideally prefer. + * + * Read locks act as shared locks which exclude an exclusive lock being + * taken. We therefore mark these accordingly on read lock acquire/release. + * + * Write locks are acquired exclusively per-VMA, but released in a shared + * fashion, that is upon vma_end_write_all(), we update the mmap's seqcount such + * that write lock is released. + * + * We therefore cannot track write locks per-VMA, nor do we try. Mitigating this + * is the fact that, of course, we do lockdep-track the mmap lock rwsem which + * must be held when taking a VMA write lock. + * + * We do, however, want to indicate that during either acquisition of a VMA + * write lock or detachment of a VMA that we require the lock held be exclusive, + * so we utilise lockdep to do so. + */ +#define __vma_lockdep_acquire_read(vma) \ + lock_acquire_shared(&vma->vmlock_dep_map, 0, 1, NULL, _RET_IP_) +#define __vma_lockdep_release_read(vma) \ + lock_release(&vma->vmlock_dep_map, _RET_IP_) +#define __vma_lockdep_acquire_exclusive(vma) \ + lock_acquire_exclusive(&vma->vmlock_dep_map, 0, 0, NULL, _RET_IP_) +#define __vma_lockdep_release_exclusive(vma) \ + lock_release(&vma->vmlock_dep_map, _RET_IP_) +/* Only meaningful if CONFIG_LOCK_STAT is defined. */ +#define __vma_lockdep_stat_mark_acquired(vma) \ + lock_acquired(&vma->vmlock_dep_map, _RET_IP_) + static inline void mm_lock_seqcount_init(struct mm_struct *mm) { seqcount_init(&mm->mm_lock_seq); @@ -176,9 +207,9 @@ static inline void vma_refcount_put(struct vm_area_struct *vma) struct mm_struct *mm = vma->vm_mm; int newcnt; - rwsem_release(&vma->vmlock_dep_map, _RET_IP_); - + __vma_lockdep_release_read(vma); newcnt = __vma_refcount_put_return(vma); + /* * __vma_enter_locked() may be sleeping waiting for readers to drop * their reference count, so wake it up if we were the last reader @@ -207,7 +238,7 @@ static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int VM_REFCNT_LIMIT))) return false; - rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_); + __vma_lockdep_acquire_read(vma); return true; } -- cgit v1.2.3 From 28f590f35da8435f75e2aee51431c6c1b8d91f54 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 23 Jan 2026 20:12:16 +0000 Subject: mm/vma: clean up __vma_enter/exit_locked() These functions are very confusing indeed. 'Entering' a lock could be interpreted as acquiring it, but this is not what these functions are interacting with. Equally they don't indicate at all what kind of lock we are 'entering' or 'exiting'. Finally they are misleading as we invoke these functions when we already hold a write lock to detach a VMA. These functions are explicitly simply 'entering' and 'exiting' a state in which we hold the EXCLUSIVE lock in order that we can either mark the VMA as being write-locked, or mark the VMA detached. Rename the functions accordingly, and also update __vma_end_exclude_readers() to return detached state with a __must_check directive, as it is simply clumsy to pass an output pointer here to detached state and inconsistent vs. __vma_start_exclude_readers(). Finally, remove the unnecessary 'inline' directives. No functional change intended. Link: https://lkml.kernel.org/r/33273be9389712347d69987c408ca7436f0c1b22.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Reviewed-by: Suren Baghdasaryan Cc: Boqun Feng Cc: Liam Howlett Cc: Michal Hocko Cc: Mike Rapoport Cc: Shakeel Butt Cc: Waiman Long Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- include/linux/mmap_lock.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index 1887ca55ead7..d6df6aad3e24 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -211,8 +211,8 @@ static inline void vma_refcount_put(struct vm_area_struct *vma) newcnt = __vma_refcount_put_return(vma); /* - * __vma_enter_locked() may be sleeping waiting for readers to drop - * their reference count, so wake it up if we were the last reader + * __vma_start_exclude_readers() may be sleeping waiting for readers to + * drop their reference count, so wake it up if we were the last reader * blocking it from being acquired. * * We may be raced by other readers temporarily incrementing the -- cgit v1.2.3 From e28e575af956c4c3089b443e87be91a6ff7af355 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 23 Jan 2026 20:12:17 +0000 Subject: mm/vma: introduce helper struct + thread through exclusive lock fns It is confusing to have __vma_start_exclude_readers() return 0, 1 or an error (but only when waiting for readers in TASK_KILLABLE state), and having the return value be stored in a stack variable called 'locked' is further confusion. More generally, we are doing a lot of rather finnicky things during the acquisition of a state in which readers are excluded and moving out of this state, including tracking whether we are detached or not or whether an error occurred. We are implementing logic in __vma_start_exclude_readers() that effectively acts as if 'if one caller calls us do X, if another then do Y', which is very confusing from a control flow perspective. Introducing the shared helper object state helps us avoid this, as we can now handle the 'an error arose but we're detached' condition correctly in both callers - a warning if not detaching, and treating the situation as if no error arose in the case of a VMA detaching. This also acts to help document what's going on and allows us to add some more logical debug asserts. Also update vma_mark_detached() to add a guard clause for the likely 'already detached' state (given we hold the mmap write lock), and add a comment about ephemeral VMA read lock reference count increments to clarify why we are entering/exiting an exclusive locked state here. Finally, separate vma_mark_detached() into its fast-path component and make it inline, then place the slow path for excluding readers in mmap_lock.c. No functional change intended. [akpm@linux-foundation.org: fix function naming in comments, add comment per Vlastimil per Lorenzo] Link: https://lkml.kernel.org/r/7d3084d596c84da10dd374130a5055deba6439c0.1769198904.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/7d3084d596c84da10dd374130a5055deba6439c0.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Suren Baghdasaryan Cc: Boqun Feng Cc: Liam Howlett Cc: Michal Hocko Cc: Mike Rapoport Cc: Shakeel Butt Cc: Vlastimil Babka Cc: Waiman Long Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- include/linux/mm_types.h | 14 +++++++------- include/linux/mmap_lock.h | 23 ++++++++++++++++++++++- 2 files changed, 29 insertions(+), 8 deletions(-) (limited to 'include') diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 3e608d22cab0..8731606d8d36 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1011,15 +1011,15 @@ struct vm_area_struct { * decrementing it again. * * VM_REFCNT_EXCLUDE_READERS_FLAG - Detached, pending - * __vma_exit_locked() completion which will decrement the reference - * count to zero. IMPORTANT - at this stage no further readers can - * increment the reference count. It can only be reduced. + * __vma_end_exclude_readers() completion which will decrement the + * reference count to zero. IMPORTANT - at this stage no further readers + * can increment the reference count. It can only be reduced. * * VM_REFCNT_EXCLUDE_READERS_FLAG + 1 - A thread is either write-locking - * an attached VMA and has yet to invoke __vma_exit_locked(), OR a - * thread is detaching a VMA and is waiting on a single spurious reader - * in order to decrement the reference count. IMPORTANT - as above, no - * further readers can increment the reference count. + * an attached VMA and has yet to invoke __vma_end_exclude_readers(), + * OR a thread is detaching a VMA and is waiting on a single spurious + * reader in order to decrement the reference count. IMPORTANT - as + * above, no further readers can increment the reference count. * * > VM_REFCNT_EXCLUDE_READERS_FLAG + 1 - A thread is either * write-locking or detaching a VMA is waiting on readers to diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index d6df6aad3e24..678f90080fa6 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -358,7 +358,28 @@ static inline void vma_mark_attached(struct vm_area_struct *vma) refcount_set_release(&vma->vm_refcnt, 1); } -void vma_mark_detached(struct vm_area_struct *vma); +void __vma_exclude_readers_for_detach(struct vm_area_struct *vma); + +static inline void vma_mark_detached(struct vm_area_struct *vma) +{ + vma_assert_write_locked(vma); + vma_assert_attached(vma); + + /* + * The VMA still being attached (refcnt > 0) - is unlikely, because the + * vma has been already write-locked and readers can increment vm_refcnt + * only temporarily before they check vm_lock_seq, realize the vma is + * locked and drop back the vm_refcnt. That is a narrow window for + * observing a raised vm_refcnt. + * + * See the comment describing the vm_area_struct->vm_refcnt field for + * details of possible refcnt values. + */ + if (likely(!__vma_refcount_put_return(vma))) + return; + + __vma_exclude_readers_for_detach(vma); +} struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, unsigned long address); -- cgit v1.2.3 From 22f7639f2f030e58cb55ad8438c77dfcea951fc3 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 23 Jan 2026 20:12:18 +0000 Subject: mm/vma: improve and document __is_vma_write_locked() We don't actually need to return an output parameter providing mm sequence number, rather we can separate that out into another function - __vma_raw_mm_seqnum() - and have any callers which need to obtain that invoke that instead. The access to the raw sequence number requires that we hold the exclusive mmap lock such that we know we can't race vma_end_write_all(), so move the assert to __vma_raw_mm_seqnum() to make this requirement clear. Also while we're here, convert all of the VM_BUG_ON_VMA()'s to VM_WARN_ON_ONCE_VMA()'s in line with the convention that we do not invoke oopses when we can avoid it. [lorenzo.stoakes@oracle.com: minor tweaks, per Vlastimil] Link: https://lkml.kernel.org/r/3fa89c13-232d-4eee-86cc-96caa75c2c67@lucifer.local Link: https://lkml.kernel.org/r/ef6c415c2d2c03f529dca124ccaed66bc2f60edc.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Suren Baghdasaryan Cc: Boqun Feng Cc: Liam Howlett Cc: Michal Hocko Cc: Mike Rapoport Cc: Shakeel Butt Cc: Vlastimil Babka Cc: Waiman Long Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- include/linux/mmap_lock.h | 45 ++++++++++++++++++++++++--------------------- 1 file changed, 24 insertions(+), 21 deletions(-) (limited to 'include') diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index 678f90080fa6..1746a172a81c 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -258,21 +258,31 @@ static inline void vma_end_read(struct vm_area_struct *vma) vma_refcount_put(vma); } -/* WARNING! Can only be used if mmap_lock is expected to be write-locked */ -static inline bool __is_vma_write_locked(struct vm_area_struct *vma, unsigned int *mm_lock_seq) +static inline unsigned int __vma_raw_mm_seqnum(struct vm_area_struct *vma) { + const struct mm_struct *mm = vma->vm_mm; + + /* We must hold an exclusive write lock for this access to be valid. */ mmap_assert_write_locked(vma->vm_mm); + return mm->mm_lock_seq.sequence; +} +/* + * Determine whether a VMA is write-locked. Must be invoked ONLY if the mmap + * write lock is held. + * + * Returns true if write-locked, otherwise false. + */ +static inline bool __is_vma_write_locked(struct vm_area_struct *vma) +{ /* * current task is holding mmap_write_lock, both vma->vm_lock_seq and * mm->mm_lock_seq can't be concurrently modified. */ - *mm_lock_seq = vma->vm_mm->mm_lock_seq.sequence; - return (vma->vm_lock_seq == *mm_lock_seq); + return vma->vm_lock_seq == __vma_raw_mm_seqnum(vma); } -int __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq, - int state); +int __vma_start_write(struct vm_area_struct *vma, int state); /* * Begin writing to a VMA. @@ -281,12 +291,10 @@ int __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq, */ static inline void vma_start_write(struct vm_area_struct *vma) { - unsigned int mm_lock_seq; - - if (__is_vma_write_locked(vma, &mm_lock_seq)) + if (__is_vma_write_locked(vma)) return; - __vma_start_write(vma, mm_lock_seq, TASK_UNINTERRUPTIBLE); + __vma_start_write(vma, TASK_UNINTERRUPTIBLE); } /** @@ -305,30 +313,25 @@ static inline void vma_start_write(struct vm_area_struct *vma) static inline __must_check int vma_start_write_killable(struct vm_area_struct *vma) { - unsigned int mm_lock_seq; - - if (__is_vma_write_locked(vma, &mm_lock_seq)) + if (__is_vma_write_locked(vma)) return 0; - return __vma_start_write(vma, mm_lock_seq, TASK_KILLABLE); + + return __vma_start_write(vma, TASK_KILLABLE); } static inline void vma_assert_write_locked(struct vm_area_struct *vma) { - unsigned int mm_lock_seq; - - VM_BUG_ON_VMA(!__is_vma_write_locked(vma, &mm_lock_seq), vma); + VM_WARN_ON_ONCE_VMA(!__is_vma_write_locked(vma), vma); } static inline void vma_assert_locked(struct vm_area_struct *vma) { - unsigned int mm_lock_seq; - /* * See the comment describing the vm_area_struct->vm_refcnt field for * details of possible refcnt values. */ - VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt) <= 1 && - !__is_vma_write_locked(vma, &mm_lock_seq), vma); + VM_WARN_ON_ONCE_VMA(refcount_read(&vma->vm_refcnt) <= 1 && + !__is_vma_write_locked(vma), vma); } static inline bool vma_is_attached(struct vm_area_struct *vma) -- cgit v1.2.3 From 256c11937de0039253ee36ed7d1cabc852beae54 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 23 Jan 2026 20:12:19 +0000 Subject: mm/vma: update vma_assert_locked() to use lockdep We can use lockdep to avoid unnecessary work here, otherwise update the code to logically evaluate all pertinent cases and share code with vma_assert_write_locked(). Make it clear here that we treat the VMA being detached at this point as a bug, this was only implicit before. Additionally, abstract references to vma->vmlock_dep_map by introducing a macro helper __vma_lockdep_map() which accesses this field if lockdep is enabled. Since lock_is_held() is specified as an extern function if lockdep is disabled, we can simply have __vma_lockdep_map() defined as NULL in this case, and then use IS_ENABLED(CONFIG_LOCKDEP) to avoid ugly ifdeffery. [lorenzo.stoakes@oracle.com: add helper macro __vma_lockdep_map(), per Vlastimil] Link: https://lkml.kernel.org/r/7c4b722e-604b-4b20-8e33-03d2f8d55407@lucifer.local Link: https://lkml.kernel.org/r/538762f079cc4fa76ff8bf30a8a9525a09961451.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Suren Baghdasaryan Reviewed-by: Vlastimil Babka Cc: Boqun Feng Cc: Liam Howlett Cc: Michal Hocko Cc: Mike Rapoport Cc: Shakeel Butt Cc: Waiman Long Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- include/linux/mmap_lock.h | 56 ++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 48 insertions(+), 8 deletions(-) (limited to 'include') diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index 1746a172a81c..90fc32b683dd 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -78,6 +78,12 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm) #ifdef CONFIG_PER_VMA_LOCK +#ifdef CONFIG_LOCKDEP +#define __vma_lockdep_map(vma) (&vma->vmlock_dep_map) +#else +#define __vma_lockdep_map(vma) NULL +#endif + /* * VMA locks do not behave like most ordinary locks found in the kernel, so we * cannot quite have full lockdep tracking in the way we would ideally prefer. @@ -98,16 +104,16 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm) * so we utilise lockdep to do so. */ #define __vma_lockdep_acquire_read(vma) \ - lock_acquire_shared(&vma->vmlock_dep_map, 0, 1, NULL, _RET_IP_) + lock_acquire_shared(__vma_lockdep_map(vma), 0, 1, NULL, _RET_IP_) #define __vma_lockdep_release_read(vma) \ - lock_release(&vma->vmlock_dep_map, _RET_IP_) + lock_release(__vma_lockdep_map(vma), _RET_IP_) #define __vma_lockdep_acquire_exclusive(vma) \ - lock_acquire_exclusive(&vma->vmlock_dep_map, 0, 0, NULL, _RET_IP_) + lock_acquire_exclusive(__vma_lockdep_map(vma), 0, 0, NULL, _RET_IP_) #define __vma_lockdep_release_exclusive(vma) \ - lock_release(&vma->vmlock_dep_map, _RET_IP_) + lock_release(__vma_lockdep_map(vma), _RET_IP_) /* Only meaningful if CONFIG_LOCK_STAT is defined. */ #define __vma_lockdep_stat_mark_acquired(vma) \ - lock_acquired(&vma->vmlock_dep_map, _RET_IP_) + lock_acquired(__vma_lockdep_map(vma), _RET_IP_) static inline void mm_lock_seqcount_init(struct mm_struct *mm) { @@ -146,7 +152,7 @@ static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt) #ifdef CONFIG_DEBUG_LOCK_ALLOC static struct lock_class_key lockdep_key; - lockdep_init_map(&vma->vmlock_dep_map, "vm_lock", &lockdep_key, 0); + lockdep_init_map(__vma_lockdep_map(vma), "vm_lock", &lockdep_key, 0); #endif if (reset_refcnt) refcount_set(&vma->vm_refcnt, 0); @@ -319,19 +325,53 @@ int vma_start_write_killable(struct vm_area_struct *vma) return __vma_start_write(vma, TASK_KILLABLE); } +/** + * vma_assert_write_locked() - assert that @vma holds a VMA write lock. + * @vma: The VMA to assert. + */ static inline void vma_assert_write_locked(struct vm_area_struct *vma) { VM_WARN_ON_ONCE_VMA(!__is_vma_write_locked(vma), vma); } +/** + * vma_assert_locked() - assert that @vma holds either a VMA read or a VMA write + * lock and is not detached. + * @vma: The VMA to assert. + */ static inline void vma_assert_locked(struct vm_area_struct *vma) { + unsigned int refcnt; + + if (IS_ENABLED(CONFIG_LOCKDEP)) { + if (!lock_is_held(__vma_lockdep_map(vma))) + vma_assert_write_locked(vma); + return; + } + /* * See the comment describing the vm_area_struct->vm_refcnt field for * details of possible refcnt values. */ - VM_WARN_ON_ONCE_VMA(refcount_read(&vma->vm_refcnt) <= 1 && - !__is_vma_write_locked(vma), vma); + refcnt = refcount_read(&vma->vm_refcnt); + + /* + * In this case we're either read-locked, write-locked with temporary + * readers, or in the midst of excluding readers, all of which means + * we're locked. + */ + if (refcnt > 1) + return; + + /* It is a bug for the VMA to be detached here. */ + VM_WARN_ON_ONCE_VMA(!refcnt, vma); + + /* + * OK, the VMA has a reference count of 1 which means it is either + * unlocked and attached or write-locked, so assert that it is + * write-locked. + */ + vma_assert_write_locked(vma); } static inline bool vma_is_attached(struct vm_area_struct *vma) -- cgit v1.2.3 From 17fd82c3abe03c1e202959bb1a7c4ab448b36bef Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 23 Jan 2026 20:12:20 +0000 Subject: mm/vma: add and use vma_assert_stabilised() Sometimes we wish to assert that a VMA is stable, that is - the VMA cannot be changed underneath us. This will be the case if EITHER the VMA lock or the mmap lock is held. In order to do so, we introduce a new assert vma_assert_stabilised() - this will make a lockdep assert if lockdep is enabled AND the VMA is read-locked. Currently lockdep tracking for VMA write locks is not implemented, so it suffices to check in this case that we have either an mmap read or write semaphore held. Note that because the VMA lock uses the non-standard vmlock_dep_map naming convention, we cannot use lockdep_assert_is_write_held() so have to open code this ourselves via lockdep-asserting that lock_is_held_type(&vma->vmlock_dep_map, 0). We have to be careful here - for instance when merging a VMA, we use the mmap write lock to stabilise the examination of adjacent VMAs which might be simultaneously VMA read-locked whilst being faulted in. If we were to assert VMA read lock using lockdep we would encounter an incorrect lockdep assert. Also, we have to be careful about asserting mmap locks are held - if we try to address the above issue by first checking whether mmap lock is held and if so asserting it via lockdep, we may find that we were raced by another thread acquiring an mmap read lock simultaneously that either we don't own (and thus can be released any time - so we are not stable) or was indeed released since we last checked. So to deal with these complexities we end up with either a precise (if lockdep is enabled) or imprecise (if not) approach - in the first instance we assert the lock is held using lockdep and thus whether we own it. If we do own it, then the check is complete, otherwise we must check for the VMA read lock being held (VMA write lock implies mmap write lock so the mmap lock suffices for this). If lockdep is not enabled we simply check if the mmap lock is held and risk a false negative (i.e. not asserting when we should do). There are a couple places in the kernel where we already do this stabliisation check - the anon_vma_name() helper in mm/madvise.c and vma_flag_set_atomic() in include/linux/mm.h, which we update to use vma_assert_stabilised(). This change abstracts these into vma_assert_stabilised(), uses lockdep if possible, and avoids a duplicate check of whether the mmap lock is held. This is also self-documenting and lays the foundations for further VMA stability checks in the code. The only functional change here is adding the lockdep check. Link: https://lkml.kernel.org/r/6c9e64bb2b56ddb6f806fde9237f8a00cb3a776b.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Reviewed-by: Suren Baghdasaryan Cc: Boqun Feng Cc: Liam Howlett Cc: Michal Hocko Cc: Mike Rapoport Cc: Shakeel Butt Cc: Waiman Long Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- include/linux/mm.h | 5 +---- include/linux/mmap_lock.h | 52 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 53 insertions(+), 4 deletions(-) (limited to 'include') diff --git a/include/linux/mm.h b/include/linux/mm.h index aa90719234f1..2c6c6d00ed73 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1008,10 +1008,7 @@ static inline void vma_flag_set_atomic(struct vm_area_struct *vma, { unsigned long *bitmap = ACCESS_PRIVATE(&vma->flags, __vma_flags); - /* mmap read lock/VMA read lock must be held. */ - if (!rwsem_is_locked(&vma->vm_mm->mmap_lock)) - vma_assert_locked(vma); - + vma_assert_stabilised(vma); if (__vma_flag_atomic_valid(vma, bit)) set_bit((__force int)bit, bitmap); } diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index 90fc32b683dd..93eca48bc443 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -374,6 +374,52 @@ static inline void vma_assert_locked(struct vm_area_struct *vma) vma_assert_write_locked(vma); } +/** + * vma_assert_stabilised() - assert that this VMA cannot be changed from + * underneath us either by having a VMA or mmap lock held. + * @vma: The VMA whose stability we wish to assess. + * + * If lockdep is enabled we can precisely ensure stability via either an mmap + * lock owned by us or a specific VMA lock. + * + * With lockdep disabled we may sometimes race with other threads acquiring the + * mmap read lock simultaneous with our VMA read lock. + */ +static inline void vma_assert_stabilised(struct vm_area_struct *vma) +{ + /* + * If another thread owns an mmap lock, it may go away at any time, and + * thus is no guarantee of stability. + * + * If lockdep is enabled we can accurately determine if an mmap lock is + * held and owned by us. Otherwise we must approximate. + * + * It doesn't necessarily mean we are not stabilised however, as we may + * hold a VMA read lock (not a write lock as this would require an owned + * mmap lock). + * + * If (assuming lockdep is not enabled) we were to assert a VMA read + * lock first we may also run into issues, as other threads can hold VMA + * read locks simlutaneous to us. + * + * Therefore if lockdep is not enabled we risk a false negative (i.e. no + * assert fired). If accurate checking is required, enable lockdep. + */ + if (IS_ENABLED(CONFIG_LOCKDEP)) { + if (lockdep_is_held(&vma->vm_mm->mmap_lock)) + return; + } else { + if (rwsem_is_locked(&vma->vm_mm->mmap_lock)) + return; + } + + /* + * We're not stabilised by the mmap lock, so assert that we're + * stabilised by a VMA lock. + */ + vma_assert_locked(vma); +} + static inline bool vma_is_attached(struct vm_area_struct *vma) { return refcount_read(&vma->vm_refcnt); @@ -476,6 +522,12 @@ static inline void vma_assert_locked(struct vm_area_struct *vma) mmap_assert_locked(vma->vm_mm); } +static inline void vma_assert_stabilised(struct vm_area_struct *vma) +{ + /* If no VMA locks, then either mmap lock suffices to stabilise. */ + mmap_assert_locked(vma->vm_mm); +} + #endif /* CONFIG_PER_VMA_LOCK */ static inline void mmap_write_lock(struct mm_struct *mm) -- cgit v1.2.3 From bc617c990eae4259cd5014d596477cbe0d596417 Mon Sep 17 00:00:00 2001 From: Nhat Pham Date: Sat, 20 Dec 2025 03:43:37 +0800 Subject: mm/shmem, swap: remove SWAP_MAP_SHMEM The SWAP_MAP_SHMEM state was introduced in the commit aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a swap entry belongs to shmem during swapoff. However, swapoff has since been rewritten in the commit b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now having swap count == SWAP_MAP_SHMEM value is basically the same as having swap count == 1, and swap_shmem_alloc() behaves analogously to swap_duplicate(). The only difference of note is that swap_shmem_alloc() does not check for -ENOMEM returned from __swap_duplicate(), but it is OK because shmem never re-duplicates any swap entry it owns. This will stil be safe if we use (batched) swap_duplicate() instead. This commit adds swap_duplicate_nr(), the batched variant of swap_duplicate(), and removes the SWAP_MAP_SHMEM state and the associated swap_shmem_alloc() helper to simplify the state machine (both mentally and in terms of actual code). We will also have an extra state/special value that can be repurposed (for swap entries that never gets re-duplicated). Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-8-8862a265a033@tencent.com Signed-off-by: Kairui Song Signed-off-by: Nhat Pham Reviewed-by: Baolin Wang Tested-by: Baolin Wang Cc: Baoquan He Cc: Barry Song Cc: Chris Li Cc: Rafael J. Wysocki (Intel) Cc: Yosry Ahmed Cc: Deepanshu Kartikey Cc: Johannes Weiner Cc: Kairui Song Signed-off-by: Andrew Morton --- include/linux/swap.h | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) (limited to 'include') diff --git a/include/linux/swap.h b/include/linux/swap.h index 38ca3df68716..bf72b548a96d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -230,7 +230,6 @@ enum { /* Special value in first swap_map */ #define SWAP_MAP_MAX 0x3e /* Max count */ #define SWAP_MAP_BAD 0x3f /* Note page is bad */ -#define SWAP_MAP_SHMEM 0xbf /* Owned by shmem/tmpfs */ /* Special value in each swap_map continuation */ #define SWAP_CONT_MAX 0x7f /* Max count */ @@ -458,8 +457,7 @@ bool folio_free_swap(struct folio *folio); void put_swap_folio(struct folio *folio, swp_entry_t entry); extern swp_entry_t get_swap_page_of_type(int); extern int add_swap_count_continuation(swp_entry_t, gfp_t); -extern void swap_shmem_alloc(swp_entry_t, int); -extern int swap_duplicate(swp_entry_t); +extern int swap_duplicate_nr(swp_entry_t entry, int nr); extern int swapcache_prepare(swp_entry_t entry, int nr); extern void swap_free_nr(swp_entry_t entry, int nr_pages); extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); @@ -514,11 +512,7 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask) return 0; } -static inline void swap_shmem_alloc(swp_entry_t swp, int nr) -{ -} - -static inline int swap_duplicate(swp_entry_t swp) +static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages) { return 0; } @@ -569,6 +563,11 @@ static inline int add_swap_extent(struct swap_info_struct *sis, } #endif /* CONFIG_SWAP */ +static inline int swap_duplicate(swp_entry_t entry) +{ + return swap_duplicate_nr(entry, 1); +} + static inline void free_swap_and_cache(swp_entry_t entry) { free_swap_and_cache_nr(entry, 1); -- cgit v1.2.3 From 2732acda82c93475c5986e1a5640004a5d4f9c3e Mon Sep 17 00:00:00 2001 From: Kairui Song Date: Sat, 20 Dec 2025 03:43:41 +0800 Subject: mm, swap: use swap cache as the swap in synchronize layer Current swap in synchronization mostly uses the swap_map's SWAP_HAS_CACHE bit. Whoever sets the bit first does the actual work to swap in a folio. This has been causing many issues as it's just a poor implementation of a bit lock. Raced users have no idea what is pinning a slot, so it has to loop with a schedule_timeout_uninterruptible(1), which is ugly and causes long-tailing or other performance issues. Besides, the abuse of SWAP_HAS_CACHE has been causing many other troubles for synchronization or maintenance. This is the first step to remove this bit completely. Now all swap in paths are using the swap cache, and both the swap cache and swap map are protected by the cluster lock. So we can just resolve the swap synchronization with the swap cache layer directly using the cluster lock and folio lock. Whoever inserts a folio in the swap cache first does the swap in work. And because folios are locked during swap operations, other raced swap operations will just wait on the folio lock. The SWAP_HAS_CACHE will be removed in later commit. For now, we still set it for some remaining users. But now we do the bit setting and swap cache folio adding in the same critical section, after swap cache is ready. No one will have to spin on the SWAP_HAS_CACHE bit anymore. This both simplifies the logic and should improve the performance, eliminating issues like the one solved in commit 01626a1823024 ("mm: avoid unconditional one-tick sleep when swapcache_prepare fails"), or the "skip_if_exists" from commit a65b0e7607ccb ("zswap: make shrinking memcg-aware"), which will be removed very soon. [kasong@tencent.com: fix cgroup v1 accounting issue] Link: https://lkml.kernel.org/r/CAMgjq7CGUnzOVG7uSaYjzw9wD7w2dSKOHprJfaEp4CcGLgE3iw@mail.gmail.com Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-12-8862a265a033@tencent.com Signed-off-by: Kairui Song Reviewed-by: Baoquan He Cc: Baolin Wang Cc: Barry Song Cc: Chris Li Cc: Nhat Pham Cc: Rafael J. Wysocki (Intel) Cc: Yosry Ahmed Cc: Deepanshu Kartikey Cc: Johannes Weiner Cc: Kairui Song Signed-off-by: Andrew Morton --- include/linux/swap.h | 6 ------ 1 file changed, 6 deletions(-) (limited to 'include') diff --git a/include/linux/swap.h b/include/linux/swap.h index bf72b548a96d..74df3004c850 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -458,7 +458,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry); extern swp_entry_t get_swap_page_of_type(int); extern int add_swap_count_continuation(swp_entry_t, gfp_t); extern int swap_duplicate_nr(swp_entry_t entry, int nr); -extern int swapcache_prepare(swp_entry_t entry, int nr); extern void swap_free_nr(swp_entry_t entry, int nr_pages); extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); int swap_type_of(dev_t device, sector_t offset); @@ -517,11 +516,6 @@ static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages) return 0; } -static inline int swapcache_prepare(swp_entry_t swp, int nr) -{ - return 0; -} - static inline void swap_free_nr(swp_entry_t entry, int nr_pages) { } -- cgit v1.2.3 From 36976159140bc288c3752a9b799090a49f1a8b62 Mon Sep 17 00:00:00 2001 From: Kairui Song Date: Sat, 20 Dec 2025 03:43:43 +0800 Subject: mm, swap: cleanup swap entry management workflow The current swap entry allocation/freeing workflow has never had a clear definition. This makes it hard to debug or add new optimizations. This commit introduces a proper definition of how swap entries would be allocated and freed. Now, most operations are folio based, so they will never exceed one swap cluster, and we now have a cleaner border between swap and the rest of mm, making it much easier to follow and debug, especially with new added sanity checks. Also making more optimization possible. Swap entry will be mostly freed and free with a folio bound. The folio lock will be useful for resolving many swap related races. Now swap allocation (except hibernation) always starts with a folio in the swap cache, and gets duped/freed protected by the folio lock: - folio_alloc_swap() - The only allocation entry point now. Context: The folio must be locked. This allocates one or a set of continuous swap slots for a folio and binds them to the folio by adding the folio to the swap cache. The swap slots' swap count start with zero value. - folio_dup_swap() - Increase the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This increases the ref count of swap entries allocated to a folio. Newly allocated swap slots' count has to be increased by this helper as the folio got unmapped (and swap entries got installed). - folio_put_swap() - Decrease the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This decreases the ref count of swap entries allocated to a folio. Typically, swapin will decrease the swap count as the folio got installed back and the swap entry got uninstalled This won't remove the folio from the swap cache and free the slot. Lazy freeing of swap cache is helpful for reducing IO. There is already a folio_free_swap() for immediate cache reclaim. This part could be further optimized later. The above locking constraints could be further relaxed when the swap table is fully implemented. Currently dup still needs the caller to lock the swap entry container (e.g. PTL), or a concurrent zap may underflow the swap count. Some swap users need to interact with swap count without involving folio (e.g. forking/zapping the page table or mapping truncate without swapin). In such cases, the caller has to ensure there is no race condition on whatever owns the swap count and use the below helpers: - swap_put_entries_direct() - Decrease the swap count directly. Context: The caller must lock whatever is referencing the slots to avoid a race. Typically the page table zapping or shmem mapping truncate will need to free swap slots directly. If a slot is cached (has a folio bound), this will also try to release the swap cache. - swap_dup_entry_direct() - Increase the swap count directly. Context: The caller must lock whatever is referencing the entries to avoid race, and the entries must already have a swap count > 1. Typically, forking will need to copy the page table and hence needs to increase the swap count of the entries in the table. The page table is locked while referencing the swap entries, so the entries all have a swap count > 1 and can't be freed. Hibernation subsystem is a bit different, so two special wrappers are here: - swap_alloc_hibernation_slot() - Allocate one entry from one device. - swap_free_hibernation_slot() - Free one entry allocated by the above helper. All hibernation entries are exclusive to the hibernation subsystem and should not interact with ordinary swap routines. By separating the workflows, it will be possible to bind folio more tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary pin. This commit should not introduce any behavior change [kasong@tencent.com: fix leak, per Chris Mason. Remove WARN_ON, per Lai Yi] Link: https://lkml.kernel.org/r/CAMgjq7AUz10uETVm8ozDWcB3XohkOqf0i33KGrAquvEVvfp5cg@mail.gmail.com [ryncsn@gmail.com: fix KSM copy pages for swapoff, per Chris] Link: https://lkml.kernel.org/r/aXxkANcET3l2Xu6J@KASONG-MC4 Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-14-8862a265a033@tencent.com Signed-off-by: Kairui Song Signed-off-by: Kairui Song Acked-by: Rafael J. Wysocki (Intel) Reviewed-by: Baoquan He Cc: Baolin Wang Cc: Barry Song Cc: Chris Li Cc: Nhat Pham Cc: Yosry Ahmed Cc: Deepanshu Kartikey Cc: Johannes Weiner Cc: Kairui Song Cc: Chris Mason Cc: Chris Mason Cc: Lai Yi Signed-off-by: Andrew Morton --- include/linux/swap.h | 58 ++++++++++++++++++++++------------------------------ 1 file changed, 25 insertions(+), 33 deletions(-) (limited to 'include') diff --git a/include/linux/swap.h b/include/linux/swap.h index 74df3004c850..aaa868f60b9c 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -452,14 +452,8 @@ static inline long get_nr_swap_pages(void) } extern void si_swapinfo(struct sysinfo *); -int folio_alloc_swap(struct folio *folio); -bool folio_free_swap(struct folio *folio); void put_swap_folio(struct folio *folio, swp_entry_t entry); -extern swp_entry_t get_swap_page_of_type(int); extern int add_swap_count_continuation(swp_entry_t, gfp_t); -extern int swap_duplicate_nr(swp_entry_t entry, int nr); -extern void swap_free_nr(swp_entry_t entry, int nr_pages); -extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); int swap_type_of(dev_t device, sector_t offset); int find_first_swap(dev_t *device); extern unsigned int count_swap_pages(int, int); @@ -471,6 +465,29 @@ struct backing_dev_info; extern struct swap_info_struct *get_swap_device(swp_entry_t entry); sector_t swap_folio_sector(struct folio *folio); +/* + * If there is an existing swap slot reference (swap entry) and the caller + * guarantees that there is no race modification of it (e.g., PTL + * protecting the swap entry in page table; shmem's cmpxchg protects t + * he swap entry in shmem mapping), these two helpers below can be used + * to put/dup the entries directly. + * + * All entries must be allocated by folio_alloc_swap(). And they must have + * a swap count > 1. See comments of folio_*_swap helpers for more info. + */ +int swap_dup_entry_direct(swp_entry_t entry); +void swap_put_entries_direct(swp_entry_t entry, int nr); + +/* + * folio_free_swap tries to free the swap entries pinned by a swap cache + * folio, it has to be here to be called by other components. + */ +bool folio_free_swap(struct folio *folio); + +/* Allocate / free (hibernation) exclusive entries */ +swp_entry_t swap_alloc_hibernation_slot(int type); +void swap_free_hibernation_slot(swp_entry_t entry); + static inline void put_swap_device(struct swap_info_struct *si) { percpu_ref_put(&si->users); @@ -498,10 +515,6 @@ static inline void put_swap_device(struct swap_info_struct *si) #define free_pages_and_swap_cache(pages, nr) \ release_pages((pages), (nr)); -static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr) -{ -} - static inline void free_swap_cache(struct folio *folio) { } @@ -511,12 +524,12 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask) return 0; } -static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages) +static inline int swap_dup_entry_direct(swp_entry_t ent) { return 0; } -static inline void swap_free_nr(swp_entry_t entry, int nr_pages) +static inline void swap_put_entries_direct(swp_entry_t ent, int nr) { } @@ -539,11 +552,6 @@ static inline int swp_swapcount(swp_entry_t entry) return 0; } -static inline int folio_alloc_swap(struct folio *folio) -{ - return -EINVAL; -} - static inline bool folio_free_swap(struct folio *folio) { return false; @@ -556,22 +564,6 @@ static inline int add_swap_extent(struct swap_info_struct *sis, return -EINVAL; } #endif /* CONFIG_SWAP */ - -static inline int swap_duplicate(swp_entry_t entry) -{ - return swap_duplicate_nr(entry, 1); -} - -static inline void free_swap_and_cache(swp_entry_t entry) -{ - free_swap_and_cache_nr(entry, 1); -} - -static inline void swap_free(swp_entry_t entry) -{ - swap_free_nr(entry, 1); -} - #ifdef CONFIG_MEMCG static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg) { -- cgit v1.2.3 From 270f095179ff15b7c72f25dd6720dcab3d15cc9b Mon Sep 17 00:00:00 2001 From: Kairui Song Date: Sat, 20 Dec 2025 03:43:44 +0800 Subject: mm, swap: add folio to swap cache directly on allocation The allocator uses SWAP_HAS_CACHE to pin a swap slot upon allocation. SWAP_HAS_CACHE is being deprecated as it caused a lot of confusion. This pinning usage here can be dropped by adding the folio to swap cache directly on allocation. All swap allocations are folio-based now (except for hibernation), so the swap allocator can always take the folio as the parameter. And now both swap cache (swap table) and swap map are protected by the cluster lock, scanning the map and inserting the folio can be done in the same critical section. This eliminates the time window that a slot is pinned by SWAP_HAS_CACHE, but it has no cache, and avoids touching the lock multiple times. This is both a cleanup and an optimization. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-15-8862a265a033@tencent.com Signed-off-by: Kairui Song Reviewed-by: Baoquan He Cc: Baolin Wang Cc: Barry Song Cc: Chris Li Cc: Nhat Pham Cc: Rafael J. Wysocki (Intel) Cc: Yosry Ahmed Cc: Deepanshu Kartikey Cc: Johannes Weiner Cc: Kairui Song Signed-off-by: Andrew Morton --- include/linux/swap.h | 5 ----- 1 file changed, 5 deletions(-) (limited to 'include') diff --git a/include/linux/swap.h b/include/linux/swap.h index aaa868f60b9c..517d24e96d8c 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -452,7 +452,6 @@ static inline long get_nr_swap_pages(void) } extern void si_swapinfo(struct sysinfo *); -void put_swap_folio(struct folio *folio, swp_entry_t entry); extern int add_swap_count_continuation(swp_entry_t, gfp_t); int swap_type_of(dev_t device, sector_t offset); int find_first_swap(dev_t *device); @@ -533,10 +532,6 @@ static inline void swap_put_entries_direct(swp_entry_t ent, int nr) { } -static inline void put_swap_folio(struct folio *folio, swp_entry_t swp) -{ -} - static inline int __swap_count(swp_entry_t entry) { return 0; -- cgit v1.2.3 From d3852f9692b8a6af7566f92f7432ee5067c6be15 Mon Sep 17 00:00:00 2001 From: Kairui Song Date: Sat, 20 Dec 2025 03:43:47 +0800 Subject: mm, swap: drop the SWAP_HAS_CACHE flag Now, the swap cache is managed by the swap table. All swap cache users are checking the swap table directly to check the swap cache state. SWAP_HAS_CACHE is now just a temporary pin before the first increase from 0 to 1 of a slot's swap count (swap_dup_entries) after swap allocation (folio_alloc_swap), or before the final free of slots pinned by folio in swap cache (put_swap_folio). Drop these two usages. For the first dup, SWAP_HAS_CACHE pinning was hard to kill because it used to have multiple meanings, more than just "a slot is cached". We have just simplified that and defined that the first dup is always done with folio locked in swap cache (folio_dup_swap), so stop checking the SWAP_HAS_CACHE bit and just check the swap cache (swap table) directly, and add a WARN if a swap entry's count is being increased for the first time while the folio is not in swap cache. As for freeing, just let the swap cache free all swap entries of a folio that have a swap count of zero directly upon folio removal. We have also just cleaned up batch freeing to check the swap cache usage using the swap table: a slot with swap cache in the swap table will not be freed until its cache is gone, and no SWAP_HAS_CACHE bit is involved anymore. And besides, the removal of a folio and freeing of the slots are being done in the same critical section now, which should improve the performance. After these two changes, SWAP_HAS_CACHE no longer has any users. Swap cache synchronization is also done by the swap table directly, so using SWAP_HAS_CACHE to pin a slot before adding the cache is also no longer needed. Remove all related logic and helpers. swap_map is now only used for tracking the count, so all swap_map users can just read it directly, ignoring the swap_count helper, which was previously used to filter out the SWAP_HAS_CACHE bit. The idea of dropping SWAP_HAS_CACHE and using the swap table directly was initially from Chris's idea of merging all the metadata usage of all swaps into one place. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-18-8862a265a033@tencent.com Signed-off-by: Kairui Song Suggested-by: Chris Li Reviewed-by: Baoquan He Cc: Baolin Wang Cc: Barry Song Cc: Nhat Pham Cc: Rafael J. Wysocki (Intel) Cc: Yosry Ahmed Cc: Deepanshu Kartikey Cc: Johannes Weiner Cc: Kairui Song Signed-off-by: Andrew Morton --- include/linux/swap.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include') diff --git a/include/linux/swap.h b/include/linux/swap.h index 517d24e96d8c..62fc7499b408 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -224,7 +224,6 @@ enum { #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX /* Bit flag in swap_map */ -#define SWAP_HAS_CACHE 0x40 /* Flag page is cached, in first swap_map */ #define COUNT_CONTINUED 0x80 /* Flag swap_map continuation for full count */ /* Special value in first swap_map */ -- cgit v1.2.3 From 086498aed3f68febb58df7e6141962942abb8944 Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Tue, 27 Jan 2026 20:13:00 +0800 Subject: mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config For architectures that define __HAVE_ARCH_TLB_REMOVE_TABLE, the page tables at the pmd/pud level are generally not of struct ptdesc type, and do not have pt_rcu_head member, thus these architectures cannot support PT_RECLAIM. In preparation for enabling PT_RECLAIM on more architectures, convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config, so that we can make conditional judgments in Kconfig. Link: https://lkml.kernel.org/r/5ebfa3d4b56e63c6906bda5eccaa9f7194d3a86b.1769515122.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Acked-by: David Hildenbrand (Arm) Tested-by: Andreas Larsson [sparc, UP&SMP] Acked-by: Andreas Larsson [sparc] Cc: "Aneesh Kumar K.V" Cc: Anton Ivanov Cc: Borislav Petkov Cc: Dave Hansen Cc: Dev Jain Cc: Helge Deller Cc: "H. Peter Anvin" Cc: Huacai Chen Cc: Ingo Molnar Cc: "James E.J. Bottomley" Cc: Johannes Berg Cc: Lance Yang Cc: "Liam R. Howlett" Cc: Lorenzo Stoakes Cc: Magnus Lindholm Cc: Matt Turner Cc: Michal Hocko Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Peter Zijlstra Cc: Richard Henderson Cc: Richard Weinberger Cc: Suren Baghdasaryan Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vlastimil Babka Cc: WANG Xuerui Cc: Wei Yang Cc: Will Deacon Signed-off-by: Andrew Morton --- include/asm-generic/tlb.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include') diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 3975f7d11553..4aeac0c3d3f0 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -213,7 +213,7 @@ struct mmu_table_batch { #define MAX_TABLE_BATCH \ ((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *)) -#ifndef __HAVE_ARCH_TLB_REMOVE_TABLE +#ifndef CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE static inline void __tlb_remove_table(void *table) { struct ptdesc *ptdesc = (struct ptdesc *)table; -- cgit v1.2.3