summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
34 hoursMerge tag 'mm-hotfixes-stable-2026-01-29-09-41' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "16 hotfixes. 9 are cc:stable, 12 are for MM. There's a patch series from Pratyush Yadav which fixes a few things in the new-in-6.19 LUO memfd code. Plus the usual shower of singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2026-01-29-09-41' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: vmcoreinfo: make hwerr_data visible for debugging mm/zone_device: reinitialize large zone device private folios mm/mm_init: don't cond_resched() in deferred_init_memmap_chunk() if called from deferred_grow_zone() mm/kfence: randomize the freelist on initialization kho: kho_preserve_vmalloc(): don't return 0 when ENOMEM kho: init alloc tags when restoring pages from reserved memory mm: memfd_luo: restore and free memfd_luo_ser on failure mm: memfd_luo: use memfd_alloc_file() instead of shmem_file_setup() memfd: export alloc_file() flex_proportions: make fprop_new_period() hardirq safe mailmap: add entry for Viacheslav Bocharov mm/memory-failure: teach kill_accessing_process to accept hugetlb tail page pfn mm/memory-failure: fix missing ->mf_stats count in hugetlb poison mm, swap: restore swap_space attr aviod kernel panic mm/kasan: fix KASAN poisoning in vrealloc() mm/shmem, swap: fix race of truncate and swap entry split
4 daysmm/zone_device: reinitialize large zone device private foliosMatthew Brost
Reinitialize metadata for large zone device private folios in zone_device_page_init prior to creating a higher-order zone device private folio. This step is necessary when the folio's order changes dynamically between zone_device_page_init calls to avoid building a corrupt folio. As part of the metadata reinitialization, the dev_pagemap must be passed in from the caller because the pgmap stored in the folio page may have been overwritten with a compound head. Without this fix, individual pages could have invalid pgmap fields and flags (with PG_locked being notably problematic) due to prior different order allocations, which can, and will, result in kernel crashes. Link: https://lkml.kernel.org/r/20260116111325.1736137-2-francois.dugast@intel.com Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Francois Dugast <francois.dugast@intel.com> Acked-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Balbir Singh <balbirs@nvidia.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: David Airlie <airlied@gmail.com> Cc: Simona Vetter <simona@ffwll.ch> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Lyude Paul <lyude@redhat.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 daysmm/mm_init: don't cond_resched() in deferred_init_memmap_chunk() if called ↵Waiman Long
from deferred_grow_zone() Commit 3acb913c9d5b ("mm/mm_init: use deferred_init_memmap_chunk() in deferred_grow_zone()") made deferred_grow_zone() call deferred_init_memmap_chunk() within a pgdat_resize_lock() critical section with irqs disabled. It did check for irqs_disabled() in deferred_init_memmap_chunk() to avoid calling cond_resched(). For a PREEMPT_RT kernel build, however, spin_lock_irqsave() does not disable interrupt but rcu_read_lock() is called. This leads to the following bug report. BUG: sleeping function called from invalid context at mm/mm_init.c:2091 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 3 locks held by swapper/0/1: #0: ffff80008471b7a0 (sched_domains_mutex){+.+.}-{4:4}, at: sched_domains_mutex_lock+0x28/0x40 #1: ffff003bdfffef48 (&pgdat->node_size_lock){+.+.}-{3:3}, at: deferred_grow_zone+0x140/0x278 #2: ffff800084acf600 (rcu_read_lock){....}-{1:3}, at: rt_spin_lock+0x1b4/0x408 CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Tainted: G W 6.19.0-rc6-test #1 PREEMPT_{RT,(full) } Tainted: [W]=WARN Call trace: show_stack+0x20/0x38 (C) dump_stack_lvl+0xdc/0xf8 dump_stack+0x1c/0x28 __might_resched+0x384/0x530 deferred_init_memmap_chunk+0x560/0x688 deferred_grow_zone+0x190/0x278 _deferred_grow_zone+0x18/0x30 get_page_from_freelist+0x780/0xf78 __alloc_frozen_pages_noprof+0x1dc/0x348 alloc_slab_page+0x30/0x110 allocate_slab+0x98/0x2a0 new_slab+0x4c/0x80 ___slab_alloc+0x5a4/0x770 __slab_alloc.constprop.0+0x88/0x1e0 __kmalloc_node_noprof+0x2c0/0x598 __sdt_alloc+0x3b8/0x728 build_sched_domains+0xe0/0x1260 sched_init_domains+0x14c/0x1c8 sched_init_smp+0x9c/0x1d0 kernel_init_freeable+0x218/0x358 kernel_init+0x28/0x208 ret_from_fork+0x10/0x20 Fix it adding a new argument to deferred_init_memmap_chunk() to explicitly tell it if cond_resched() is allowed or not instead of relying on some current state information which may vary depending on the exact kernel configuration options that are enabled. Link: https://lkml.kernel.org/r/20260122184343.546627-1-longman@redhat.com Fixes: 3acb913c9d5b ("mm/mm_init: use deferred_init_memmap_chunk() in deferred_grow_zone()") Signed-off-by: Waiman Long <longman@redhat.com> Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: <stable@vger.kernrl.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 daysmm/kfence: randomize the freelist on initializationPimyn Girgis
Randomize the KFENCE freelist during pool initialization to make allocation patterns less predictable. This is achieved by shuffling the order in which metadata objects are added to the freelist using get_random_u32_below(). Additionally, ensure the error path correctly calculates the address range to be reset if initialization fails, as the address increment logic has been moved to a separate loop. Link: https://lkml.kernel.org/r/20260120161510.3289089-1-pimyn@google.com Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure") Signed-off-by: Pimyn Girgis <pimyn@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Ernesto Martnez Garca <ernesto.martinezgarcia@tugraz.at> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Kees Cook <kees@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 daysmm: memfd_luo: restore and free memfd_luo_ser on failurePratyush Yadav (Google)
memfd_luo_ser has the serialization metadata. It is of no use once restoration fails. Free it on failure. Link: https://lkml.kernel.org/r/20260122151842.4069702-4-pratyush@kernel.org Fixes: b3749f174d68 ("mm: memfd_luo: allow preserving memfd") Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 daysmm: memfd_luo: use memfd_alloc_file() instead of shmem_file_setup()Pratyush Yadav (Google)
When restoring a memfd, the file is created using shmem_file_setup(). While memfd creation also calls this function to get the file, it also does other things: 1. The O_LARGEFILE flag is set on the file. If this is not done, writes on the memfd exceeding 2 GiB fail. 2. FMODE_LSEEK, FMODE_PREAD, and FMODE_PWRITE are set on the file. This makes sure the file is seekable and can be used with pread() and pwrite(). 3. Initializes the security field for the inode and makes sure that inode creation is permitted by the security module. Currently, none of those things are done. This means writes above 2 GiB fail, pread(), and pwrite() fail, and so on. lseek() happens to work because file_init_path() sets it because shmem defines fop->llseek. Fix this by using memfd_alloc_file() to get the file to make sure the initialization sequence for normal and preserved memfd is the same. Link: https://lkml.kernel.org/r/20260122151842.4069702-3-pratyush@kernel.org Fixes: b3749f174d68 ("mm: memfd_luo: allow preserving memfd") Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 daysmemfd: export alloc_file()Pratyush Yadav (Google)
Patch series "mm: memfd_luo hotfixes". This series contains a couple of fixes for memfd preservation using LUO. This patch (of 3): The Live Update Orchestrator's (LUO) memfd preservation works by preserving all the folios of a memfd, re-creating an empty memfd on the next boot, and then inserting back the preserved folios. Currently it creates the file by directly calling shmem_file_setup(). This leaves out other work done by alloc_file() like setting up the file mode, flags, or calling the security hooks. Export alloc_file() to let memfd_luo use it. Rename it to memfd_alloc_file() since it is no longer private and thus needs a subsystem prefix. Link: https://lkml.kernel.org/r/20260122151842.4069702-1-pratyush@kernel.org Link: https://lkml.kernel.org/r/20260122151842.4069702-2-pratyush@kernel.org Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 daysmm/memory-failure: teach kill_accessing_process to accept hugetlb tail page pfnJane Chu
When a hugetlb folio is being poisoned again, try_memory_failure_hugetlb() passed head pfn to kill_accessing_process(), that is not right. The precise pfn of the poisoned page should be used in order to determine the precise vaddr as the SIGBUS payload. This issue has already been taken care of in the normal path, that is, hwpoison_user_mappings(), see [1][2]. Further more, for [3] to work correctly in the hugetlb repoisoning case, it's essential to inform VM the precise poisoned page, not the head page. [1] https://lkml.kernel.org/r/20231218135837.3310403-1-willy@infradead.org [2] https://lkml.kernel.org/r/20250224211445.2663312-1-jane.chu@oracle.com [3] https://lore.kernel.org/lkml/20251116013223.1557158-1-jiaqiyan@google.com/ Link: https://lkml.kernel.org/r/20260120232234.3462258-2-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Chris Mason <clm@meta.com> Cc: David Hildenbrand <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: William Roche <william.roche@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 daysmm/memory-failure: fix missing ->mf_stats count in hugetlb poisonJane Chu
When a newly poisoned subpage ends up in an already poisoned hugetlb folio, 'num_poisoned_pages' is incremented, but the per node ->mf_stats is not. Fix the inconsistency by designating action_result() to update them both. While at it, define __get_huge_page_for_hwpoison() return values in terms of symbol names for better readibility. Also rename folio_set_hugetlb_hwpoison() to hugetlb_update_hwpoison() since the function does more than the conventional bit setting and the fact three possible return values are expected. Link: https://lkml.kernel.org/r/20260120232234.3462258-1-jane.chu@oracle.com Fixes: 18f41fa616ee ("mm: memory-failure: bump memory failure stats to pglist_data") Signed-off-by: Jane Chu <jane.chu@oracle.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Chris Mason <clm@meta.com> Cc: David Hildenbrand <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: William Roche <william.roche@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 daysmm, swap: restore swap_space attr aviod kernel panicrobin.kuo
commit 8b47299a411a ("mm, swap: mark swap address space ro and add context debug check") made the swap address space read-only. It may lead to kernel panic if arch_prepare_to_swap returns a failure under heavy memory pressure as follows, el1_abort+0x40/0x64 el1h_64_sync_handler+0x48/0xcc el1h_64_sync+0x84/0x88 errseq_set+0x4c/0xb8 (P) __filemap_set_wb_err+0x20/0xd0 shrink_folio_list+0xc20/0x11cc evict_folios+0x1520/0x1be4 try_to_shrink_lruvec+0x27c/0x3dc shrink_one+0x9c/0x228 shrink_node+0xb3c/0xeac do_try_to_free_pages+0x170/0x4f0 try_to_free_pages+0x334/0x534 __alloc_pages_direct_reclaim+0x90/0x158 __alloc_pages_slowpath+0x334/0x588 __alloc_frozen_pages_noprof+0x224/0x2fc __folio_alloc_noprof+0x14/0x64 vma_alloc_zeroed_movable_folio+0x34/0x44 do_pte_missing+0xad4/0x1040 handle_mm_fault+0x4a4/0x790 do_page_fault+0x288/0x5f8 do_translation_fault+0x38/0x54 do_mem_abort+0x54/0xa8 Restore swap address space as not ro to avoid the panic. Link: https://lkml.kernel.org/r/20260116062535.306453-2-robin.kuo@mediatek.com Fixes: 8b47299a411a ("mm, swap: mark swap address space ro and add context debug check") Signed-off-by: robin.kuo <robin.kuo@mediatek.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: andrew.yang <andrew.yang@mediatek.com> Cc: AngeloGiaocchino Del Regno <angelogioacchino.delregno@collabora.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Mathias Brugger <matthias.bgg@gmail.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Qun-wei Lin <Qun-wei.Lin@mediatek.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 daysmm/kasan: fix KASAN poisoning in vrealloc()Andrey Ryabinin
A KASAN warning can be triggered when vrealloc() changes the requested size to a value that is not aligned to KASAN_GRANULE_SIZE. ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1 at mm/kasan/shadow.c:174 kasan_unpoison+0x40/0x48 ... pc : kasan_unpoison+0x40/0x48 lr : __kasan_unpoison_vmalloc+0x40/0x68 Call trace: kasan_unpoison+0x40/0x48 (P) vrealloc_node_align_noprof+0x200/0x320 bpf_patch_insn_data+0x90/0x2f0 convert_ctx_accesses+0x8c0/0x1158 bpf_check+0x1488/0x1900 bpf_prog_load+0xd20/0x1258 __sys_bpf+0x96c/0xdf0 __arm64_sys_bpf+0x50/0xa0 invoke_syscall+0x90/0x160 Introduce a dedicated kasan_vrealloc() helper that centralizes KASAN handling for vmalloc reallocations. The helper accounts for KASAN granule alignment when growing or shrinking an allocation and ensures that partial granules are handled correctly. Use this helper from vrealloc_node_align_noprof() to fix poisoning logic. [ryabinin.a.a@gmail.com: move kasan_enabled() check, fix build] Link: https://lkml.kernel.org/r/20260119144509.32767-1-ryabinin.a.a@gmail.com Link: https://lkml.kernel.org/r/20260113191516.31015-1-ryabinin.a.a@gmail.com Fixes: d699440f58ce ("mm: fix vrealloc()'s KASAN poisoning logic") Signed-off-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Reported-by: Maciej Żenczykowski <maze@google.com> Reported-by: <joonki.min@samsung-slsi.corp-partner.google.com> Closes: https://lkml.kernel.org/r/CANP3RGeuRW53vukDy7WDO3FiVgu34-xVJYkfpm08oLO3odYFrA@mail.gmail.com Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 daysmm/shmem, swap: fix race of truncate and swap entry splitKairui Song
The helper for shmem swap freeing is not handling the order of swap entries correctly. It uses xa_cmpxchg_irq to erase the swap entry, but it gets the entry order before that using xa_get_order without lock protection, and it may get an outdated order value if the entry is split or changed in other ways after the xa_get_order and before the xa_cmpxchg_irq. And besides, the order could grow and be larger than expected, and cause truncation to erase data beyond the end border. For example, if the target entry and following entries are swapped in or freed, then a large folio was added in place and swapped out, using the same entry, the xa_cmpxchg_irq will still succeed, it's very unlikely to happen though. To fix that, open code the Xarray cmpxchg and put the order retrieval and value checking in the same critical section. Also, ensure the order won't exceed the end border, skip it if the entry goes across the border. Skipping large swap entries crosses the end border is safe here. Shmem truncate iterates the range twice, in the first iteration, find_lock_entries already filtered such entries, and shmem will swapin the entries that cross the end border and partially truncate the folio (split the folio or at least zero part of it). So in the second loop here, if we see a swap entry that crosses the end order, it must at least have its content erased already. I observed random swapoff hangs and kernel panics when stress testing ZSWAP with shmem. After applying this patch, all problems are gone. Link: https://lkml.kernel.org/r/20260120-shmem-swap-fix-v3-1-3d33ebfbc057@tencent.com Fixes: 809bc86517cc ("mm: shmem: support large folio swap out") Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 daysMerge tag 'slab-for-6.19-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fix from Vlastimil Babka: - A stable fix for kmalloc_nolock() in non-preemptible contexts on PREEMPT_RT (Swaraj Gaikwad) * tag 'slab-for-6.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: slab: fix kmalloc_nolock() context check for PREEMPT_RT
10 daysslab: fix kmalloc_nolock() context check for PREEMPT_RTSwaraj Gaikwad
On PREEMPT_RT kernels, local_lock becomes a sleeping lock. The current check in kmalloc_nolock() only verifies we're not in NMI or hard IRQ context, but misses the case where preemption is disabled. When a BPF program runs from a tracepoint with preemption disabled (preempt_count > 0), kmalloc_nolock() proceeds to call local_lock_irqsave() which attempts to acquire a sleeping lock, triggering: BUG: sleeping function called from invalid context in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 6128 preempt_count: 2, expected: 0 Fix this by checking !preemptible() on PREEMPT_RT, which directly expresses the constraint that we cannot take a sleeping lock when preemption is disabled. This encompasses the previous checks for NMI and hard IRQ contexts while also catching cases where preemption is disabled. Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().") Reported-by: syzbot+b1546ad4a95331b2101e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b1546ad4a95331b2101e Signed-off-by: Swaraj Gaikwad <swarajgaikwad1925@gmail.com> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113150639.48407-1-swarajgaikwad1925@gmail.co Cc: <stable@vger.kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
10 daysMerge tag 'mm-hotfixes-stable-2026-01-20-13-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: - A patch series from David Hildenbrand which fixes a few things related to hugetlb PMD sharing - The remainder are singletons, please see their changelogs for details * tag 'mm-hotfixes-stable-2026-01-20-13-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: restore per-memcg proactive reclaim with !CONFIG_NUMA mm/kfence: fix potential deadlock in reboot notifier Docs/mm/allocation-profiling: describe sysctrl limitations in debug mode mm: do not copy page tables unnecessarily for VM_UFFD_WP mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather mm/rmap: fix two comments related to huge_pmd_unshare() mm/hugetlb: fix two comments related to huge_pmd_unshare() mm/hugetlb: fix hugetlb_pmd_shared() mm: remove unnecessary and incorrect mmap lock assert x86/kfence: avoid writing L1TF-vulnerable PTEs mm/vma: do not leak memory when .mmap_prepare swaps the file migrate: correct lock ordering for hugetlb file folios panic: only warn about deprecated panic_print on write access fs/writeback: skip AS_NO_DATA_INTEGRITY mappings in wait_sb_inodes() mm: take into account mm_cid size for mm_struct static definitions mm: rename cpu_bitmap field to flexible_array mm: add missing static initializer for init_mm::mm_cid.lock
10 daysMerge tag 'dma-mapping-6.19-2026-01-20' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: - minor fixes for the corner cases of the SWIOTLB pool management (Robin Murphy) * tag 'dma-mapping-6.19-2026-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma/pool: Avoid allocating redundant pools mm_zone: Generalise has_managed_dma() dma/pool: Improve pool lookup
10 daysmm: restore per-memcg proactive reclaim with !CONFIG_NUMAYosry Ahmed
Commit 2b7226af730c ("mm/memcg: make memory.reclaim interface generic") moved proactive reclaim logic from memory.reclaim handler to a generic user_proactive_reclaim() helper to be used for per-node proactive reclaim. However, user_proactive_reclaim() was only defined under CONFIG_NUMA, with a stub always returning 0 otherwise. This broke memory.reclaim on !CONFIG_NUMA configs, causing it to report success without actually attempting reclaim. Move the definition of user_proactive_reclaim() outside CONFIG_NUMA, and instead define a stub for __node_reclaim() in the !CONFIG_NUMA case. __node_reclaim() is only called from user_proactive_reclaim() when a write is made to sys/devices/system/node/nodeX/reclaim, which is only defined with CONFIG_NUMA. Link: https://lkml.kernel.org/r/20260116205247.928004-1-yosry.ahmed@linux.dev Fixes: 2b7226af730c ("mm/memcg: make memory.reclaim interface generic") Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysmm/kfence: fix potential deadlock in reboot notifierBreno Leitao
The reboot notifier callback can deadlock when calling cancel_delayed_work_sync() if toggle_allocation_gate() is blocked in wait_event_idle() waiting for allocations, that might not happen on shutdown path. The issue is that cancel_delayed_work_sync() waits for the work to complete, but the work is waiting for kfence_allocation_gate > 0 which requires allocations to happen (each allocation is increased by 1) - allocations that may have stopped during shutdown. Fix this by: 1. Using cancel_delayed_work() (non-sync) to avoid blocking. Now the callback succeeds and return. 2. Adding wake_up() to unblock any waiting toggle_allocation_gate() 3. Adding !kfence_enabled to the wait condition so the wake succeeds The static_branch_disable() IPI will still execute after the wake, but at this early point in shutdown (reboot notifier runs with INT_MAX priority), the system is still functional and CPUs can respond to IPIs. Link: https://lkml.kernel.org/r/20260116-kfence_fix-v1-1-4165a055933f@debian.org Fixes: ce2bba89566b ("mm/kfence: add reboot notifier to disable KFENCE on shutdown") Signed-off-by: Breno Leitao <leitao@debian.org> Reported-by: Chris Mason <clm@meta.com> Closes: https://lore.kernel.org/all/20260113140234.677117-1-clm@meta.com/ Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Breno Leitao <leitao@debian.org> Cc: Chris Mason <clm@meta.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysmm: do not copy page tables unnecessarily for VM_UFFD_WPLorenzo Stoakes
Commit ab04b530e7e8 ("mm: introduce copy-on-fork VMAs and make VM_MAYBE_GUARD one") aggregates flags checks in vma_needs_copy(), including VM_UFFD_WP. However in doing so, it incorrectly performed this check against src_vma. This check was done on the assumption that all relevant flags are copied upon fork. However the userfaultfd logic is very innovative in that it implements custom logic on fork in dup_userfaultfd(), including a rather well hidden case where lacking UFFD_FEATURE_EVENT_FORK causes VM_UFFD_WP to not be propagated to the destination VMA. And indeed, vma_needs_copy(), prior to this patch, did check this property on dst_vma, not src_vma. Since all the other relevant flags are copied on fork, we can simply fix this by checking against dst_vma. While we're here, we fix a comment against VM_COPY_ON_FORK (noting that it did indeed already reference dst_vma) to make it abundantly clear that we must check against the destination VMA. Link: https://lkml.kernel.org/r/20260114110006.1047071-1-lorenzo.stoakes@oracle.com Fixes: ab04b530e7e8 ("mm: introduce copy-on-fork VMAs and make VM_MAYBE_GUARD one") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Chris Mason <clm@meta.com> Closes: https://lore.kernel.org/all/20260113231257.3002271-1-clm@meta.com/ Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: Pedro Falcato <pfalcato@suse.de> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysmm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using ↵David Hildenbrand (Red Hat)
mmu_gather As reported, ever since commit 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") we can end up in some situations where we perform so many IPI broadcasts when unsharing hugetlb PMD page tables that it severely regresses some workloads. In particular, when we fork()+exit(), or when we munmap() a large area backed by many shared PMD tables, we perform one IPI broadcast per unshared PMD table. There are two optimizations to be had: (1) When we process (unshare) multiple such PMD tables, such as during exit(), it is sufficient to send a single IPI broadcast (as long as we respect locking rules) instead of one per PMD table. Locking prevents that any of these PMD tables could get reused before we drop the lock. (2) When we are not the last sharer (> 2 users including us), there is no need to send the IPI broadcast. The shared PMD tables cannot become exclusive (fully unshared) before an IPI will be broadcasted by the last sharer. Concurrent GUP-fast could walk into a PMD table just before we unshared it. It could then succeed in grabbing a page from the shared page table even after munmap() etc succeeded (and supressed an IPI). But there is not difference compared to GUP-fast just sleeping for a while after grabbing the page and re-enabling IRQs. Most importantly, GUP-fast will never walk into page tables that are no-longer shared, because the last sharer will issue an IPI broadcast. (if ever required, checking whether the PUD changed in GUP-fast after grabbing the page like we do in the PTE case could handle this) So let's rework PMD sharing TLB flushing + IPI sync to use the mmu_gather infrastructure so we can implement these optimizations and demystify the code at least a bit. Extend the mmu_gather infrastructure to be able to deal with our special hugetlb PMD table sharing implementation. To make initialization of the mmu_gather easier when working on a single VMA (in particular, when dealing with hugetlb), provide tlb_gather_mmu_vma(). We'll consolidate the handling for (full) unsharing of PMD tables in tlb_unshare_pmd_ptdesc() and tlb_flush_unshared_tables(), and track in "struct mmu_gather" whether we had (full) unsharing of PMD tables. Because locking is very special (concurrent unsharing+reuse must be prevented), we disallow deferring flushing to tlb_finish_mmu() and instead require an explicit earlier call to tlb_flush_unshared_tables(). From hugetlb code, we call huge_pmd_unshare_flush() where we make sure that the expected lock protecting us from concurrent unsharing+reuse is still held. Check with a VM_WARN_ON_ONCE() in tlb_finish_mmu() that tlb_flush_unshared_tables() was properly called earlier. Document it all properly. Notes about tlb_remove_table_sync_one() interaction with unsharing: There are two fairly tricky things: (1) tlb_remove_table_sync_one() is a NOP on architectures without CONFIG_MMU_GATHER_RCU_TABLE_FREE. Here, the assumption is that the previous TLB flush would send an IPI to all relevant CPUs. Careful: some architectures like x86 only send IPIs to all relevant CPUs when tlb->freed_tables is set. The relevant architectures should be selecting MMU_GATHER_RCU_TABLE_FREE, but x86 might not do that in stable kernels and it might have been problematic before this patch. Also, the arch flushing behavior (independent of IPIs) is different when tlb->freed_tables is set. Do we have to enlighten them to also take care of tlb->unshared_tables? So far we didn't care, so hopefully we are fine. Of course, we could be setting tlb->freed_tables as well, but that might then unnecessarily flush too much, because the semantics of tlb->freed_tables are a bit fuzzy. This patch changes nothing in this regard. (2) tlb_remove_table_sync_one() is not a NOP on architectures with CONFIG_MMU_GATHER_RCU_TABLE_FREE that actually don't need a sync. Take x86 as an example: in the common case (!pv, !X86_FEATURE_INVLPGB) we still issue IPIs during TLB flushes and don't actually need the second tlb_remove_table_sync_one(). This optimized can be implemented on top of this, by checking e.g., in tlb_remove_table_sync_one() whether we really need IPIs. But as described in (1), it really must honor tlb->freed_tables then to send IPIs to all relevant CPUs. Notes on TLB flushing changes: (1) Flushing for non-shared PMD tables We're converting from flush_hugetlb_tlb_range() to tlb_remove_huge_tlb_entry(). Given that we properly initialize the MMU gather in tlb_gather_mmu_vma() to be hugetlb aware, similar to __unmap_hugepage_range(), that should be fine. (2) Flushing for shared PMD tables We're converting from various things (flush_hugetlb_tlb_range(), tlb_flush_pmd_range(), flush_tlb_range()) to tlb_flush_pmd_range(). tlb_flush_pmd_range() achieves the same that tlb_remove_huge_tlb_entry() would achieve in these scenarios. Note that tlb_remove_huge_tlb_entry() also calls __tlb_remove_tlb_entry(), however that is only implemented on powerpc, which does not support PMD table sharing. Similar to (1), tlb_gather_mmu_vma() should make sure that TLB flushing keeps on working as expected. Further, note that the ptdesc_pmd_pts_dec() in huge_pmd_share() is not a concern, as we are holding the i_mmap_lock the whole time, preventing concurrent unsharing. That ptdesc_pmd_pts_dec() usage will be removed separately as a cleanup later. There are plenty more cleanups to be had, but they have to wait until this is fixed. [david@kernel.org: fix kerneldoc] Link: https://lkml.kernel.org/r/f223dd74-331c-412d-93fc-69e360a5006c@kernel.org Link: https://lkml.kernel.org/r/20251223214037.580860-5-david@kernel.org Fixes: 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Reported-by: Uschakow, Stanislav" <suschako@amazon.de> Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/ Tested-by: Laurence Oberman <loberman@redhat.com> Acked-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysmm/rmap: fix two comments related to huge_pmd_unshare()David Hildenbrand (Red Hat)
PMD page table unsharing no longer touches the refcount of a PMD page table. Also, it is not about dropping the refcount of a "PMD page" but the "PMD page table". Let's just simplify by saying that the PMD page table was unmapped, consequently also unmapping the folio that was mapped into this page. This code should be deduplicated in the future. Link: https://lkml.kernel.org/r/20251223214037.580860-4-david@kernel.org Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count") Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Rik van Riel <riel@surriel.com> Tested-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: "Uschakow, Stanislav" <suschako@amazon.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysmm/hugetlb: fix two comments related to huge_pmd_unshare()David Hildenbrand (Red Hat)
Ever since we stopped using the page count to detect shared PMD page tables, these comments are outdated. The only reason we have to flush the TLB early is because once we drop the i_mmap_rwsem, the previously shared page table could get freed (to then get reallocated and used for other purpose). So we really have to flush the TLB before that could happen. So let's simplify the comments a bit. The "If we unshared PMDs, the TLB flush was not recorded in mmu_gather." part introduced as in commit a4a118f2eead ("hugetlbfs: flush TLBs correctly after huge_pmd_unshare") was confusing: sure it is recorded in the mmu_gather, otherwise tlb_flush_mmu_tlbonly() wouldn't do anything. So let's drop that comment while at it as well. We'll centralize these comments in a single helper as we rework the code next. Link: https://lkml.kernel.org/r/20251223214037.580860-3-david@kernel.org Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count") Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Rik van Riel <riel@surriel.com> Tested-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: "Uschakow, Stanislav" <suschako@amazon.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysmm: remove unnecessary and incorrect mmap lock assertLorenzo Stoakes
This check was introduced by commit 42fc541404f2 ("mmap locking API: add mmap_assert_locked() and mmap_assert_write_locked()") which replaced a VM_BUG_ON_VMA() over rwsem_is_locked from commit a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages"), i.e. the commit that introduced PUD THPs. These seem to be careful asserts introduced to ensure that locks are held in general, however for a zap we require that VMAs are kept stable, and this is a requirement that has held perfectly well for a long time. These were long before VMA locks and thus there appears to be no reason to think this is assert is there for anything other than 'stabilised VMA'. Asserting that the VMA under examination is stable only in the case of a THP PUD is strange and unnecessary. If we wish to be careful and assert such things, we should do so at the zap level. However in any case the current situation is already simply incorrect - a VMA lock suffices here. Remove the assert for now as it is unnecessarily, incorrect and unhelpful, subsequent work can introduce an assert in general for zapping if required. Link: https://lkml.kernel.org/r/20260114115619.1087466-1-lorenzo.stoakes@oracle.com Fixes: 2ab7f1bbafc9 ("mm/madvise: allow guard page install/remove under VMA lock") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Chris Mason <clm@meta.com> Closes: https://lore.kernel.org/all/20260113220856.2358195-1-clm@meta.com/ Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 daysmm/vma: do not leak memory when .mmap_prepare swaps the fileLorenzo Stoakes
The current implementation of mmap() is set up such that a struct file object is obtained for the input fd in ksys_mmap_pgoff() via fget(), and its reference count decremented at the end of the function via. fput(). If a merge can be achieved, we are fine to simply decrement the refcount on the file. Otherwise, in __mmap_new_file_vma(), we increment the reference count on the file via get_file() such that the fput() in ksys_mmap_pgoff() does not free the now-referenced file object. The introduction of the f_op->mmap_prepare hook changes things, as it becomes possible for a driver to replace the file object right at the beginning of the mmap operation. The current implementation is buggy if this happens because it unconditionally calls get_file() on the mapping's file whether or not it was replaced (and thus whether or not its reference count will be decremented at the end of ksys_mmap_pgoff()). This results in a memory leak, and was exposed in commit ab04945f91bc ("mm: update mem char driver to use mmap_prepare"). This patch solves the problem by explicitly tracking whether we actually need to call get_file() on the file or not, and only doing so if required. Link: https://lkml.kernel.org/r/20260112155143.661284-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Fixes: ab04945f91bc ("mm: update mem char driver to use mmap_prepare") Reported-by: syzbot+bf5de69ebb4bdf86f59f@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6964a92b.050a0220.eaf7.008a.GAE@google.com/ Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 daysmigrate: correct lock ordering for hugetlb file foliosMatthew Wilcox (Oracle)
Syzbot has found a deadlock (analyzed by Lance Yang): 1) Task (5749): Holds folio_lock, then tries to acquire i_mmap_rwsem(read lock). 2) Task (5754): Holds i_mmap_rwsem(write lock), then tries to acquire folio_lock. migrate_pages() -> migrate_hugetlbs() -> unmap_and_move_huge_page() <- Takes folio_lock! -> remove_migration_ptes() -> __rmap_walk_file() -> i_mmap_lock_read() <- Waits for i_mmap_rwsem(read lock)! hugetlbfs_fallocate() -> hugetlbfs_punch_hole() <- Takes i_mmap_rwsem(write lock)! -> hugetlbfs_zero_partial_page() -> filemap_lock_hugetlb_folio() -> filemap_lock_folio() -> __filemap_get_folio <- Waits for folio_lock! The migration path is the one taking locks in the wrong order according to the documentation at the top of mm/rmap.c. So expand the scope of the existing i_mmap_lock to cover the calls to remove_migration_ptes() too. This is (mostly) how it used to be after commit c0d0381ade79. That was removed by 336bf30eb765 for both file & anon hugetlb pages when it should only have been removed for anon hugetlb pages. Link: https://lkml.kernel.org/r/20260109041345.3863089-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Fixes: 336bf30eb765 ("hugetlbfs: fix anon huge page migration race") Reported-by: syzbot+2d9c96466c978346b55f@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/68e9715a.050a0220.1186a4.000d.GAE@google.com Debugged-by: Lance Yang <lance.yang@linux.dev> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: Jann Horn <jannh@google.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Ying Huang <ying.huang@linux.alibaba.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 daysmm: rename cpu_bitmap field to flexible_arrayMathieu Desnoyers
The cpu_bitmap flexible array now contains more than just the cpu_bitmap. In preparation for changing the static mm_struct definitions to cover for the additional space required, change the cpu_bitmap type from "unsigned long" to "char", require an unsigned long alignment of the flexible array, and rename the field from "cpu_bitmap" to "flexible_array". Introduce the MM_STRUCT_FLEXIBLE_ARRAY_INIT macro to statically initialize the flexible array. This covers the init_mm and efi_mm static definitions. This is a preparation step for fixing the missing mm_cid size for static mm_struct definitions. Link: https://lkml.kernel.org/r/20251224173358.647691-3-mathieu.desnoyers@efficios.com Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Thomas Gleixner <tglx@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Aboorva Devarajan <aboorvad@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christan König <christian.koenig@amd.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Liam R . Howlett" <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Martin Liu <liumartin@google.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 daysmm: add missing static initializer for init_mm::mm_cid.lockMathieu Desnoyers
Initialize the mm_cid.lock struct member of init_mm. Link: https://lkml.kernel.org/r/20251224173358.647691-2-mathieu.desnoyers@efficios.com Fixes: 8cea569ca785 ("sched/mmcid: Use proper data structures") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Thomas Gleixner <tglx@kernel.org> Cc: Aboorva Devarajan <aboorvad@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christan König <christian.koenig@amd.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Liam R . Howlett" <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Martin Liu <liumartin@google.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-16Merge tag 'cxl-fixes-6.19-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull Compute Express Link (CXL) fixes from Dave Jiang: - Recognize all ZONE_DEVICE users as physaddr consumers - Fix format string for extended_linear_cache_size_show() - Fix target list setup for multiple decoders sharing the same downstream port - Restore HBIW check before derefernce platform data - Fix potential infinite loop in __cxl_dpa_reserve() - Check for invalid addresses returned from translation functions on error * tag 'cxl-fixes-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: cxl: Check for invalid addresses returned from translation functions on errors cxl/hdm: Fix potential infinite loop in __cxl_dpa_reserve() cxl/acpi: Restore HBIW check before dereferencing platform_data cxl/port: Fix target list setup for multiple decoders sharing the same dport cxl/region: fix format string for resource_size_t x86/kaslr: Recognize all ZONE_DEVICE users as physaddr consumers
2026-01-14mm: numa,memblock: include <asm/numa.h> for 'numa_nodes_parsed'Ben Dooks
The 'numa_nodes_parsed' is defined in <asm/numa.h> but this file is not included in mm/numa_memblks.c (build x86_64) so add this to the incldues to fix the following sparse warning: mm/numa_memblks.c:13:12: warning: symbol 'numa_nodes_parsed' was not declared. Should it be static? Link: https://lkml.kernel.org/r/20260108101539.229192-1-ben.dooks@codethink.co.uk Fixes: 87482708210f ("mm: introduce numa_memblks") Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Ben Dooks <ben.dooks@codethink.co.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-14mm/page_alloc: prevent pcp corruption with SMP=nVlastimil Babka
The kernel test robot has reported: BUG: spinlock trylock failure on UP on CPU#0, kcompactd0/28 lock: 0xffff888807e35ef0, .magic: dead4ead, .owner: kcompactd0/28, .owner_cpu: 0 CPU: 0 UID: 0 PID: 28 Comm: kcompactd0 Not tainted 6.18.0-rc5-00127-ga06157804399 #1 PREEMPT 8cc09ef94dcec767faa911515ce9e609c45db470 Call Trace: <IRQ> __dump_stack (lib/dump_stack.c:95) dump_stack_lvl (lib/dump_stack.c:123) dump_stack (lib/dump_stack.c:130) spin_dump (kernel/locking/spinlock_debug.c:71) do_raw_spin_trylock (kernel/locking/spinlock_debug.c:?) _raw_spin_trylock (include/linux/spinlock_api_smp.h:89 kernel/locking/spinlock.c:138) __free_frozen_pages (mm/page_alloc.c:2973) ___free_pages (mm/page_alloc.c:5295) __free_pages (mm/page_alloc.c:5334) tlb_remove_table_rcu (include/linux/mm.h:? include/linux/mm.h:3122 include/asm-generic/tlb.h:220 mm/mmu_gather.c:227 mm/mmu_gather.c:290) ? __cfi_tlb_remove_table_rcu (mm/mmu_gather.c:289) ? rcu_core (kernel/rcu/tree.c:?) rcu_core (include/linux/rcupdate.h:341 kernel/rcu/tree.c:2607 kernel/rcu/tree.c:2861) rcu_core_si (kernel/rcu/tree.c:2879) handle_softirqs (arch/x86/include/asm/jump_label.h:36 include/trace/events/irq.h:142 kernel/softirq.c:623) __irq_exit_rcu (arch/x86/include/asm/jump_label.h:36 kernel/softirq.c:725) irq_exit_rcu (kernel/softirq.c:741) sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1052) </IRQ> <TASK> RIP: 0010:_raw_spin_unlock_irqrestore (arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194) free_pcppages_bulk (mm/page_alloc.c:1494) drain_pages_zone (include/linux/spinlock.h:391 mm/page_alloc.c:2632) __drain_all_pages (mm/page_alloc.c:2731) drain_all_pages (mm/page_alloc.c:2747) kcompactd (mm/compaction.c:3115) kthread (kernel/kthread.c:465) ? __cfi_kcompactd (mm/compaction.c:3166) ? __cfi_kthread (kernel/kthread.c:412) ret_from_fork (arch/x86/kernel/process.c:164) ? __cfi_kthread (kernel/kthread.c:412) ret_from_fork_asm (arch/x86/entry/entry_64.S:255) </TASK> Matthew has analyzed the report and identified that in drain_page_zone() we are in a section protected by spin_lock(&pcp->lock) and then get an interrupt that attempts spin_trylock() on the same lock. The code is designed to work this way without disabling IRQs and occasionally fail the trylock with a fallback. However, the SMP=n spinlock implementation assumes spin_trylock() will always succeed, and thus it's normally a no-op. Here the enabled lock debugging catches the problem, but otherwise it could cause a corruption of the pcp structure. The problem has been introduced by commit 574907741599 ("mm/page_alloc: leave IRQs enabled for per-cpu page allocations"). The pcp locking scheme recognizes the need for disabling IRQs to prevent nesting spin_trylock() sections on SMP=n, but the need to prevent the nesting in spin_lock() has not been recognized. Fix it by introducing local wrappers that change the spin_lock() to spin_lock_iqsave() with SMP=n and use them in all places that do spin_lock(&pcp->lock). [vbabka@suse.cz: add pcp_ prefix to the spin_lock_irqsave wrappers, per Steven] Link: https://lkml.kernel.org/r/20260105-fix-pcp-up-v1-1-5579662d2071@suse.cz Fixes: 574907741599 ("mm/page_alloc: leave IRQs enabled for per-cpu page allocations") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202512101320.e2f2dd6f-lkp@intel.com Analyzed-by: Matthew Wilcox <willy@infradead.org> Link: https://lore.kernel.org/all/aUW05pyc9nZkvY-1@casper.infradead.org/ Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-14mm: kmsan: fix poisoning of high-order non-compound pagesRyan Roberts
kmsan_free_page() is called by the page allocator's free_pages_prepare() during page freeing. Its job is to poison all the memory covered by the page. It can be called with an order-0 page, a compound high-order page or a non-compound high-order page. But page_size() only works for order-0 and compound pages. For a non-compound high-order page it will incorrectly return PAGE_SIZE. The implication is that the tail pages of a high-order non-compound page do not get poisoned at free, so any invalid access while they are free could go unnoticed. It looks like the pages will be poisoned again at allocation time, so that would bookend the window. Fix this by using the order parameter to calculate the size. Link: https://lkml.kernel.org/r/20260104134348.3544298-1-ryan.roberts@arm.com Fixes: b073d7f8aee4 ("mm: kmsan: maintain KMSAN metadata for page operations") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Alexander Potapenko <glider@google.com> Tested-by: Alexander Potapenko <glider@google.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-14mm/vma: enforce VMA fork limit on unfaulted,faulted mremap merge tooLorenzo Stoakes
The is_mergeable_anon_vma() function uses vmg->middle as the source VMA. However when merging a new VMA, this field is NULL. In all cases except mremap(), the new VMA will either be newly established and thus lack an anon_vma, or will be an expansion of an existing VMA thus we do not care about whether VMA is CoW'd or not. In the case of an mremap(), we can end up in a situation where we can accidentally allow an unfaulted/faulted merge with a VMA that has been forked, violating the general rule that we do not permit this for reasons of anon_vma lock scalability. Now we have the ability to be aware of the fact we are copying a VMA and also know which VMA that is, we can explicitly check for this, so do so. This is pertinent since commit 879bca0a2c4f ("mm/vma: fix incorrectly disallowed anonymous VMA merges"), as this patch permits unfaulted/faulted merges that were previously disallowed running afoul of this issue. While we are here, vma_had_uncowed_parents() is a confusing name, so make it simple and rename it to vma_is_fork_child(). Link: https://lkml.kernel.org/r/6e2b9b3024ae1220961c8b81d74296d4720eaf2b.1767638272.git.lorenzo.stoakes@oracle.com Fixes: 879bca0a2c4f ("mm/vma: fix incorrectly disallowed anonymous VMA merges") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Jeongjun Park <aha310510@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: Yeoreum Yun <yeoreum.yun@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-14mm/vma: fix anon_vma UAF on mremap() faulted, unfaulted mergeLorenzo Stoakes
Patch series "mm/vma: fix anon_vma UAF on mremap() faulted, unfaulted merge", v2. Commit 879bca0a2c4f ("mm/vma: fix incorrectly disallowed anonymous VMA merges") introduced the ability to merge previously unavailable VMA merge scenarios. However, it is handling merges incorrectly when it comes to mremap() of a faulted VMA adjacent to an unfaulted VMA. The issues arise in three cases: 1. Previous VMA unfaulted: copied -----| v |-----------|.............| | unfaulted |(faulted VMA)| |-----------|.............| prev 2. Next VMA unfaulted: copied -----| v |.............|-----------| |(faulted VMA)| unfaulted | |.............|-----------| next 3. Both adjacent VMAs unfaulted: copied -----| v |-----------|.............|-----------| | unfaulted |(faulted VMA)| unfaulted | |-----------|.............|-----------| prev next This series fixes each of these cases, and introduces self tests to assert that the issues are corrected. I also test a further case which was already handled, to assert that my changes continues to correctly handle it: 4. prev unfaulted, next faulted: copied -----| v |-----------|.............|-----------| | unfaulted |(faulted VMA)| faulted | |-----------|.............|-----------| prev next This bug was discovered via a syzbot report, linked to in the first patch in the series, I confirmed that this series fixes the bug. I also discovered that we are failing to check that the faulted VMA was not forked when merging a copied VMA in cases 1-3 above, an issue this series also addresses. I also added self tests to assert that this is resolved (and confirmed that the tests failed prior to this). I also cleaned up vma_expand() as part of this work, renamed vma_had_uncowed_parents() to vma_is_fork_child() as the previous name was unduly confusing, and simplified the comments around this function. This patch (of 4): Commit 879bca0a2c4f ("mm/vma: fix incorrectly disallowed anonymous VMA merges") introduced the ability to merge previously unavailable VMA merge scenarios. The key piece of logic introduced was the ability to merge a faulted VMA immediately next to an unfaulted VMA, which relies upon dup_anon_vma() to correctly handle anon_vma state. In the case of the merge of an existing VMA (that is changing properties of a VMA and then merging if those properties are shared by adjacent VMAs), dup_anon_vma() is invoked correctly. However in the case of the merge of a new VMA, a corner case peculiar to mremap() was missed. The issue is that vma_expand() only performs dup_anon_vma() if the target (the VMA that will ultimately become the merged VMA): is not the next VMA, i.e. the one that appears after the range in which the new VMA is to be established. A key insight here is that in all other cases other than mremap(), a new VMA merge either expands an existing VMA, meaning that the target VMA will be that VMA, or would have anon_vma be NULL. Specifically: * __mmap_region() - no anon_vma in place, initial mapping. * do_brk_flags() - expanding an existing VMA. * vma_merge_extend() - expanding an existing VMA. * relocate_vma_down() - no anon_vma in place, initial mapping. In addition, we are in the unique situation of needing to duplicate anon_vma state from a VMA that is neither the previous or next VMA being merged with. dup_anon_vma() deals exclusively with the target=unfaulted, src=faulted case. This leaves four possibilities, in each case where the copied VMA is faulted: 1. Previous VMA unfaulted: copied -----| v |-----------|.............| | unfaulted |(faulted VMA)| |-----------|.............| prev target = prev, expand prev to cover. 2. Next VMA unfaulted: copied -----| v |.............|-----------| |(faulted VMA)| unfaulted | |.............|-----------| next target = next, expand next to cover. 3. Both adjacent VMAs unfaulted: copied -----| v |-----------|.............|-----------| | unfaulted |(faulted VMA)| unfaulted | |-----------|.............|-----------| prev next target = prev, expand prev to cover. 4. prev unfaulted, next faulted: copied -----| v |-----------|.............|-----------| | unfaulted |(faulted VMA)| faulted | |-----------|.............|-----------| prev next target = prev, expand prev to cover. Essentially equivalent to 3, but with additional requirement that next's anon_vma is the same as the copied VMA's. This is covered by the existing logic. To account for this very explicitly, we introduce vma_merge_copied_range(), which sets a newly introduced vmg->copied_from field, then invokes vma_merge_new_range() which handles the rest of the logic. We then update the key vma_expand() function to clean up the logic and make what's going on clearer, making the 'remove next' case less special, before invoking dup_anon_vma() unconditionally should we be copying from a VMA. Note that in case 3, the if (remove_next) ... branch will be a no-op, as next=src in this instance and src is unfaulted. In case 4, it won't be, but since in this instance next=src and it is faulted, this will have required tgt=faulted, src=faulted to be compatible, meaning that next->anon_vma == vmg->copied_from->anon_vma, and thus a single dup_anon_vma() of next suffices to copy anon_vma state for the copied-from VMA also. If we are copying from a VMA in a successful merge we must _always_ propagate anon_vma state. This issue can be observed most directly by invoked mremap() to move around a VMA and cause this kind of merge with the MREMAP_DONTUNMAP flag specified. This will result in unlink_anon_vmas() being called after failing to duplicate anon_vma state to the target VMA, which results in the anon_vma itself being freed with folios still possessing dangling pointers to the anon_vma and thus a use-after-free bug. This bug was discovered via a syzbot report, which this patch resolves. We further make a change to update the mergeable anon_vma check to assert the copied-from anon_vma did not have CoW parents, as otherwise dup_anon_vma() might incorrectly propagate CoW ancestors from the next VMA in case 4 despite the anon_vma's being identical for both VMAs. Link: https://lkml.kernel.org/r/cover.1767638272.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/b7930ad2b1503a657e29fe928eb33061d7eadf5b.1767638272.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Fixes: 879bca0a2c4f ("mm/vma: fix incorrectly disallowed anonymous VMA merges") Reported-by: syzbot+b165fc2e11771c66d8ba@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/694a2745.050a0220.19928e.0017.GAE@google.com/ Reported-by: syzbot+5272541ccbbb14e2ec30@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/694e3dc6.050a0220.35954c.0066.GAE@google.com/ Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Jeongjun Park <aha310510@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Yeoreum Yun <yeoreum.yun@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-14mm/zswap: fix error pointer free in zswap_cpu_comp_prepare()Pavel Butsykin
crypto_alloc_acomp_node() may return ERR_PTR(), but the fail path checks only for NULL and can pass an error pointer to crypto_free_acomp(). Use IS_ERR_OR_NULL() to only free valid acomp instances. Link: https://lkml.kernel.org/r/20251231074638.2564302-1-pbutsykin@cloudlinux.com Fixes: 779b9955f643 ("mm: zswap: move allocations during CPU init outside the lock") Signed-off-by: Pavel Butsykin <pbutsykin@cloudlinux.com> Reviewed-by: SeongJae Park <sj@kernel.org> Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-14mm/damon/sysfs-scheme: cleanup access_pattern subdirs on scheme dir setup ↵SeongJae Park
failure When a DAMOS-scheme DAMON sysfs directory setup fails after setup of access_pattern/ directory, subdirectories of access_pattern/ directory are not cleaned up. As a result, DAMON sysfs interface is nearly broken until the system reboots, and the memory for the unremoved directory is leaked. Cleanup the directories under such failures. Link: https://lkml.kernel.org/r/20251225023043.18579-5-sj@kernel.org Fixes: 9bbb820a5bd5 ("mm/damon/sysfs: support DAMOS quotas") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: chongjiapeng <jiapeng.chong@linux.alibaba.com> Cc: <stable@vger.kernel.org> # 5.18.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-14mm/damon/sysfs-scheme: cleanup quotas subdirs on scheme dir setup failureSeongJae Park
When a DAMOS-scheme DAMON sysfs directory setup fails after setup of quotas/ directory, subdirectories of quotas/ directory are not cleaned up. As a result, DAMON sysfs interface is nearly broken until the system reboots, and the memory for the unremoved directory is leaked. Cleanup the directories under such failures. Link: https://lkml.kernel.org/r/20251225023043.18579-4-sj@kernel.org Fixes: 1b32234ab087 ("mm/damon/sysfs: support DAMOS watermarks") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: chongjiapeng <jiapeng.chong@linux.alibaba.com> Cc: <stable@vger.kernel.org> # 5.18.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-14mm/damon/sysfs: cleanup attrs subdirs on context dir setup failureSeongJae Park
When a context DAMON sysfs directory setup is failed after setup of attrs/ directory, subdirectories of attrs/ directory are not cleaned up. As a result, DAMON sysfs interface is nearly broken until the system reboots, and the memory for the unremoved directory is leaked. Cleanup the directories under such failures. Link: https://lkml.kernel.org/r/20251225023043.18579-3-sj@kernel.org Fixes: c951cd3b8901 ("mm/damon: implement a minimal stub for sysfs-based DAMON interface") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: chongjiapeng <jiapeng.chong@linux.alibaba.com> Cc: <stable@vger.kernel.org> # 5.18.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-14mm/damon/sysfs: cleanup intervals subdirs on attrs dir setup failureSeongJae Park
Patch series "mm/damon/sysfs: free setup failures generated zombie sub-sub dirs". Some DAMON sysfs directory setup functions generates its sub and sub-sub directories. For example, 'monitoring_attrs/' directory setup creates 'intervals/' and 'intervals/intervals_goal/' directories under 'monitoring_attrs/' directory. When such sub-sub directories are successfully made but followup setup is failed, the setup function should recursively clean up the subdirectories. However, such setup functions are only dereferencing sub directory reference counters. As a result, under certain setup failures, the sub-sub directories keep having non-zero reference counters. It means the directories cannot be removed like zombies, and the memory for the directories cannot be freed. The user impact of this issue is limited due to the following reasons. When the issue happens, the zombie directories are still taking the path. Hence attempts to generate the directories again will fail, without additional memory leak. This means the upper bound memory leak is limited. Nonetheless this also implies controlling DAMON with a feature that requires the setup-failed sysfs files will be impossible until the system reboots. Also, the setup operations are quite simple. The certain failures would hence only rarely happen, and are difficult to artificially trigger. This patch (of 4): When attrs/ DAMON sysfs directory setup is failed after setup of intervals/ directory, intervals/intervals_goal/ directory is not cleaned up. As a result, DAMON sysfs interface is nearly broken until the system reboots, and the memory for the unremoved directory is leaked. Cleanup the directory under such failures. Link: https://lkml.kernel.org/r/20251225023043.18579-1-sj@kernel.org Link: https://lkml.kernel.org/r/20251225023043.18579-2-sj@kernel.org Fixes: 8fbbcbeaafeb ("mm/damon/sysfs: implement intervals tuning goal directory") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: chongjiapeng <jiapeng.chong@linux.alibaba.com> Cc: <stable@vger.kernel.org> # 6.15.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-14mm/damon/core: remove call_control in inactive contextsSeongJae Park
If damon_call() is executed against a DAMON context that is not running, the function returns error while keeping the damon_call_control object linked to the context's call_controls list. Let's suppose the object is deallocated after the damon_call(), and yet another damon_call() is executed against the same context. The function tries to add the new damon_call_control object to the call_controls list, which still has the pointer to the previous damon_call_control object, which is deallocated. As a result, use-after-free happens. This can actually be triggered using the DAMON sysfs interface. It is not easily exploitable since it requires the sysfs write permission and making a definitely weird file writes, though. Please refer to the report for more details about the issue reproduction steps. Fix the issue by making two changes. Firstly, move the final kdamond_call() for cancelling all existing damon_call() requests from terminating DAMON context to be done before the ctx->kdamond reset. This makes any code that sees NULL ctx->kdamond can safely assume the context may not access damon_call() requests anymore. Secondly, let damon_call() to cleanup the damon_call_control objects that were added to the already-terminated DAMON context, before returning the error. Link: https://lkml.kernel.org/r/20251231012315.75835-1-sj@kernel.org Fixes: 004ded6bee11 ("mm/damon: accept parallel damon_call() requests") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: JaeJoon Jung <rgbi3307@gmail.com> Closes: https://lore.kernel.org/20251224094401.20384-1-rgbi3307@gmail.com Cc: <stable@vger.kernel.org> # 6.17.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-14mm/hugetlb: ignore hugepage kernel args if hugepages are unsupportedSourabh Jain
Skip processing hugepage kernel arguments (hugepagesz, hugepages, and default_hugepagesz) when hugepages are not supported by the architecture. Some architectures may need to disable hugepages based on conditions discovered during kernel boot. The hugepages_supported() helper allows architecture code to advertise whether hugepages are supported. Currently, normal hugepage allocation is guarded by hugepages_supported(), but gigantic hugepages are allocated regardless of this check. This causes problems on powerpc for fadump (firmware- assisted dump). In the fadump (firmware-assisted dump) scenario, a production kernel crash causes the system to boot into a special kernel whose sole purpose is to collect the memory dump and reboot. Features such as hugepages are not required in this environment and should be disabled. For example, when the fadump kernel boots with the following kernel arguments: default_hugepagesz=1GB hugepagesz=1GB hugepages=200 Before this patch, the kernel prints the following logs: HugeTLB: allocating 200 of page size 1.00 GiB failed. Only allocated 58 hugepages. HugeTLB support is disabled! HugeTLB: huge pages not supported, ignoring associated command-line parameters hugetlbfs: disabling because there are no supported hugepage sizes Even though the logs state that HugeTLB support is disabled, gigantic hugepages are still allocated. This causes the fadump kernel to run out of memory during boot. After this patch is applied, the kernel prints the following logs for the same command line: HugeTLB: hugepages unsupported, ignoring default_hugepagesz=1GB cmdline HugeTLB: hugepages unsupported, ignoring hugepagesz=1GB cmdline HugeTLB: hugepages unsupported, ignoring hugepages=200 cmdline HugeTLB support is disabled! hugetlbfs: disabling because there are no supported hugepage sizes To fix the issue, gigantic hugepage allocation should be guarded by hugepages_supported(). Previously, two approaches were proposed to bring gigantic hugepage allocation under hugepages_supported(): [1] Check hugepages_supported() in the generic code before allocating gigantic hugepages [2] Make arch_hugetlb_valid_size() return false for all hugetlb sizes Approach [2] has two minor issues: 1. It prints misleading logs about invalid hugepage sizes 2. The kernel still processes hugepage kernel arguments unnecessarily To control gigantic hugepage allocation, skip processing hugepage kernel arguments (default_hugepagesz, hugepagesz and hugepages) when hugepages_supported() returns false. Note for backporting: This fix is a partial reversion of the commit mentioned in the Fixes tag and is only valid once the change referenced by the Depends-on tag is present. When backporting this patch, the commit mentioned in the Depends-on tag must be included first. Link: https://lore.kernel.org/all/20250121150419.1342794-1-sourabhjain@linux.ibm.com/ [1] Link: https://lore.kernel.org/all/20250128043358.163372-1-sourabhjain@linux.ibm.com/ [2] Link: https://lkml.kernel.org/r/20251224115524.1272010-1-sourabhjain@linux.ibm.com Fixes: c2833a5bf75b ("hugetlbfs: fix changes to command line processing") Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Depends-on: 2354ad252b66 ("powerpc/mm: Update default hugetlb size early") Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-14mm/page_alloc: make percpu_pagelist_high_fraction reads lock-freeAboorva Devarajan
When page isolation loops indefinitely during memory offline, reading /proc/sys/vm/percpu_pagelist_high_fraction blocks on pcp_batch_high_lock, causing hung task warnings. Make procfs reads lock-free since percpu_pagelist_high_fraction is a simple integer with naturally atomic reads, writers still serialize via the mutex. This prevents hung task warnings when reading the procfs file during long-running memory offline operations. [akpm@linux-foundation.org: add comment, per Michal] Link: https://lkml.kernel.org/r/aS_y9AuJQFydLEXo@tiehlicka Link: https://lkml.kernel.org/r/20251201060009.1420792-1-aboorvad@linux.ibm.com Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-14mm/damon/core: get memcg reference before accessShakeel Butt
The commit b74a120bcf507 ("mm/damon/core: implement DAMOS_QUOTA_NODE_MEMCG_USED_BP") added accesses to memcg structure without getting reference to it. This is unsafe. Let's get the reference before accessing the memcg. Link: https://lkml.kernel.org/r/20251225002904.139543-1-shakeel.butt@linux.dev Fixes: b74a120bcf507 ("mm/damon/core: implement DAMOS_QUOTA_NODE_MEMCG_USED_BP") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-14mm: vmalloc: fix up vrealloc_node_align() kernel-doc macro nameBagas Sanjaya
Sphinx reports kernel-doc warning: WARNING: ./mm/vmalloc.c:4284 expecting prototype for vrealloc_node_align_noprof(). Prototype was for vrealloc_node_align() instead Fix the macro name in vrealloc_node_align_noprof() kernel-doc comment. Link: https://lkml.kernel.org/r/20251219014006.16328-5-bagasdotme@gmail.com Fixes: 4c5d3365882d ("mm/vmalloc: allow to set node and align in vrealloc") Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-14mm_zone: Generalise has_managed_dma()Robin Murphy
It would be useful to be able to check for potential DMA pages beyond just ZONE_DMA - generalise the existing has_managed_dma() function to allow checking other zones too. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: Vladimir Kondratiev <vladimir.kondratiev@mobileye.com> Reviewed-by: Baoquan He <bhe@redhat.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/bd002d2351074e57be1ca08f03f333debac658fb.1768230104.git.robin.murphy@arm.com
2026-01-05x86/kaslr: Recognize all ZONE_DEVICE users as physaddr consumersDan Williams
Commit 7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems") is too narrow. The effect being mitigated in that commit is caused by ZONE_DEVICE which PCI_P2PDMA has a dependency. ZONE_DEVICE, in general, lets any physical address be added to the direct-map. I.e. not only ACPI hotplug ranges, CXL Memory Windows, or EFI Specific Purpose Memory, but also any PCI MMIO range for the DEVICE_PRIVATE and PCI_P2PDMA cases. Update the mitigation, limit KASLR entropy, to apply in all ZONE_DEVICE=y cases. Distro kernels typically have PCI_P2PDMA=y, so the practical exposure of this problem is limited to the PCI_P2PDMA=n case. A potential path to recover entropy would be to walk ACPI and determine the limits for hotplug and PCI MMIO before kernel_randomize_memory(). On smaller systems that could yield some KASLR address bits. This needs additional investigation to determine if some limited ACPI table scanning can happen this early without an open coded solution like arch/x86/boot/compressed/acpi.c needs to deploy. Cc: Ingo Molnar <mingo@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Fixes: 7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems") Cc: <stable@vger.kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Balbir Singh <balbirs@nvidia.com> Tested-by: Yasunori Goto <y-goto@fujitsu.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: http://patch.msgid.link/692e08b2516d4_261c1100a3@dwillia2-mobl4.notmuch Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-12-23mm/ksm: fix pte_unmap_unlock of wrong address in break_ksm_pmd_entrySasha Levin
On ARM32 with HIGHMEM/HIGHPTE, break_ksm_pmd_entry() triggers a BUG during KSM unmerging because pte_unmap_unlock() is passed a pointer that may be beyond the mapped PTE page. The issue occurs when the PTE iteration loop completes without finding a KSM page. After the loop, 'ptep' has been incremented past the last PTE entry. On ARM32 LPAE with 512 PTEs per page (512 * 8 = 4096 bytes), this means ptep points to the next page, outside the kmap'd region. When pte_unmap_unlock(ptep, ptl) calls kunmap_local(ptep), it unmaps the wrong page address, leaving the original kmap slot still mapped. The next kmap_local then finds this slot unexpectedly occupied: WARNING: mm/highmem.c:622 kunmap_local_indexed (address mismatch) kernel BUG at mm/highmem.c:564 __kmap_local_pfn_prot (slot not empty) Fix this by passing start_ptep to pte_unmap_unlock(), which always points within the originally mapped PTE page. Reproducer: Run LTP ksm03 test on ARM32 with HIGHMEM enabled. The test triggers KSM merging followed by unmerging (writing 0 then 2 to /sys/kernel/mm/ksm/run), which exercises break_ksm_pmd_entry(). Link: https://lkml.kernel.org/r/20251220202926.318366-1-sashal@kernel.org Fixes: 5d4939fc2258 ("ksm: perform a range-walk in break_ksm") Signed-off-by: Sasha Levin <sashal@kernel.org> Assisted-by: claude-opus-4-5-20251101 Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23mm/page_owner: fix memory leak in page_owner_stack_fops->release()Ran Xiaokai
The page_owner_stack_fops->open() callback invokes seq_open_private(), therefore its corresponding ->release() callback must call seq_release_private(). Otherwise it will cause a memory leak of struct stack_print_ctx. Link: https://lkml.kernel.org/r/20251219074232.136482-1-ranxiaokai627@163.com Fixes: 765973a09803 ("mm,page_owner: display all stacks and their count") Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Marco Elver <elver@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23mm/memremap: fix spurious large folio warning for FS-DAXJohn Groves
This patch addresses a warning that I discovered while working on famfs, which is an fs-dax file system that virtually always does PMD faults (next famfs patch series coming after the holidays). However, XFS also does PMD faults in fs-dax mode, and it also triggers the warning. It takes some effort to get XFS to do a PMD fault, but instructions to reproduce it are below. The VM_WARN_ON_ONCE(folio_test_large(folio)) check in free_zone_device_folio() incorrectly triggers for MEMORY_DEVICE_FS_DAX when PMD (2MB) mappings are used. FS-DAX legitimately creates large file-backed folios when handling PMD faults. This is a core feature of FS-DAX that provides significant performance benefits by mapping 2MB regions directly to persistent memory. When these mappings are unmapped, the large folios are freed through free_zone_device_folio(), which triggers the spurious warning. The warning was introduced by commit that added support for large zone device private folios. However, that commit did not account for FS-DAX file-backed folios, which have always supported large (PMD-sized) mappings. The check distinguishes between anonymous folios (which clear AnonExclusive flags for each sub-page) and file-backed folios. For file-backed folios, it assumes large folios are unexpected - but this assumption is incorrect for FS-DAX. The fix is to exempt MEMORY_DEVICE_FS_DAX from the large folio warning, allowing FS-DAX to continue using PMD mappings without triggering false warnings. Link: https://lkml.kernel.org/r/20251219123717.39330-1-john@groves.net Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") Signed-off-by: John Groves <john@groves.net> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Gregory Price <gourry@gourry.net> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23mm/page_alloc: report 1 as zone_batchsize for !CONFIG_MMUJoshua Hahn
Commit 2783088ef24e ("mm/page_alloc: prevent reporting pcp->batch = 0") moved the error handling (0-handling) of zone_batchsize from its callers to inside the function. However, the commit left out the error handling for the NOMMU case, leading to deadlocks on NOMMU systems. For NOMMU systems, return 1 instead of 0 for zone_batchsize, which restores the previous deadlock-free behavior. There is no functional difference expected with this patch before commit 2783088ef24e, other than the pr_debug in zone_pcp_init now printing out 1 instead of 0 for zones in NOMMU systems. Not only is this a pr_debug, the difference is purely semantic anyways. Link: https://lkml.kernel.org/r/20251218083200.2435789-1-joshua.hahnjy@gmail.com Fixes: 2783088ef24e ("mm/page_alloc: prevent reporting pcp->batch = 0") Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reported-by: Daniel Palmer <daniel@thingy.jp> Closes: https://lore.kernel.org/linux-mm/CAFr9PX=_HaM3_xPtTiBn5Gw5-0xcRpawpJ02NStfdr0khF2k7g@mail.gmail.com/ Reported-by: Guenter Roeck <linux@roeck-us.net> Closes: https://lore.kernel.org/all/42143500-c380-41fe-815c-696c17241506@roeck-us.net/ Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Daniel Palmer <daniel@thingy.jp> Tested-by: Guenter Roeck <linux@roeck-us.net> Acked-by: SeongJae Park <sj@kernel.org> Tested-by: Hajime Tazaki <thehajime@gmail.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23mm: memcg: fix unit conversion for K() macro in OOM logShakeel Butt
The commit bc8e51c05ad5 ("mm: memcg: dump memcg protection info on oom or alloc failures") added functionality to dump memcg protections on OOM or allocation failures. It uses K() macro to dump the information and passes bytes to the macro. However the macro take number of pages instead of bytes. It is defined as: #define K(x) ((x) << (PAGE_SHIFT-10)) Let's fix this. Link: https://lkml.kernel.org/r/20251216212054.484079-1-shakeel.butt@linux.dev Fixes: bc8e51c05ad5 ("mm: memcg: dump memcg protection info on oom or alloc failures") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reported-by: Chris Mason <clm@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>