From d9e4142e7635f6f7173854667c0695ce5b836bbc Mon Sep 17 00:00:00 2001 From: Breno Leitao Date: Mon, 16 Mar 2026 04:54:31 -0700 Subject: kho: add size parameter to kho_add_subtree() Patch series "kho: history: track previous kernel version and kexec boot count", v9. Use Kexec Handover (KHO) to pass the previous kernel's version string and the number of kexec reboots since the last cold boot to the next kernel, and print it at boot time. Example ======= [ 0.000000] Linux version 6.19.0-rc3-upstream-00047-ge5d992347849 ... [ 0.000000] KHO: exec from: 6.19.0-rc4-next-20260107upstream-00004-g3071b0dc4498 (count 1) Motivation ========== Bugs that only reproduce when kexecing from specific kernel versions are difficult to diagnose. These issues occur when a buggy kernel kexecs into a new kernel, with the bug manifesting only in the second kernel. Recent examples include: * eb2266312507 ("x86/boot: Fix page table access in 5-level to 4-level paging transition") * 77d48d39e991 ("efistub/tpm: Use ACPI reclaim memory for event log to avoid corruption") * 64b45dd46e15 ("x86/efi: skip memattr table on kexec boot") As kexec-based reboots become more common, these version-dependent bugs are appearing more frequently. At scale, correlating crashes to the previous kernel version is challenging, especially when issues only occur in specific transition scenarios. Some bugs manifest only after multiple consecutive kexec reboots. Tracking the kexec count helps identify these cases (this metric is already used by live update sub-system). KHO provides a reliable mechanism to pass information between kernels. By carrying the previous kernel's release string and kexec count forward, we can print this context at boot time to aid debugging. The goal of this feature is to have this information being printed in early boot, so, users can trace back kernel releases in kexec. Systemd is not helpful because we cannot assume that the previous kernel has systemd or even write access to the disk (common when using Linux as bootloaders) This patch (of 6): kho_add_subtree() assumes the fdt argument is always an FDT and calls fdt_totalsize() on it in the debugfs code path. This assumption will break if a caller passes arbitrary data instead of an FDT. When CONFIG_KEXEC_HANDOVER_DEBUGFS is enabled, kho_debugfs_fdt_add() calls __kho_debugfs_fdt_add(), which executes: f->wrapper.size = fdt_totalsize(fdt); Fix this by adding an explicit size parameter to kho_add_subtree() so callers specify the blob size. This allows subtrees to contain arbitrary data formats, not just FDTs. Update all callers: - memblock.c: use fdt_totalsize(fdt) - luo_core.c: use fdt_totalsize(fdt_out) - test_kho.c: use fdt_totalsize() - kexec_handover.c (root fdt): use fdt_totalsize(kho_out.fdt) Also update __kho_debugfs_fdt_add() to receive the size explicitly instead of computing it internally via fdt_totalsize(). In kho_in_debugfs_init(), pass fdt_totalsize() for the root FDT and sub-blobs since all current users are FDTs. A subsequent patch will persist the size in the KHO FDT so the incoming side can handle non-FDT blobs correctly. Link: https://lore.kernel.org/20260323110747.193569-1-duanchenghao@kylinos.cn Link: https://lore.kernel.org/20260316-kho-v9-1-ed6dcd951988@debian.org Signed-off-by: Breno Leitao Suggested-by: Pratyush Yadav Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav Cc: Alexander Graf Cc: David Hildenbrand Cc: Jonathan Corbet Cc: "Liam R. Howlett" Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Pasha Tatashin Cc: SeongJae Park Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- lib/test_kho.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'lib') diff --git a/lib/test_kho.c b/lib/test_kho.c index 7ef9e4061869..263182437315 100644 --- a/lib/test_kho.c +++ b/lib/test_kho.c @@ -143,7 +143,8 @@ static int kho_test_preserve(struct kho_test_state *state) if (err) goto err_unpreserve_data; - err = kho_add_subtree(KHO_TEST_FDT, folio_address(state->fdt)); + err = kho_add_subtree(KHO_TEST_FDT, folio_address(state->fdt), + fdt_totalsize(folio_address(state->fdt))); if (err) goto err_unpreserve_data; -- cgit v1.2.3 From 85e41392820fcf0f7a3f9784cea907905f921358 Mon Sep 17 00:00:00 2001 From: Breno Leitao Date: Mon, 16 Mar 2026 04:54:33 -0700 Subject: kho: persist blob size in KHO FDT kho_add_subtree() accepts a size parameter but only forwards it to debugfs. The size is not persisted in the KHO FDT, so it is lost across kexec. This makes it impossible for the incoming kernel to determine the blob size without understanding the blob format. Store the blob size as a "blob-size" property in the KHO FDT alongside the "preserved-data" physical address. This allows the receiving kernel to recover the size for any blob regardless of format. Also extend kho_retrieve_subtree() with an optional size output parameter so callers can learn the blob size without needing to understand the blob format. Update all callers to pass NULL for the new parameter. Link: https://lore.kernel.org/20260316-kho-v9-3-ed6dcd951988@debian.org Signed-off-by: Breno Leitao Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav Cc: Alexander Graf Cc: David Hildenbrand Cc: Jonathan Corbet Cc: "Liam R. Howlett" Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Pasha Tatashin Cc: SeongJae Park Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- lib/test_kho.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'lib') diff --git a/lib/test_kho.c b/lib/test_kho.c index 263182437315..aa6a0956bb8b 100644 --- a/lib/test_kho.c +++ b/lib/test_kho.c @@ -319,7 +319,7 @@ static int __init kho_test_init(void) if (!kho_is_enabled()) return 0; - err = kho_retrieve_subtree(KHO_TEST_FDT, &fdt_phys); + err = kho_retrieve_subtree(KHO_TEST_FDT, &fdt_phys, NULL); if (!err) { err = kho_test_restore(fdt_phys); if (err) -- cgit v1.2.3 From 074488008d6e745af067e968d6046f2c04b12537 Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Fri, 27 Mar 2026 03:33:32 +0000 Subject: liveupdate: remove liveupdate_test_unregister() Now that file handler unregistration automatically unregisters all associated file handlers (FLBs), the liveupdate_test_unregister() function is no longer needed. Remove it along with its usages and declarations. Link: https://lore.kernel.org/20260327033335.696621-9-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) Cc: David Matlack Cc: Mike Rapoport Cc: Samiullah Khawaja Signed-off-by: Andrew Morton --- lib/tests/liveupdate.c | 18 ------------------ 1 file changed, 18 deletions(-) (limited to 'lib') diff --git a/lib/tests/liveupdate.c b/lib/tests/liveupdate.c index 496d6ef91a30..e4b0ecbee32f 100644 --- a/lib/tests/liveupdate.c +++ b/lib/tests/liveupdate.c @@ -135,24 +135,6 @@ void liveupdate_test_register(struct liveupdate_file_handler *fh) TEST_NFLBS, fh->compatible); } -void liveupdate_test_unregister(struct liveupdate_file_handler *fh) -{ - int err, i; - - for (i = 0; i < TEST_NFLBS; i++) { - struct liveupdate_flb *flb = &test_flbs[i]; - - err = liveupdate_unregister_flb(fh, flb); - if (err) { - pr_err("Failed to unregister %s %pe\n", - flb->compatible, ERR_PTR(err)); - } - } - - pr_info("Unregistered %d FLBs from file handler: [%s]\n", - TEST_NFLBS, fh->compatible); -} - MODULE_LICENSE("GPL"); MODULE_AUTHOR("Pasha Tatashin "); MODULE_DESCRIPTION("In-kernel test for LUO mechanism"); -- cgit v1.2.3 From 6b1842775a460245e97d36d3a67d0cfba7c4ff79 Mon Sep 17 00:00:00 2001 From: Hao Ge Date: Tue, 31 Mar 2026 16:13:12 +0800 Subject: mm/alloc_tag: clear codetag for pages allocated before page_ext initialization Due to initialization ordering, page_ext is allocated and initialized relatively late during boot. Some pages have already been allocated and freed before page_ext becomes available, leaving their codetag uninitialized. A clear example is in init_section_page_ext(): alloc_page_ext() calls kmemleak_alloc(). If the slab cache has no free objects, it falls back to the buddy allocator to allocate memory. However, at this point page_ext is not yet fully initialized, so these newly allocated pages have no codetag set. These pages may later be reclaimed by KASAN, which causes the warning to trigger when they are freed because their codetag ref is still empty. Use a global array to track pages allocated before page_ext is fully initialized. The array size is fixed at 8192 entries, and will emit a warning if this limit is exceeded. When page_ext initialization completes, set their codetag to empty to avoid warnings when they are freed later. This warning is only observed with CONFIG_MEM_ALLOC_PROFILING_DEBUG=Y and mem_profiling_compressed disabled: [ 9.582133] ------------[ cut here ]------------ [ 9.582137] alloc_tag was not set [ 9.582139] WARNING: ./include/linux/alloc_tag.h:164 at __pgalloc_tag_sub+0x40f/0x550, CPU#5: systemd/1 [ 9.582190] CPU: 5 UID: 0 PID: 1 Comm: systemd Not tainted 7.0.0-rc4 #1 PREEMPT(lazy) [ 9.582192] Hardware name: Red Hat KVM, BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 9.582194] RIP: 0010:__pgalloc_tag_sub+0x40f/0x550 [ 9.582196] Code: 00 00 4c 29 e5 48 8b 05 1f 88 56 05 48 8d 4c ad 00 48 8d 2c c8 e9 87 fd ff ff 0f 0b 0f 0b e9 f3 fe ff ff 48 8d 3d 61 2f ed 03 <67> 48 0f b9 3a e9 b3 fd ff ff 0f 0b eb e4 e8 5e cd 14 02 4c 89 c7 [ 9.582197] RSP: 0018:ffffc9000001f940 EFLAGS: 00010246 [ 9.582200] RAX: dffffc0000000000 RBX: 1ffff92000003f2b RCX: 1ffff110200d806c [ 9.582201] RDX: ffff8881006c0360 RSI: 0000000000000004 RDI: ffffffff9bc7b460 [ 9.582202] RBP: 0000000000000000 R08: 0000000000000000 R09: fffffbfff3a62324 [ 9.582203] R10: ffffffff9d311923 R11: 0000000000000000 R12: ffffea0004001b00 [ 9.582204] R13: 0000000000002000 R14: ffffea0000000000 R15: ffff8881006c0360 [ 9.582206] FS: 00007ffbbcf2d940(0000) GS:ffff888450479000(0000) knlGS:0000000000000000 [ 9.582208] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9.582210] CR2: 000055ee3aa260d0 CR3: 0000000148b67005 CR4: 0000000000770ef0 [ 9.582211] PKRU: 55555554 [ 9.582212] Call Trace: [ 9.582213] [ 9.582214] ? __pfx___pgalloc_tag_sub+0x10/0x10 [ 9.582216] ? check_bytes_and_report+0x68/0x140 [ 9.582219] __free_frozen_pages+0x2e4/0x1150 [ 9.582221] ? __free_slab+0xc2/0x2b0 [ 9.582224] qlist_free_all+0x4c/0xf0 [ 9.582227] kasan_quarantine_reduce+0x15d/0x180 [ 9.582229] __kasan_slab_alloc+0x69/0x90 [ 9.582232] kmem_cache_alloc_noprof+0x14a/0x500 [ 9.582234] do_getname+0x96/0x310 [ 9.582237] do_readlinkat+0x91/0x2f0 [ 9.582239] ? __pfx_do_readlinkat+0x10/0x10 [ 9.582240] ? get_random_bytes_user+0x1df/0x2c0 [ 9.582244] __x64_sys_readlinkat+0x96/0x100 [ 9.582246] do_syscall_64+0xce/0x650 [ 9.582250] ? __x64_sys_getrandom+0x13a/0x1e0 [ 9.582252] ? __pfx___x64_sys_getrandom+0x10/0x10 [ 9.582254] ? do_syscall_64+0x114/0x650 [ 9.582255] ? ksys_read+0xfc/0x1d0 [ 9.582258] ? __pfx_ksys_read+0x10/0x10 [ 9.582260] ? do_syscall_64+0x114/0x650 [ 9.582262] ? do_syscall_64+0x114/0x650 [ 9.582264] ? __pfx_fput_close_sync+0x10/0x10 [ 9.582266] ? file_close_fd_locked+0x178/0x2a0 [ 9.582268] ? __x64_sys_faccessat2+0x96/0x100 [ 9.582269] ? __x64_sys_close+0x7d/0xd0 [ 9.582271] ? do_syscall_64+0x114/0x650 [ 9.582273] ? do_syscall_64+0x114/0x650 [ 9.582275] ? clear_bhb_loop+0x50/0xa0 [ 9.582277] ? clear_bhb_loop+0x50/0xa0 [ 9.582279] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 9.582280] RIP: 0033:0x7ffbbda345ee [ 9.582282] Code: 0f 1f 40 00 48 8b 15 29 38 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 0f 1f 40 00 f3 0f 1e fa 49 89 ca b8 0b 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fa 37 0d 00 f7 d8 64 89 01 48 [ 9.582284] RSP: 002b:00007ffe2ad8de58 EFLAGS: 00000202 ORIG_RAX: 000000000000010b [ 9.582286] RAX: ffffffffffffffda RBX: 000055ee3aa25570 RCX: 00007ffbbda345ee [ 9.582287] RDX: 000055ee3aa25570 RSI: 00007ffe2ad8dee0 RDI: 00000000ffffff9c [ 9.582288] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000001001 [ 9.582289] R10: 0000000000001000 R11: 0000000000000202 R12: 0000000000000033 [ 9.582290] R13: 00007ffe2ad8dee0 R14: 00000000ffffff9c R15: 00007ffe2ad8deb0 [ 9.582292] [ 9.582293] ---[ end trace 0000000000000000 ]--- Link: https://lore.kernel.org/20260331081312.123719-1-hao.ge@linux.dev Fixes: dcfe378c81f72 ("lib: introduce support for page allocation tagging") Signed-off-by: Hao Ge Suggested-by: Suren Baghdasaryan Acked-by: Suren Baghdasaryan Cc: Kent Overstreet Cc: Signed-off-by: Andrew Morton --- lib/alloc_tag.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 109 insertions(+) (limited to 'lib') diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c index 58991ab09d84..ed1bdcf1f8ab 100644 --- a/lib/alloc_tag.c +++ b/lib/alloc_tag.c @@ -6,7 +6,9 @@ #include #include #include +#include #include +#include #include #include #include @@ -758,8 +760,115 @@ static __init bool need_page_alloc_tagging(void) return mem_profiling_support; } +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG +/* + * Track page allocations before page_ext is initialized. + * Some pages are allocated before page_ext becomes available, leaving + * their codetag uninitialized. Track these early PFNs so we can clear + * their codetag refs later to avoid warnings when they are freed. + * + * Early allocations include: + * - Base allocations independent of CPU count + * - Per-CPU allocations (e.g., CPU hotplug callbacks during smp_init, + * such as trace ring buffers, scheduler per-cpu data) + * + * For simplicity, we fix the size to 8192. + * If insufficient, a warning will be triggered to alert the user. + * + * TODO: Replace fixed-size array with dynamic allocation using + * a GFP flag similar to ___GFP_NO_OBJ_EXT to avoid recursion. + */ +#define EARLY_ALLOC_PFN_MAX 8192 + +static unsigned long early_pfns[EARLY_ALLOC_PFN_MAX] __initdata; +static atomic_t early_pfn_count __initdata = ATOMIC_INIT(0); + +static void __init __alloc_tag_add_early_pfn(unsigned long pfn) +{ + int old_idx, new_idx; + + do { + old_idx = atomic_read(&early_pfn_count); + if (old_idx >= EARLY_ALLOC_PFN_MAX) { + pr_warn_once("Early page allocations before page_ext init exceeded EARLY_ALLOC_PFN_MAX (%d)\n", + EARLY_ALLOC_PFN_MAX); + return; + } + new_idx = old_idx + 1; + } while (!atomic_try_cmpxchg(&early_pfn_count, &old_idx, new_idx)); + + early_pfns[old_idx] = pfn; +} + +typedef void alloc_tag_add_func(unsigned long pfn); +static alloc_tag_add_func __rcu *alloc_tag_add_early_pfn_ptr __refdata = + RCU_INITIALIZER(__alloc_tag_add_early_pfn); + +void alloc_tag_add_early_pfn(unsigned long pfn) +{ + alloc_tag_add_func *alloc_tag_add; + + if (static_key_enabled(&mem_profiling_compressed)) + return; + + rcu_read_lock(); + alloc_tag_add = rcu_dereference(alloc_tag_add_early_pfn_ptr); + if (alloc_tag_add) + alloc_tag_add(pfn); + rcu_read_unlock(); +} + +static void __init clear_early_alloc_pfn_tag_refs(void) +{ + unsigned int i; + + if (static_key_enabled(&mem_profiling_compressed)) + return; + + rcu_assign_pointer(alloc_tag_add_early_pfn_ptr, NULL); + /* Make sure we are not racing with __alloc_tag_add_early_pfn() */ + synchronize_rcu(); + + for (i = 0; i < atomic_read(&early_pfn_count); i++) { + unsigned long pfn = early_pfns[i]; + + if (pfn_valid(pfn)) { + struct page *page = pfn_to_page(pfn); + union pgtag_ref_handle handle; + union codetag_ref ref; + + if (get_page_tag_ref(page, &ref, &handle)) { + /* + * An early-allocated page could be freed and reallocated + * after its page_ext is initialized but before we clear it. + * In that case, it already has a valid tag set. + * We should not overwrite that valid tag with CODETAG_EMPTY. + * + * Note: there is still a small race window between checking + * ref.ct and calling set_codetag_empty(). We accept this + * race as it's unlikely and the extra complexity of atomic + * cmpxchg is not worth it for this debug-only code path. + */ + if (ref.ct) { + put_page_tag_ref(handle); + continue; + } + + set_codetag_empty(&ref); + update_page_tag_ref(handle, &ref); + put_page_tag_ref(handle); + } + } + + } +} +#else /* !CONFIG_MEM_ALLOC_PROFILING_DEBUG */ +static inline void __init clear_early_alloc_pfn_tag_refs(void) {} +#endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */ + static __init void init_page_alloc_tagging(void) { + clear_early_alloc_pfn_tag_refs(); } struct page_ext_operations page_alloc_tagging_ops = { -- cgit v1.2.3 From 744dd97752ef1076a8d8672bb0d8aa2c7abc1144 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Tue, 31 Mar 2026 17:34:43 +1100 Subject: lib: test_hmm: evict device pages on file close to avoid use-after-free Patch series "Minor hmm_test fixes and cleanups". Two bugfixes a cleanup for the HMM kernel selftests. These were mostly reported by Zenghui Yu with special thanks to Lorenzo for analysing and pointing out the problems. This patch (of 3): When dmirror_fops_release() is called it frees the dmirror struct but doesn't migrate device private pages back to system memory first. This leaves those pages with a dangling zone_device_data pointer to the freed dmirror. If a subsequent fault occurs on those pages (eg. during coredump) the dmirror_devmem_fault() callback dereferences the stale pointer causing a kernel panic. This was reported [1] when running mm/ksft_hmm.sh on arm64, where a test failure triggered SIGABRT and the resulting coredump walked the VMAs faulting in the stale device private pages. Fix this by calling dmirror_device_evict_chunk() for each devmem chunk in dmirror_fops_release() to migrate all device private pages back to system memory before freeing the dmirror struct. The function is moved earlier in the file to avoid a forward declaration. Link: https://lore.kernel.org/20260331063445.3551404-1-apopple@nvidia.com Link: https://lore.kernel.org/20260331063445.3551404-2-apopple@nvidia.com Fixes: b2ef9f5a5cb3 ("mm/hmm/test: add selftest driver for HMM") Signed-off-by: Alistair Popple Reported-by: Zenghui Yu Closes: https://lore.kernel.org/linux-mm/8bd0396a-8997-4d2e-a13f-5aac033083d7@linux.dev/ Reviewed-by: Balbir Singh Tested-by: Zenghui Yu Cc: David Hildenbrand Cc: Jason Gunthorpe Cc: Leon Romanovsky Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Mike Rapoport Cc: Suren Baghdasaryan Cc: Zenghui Yu Cc: Matthew Brost Cc: Signed-off-by: Andrew Morton --- lib/test_hmm.c | 112 +++++++++++++++++++++++++++++++-------------------------- 1 file changed, 62 insertions(+), 50 deletions(-) (limited to 'lib') diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 0964d53365e6..79fe7d233df1 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -185,11 +185,73 @@ static int dmirror_fops_open(struct inode *inode, struct file *filp) return 0; } +static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk) +{ + unsigned long start_pfn = chunk->pagemap.range.start >> PAGE_SHIFT; + unsigned long end_pfn = chunk->pagemap.range.end >> PAGE_SHIFT; + unsigned long npages = end_pfn - start_pfn + 1; + unsigned long i; + unsigned long *src_pfns; + unsigned long *dst_pfns; + unsigned int order = 0; + + src_pfns = kvcalloc(npages, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL); + dst_pfns = kvcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL); + + migrate_device_range(src_pfns, start_pfn, npages); + for (i = 0; i < npages; i++) { + struct page *dpage, *spage; + + spage = migrate_pfn_to_page(src_pfns[i]); + if (!spage || !(src_pfns[i] & MIGRATE_PFN_MIGRATE)) + continue; + + if (WARN_ON(!is_device_private_page(spage) && + !is_device_coherent_page(spage))) + continue; + + order = folio_order(page_folio(spage)); + spage = BACKING_PAGE(spage); + if (src_pfns[i] & MIGRATE_PFN_COMPOUND) { + dpage = folio_page(folio_alloc(GFP_HIGHUSER_MOVABLE, + order), 0); + } else { + dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL); + order = 0; + } + + /* TODO Support splitting here */ + lock_page(dpage); + dst_pfns[i] = migrate_pfn(page_to_pfn(dpage)); + if (src_pfns[i] & MIGRATE_PFN_WRITE) + dst_pfns[i] |= MIGRATE_PFN_WRITE; + if (order) + dst_pfns[i] |= MIGRATE_PFN_COMPOUND; + folio_copy(page_folio(dpage), page_folio(spage)); + } + migrate_device_pages(src_pfns, dst_pfns, npages); + migrate_device_finalize(src_pfns, dst_pfns, npages); + kvfree(src_pfns); + kvfree(dst_pfns); +} + static int dmirror_fops_release(struct inode *inode, struct file *filp) { struct dmirror *dmirror = filp->private_data; + struct dmirror_device *mdevice = dmirror->mdevice; + int i; mmu_interval_notifier_remove(&dmirror->notifier); + + if (mdevice->devmem_chunks) { + for (i = 0; i < mdevice->devmem_count; i++) { + struct dmirror_chunk *devmem = + mdevice->devmem_chunks[i]; + + dmirror_device_evict_chunk(devmem); + } + } + xa_destroy(&dmirror->pt); kfree(dmirror); return 0; @@ -1377,56 +1439,6 @@ static int dmirror_snapshot(struct dmirror *dmirror, return ret; } -static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk) -{ - unsigned long start_pfn = chunk->pagemap.range.start >> PAGE_SHIFT; - unsigned long end_pfn = chunk->pagemap.range.end >> PAGE_SHIFT; - unsigned long npages = end_pfn - start_pfn + 1; - unsigned long i; - unsigned long *src_pfns; - unsigned long *dst_pfns; - unsigned int order = 0; - - src_pfns = kvcalloc(npages, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL); - dst_pfns = kvcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL); - - migrate_device_range(src_pfns, start_pfn, npages); - for (i = 0; i < npages; i++) { - struct page *dpage, *spage; - - spage = migrate_pfn_to_page(src_pfns[i]); - if (!spage || !(src_pfns[i] & MIGRATE_PFN_MIGRATE)) - continue; - - if (WARN_ON(!is_device_private_page(spage) && - !is_device_coherent_page(spage))) - continue; - - order = folio_order(page_folio(spage)); - spage = BACKING_PAGE(spage); - if (src_pfns[i] & MIGRATE_PFN_COMPOUND) { - dpage = folio_page(folio_alloc(GFP_HIGHUSER_MOVABLE, - order), 0); - } else { - dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL); - order = 0; - } - - /* TODO Support splitting here */ - lock_page(dpage); - dst_pfns[i] = migrate_pfn(page_to_pfn(dpage)); - if (src_pfns[i] & MIGRATE_PFN_WRITE) - dst_pfns[i] |= MIGRATE_PFN_WRITE; - if (order) - dst_pfns[i] |= MIGRATE_PFN_COMPOUND; - folio_copy(page_folio(dpage), page_folio(spage)); - } - migrate_device_pages(src_pfns, dst_pfns, npages); - migrate_device_finalize(src_pfns, dst_pfns, npages); - kvfree(src_pfns); - kvfree(dst_pfns); -} - /* Removes free pages from the free list so they can't be re-allocated */ static void dmirror_remove_free_pages(struct dmirror_chunk *devmem) { -- cgit v1.2.3 From af69016dab967346f759016ca503ebc61dd048b5 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Tue, 31 Mar 2026 17:34:45 +1100 Subject: lib: test_hmm: implement a device release method Unloading the HMM test module produces the following warning: [ 3782.224783] ------------[ cut here ]------------ [ 3782.226323] Device 'hmm_dmirror0' does not have a release() function, it is broken and must be fixed. See Documentation/core-api/kobject.rst. [ 3782.230570] WARNING: drivers/base/core.c:2567 at device_release+0x185/0x210, CPU#20: rmmod/1924 [ 3782.233949] Modules linked in: test_hmm(-) nvidia_uvm(O) nvidia(O) [ 3782.236321] CPU: 20 UID: 0 PID: 1924 Comm: rmmod Tainted: G O 7.0.0-rc1+ #374 PREEMPT(full) [ 3782.240226] Tainted: [O]=OOT_MODULE [ 3782.241639] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 [ 3782.246193] RIP: 0010:device_release+0x185/0x210 [ 3782.247860] Code: 00 00 fc ff df 48 8d 7b 50 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 86 00 00 00 48 8b 73 50 48 85 f6 74 11 48 8d 3d db 25 29 03 <67> 48 0f b9 3a e9 0d ff ff ff 48 b8 00 00 00 00 00 fc ff df 48 89 [ 3782.254211] RSP: 0018:ffff888126577d98 EFLAGS: 00010246 [ 3782.256054] RAX: dffffc0000000000 RBX: ffffffffc2b70310 RCX: ffffffff8fe61ba1 [ 3782.258512] RDX: 1ffffffff856e062 RSI: ffff88811341eea0 RDI: ffffffff91bbacb0 [ 3782.261041] RBP: ffff888111475000 R08: 0000000000000001 R09: fffffbfff856e069 [ 3782.263471] R10: ffffffffc2b7034b R11: 00000000ffffffff R12: 0000000000000000 [ 3782.265983] R13: dffffc0000000000 R14: ffff88811341eea0 R15: 0000000000000000 [ 3782.268443] FS: 00007fd5a3689040(0000) GS:ffff88842c8d0000(0000) knlGS:0000000000000000 [ 3782.271236] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3782.273251] CR2: 00007fd5a36d2c10 CR3: 00000001242b8000 CR4: 00000000000006f0 [ 3782.275362] Call Trace: [ 3782.276071] [ 3782.276678] kobject_put+0x146/0x270 [ 3782.277731] hmm_dmirror_exit+0x7a/0x130 [test_hmm] [ 3782.279135] __do_sys_delete_module+0x341/0x510 [ 3782.280438] ? module_flags+0x300/0x300 [ 3782.281547] do_syscall_64+0x111/0x670 [ 3782.282620] entry_SYSCALL_64_after_hwframe+0x4b/0x53 [ 3782.284091] RIP: 0033:0x7fd5a3793b37 [ 3782.285303] Code: 73 01 c3 48 8b 0d c9 82 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 99 82 0c 00 f7 d8 64 89 01 48 [ 3782.290708] RSP: 002b:00007ffd68b7dc68 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 [ 3782.292817] RAX: ffffffffffffffda RBX: 000055e3c0d1c770 RCX: 00007fd5a3793b37 [ 3782.294735] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055e3c0d1c7d8 [ 3782.296661] RBP: 0000000000000000 R08: 1999999999999999 R09: 0000000000000000 [ 3782.298622] R10: 00007fd5a3806ac0 R11: 0000000000000206 R12: 00007ffd68b7deb0 [ 3782.300576] R13: 00007ffd68b7e781 R14: 000055e3c0d1b2a0 R15: 00007ffd68b7deb8 [ 3782.301963] [ 3782.302371] irq event stamp: 5019 [ 3782.302987] hardirqs last enabled at (5027): [] __up_console_sem+0x52/0x60 [ 3782.304507] hardirqs last disabled at (5036): [] __up_console_sem+0x37/0x60 [ 3782.306086] softirqs last enabled at (4940): [] __irq_exit_rcu+0xc0/0xf0 [ 3782.307567] softirqs last disabled at (4929): [] __irq_exit_rcu+0xc0/0xf0 [ 3782.309105] ---[ end trace 0000000000000000 ]--- This is because the test module doesn't have a device.release method. In this case one probably isn't needed for correctness - the device structs are in a static array so don't need freeing when the final reference goes away. However some device state is freed on exit, so to ensure this happens at the right time and to silence the warning move the deinitialisation to a release method and assign that as the device release callback. Whilst here also fix a minor error handling bug where cdev_device_del() wasn't being called if allocation failed. Link: https://lore.kernel.org/20260331063445.3551404-4-apopple@nvidia.com Fixes: 6a760f58c792 ("mm/hmm/test: use char dev with struct device to get device node") Signed-off-by: Alistair Popple Acked-by: Balbir Singh Tested-by: Zenghui Yu (Huawei) Cc: David Hildenbrand Cc: Jason Gunthorpe Cc: Leon Romanovsky Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Mike Rapoport Cc: Suren Baghdasaryan Cc: Matthew Brost Cc: Signed-off-by: Andrew Morton --- lib/test_hmm.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) (limited to 'lib') diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 79fe7d233df1..213504915737 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -1738,6 +1738,13 @@ static const struct dev_pagemap_ops dmirror_devmem_ops = { .folio_split = dmirror_devmem_folio_split, }; +static void dmirror_device_release(struct device *dev) +{ + struct dmirror_device *mdevice = container_of(dev, struct dmirror_device, device); + + dmirror_device_remove_chunks(mdevice); +} + static int dmirror_device_init(struct dmirror_device *mdevice, int id) { dev_t dev; @@ -1749,6 +1756,8 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id) cdev_init(&mdevice->cdevice, &dmirror_fops); mdevice->cdevice.owner = THIS_MODULE; + mdevice->device.release = dmirror_device_release; + device_initialize(&mdevice->device); mdevice->device.devt = dev; @@ -1756,12 +1765,16 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id) if (ret) goto put_device; + /* Build a list of free ZONE_DEVICE struct pages */ + ret = dmirror_allocate_chunk(mdevice, NULL, false); + if (ret) + goto put_device; + ret = cdev_device_add(&mdevice->cdevice, &mdevice->device); if (ret) goto put_device; - /* Build a list of free ZONE_DEVICE struct pages */ - return dmirror_allocate_chunk(mdevice, NULL, false); + return 0; put_device: put_device(&mdevice->device); @@ -1770,7 +1783,6 @@ put_device: static void dmirror_device_remove(struct dmirror_device *mdevice) { - dmirror_device_remove_chunks(mdevice); cdev_device_del(&mdevice->cdevice, &mdevice->device); put_device(&mdevice->device); } -- cgit v1.2.3