summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-11-29rqspinlock: Precede non-head waiter queueing with AA checkKumar Kartikeya Dwivedi
While previous commits sufficiently address the deadlocks, there are still scenarios where queueing of waiters in NMIs can exacerbate the possibility of timeouts. Consider the case below: CPU 0 <NMI> res_spin_lock(A) -> becomes non-head waiter </NMI> lock owner in CS or pending waiter spinning CPU 1 res_spin_lock(A) -> head waiter spinning on owner/pending bits In such a scenario, the non-head waiter in NMI on CPU 0 will not poll for deadlocks or timeout since it will simply queue behind previous waiter (head on CPU 1), and also not enter the trylock fallback since no rqspinlock queue waiter is active on CPU 0. In such a scenario, the transaction initiated by the head waiter on CPU 1 will timeout, signalling the NMI and ending the cyclic dependency, but it will cost 250 ms of time. Instead, the NMI on CPU 0 could simply check for the presence of an AA deadlock and only proceed with queueing on success. Add such a check right before any form of queueing is initiated. The reason the AA deadlock check is not used in conjunction with in_nmi() is that a similar case could occur due to a reentrant path in the owner's critical section, and unconditionally checking for AA before entering the queueing path avoids expensive timeouts. Non-NMI reentrancy only happens at controlled points in the slow path (with specific tracepoints which do not impede the forward progress of a waiter loop), or in the owner CS, while NMIs can land anywhere. While this check is only needed for non-head waiter queueing, checking whether we are head or not is racy without xchg_tail, and after that point, we are already queued, hence for simplicity we must invoke the check unconditionally. Note that a more contrived case could still be constructed by using two locks, and interrupting the progress of the respective owners by non-head waiters of the other lock, in an ABBA fashion, which would still not be covered by the current set of checks and conditions. It would still lead to a timeout though, and not a deadlock. An ABBA check cannot happen optimistically before the queueing, since it can be racy, and needs to be happen continuously during the waiting period, which would then require an unlinking step for queued NMI/reentrant waiters. This is beyond the scope of this patch. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251128232802.1031906-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-29rqspinlock: Disable spinning for trylock fallbackKumar Kartikeya Dwivedi
The original trylock fallback was inherited from qspinlock, and then reused for the reentrant NMIs while the slow path is active. However, under contention, it is very unlikely for the trylock to succeed in taking the lock. In addition, a trylock also has no fairness guarantees, and thus is prone to starvation issues under extreme scenarios. The original qspinlock had no choice in terms of returning an error the caller; if the node count was breached, it had to fall back to trylock to attempt to take the lock. In case of rqspinlock, we do have the option of returning to the user. Thus, simply attempt the trylock once, and instead of spinning, return an error in case the lock cannot be taken. This ends up significantly reducing the time spent in the trylock fallback, since we no longer wait for the timeout duration trying to aimlessly acquire the lock when there's a high-probability that under contention, it won't be available to us anyway. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251128232802.1031906-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-29rqspinlock: Use trylock fallback when per-CPU rqnode is busyKumar Kartikeya Dwivedi
In addition to deferring to the trylock fallback in NMIs, only do so when an rqspinlock waiter is queued on the current CPU. This is detected by noticing a non-zero node index. This allows NMI waiters to join the waiter queue if it isn't interrupting an existing rqspinlock waiter, and increase the chances of fairly obtaining the lock, performing deadlock detection as the head, and not being starved while attempting the trylock. The trylock path in particular is unlikely to succeed under contention, as it relies on the lock word becoming 0, which indicates no contention. This means that the most likely result for NMIs attempting a trylock is a timeout under contention if they don't hit an AA or ABBA case. The core problem being addressed through the fixed commit was removing the dependency edge between an NMI queue waiter and the queue waiter it is interrupting. Whenever a circular dependency forms, and with no way to break it (as non-head waiters don't poll for deadlocks or timeouts), we would enter into a deadlock. A trylock either breaks such an edge by probing for deadlocks, and finally terminating the waiting loop using a timeout. By excluding queueing on CPUs where the node index is non-zero for NMIs, this sort of dependency is broken. The CPU enters the trylock path for those cases, and falls back to deadlock checks and timeouts. However, in other case where it doesn't interrupt the CPU in the slow path while its queued on the lock, it can join the queue as a normal waiter, and avoid trylock associated starvation and subsequent timeouts. There are a few remaining cases here that matter: the NMI can still preempt the owner in its critical section, and if it queues as a non-head waiter, it can end up impeding the progress of the owner. While this won't deadlock, since the head waiter will eventually signal the NMI waiter to either stop (due to a timeout), it can still lead to long timeouts. These gaps will be addressed in subsequent commits. Note that while the node count detection approach is less conservative than simply deferring NMIs to trylock, it is going to return errors where attempts to lock B in NMI happen while waiters for lock A are in a lower context on the same CPU. However, this only occurs when the lower context is queued in the slow path, and the NMI attempt can proceed without failure in all other cases. To continue to prevent AA deadlocks (or ABBA in a similar NMI interrupting lower context pattern), we'd need a more fleshed out algorithm to unlink NMI waiters after they queue and detect such cases. However, all that complexity isn't appealing yet to reduce the failure rate in the small window inside the slow path. It is important to note that reentrancy in the slow path can also happen through trace_contention_{begin,end}, but in those cases, unlike an NMI, the forward progress of the head waiter (or the predecessor in general) is not being blocked. Fixes: 0d80e7f951be ("rqspinlock: Choose trylock fallback for NMI waiters") Reported-by: Ritesh Oedayrajsingh Varma <ritesh@superluminal.eu> Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251128232802.1031906-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-29rqspinlock: Perform AA checks immediatelyKumar Kartikeya Dwivedi
Currently, while we enter the check_timeout call immediately due to the way the ts.spin is initialized, we still invoke the AA and ABBA checks in the second invocation, and only initialize the timestamp in the first one. Since each iteration is at least done with a 1ms delay, this can add delays in detection of AA deadlocks, up to a ms. Rework check_timeout() to avoid this. First, call check_deadlock_AA() while initializing the timestamps for the wait period. This also means that we only do it once per waiting period, instead of every invocation. Finally, drop check_deadlock() and call check_deadlock_ABBA() directly. To save on unnecessary ktime_get_mono_fast_ns() in case of AA deadlock, sample the time only if it returns 0. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251128232802.1031906-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-29rqspinlock: Enclose lock/unlock within lock entry acquisitionsKumar Kartikeya Dwivedi
Ritesh reported that timeouts occurred frequently for rqspinlock despite reentrancy on the same lock on the same CPU in [0]. This patch closes one of the races leading to this behavior, and reduces the frequency of timeouts. We currently have a tiny window between the fast-path cmpxchg and the grabbing of the lock entry where an NMI could land, attempt the same lock that was just acquired, and end up timing out. This is not ideal. Instead, move the lock entry acquisition from the fast path to before the cmpxchg, and remove the grabbing of the lock entry in the slow path, assuming it was already taken by the fast path. The TAS fallback is invoked directly without being preceded by the typical fast path, therefore we must continue to grab the deadlock detection entry in that case. Case on lock leading to missed AA: cmpxchg lock A <NMI> ... rqspinlock acquisition of A ... timeout </NMI> grab_held_lock_entry(A) There is a similar case when unlocking the lock. If the NMI lands between the WRITE_ONCE and smp_store_release, it is possible that we end up in a situation where the NMI fails to diagnose the AA condition, leading to a timeout. Case on unlock leading to missed AA: WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL) <NMI> ... rqspinlock acquisition of A ... timeout </NMI> smp_store_release(A->locked, 0) The patch changes the order on unlock to smp_store_release() succeeded by WRITE_ONCE() of NULL. This avoids the missed AA detection described above, but may lead to a false positive if the NMI lands between these two statements, which is acceptable (and preferred over a timeout). The original intention of the reverse order on unlock was to prevent the following possible misdiagnosis of an ABBA scenario: grab entry A lock A grab entry B lock B unlock B smp_store_release(B->locked, 0) grab entry B lock B grab entry A lock A ! <detect ABBA> WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL) If the store release were is after the WRITE_ONCE, the other CPU would not observe B in the table of the CPU unlocking the lock B. However, since the threads are obviously participating in an ABBA deadlock, it is no longer appealing to use the order above since it may lead to a 250 ms timeout due to missed AA detection. [0]: https://lore.kernel.org/bpf/CAH6OuBTjG+N=+GGwcpOUbeDN563oz4iVcU3rbse68egp9wj9_A@mail.gmail.com Fixes: 0d80e7f951be ("rqspinlock: Choose trylock fallback for NMI waiters") Reported-by: Ritesh Oedayrajsingh Varma <ritesh@superluminal.eu> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251128232802.1031906-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-28bpf: Disable file_alloc_security hookAmery Hung
A use-after-free bug may be triggered by calling bpf_inode_storage_get() in a BPF LSM program hooked to file_alloc_security. Disable the hook to prevent this from happening. The cause of the bug is shown in the trace below. In alloc_file(), a file struct is first allocated through kmem_cache_alloc(). Then, file_alloc_security hook is invoked. Since the zero initialization or assignment of f->f_inode happen after this LSM hook, a BPF program may get a dangeld inode pointer by walking the file struct. alloc_file() -> alloc_empty_file() -> f = kmem_cache_alloc() -> init_file() -> security_file_alloc() // f->f_inode not init-ed yet! -> f->f_inode = NULL; -> file_init_path() -> f->f_inode = path->dentry->d_inode Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Dongliang Mu <dzm91@hust.edu.cn> Closes: https://lore.kernel.org/bpf/1d2d1968.47cd3.19ab9528e94.Coremail.kaiyanm@hust.edu.cn/ Signed-off-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20251126202927.2584874-1-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-28bpf: check for insn arrays in check_ptr_alignmentAnton Protopopov
Do not abuse the strict_alignment_once flag, and check if the map is an instruction array inside the check_ptr_alignment() function. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20251128063224.1305482-3-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-28bpf: force BPF_F_RDONLY_PROG on insn array creationAnton Protopopov
The original implementation added a hack to check_mem_access() to prevent programs from writing into insn arrays. To get rid of this hack, enforce BPF_F_RDONLY_PROG on map creation. Also fix the corresponding selftest, as the error message changes with this patch. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20251128063224.1305482-2-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-28Merge branch 'rcu/refscale' into nextFrederic Weisbecker
Add performance testing for common context synchronizations (Preemption, IRQ, Softirq) and per-cpu increments. Those are relevant comparisons against SRCU-fast read side APIs, especially as they are planned to synchronize further tracing fast-path code.
2025-11-28Merge branches 'pm-qos' and 'pm-tools'Rafael J. Wysocki
Merge PM QoS updates and a cpupower utility update for 6.19-rc1: - Introduce and document a QoS limit on CPU exit latency during wakeup from suspend-to-idle (Ulf Hansson) - Add support for building libcpupower statically (Zuo An) * pm-qos: Documentation: power/cpuidle: Document the CPU system wakeup latency QoS cpuidle: Respect the CPU system wakeup QoS limit for cpuidle sched: idle: Respect the CPU system wakeup QoS limit for s2idle pmdomain: Respect the CPU system wakeup QoS limit for cpuidle pmdomain: Respect the CPU system wakeup QoS limit for s2idle PM: QoS: Introduce a CPU system wakeup QoS limit * pm-tools: tools/power/cpupower: Support building libcpupower statically
2025-11-28Merge branches 'pm-em' and 'pm-opp'Rafael J. Wysocki
Merge energy model management updates and operating performance points (OPP) library changes for 6.19-rc1: - Add support for sending netlink notifications to user space on energy model updates (Changwoo Mini, Peng Fan) - Minor improvements to the Rust OPP interface (Tamir Duberstein) - Fixes to scope-based pointers in the OPP library (Viresh Kumar) * pm-em: PM: EM: Add to em_pd_list only when no failure PM: EM: Notify an event when the performance domain changes PM: EM: Implement em_notify_pd_created/updated() PM: EM: Implement em_notify_pd_deleted() PM: EM: Implement em_nl_get_pd_table_doit() PM: EM: Implement em_nl_get_pds_doit() PM: EM: Add an iterator and accessor for the performance domain PM: EM: Add a skeleton code for netlink notification PM: EM: Add em.yaml and autogen files PM: EM: Expose the ID of a performance domain via debugfs PM: EM: Assign a unique ID when creating a performance domain * pm-opp: rust: opp: simplify callers of `to_c_str_array` OPP: Initialize scope-based pointers inline rust: opp: fix broken rustdoc link
2025-11-28Merge branch 'pm-sleep'Rafael J. Wysocki
Merge updates related to system suspend and hibernation for 6.19-rc1: - Replace snprintf() with scnprintf() in show_trace_dev_match() (Kaushlendra Kumar) - Fix memory allocation error handling in pm_vt_switch_required() (Malaya Kumar Rout) - Introduce CALL_PM_OP() macro and use it to simplify code in generic PM operations (Kaushlendra Kumar) - Add module param to backtrace all CPUs in the device power management watchdog (Sergey Senozhatsky) - Rework message printing in swsusp_save() (Rafael Wysocki) - Make it possible to change the number of hibernation compression threads (Xueqin Luo) - Clarify that only cgroup1 freezer uses PM freezer (Tejun Heo) - Add document on debugging shutdown hangs to PM documentation and correct a mistaken configuration option in it (Mario Limonciello) - Shut down wakeup source timer before removing the wakeup source from the list (Kaushlendra Kumar, Rafael Wysocki) - Introduce new PMSG_POWEROFF event for system shutdown handling with the help of PM device callbacks (Mario Limonciello) - Make pm_test delay interruptible by wakeup events (Riwen Lu) - Clean up kernel-doc comment style usage in the core hibernation code and remove unuseful comments from it (Sunday Adelodun, Rafael Wysocki) - Add support for handling wakeup events and aborting the suspend process while it is syncing file systems (Samuel Wu, Rafael Wysocki) * pm-sleep: (21 commits) PM: hibernate: Extra cleanup of comments in swap handling code PM: sleep: Call pm_sleep_fs_sync() instead of ksys_sync_helper() PM: sleep: Add support for wakeup during filesystem sync PM: hibernate: Clean up kernel-doc comment style usage PM: suspend: Make pm_test delay interruptible by wakeup events usb: sl811-hcd: Add PM_EVENT_POWEROFF into suspend callbacks scsi: Add PM_EVENT_POWEROFF into suspend callbacks PM: Introduce new PMSG_POWEROFF event PM: wakeup: Update after recent wakeup source removal ordering change PM: wakeup: Delete timer before removing wakeup source from list Documentation: power: Correct a mistaken configuration option Documentation: power: Add document on debugging shutdown hangs freezer: Clarify that only cgroup1 freezer uses PM freezer PM: hibernate: add sysfs interface for hibernate_compression_threads PM: hibernate: make compression threads configurable PM: hibernate: dynamically allocate crc->unc_len/unc for configurable threads PM: hibernate: Rework message printing in swsusp_save() PM: dpm_watchdog: add module param to backtrace all CPUs PM: sleep: Introduce CALL_PM_OP() macro to simplify code PM: console: Fix memory allocation error handling in pm_vt_switch_required() ...
2025-11-28Merge branches 'pm-core' and 'pm-runtime'Rafael J. Wysocki
Merge a core power management update and runtime PM framework updates for 6.19-rc1: - Add WQ_UNBOUND to pm_wq workqueue (Marco Crivellari) - Add runtime PM wrapper macros for ACQUIRE()/ACQUIRE_ERR() and use them in the PCI core and the ACPI TAD driver (Rafael Wysocki) - Improve runtime PM in the ACPI TAD driver (Rafael Wysocki) - Update pm_runtime_allow/forbid() documentation (Rafael Wysocki) - Fix typos in runtime.c comments (Malaya Kumar Rout) * pm-core: PM: WQ_UNBOUND added to pm_wq workqueue * pm-runtime: PCI/sysfs: Use PM_RUNTIME_ACQUIRE()/PM_RUNTIME_ACQUIRE_ERR() ACPI: TAD: Use PM_RUNTIME_ACQUIRE()/PM_RUNTIME_ACQUIRE_ERR() PM: runtime: Wrapper macros for ACQUIRE()/ACQUIRE_ERR() PM: runtime: fix typos in runtime.c comments ACPI: TAD: Improve runtime PM using guard macros ACPI: TAD: Rearrange runtime PM operations in acpi_tad_remove() PM: runtime: docs: Update pm_runtime_allow/forbid() documentation
2025-11-28refscale: Add SRCU-fast-updown readersPaul E. McKenney
This commit adds refscale readers based on srcu_read_lock_fast_updown() and srcu_read_unlock_fast_updown() ("refscale.scale_type=srcu-fast-updown"). On my x86 laptop, these are about 2.2ns per pair. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: <bpf@vger.kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-28bpf: convert bpf_token_create() to FD_PREPARE()Christian Brauner
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-24-b6efa1706cfd@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28bpf: convert bpf_iter_new_fd() to FD_PREPARE()Christian Brauner
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-23-b6efa1706cfd@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-27Merge tag 'dma-mapping-6.18-2025-11-27' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: "Two last minute fixes for the recently modified DMA API infrastructure: - proper handling of DMA_ATTR_MMIO in dma_iova_unlink() function (me) - regression fix for the code refactoring related to P2PDMA (Pranjal Shrivastava)" * tag 'dma-mapping-6.18-2025-11-27' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-direct: Fix missing sg_dma_len assignment in P2PDMA bus mappings iommu/dma: add missing support for DMA_ATTR_MMIO for dma_iova_unlink()
2025-11-27overflow: Introduce struct_offset() to get offset of memberSteven Rostedt
The trace_marker_raw file in tracefs takes a buffer from user space that contains an id as well as a raw data string which is usually a binary structure. The structure used has the following: struct raw_data_entry { struct trace_entry ent; unsigned int id; char buf[]; }; Since the passed in "cnt" variable is both the size of buf as well as the size of id, the code to allocate the location on the ring buffer had: size = struct_size(entry, buf, cnt - sizeof(entry->id)); Which is quite ugly and hard to understand. Instead, add a helper macro called struct_offset() which then changes the above to a simple and easy to understand: size = struct_offset(entry, id) + cnt; This will likely come in handy for other use cases too. Link: https://lore.kernel.org/all/CAHk-=whYZVoEdfO1PmtbirPdBMTV9Nxt9f09CK0k6S+HJD3Zmg@mail.gmail.com/ Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org> Link: https://patch.msgid.link/20251126145249.05b1770a@gandalf.local.home Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Kees Cook <kees@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-27Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"Ilias Stamatis
Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only logic") removed an optimization introduced by commit 756398750e11 ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That was not called out in the message of the first commit explicitly so it's not entirely clear whether removing the optimization happened inadvertently or not. As the original commit message of the optimization explains there is no point considering the children of a subtree in find_next_iomem_res() if the top level range does not match. Reinstating the optimization results in performance improvements in systems where /proc/iomem is ~5k lines long. Calling mmap() on /dev/mem in such platforms takes 700-1500μs without the optimisation and 10-50μs with the optimisation. Note that even though commit 97523a4edb7b removed the 'sibling_only' parameter from next_resource(), newer kernels have basically reinstated it under the name 'skip_children'. Link: https://lore.kernel.org/all/20251124165349.3377826-1-ilstam@amazon.com/T/#u Fixes: 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only logic") Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <huang.ying.caritas@gmail.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27vmcoreinfo: track and log recoverable hardware errorsBreno Leitao
Introduce a generic infrastructure for tracking recoverable hardware errors (HW errors that are visible to the OS but does not cause a panic) and record them for vmcore consumption. This aids post-mortem crash analysis tools by preserving a count and timestamp for the last occurrence of such errors. On the other side, correctable errors, which the OS typically remains unaware of because the underlying hardware handles them transparently, are less relevant for crash dump and therefore are NOT tracked in this infrastructure. Add centralized logging for sources of recoverable hardware errors based on the subsystem it has been notified. hwerror_data is write-only at kernel runtime, and it is meant to be read from vmcore using tools like crash/drgn. For example, this is how it looks like when opening the crashdump from drgn. >>> prog['hwerror_data'] (struct hwerror_info[1]){ { .count = (int)844, .timestamp = (time64_t)1752852018, }, ... This helps fleet operators quickly triage whether a crash may be influenced by hardware recoverable errors (which executes a uncommon code path in the kernel), especially when recoverable errors occurred shortly before a panic, such as the bug fixed by commit ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them when destroying the pool") This is not intended to replace full hardware diagnostics but provides a fast way to correlate hardware events with kernel panics quickly. Rare machine check exceptions—like those indicated by mce_flags.p5 or mce_flags.winchip—are not accounted for in this method, as they fall outside the intended usage scope for this feature's user base. [leitao@debian.org: add hw-recoverable-errors to toctree] Link: https://lkml.kernel.org/r/20251127-vmcoreinfo_fix-v1-1-26f5b1c43da9@debian.org Link: https://lkml.kernel.org/r/20251010-vmcore_hw_error-v5-1-636ede3efe44@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Suggested-by: Tony Luck <tony.luck@intel.com> Suggested-by: Shuai Xue <xueshuai@linux.alibaba.com> Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com> Reviewed-by: Hanjun Guo <guohanjun@huawei.com> [APEI] Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Bob Moore <robert.moore@intel.com> Cc: Borislav Betkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morse <james.morse@arm.com> Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com> Cc: Len Brown <lenb@kernel.org> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Omar Sandoval <osandov@osandov.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27kho: fix restoring of contiguous ranges of order-0 pagesMike Rapoport (Microsoft)
When contiguous ranges of order-0 pages are restored, kho_restore_page() calls prep_compound_page() with the first page in the range and order as parameters and then kho_restore_pages() calls split_page() to make sure all pages in the range are order-0. However, since split_page() is not intended to split compound pages and with VM_DEBUG enabled it will trigger a VM_BUG_ON_PAGE(). Update kho_restore_page() so that it will use prep_compound_page() when it restores a folio and make sure it properly sets page count for both large folios and ranges of order-0 pages. Link: https://lkml.kernel.org/r/20251125110917.843744-3-rppt@kernel.org Fixes: a667300bd53f ("kho: add support for preserving vmalloc allocations") Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reported-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27kho: kho_restore_vmalloc: fix initialization of pages arrayMike Rapoport (Microsoft)
Patch series "kho: fixes for vmalloc restoration". Pratyush reported off-list that when kho_restore_vmalloc() is used to restore a vmalloc_huge() allocation it hits VM_BUG_ON() when we reconstruct the struct pages in kho_restore_pages(). These patches fix the issue. This patch (of 2): In case a preserved vmalloc allocation was using huge pages, all pages in the array of pages added to vm_struct during kho_restore_vmalloc() are wrongly set to the same page. Fix the indexing when assigning pages to that array. Link: https://lkml.kernel.org/r/20251125110917.843744-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20251125110917.843744-2-rppt@kernel.org Fixes: a667300bd53f ("kho: add support for preserving vmalloc allocations") Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27KHO: fix boot failure due to kmemleak access to non-PRESENT pagesRan Xiaokai
When booting with debug_pagealloc=on while having: CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT=y CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=n the system fails to boot due to page faults during kmemleak scanning. This occurs because: With debug_pagealloc is enabled, __free_pages() invokes debug_pagealloc_unmap_pages(), clearing the _PAGE_PRESENT bit for freed pages in the kernel page table. KHO scratch areas are allocated from memblock and noted by kmemleak. But these areas don't remain reserved but released later to the page allocator using init_cma_reserved_pageblock(). This causes subsequent kmemleak scans access non-PRESENT pages, leading to fatal page faults. Mark scratch areas with kmemleak_ignore_phys() after they are allocated from memblock to exclude them from kmemleak scanning before they are released to buddy allocator to fix this. [ran.xiaokai@zte.com.cn: add comment] Link: https://lkml.kernel.org/r/20251127122700.103927-1-ranxiaokai627@163.com Link: https://lkml.kernel.org/r/20251122182929.92634-1-ranxiaokai627@163.com Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27kexec: move sysfs entries to /sys/kernel/kexecSourabh Jain
Patch series "kexec: reorganize kexec and kdump sysfs", v6. All existing kexec and kdump sysfs entries are moved to a new location, /sys/kernel/kexec, to keep /sys/kernel/ clean and better organized. Symlinks are created at the old locations for backward compatibility and can be removed in the future [01/03]. While doing this cleanup, the old kexec and kdump sysfs entries are marked as deprecated in the existing ABI documentation [02/03]. This makes it clear that these older interfaces should no longer be used. New ABI documentation is added to describe the reorganized interfaces [03/03], so users and tools can rely on the updated sysfs interfaces going forward. This patch (of 3): Several kexec and kdump sysfs entries are currently placed directly under /sys/kernel/, which clutters the directory and makes it harder to identify unrelated entries. To improve organization and readability, these entries are now moved under a dedicated directory, /sys/kernel/kexec. The following sysfs moved under new kexec sysfs node +------------------------------------+------------------+ | Old sysfs name | New sysfs name | | (under /sys/kernel) | (under /sys/kernel/kexec) | +---------------------------+---------------------------+ | kexec_loaded | loaded | +---------------------------+---------------------------+ | kexec_crash_loaded | crash_loaded | +---------------------------+---------------------------+ | kexec_crash_size | crash_size | +---------------------------+---------------------------+ | crash_elfcorehdr_size | crash_elfcorehdr_size | +---------------------------+---------------------------+ | kexec_crash_cma_ranges | crash_cma_ranges | +---------------------------+---------------------------+ For backward compatibility, symlinks are created at the old locations so that existing tools and scripts continue to work. These symlinks can be removed in the future once users have switched to the new path. While creating symlinks, entries are added in /sys/kernel/ that point to their new locations under /sys/kernel/kexec/. If an error occurs while adding a symlink, it is logged but does not stop initialization of the remaining kexec sysfs symlinks. The /sys/kernel/<crash_elfcorehdr_size | kexec/crash_elfcorehdr_size> entry is now controlled by CONFIG_CRASH_DUMP instead of CONFIG_VMCORE_INFO, as CONFIG_CRASH_DUMP also enables CONFIG_VMCORE_INFO. Link: https://lkml.kernel.org/r/20251118114507.1769455-1-sourabhjain@linux.ibm.com Link: https://lkml.kernel.org/r/20251118114507.1769455-2-sourabhjain@linux.ibm.com Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Aditya Gupta <adityag@linux.ibm.com> Cc: Dave Young <dyoung@redhat.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Jiri Bohac <jbohac@suse.cz> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Shivang Upadhyay <shivangu@linux.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27kho: free chunks using free_page() instead of kfree()Pratyush Yadav
Before commit fa759cd75bce5 ("kho: allocate metadata directly from the buddy allocator"), the chunks were allocated from the slab allocator using kzalloc(). Those were rightly freed using kfree(). When the commit switched to using the buddy allocator directly, it missed updating kho_mem_ser_free() to use free_page() instead of kfree(). Link: https://lkml.kernel.org/r/20251118182218.63044-1-pratyush@kernel.org Fixes: fa759cd75bce5 ("kho: allocate metadata directly from the buddy allocator") Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: David Matlack <dmatlack@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27liveupdate: luo_file: add private argument to store runtime statePratyush Yadav
Currently file handlers only get the serialized_data field to store their state. This field has a pointer to the serialized state of the file, and it becomes a part of LUO file's serialized state. File handlers can also need some runtime state to track information that shouldn't make it in the serialized data. One such example is a vmalloc pointer. While kho_preserve_vmalloc() preserves the memory backing a vmalloc allocation, it does not store the original vmap pointer, since that has no use being passed to the next kernel. The pointer is needed to free the memory in case the file is unpreserved. Provide a private field in struct luo_file and pass it to all the callbacks. The field's can be set by preserve, and must be freed by unpreserve. Link: https://lkml.kernel.org/r/20251125165850.3389713-14-pasha.tatashin@soleen.com Signed-off-by: Pratyush Yadav <ptyadav@amazon.de> Co-developed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: David Matlack <dmatlack@google.com> Cc: Aleksander Lobakin <aleksander.lobakin@intel.com> Cc: Alexander Graf <graf@amazon.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: anish kumar <yesanishhere@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chanwoo Choi <cw00.choi@samsung.com> Cc: Chen Ridong <chenridong@huawei.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Daniel Wagner <wagi@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Jeffery <djeffery@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guixin Liu <kanie@linux.alibaba.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lennart Poettering <lennart@poettering.net> Cc: Leon Romanovsky <leon@kernel.org> Cc: Leon Romanovsky <leonro@nvidia.com> Cc: Lukas Wunner <lukas@wunner.de> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Matthew Maurer <mmaurer@google.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Myugnjoo Ham <myungjoo.ham@samsung.com> Cc: Parav Pandit <parav@nvidia.com> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Song Liu <song@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Stuart Hayes <stuart.w.hayes@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: William Tu <witu@nvidia.com> Cc: Yoann Congal <yoann.congal@smile.fr> Cc: Zhu Yanjun <yanjun.zhu@linux.dev> Cc: Zijun Hu <quic_zijuhu@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27liveupdate: luo_session: add ioctls for file preservationPasha Tatashin
Introducing the userspace interface and internal logic required to manage the lifecycle of file descriptors within a session. Previously, a session was merely a container; this change makes it a functional management unit. The following capabilities are added: A new set of ioctl commands are added, which operate on the file descriptor returned by CREATE_SESSION. This allows userspace to: - LIVEUPDATE_SESSION_PRESERVE_FD: Add a file descriptor to a session to be preserved across the live update. - LIVEUPDATE_SESSION_RETRIEVE_FD: Retrieve a preserved file in the new kernel using its unique token. - LIVEUPDATE_SESSION_FINISH: finish session The session's .release handler is enhanced to be state-aware. When a session's file descriptor is closed, it correctly unpreserves the session based on its current state before freeing all associated file resources. Link: https://lkml.kernel.org/r/20251125165850.3389713-8-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: David Matlack <dmatlack@google.com> Cc: Aleksander Lobakin <aleksander.lobakin@intel.com> Cc: Alexander Graf <graf@amazon.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: anish kumar <yesanishhere@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chanwoo Choi <cw00.choi@samsung.com> Cc: Chen Ridong <chenridong@huawei.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Daniel Wagner <wagi@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Jeffery <djeffery@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guixin Liu <kanie@linux.alibaba.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lennart Poettering <lennart@poettering.net> Cc: Leon Romanovsky <leon@kernel.org> Cc: Leon Romanovsky <leonro@nvidia.com> Cc: Lukas Wunner <lukas@wunner.de> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Matthew Maurer <mmaurer@google.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Myugnjoo Ham <myungjoo.ham@samsung.com> Cc: Parav Pandit <parav@nvidia.com> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Song Liu <song@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Stuart Hayes <stuart.w.hayes@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: William Tu <witu@nvidia.com> Cc: Yoann Congal <yoann.congal@smile.fr> Cc: Zhu Yanjun <yanjun.zhu@linux.dev> Cc: Zijun Hu <quic_zijuhu@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27liveupdate: luo_file: implement file systems callbacksPasha Tatashin
This patch implements the core mechanism for managing preserved files throughout the live update lifecycle. It provides the logic to invoke the file handler callbacks (preserve, unpreserve, freeze, unfreeze, retrieve, and finish) at the appropriate stages. During the reboot phase, luo_file_freeze() serializes the final metadata for each file (handler compatible string, token, and data handle) into a memory region preserved by KHO. In the new kernel, luo_file_deserialize() reconstructs the in-memory file list from this data, preparing the session for retrieval. Link: https://lkml.kernel.org/r/20251125165850.3389713-7-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Tested-by: David Matlack <dmatlack@google.com> Cc: Aleksander Lobakin <aleksander.lobakin@intel.com> Cc: Alexander Graf <graf@amazon.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: anish kumar <yesanishhere@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chanwoo Choi <cw00.choi@samsung.com> Cc: Chen Ridong <chenridong@huawei.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Daniel Wagner <wagi@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Jeffery <djeffery@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guixin Liu <kanie@linux.alibaba.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lennart Poettering <lennart@poettering.net> Cc: Leon Romanovsky <leon@kernel.org> Cc: Leon Romanovsky <leonro@nvidia.com> Cc: Lukas Wunner <lukas@wunner.de> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Matthew Maurer <mmaurer@google.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Myugnjoo Ham <myungjoo.ham@samsung.com> Cc: Parav Pandit <parav@nvidia.com> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Song Liu <song@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Stuart Hayes <stuart.w.hayes@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: William Tu <witu@nvidia.com> Cc: Yoann Congal <yoann.congal@smile.fr> Cc: Zhu Yanjun <yanjun.zhu@linux.dev> Cc: Zijun Hu <quic_zijuhu@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27liveupdate: luo_core: add user interfacePasha Tatashin
Introduce the user-space interface for the Live Update Orchestrator via ioctl commands, enabling external control over the live update process and management of preserved resources. The idea is that there is going to be a single userspace agent driving the live update, therefore, only a single process can ever hold this device opened at a time. The following ioctl commands are introduced: LIVEUPDATE_IOCTL_CREATE_SESSION Provides a way for userspace to create a named session for grouping file descriptors that need to be preserved. It returns a new file descriptor representing the session. LIVEUPDATE_IOCTL_RETRIEVE_SESSION Allows the userspace agent in the new kernel to reclaim a preserved session by its name, receiving a new file descriptor to manage the restored resources. Link: https://lkml.kernel.org/r/20251125165850.3389713-6-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Tested-by: David Matlack <dmatlack@google.com> Cc: Aleksander Lobakin <aleksander.lobakin@intel.com> Cc: Alexander Graf <graf@amazon.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: anish kumar <yesanishhere@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chanwoo Choi <cw00.choi@samsung.com> Cc: Chen Ridong <chenridong@huawei.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Daniel Wagner <wagi@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Jeffery <djeffery@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guixin Liu <kanie@linux.alibaba.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lennart Poettering <lennart@poettering.net> Cc: Leon Romanovsky <leon@kernel.org> Cc: Leon Romanovsky <leonro@nvidia.com> Cc: Lukas Wunner <lukas@wunner.de> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Matthew Maurer <mmaurer@google.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Myugnjoo Ham <myungjoo.ham@samsung.com> Cc: Parav Pandit <parav@nvidia.com> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Song Liu <song@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Stuart Hayes <stuart.w.hayes@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: William Tu <witu@nvidia.com> Cc: Yoann Congal <yoann.congal@smile.fr> Cc: Zhu Yanjun <yanjun.zhu@linux.dev> Cc: Zijun Hu <quic_zijuhu@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27liveupdate: luo_session: add sessions supportPasha Tatashin
Introduce concept of "Live Update Sessions" within the LUO framework. LUO sessions provide a mechanism to group and manage `struct file *` instances (representing file descriptors) that need to be preserved across a kexec-based live update. Each session is identified by a unique name and acts as a container for file objects whose state is critical to a userspace workload, such as a virtual machine or a high-performance database, aiming to maintain their functionality across a kernel transition. This groundwork establishes the framework for preserving file-backed state across kernel updates, with the actual file data preservation mechanisms to be implemented in subsequent patches. [dan.carpenter@linaro.org: fix use after free in luo_session_deserialize()] Link: https://lkml.kernel.org/r/c5dd637d7eed3a3be48c5e9fedb881596a3b1f5a.1764163896.git.dan.carpenter@linaro.org Link: https://lkml.kernel.org/r/20251125165850.3389713-5-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Tested-by: David Matlack <dmatlack@google.com> Cc: Aleksander Lobakin <aleksander.lobakin@intel.com> Cc: Alexander Graf <graf@amazon.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: anish kumar <yesanishhere@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chanwoo Choi <cw00.choi@samsung.com> Cc: Chen Ridong <chenridong@huawei.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Daniel Wagner <wagi@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Jeffery <djeffery@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guixin Liu <kanie@linux.alibaba.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lennart Poettering <lennart@poettering.net> Cc: Leon Romanovsky <leon@kernel.org> Cc: Leon Romanovsky <leonro@nvidia.com> Cc: Lukas Wunner <lukas@wunner.de> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Matthew Maurer <mmaurer@google.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Myugnjoo Ham <myungjoo.ham@samsung.com> Cc: Parav Pandit <parav@nvidia.com> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Song Liu <song@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Stuart Hayes <stuart.w.hayes@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: William Tu <witu@nvidia.com> Cc: Yoann Congal <yoann.congal@smile.fr> Cc: Zhu Yanjun <yanjun.zhu@linux.dev> Cc: Zijun Hu <quic_zijuhu@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27kexec: call liveupdate_reboot() before kexecPasha Tatashin
Modify the kernel_kexec() to call liveupdate_reboot(). This ensures that the Live Update Orchestrator is notified just before the kernel executes the kexec jump. The liveupdate_reboot() function triggers the final freeze event, allowing participating FDs perform last-minute check or state saving within the blackout window. If liveupdate_reboot() returns an error (indicating a failure during LUO finalization), the kexec operation is aborted to prevent proceeding with an inconsistent state. An error is returned to user. Link: https://lkml.kernel.org/r/20251125165850.3389713-4-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Tested-by: David Matlack <dmatlack@google.com> Cc: Aleksander Lobakin <aleksander.lobakin@intel.com> Cc: Alexander Graf <graf@amazon.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: anish kumar <yesanishhere@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chanwoo Choi <cw00.choi@samsung.com> Cc: Chen Ridong <chenridong@huawei.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Daniel Wagner <wagi@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Jeffery <djeffery@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guixin Liu <kanie@linux.alibaba.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lennart Poettering <lennart@poettering.net> Cc: Leon Romanovsky <leon@kernel.org> Cc: Leon Romanovsky <leonro@nvidia.com> Cc: Lukas Wunner <lukas@wunner.de> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Matthew Maurer <mmaurer@google.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Myugnjoo Ham <myungjoo.ham@samsung.com> Cc: Parav Pandit <parav@nvidia.com> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Song Liu <song@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Stuart Hayes <stuart.w.hayes@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: William Tu <witu@nvidia.com> Cc: Yoann Congal <yoann.congal@smile.fr> Cc: Zhu Yanjun <yanjun.zhu@linux.dev> Cc: Zijun Hu <quic_zijuhu@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27liveupdate: luo_core: integrate with KHOPasha Tatashin
Integrate the LUO with the KHO framework to enable passing LUO state across a kexec reboot. This patch implements the lifecycle integration with KHO: 1. Incoming State: During early boot (`early_initcall`), LUO checks if KHO is active. If so, it retrieves the "LUO" subtree, verifies the "luo-v1" compatibility string, and reads the `liveupdate-number` to track the update count. 2. Outgoing State: During late initialization (`late_initcall`), LUO allocates a new FDT for the next kernel, populates it with the basic header (compatible string and incremented update number), and registers it with KHO (`kho_add_subtree`). 3. Finalization: The `liveupdate_reboot()` notifier is updated to invoke `kho_finalize()`. This ensures that all memory segments marked for preservation are properly serialized before the kexec jump. Link: https://lkml.kernel.org/r/20251125165850.3389713-3-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Tested-by: David Matlack <dmatlack@google.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Aleksander Lobakin <aleksander.lobakin@intel.com> Cc: Alexander Graf <graf@amazon.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: anish kumar <yesanishhere@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chanwoo Choi <cw00.choi@samsung.com> Cc: Chen Ridong <chenridong@huawei.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Daniel Wagner <wagi@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Jeffery <djeffery@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guixin Liu <kanie@linux.alibaba.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lennart Poettering <lennart@poettering.net> Cc: Leon Romanovsky <leon@kernel.org> Cc: Leon Romanovsky <leonro@nvidia.com> Cc: Lukas Wunner <lukas@wunner.de> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Matthew Maurer <mmaurer@google.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Myugnjoo Ham <myungjoo.ham@samsung.com> Cc: Parav Pandit <parav@nvidia.com> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Song Liu <song@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Stuart Hayes <stuart.w.hayes@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: William Tu <witu@nvidia.com> Cc: Yoann Congal <yoann.congal@smile.fr> Cc: Zhu Yanjun <yanjun.zhu@linux.dev> Cc: Zijun Hu <quic_zijuhu@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27liveupdate: luo_core: Live Update OrchestratorPasha Tatashin
Patch series "Live Update Orchestrator", v8. This series introduces the Live Update Orchestrator, a kernel subsystem designed to facilitate live kernel updates using a kexec-based reboot. This capability is critical for cloud environments, allowing hypervisors to be updated with minimal downtime for running virtual machines. LUO achieves this by preserving the state of selected resources, such as memory, devices and their dependencies, across the kernel transition. As a key feature, this series includes support for preserving memfd file descriptors, which allows critical in-memory data, such as guest RAM or any other large memory region, to be maintained in RAM across the kexec reboot. The other series that use LUO, are VFIO [1], IOMMU [2], and PCI [3] preservations. Github repo of this series [4]. The core of LUO is a framework for managing the lifecycle of preserved resources through a userspace-driven interface. Key features include: - Session Management Userspace agent (i.e. luod [5]) creates named sessions, each represented by a file descriptor (via centralized agent that controls /dev/liveupdate). The lifecycle of all preserved resources within a session is tied to this FD, ensuring automatic kernel cleanup if the controlling userspace agent crashes or exits unexpectedly. - File Preservation A handler-based framework allows specific file types (demonstrated here with memfd) to be preserved. Handlers manage the serialization, restoration, and lifecycle of their specific file types. - File-Lifecycle-Bound State A new mechanism for managing shared global state whose lifecycle is tied to the preservation of one or more files. This is crucial for subsystems like IOMMU or HugeTLB, where multiple file descriptors may depend on a single, shared underlying resource that must be preserved only once. - KHO Integration LUO drives the Kexec Handover framework programmatically to pass its serialized metadata to the next kernel. The LUO state is finalized and added to the kexec image just before the reboot is triggered. In the future this step will also be removed once stateless KHO is merged [6]. - Userspace Interface Control is provided via ioctl commands on /dev/liveupdate for creating and retrieving sessions, as well as on session file descriptors for managing individual files. - Testing The series includes a set of selftests, including userspace API validation, kexec-based lifecycle tests for various session and file scenarios, and a new in-kernel test module to validate the FLB logic. Introduce LUO, a mechanism intended to facilitate kernel updates while keeping designated devices operational across the transition (e.g., via kexec). The primary use case is updating hypervisors with minimal disruption to running virtual machines. For userspace side of hypervisor update we have copyless migration. LUO is for updating the kernel. This initial patch lays the groundwork for the LUO subsystem. Further functionality, including the implementation of state transition logic, integration with KHO, and hooks for subsystems and file descriptors, will be added in subsequent patches. Create a character device at /dev/liveupdate. A new uAPI header, <uapi/linux/liveupdate.h>, will define the necessary structures. The magic number for IOCTL is registered in Documentation/userspace-api/ioctl/ioctl-number.rst. Link: https://lkml.kernel.org/r/20251125165850.3389713-1-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20251125165850.3389713-2-pasha.tatashin@soleen.com Link: https://lore.kernel.org/all/20251018000713.677779-1-vipinsh@google.com/ [1] Link: https://lore.kernel.org/linux-iommu/20250928190624.3735830-1-skhawaja@google.com [2] Link: https://lore.kernel.org/linux-pci/20250916-luo-pci-v2-0-c494053c3c08@kernel.org [3] Link: https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v8 [4] Link: https://tinyurl.com/luoddesign [5] Link: https://lore.kernel.org/all/20251020100306.2709352-1-jasonmiu@google.com [6] Link: https://lore.kernel.org/all/20251115233409.768044-1-pasha.tatashin@soleen.com [7] Link: https://github.com/soleen/linux/blob/luo/v8b03/diff.v7.v8 [8] Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: David Matlack <dmatlack@google.com> Cc: Aleksander Lobakin <aleksander.lobakin@intel.com> Cc: Alexander Graf <graf@amazon.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: anish kumar <yesanishhere@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chanwoo Choi <cw00.choi@samsung.com> Cc: Chen Ridong <chenridong@huawei.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Daniel Wagner <wagi@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Jeffery <djeffery@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guixin Liu <kanie@linux.alibaba.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lennart Poettering <lennart@poettering.net> Cc: Leon Romanovsky <leon@kernel.org> Cc: Leon Romanovsky <leonro@nvidia.com> Cc: Lukas Wunner <lukas@wunner.de> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Matthew Maurer <mmaurer@google.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Myugnjoo Ham <myungjoo.ham@samsung.com> Cc: Parav Pandit <parav@nvidia.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Song Liu <song@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Stuart Hayes <stuart.w.hayes@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: William Tu <witu@nvidia.com> Cc: Yoann Congal <yoann.congal@smile.fr> Cc: Zijun Hu <quic_zijuhu@quicinc.com> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27kho: add Kconfig option to enable KHO by defaultPasha Tatashin
Currently, Kexec Handover must be explicitly enabled via the kernel command line parameter `kho=on`. For workloads that rely on KHO as a foundational requirement (such as the upcoming Live Update Orchestrator), requiring an explicit boot parameter adds redundant configuration steps. Introduce CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT. When selected, KHO defaults to enabled. This is equivalent to passing kho=on at boot. The behavior can still be disabled at runtime by passing kho=off. Link: https://lkml.kernel.org/r/20251114190002.3311679-14-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Vasilevsky <dave@vasilevsky.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27kho: allow memory preservation state updates after finalizationPasha Tatashin
Currently, kho_preserve_* and kho_unpreserve_* return -EBUSY if KHO is finalized. This enforces a rigid "freeze" on the KHO memory state. With the introduction of re-entrant finalization, this restriction is no longer necessary. Users should be allowed to modify the preservation set (e.g., adding new pages or freeing old ones) even after an initial finalization. The intended workflow for updates is now: 1. Modify state (preserve/unpreserve). 2. Call kho_finalize() again to refresh the serialized metadata. Remove the kho_out.finalized checks to enable this dynamic behavior. This also allows to convert kho_unpreserve_* functions to void, as they do not return any error anymore. Link: https://lkml.kernel.org/r/20251114190002.3311679-13-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Vasilevsky <dave@vasilevsky.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27kho: allow kexec load before KHO finalizationPasha Tatashin
Currently, kho_fill_kimage() checks kho_out.finalized and returns early if KHO is not yet finalized. This enforces a strict ordering where userspace must finalize KHO *before* loading the kexec image. This is restrictive, as standard workflows often involve loading the target kernel early in the lifecycle and finalizing the state (FDT) only immediately before the reboot. Since the KHO FDT resides at a physical address allocated during boot (kho_init), its location is stable. We can attach this stable address to the kimage regardless of whether the content has been finalized yet. Relax the check to only require kho_enable, allowing kexec_file_load to proceed at any time. Link: https://lkml.kernel.org/r/20251114190002.3311679-12-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Vasilevsky <dave@vasilevsky.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27kho: update FDT dynamically for subtree addition/removalPasha Tatashin
Currently, sub-FDTs were tracked in a list (kho_out.sub_fdts) and the final FDT is constructed entirely from scratch during kho_finalize(). We can maintain the FDT dynamically: 1. Initialize a valid, empty FDT in kho_init(). 2. Use fdt_add_subnode and fdt_setprop in kho_add_subtree to update the FDT immediately when a subsystem registers. 3. Use fdt_del_node in kho_remove_subtree to remove entries. This removes the need for the intermediate sub_fdts list and the reconstruction logic in kho_finalize(). kho_finalize() now only needs to trigger memory map serialization. Link: https://lkml.kernel.org/r/20251114190002.3311679-11-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Vasilevsky <dave@vasilevsky.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27kho: remove abort functionality and support state refreshPasha Tatashin
Previously, KHO required a dedicated kho_abort() function to clean up state before kho_finalize() could be called again. This was necessary to handle complex unwind paths when using notifiers. With the shift to direct memory preservation, the explicit abort step is no longer strictly necessary. Remove kho_abort() and refactor kho_finalize() to handle re-entry. If kho_finalize() is called while KHO is already finalized, it will now automatically clean up the previous memory map and state before generating a new one. This allows the KHO state to be updated/refreshed simply by triggering finalize again. Update debugfs to return -EINVAL if userspace attempts to write 0 to the finalize attribute, as explicit abort is no longer supported. Link: https://lkml.kernel.org/r/20251114190002.3311679-10-pasha.tatashin@soleen.com Suggested-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Vasilevsky <dave@vasilevsky.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27kho: remove global preserved_mem_map and store state in FDTPasha Tatashin
Currently, the serialized memory map is tracked via kho_out.preserved_mem_map and copied to the FDT during finalization. This double tracking is redundant. Remove preserved_mem_map from kho_out. Instead, maintain the physical address of the head chunk directly in the preserved-memory-map FDT property. Introduce kho_update_memory_map() to manage this property. This function handles: 1. Retrieving and freeing any existing serialized map (handling the abort/retry case). 2. Updating the FDT property with the new chunk address. This establishes the FDT as the single source of truth for the handover state. Link: https://lkml.kernel.org/r/20251114190002.3311679-9-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Vasilevsky <dave@vasilevsky.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27kho: simplify serialization and remove __kho_abortPasha Tatashin
Currently, __kho_finalize() performs memory serialization in the middle of FDT construction. If FDT construction fails later, the function must manually clean up the serialized memory via __kho_abort(). Refactor __kho_finalize() to perform kho_mem_serialize() only after the FDT has been successfully constructed and finished. This reordering has two benefits: 1. It avoids expensive serialization work if FDT generation fails. 2. It removes the need for cleanup in the FDT error path. As a result, the internal helper __kho_abort() is no longer needed for internal error handling. Inline its remaining logic (cleanup of the preserved memory map) directly into kho_abort() and remove the helper. Link: https://lkml.kernel.org/r/20251114190002.3311679-8-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Vasilevsky <dave@vasilevsky.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27kho: always expose output FDT in debugfsPasha Tatashin
Currently, the output FDT is added to debugfs only when KHO is finalized and removed when aborted. There is no need to hide the FDT based on the state. Always expose it starting from initialization. This aids the transition toward removing the explicit abort functionality and converting KHO to be fully stateless. Link: https://lkml.kernel.org/r/20251114190002.3311679-7-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Vasilevsky <dave@vasilevsky.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27kho: verify deserialization status and fix FDT alignment accessPasha Tatashin
During boot, kho_restore_folio() relies on the memory map having been successfully deserialized. If deserialization fails or no map is present, attempting to restore the FDT folio is unsafe. Update kho_mem_deserialize() to return a boolean indicating success. Use this return value in kho_memory_init() to disable KHO if deserialization fails. Also, the incoming FDT folio is never used, there is no reason to restore it. Additionally, use get_unaligned() to retrieve the memory map pointer from the FDT. FDT properties are not guaranteed to be naturally aligned, and accessing a 64-bit value via a pointer that is only 32-bit aligned can cause faults. Link: https://lkml.kernel.org/r/20251114190002.3311679-6-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Vasilevsky <dave@vasilevsky.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27kho: preserve FDT folio only once during initializationPasha Tatashin
Currently, the FDT folio is preserved inside __kho_finalize(). If the user performs multiple finalize/abort cycles, kho_preserve_folio() is called repeatedly for the same FDT folio. Since the FDT folio is allocated once during kho_init(), it should be marked for preservation at the same time. Move the preservation call to kho_init() to align the preservation state with the object's lifecycle and simplify the finalize path. Also, pre-zero the FDT tree so we do not expose random bits to the user and to the next kernel by using the new kho_alloc_preserve() api. Link: https://lkml.kernel.org/r/20251114190002.3311679-5-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Vasilevsky <dave@vasilevsky.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27kho: introduce high-level memory allocation APIPasha Tatashin
Currently, clients of KHO must manually allocate memory (e.g., via alloc_pages), calculate the page order, and explicitly call kho_preserve_folio(). Similarly, cleanup requires separate calls to unpreserve and free the memory. Introduce a high-level API to streamline this common pattern: - kho_alloc_preserve(size): Allocates physically contiguous, zeroed memory and immediately marks it for preservation. - kho_unpreserve_free(ptr): Unpreserves and frees the memory in the current kernel. - kho_restore_free(ptr): Restores the struct page state of preserved memory in the new kernel and immediately frees it to the page allocator. [pasha.tatashin@soleen.com: build fixes] Link: https://lkml.kernel.org/r/CA+CK2bBgXDhrHwTVgxrw7YTQ-0=LgW0t66CwPCgG=C85ftz4zw@mail.gmail.com Link: https://lkml.kernel.org/r/20251114190002.3311679-4-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Vasilevsky <dave@vasilevsky.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27kho: convert __kho_abort() to return voidPasha Tatashin
The internal helper __kho_abort() always returns 0 and has no failure paths. Its return value is ignored by __kho_finalize and checked needlessly by kho_abort. Change the return type to void to reflect that this function cannot fail, and simplify kho_abort by removing dead error handling code. Link: https://lkml.kernel.org/r/20251114190002.3311679-3-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Vasilevsky <dave@vasilevsky.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27kho: fix misleading log message in kho_populate()Pasha Tatashin
Patch series "kho: simplify state machine and enable dynamic updates", v2. This patch series refactors the Kexec Handover subsystem to transition from a rigid, state-locked model to a dynamic, re-entrant architecture. It also introduces usability improvements. Motivation Currently, KHO relies on a strict state machine where memory preservation is locked upon finalization. If a change is required, the user must explicitly "abort" to reset the state. Additionally, the kexec image cannot be loaded until KHO is finalized, and the FDT is rebuilt from scratch on every finalization. This series simplifies this workflow to support "load early, finalize late" scenarios. Key Changes State Machine Simplification: - Removed kho_abort(). kho_finalize() is now re-entrant; calling it a second time automatically flushes the previous serialized state and generates a fresh one. - Removed kho_out.finalized checks from preservation APIs, allowing drivers to add/remove pages even after an initial finalization. - Decoupled kexec_file_load from KHO finalization. The KHO FDT physical address is now stable from boot, allowing the kexec image to be loaded before the handover metadata is finalized. FDT Management: - The FDT is now updated in-place dynamically when subtrees are added or removed, removing the need for complex reconstruction logic. - The output FDT is always exposed in debugfs (initialized and zeroed at boot), improving visibility and debugging capabilities throughout the system lifecycle. - Removed the redundant global preserved_mem_map pointer, establishing the FDT property as the single source of truth. New Features & API Enhancements: - High-Level Allocators: Introduced kho_alloc_preserve() and friends to reduce boilerplate for drivers that need to allocate, preserve, and eventually restore simple memory buffers. - Configuration: Added CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT to allow KHO to be active by default without requiring the kho=on command line parameter. Fixes: - Fixed potential alignment faults when accessing 64-bit FDT properties. - Fixed the lifecycle of the FDT folio preservation (now preserved once at init). This patch (of 13): The log message in kho_populate() currently states "Will skip init for some devices". This implies that Kexec Handover always involves skipping device initialization. However, KHO is a generic mechanism used to preserve kernel memory across reboot for various purposes, such as memfd, telemetry, or reserve_mem. Skipping device initialization is a specific property of live update drivers using KHO, not a property of the mechanism itself. Remove the misleading suffix to accurately reflect the generic nature of KHO discovery. Link: https://lkml.kernel.org/r/20251114190002.3311679-2-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Vasilevsky <dave@vasilevsky.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27liveupdate: kho: use %pe format specifier for error pointer printingZhu Yanjun
Make pr_xxx() call to use the %pe format specifier instead of %d. The %pe specifier prints a symbolic error string (e.g., -ENOMEM, -EINVAL) when given an error pointer created with ERR_PTR(err). This change enhances the clarity and diagnostic value of the error message by showing a descriptive error name rather than a numeric error code. Note, that some err are still printed by value, as those errors might come from libfdt and not regular errnos. Link: https://lkml.kernel.org/r/20251101142325.1326536-10-pasha.tatashin@soleen.com Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Co-developed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: "Mike Rapoport (Microsoft)" <rppt@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27liveupdate: kho: move to kernel/liveupdatePasha Tatashin
Move KHO to kernel/liveupdate/ in preparation of placing all Live Update core kernel related files to the same place. [pasha.tatashin@soleen.com: disable the menu when DEFERRED_STRUCT_PAGE_INIT] Link: https://lkml.kernel.org/r/CA+CK2bAvh9Oa2SLfsbJ8zztpEjrgr_hr-uGgF1coy8yoibT39A@mail.gmail.com Link: https://lkml.kernel.org/r/20251101142325.1326536-8-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Simon Horman <horms@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27kho: don't unpreserve memory during abortPasha Tatashin
KHO allows clients to preserve memory regions at any point before the KHO state is finalized. The finalization process itself involves KHO performing its own actions, such as serializing the overall preserved memory map. If this finalization process is aborted, the current implementation destroys KHO's internal memory tracking structures (`kho_out.ser.track.orders`). This behavior effectively unpreserves all memory from KHO's perspective, regardless of whether those preservations were made by clients before the finalization attempt or by KHO itself during finalization. This premature unpreservation is incorrect. An abort of the finalization process should only undo actions taken by KHO as part of that specific finalization attempt. Individual memory regions preserved by clients prior to finalization should remain preserved, as their lifecycle is managed by the clients themselves. These clients might still need to call kho_unpreserve_folio() or kho_unpreserve_phys() based on their own logic, even after a KHO finalization attempt is aborted. Link: https://lkml.kernel.org/r/20251101142325.1326536-7-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: "Mike Rapoport (Microsoft)" <rppt@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Simon Horman <horms@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27kho: add interfaces to unpreserve folios, page ranges, and vmallocPasha Tatashin
Allow users of KHO to cancel the previous preservation by adding the necessary interfaces to unpreserve folio, pages, and vmallocs. Link: https://lkml.kernel.org/r/20251101142325.1326536-4-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Simon Horman <horms@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>