summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-11-20watchdog: add sys_info sysctls to dump sys info on system lockupFeng Tang
When soft/hard lockup happens, developers may need different kinds of system information (call-stacks, memory info, locks, etc.) to help debugging. Add 'softlockup_sys_info' and 'hardlockup_sys_info' sysctl knobs to take human readable string like "tasks,mem,timers,locks,ftrace,...", and when system lockup happens, all requested information will be printed out. (refer kernel/sys_info.c for more details). Link: https://lkml.kernel.org/r/20251113111039.22701-4-feng.tang@linux.alibaba.com Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20hung_task: add hung_task_sys_info sysctl to dump sys info on task-hungFeng Tang
When task-hung happens, developers may need different kinds of system information (call-stacks, memory info, locks, etc.) to help debugging. Add 'hung_task_sys_info' sysctl knob to take human readable string like "tasks,mem,timers,locks,ftrace,...", and when task-hung happens, all requested information will be dumped. (refer kernel/sys_info.c for more details). Meanwhile, the newly introduced sys_info() call is used to unify some existing info-dumping knobs. [feng.tang@linux.alibaba.com: maintain consistecy established behavior, per Lance and Petr] Link: https://lkml.kernel.org/r/aRncJo1mA5Zk77Hr@U-2FWC9VHC-2323.local Link: https://lkml.kernel.org/r/20251113111039.22701-3-feng.tang@linux.alibaba.com Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Suggested-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20kernel/hung_task: unexport sysctl_hung_task_timeout_secsChristoph Hellwig
This was added by the bcachefs pull requests despite various objections, and with bcachefs removed is now unused. This reverts commit 5c3273ec3c6a ("kernel/hung_task.c: export sysctl_hung_task_timeout_secs"). Link: https://lkml.kernel.org/r/20251104121920.2430568-1-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20panic: sys_info: align constant definition names with parametersAndy Shevchenko
Align constant definition names with parameters to make it easier to map. It's also better to maintain and extend the names while keeping their uniqueness. Link: https://lkml.kernel.org/r/20251030132007.3742368-3-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20PM: sleep: Call pm_sleep_fs_sync() instead of ksys_sync_helper()Samuel Wu
Replace the direct calls to ksys_sync_helper() with the new pm_sleep_fs_sync() in suspend and hibernation code paths. This enables the new mechanism allowing the filesystem sync phase to be interrupted. Suggested-by: Saravana Kannan <saravanak@google.com> Signed-off-by: Samuel Wu <wusamuel@google.com> Co-developed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [ rjw: Subject and changelog edits, tags adjustment ] Link: https://patch.msgid.link/20251119171426.4086783-3-wusamuel@google.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-20PM: sleep: Add support for wakeup during filesystem syncSamuel Wu
Add helper function pm_sleep_fs_sync() and related data structures as a preparation for allowing system suspend and hibernation to be aborted by wakeup events while syncing file systems. The new function, to be called by the suspend process in order to sync file systems, uses a dedicated ordered workqueue to run ksys_sync_helper() in parallel with the calling process. Next, it waits for the completion of the filesystem sync and periodically checks if any system wakeup events are pending, in which case it will return an error. If that happens while the filesystem sync is still in progress, it will continue, possibly after pm_sleep_fs_sync() has returned, and if that function is called again before the sync is complete, a new work item to run ksys_sync_helper() again will be queued (and waited for) to increase the likelihood of writing all of the dirty pages in memory back to persistent storage. Suggested-by: Saravana Kannan <saravanak@google.com> Signed-off-by: Samuel Wu <wusamuel@google.com> Co-developed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [ rjw: Subject and changelog rewrite, tags adjustment ] Link: https://patch.msgid.link/20251119171426.4086783-2-wusamuel@google.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-20Merge back material related to system sleep for 6.19Rafael J. Wysocki
2025-11-20sched: Provide and use set_need_resched_current()Peter Zijlstra
set_tsk_need_resched(current) requires set_preempt_need_resched(current) to work correctly outside of the scheduler. Provide set_need_resched_current() which wraps this correctly and replace all the open coded instances. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251116174750.665769842@linutronix.de
2025-11-20workqueue: Init rescuer's affinities as wq_unbound_cpumaskLai Jiangshan
The affinity to set to the rescuers should be consistent in all paths when a rescuer is in detached state. The affinity could be either wq_unbound_cpumask or unbound_effective_cpumask(wq). Related paths: rescuer's worker_detach_from_pool() update wq_unbound_cpumask update wq's cpumask init_rescuer() Both affinities are Ok as long as they are consistent in all paths. In the commit 449b31ad2937 ("workqueue: Init rescuer's affinities as the wq's effective cpumask") makes init_rescuer use unbound_effective_cpumask(wq) which is consistent with then apply_wqattrs_commit(). But using unbound_effective_cpumask(wq) requres much more code to maintain the consistency, and it doesn't make much sense since the affinity is only effective when the rescuer is not processing works. wq_unbound_cpumask is more favorable. So apply_wqattrs_commit() and the path of "updating wq's cpumask" had been changed to not update the rescuer's affinity, and both the paths of "updating wq_unbound_cpumask" and "rescuer's worker_detach_from_pool()" had been changed to use wq_unbound_cpumask. Now, make init_rescuer() use wq_unbound_cpumask for rescuer's affinity and make all the paths consistent. Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-20workqueue: Let DISASSOCIATED workers follow unbound wq cpumask changesLai Jiangshan
When workqueue cpumask changes are committed, the DISASSOCIATED workers affinity is not touched and this might be a problem down the line for isolated setups when the DISASSOCIATED pools still have works to run after the cpu is offline. Make sure the workers' affinity is updated every time a workqueue cpumask changes, so these workers can't break isolation. Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-20workqueue: Update the rescuer's affinity only when it is detachedLai Jiangshan
When a rescuer is attached to a pool, its affinity should be only managed by the pool. But updating the detached rescuer's affinity is still meaningful so that it will not disrupt isolated CPUs when it is to be waken up. But the commit d64f2fa064f8 ("kernel/workqueue: Let rescuers follow unbound wq cpumask changes") updates the affinity unconditionally, and causes some issues 1) it also changes the affinity when the rescuer is already attached to a pool, which violates the affinity management. 2) the said commit tries to update the affinity of the rescuers, but it misses the rescuers of the PERCPU workqueues, and isolated CPUs can be possibly disrupted by these rescuers when they are summoned. 3) The affinity to set to the rescuers should be consistent in all paths when a rescuer is in detached state. The affinity could be either wq_unbound_cpumask or unbound_effective_cpumask(wq). Related paths: rescuer's worker_detach_from_pool() update wq_unbound_cpumask update wq's cpumask init_rescuer() Both affinities are Ok as long as they are consistent in all paths. But using unbound_effective_cpumask(wq) requres much more code to maintain the consistency, and it doesn't make much sense since the affinity is only effective when the rescuer is not processing works. wq_unbound_cpumask is more favorable. Fix the 1) issue by testing rescuer->pool before updating with wq_pool_attach_mutex held. Fix the 2) issue by moving the rescuer's affinity updating code to the place updating wq_unbound_cpumask and make it also update for PERCPU workqueues. Partially cleanup the 3) consistency issue by using wq_unbound_cpumask. So that the path of "updating wq's cpumask" doesn't need to maintain it. and both the paths of "updating wq_unbound_cpumask" and "rescuer's worker_detach_from_pool()" use wq_unbound_cpumask. Cleanup for init_rescuer()'s consistency for affinity can be done in future. Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-20timers/migration: Exclude isolated cpus from hierarchyGabriele Monaco
The timer migration mechanism allows active CPUs to pull timers from idle ones to improve the overall idle time. This is however undesired when CPU intensive workloads run on isolated cores, as the algorithm would move the timers from housekeeping to isolated cores, negatively affecting the isolation. Exclude isolated cores from the timer migration algorithm, extend the concept of unavailable cores, currently used for offline ones, to isolated ones: * A core is unavailable if isolated or offline; * A core is available if non isolated and online; A core is considered unavailable as isolated if it belongs to: * the isolcpus (domain) list * an isolated cpuset Except if it is: * in the nohz_full list (already idle for the hierarchy) * the nohz timekeeper core (must be available to handle global timers) CPUs are added to the hierarchy during late boot, excluding isolated ones, the hierarchy is also adapted when the cpuset isolation changes. Due to how the timer migration algorithm works, any CPU part of the hierarchy can have their global timers pulled by remote CPUs and have to pull remote timers, only skipping pulling remote timers would break the logic. For this reason, prevent isolated CPUs from pulling remote global timers, but also the other way around: any global timer started on an isolated CPU will run there. This does not break the concept of isolation (global timers don't come from outside the CPU) and, if considered inappropriate, can usually be mitigated with other isolation techniques (e.g. IRQ pinning). This effect was noticed on a 128 cores machine running oslat on the isolated cores (1-31,33-63,65-95,97-127). The tool monopolises CPUs, and the CPU with lowest count in a timer migration hierarchy (here 1 and 65) appears as always active and continuously pulls global timers, from the housekeeping CPUs. This ends up moving driver work (e.g. delayed work) to isolated CPUs and causes latency spikes: before the change: # oslat -c 1-31,33-63,65-95,97-127 -D 62s ... Maximum: 1203 10 3 4 ... 5 (us) after the change: # oslat -c 1-31,33-63,65-95,97-127 -D 62s ... Maximum: 10 4 3 4 3 ... 5 (us) The same behaviour was observed on a machine with as few as 20 cores / 40 threads with isocpus set to: 1-9,11-39 with rtla-osnoise-top. Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: John B. Wyatt IV <jwyatt@redhat.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://patch.msgid.link/20251120145653.296659-8-gmonaco@redhat.com
2025-11-20sched/isolation: Force housekeeping if isolcpus and nohz_full don't leave anyGabriele Monaco
Currently the user can set up isolcpus and nohz_full in such a way that leaves no housekeeping CPU (i.e. no CPU that is neither domain isolated nor nohz full). This can be a problem for other subsystems (e.g. the timer wheel imgration). Prevent this configuration by invalidating the last setting in case the union of isolcpus (domain) and nohz_full covers all CPUs. Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://patch.msgid.link/20251120145653.296659-6-gmonaco@redhat.com
2025-11-20cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to ↵Gabriele Monaco
update_isolation_cpumasks() update_unbound_workqueue_cpumask() updates unbound workqueues settings when there's a change in isolated CPUs, but it can be used for other subsystems requiring updated when isolated CPUs change. Generalise the name to update_isolation_cpumasks() to prepare for other functions unrelated to workqueues to be called in that spot. [longman: Change the function name to update_isolation_cpumasks()] Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://patch.msgid.link/20251120145653.296659-5-gmonaco@redhat.com
2025-11-20timers/migration: Use scoped_guard on available flag set/clearGabriele Monaco
Cleanup tmigr_clear_cpu_available() and tmigr_set_cpu_available() to prepare for easier checks on the available flag. Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251120145653.296659-4-gmonaco@redhat.com
2025-11-20timers/migration: Add mask for CPUs available in the hierarchyGabriele Monaco
Keep track of the CPUs available for timer migration in a cpumask. This prepares the ground to generalise the concept of unavailable CPUs. Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251120145653.296659-3-gmonaco@redhat.com
2025-11-20timers/migration: Rename 'online' bit to 'available'Gabriele Monaco
The timer migration hierarchy excludes offline CPUs via the tmigr_is_not_available function, which is essentially checking the online bit for the CPU. Rename the online bit to available and all references in function names and tracepoint to generalise the concept of available CPUs. Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251120145653.296659-2-gmonaco@redhat.com
2025-11-20Merge tag 'sched_ext-for-6.18-rc6-fixes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fix from Tejun Heo: "One low risk and obvious fix: scx_enable() was dereferencing an error pointer on helper kthread creation failure. Fixed" * tag 'sched_ext-for-6.18-rc6-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix scx_enable() crash on helper kthread creation failure
2025-11-20PCI/P2PDMA: Simplify bus address mapping APILeon Romanovsky
Update the pci_p2pdma_bus_addr_map() function to take a direct pointer to the p2pdma_provider structure instead of the pci_p2pdma_map_state. This simplifies the API by removing the need for callers to extract the provider from the state structure. The change updates all callers across the kernel (block layer, IOMMU, DMA direct, and HMM) to pass the provider pointer directly, making the code more explicit and reducing unnecessary indirection. This also removes the runtime warning check since callers now have direct control over which provider they use. Tested-by: Alex Mastro <amastro@fb.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-2-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20sched_ext: Fix scx_enable() crash on helper kthread creation failureSaket Kumar Bhaskar
A crash was observed when the sched_ext selftests runner was terminated with Ctrl+\ while test 15 was running: NIP [c00000000028fa58] scx_enable.constprop.0+0x358/0x12b0 LR [c00000000028fa2c] scx_enable.constprop.0+0x32c/0x12b0 Call Trace: scx_enable.constprop.0+0x32c/0x12b0 (unreliable) bpf_struct_ops_link_create+0x18c/0x22c __sys_bpf+0x23f8/0x3044 sys_bpf+0x2c/0x6c system_call_exception+0x124/0x320 system_call_vectored_common+0x15c/0x2ec kthread_run_worker() returns an ERR_PTR() on failure rather than NULL, but the current code in scx_alloc_and_add_sched() only checks for a NULL helper. Incase of failure on SIGQUIT, the error is not handled in scx_alloc_and_add_sched() and scx_enable() ends up dereferencing an error pointer. Error handling is fixed in scx_alloc_and_add_sched() to propagate PTR_ERR() into ret, so that scx_enable() jumps to the existing error path, avoiding random dereference on failure. Fixes: bff3b5aec1b7 ("sched_ext: Move disable machinery into scx_sched") Cc: stable@vger.kernel.org # v6.16+ Reported-and-tested-by: Samir Mulani <samir@linux.ibm.com> Signed-off-by: Saket Kumar Bhaskar <skb99@linux.ibm.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.18-rc7). No conflicts, adjacent changes: tools/testing/selftests/net/af_unix/Makefile e1bb28bf13f4 ("selftest: af_unix: Add test for SO_PEEK_OFF.") 45a1cd8346ca ("selftests: af_unix: Add tests for ECONNRESET and EOF semantics") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplugPingfan Liu
*** Bug description *** When testing kexec-reboot on a 144 cpus machine with isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I encounter the following bug: [ 97.114759] psci: CPU142 killed (polled 0 ms) [ 97.333236] Failed to offline CPU143 - error=-16 [ 97.333246] ------------[ cut here ]------------ [ 97.342682] kernel BUG at kernel/cpu.c:1569! [ 97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP [...] In essence, the issue originates from the CPU hot-removal process, not limited to kexec. It can be reproduced by writing a SCHED_DEADLINE program that waits indefinitely on a semaphore, spawning multiple instances to ensure some run on CPU 72, and then offlining CPUs 1–143 one by one. When attempting this, CPU 143 failed to go offline. bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done' Tracking down this issue, I found that dl_bw_deactivate() returned -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU. But that is not the fact, and contributed by the following factors: When a CPU is inactive, cpu_rq()->rd is set to def_root_domain. For an blocked-state deadline task (in this case, "cppc_fie"), it was not migrated to CPU0, and its task_rq() information is stale. So its rq->rd points to def_root_domain instead of the one shared with CPU0. As a result, its bandwidth is wrongly accounted into a wrong root domain during domain rebuild. *** Issue *** The key point is that root_domain is only tracked through active rq->rd. To avoid using a global data structure to track all root_domains in the system, there should be a method to locate an active CPU within the corresponding root_domain. *** Solution *** To locate the active cpu, the following rules for deadline sub-system is useful -1.any cpu belongs to a unique root domain at a given time -2.DL bandwidth checker ensures that the root domain has active cpus. Now, let's examine the blocked-state task P. If P is attached to a cpuset that is a partition root, it is straightforward to find an active CPU. If P is attached to a cpuset that has changed from 'root' to 'member', the active CPUs are grouped into the parent root domain. Naturally, the CPUs' capacity and reserved DL bandwidth are taken into account in the ancestor root domain. (In practice, it may be unsafe to attach P to an arbitrary root domain, since that domain may lack sufficient DL bandwidth for P.) Again, it is straightforward to find an active CPU in the ancestor root domain. This patch groups CPUs into isolated and housekeeping sets. For the housekeeping group, it walks up the cpuset hierarchy to find active CPUs in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd. Signed-off-by: Pingfan Liu <piliu@redhat.com> Cc: Waiman Long <longman@redhat.com> Cc: Chen Ridong <chenridong@huaweicloud.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Pierre Gondois <pierre.gondois@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> To: linux-kernel@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-20cgroup/cpuset: Introduce cpuset_cpus_allowed_locked()Pingfan Liu
cpuset_cpus_allowed() uses a reader lock that is sleepable under RT, which means it cannot be called inside raw_spin_lock_t context. Introduce a new cpuset_cpus_allowed_locked() helper that performs the same function as cpuset_cpus_allowed() except that the caller must have acquired the cpuset_mutex so that no further locking will be needed. Suggested-by: Waiman Long <longman@redhat.com> Signed-off-by: Pingfan Liu <piliu@redhat.com> Cc: Waiman Long <longman@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: linux-kernel@vger.kernel.org To: cgroups@vger.kernel.org Reviewed-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-20timekeeping: Fix resource leak in tk_aux_sysfs_init() error pathsMalaya Kumar Rout
tk_aux_sysfs_init() returns immediately on error during the auxiliary clock initialization loop without cleaning up previously allocated kobjects and sysfs groups. If kobject_create_and_add() or sysfs_create_group() fails during loop iteration, the parent kobjects (tko and auxo) and any previously created child kobjects are leaked. Fix this by adding proper error handling with goto labels to ensure all allocated resources are cleaned up on failure. kobject_put() on the parent kobjects will handle cleanup of their children. Fixes: 7b95663a3d96 ("timekeeping: Provide interface to control auxiliary clocks") Signed-off-by: Malaya Kumar Rout <mrout@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251120150213.246777-1-mrout@redhat.com
2025-11-20sched/mmcid: Use cpumask_weighted_or()Thomas Gleixner
Use cpumask_weighted_or() instead of cpumask_or() and cpumask_weight() on the result, which walks the same bitmap twice. Results in 10-20% less cycles, which reduces the runqueue lock hold time. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Link: https://patch.msgid.link/20251119172549.511736272@linutronix.de
2025-11-20sched/mmcid: Prevent pointless work in mm_update_cpus_allowed()Thomas Gleixner
mm_update_cpus_allowed() is not required to be invoked for affinity changes due to migrate_disable() and migrate_enable(). migrate_disable() restricts the task temporarily to a CPU on which the task was already allowed to run, so nothing changes. migrate_enable() restores the actual task affinity mask. If that mask changed between migrate_disable() and migrate_enable() then that change was already accounted for. Move the invocation to the proper place to avoid that. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.385208276@linutronix.de
2025-11-20sched/mmcid: Move scheduler code out of global headerThomas Gleixner
This is only used in the scheduler core code, so there is no point to have it in a global header. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Link: https://patch.msgid.link/20251119172549.321259077@linutronix.de
2025-11-20sched: Fixup whitespace damageThomas Gleixner
With whitespace checks enabled in the editor this makes eyes bleed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.258651925@linutronix.de
2025-11-20sched/mmcid: Use proper data structuresThomas Gleixner
Having a lot of CID functionality specific members in struct task_struct and struct mm_struct is not really making the code easier to read. Encapsulate the CID specific parts in data structures and keep them separate from the stuff they are embedded in. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.131573768@linutronix.de
2025-11-20sched/mmcid: Revert the complex CID managementThomas Gleixner
The CID management is a complex beast, which affects both scheduling and task migration. The compaction mechanism forces random tasks of a process into task work on exit to user space causing latency spikes. Revert back to the initial simple bitmap allocating mechanics, which are known to have scalability issues as that allows to gradually build up a replacement functionality in a reviewable way. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.068197830@linutronix.de
2025-11-20perf: Fix 0 count issue of cpu-clockDapeng Mi
Currently cpu-clock event always returns 0 count, e.g., perf stat -e cpu-clock -- sleep 1 Performance counter stats for 'sleep 1': 0 cpu-clock # 0.000 CPUs utilized 1.002308394 seconds time elapsed The root cause is the commit 'bc4394e5e79c ("perf: Fix the throttle error of some clock events")' adds PERF_EF_UPDATE flag check before calling cpu_clock_event_update() to update the count, however the PERF_EF_UPDATE flag is never set when the cpu-clock event is stopped in counting mode (pmu->dev() -> cpu_clock_event_del() -> cpu_clock_event_stop()). This leads to the cpu-clock event count is never updated. To fix this issue, force to set PERF_EF_UPDATE flag for cpu-clock event just like what task-clock does. Fixes: bc4394e5e79c ("perf: Fix the throttle error of some clock events") Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://patch.msgid.link/20251112080526.3971392-1-dapeng1.mi@linux.intel.com
2025-11-19tick/sched: Fix bogus condition in report_idle_softirq()Wen Yang
In commit 0345691b24c0 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") the new function report_idle_softirq() was created by breaking code out of the existing can_stop_idle_tick() for kernels v5.18 and newer. In doing so, the code essentially went from this form: if (A) { static int ratelimit; if (ratelimit < 10 && !C && A&D) { pr_warn("NOHZ tick-stop error: ..."); ratelimit++; } return false; } to a new function: static bool report_idle_softirq(void) { static int ratelimit; if (likely(!A)) return false; if (ratelimit < 10) return false; ... pr_warn("NOHZ tick-stop error: local softirq work is pending, handler #%02x!!!\n", pending); ratelimit++; return true; } commit a7e282c77785 ("tick/rcu: Fix bogus ratelimit condition") realized ratelimit was essentially set to zero instead of ten, and hence *no* softirq pending messages would ever be issued, but "fixed" it as: - if (ratelimit < 10) + if (ratelimit >= 10) return false; However, this fix introduced another issue: When ratelimit is greater than or equal 10, even if A is true, it will directly return false. While ratelimit in the original code was only used to control printing and will not affect the return value. Restore the original logic and restrict ratelimit to control the printk and not the return value. Fixes: 0345691b24c0 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") Fixes: a7e282c77785 ("tick/rcu: Fix bogus ratelimit condition") Signed-off-by: Wen Yang <wen.yang@linux.dev> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251119174525.29470-1-wen.yang@linux.dev
2025-11-19smp: Introduce a helper function to check for pending IPIsUlf Hansson
When governors used during cpuidle try to find the most optimal idle state for a CPU or a group of CPUs, they are known to quite often fail. One reason for this is, that they are not taking into account whether there has been an IPI scheduled for any of the CPUs that are affected by the selected idle state. To enable pending IPIs to be taken into account for cpuidle decisions, introduce a new helper function, cpus_peek_for_pending_ipi(). Suggested-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2025-11-19printk: Avoid scheduling irq_work on suspendJohn Ogness
Allowing irq_work to be scheduled while trying to suspend has shown to cause problems as some architectures interpret the pending interrupts as a reason to not suspend. This became a problem for printk() with the introduction of NBCON consoles. With every printk() call, NBCON console printing kthreads are woken by queueing irq_work. This means that irq_work continues to be queued due to printk() calls late in the suspend procedure. Avoid this problem by preventing printk() from queueing irq_work once console suspending has begun. This applies to triggering NBCON and legacy deferred printing as well as klogd waiters. Since triggering of NBCON threaded printing relies on irq_work, the pr_flush() within console_suspend_all() is used to perform the final flushing before suspending consoles and blocking irq_work queueing. NBCON consoles that are not suspended (due to the usage of the "no_console_suspend" boot argument) transition to atomic flushing. Introduce a new global variable @console_irqwork_blocked to flag when irq_work queueing is to be avoided. The flag is used by printk_get_console_flush_type() to avoid allowing deferred printing and switch NBCON consoles to atomic flushing. It is also used by vprintk_emit() to avoid klogd waking. Add WARN_ON_ONCE(console_irqwork_blocked) to the irq_work queuing functions to catch any code that attempts to queue printk irq_work during the suspending/resuming procedure. Cc: stable@vger.kernel.org # 6.13.x because no drivers in 6.12.x Fixes: 6b93bb41f6ea ("printk: Add non-BKL (nbcon) console basic infrastructure") Closes: https://lore.kernel.org/lkml/DB9PR04MB8429E7DDF2D93C2695DE401D92C4A@DB9PR04MB8429.eurprd04.prod.outlook.com Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Sherry Sun <sherry.sun@nxp.com> Link: https://patch.msgid.link/20251113160351.113031-3-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-11-19printk: Allow printk_trigger_flush() to flush all typesJohn Ogness
Currently printk_trigger_flush() only triggers legacy offloaded flushing, even if that may not be the appropriate method to flush for currently registered consoles. (The function predates the NBCON consoles.) Since commit 6690d6b52726 ("printk: Add helper for flush type logic") there is printk_get_console_flush_type(), which also considers NBCON consoles and reports all the methods of flushing appropriate based on the system state and consoles available. Update printk_trigger_flush() to use printk_get_console_flush_type() to appropriately flush registered consoles. Suggested-by: Petr Mladek <pmladek@suse.com> Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/stable/20251113160351.113031-2-john.ogness%40linutronix.de Tested-by: Sherry Sun <sherry.sun@nxp.com> Link: https://patch.msgid.link/20251113160351.113031-2-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-11-19ima: Access decompressed kernel module to verify appended signatureCoiby Xu
Currently, when in-kernel module decompression (CONFIG_MODULE_DECOMPRESS) is enabled, IMA has no way to verify the appended module signature as it can't decompress the module. Define a new kernel_read_file_id enumerate READING_MODULE_COMPRESSED so IMA can calculate the compressed kernel module data hash on READING_MODULE_COMPRESSED and defer appraising/measuring it until on READING_MODULE when the module has been decompressed. Before enabling in-kernel module decompression, a kernel module in initramfs can still be loaded with ima_policy=secure_boot. So adjust the kernel module rule in secure_boot policy to allow either an IMA signature OR an appended signature i.e. to use "appraise func=MODULE_CHECK appraise_type=imasig|modsig". Reported-by: Karel Srot <ksrot@redhat.com> Suggested-by: Mimi Zohar <zohar@linux.ibm.com> Suggested-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Coiby Xu <coxu@redhat.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2025-11-19tracing: Switch to use %ptSpAndy Shevchenko
Use %ptSp instead of open coded variants to print content of struct timespec64 in human readable format. Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://patch.msgid.link/20251113150217.3030010-22-andriy.shevchenko@linux.intel.com Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-11-19watch_queue: Use local kmap in post_one_notification()Davidlohr Bueso
Replace the now deprecated kmap_atomic() with kmap_local_page(). Optimize for the non-highmem cases and avoid disabling preemption and pagefaults, the caller's context is atomic anyway, but that is irrelevant to kmap. The memcpy itself does not require any such semantics and the mapping would hold valid across context switches anyway. Further, highmem is planned to to be removed[1]. [1] https://lore.kernel.org/all/4ff89b72-03ff-4447-9d21-dd6a5fe1550f@app.fastmail.com/ Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://patch.msgid.link/20251118210706.1816303-1-dave@stgolabs.net Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-18bpf: Replace bpf memory allocator with kmalloc_nolock() in local storageAmery Hung
Replace bpf memory allocator with kmalloc_nolock() to reduce memory wastage due to preallocation. In bpf_selem_free(), an selem now needs to wait for a RCU grace period before being freed when reuse_now == true. Therefore, rcu_barrier() should be always be called in bpf_local_storage_map_free(). In bpf_local_storage_free(), since smap->storage_ma is no longer needed to return the memory, the function is now independent from smap. Remove the outdated comment in bpf_local_storage_alloc(). We already free selem after an RCU grace period in bpf_local_storage_update() when bpf_local_storage_alloc() failed the cmpxchg since commit c0d63f309186 ("bpf: Add bpf_selem_free()"). Signed-off-by: Amery Hung <ameryhung@gmail.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20251114201329.3275875-5-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-18bpf: Save memory alloction info in bpf_local_storageAmery Hung
Save the memory allocation method used for bpf_local_storage in the struct explicitly so that we don't need to go through the hassle to find out the info. When a later patch replaces BPF memory allocator with kmalloc_noloc(), bpf_local_storage_free() will no longer need smap->storage_ma to return the memory and completely remove the dependency on smap in bpf_local_storage_free(). Signed-off-by: Amery Hung <ameryhung@gmail.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20251114201329.3275875-4-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-18bpf: Remove smap argument from bpf_selem_free()Amery Hung
Since selem already saves a pointer to smap, use it instead of an additional argument in bpf_selem_free(). This requires moving the SDATA(selem)->smap assignment from bpf_selem_link_map() to bpf_selem_alloc() since bpf_selem_free() may be called without the selem being linked to smap in bpf_local_storage_update(). Signed-off-by: Amery Hung <ameryhung@gmail.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20251114201329.3275875-3-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-18bpf: Always charge/uncharge memory when allocating/unlinking storage elementsAmery Hung
Since commit a96a44aba556 ("bpf: bpf_sk_storage: Fix invalid wait context lockdep report"), {charge,uncharge}_mem are always true when allocating a bpf_local_storage_elem or unlinking a bpf_local_storage_elem from local storage, so drop these arguments. No functional change. Signed-off-by: Amery Hung <ameryhung@gmail.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20251114201329.3275875-2-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-18genirq: Use raw_spinlock_irq() in irq_set_affinity_notifier()Chengkaitao
Since irq_set_affinity_notifier() may sleep, interrupts are enabled. So raw_spinlock_irqsave() can be replaced with raw_spinlock_irq(). Signed-off-by: Chengkaitao <chengkaitao@kylinos.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251118012754.61805-1-pilgrimtao@gmail.com
2025-11-17bpf: Fix invalid prog->stats access when update_effective_progs failsPu Lehui
Syzkaller triggers an invalid memory access issue following fault injection in update_effective_progs. The issue can be described as follows: __cgroup_bpf_detach update_effective_progs compute_effective_progs bpf_prog_array_alloc <-- fault inject purge_effective_progs /* change to dummy_bpf_prog */ array->items[index] = &dummy_bpf_prog.prog ---softirq start--- __do_softirq ... __cgroup_bpf_run_filter_skb __bpf_prog_run_save_cb bpf_prog_run stats = this_cpu_ptr(prog->stats) /* invalid memory access */ flags = u64_stats_update_begin_irqsave(&stats->syncp) ---softirq end--- static_branch_dec(&cgroup_bpf_enabled_key[atype]) The reason is that fault injection caused update_effective_progs to fail and then changed the original prog into dummy_bpf_prog.prog in purge_effective_progs. Then a softirq came, and accessing the members of dummy_bpf_prog.prog in the softirq triggers invalid mem access. To fix it, skip updating stats when stats is NULL. Fixes: 492ecee892c2 ("bpf: enable program stats") Signed-off-by: Pu Lehui <pulehui@huawei.com> Link: https://lore.kernel.org/r/20251115102343.2200727-1-pulehui@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-17PM: hibernate: Clean up kernel-doc comment style usageSunday Adelodun
Several static functions in kernel/power/swap.c were described using the kernel-doc comment style (/** ... */) even though they are not exported or referenced by generated documentation. This led to kernel-doc warnings and stylistic inconsistencies. Convert these unnecessary kernel-doc blocks to regular C comments, remove comment blocks that are no longer useful, relocate comments to more appropriate positions where needed, and fix a few "Return:" descriptions that were either missing or incorrectly formatted. No functional changes. Signed-off-by: Sunday Adelodun <adelodunolaoluwa@yahoo.com> [ rjw: Subject adjustment, changelog edits, comment edits ] Link: https://patch.msgid.link/20251114220438.52448-1-adelodunolaoluwa@yahoo.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-17Merge tag 'vfs-6.18-rc7.fixes' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix unitialized variable in statmount_string() - Fix hostfs mounting when passing host root during boot - Fix dynamic lookup to fail on cell lookup failure - Fix missing file type when reading bfs inodes from disk - Enforce checking of sb_min_blocksize() calls and update all callers accordingly - Restore write access before closing files opened by open_exec() in binfmt_misc - Always freeze efivarfs during suspend/hibernate cycles - Fix statmount()'s and listmount()'s grab_requested_mnt_ns() helper to actually allow mount namespace file descriptor in addition to mount namespace ids - Fix tmpfs remount when noswap is specified - Switch Landlock to iput_not_last() to remove false-positives from might_sleep() annotations in iput() - Remove dead node_to_mnt_ns() code - Ensure that per-queue kobjects are successfully created * tag 'vfs-6.18-rc7.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: landlock: fix splats from iput() after it started calling might_sleep() fs: add iput_not_last() shmem: fix tmpfs reconfiguration (remount) when noswap is set fs/namespace: correctly handle errors returned by grab_requested_mnt_ns power: always freeze efivarfs binfmt_misc: restore write access before closing files opened by open_exec() block: add __must_check attribute to sb_min_blocksize() virtio-fs: fix incorrect check for fsvq->kobj xfs: check the return value of sb_min_blocksize() in xfs_fs_fill_super isofs: check the return value of sb_min_blocksize() in isofs_fill_super exfat: check return value of sb_min_blocksize in exfat_read_boot_sector vfat: fix missing sb_min_blocksize() return value checks mnt: Remove dead code which might prevent from building bfs: Reconstruct file type when loading from disk afs: Fix dynamic lookup to fail on cell lookup failure hostfs: Fix only passing host root in boot stage with new mount fs: Fix uninitialized 'offp' in statmount_string()
2025-11-17Merge tag 'sched_ext-for-6.18-rc6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: "Five fixes addressing PREEMPT_RT compatibility and locking issues. Three commits fix potential deadlocks and sleeps in atomic contexts on RT kernels by converting locks to raw spinlocks and ensuring IRQ work runs in hard-irq context. The remaining two fix unsafe locking in the debug dump path and a variable dereference typo" * tag 'sched_ext-for-6.18-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Use IRQ_WORK_INIT_HARD() to initialize rq->scx.kick_cpus_irq_work sched_ext: Fix possible deadlock in the deferred_irq_workfn() sched/ext: convert scx_tasks_lock to raw spinlock sched_ext: Fix unsafe locking in the scx_dump_state() sched_ext: Fix use of uninitialized variable in scx_bpf_cpuperf_set()
2025-11-17sched/fair: Proportional newidle balancePeter Zijlstra
Add a randomized algorithm that runs newidle balancing proportional to its success rate. This improves schbench significantly: 6.18-rc4: 2.22 Mrps/s 6.18-rc4+revert: 2.04 Mrps/s 6.18-rc4+revert+random: 2.18 Mrps/S Conversely, per Adam Li this affects SpecJBB slightly, reducing it by 1%: 6.17: -6% 6.17+revert: 0% 6.17+revert+random: -1% Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Chris Mason <clm@meta.com> Link: https://lkml.kernel.org/r/6825c50d-7fa7-45d8-9b81-c6e7e25738e2@meta.com Link: https://patch.msgid.link/20251107161739.770122091@infradead.org
2025-11-17sched/fair: Small cleanup to update_newidle_cost()Peter Zijlstra
Simplify code by adding a few variables. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Chris Mason <clm@meta.com> Link: https://patch.msgid.link/20251107161739.655208666@infradead.org
2025-11-17sched/fair: Small cleanup to sched_balance_newidle()Peter Zijlstra
Pull out the !sd check to simplify code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Chris Mason <clm@meta.com> Link: https://patch.msgid.link/20251107161739.525916173@infradead.org