summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-10-21bpf: save the start of functions in bpf_prog_auxAnton Protopopov
Introduce a new subprog_start field in bpf_prog_aux. This field may be used by JIT compilers wanting to know the real absolute xlated offset of the function being jitted. The func_info[func_id] may have served this purpose, but func_info may be NULL, so JIT compilers can't rely on it. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251019202145.3944697-3-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-21bpf: fix the return value of push_stackAnton Protopopov
In [1] Eduard mentioned that on push_stack failure verifier code should return -ENOMEM instead of -EFAULT. After checking with the other call sites I've found that code randomly returns either -ENOMEM or -EFAULT. This patch unifies the return values for the push_stack (and similar push_async_cb) functions such that error codes are always assigned properly. [1] https://lore.kernel.org/bpf/20250615085943.3871208-1-a.s.protopopov@gmail.com Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251019202145.3944697-2-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-21bpf: Sync pending IRQ work before freeing ring bufferNoorain Eqbal
Fix a race where irq_work can be queued in bpf_ringbuf_commit() but the ring buffer is freed before the work executes. In the syzbot reproducer, a BPF program attached to sched_switch triggers bpf_ringbuf_commit(), queuing an irq_work. If the ring buffer is freed before this work executes, the irq_work thread may accesses freed memory. Calling `irq_work_sync(&rb->work)` ensures that all pending irq_work complete before freeing the buffer. Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Reported-by: syzbot+2617fc732430968b45d2@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2617fc732430968b45d2 Tested-by: syzbot+2617fc732430968b45d2@syzkaller.appspotmail.com Signed-off-by: Noorain Eqbal <nooraineqbal@gmail.com> Link: https://lore.kernel.org/r/20251020180301.103366-1-nooraineqbal@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-21bpf: Clarify get_outer_instance() handling in propagate_to_outer_instance()Shardul Bankar
propagate_to_outer_instance() calls get_outer_instance() and uses the returned pointer to reset and commit stack write marks. Under normal conditions, update_instance() guarantees that an outer instance exists, so get_outer_instance() cannot return an ERR_PTR. However, explicitly checking for IS_ERR(outer_instance) makes this code more robust and self-documenting. It reduces cognitive load when reading the control flow and silences potential false-positive reports from static analysis or automated tooling. No functional change intended. Signed-off-by: Shardul Bankar <shardulsb08@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251021080849.860072-1-shardulsb08@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-21seqlock: Change thread_group_cputime() to use scoped_seqlock_read()Oleg Nesterov
To simplify the code and make it more readable. While at it, change thread_group_cputime() to use __for_each_thread(sig). [peterz: update to new interface] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-21locking/spinlock/debug: Fix data-race in do_raw_write_lockAlexander Sverdlin
KCSAN reports: BUG: KCSAN: data-race in do_raw_write_lock / do_raw_write_lock write (marked) to 0xffff800009cf504c of 4 bytes by task 1102 on cpu 1: do_raw_write_lock+0x120/0x204 _raw_write_lock_irq do_exit call_usermodehelper_exec_async ret_from_fork read to 0xffff800009cf504c of 4 bytes by task 1103 on cpu 0: do_raw_write_lock+0x88/0x204 _raw_write_lock_irq do_exit call_usermodehelper_exec_async ret_from_fork value changed: 0xffffffff -> 0x00000001 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 1103 Comm: kworker/u4:1 6.1.111 Commit 1a365e822372 ("locking/spinlock/debug: Fix various data races") has adressed most of these races, but seems to be not consistent/not complete. >From do_raw_write_lock() only debug_write_lock_after() part has been converted to WRITE_ONCE(), but not debug_write_lock_before() part. Do it now. Fixes: 1a365e822372 ("locking/spinlock/debug: Fix various data races") Reported-by: Adrian Freihofer <adrian.freihofer@siemens.com> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@siemens.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Waiman Long <longman@redhat.com> Cc: stable@vger.kernel.org
2025-10-20Merge tag 'cgroup-for-6.18-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix seqcount lockdep assertion failure in cgroup freezer on PREEMPT_RT. Plain seqcount_t expects preemption disabled, but PREEMPT_RT spinlocks don't disable preemption. Switch to seqcount_spinlock_t to properly associate css_set_lock with the freeze timing seqcount. - Misc changes including kernel-doc warning fix for misc_res_type enum and improved selftest diagnostics. * tag 'cgroup-for-6.18-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/misc: fix misc_res_type kernel-doc warning selftests: cgroup: Use values_close_report in test_cpu selftests: cgroup: add values_close_report helper cgroup: Fix seqcount lockdep assertion in cgroup freezer
2025-10-20PM: hibernate: Rework message printing in swsusp_save()Rafael J. Wysocki
The messages printed by swsusp_save() are basically only useful for debug, so printing them every time a hibernation image is created at the "info" log level is not particularly useful. Also printing a message on a failing memory allocation is redundant. Use pm_deferred_pr_dbg() for printing those messages so they will only be printed when requested and the "deferred" variant is used because this code runs in a deeply atomic context (one CPU with interrupts off, no functional devices). Also drop the useless message printed when memory allocations fails. While at it, extend one of the messages in question so it is less cryptic. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [ rjw: Dropped a useless colon at the end of one of the messages ] Link: https://patch.msgid.link/10750389.nUPlyArG6x@rafael.j.wysocki
2025-10-20genirq/msi: Slightly simplify msi_domain_alloc()Christophe JAILLET
The return value of irq_find_mapping() is only tested, not used for anything else. Replaced it by irq_resolve_mapping() which is internally used by irq_find_mapping() and allows a simple boolean decision. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/1ce680114cdb8d40b072c54d7f015696a540e5a6.1760863194.git.christophe.jaillet@wanadoo.fr
2025-10-20timekeeping: Fix aux clocks sysfs initialization loop boundHaofeng Li
The loop in tk_aux_sysfs_init() uses `i <= MAX_AUX_CLOCKS` as the termination condition, which results in 9 iterations (i=0 to 8) when MAX_AUX_CLOCKS is defined as 8. However, the kernel is designed to support only up to 8 auxiliary clocks. This off-by-one error causes the creation of a 9th sysfs entry that exceeds the intended auxiliary clock range. Fix the loop bound to use `i < MAX_AUX_CLOCKS` to ensure exactly 8 auxiliary clock entries are created, matching the design specification. Fixes: 7b95663a3d96 ("timekeeping: Provide interface to control auxiliary clocks") Signed-off-by: Haofeng Li <lihaofeng@kylinos.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/tencent_2376993D9FC06A3616A4F981B3DE1C599607@qq.com
2025-10-20cgroup/cpuset: Don't track # of local child partitionsWaiman Long
The cpuset structure has a nr_subparts field which tracks the number of child local partitions underneath a particular cpuset. Right now, nr_subparts is only used in partition_is_populated() to avoid iteration of child cpusets if the condition is right. So by always performing the child iteration, we can avoid tracking the number of child partitions and simplify the code a bit. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-20rv: Make rtapp/pagefault monitor depends on CONFIG_MMUNam Cao
There is no page fault without MMU. Compiling the rtapp/pagefault monitor without CONFIG_MMU fails as page fault tracepoints' definitions are not available. Make rtapp/pagefault monitor depends on CONFIG_MMU. Fixes: 9162620eb604 ("rv: Add rtapp_pagefault monitor") Signed-off-by: Nam Cao <namcao@linutronix.de> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509260455.6Z9Vkty4-lkp@intel.com/ Cc: stable@vger.kernel.org Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/20251002082317.973839-1-namcao@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-10-20rv: Fully convert enabled_monitors to use list_head as iteratorNam Cao
The callbacks in enabled_monitors_seq_ops are inconsistent. Some treat the iterator as struct rv_monitor *, while others treat the iterator as struct list_head *. This causes a wrong type cast and crashes the system as reported by Nathan. Convert everything to use struct list_head * as iterator. This also makes enabled_monitors consistent with available_monitors. Fixes: de090d1ccae1 ("rv: Fix wrong type cast in enabled_monitors_next()") Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/linux-trace-kernel/20250923002004.GA2836051@ax162/ Signed-off-by: Nam Cao <namcao@linutronix.de> Cc: stable@vger.kernel.org Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/20251002082235.973099-1-namcao@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-10-19Merge tag 'sched_urgent_for_v6.18_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Make sure the check for lost pelt idle time is done unconditionally to have correct lost idle time accounting - Stop the deadline server task before a CPU goes offline * tag 'sched_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix pelt lost idle time detection sched/deadline: Stop dl_server before CPU goes offline
2025-10-19Merge tag 'perf_urgent_for_v6.18_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Make sure perf reporting works correctly in setups using overlayfs or FUSE - Move the uprobe optimization to a better location logically * tag 'perf_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix MMAP2 event device with backing files perf/core: Fix MMAP event path names with backing files perf/core: Fix address filter match with backing files uprobe: Move arch_uprobe_optimize right after handlers execution
2025-10-18bpf: mark vma->{vm_mm,vm_file} as __safe_trusted_or_nullYafang Shao
The vma->vm_mm might be NULL and it can be accessed outside of RCU. Thus, we can mark it as trusted_or_null. With this change, BPF helpers can safely access vma->vm_mm to retrieve the associated mm_struct from the VMA. Then we can make policy decision from the VMA. The "trusted" annotation enables direct access to vma->vm_mm within kfuncs marked with KF_TRUSTED_ARGS or KF_RCU, such as bpf_task_get_cgroup1() and bpf_task_under_cgroup(). Conversely, "null" enforcement requires all callsites using vma->vm_mm to perform NULL checks. The lsm selftest must be modified because it directly accesses vma->vm_mm without a NULL pointer check; otherwise it will break due to this change. For the VMA based THP policy, the use case is as follows, @mm = @vma->vm_mm; // vm_area_struct::vm_mm is trusted or null if (!@mm) return; bpf_rcu_read_lock(); // rcu lock must be held to dereference the owner @owner = @mm->owner; // mm_struct::owner is rcu trusted or null if (!@owner) goto out; @cgroup1 = bpf_task_get_cgroup1(@owner, MEMCG_HIERARCHY_ID); /* make the decision based on the @cgroup1 attribute */ bpf_cgroup_release(@cgroup1); // release the associated cgroup out: bpf_rcu_read_unlock(); PSI memory information can be obtained from the associated cgroup to inform policy decisions. Since upstream PSI support is currently limited to cgroup v2, the following example demonstrates cgroup v2 implementation: @owner = @mm->owner; if (@owner) { // @ancestor_cgid is user-configured @ancestor = bpf_cgroup_from_id(@ancestor_cgid); if (bpf_task_under_cgroup(@owner, @ancestor)) { @psi_group = @ancestor->psi; /* Extract PSI metrics from @psi_group and * implement policy logic based on the values */ } } The vma::vm_file can also be marked with __safe_trusted_or_null. No additional selftests are required since vma->vm_file and vma->vm_mm are already validated in the existing selftest suite. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Link: https://lore.kernel.org/r/20251016063929.13830-3-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-18bpf: mark mm->owner as __safe_rcu_or_nullYafang Shao
When CONFIG_MEMCG is enabled, we can access mm->owner under RCU. The owner can be NULL. With this change, BPF helpers can safely access mm->owner to retrieve the associated task from the mm. We can then make policy decision based on the task attribute. The typical use case is as follows, bpf_rcu_read_lock(); // rcu lock must be held for rcu trusted field @owner = @mm->owner; // mm_struct::owner is rcu trusted or null if (!@owner) goto out; /* Do something based on the task attribute */ out: bpf_rcu_read_unlock(); Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20251016063929.13830-2-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf at 6.18-rc2Alexei Starovoitov
Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-18sched_ext: Allow forcibly picking an scx taskAndrea Righi
Refactor pick_task_scx() adding a new argument to forcibly pick a SCHED_EXT task, ignoring any higher-priority sched class activity. This refactoring prepares the code for future scenarios, e.g., allowing the ext dl_server to force a SCHED_EXT task selection. No functional changes. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-18dma: contiguous: Reserve default CMA heapMaxime Ripard
The CMA code, in addition to the reserved-memory regions in the device tree, will also register a default CMA region if the device tree doesn't provide any, with its size and position coming from either the kernel command-line or configuration. Let's register that one for use to create a heap for it. Reviewed-by: T.J. Mercier <tjmercier@google.com> Signed-off-by: Maxime Ripard <mripard@kernel.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org> Link: https://lore.kernel.org/r/20251013-dma-buf-ecc-heap-v8-4-04ce150ea3d9@kernel.org
2025-10-18dma: contiguous: Register reusable CMA regions at bootMaxime Ripard
In order to create a CMA dma-buf heap instance for each CMA heap region in the system, we need to collect all of them during boot. They are created from two main sources: the reserved-memory regions in the device tree, and the default CMA region created from the configuration or command line parameters, if no default region is provided in the device tree. Let's collect all the device-tree defined CMA regions flagged as reusable. Reviewed-by: T.J. Mercier <tjmercier@google.com> Signed-off-by: Maxime Ripard <mripard@kernel.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org> Link: https://lore.kernel.org/r/20251013-dma-buf-ecc-heap-v8-3-04ce150ea3d9@kernel.org
2025-10-18PM: console: Fix memory allocation error handling in pm_vt_switch_required()Malaya Kumar Rout
The pm_vt_switch_required() function fails silently when memory allocation fails, offering no indication to callers that the operation was unsuccessful. This behavior prevents drivers from handling allocation errors correctly or implementing retry mechanisms. By ensuring that failures are reported back to the caller, drivers can make informed decisions, improve robustness, and avoid unexpected behavior during critical power management operations. Change the function signature to return an integer error code and modify the implementation to return -ENOMEM when kmalloc() fails. Update both the function declaration and the inline stub in include/linux/pm.h to maintain consistency across CONFIG_VT_CONSOLE_SLEEP configurations. The function now returns: - 0 on success (including when updating existing entries) - -ENOMEM when memory allocation fails This change improves error reporting without breaking existing callers, as the current callers in drivers/video/fbdev/core/fbmem.c already ignore the return value, making this a backward-compatible improvement. Reviewed-by: Lyude Paul <lyude@redhat.com> Signed-off-by: Malaya Kumar Rout <mrout@redhat.com> Reviewed-by: Dhruva Gole <d-gole@ti.com> Reviewed-by: Lyude Paul <lyude@redhat.com> Link: https://patch.msgid.link/20251013193028.89570-1-mrout@redhat.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-10-16sched_ext: Merge branch 'sched/core' of ↵Tejun Heo
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.19 Pull in tip/sched/core to receive: 50653216e4ff ("sched: Add support to pick functions to take rf") 4c95380701f5 ("sched/ext: Fold balance_scx() into pick_task_scx()") which will enable clean integration of DL server support among other things. This conflicts with the following from sched_ext/for-6.18-fixes: a8ad873113d3 ("sched_ext: defer queue_balance_callback() until after ops.dispatch") which adds maybe_queue_balance_callback() to balance_scx() which is removed by 50653216e4ff. Resolve by moving the invocation to pick_task_scx() in the equivalent location. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-16sched_ext: Merge branch 'for-6.18-fixes' into for-6.19Tejun Heo
Pull sched_ext/for-6.18-fixes to sync trees to receive: 05e63305c85c ("sched_ext: Fix scx_kick_pseqs corruption on concurrent scheduler loads") to avoid conflicts with planned cgroup sub-sched support. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-16sched_ext: fix flag check for deferred callbacksEmil Tsalapatis
When scheduling the deferred balance callbacks, check SCX_RQ_BAL_CB_PENDING instead of SCX_RQ_BAL_PENDING. This way schedule_deferred() properly tests whether there is already a pending request for queue_balance_callback() to be invoked at the end of .balance(). Fixes: a8ad873113d3 ("sched_ext: defer queue_balance_callback() until after ops.dispatch") Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-16bpf: Fix memory leak in __lookup_instance error pathShardul Bankar
When __lookup_instance() allocates a func_instance structure but fails to allocate the must_write_set array, it returns an error without freeing the previously allocated func_instance. This causes a memory leak of 192 bytes (sizeof(struct func_instance)) each time this error path is triggered. Fix by freeing 'result' on must_write_set allocation failure. Fixes: b3698c356ad9 ("bpf: callchain sensitive stack liveness tracking using CFG") Reported-by: BPF Runtime Fuzzer (BRF) Signed-off-by: Shardul Bankar <shardulsb08@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://patch.msgid.link/20251016063330.4107547-1-shardulsb08@gmail.com
2025-10-16sched/ext: Fold balance_scx() into pick_task_scx()Peter Zijlstra
With pick_task() having an rf argument, it is possible to do the lock-break there, get rid of the weird balance/pick_task hack. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16sched: Add support to pick functions to take rfJoel Fernandes
Some pick functions like the internal pick_next_task_fair() already take rf but some others dont. We need this for scx's server pick function. Prepare for this by having pick functions accept it. [peterz: - added RETRY_TASK handling - removed pick_next_task_fair indirection] Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16sched: Detect per-class runqueue changesPeter Zijlstra
Have enqueue/dequeue set a per-class bit in rq->queue_mask. This then enables easy tracking of which runqueues are modified over a lock-break. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16sched: Mandate shared flags for sched_changePeter Zijlstra
Shrikanth noted that sched_change pattern relies on using shared flags. Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16sched: Cleanup the sched_change NOCLOCK usagePeter Zijlstra
Teach the sched_change pattern how to do update_rq_clock(); this allows for some simplifications / cleanups. Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16sched: Match __task_rq_{,un}lock()Peter Zijlstra
In preparation to adding more rules to __task_rq_lock(), such that __task_rq_unlock() will no longer be equivalent to rq_unlock(), make sure every __task_rq_lock() is matched by a __task_rq_unlock() and vice-versa. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16sched: Add locking comments to sched_class methodsPeter Zijlstra
'Document' the locking context the various sched_class methods are called under. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16sched: Make __do_set_cpus_allowed() use the sched_change patternPeter Zijlstra
Now that do_set_cpus_allowed() holds all the regular locks, convert it to use the sched_change pattern helper. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16sched: Rename do_set_cpus_allowed()Peter Zijlstra
Hopefully saner naming. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16sched: Fix do_set_cpus_allowed() lockingPeter Zijlstra
All callers of do_set_cpus_allowed() only take p->pi_lock, which is not sufficient to actually change the cpumask. Again, this is mostly ok in these cases, but it results in unnecessarily complicated reasoning. Furthermore, there is no reason what so ever to not just take all the required locks, so do just that. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16sched: Fix migrate_disable_switch() lockingPeter Zijlstra
For some reason migrate_disable_switch() was more complicated than it needs to be, resulting in mind bending locking of dubious quality. Recognise that migrate_disable_switch() must be called before a context switch, but any place before that switch is equally good. Since the current place results in troubled locking, simply move the thing before taking rq->lock. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16sched: Move sched_class::prio_changed() into the change patternPeter Zijlstra
Move sched_class::prio_changed() into the change pattern. And while there, extend it with sched_class::get_prio() in order to fix the deadline sitation. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16sched: Cleanup sched_delayed handling for class switchesPeter Zijlstra
Use the new sched_class::switching_from() method to dequeue delayed tasks before switching to another class. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16sched: Fold sched_class::switch{ing,ed}_{to,from}() into the change patternPeter Zijlstra
Add {DE,EN}QUEUE_CLASS and fold the sched_class::switch* methods into the change pattern. This completes and makes the pattern more symmetric. This changes the order of callbacks slightly: OLD NEW | | switching_from() dequeue_task(); | dequeue_task() put_prev_task(); | put_prev_task() | switched_from() | ... change task ... | ... change task ... | switching_to(); | switching_to() enqueue_task(); | enqueue_task() set_next_task(); | set_next_task() prev_class->switched_from() | switched_to() | switched_to() | Notably, it moves the switched_from() callback right after the dequeue/put. Existing implementations don't appear to be affected by this change in location -- specifically the task isn't enqueued on the class in question in either location. Make (CLASS)^(SAVE|MOVE), because there is nothing to save-restore when changing scheduling classes. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16sched/deadline: Prepare for switched_from() changePeter Zijlstra
Prepare for the sched_class::switch*() methods getting folded into the change pattern. As a result of that, the location of switched_from will change slightly. SCHED_DEADLINE is affected by this change in location: OLD NEW | | switching_from() dequeue_task(); | dequeue_task() put_prev_task(); | put_prev_task() | switched_from() | ... change task ... | ... change task ... | switching_to(); | switching_to() enqueue_task(); | enqueue_task() set_next_task(); | set_next_task() prev_class->switched_from() | switched_to() | switched_to() | Notably, where switched_from() was called *after* the change to the task, it will get called before it. Specifically, switched_from_dl() uses dl_task(p) which uses p->prio; which is changed when switching class (it might be the reason to switch class in case of PI). When switched_from_dl() gets called, the task will have left the deadline class and dl_task() must be false, while when doing dequeue_dl_entity() the task must be a dl_task(), otherwise we'd have called a different dequeue method. Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16sched: Re-arrange the {EN,DE}QUEUE flagsPeter Zijlstra
Ensure the matched flags are in the low word while the unmatched flags go into the second word. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16sched: Employ sched_change guardsPeter Zijlstra
As proposed a long while ago -- and half done by scx -- wrap the scheduler's 'change' pattern in a guard helper. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16sched/fair: Only update stats for allowed CPUs when looking for dst groupAdam Li
Load imbalance is observed when the workload frequently forks new threads. Due to CPU affinity, the workload can run on CPU 0-7 in the first group, and only on CPU 8-11 in the second group. CPU 12-15 are always idle. { 0 1 2 3 4 5 6 7 } {8 9 10 11 12 13 14 15} * * * * * * * * * * * * When looking for dst group for newly forked threads, in many times update_sg_wakeup_stats() reports the second group has more idle CPUs than the first group. The scheduler thinks the second group is less busy. Then it selects least busy CPUs among CPU 8-11. Therefore CPU 8-11 can be crowded with newly forked threads, at the same time CPU 0-7 can be idle. A task may not use all the CPUs in a schedule group due to CPU affinity. Only update schedule group statistics for allowed CPUs. Signed-off-by: Adam Li <adamli@os.amperecomputing.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16sched: Create architecture specific sched domain distancesTim Chen
Allow architecture specific sched domain NUMA distances that are modified from actual NUMA node distances for the purpose of building NUMA sched domains. Keep actual NUMA distances separately if modified distances are used for building sched domains. Such distances are still needed as NUMA balancing benefits from finding the NUMA nodes that are actually closer to a task numa_group. Consolidate the recording of unique NUMA distances in an array to sched_record_numa_dist() so the function can be reused to record NUMA distances when the NUMA distance metric is changed. No functional change and additional distance array allocated if there're no arch specific NUMA distances being defined. Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com>
2025-10-16sched/deadline: only set free_cpus for online runqueuesDoug Berger
Commit 16b269436b72 ("sched/deadline: Modify cpudl::free_cpus to reflect rd->online") introduced the cpudl_set/clear_freecpu functions to allow the cpu_dl::free_cpus mask to be manipulated by the deadline scheduler class rq_on/offline callbacks so the mask would also reflect this state. Commit 9659e1eeee28 ("sched/deadline: Remove cpu_active_mask from cpudl_find()") removed the check of the cpu_active_mask to save some processing on the premise that the cpudl::free_cpus mask already reflected the runqueue online state. Unfortunately, there are cases where it is possible for the cpudl_clear function to set the free_cpus bit for a CPU when the deadline runqueue is offline. When this occurs while a CPU is connected to the default root domain the flag may retain the bad state after the CPU has been unplugged. Later, a different CPU that is transitioning through the default root domain may push a deadline task to the powered down CPU when cpudl_find sees its free_cpus bit is set. If this happens the task will not have the opportunity to run. One example is outlined here: https://lore.kernel.org/lkml/20250110233010.2339521-1-opendmb@gmail.com Another occurs when the last deadline task is migrated from a CPU that has an offlined runqueue. The dequeue_task member of the deadline scheduler class will eventually call cpudl_clear and set the free_cpus bit for the CPU. This commit modifies the cpudl_clear function to be aware of the online state of the deadline runqueue so that the free_cpus mask can be updated appropriately. It is no longer necessary to manage the mask outside of the cpudl_set/clear functions so the cpudl_set/clear_freecpu functions are removed. In addition, since the free_cpus mask is now only updated under the cpudl lock the code was changed to use the non-atomic __cpumask functions. Signed-off-by: Doug Berger <opendmb@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16sched/fair: Forfeit vruntime on yieldFernand Sieber
If a task yields, the scheduler may decide to pick it again. The task in turn may decide to yield immediately or shortly after, leading to a tight loop of yields. If there's another runnable task as this point, the deadline will be increased by the slice at each loop. This can cause the deadline to runaway pretty quickly, and subsequent elevated run delays later on as the task doesn't get picked again. The reason the scheduler can pick the same task again and again despite its deadline increasing is because it may be the only eligible task at that point. Fix this by making the task forfeiting its remaining vruntime and pushing the deadline one slice ahead. This implements yield behavior more authentically. We limit the forfeiting to eligible tasks. This is because core scheduling prefers running ineligible tasks rather than force idling. As such, without the condition, we can end up on a yield loop which makes the vruntime increase rapidly, leading to anomalous run delays later down the line. Fixes: 147f3efaa24182 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Fernand Sieber <sieberf@amazon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250401123622.584018-1-sieberf@amazon.com Link: https://lore.kernel.org/r/20250911095113.203439-1-sieberf@amazon.com Link: https://lore.kernel.org/r/20250916140228.452231-1-sieberf@amazon.com
2025-10-15dma-debug: don't report false positives with DMA_BOUNCE_UNALIGNED_KMALLOCMarek Szyprowski
Commit 370645f41e6e ("dma-mapping: force bouncing if the kmalloc() size is not cache-line-aligned") introduced DMA_BOUNCE_UNALIGNED_KMALLOC feature and permitted architecture specific code configure kmalloc slabs with sizes smaller than the value of dma_get_cache_alignment(). When that feature is enabled, the physical address of some small kmalloc()-ed buffers might be not aligned to the CPU cachelines, thus not really suitable for typical DMA. To properly handle that case a SWIOTLB buffer bouncing is used, so no CPU cache corruption occurs. When that happens, there is no point reporting a false-positive DMA-API warning that the buffer is not properly aligned, as this is not a client driver fault. [m.szyprowski@samsung.com: replace is_swiotlb_allocated() with is_swiotlb_active(), per Catalin] Link: https://lkml.kernel.org/r/20251010173009.3916215-1-m.szyprowski@samsung.com Link: https://lkml.kernel.org/r/20251009141508.2342138-1-m.szyprowski@samsung.com Fixes: 370645f41e6e ("dma-mapping: force bouncing if the kmalloc() size is not cache-line-aligned") Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: Robin Murohy <robin.murphy@arm.com> Cc: "Isaac J. Manjarres" <isaacmanjarres@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-15sched_ext: Add lockless peek operation for DSQsRyan Newton
The builtin DSQ queue data structures are meant to be used by a wide range of different sched_ext schedulers with different demands on these data structures. They might be per-cpu with low-contention, or high-contention shared queues. Unfortunately, DSQs have a coarse-grained lock around the whole data structure. Without going all the way to a lock-free, more scalable implementation, a small step we can take to reduce lock contention is to allow a lockless, small-fixed-cost peek at the head of the queue. This change allows certain custom SCX schedulers to cheaply peek at queues, e.g. during load balancing, before locking them. But it represents a few extra memory operations to update the pointer each time the DSQ is modified, including a memory barrier on ARM so the write appears correctly ordered. This commit adds a first_task pointer field which is updated atomically when the DSQ is modified, and allows any thread to peek at the head of the queue without holding the lock. Signed-off-by: Ryan Newton <newton@meta.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-15livepatch: Match old_sympos 0 and 1 in klp_find_func()Song Liu
When there is only one function of the same name, old_sympos of 0 and 1 are logically identical. Match them in klp_find_func(). This is to avoid a corner case with different toolchain behavior. In this specific issue, two versions of kpatch-build were used to build livepatch for the same kernel. One assigns old_sympos == 0 for unique local functions, the other assigns old_sympos == 1 for unique local functions. Both versions work fine by themselves. (PS: This behavior change was introduced in a downstream version of kpatch-build. This change does not exist in upstream kpatch-build.) However, during livepatch upgrade (with the replace flag set) from a patch built with one version of kpatch-build to the same fix built with the other version of kpatch-build, livepatching fails with errors like: [ 14.218706] sysfs: cannot create duplicate filename 'xxx/somefunc,1' ... [ 14.219466] Call Trace: [ 14.219468] <TASK> [ 14.219469] dump_stack_lvl+0x47/0x60 [ 14.219474] sysfs_warn_dup.cold+0x17/0x27 [ 14.219476] sysfs_create_dir_ns+0x95/0xb0 [ 14.219479] kobject_add_internal+0x9e/0x260 [ 14.219483] kobject_add+0x68/0x80 [ 14.219485] ? kstrdup+0x3c/0xa0 [ 14.219486] klp_enable_patch+0x320/0x830 [ 14.219488] patch_init+0x443/0x1000 [ccc_0_6] [ 14.219491] ? 0xffffffffa05eb000 [ 14.219492] do_one_initcall+0x2e/0x190 [ 14.219494] do_init_module+0x67/0x270 [ 14.219496] init_module_from_file+0x75/0xa0 [ 14.219499] idempotent_init_module+0x15a/0x240 [ 14.219501] __x64_sys_finit_module+0x61/0xc0 [ 14.219503] do_syscall_64+0x5b/0x160 [ 14.219505] entry_SYSCALL_64_after_hwframe+0x4b/0x53 [ 14.219507] RIP: 0033:0x7f545a4bd96d ... [ 14.219516] kobject: kobject_add_internal failed for somefunc,1 with -EEXIST, don't try to register things with the same name ... This happens because klp_find_func() thinks somefunc with old_sympos==0 is not the same as somefunc with old_sympos==1, and klp_add_object_nops adds another xxx/func,1 to the list of functions to patch. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> [pmladek@suse.com: Fixed some typos.] Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com>