summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
2026-06-02sched: Add blocked_donor link to task for smarter mutex handoffsPeter Zijlstra
Add link to the task this task is proxying for, and use it so the mutex owner can do an intelligent hand-off of the mutex to the task that the owner is running on behalf. [jstultz: This patch was split out from larger proxy patch] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260512025635.2840817-8-jstultz@google.com
2026-06-02sched: Add is_blocked task flagJohn Stultz
Add a new is_blocked flag to the task struct. This flag is set by try_to_block_task() and cleared by ttwu_do_wakeup() and tracks if the task is blocked. Traditionally this would mirror !p->on_rq, however due things like DELAY_DEQUEUE and PROXY_EXEC, this can diverge, so its useful to manage separately. Additionally with this, we might be able to get rid of the p->se.sched_delayed (ab)use in the core code (eventually). Taken whole cloth from Peter's email: https://lore.kernel.org/lkml/20260501132143.GC1026330@noisy.programming.kicks-ass.net/ With a few additional p->is_blocked = 0 in a few cases where we return current if blocked_on gets zeroed or there is no owner. This may hint that these current special cases might be dropped eventually. This change also helps resolve wait-queue stalls seen with proxy-execution. See previous patch attempts for details: https://lore.kernel.org/lkml/20260430215103.2978955-2-jstultz@google.com/ Reported-by: Vineeth Pillai <vineethrp@google.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260512025635.2840817-7-jstultz@google.com
2026-06-02sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING caseJohn Stultz
This patch adds logic so try_to_wake_up() will notice if we are waking a task where blocked_on == PROXY_WAKING, and if necessary dequeue the task so the wakeup will naturally return-migrate the donor task back to a cpu it can run on. This helps performance as we do the dequeue and wakeup under the locks normally taken in the try_to_wake_up() and avoids having to do proxy_force_return() from __schedule(), which has to re-take similar locks and then force a pick again loop. This was split out from the larger proxy patch, and significantly reworked. Credits for the original patch go to: Peter Zijlstra (Intel) <peterz@infradead.org> Juri Lelli <juri.lelli@redhat.com> Valentin Schneider <valentin.schneider@arm.com> Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260512025635.2840817-6-jstultz@google.com
2026-06-02sched: Rework block_task so it can be directly calledJohn Stultz
Pull most of the logic out of try_to_block_task() and put it into block_task() directly, so that we can call block_task() and not have to worry about the failing cases in try_to_block_task() Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260512025635.2840817-5-jstultz@google.com
2026-06-02sched: deadline: Add dl_rq->curr pointer to address issues with Proxy ExecJohn Stultz
The DL scheduler keeps the current task in the rbtree, since the deadline value isn't usually chagned while the task is runnable. This results in set_next_task() and put_prev_task() being simpler, but unfortunately this causes complexity elsewhere. Specifically when update_curr_dl() updates the deadline, it has to dequeue and then enqueue the task. From put_prev_task_dl(), we first call update_curr_dl(), and then call enqueue_pushable_dl_task(). However, with Proxy Exec this goes awry. Since when a mutex is released, we might wake the waiting rq->donor. This will cause put_prev_task() to be called on the donor to take it off the cpu for return migration. At that point, from put_prev_task_dl() the update_curr_dl() logic will dequeue & enqueue the task, and the enqueue function will call enqueue_pushable_dl_task() (since the task_current() check won't prevent it). Then back up the callstack in put_prev_task_dl() we'll end up calling enqueue_pushable_dl_task() again, tripping the !RB_EMPTY_NODE(&p->pushable_dl_tasks) warning. So to avoid this, use Peter's suggested[1] approach, and add a dl_rq->curr pointer that is set/cleared from set_next_task()/ put_prev_task(), which effectively tracks the rq->donor. We can then use this to avoid adding the active donor to the pushable list from enqueue_task_dl(). [1]: https://lore.kernel.org/lkml/20260304095123.GP606826@noisy.programming.kicks-ass.net/ Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260512025635.2840817-4-jstultz@google.com
2026-06-02sched: deadline: Add some helper variables to cleanup deadline logicJohn Stultz
As part of an improvement to handling pushable deadline tasks, Peter suggested this cleanup[1], to use helper values for dl_entity and dl_rq in the enqueue_task_dl() and put_prev_task_dl() functions. There should be no functional change from this patch. To make sure this cleanup change doesn't obscure later logic changes, I've split it into its own patch. [1]: https://lore.kernel.org/lkml/20260304095123.GP606826@noisy.programming.kicks-ass.net/ Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260512025635.2840817-3-jstultz@google.com
2026-06-02sched: Rework prev_balance() to avoid stale prev referencesJohn Stultz
Historically, the prev value from __schedule() was the rq->curr. This prev value is passed down through numerous functions, and used in the class scheduler implementations. The fact that prev was on_cpu until the end of __schedule(), meant it was stable across the rq lock drops that the class->balance() implementations often do. However, with proxy-exec, the prev passed to functions called by __schedule() is rq->donor, which may not be the same as rq->curr and may not be on_cpu, this makes the prev value potentially unstable across rq lock drops. A recently found issue with proxy-exec, is when we begin doing return migration from try_to_wake_up(), its possible we may be waking up the rq->donor. When we do this, we proxy_resched_idle() to put_prev_set_next() setting the rq->donor to rq->idle, allowing the rq->donor to be return migrated and allowed to run. This however runs into trouble, as on another cpu we might be in the middle of calling __schedule(). Conceptually the rq lock is held for the majority of the time, but in calling prev_balance() its possible the class->balance() handler call may briefly drop the rq lock. This opens a window for try_to_wake_up() to wake and return migrate the rq->donor before the class logic reacquires the rq lock. Unfortunately prev_balance() pass in a prev argument, to which we pass rq->donor. However this prev value can now become stale and incorrect across a rq lock drop. So, to correct this, rework the prev_balance() call so that it does not take a "prev" argument. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260512025635.2840817-2-jstultz@google.com
2026-06-02Merge branch 'tip/sched/urgent'Peter Zijlstra
Pick up urgent fixes. Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2026-05-31sched_ext: Guard BPF arena helper calls to fix 32-bit buildTejun Heo
BPF arena (kernel/bpf/arena.c) is compiled only on MMU && 64BIT, while SCHED_CLASS_EXT depends on BPF_SYSCALL && BPF_JIT && DEBUG_INFO_BTF with no 64BIT requirement. On a 32-bit arch with a BPF JIT, SCX builds while the arena helpers are absent, so the cid-form code's unconditional calls to bpf_prog_arena() and bpf_arena_map_kern_vm_start() fail to link: build_policy.o: undefined reference to `bpf_prog_arena' build_policy.o: undefined reference to `bpf_arena_map_kern_vm_start' Guard the three call sites with the same MMU && 64BIT condition that gates arena.o. A cid-form scheduler needs a BPF arena, which isn't available on such builds, so it can't run there regardless. cpu-form schedulers don't touch the arena and are unaffected. This is a quick workaround to get past the build errors. A fuller fix may make the whole cid-form path conditional on the same condition, or drop 32-bit support outright. Fixes: 0e2819cba977 ("sched_ext: Require an arena for cid-form schedulers") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202605310454.U9iByL2n-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202605310926.APXMc0RJ-lkp@intel.com/ Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-29sched/fair: Use rq_clock() in update_tg_load_avg() rate-limitRik van Riel
update_tg_load_avg() is called once per leaf cfs_rq from the __update_blocked_fair() walk that runs inside the NOHZ idle-balance softirq, and again from update_load_avg() with UPDATE_TG. Its first operation after the trivial early-outs is unconditionally: now = sched_clock_cpu(cpu_of(rq_of(cfs_rq))); if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) return; Jakub ran into a system where nohz_idle_balance() was taking 75% of a CPU (which is handling network traffic and doing many irq_exit_cpu calls), with 35% of that CPU spent in update_load_avg, and 17% of the CPU in sched_clock_cpu(), reading the TSC. In a quick synthetic test, it looks like this patch reduces the CPU use of sched_balance_update_blocked_averages by about 20%. Switch the rate-limit to read rq_clock(rq_of(cfs_rq)) instead. This eliminates the rdtsc, and uses a fairly fresh timestamp, because all callers of update_tg_load_avg() and clear_tg_load_avg() hold rq->lock and have called update_rq_clock(rq) within microseconds: caller pre-state __update_blocked_fair encloser did update_rq_clock(rq) update_load_avg's three UPDATE_TG sites under rq->lock after enqueue/dequeue/update_curr attach_/detach_entity_cfs_rq preceded by update_load_avg(...) clear_tg_load_avg via offline path rq_clock_start_loop_update(rq) upfront so rq->clock is fresh at every call. Since cfs_rqs are per-CPU per-task_group, cfs_rq->last_update_tg_load_avg is always compared against the same rq's clock; no cross-rq drift. Signed-off-by: Rik van Riel <riel@surriel.com> Assisted-by: Claude (Anthropic) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260527110250.6a91718d@fangorn
2026-05-29sched_ext: Auto-register/unregister dl_server reservationsAndrea Righi
Commit cd959a3562050d ("sched_ext: Add a DL server for sched_ext tasks") introduced an ext_server deadline server to protect sched_ext tasks from fair/RT starvation, mirroring the existing fair_server. Currently, both servers reserve their 50ms/1000ms bandwidth at boot, regardless of whether a BPF scheduler is loaded. Unused bandwidth is still reclaimed at runtime by other classes, but the static reservation prevents the RT class from implicitly using that headroom when one of the two classes is guaranteed to be empty. A sysadmin can work around this by writing /sys/kernel/debug/sched/{fair,ext}_server/cpu*/runtime, but that requires manual action and not all systems expose debugfs. A better approach is to make server bandwidth reservations dynamic: only the scheduling policy that is currently active should register its reservation, while the inactive one should not artificially hold capacity (keeping both reservations only when the BPF scheduler is running in partial mode): +---------------------------------------------+-------------+------------+ | BPF scheduler state | fair server | ext server | +---------------------------------------------+-------------+------------+ | not loaded (default boot) | reserved | none | | loaded full mode (!SCX_OPS_SWITCH_PARTIAL) | none | reserved | | loaded partial mode (SCX_OPS_SWITCH_PARTIAL)| reserved | reserved | +---------------------------------------------+-------------+------------+ To achieve this, introduce an "attached/detached" state for each deadline server, so the kernel can decide whether a server's bandwidth should be accounted in global bandwidth tracking. At boot, the system starts with only the fair server contributing to bandwidth accounting. When a BPF scheduler is enabled, the ext server is attached and may replace or complement the fair server depending on whether full or partial mode is used. When sched_ext is disabled, the system restores the previous deadline bandwidth values and behavior. The transition logic ensures that switching between scheduling modes is consistent and reversible, without losing runtime configuration or requiring manual intervention. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260526164420.638711-2-arighi@nvidia.com
2026-05-29sched/deadline: Reject debugfs dl_server writes for offline CPUsAndrea Righi
Writing runtime or period via the per-CPU dl_server debugfs files (/sys/kernel/debug/sched/{fair,ext}_server/cpu*/{runtime,period}) on an offline CPU can trigger two distinct kernel issues: 1) Divide-by-zero in dl_server_apply_params(): Oops: divide error: 0000 [#1] SMP NOPTI RIP: 0010:dl_server_apply_params+0x239/0x3a0 Call Trace: sched_server_write_common.isra.0+0x21a/0x3c0 full_proxy_write+0x78/0xd0 vfs_write+0xe7/0x6e0 Both __dl_sub() and __dl_add() divide by cpus internally, which can be 0 once the CPU has been removed from any active root-domain span (this has been latent since the debugfs interface was introduced). 2) WARN_ON_ONCE in dl_server_start(): WARNING: kernel/sched/deadline.c:1805 at dl_server_start+0x232/0x270 Commit ee6e44dfe6e5 ("sched/deadline: Stop dl_server before CPU goes offline") added this check to catch enqueueing the server on an offline rq. There's no meaningful semantics for re-configuring the per-CPU dl_server bandwidth while the CPU is offline, so simply reject the write with -EBUSY so userspace gets a clear error. Closes: https://lore.kernel.org/all/20260526092228.3B6891F00A3A@smtp.kernel.org/ Fixes: d741f297bcea ("sched/fair: Fair server interface") Reported-by: Sashiko <sashiko-bot@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: abaci-kreproducer <abaci@linux.alibaba.com> Link: https://patch.msgid.link/20260526100502.575774-1-arighi@nvidia.com
2026-05-29sched/topology: Provide arch_llc_mask for cache aware schedulingShrikanth Hegde
Venkat Reported a boot kernel panic next-20260522. Git bisect pointed to b5ea300a17e3 ("sched/cache: Make LLC id continuous") Stacktrace points to llc_mask being null. NIP [c000000000e58504] _find_first_bit+0x44/0x130 LR [c000000000e58500] _find_first_bit+0x40/0x130 Call Trace: build_sched_domains+0xad8/0xe50 sched_init_smp+0xa8/0x164 kernel_init_freeable+0x250/0x370 ret_from_kernel_user_thread+0x14/0x1c On powerpc, cpu_coregroup_mask is available only when the underlying hardware support coregroup. In shared LPAR, QEMU guest or power9 etc coregroup isn't supported. In such cases llc_mask was being referenced when it was null leading to panic. On powerpc, LLC is at SMT core level. So assumption that coregroup(MC) domain point to LLC is wrong. Provide a way for archs to say where its LLC is if it not at MC domain. Fixes: b5ea300a17e3 ("sched/cache: Make LLC id continuous") Closes: https://lore.kernel.org/all/51154de7-3700-4cb4-82f2-1b3a8fa427f7@linux.ibm.com/ Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Co-developed-by: Chen, Yu C <yu.c.chen@intel.com> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Tested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/20260529075712.1181039-1-sshegde@linux.ibm.com
2026-05-27sched_ext: idle: Fix errno loss in scx_idle_init()Cheng-Yang Chou
|| is a boolean operator, any nonzero (error) return short-circuits to 1 rather than the actual errno. The caller in scx_init() logs and propagates this value, so the wrong code reaches upper layers. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-26sched: Remove sched_class::pick_next_task()Peter Zijlstra
The reason for pick_next_task_fair() is the put/set optimization that avoids touching the common ancestors. However, it is possible to implement this in the put_prev_task() and set_next_task() calls as used in put_prev_set_next_task(). Notably, put_prev_set_next_task() is the only site that: - calls put_prev_task() with a .next argument; - calls set_next_task() with .first = true. This means that put_prev_task() can determine the common hierarchy and stop there, and then set_next_task() can terminate where put_prev_task stopped. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260511120628.057634261@infradead.org
2026-05-26sched/fair: Add newidle balance to pick_task_fair()Peter Zijlstra
With commit 50653216e4ff ("sched: Add support to pick functions to take rf") removing the balance callback, the pick_task() callback is in charge of newidle balancing. This means pick_task_fair() should do so too. This hasn't been a problem in practise because pick_next_task_fair() is used. However, since we'll be removing that one shortly, make sure pick_next_task() is up to scratch. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260511120627.944705718@infradead.org
2026-05-26sched/debug: Collapse subsequent CONFIG_SCHED_CLASS_EXT sectionsPeter Zijlstra
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260511120627.281160085@infradead.org
2026-05-26sched: Use {READ,WRITE}_ONCE() for preempt_dynamic_modePeter Zijlstra
Robots figured out you can read and write this concurrently and got 'upset'. Gemini even noted sched_dynamic_show() can generate 'confusing' output if it observed different values during the printing. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260511120627.176946327@infradead.org
2026-05-26sched/debug: Use char * instead of char (*)[]Peter Zijlstra
Some of the fancy AI robots are getting 'upset'. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260511120627.065013766@infradead.org
2026-05-26sched/fair: Fix RCU usage in NOHZ exit path on CPU offlineAndrea Righi
Commit c9d93a73ce87 ("sched/fair: Drop redundant RCU read lock in NOHZ kick path") removed the rcu_read_lock()/unlock() pair from set_cpu_sd_state_busy() and set_cpu_sd_state_idle() on the assumption that all callers run in a safe context for rcu_dereference_all(): IRQs disabled or cpus_write_lock() held. That assumption is wrong for the CPU hotplug teardown path. When CPUs are taken offline, set_cpu_sd_state_busy() is invoked via: cpuhp/N kthread cpuhp_thread_fun() cpuhp_invoke_callback() sched_cpu_deactivate() nohz_balance_exit_idle() set_cpu_sd_state_busy() rcu_dereference_all(per_cpu(sd_llc, cpu)) The cpuhp kthread holds cpu_hotplug_lock (percpu-rwsem) but runs with preemption and IRQs enabled. As a result, lockdep correctly reports a suspicious RCU usage on CPU offline, e.g.: # echo 0 > /sys/devices/system/cpu/cpu1/online ============================= WARNING: suspicious RCU usage ----------------------------- kernel/sched/fair.c:12793 suspicious rcu_dereference_check() usage! ... 2 locks held by cpuhp/1/20: #0: (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x42/0x1ae #1: (cpuhp_state-down){+.+.}-{0:0}, at: cpuhp_thread_fun+0x72/0x1ae Call Trace: lockdep_rcu_suspicious nohz_balance_exit_idle sched_cpu_deactivate cpuhp_invoke_callback cpuhp_thread_fun smpboot_thread_fn Fix this by adding RCU read lock coverage to the one caller that lacks it: nohz_balance_exit_idle() in the CPU hotplug teardown. The other callers (nohz_balancer_kick() and nohz_balance_enter_idle()) genuinely run with IRQs disabled, so they remain unchanged. Fixes: c9d93a73ce87 ("sched/fair: Drop redundant RCU read lock in NOHZ kick path") Closes: https://lore.kernel.org/all/38fe0a1d-1a48-435a-910a-c278024d9ac9@samsung.com/ Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260522092523.2046095-1-arighi@nvidia.com
2026-05-25sched_ext: Convert ops.set_cmask() to arena-resident cmaskTejun Heo
ops_cid.set_cmask() expects a cmask. The kernel couldn't write into the arena, so it translated cpumask -> cmask in kernel memory and passed the result as a trusted pointer. The BPF cmask helpers all operate on arena cmasks though, so the BPF side had to word-by-word probe-read the kernel cmask into an arena cmask via cmask_copy_from_kernel() before any helper could touch it. It works, but is clumsy. With direct kernel-side arena access now in place, build the cmask in the arena. The kernel writes to it through the kern_va side of the dual mapping. BPF directly dereferences it via an __arena pointer like any other arena struct. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
2026-05-25sched_ext: Sub-allocator over kernel-claimed BPF arena pagesTejun Heo
Build a per-scheduler sub-allocator on top of pages claimed from the BPF arena registered in the previous patch. Subsequent kernel-managed arena-resident structures (e.g. per-CPU set_cmask cmask) carve their storage from this pool. scx_arena_pool_init() creates a gen_pool. scx_arena_alloc() returns the kernel VA. On exhaustion, the pool grows by claiming more pages via bpf_arena_alloc_pages_sleepable(). Chunks are added at the kernel-side mapping address. Callers translate to the BPF-arena form themselves if needed. Allocations sleep (GFP_KERNEL) - they may grow the pool through vzalloc and arena page allocation. All current consumers run from the enable path (after ops.init() and the kernel-side arena auto-discovery, before validate_ops()), where sleeping is fine. scx_arena_pool_destroy() walks each chunk, returns outstanding ranges to the gen_pool with gen_pool_free() and then calls gen_pool_destroy(). The underlying arena pages are released when the arena map itself is torn down, so the pool destroy doesn't free them explicitly. v2: Switch scx_arena_alloc() to a loop. (Andrea) Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-25sched_ext: Require an arena for cid-form schedulersTejun Heo
Upcoming patches will let the kernel place arena-resident scratch shared with the BPF program (e.g. per-CPU set_cmask cmask) so the BPF side can dereference it directly via __arena pointers, replacing the current cmask_copy_from_kernel() probe-read loop. That requires each cid-form scheduler to expose its arena to the kernel. Kernel- side accesses are recovered by the per-arena scratch-page mechanism. bpf_scx_reg_cid() walks the struct_ops member progs via bpf_struct_ops_for_each_prog() and reads each prog's arena via bpf_prog_arena(). The verifier enforces one arena per program, so each member prog contributes at most one arena. All non-NULL contributions must match and at least one member prog must use an arena. The map ref is held on scx_sched and dropped on sched destroy. cpu-form schedulers (bpf_scx_reg) are unchanged - no arena requirement. Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-25Merge branch 'arena_direct_access' of ↵Tejun Heo
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next into for-7.2
2026-05-22Merge tag 'sched_ext-for-7.1-rc4-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Spurious WARN in ops_dequeue() racing with concurrent dispatch - Self-deadlock between scheduler disable and a concurrent sub-sched enable * tag 'sched_ext-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix spurious WARN on stale ops_state in ops_dequeue() sched_ext: Fix deadlock between scx_root_disable() and concurrent forks
2026-05-21sched_ext: Fix spurious WARN on stale ops_state in ops_dequeue()Samuele Mariotti
ops_dequeue() can race with finish_dispatch() and spuriously trigger the "queued task must be in BPF scheduler's custody" warning. ops_dequeue() snapshots p->scx.ops_state via atomic_long_read_acquire() and then, in the SCX_OPSS_QUEUED arm, asserts that SCX_TASK_IN_CUSTODY is set. The two reads are not atomic w.r.t. a concurrent finish_dispatch() running on another CPU: CPU 1 CPU 2 ===== ===== dequeue_task_scx() ops_dequeue() opss = read_acquire(ops_state) = SCX_OPSS_QUEUED finish_dispatch() cmpxchg ops_state: SCX_OPSS_QUEUED -> SCX_OPSS_DISPATCHING [succeeds] dispatch_enqueue(SCX_DSQ_GLOBAL, SCX_ENQ_CLEAR_OPSS) call_task_dequeue() p->scx.flags &= ~SCX_TASK_IN_CUSTODY WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_IN_CUSTODY)) /* opss is stale: QUEUED, * but task already claimed */ set_release(ops_state, SCX_OPSS_NONE) The race has been observed via two distinct call chains: the most common goes through sched_setaffinity(), a rarer variant through sched_change_begin(). For SCX_DSQ_GLOBAL / SCX_DSQ_BYPASS, dispatch_enqueue() clears SCX_TASK_IN_CUSTODY before clearing ops_state to SCX_OPSS_NONE (intentional, to avoid concurrent non-atomic RMW of p->scx.flags against ops_dequeue()). The window between those two writes is exactly what ops_dequeue() observes as "QUEUED without custody". The observed state is not actually inconsistent, it just means CPU 1 has already claimed the task and the QUEUED value held by CPU 2 is stale. Re-read ops_state in that case; the next read is guaranteed to return SCX_OPSS_DISPATCHING or SCX_OPSS_NONE, both of which exit the switch cleanly. The retry is bounded: once IN_CUSTODY is cleared, ops_state has already advanced past QUEUED for this dispatch cycle, and a fresh QUEUED would require re-enqueue under p's rq lock, which CPU 2 holds. Changes in v2: - Use READ_ONCE() for p->scx.flags to ensure fresh reads and prevent compiler reordering in the lockless path - Add cpu_relax() to reduce power consumption and improve performance during the spin-wait - Use unlikely() to optimize branch prediction for the common case - Expand the in-code comment to document the race condition and bounded retry guarantee Fixes: ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics") Suggested-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Samuele Mariotti <smariotti@disroot.org> Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-20sched_ext: Add cmask mask opsTejun Heo
Sub-sched cap code and other upcoming consumers need bulk cmask ops, both mutating (and/or/copy/andnot) and predicate (subset/intersects/empty). cmask_walk_op2() walks the intersection of two ranges word by word; cmask_walk_op1() walks one range. Both are __always_inline and dispatched on a compile-time-constant op enum, so each public entry collapses to a specialized loop with the inner switch reduced to one arm. Two-cmask ops only touch bits in the intersection of the two ranges; bits outside are left unchanged. scx_cmask_or_racy() and scx_cmask_copy_racy() mirror the locking forms but read @src word-by-word through data_race(); callers handle ordering with concurrent writers themselves. v2: Add scx_cmask_empty(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-20sched_ext: Track bits[] storage size in struct scx_cmaskTejun Heo
scx_cmask carries @base and @nr_cids but not the bits[] allocation size, so helpers reshaping the active range have no way to check it fits and later kfuncs taking caller-provided storage can't validate it. Add @alloc_words (u64 word count) annotated with __counted_by, and split the bit-range API into three helpers: - SCX_CMASK_DEFINE() / __SCX_CMASK_DEFINE() define an on-stack cmask, the latter taking an explicit capacity for oversized storage. SCX_CMASK_DEFINE_SHARD() is a thin wrapper that always reserves SCX_CID_SHARD_MAX_CPUS bits of storage. - scx_cmask_init() / __scx_cmask_init() initialize a cmask, with the same tight-vs-explicit split. - scx_cmask_reframe() reshapes the active range without resizing storage. The BPF mirror (cmask_init / __cmask_init / cmask_reframe) gets the same shape. Add scx_cmask_clear() and scx_cmask_fill() to zero and set the active-range bits respectively. scx_cpumask_to_cmask() uses scx_cmask_clear(); scx_cmask_init() would otherwise re-write @alloc_words on every call. A later patch uses @alloc_words in scx_cmask_ref_shard() to refuse output storage that can't hold the requested shard. v2: Init per-CPU scx_set_cmask_scratch (was zero-init, emitted empty cmasks). Add nr_cids/alloc_cids check in BPF __cmask_init(). (sashiko AI) Widen SCX_CMASK_NR_WORDS()/CMASK_NR_WORDS() to compute in u64 so that @nr_cids near U32_MAX no longer wraps to a small value and bypasses the bounds check in cmask_reframe(). (Andrea) Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-20sched_ext: Rename scx_cmask.nr_bits to nr_cidsTejun Heo
struct scx_cmask is a base-windowed bitmap over cid space. Each bit represents one cid, so the count of active bits is the count of cids. The sibling struct scx_cid_shard already uses nr_cids. Rename as a prep so the following patches that grow the cmask API can use the consistent name. v2: Also rename src->nr_bits / dst->nr_bits in cmask_copy_from_kernel(). (sashiko AI) Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-19sched/topology: Allow multiple domains to claim sched_domain_sharedK Prateek Nayak
Recent optimizations of sd->shared assignment moved to allocating a single instance of per-CPU sched_domain_shared objects per s_data. Recent optimizations to select_idle_capacity() moved the sd->shared assignments to "sd_asym" domain when ASYM_CPUCAPACITY is detected but cache-aware scheduling mandates the presence of "sd_llc_shared" to compute and cache per-LLC statistics. Use an "alloc_flags" union in sched_domain_shared to claim a sched_domain_shared object per sched_domain. Allocation starts searching for an available / matching sched_domain_shared instance from the first CPU of sched_domain_span(sd) (sd can be sd_llc, or sd_asym). If the shared object is claimed by another domain, the instance corresponding to next CPU in the domain span is explored until a matching / available instance is found. In case of a single CPU in sched_domain_span(), the domain will be degenerated and a temporary overlap of ->shared objects across different domains is acceptable. "alloc_flags" forms a union with "nr_idle_scan" and the stale flags are left as is when the sd->shared is published. The expectation is for the first load balancing instance to correct the value just like the current behavior, except the initial value is no longer 0. Originally-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Andrea Righi <arighi@nvidia.com>
2026-05-19Merge branch 'sched/cache'Peter Zijlstra
Merge the cache aware balancer topic branch. # Conflicts: # kernel/sched/topology.c
2026-05-19sched/rt: Have RT_PUSH_IPI be default off for non PREEMPT_RTSteven Rostedt
RT migration is done aggressively. When a CPU schedules out a high priority RT task for a lower priority task, it will look to see if there's any RT tasks that are waiting to run on another CPU that is of higher priority than the task this CPU is about to run. If it finds one, it will pull that task over to the CPU and allow it to run there instead. Normally, this pulling is done by looking at the RT overloaded mask (rto) which contains all the CPUs in the scheduler domain with RT tasks that are waiting to run due to a higher priority RT task currently running on their CPU. The CPU that is about to schedule a lower priority task will grab the rq lock of the overloaded CPU and move the RT task from that CPU's runqueue to the local one and schedule the higher priority RT task. This caused issues when a lot of CPUs would schedule a lower priority task at the same time. They would all try to grab the same runqueue lock of the CPU with the overloaded RT tasks. Only the first CPU that got in will get that task. All the others would wait until they got the runqueue lock and see there's nothing to pull and do nothing. On systems with lots of CPUs, this caused a large latency (up to 500us) which is beyond what PREEMPT_RT is to allow. The solution to that was to create an RT_PUSH_IPI logic. When any CPU wanted to pull a task, instead of grabbing the runqueue lock of the overloaded CPU, it would start by sending an IPI to the overloaded CPU, and that IPI handler would have the CPU with the waiting RT task do a push instead. Then that handler would send an IPI to the next CPU with overloaded RT tasks, and so on. Note, after the first CPU starts this process, if another CPU wanted to do a pull, it would see that the process has already begun and would only increment a counter to have the IPIs continue again. The RT_PUSH_IPI solved the latency problem with PREEMPT_RT but could cause a new issue with non PREEMPT_RT. Namely, softirqs run in a threaded context on PREEMPT_RT but they can run in an interrupt context in non-RT. If an IPI lands on a CPU that has just woken up multiple RT tasks and the current CPU is running a non RT or a low priority RT task, instead of doing a push, it would simply do a schedule on that CPU. But if a softirq was also executing on this CPU, the schedule would need to wait until the softirq finished. Until then, the CPU would still be considered overloaded as there are RT tasks still waiting to run on it. A live lock occurred on a workload that was doing heavy networking traffic on a large machine where the softirqs would run 500us out of 750us. And it would also be waking up RT tasks, causing the RT pull logic to be constantly executed. When a softirq triggered on a CPU with RT tasks queued but not running yet, and the other CPUs would see this CPU as being overloaded, they would send an IPI over to it. The CPU would notice that the waiting RT tasks are of higher priority than the currently running task and simply schedule that CPU instead. But because the softirq was executing, before it could schedule, it would receive another IPI to do the same. The amount of IPIs would slow down the currently running softirq so much that before it could return back to task context, it would execute another softirq never allowing the CPU to schedule. This live locked that CPU. As RT_PUSH_IPI was created to help PREEMPT_RT, make it default off if PREEMPT_RT is not enabled. Fixes: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling") Closes: https://lore.kernel.org/all/20260506235716.2530720-1-tj@kernel.org/ Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260515103740.25ccbed8@gandalf.local.home
2026-05-19sched: Switch rq->next_class on proxy_resched_idle()John Stultz
K Prateek noticed we weren't setting the rq->next_class in proxy_resched_idle(), when I was debugging an issue seen with CONFIG_SCHED_PROXY_EXEC and some of Peter's new patches, and suggested this fix. So set rq->next_class when we temporarily switch the donor to idle, so we don't accidentally call wakeup_preempt_fair() with idle as the donor. Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260514234732.3170197-1-jstultz@google.com
2026-05-19sched/fair: Add SIS_UTIL support to select_idle_capacity()K Prateek Nayak
Add to select_idle_capacity() the same SIS_UTIL-controlled idle-scan mechanism, already used by select_idle_cpu(): when sched_feat(SIS_UTIL) is enabled and the LLC domain has sched_domain_shared data, derive the per-attempt scan limit from sd->shared->nr_idle_scan. That bounds the walk on large LLCs: once nr_idle_scan is exhausted, return the best CPU seen so far. The early exit is gated on !has_idle_core so an active idle-core search (SMT with idle cores reported by test_idle_cores()) isn't cut short before it gets a chance to find one. Co-developed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260509180955.1840064-6-arighi@nvidia.com
2026-05-19sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacityAndrea Righi
When SD_ASYM_CPUCAPACITY load balancing considers pulling a misfit task, capacity_of(dst_cpu) can overstate available compute if the SMT sibling is busy: the core does not deliver its full nominal capacity. If SMT is active and dst_cpu is not on a fully idle core, skip this destination so we do not migrate a misfit expecting a capacity upgrade we cannot actually provide. Reported-by: Felix Abecassis <fabecassis@nvidia.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260509180955.1840064-5-arighi@nvidia.com
2026-05-19sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selectionAndrea Righi
On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting different per-core frequencies), the wakeup path uses select_idle_capacity() and prioritizes idle CPUs with higher capacity for better task placement. However, when those CPUs belong to SMT cores, their effective capacity can be much lower than the nominal capacity when the sibling thread is busy: SMT siblings compete for shared resources, so a "high capacity" CPU that is idle but whose sibling is busy does not deliver its full capacity. This effective capacity reduction cannot be modeled by the static capacity value alone. Introduce SMT awareness in the asym-capacity idle selection policy: when SMT is active, always prefer fully-idle SMT cores over partially-idle ones. Prioritizing fully-idle SMT cores yields better task placement because the effective capacity of partially-idle SMT cores is reduced; always preferring them when available leads to more accurate capacity usage on task wakeup. On an SMT system with asymmetric CPU capacities (NVIDIA Vera Rubin), SMT-aware idle selection has been shown to improve throughput by around 15-18% over NO_ASYM mainline and by around 60% over ASYM mainline, for CPU-bound workloads (NVBLAS) running an amount of tasks equal to the amount of SMT cores. Reported-by: Felix Abecassis <fabecassis@nvidia.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260511142502.3873984-1-arighi@nvidia.com
2026-05-19sched/fair: Attach sched_domain_shared to sd_asym_cpucapacityK Prateek Nayak
On asymmetric CPU capacity systems, the wakeup path uses select_idle_capacity(), which scans the span of sd_asym_cpucapacity rather than sd_llc. The has_idle_cores hint however lives on sd_llc->shared, so the wakeup-time read of has_idle_cores operates on an LLC-scoped blob while the actual scan/decision spans the asym domain; nr_busy_cpus also lives in the same shared sched_domain data, but it's never used in the asym CPU capacity scenario. Therefore, move the sched_domain_shared object to sd_asym_cpucapacity whenever the CPU has a SD_ASYM_CPUCAPACITY_FULL ancestor and that ancestor is non-overlapping (i.e., not built from SD_NUMA). In that case the scope of has_idle_cores matches the scope of the wakeup scan. Fall back to attaching the shared object to sd_llc in three cases: 1) plain symmetric systems (no SD_ASYM_CPUCAPACITY_FULL anywhere); 2) CPUs in an exclusive cpuset that carves out a symmetric capacity island: has_asym is system-wide but those CPUs have no SD_ASYM_CPUCAPACITY_FULL ancestor in their hierarchy and follow the symmetric LLC path in select_idle_sibling(); 3) exotic topologies where SD_ASYM_CPUCAPACITY_FULL lands on an SD_NUMA-built domain. init_sched_domain_shared() keys the shared blob off cpumask_first(span), which on overlapping NUMA domains would alias unrelated spans onto the same blob. Keep the shared object on the LLC there; select_idle_capacity() gracefully skips the has_idle_cores preference when sd->shared is NULL. While at it, also rename the per-CPU sd_llc_shared to sd_balance_shared, as it is no longer strictly tied to the LLC. Co-developed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260516055850.1345932-1-arighi@nvidia.com
2026-05-19sched/fair: Drop redundant RCU read lock in NOHZ kick pathAndrea Righi
nohz_balancer_kick() is reached from sched_balance_trigger(), which is called from sched_tick(). sched_tick() runs with IRQs disabled, so the additional rcu_read_lock/unlock() used around sched_domain accesses in this path is redundant. Rely on the existing IRQ-disabled context (and the rcu_dereference_all() checking) instead. The same applies to set_cpu_sd_state_idle(), called from the idle entry path with IRQs disabled, and to set_cpu_sd_state_busy(), reachable via nohz_balance_exit_idle() from two contexts: nohz_balancer_kick() (IRQs disabled, as above) and sched_cpu_deactivate() (the CPUHP_AP_ACTIVE teardown, which runs under cpus_write_lock(), so it cannot race with sched-domain rebuilds). In both cases the rcu_dereference_all() validation is sufficient. No functional change intended. Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260509180955.1840064-2-arighi@nvidia.com
2026-05-19sched: Unify SMT active check via sched_smt_active()Shrikanth Hegde
There is a use of sched_smt_active() and explicit use of sched_smt_present. Remove the explicit usage for better code maintenance and readability. Note that this differs slightly for update_idle_core. It used to call static_branch_unlikely earlier and now it will call static_branch_likely. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://patch.msgid.link/20260515172456.542799-5-sshegde@linux.ibm.com
2026-05-19sched/fair: Add sched_smt_active check for fastpathsShrikanth Hegde
For fastpaths such as wakeup and load balance even minimal code additions can add up. is_core_idle is accessed during load balance. Other callsites of is_core_idle make sched_smt_active() check first. Make the same check in should_we_balance. Rest of access to cpu_smt_mask isn't in fastpath. Note: Remove the stale comment above is_core_idle. Enqueue methods of fair aren't close to it anymore. Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://patch.msgid.link/20260515172456.542799-4-sshegde@linux.ibm.com
2026-05-19sched: Simplify ifdeffery around cpu_smt_maskShrikanth Hegde
Now, that cpu_smt_mask is defined as cpumask_of(cpu) for CONFIG_SCHED_SMT=n, it is possible to get rid of the ifdeffery. Effectively, - This makes sched_smt_present is defined always - cpumask_weight(cpumask_of(cpu)) == 1. So sched_smt_present_inc/dec will never enable the sched_smt_present. Which is expected. - Paths that were compile-time eliminated become runtime guarded using static keys. - Defines set_idle_cores, test_idle_cores, etc which could likely benefit the CONFIG_SCHED_SMT=n systems to use the same optimizations within the LLC at wakeups. - This will expose sched_smt_present symbol for CONFIG_SCHED_SMT=n. Likely not a concern. - There is a bloat of code CONFIG_SCHED_SMT=n. (NR_CPUS=2048) add/remove: 24/18 grow/shrink: 26/28 up/down: 6396/-3188 (3208) Total: Before=30629880, After=30633088, chg +0.01% - No code bloat for CONFIG_SCHED_SMT=y, which is expected. - Add comments around stop_core_cpuslocked on why ifdefs are not removed. - This leaves the remaining uses of CONFIG_SCHED_SMT mainly for topology building bits which has a policy based decision. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260515172456.542799-3-sshegde@linux.ibm.com
2026-05-19sched/fair: Update util_est after updating util_avg during dequeueVincent Guittot
util_est_update() must be called after updating util_avg during the dequeue of a task and only when the task is not delayed dequeue. Move util_est_update() in update_load_avg(). Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race") Closes: https://lore.kernel.org/all/20260512124653.305275-1-qyousef@layalina.io/ Reported-by: Qais Yousef <qyousef@layalina.io> Reviewed-and-tested-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260518102345.268452-1-vincent.guittot@linaro.org
2026-05-19sched/cputime: Drop now-stale mul_u64_u64_div_u64() over-approximation guardNicolas Pitre
Commit 77baa5bafcbe ("sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime") added a clamp in cputime_adjust(): if (unlikely(stime > rtime)) stime = rtime; The justification was that mul_u64_u64_div_u64() could over-approximate on some architectures (notably arm64 and the old 32-bit fallback), so the mathematically impossible stime > rtime was nevertheless reachable and would underflow utime = rtime - stime. That premise no longer holds. Commit b29a62d87cc0 ("mul_u64_u64_div_u64: make it precise always") replaced the fallback implementation with an exact 128-bit long division, and the x86_64 inline asm already produced exact results. The helper now returns the mathematically correct floor(a*b/d) on every architecture, so stime <= rtime is guaranteed by stime <= stime + utime and the clamp is dead code. Remove it along with its stale comment. Signed-off-by: Nicolas Pitre <nico@fluxnic.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260514202629.673539-1-nico@fluxnic.net
2026-05-19sched/deadline: Fix replenishment logic for non-deferred serversYuri Andriaccio
Enqueue and replenish non-deferred deadline servers when their runtime is exhausted and the replenishment timer could not be started because it is too close to the wake-up instant. Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260430213835.62217-2-yurand2000@gmail.com
2026-05-19sched/rt: Update default bandwidth for real-time tasks to ONEYuri Andriaccio
Set the default total bandwidth for SCHED_DEADLINE tasks and servers to ONE. FIFO/RR tasks are already throttled by fair-servers and ext-servers, and the sysctl_sched_rt_runtime parameter now only defines the total bw that is allowed to deadline entities. Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260430213835.62217-22-yurand2000@gmail.com
2026-05-18sched/cache: Fix stale preferred_llc for a new taskChen Yu
On fork without CLONE_VM, the child gets a new mm, the parent's preferred_llc value is stale for the child. Fix this by resetting the task's preferred_llc to -1. This bug was reported by sashiko. Fixes: 47d8696b95f7 ("sched/cache: Assign preferred LLC ID to processes") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/0ec7309d0e24ede97656754d1505b7490403d966.1778703694.git.tim.c.chen@linux.intel.com
2026-05-18sched/cache: Fix has_multi_llcs iff at least one partition has multiple LLCsChen Yu
sched_cache_present is a global static key, but build_sched_domains() is called per partition from the "Build new domains" loop in partition_sched_domains_locked(). Each call unconditionally sets the key based solely on the has_multi_llcs local variable for that partition. The call to the last partition set the value even when there are previous partitions with multiple LLCs. If partition A (multi-LLC) is built first, the key is enabled. Then when partition B (single-LLC) is built, the key is disabled. The multi-LLC partition A is still active but the key is now off. Fix it by doing a similar thing as sched_energy_present: check the multi-LLCs during the iteration over all the partitions rather than checking it on a single partition. This bug was reported by sashiko. Fixes: d59f4fd1d303 ("sched/cache: Enable cache aware scheduling for multi LLCs NUMA node") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/c541af2547d54509fbfd3b3a1e8072e2e5c7ff68.1778703694.git.tim.c.chen@linux.intel.com
2026-05-18sched/cache: Fix cache aware scheduling enabling for multi LLCs systemChen Yu
If there are multiple LLCs in the system, cache aware scheduling should be enabled. However, there is a corner case where, if there is a single NUMA node and a single LLC per node, cache aware scheduling will be turned on in the current implementation - because at this moment, the parent domain has not yet been degenerated, and it is possible that the current domain has the same cpu span as its parent. There is no need to turn cache aware scheduling on in this scenario. Fix it by iterating the parent domains to find a domain that is a superset of the current sd_llc, so that later, after the duplicated parent domains have been degenerated, cache aware scheduling will take effect. For example, the expected behavior would be: 2 sockets, 1 LLC per socket: MC span=0-3, PKG span=0-7, has_multi_llcs=true 1 socket, 2 LLCs per socket: MC span=0-3, PKG span=0-7, has_multi_llcs=true 2 sockets, 2 LLCs per socket: MC span=0-3, PKG span=0-7, has_multi_llcs=true 1 socket, 1 LLC per socket: MC span=0-3, PKG span=0-3, has_multi_llcs=false This bug was reported by sashiko. Fixes: d59f4fd1d303 ("sched/cache: Enable cache aware scheduling for multi LLCs NUMA node") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/6328a8a7f40925cec2a712d81ee58128a4c4444a.1778703694.git.tim.c.chen@linux.intel.com
2026-05-18sched/cache: Fix race condition during sched domain rebuildChen Yu
sched_cache_active_set_unlocked() checks hardware support without locks: static void sched_cache_active_set(bool locked) { /* hardware does not support */ if (!static_branch_likely(&sched_cache_present)) { _sched_cache_active_set(false, locked); return; } ... If build_sched_domains() runs concurrently during CPU hotplug, it can disable sched_cache_present under sched_domains_mutex and the CPU hotplug lock. If a debugfs write thread evaluates sched_cache_present as true right before that, and then blocks or gets preempted, it might proceed to enable sched_cache_active after the hardware support has been marked as absent. Make it safer by acquiring cpus_read_lock() and sched_domains_mutex_lock() when the user changes sched_cache_active via debugfs. This bug was reported by sashiko. Fixes: 067a31358143 ("sched/cache: Allow the user space to turn on and off cache aware scheduling") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/9afddf439687f04bb56b46625bd9f153eb8abad5.1778703694.git.tim.c.chen@linux.intel.com
2026-05-18sched/cache: Fix checking active load balance by only considering the CFS taskChen Yu
The currently running task cur may not be a CFS task, such as an RT or Deadline task. For non-CFS tasks, the task_util(cur) utilization average is not maintained, so this might pass a stale or meaningless value to can_migrate_llc(). Check if the task is CFS before getting its task_util(). This bug was reported by sashiko. Fixes: 714059f79ff0 ("sched/cache: Handle moving single tasks to/from their preferred LLC") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/f9161133cf040d286dca11344a112c5ef2a5253d.1778703694.git.tim.c.chen@linux.intel.com