summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
2026-01-15sched: Fix build for modules using set_tsk_need_resched()Gabriele Monaco
Commit adcc3bfa8806 ("sched: Adapt sched tracepoints for RV task model") added a tracepoint to the need_resched action that can be triggered also by set_tsk_need_resched. This function was previously accessible from out-of-tree modules but it's no longer available because the __trace_set_need_resched() symbol is not exported (together with the tracepoint itself, which was exported in a separate patch) and building such modules fails. Export __trace_set_need_resched to modules to fix those build issues. Fixes: adcc3bfa8806 ("sched: Adapt sched tracepoints for RV task model") Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://patch.msgid.link/20260112140413.362202-1-gmonaco@redhat.com
2026-01-15sched/deadline: Use ENQUEUE_MOVE to allow priority changePeter Zijlstra
Pierre reported hitting balance callback warnings for deadline tasks after commit 6455ad5346c9 ("sched: Move sched_class::prio_changed() into the change pattern"). It turns out that DEQUEUE_SAVE+ENQUEUE_RESTORE does not preserve DL priority and subsequently trips a balance pass -- where one was not expected. From discussion with Juri and Luca, the purpose of this clause was to deal with tasks new to DL and all those sites will have MOVE set (as well as CLASS, but MOVE is move conservative at this point). Per the previous patches MOVE is audited to always run the balance callbacks, so switch enqueue_dl_entity() to use MOVE for this case. Fixes: 6455ad5346c9 ("sched: Move sched_class::prio_changed() into the change pattern") Reported-by: Pierre Gondois <pierre.gondois@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2026-01-15sched: Deadline has dynamic priorityPeter Zijlstra
While FIFO/RR have static priority, DEADLINE is a dynamic priority scheme. Notably it has static priority -1. Do not assume the priority doesn't change for deadline tasks just because the static priority doesn't change. This ensures DL always sees {DE,EN}QUEUE_MOVE where appropriate. Fixes: ff77e4685359 ("sched/rt: Fix PI handling vs. sched_setscheduler()") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2026-01-15sched: Audit MOVE vs balance_callbacksPeter Zijlstra
The {DE,EN}QUEUE_MOVE flag indicates a task is allowed to change priority, which means there could be balance callbacks queued. Therefore audit all MOVE users and make sure they do run balance callbacks before dropping rq-lock. Fixes: 6455ad5346c9 ("sched: Move sched_class::prio_changed() into the change pattern") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2026-01-15sched: Fold rq-pin swizzle into __balance_callbacks()Peter Zijlstra
Prepare for more users needing the rq-pin swizzle. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2026-01-15sched/deadline: Avoid double update_rq_clock()Peter Zijlstra
When setup_new_dl_entity() is called from enqueue_task_dl() -> enqueue_dl_entity(), the rq-clock should already be updated, and calling update_rq_clock() again is not right. Move the update_rq_clock() to the one other caller of setup_new_dl_entity(): sched_init_dl_server(). Fixes: 9f239df55546 ("sched/deadline: Initialize dl_servers after SMP") Reported-by: Pierre Gondois <pierre.gondois@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Link: https://patch.msgid.link/20260113115622.GA831285@noisy.programming.kicks-ass.net
2026-01-15sched/deadline: Ensure get_prio_dl() is up-to-datePeter Zijlstra
Pratheek tripped a WARN and noted the following issue: > Inspecting the set of events that led to the warning being triggered > showed the following: > > systemd-1 [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed begin! > > systemd-1 [008] dN.31 ...: sched_change_begin: Begin! > systemd-1 [008] dN.31 ...: sched_change_begin: Before dequeue_task()! > systemd-1 [008] dN.31 ...: update_curr_dl_se: update_curr_dl_se: ENQUEUE_REPLENISH > systemd-1 [008] dN.31 ...: enqueue_dl_entity: enqueue_dl_entity: ENQUEUE_REPLENISH > systemd-1 [008] dN.31 ...: replenish_dl_entity: Replenish before: 14815760217 > systemd-1 [008] dN.31 ...: replenish_dl_entity: Replenish after: 14816960047 > systemd-1 [008] dN.31 ...: sched_change_begin: Before put_prev_task()! > > systemd-1 [008] dN.31 ...: sched_change_end: Before enqueue_task()! > systemd-1 [008] dN.31 ...: sched_change_end: Before put_prev_task()! > systemd-1 [008] dN.31 ...: prio_changed_dl: Queuing pull task on prio change: 14815760217 -> 14816960047 > systemd-1 [008] dN.31 ...: prio_changed_dl: Queuing balance callback! > systemd-1 [008] dN.31 ...: sched_change_end: End! > > systemd-1 [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed end! > systemd-1 [008] dN.21 ...: __schedule: Woops! Balance callback found! > > 1. sched_change_begin() from guard(sched_change) in > do_set_cpus_allowed() stashes the priority, which for the deadline > task, is "p->dl.deadline". > 2. The dequeue of the deadline task replenishes the deadline. > 3. The task is enqueued back after guard's scope ends and since there is > no *_CLASS flags set, sched_change_end() calls > dl_sched_class->prio_changed() which compares the deadline. > 4. Since deadline was moved on dequeue, prio_changed_dl() sees the value > differ from the stashed value and queues a balance pull callback. > 5. do_set_cpus_allowed() finishes and drops the rq_lock without doing a > do_balance_callbacks(). > 6. Grabbing the rq_lock() at subsequent __schedule() triggers the > warning since the balance pull callback was never executed before > dropping the lock. Meaning get_prio_dl() ought to update current and return an up-to-date value. Fixes: 6455ad5346c9 ("sched: Move sched_class::prio_changed() into the change pattern") Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260106104113.GX3707891@noisy.programming.kicks-ass.net
2026-01-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc5Alexei Starovoitov
Cross-merge BPF and other fixes after downstream PR. No conflicts. Adjacent: Auto-merging MAINTAINERS Auto-merging Makefile Auto-merging kernel/bpf/verifier.c Auto-merging kernel/sched/ext.c Auto-merging mm/memcontrol.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-13sched: Export hidden tracepoints to modulesGabriele Monaco
The tracepoints sched_entry, sched_exit and sched_set_need_resched are not exported to tracefs as trace events, this allows only kernel code to access them. Helper modules like [1] can be used to still have the tracepoints available to ftrace for debugging purposes, but they do rely on the tracepoints being exported. Export the 3 not exported tracepoints. Note that sched_set_state is already exported as the macro is called from modules. [1] - https://github.com/qais-yousef/sched_tp.git Fixes: adcc3bfa8806 ("sched: Adapt sched tracepoints for RV task model") Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://patch.msgid.link/20251205131621.135513-9-gmonaco@redhat.com
2026-01-13sched/deadline: Fix server stopping with runnable tasksGabriele Monaco
The deadline server can currently stop due to idle although fair tasks are runnable. This happens essentially when: * the server is set to idle, a task wakes up, the server stops * a task wakes up, the server sets itself to idle and stops right away Address both cases by clearing the server idle flag whenever a fair task wakes up and accounting also for pending tasks in the definition of idle. Fixes: f5a538c07df2 ("sched/deadline: Fix dl_server stop condition") Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260113085159.114226-3-gmonaco@redhat.com
2026-01-13sched: Provide idle_rq() helperPeter Zijlstra
A fix for the dl_server 'requires' idle_cpu() usage, which made me note that it and available_idle_cpu() are extern function calls. And while idle_cpu() is used outside of kernel/sched/, available_idle_cpu() is not. This makes it hard to make idle_cpu() an inline helper, so provide idle_rq() and implement idle_cpu() and available_idle_cpu() using that. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2026-01-13sched/deadline: Fix potential race in dl_add_task_root_domain()Pingfan Liu
The access rule for local_cpu_mask_dl requires it to be called on the local CPU with preemption disabled. However, dl_add_task_root_domain() currently violates this rule. Without preemption disabled, the following race can occur: 1. ThreadA calls dl_add_task_root_domain() on CPU 0 2. Gets pointer to CPU 0's local_cpu_mask_dl 3. ThreadA is preempted and migrated to CPU 1 4. ThreadA continues using CPU 0's local_cpu_mask_dl 5. Meanwhile, the scheduler on CPU 0 calls find_later_rq() which also uses local_cpu_mask_dl (with preemption properly disabled) 6. Both contexts now corrupt the same per-CPU buffer concurrently Fix this by moving the local_cpu_mask_dl access to the preemption disabled section. Closes: https://lore.kernel.org/lkml/aSBjm3mN_uIy64nz@jlelli-thinkpadt14gen4.remote.csb Fixes: 318e18ed22e8 ("sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug") Reported-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Pingfan Liu <piliu@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Waiman Long <longman@redhat.com> Link: https://patch.msgid.link/20251125032630.8746-3-piliu@redhat.com
2026-01-13sched/deadline: Remove unnecessary comment in dl_add_task_root_domain()Pingfan Liu
The comments above dl_get_task_effective_cpus() and dl_add_task_root_domain() already explain how to fetch a valid root domain and protect against races. There's no need to repeat this inside dl_add_task_root_domain(). Remove the redundant comment to keep the code clean. No functional change. Signed-off-by: Pingfan Liu <piliu@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Waiman Long <longman@redhat.com> Link: https://patch.msgid.link/20251125032630.8746-2-piliu@redhat.com
2026-01-12sched: Move clock related paravirt code to kernel/schedJuergen Gross
Paravirt clock related functions are available in multiple archs. In order to share the common parts, move the common static keys to kernel/sched/ and remove them from the arch specific files. Make a common paravirt_steal_clock() implementation available in kernel/sched/cputime.c, guarding it with a new config option CONFIG_HAVE_PV_STEAL_CLOCK_GEN, which can be selected by an arch in case it wants to use that common variant. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260105110520.21356-7-jgross@suse.com
2026-01-12paravirt: Remove asm/paravirt_api_clock.hJuergen Gross
All architectures supporting CONFIG_PARAVIRT share the same contents of asm/paravirt_api_clock.h: #include <asm/paravirt.h> So remove all incarnations of asm/paravirt_api_clock.h and remove the only place where it is included, as there asm/paravirt.h is included anyway. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> # powerpc, scheduler bits Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260105110520.21356-6-jgross@suse.com
2026-01-11Merge tag 'sched-urgent-2026-01-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a crash in sched_mm_cid_after_execve()" * tag 'sched-urgent-2026-01-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/mm_cid: Prevent NULL mm dereference in sched_mm_cid_after_execve()
2026-01-11treewide: Update email addressThomas Gleixner
In a vain attempt to consolidate the email zoo switch everything to the kernel.org account. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-01-09sched/mm_cid: Prevent NULL mm dereference in sched_mm_cid_after_execve()Cong Wang
sched_mm_cid_after_execve() is called in bprm_execve()'s cleanup path even when exec_binprm() fails. For the init task's first execve(), this causes a problem: 1. current->mm is NULL (kernel threads don't have an mm) 2. sched_mm_cid_before_execve() exits early because mm is NULL 3. exec_binprm() fails (e.g., ENOENT for missing script interpreter) 4. sched_mm_cid_after_execve() is called with mm still NULL 5. sched_mm_cid_fork() is called unconditionally, triggering WARN_ON This is easily reproduced by booting with an init that is a shell script (#!/bin/sh) where the interpreter doesn't exist in the initramfs. Fix this by checking if t->mm is NULL before calling sched_mm_cid_fork(), matching the behavior of sched_mm_cid_before_execve() which already handles this case via sched_mm_cid_exit()'s early return. Fixes: b0c3d51b54f8 ("sched/mmcid: Provide precomputed maximal value") Signed-off-by: Cong Wang <cwang@multikernel.io> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20251223215113.639686-1-xiyou.wangcong@gmail.com
2026-01-08sched: Further restrict the preemption modesPeter Zijlstra
The introduction of PREEMPT_LAZY was for multiple reasons: - PREEMPT_RT suffered from over-scheduling, hurting performance compared to !PREEMPT_RT. - the introduction of (more) features that rely on preemption; like folio_zero_user() which can do large memset() without preemption checks. (Xen already had a horrible hack to deal with long running hypercalls) - the endless and uncontrolled sprinkling of cond_resched() -- mostly cargo cult or in response to poor to replicate workloads. By moving to a model that is fundamentally preemptable these things become managable and avoid needing to introduce more horrible hacks. Since this is a requirement; limit PREEMPT_NONE to architectures that do not support preemption at all. Further limit PREEMPT_VOLUNTARY to those architectures that do not yet have PREEMPT_LAZY support (with the eventual goal to make this the empty set and completely remove voluntary preemption and cond_resched() -- notably VOLUNTARY is already limited to !ARCH_NO_PREEMPT.) This leaves up-to-date architectures (arm64, loongarch, powerpc, riscv, s390, x86) with only two preemption models: full and lazy. While Lazy has been the recommended setting for a while, not all distributions have managed to make the switch yet. Force things along. Keep the patch minimal in case of hard to address regressions that might pop up. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://patch.msgid.link/20251219101502.GB1132199@noisy.programming.kicks-ass.net
2026-01-08sched: Reorder some fields in struct rqBlake Jones
This colocates some hot fields in "struct rq" to be on the same cache line as others that are often accessed at the same time or in similar ways. Using data from a Google-internal fleet-scale profiler, I found three distinct groups of hot fields in struct rq: - (1) The runqueue lock: __lock. - (2) Those accessed from hot code in pick_next_task_fair(): nr_running, nr_numa_running, nr_preferred_running, ttwu_pending, cpu_capacity, curr, idle. - (3) Those accessed from some other hot codepaths, e.g. update_curr(), update_rq_clock(), and scheduler_tick(): clock_task, clock_pelt, clock, lost_idle_time, clock_update_flags, clock_pelt_idle, clock_idle. The cycles spent on accessing these different groups of fields broke down roughly as follows: - 50% on group (1) (the runqueue lock, always read-write) - 39% on group (2) (load:store ratio around 38:1) - 8% on group (3) (load:store ratio around 5:1) - 3% on all the other fields Most of the fields in group (3) are already in a cache line grouping; this patch just adds "clock" and "clock_update_flags" to that group. The fields in group (2) are scattered across several cache lines; the main effect of this patch is to group them together, on a single line at the beginning of the structure. A few other less performance-critical fields (nr_switches, numa_migrate_on, has_blocked_load, nohz_csd, last_blocked_load_update_tick) were also reordered to reduce holes in the data structure. Since the runqueue lock is acquired from so many different contexts, and is basically always accessed using an atomic operation, putting it on either of the cache lines for groups (2) or (3) would slow down accesses to those fields dramatically, since those groups are read-mostly accesses. To test this, I wrote a focused load test that would put load on the pick_next_task_fair() path. A parent process would fork many child processes, and each child would nanosleep() for 1 msec many times in a loop. The load test was monitored with "perf", and I looked at the amount of cycles that were spent with sched_balance_rq() on the stack. The test was reliably spending ~5% of all of its cycles there. I ran it 100 times on a pair of 2-socket Intel Haswell machines (72 vCPUs per machine) - one running the tip of sched/core, the other running this change - using 360 child processes and 8192 1-msec sleeps per child. The mean cycle count dropped from 5.14B to 4.91B, or a *4.6% decrease* in relevant scheduler cycles. Given that this change reduces cache misses in a very hot kernel codepath, there's likely to be additional application performance improvement due to reduced cache conflicts from kernel data structures. On a Power11 system with 128-byte cache lines, my test showed a ~5% decrease in relevant scheduler cycles, along with a slight increase in user time - both positive indicators. This data comes from https://lore.kernel.org/lkml/affdc6b1-9980-44d1-89db-d90730c1e384@linux.ibm.com/ This is the case even though the additional "____cacheline_aligned" that puts the runqueue lock on the next cache line adds an additional 64 bytes of padding on those machines. This patch does not change the size of "struct rq" on machines with 64-byte cache lines. I also ran "hackbench" to try to test this change, but it didn't show conclusive results. Looking at a CPU cycle profile of the hackbench run, it was spending 95% of its cycles inside __alloc_skb(), __kfree_skb(), or kmem_cache_free() - almost all of which was spent updating memcg counters or contending on the list_lock in kmem_cache_node. And it spent less than 0.5% of its cycles inside either schedule() or try_to_wake_up(). So it's not surprising that it didn't show useful results here. The "__no_randomize_layout" was added to reflect the fact that performance of code that references this data structure is unusually sensitive to placement of its members. Signed-off-by: Blake Jones <blakejones@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Reviewed-by: Josh Don <joshdon@google.com> Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Link: https://patch.msgid.link/20251202023743.1524247-1-blakejones@google.com
2026-01-08sched/fair: Use cpumask_weight_and() in sched_balance_find_dst_group()Yury Norov (NVIDIA)
In the group_has_spare case, the function creates a temporary cpumask to just calculate weight of (p->cpus_ptr & sched_group_span(local)). We've got a dedicated helper for it. Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Fernand Sieber <sieberf@amazon.com> Link: https://patch.msgid.link/20251207034247.402926-1-yury.norov@gmail.com
2026-01-08sched/fair: Simplify task_numa_find_cpu()Yury Norov (NVIDIA)
Use for_each_cpu_and() and drop some housekeeping code. Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://patch.msgid.link/20251207033037.399608-1-yury.norov@gmail.com
2026-01-08sched/fair: Drop useless cpumask_empty() in find_energy_efficient_cpu()Yury Norov (NVIDIA)
cpumask_empty() call is O(N) and useless because the previous cpumask_and() returns false for empty 'cpus'. Drop it. Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20251207040543.407695-1-yury.norov@gmail.com
2026-01-05sched: Enable context analysis for core.c and fair.cMarco Elver
This demonstrates a larger conversion to use Clang's context analysis. The benefit is additional static checking of locking rules, along with better documentation. Notably, kernel/sched contains sufficiently complex synchronization patterns, and application to core.c & fair.c demonstrates that the latest Clang version has become powerful enough to start applying this to more complex subsystems (with some modest annotations and changes). Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251219154418.3592607-37-elver@google.com
2026-01-02bpf: Remove redundant KF_TRUSTED_ARGS flag from all kfuncsPuranjay Mohan
Now that KF_TRUSTED_ARGS is the default for all kfuncs, remove the explicit KF_TRUSTED_ARGS flag from all kfunc definitions and remove the flag itself. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260102180038.2708325-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-22sched_ext: Avoid multiple irq_work_queue() calls in destroy_dsq()Zqiang
llist_add() returns true only when adding to an empty list, which indicates that no IRQ work is currently queued or running. Therefore, we only need to call irq_work_queue() when llist_add() returns true, to avoid unnecessarily re-queueing IRQ work that is already pending or executing. Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-22sched_ext: Use the resched_cpu() to replace resched_curr() in the ↵Zqiang
bypass_lb_node() For the PREEMPT_RT kernels, the scx_bypass_lb_timerfn() running in the preemptible per-CPU ktimer kthread context, this means that the following scenarios will occur(for x86 platform): cpu1 cpu2 ktimer kthread: ->scx_bypass_lb_timerfn ->bypass_lb_node ->for_each_cpu(cpu, resched_mask) migration/1: by preempt by migration/2: multi_cpu_stop() multi_cpu_stop() ->take_cpu_down() ->__cpu_disable() ->set cpu1 offline ->rq1 = cpu_rq(cpu1) ->resched_curr(rq1) ->smp_send_reschedule(cpu1) ->native_smp_send_reschedule(cpu1) ->if(unlikely(cpu_is_offline(cpu))) { WARN(1, "sched: Unexpected reschedule of offline CPU#%d!\n", cpu); return; } This commit therefore use the resched_cpu() to replace resched_curr() in the bypass_lb_node() to avoid send-ipi to offline CPUs. Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-19sched_ext: Fix some comments in ext.cZqiang
This commit update balance_scx() in the comments to balance_one(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-19sched/fair: Fix sched_avg foldPeter Zijlstra
After the robot reported a regression wrt commit: 089d84203ad4 ("sched/fair: Fold the sched_avg update"), Shrikanth noted that two spots missed a factor se_weight(). Fixes: 089d84203ad4 ("sched/fair: Fold the sched_avg update") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202512181208.753b9f6e-lkp@intel.com Debugged-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251218102020.GO3707891@noisy.programming.kicks-ass.net
2025-12-17sched: Fix faulty assertion in sched_change_end()Peter Zijlstra
Commit 47efe2ddccb1f ("sched/core: Add assertions to QUEUE_CLASS") added an assert to sched_change_end() verifying that a class demotion would result in a reschedule. As it turns out; rt_mutex_setprio() does not force a resched on class demontion. Furthermore, this is only relevant to running tasks. Change the warning into a reschedule and make sure to only do so for running tasks. Fixes: 47efe2ddccb1f ("sched/core: Add assertions to QUEUE_CLASS") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251216141725.GW3707837@noisy.programming.kicks-ass.net
2025-12-17sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*()Peter Zijlstra
Change sched_class::wakeup_preempt() to also get called for cross-class wakeups, specifically those where the woken task is of a higher class than the previous highest class. In order to do this, track the current highest class of the runqueue in rq::next_class and have wakeup_preempt() track this upwards for each new wakeup. Additionally have schedule() re-set the value on pick. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251127154725.901391274@infradead.org
2025-12-16sched_ext: fix uninitialized ret on alloc_percpu() failureLiang Jie
Smatch reported: kernel/sched/ext.c:5332 scx_alloc_and_add_sched() warn: passing zero to 'ERR_PTR' In scx_alloc_and_add_sched(), the alloc_percpu() failure path jumps to err_free_gdsqs without initializing @ret. That can lead to returning ERR_PTR(0), which violates the ERR_PTR() convention and confuses callers. Set @ret to -ENOMEM before jumping to the error path when alloc_percpu() fails. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/r/202512141601.yAXDAeA9-lkp@intel.com/ Reported-by: Dan Carpenter <error27@gmail.com> Fixes: c201ea1578d3 ("sched_ext: Move event_stats_cpu into scx_sched") Signed-off-by: Liang Jie <liangjie@lixiang.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-15sched_ext: Remove unused code in the do_pick_task_scx()Zqiang
The kick_idle variable is no longer used, this commit therefore remove it and also remove associated code in the do_pick_task_scx(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-15sched/fair: Sort out 'blocked_load*' namespace noiseIngo Molnar
There's three layers of logic in the scheduler that deal with 'has_blocked' (load) handling of the NOHZ code: (1) nohz.has_blocked, (2) rq->has_blocked_load, deal with NOHZ idle balancing, (3) and cfs_rq_has_blocked(), which is part of the layer that is passing the SMP load-balancing signal to the NOHZ layers. The 'has_blocked' and 'has_blocked_load' names are used in a mixed fashion, sometimes within the same function. Standardize on 'has_blocked_load' to make it all easy to read and easy to grep. No change in functionality. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/aS6yvxyc3JfMxxQW@gmail.com
2025-12-15sched/fair: Introduce and use the vruntime_cmp() and vruntime_op() wrappers ↵Ingo Molnar
for wrapped-signed aritmetics We have to be careful with vruntime comparisons and subtraction, due to the possibility of wrapping, so we have macros like: #define vruntime_gt(field, lse, rse) ({ (s64)((lse)->field - (rse)->field) > 0; }) Which is used like this: if (vruntime_gt(min_vruntime, se, rse)) se->min_vruntime = rse->min_vruntime; Replace this with an easier to read pattern that uses the regular arithmetics operators: if (vruntime_cmp(se->min_vruntime, ">", rse->min_vruntime)) se->min_vruntime = rse->min_vruntime; Also replace vruntime subtractions with vruntime_op(): - delta = (s64)(sea->vruntime - seb->vruntime) + - (s64)(cfs_rqb->zero_vruntime_fi - cfs_rqa->zero_vruntime_fi); + delta = vruntime_op(sea->vruntime, "-", seb->vruntime) + + vruntime_op(cfs_rqb->zero_vruntime_fi, "-", cfs_rqa->zero_vruntime_fi); In the vruntime_cmp() and vruntime_op() macros use Use __builtin_strcmp(), because of __HAVE_ARCH_STRCMP might turn off the compiler optimizations we rely on here to catch usage bugs. No change in functionality. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-12-15sched/fair: Rename cfs_rq::avg_vruntime to ::sum_w_vruntime, and helper ↵Ingo Molnar
functions The ::avg_vruntime field is a misnomer: it says it's an 'average vruntime', but in reality it's the momentary sum of the weighted vruntimes of all queued tasks, which is at least a division away from being an average. This is clear from comments about the math of fair scheduling: * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime This confusion is increased by the cfs_avg_vruntime() function, which does perform the division and returns a true average. The sum of all weighted vruntimes should be named thusly, so rename the field to ::sum_w_vruntime. (As arguably ::sum_weighted_vruntime would be a bit of a mouthful.) Understanding the scheduler is hard enough already, without extra layers of obfuscated naming. ;-) Also rename related helper functions: sum_vruntime_add() => sum_w_vruntime_add() sum_vruntime_sub() => sum_w_vruntime_sub() sum_vruntime_update() => sum_w_vruntime_update() With the notable exception of cfs_avg_vruntime(), which was named accurately. Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251201064647.1851919-7-mingo@kernel.org
2025-12-15sched/fair: Rename cfs_rq::avg_load to cfs_rq::sum_weightIngo Molnar
The ::avg_load field is a long-standing misnomer: it says it's an 'average load', but in reality it's the momentary sum of the load of all currently runnable tasks. We'd have to also perform a division by nr_running (or use time-decay) to arrive at any sort of average value. This is clear from comments about the math of fair scheduling: * \Sum w_i := cfs_rq->avg_load The sum of all weights is ... the sum of all weights, not the average of all weights. To make it doubly confusing, there's also an ::avg_load in the load-balancing struct sg_lb_stats, which *is* a true average. The second part of the field's name is a minor misnomer as well: it says 'load', and it is indeed a load_weight structure as it shares code with the load-balancer - but it's only in an SMP load-balancing context where load = weight, in the fair scheduling context the primary purpose is the weighting of different nice levels. So rename the field to ::sum_weight instead, which makes the terminology of the EEVDF math match up with our implementation of it: * \Sum w_i := cfs_rq->sum_weight Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251201064647.1851919-6-mingo@kernel.org
2025-12-15sched/fair: Clean up comments in 'struct cfs_rq'Ingo Molnar
- Fix vertical alignment - Fix typos - Fix capitalization Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251201064647.1851919-3-mingo@kernel.org
2025-12-15sched/fair: Join two #ifdef CONFIG_FAIR_GROUP_SCHED blocksIngo Molnar
Join two identical #ifdef blocks: #ifdef CONFIG_FAIR_GROUP_SCHED ... #endif #ifdef CONFIG_FAIR_GROUP_SCHED ... #endif Also mark nested #ifdef blocks in the usual fashion, to make it more apparent where in a nested hierarchy of #ifdefs we are at a glance. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20251201064647.1851919-2-mingo@kernel.org
2025-12-14sched/core: Add assertions to QUEUE_CLASSPeter Zijlstra
Add some checks to the sched_change pattern to validate assumptions around changing classes. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251127154725.771691954@infradead.org
2025-12-14sched/fair: Limit hrtick workPeter Zijlstra
The task_tick_fair() function does: - update the hierarchical runtimes - drive NUMA-balancing - update load-balance statistics - drive force-idle preemption All but the very first can be limited to the periodic tick. Let hrtick only update accounting and drive preemption, not load-balancing and other bits. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20250918080205.563385766@infradead.org
2025-12-14sched/fair: Remove superfluous rcu_read_lock()Peter Zijlstra
With fair switched to rcu_dereference_all() validation, having IRQ or preemption disabled is sufficient, remove the rcu_read_lock() clutter. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251127154725.647502625@infradead.org
2025-12-14sched/fair: Switch to rcu_dereference_all()Peter Zijlstra
With the {rcu,sched,bh} RCU flavours being unified, it doesn't really make sense to check for just the rcu one. Switch to the _all family of verification which includes all 3 of the listed flavours. Notably, this will enable us to remove some superfluous rcu_read_lock() regions when we know they are inside preempt/IRQ disabled regions. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-12-14sched/headers: Rename rcu_dereference_check_sched_domain() => ↵Peter Zijlstra
rcu_dereference_sched_domain() Remove check from the name for being surplus to requirements. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-12-14sched/fair: Avoid rq->lock bouncing in sched_balance_newidle()Peter Zijlstra
While poking at this code recently I noted we do a pointless unlock+lock cycle in sched_balance_newidle(). We drop the rq->lock (so we can balance) but then instantly grab the same rq->lock again in sched_balance_update_blocked_averages(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251127154725.532469061@infradead.org
2025-12-14sched/fair: Fold the sched_avg updatePeter Zijlstra
Nine (and a half) instances of the same pattern is just silly, fold the lot. Notably, the half instance in enqueue_load_avg() is right after setting cfs_rq->avg.load_sum to cfs_rq->avg.load_avg * get_pelt_divider(&cfs_rq->avg). Since get_pelt_divisor() >= PELT_MIN_DIVIDER, this ends up being a no-op change. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20251127154725.413564507@infradead.org
2025-12-12sched_ext: Fix missing post-enqueue handling in move_local_task_to_local_dsq()Tejun Heo
move_local_task_to_local_dsq() is used when moving a task from a non-local DSQ to a local DSQ on the same CPU. It directly manipulates the local DSQ without going through dispatch_enqueue() and was missing the post-enqueue handling that triggers preemption when SCX_ENQ_PREEMPT is set or the idle task is running. The function is used by move_task_between_dsqs() which backs scx_bpf_dsq_move() and may be called while the CPU is busy. Add local_dsq_post_enq() call to move_local_task_to_local_dsq(). As the dispatch path doesn't need post-enqueue handling, add SCX_RQ_IN_BALANCE early exit to keep consume_dispatch_q() behavior unchanged and avoid triggering unnecessary resched when scx_bpf_dsq_move() is used from the dispatch path. Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Cc: stable@vger.kernel.org # v6.12+ Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-12sched_ext: Factor out local_dsq_post_enq() from dispatch_enqueue()Tejun Heo
Factor out local_dsq_post_enq() which performs post-enqueue handling for local DSQs - triggering resched_curr() if SCX_ENQ_PREEMPT is specified or if the current CPU is idle. No functional change. This will be used by the next patch to fix move_local_task_to_local_dsq(). Cc: stable@vger.kernel.org # v6.12+ Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-11sched_ext: Fix bypass depth leak on scx_enable() failureTejun Heo
scx_enable() calls scx_bypass(true) to initialize in bypass mode and then scx_bypass(false) on success to exit. If scx_enable() fails during task initialization - e.g. scx_cgroup_init() or scx_init_task() returns an error - it jumps to err_disable while bypass is still active. scx_disable_workfn() then calls scx_bypass(true/false) for its own bypass, leaving the bypass depth at 1 instead of 0. This causes the system to remain permanently in bypass mode after a failed scx_enable(). Failures after task initialization is complete - e.g. scx_tryset_enable_state() at the end - already call scx_bypass(false) before reaching the error path and are not affected. This only affects a subset of failure modes. Fix it by tracking whether scx_enable() called scx_bypass(true) in a bool and having scx_disable_workfn() call an extra scx_bypass(false) to clear it. This is a temporary measure as the bypass depth will be moved into the sched instance, which will make this tracking unnecessary. Fixes: 8c2090c504e9 ("sched_ext: Initialize in bypass mode") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Chris Mason <clm@meta.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/stable/286e6f7787a81239e1ce2989b52391ce%40kernel.org Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-08sched/ext: Avoid null ptr traversal when ->put_prev_task() is called with ↵John Stultz
NULL next Early when trying to get sched_ext and proxy-exe working together, I kept tripping over NULL ptr in put_prev_task_scx() on the line: if (sched_class_above(&ext_sched_class, next->sched_class)) { Which was due to put_prev_task() passes a NULL next, calling: prev->sched_class->put_prev_task(rq, prev, NULL); put_prev_task_scx() already guards for a NULL next in the switch_class case, but doesn't seem to have a guard for sched_class_above() check. I can't say I understand why this doesn't trip usually without proxy-exec. And in newer kernels there are way fewer put_prev_task(), and I can't easily reproduce the issue now even with proxy-exec. But we still have one put_prev_task() call left in core.c that seems like it could trip this, so I wanted to send this out for consideration. tj: put_prev_task() can be called with NULL @next; however, when @p is queued, that doesn't happen, so this condition shouldn't currently be triggerable. The connection isn't straightforward or necessarily reliable, so add the NULL check even if it can't currently be triggered. Link: http://lkml.kernel.org/r/20251206022218.1541878-1-jstultz@google.com Signed-off-by: John Stultz <jstultz@google.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>