summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
2026-05-18sched/cache: Fix unpaired account_llc_enqueue/dequeueChen Yu
There is a race condition that, after a task is enqueued on a runqueue, task_llc(p) may change due to CPU hotplug, because the llc_id is dynamically allocated and adjusted at runtime. Therefore, checking task_llc(p) to determine whether the task is being dequeued from its preferred LLC is unreliable and can cause inconsistent values. To fix this problem, record whether p is enqueued on its preferred LLC, in order to pair with account_llc_dequeue() to maintain a consistent nr_pref_llc_running per runqueue. This bug was reported by sashiko, and the solution was once suggested by Prateek. Fixes: 46afe3af7ead ("sched/cache: Track LLC-preferred tasks per runqueue") Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/0c8c6a1571d66792a4d2ff0103ba3cc13e059046.1778703694.git.tim.c.chen@linux.intel.com
2026-05-18sched/cache: Annotate lockless accesses to mm->sc_stat.cpuChen Yu
mm->sc_stat.cpu is written by task_cache_work() and could be read locklessly by several functions on other CPUs. Use READ_ONCE and WRITE_ONCE on mm->sc_stat.cpu access and write to prevent inconsistent values from compiler optimizations when there are multiple accesses. For example in get_pref_llc(), if the writer updated the field between two compiler-generated loads, the validation (e.g., cpu != -1) and subsequent use (e.g., llc_id(cpu)) could operate on different values, allowing a negative CPU ID to be used as an index. Leave plain write in mm_init_sched(), where the mm is not yet visible to other CPUs. This bug was reported by sashiko. Fixes: 47d8696b95f7 ("sched/cache: Assign preferred LLC ID to processes") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/63ea494f12efcf265d7134400a06cd75d7f2c310.1778703694.git.tim.c.chen@linux.intel.com
2026-05-18sched/cache: Fix potential NULL mm pointer accessChen Yu
A concurrent task exit might cause a NULL pointer dereference in account_mm_sched(). Use the locally cached mm pointer instead, since the active_mm reference guarantees the structure remains allocated. Meanwhile, skip the kernel thread because it has nothing to do with cache aware scheduling. This bug was reported by sashiko and Vern. Fixes: df0d98475954 ("sched/cache: Introduce infrastructure for cache-aware load balancing") Reported-by: Vern Hao <haoxing990@gmail.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/09cf7ee3-6e27-4505-9692-4b4a4707c8b2@gmail.com/ Link: https://patch.msgid.link/066d8cfa45d4822bf4367e788c50377c66bbcc82.1778703694.git.tim.c.chen@linux.intel.com
2026-05-18sched/cache: Fix rcu warning when accessing sd_llc domainChen Yu
rcu_dereference_all() should be used to access the sd_llc domain under RCU protection. This bug was reported by sashiko. Fixes: df0d98475954 ("sched/cache: Introduce infrastructure for cache-aware load balancing") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/2dc49455e861215d8059a1c877953f0b95990038.1778703694.git.tim.c.chen@linux.intel.com
2026-05-18sched/cache: Add user control to adjust the aggressiveness of cache-aware ↵Chen Yu
scheduling Introduce a set of debugfs knobs to control how aggressively the cache aware scheduling does the task aggregation. (1) aggr_tolerance With sched_cache enabled, the scheduler uses a process's footprint as a proxy for its LLC footprint to determine if aggregating tasks on the preferred LLC could cause cache contention. If the footprint exceeds the LLC size, aggregation is skipped. Since the kernel cannot efficiently track per-task cache usage (resctrl is user-space only), userspace can provide a more accurate hint. Introduce /sys/kernel/debug/sched/llc_balancing/aggr_tolerance to let users control how strictly footprint limits aggregation. Values range from 0 to 100: - 0: Cache-aware scheduling is disabled. - 1: Strict; tasks with footprint larger than LLC size are skipped. - >=100: Aggressive; tasks are aggregated regardless of footprint. For example, with a 32MB L3 cache: - aggr_tolerance=1 -> tasks with footprint > 32MB are skipped. - aggr_tolerance=99 -> tasks with footprint > 784GB are skipped (784GB = (1 + (99 - 1) * 256) * 32MB). Similarly, /sys/kernel/debug/sched/llc_balancing/aggr_tolerance also controls how strictly the number of active threads is considered when doing cache aware load balance. The number of SMTs is also considered. High SMT counts reduce the aggregation capacity, preventing excessive task aggregation on SMT-heavy systems like Power10/Power11. Yangyu suggested introducing separate aggregation controls for the number of active threads and memory footprint checks. Since there are plans to add per-process/task group controls, fine-grained tunables are deferred to that implementation. (2) epoch_period, epoch_affinity_timeout, imb_pct, overaggr_pct are also turned into tunables. Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Suggested-by: Tingyin Duan <tingyin.duan@gmail.com> Suggested-by: Jianyong Wu <jianyong.wu@outlook.com> Suggested-by: Yangyu Chen <cyy@cyyself.name> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Tingyin Duan <tingyin.duan@gmail.com> Link: https://patch.msgid.link/1c62cc060ba2b33d7b1f0ed98b3390128edbae93.1778703694.git.tim.c.chen@linux.intel.com
2026-05-18sched/cache: Avoid cache-aware scheduling for memory-heavy processesChen Yu
Prateek and Tingyin reported that memory-intensive workloads (such as stream) can saturate memory bandwidth and caches on the preferred LLC when sched_cache aggregates too many threads. To mitigate this, estimate a process's memory footprint by comparing its NUMA balancing fault statistics to the size of the LLC. If the footprint exceeds the LLC size, skip cache-aware scheduling. Note that footprint is only an approximation of the memory footprint, since the kernel lacks suitable metrics to estimate the real working set. If a user-provided hint is available in the future, it would be more accurate. A later patch will allow users to provide a hint to adjust this threshold. Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Suggested-by: Vern Hao <vernhao@tencent.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Tingyin Duan <tingyin.duan@gmail.com> Link: https://patch.msgid.link/95cf64a385bcc12f18dcebe9d59e8d3ba8bb318f.1778703694.git.tim.c.chen@linux.intel.com
2026-05-18sched/cache: Calculate the LLC size and store it in sched_domainChen Yu
Cache aware scheduling needs to know the LLC size that a process can use, so as to avoid memory-intensive tasks from being over-aggregated on a single LLC. Introduce a preparation patch to add get_effective_llc_bytes() to get the LLC size that a CPU can use. The function can be further enhanced by subtracting the LLC cache ways reserved by resctrl (CAT in Intel RDT, etc). Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Tingyin Duan <tingyin.duan@gmail.com> Link: https://patch.msgid.link/37afee09ff608034da0ce149e72d33b6f4698edf.1778703694.git.tim.c.chen@linux.intel.com
2026-05-18sched/cache: Skip cache-aware scheduling for single-threaded processesChen Yu
For a single thread, the current wakeup path tends to place it on the same LLC where it was previously running with cache-hot data. There is no need to enable cache-aware scheduling for single-threaded processes for the following reasons: 1. Cache-aware scheduling primarily benefits multi-threaded processes where threads share data. Single-threaded processes typically have no inter-thread data sharing and thus gain little. 2. Enabling it incurs the additional overhead of tracking the thread's residency in the LLCs. 3. Bypassing single-threaded processes avoids excessive concentration of such tasks on a single LLC. Nevertheless, this check can be omitted if users explicitly provide hints for such single-threaded workloads where different processes have shared memory, e.g., via prctl() or other interfaces to be added in the future. Signed-off-by: Chen Yu <yu.c.chen@intel.com> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Tingyin Duan <tingyin.duan@gmail.com> Link: https://patch.msgid.link/8a59a13aa58fdb48e410ecb2aabd97fe3ea5d256.1778703694.git.tim.c.chen@linux.intel.com
2026-05-18sched/cache: Disable cache aware scheduling for processes with high thread ↵Chen Yu
counts A performance regression was observed by Prateek when running hackbench with many threads per process (high fd count). To avoid this, processes with a large number of active threads are excluded from cache-aware scheduling. With sched_cache enabled, record the number of active threads in each process during the periodic task_cache_work(). While iterating over CPUs, if the currently running task belongs to the same process as the task that launched task_cache_work(), increment the active thread count. If the number of active threads within the process exceeds the number of Cores (divided by the SMT number) in the LLC, do not enable cache-aware scheduling. However, on systems with a smaller number of CPUs within 1 LLC, like Power10/Power11 with SMT4 and an LLC size of 4, this check effectively disables cache-aware scheduling for any process. One possible solution suggested by Peter is to use an LLC-mask instead of a single LLC value for preference. Once there are a 'few' LLCs as preference, this constraint becomes a little easier. It could be an enhancement in the future. For users who wish to perform task aggregation regardless, a debugfs knob is provided for tuning in a subsequent change. Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Suggested-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Tingyin Duan <tingyin.duan@gmail.com> Link: https://patch.msgid.link/d076cd21a8e6c6341d1e2d927e118db770ebb650.1778703694.git.tim.c.chen@linux.intel.com
2026-05-18sched/cache: Allow only 1 thread of the process to calculate the LLC occupancyJianyong Wu
Scanning online CPUs to calculate the occupancy might be time-consuming. Only allow 1 thread of the process to scan the CPUs at the same time, which is similar to what NUMA balance does in task_numa_work(). Signed-off-by: Jianyong Wu <wujianyong@hygon.cn> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/5672b52e588b855b01e5a1a17822f7c6c7237a3d.1778703694.git.tim.c.chen@linux.intel.com
2026-05-17Merge branch 'for-7.1-fixes' into for-7.2Tejun Heo
Pull to receive: 39e25a210060 ("sched_ext: Drop NONE early return in scx_disable_and_exit_task()") b273b75b8d67 ("sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched()") cceb874eee46 ("sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work") 6ae315d37924 ("sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation") 515e3996a4c2 ("sched_ext: Fix deadlock between scx_root_disable() and concurrent forks") to prepare for-7.2 for further sub-sched changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-17sched_ext: Fix deadlock between scx_root_disable() and concurrent forksTejun Heo
scx_root_disable() enters SCX_DISABLING before it grabs scx_enable_mutex to clear __scx_switched_all and scx_switching_all. task_should_scx() short-circuits on DISABLING, so forks in that window land on fair while next_active_class() still skips fair - the new tasks stall. This can deadlock the disable path itself: scx_alloc_and_add_sched() runs under scx_enable_mutex and creates a helper kthread; if that new kthread is one of the stalled fair tasks, the mutex holder waits forever and scx_root_disable() can never make progress. Only sub-sched support exposes this, since sub-sched enables are the only path where scx_alloc_and_add_sched() can race the root's disable. Move the DISABLING check after @scx_switching_all. @scx_switching_all serves as a proxy for __scx_switched_all, so while it's set, forks keep going to scx. Once cleared, DISABLING applies normally. v2: Reword in-source comment and description. (Andrea) Fixes: 337ec00b1d9c ("sched_ext: Implement cgroup sub-sched enabling and disabling") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-13Merge tag 'sched_ext-for-7.1-rc3-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: "The bulk of this is hardening of the new sub-scheduler infrastructure. - UAFs and lifecycle bugs on the sub-sched attach/detach paths: parent sub_kset freed under a racing child, list_del_rcu on an uninitialized list head, ops->priv stomped by concurrent attach/detach, and a UAF in the init-failure error path - Task state-machine reorg closing concurrent enable-vs-dead races: a task exiting during the unlocked init window could trip NULL ops derefs or skip exit_task() cleanup - A scx_link_sched() self-deadlock on scx_sched_lock - isolcpus: stop dereferencing the now-RCU-protected HK_TYPE_DOMAIN cpumask without RCU, and stop rejecting BPF schedulers when only cpuset isolated partitions are active - PREEMPT_RT: disable irq_work runs in hardirq context so dumps show the failing task rather than the irq_work kthread - Assorted !CONFIG_EXT_SUB_SCHED, randconfig, and selftest build fixes" * tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched() sched_ext: Drop NONE early return in scx_disable_and_exit_task() sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path sched_ext: Clear ops->priv on scx_alloc_and_add_sched() error paths sched_ext: Fix ops->priv clobber on concurrent attach/detach selftests/sched_ext: Fix build error in dequeue selftest sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths sched_ext: Close sub-sched init race with post-init DEAD recheck sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state() sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work sched_ext: Use IRQ_WORK_INIT_HARD() to initialize sch->disable_irq_work sched_ext: Fix !CONFIG_EXT_SUB_SCHED build warnings sched_ext: Drop unused scx_find_sub_sched() stub sched_ext: Move scx_error() out of scx_link_sched()'s lock region
2026-05-13Merge tag 'cgroup-for-7.1-rc3-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - cpuset fixes: - Partition invalidation could return CPUs still in use by sibling partitions, producing overlapping effective_cpus - cpuset_can_attach() over-reserved DL bandwidth on moves that stayed within the same root domain - Pending DL migration state leaked into later attaches when a later can_attach() check failed - Reorder PF_EXITING and __GFP_HARDWALL checks so dying tasks can allocate from any node and exit quickly - dmem: propagate -ENOMEM instead of spinning forever when the fallback pool allocation also fails - selftests/cgroup: percpu test error-path leak, bogus numeric comparison of cpuset strings, and a zero-length read() that silently passed OOM-kill tests * tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Return only actually allocated CPUs during partition invalidation selftests/cgroup: Fix error path leaks in test_percpu_basic cgroup/cpuset: Reserve DL bandwidth only for root-domain moves cgroup/cpuset: Reset DL migration state on can_attach() failure selftests/cgroup: Fix string comparison in write_test selftests/cgroup: Fix cg_read_strcmp() empty string comparison cgroup/dmem: Return -ENOMEM on failed pool preallocation cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()
2026-05-13sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolationAndrea Righi
scx_enable() refuses to attach a BPF scheduler when isolcpus=domain is in effect by comparing housekeeping_cpumask(HK_TYPE_DOMAIN) against cpu_possible_mask. Since commit 27c3a5967f05 ("sched/isolation: Convert housekeeping cpumasks to rcu pointers"), HK_TYPE_DOMAIN's cpumask is RCU protected and dereferencing it requires either RCU read lock, the cpu_hotplug write lock, or the cpuset lock; scx_enable() holds none of these, so booting with isolcpus=domain and attaching any BPF scheduler triggers the following lockdep splat: ============================= WARNING: suspicious RCU usage ----------------------------- kernel/sched/isolation.c:60 suspicious rcu_dereference_check() usage! 1 lock held by scx_flash/281: #0: ffffffff8379fce0 (update_mutex){+.+.}-{4:4}, at: bpf_struct_ops_link_create+0x134/0x1c0 Call Trace: dump_stack_lvl+0x6f/0xb0 lockdep_rcu_suspicious.cold+0x37/0x70 housekeeping_cpumask+0xcd/0xe0 scx_enable.isra.0+0x17/0x120 bpf_scx_reg+0x5e/0x80 bpf_struct_ops_link_create+0x151/0x1c0 __sys_bpf+0x1e4b/0x33c0 __x64_sys_bpf+0x21/0x30 do_syscall_64+0x117/0xf80 entry_SYSCALL_64_after_hwframe+0x77/0x7f In addition, commit 03ff73510169 ("cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset") made HK_TYPE_DOMAIN include cpuset isolated partitions as well, which means the current check also rejects BPF schedulers when a cpuset partition is active. That contradicts the original intent of commit 9f391f94a173 ("sched_ext: Disallow loading BPF scheduler if isolcpus= domain isolation is in effect"), which explicitly noted that cpuset partitions are honored through per-task cpumasks and should not be rejected. Switch to housekeeping_enabled(HK_TYPE_DOMAIN_BOOT), which reads only the housekeeping flag bit (no RCU dereference) and reflects exactly the boot-time isolcpus= configuration that the error message refers to. Fixes: 27c3a5967f05 ("sched/isolation: Convert housekeeping cpumasks to rcu pointers") Cc: stable@vger.kernel.org # v7.0+ Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org>
2026-05-12sched_ext: Defer sub_kset base put to scx_sched_free_rcu_workTejun Heo
scx_sub_enable_workfn() pins parent->kobj before dropping scx_sched_lock, but that does not pin parent->sub_kset. Concurrent disable can kset_unregister and free sub_kset before scx_alloc_and_add_sched() dereferences it. Split sub_kset teardown: kobject_del() at disable keeps sysfs removal; defer kobject_put() to scx_sched_free_rcu_work so the memory survives. A racing child sees state_in_sysfs=0 with valid memory, sysfs_create_dir() fails, and the existing exit_kind gate in scx_link_sched() turns it away with -ENOENT. Fixes: 411d3ef1a705 ("sched_ext: Unregister sub_kset on scheduler disable") Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-12sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched()Tejun Heo
On scx_link_sched() error paths (parent disabled, hash insert failure), &sch->all is never added to scx_sched_all. The cleanup path runs scx_unlink_sched() unconditionally, which calls list_del_rcu(&sch->all) on a list_head that was never initialized triggering a corruption warning. Initialize &sch->all. Fixes: 54be8de4236a ("sched_ext: Factor out scx_link_sched() and scx_unlink_sched()") Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-12sched_ext: Drop NONE early return in scx_disable_and_exit_task()Tejun Heo
d3e73a0808dd ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths") skipped the trailing scx_set_task_sched(p, NULL) on NONE tasks. After scx_fail_parent() parks a task at NONE/sched=parent and the parent is later freed via queue_rcu_work() during root_disable, the preserved p->scx.sched dangles - print_scx_info() from sched_show_task() reads sch->ops.name from freed memory. Drop the early return. __scx_disable_and_exit_task() already short- circuits on NONE and the SUB_INIT block was cleared by scx_fail_parent()'s earlier call, so clearing p->scx.sched is the only work left - and the one thing the path actually needs. v2: Extend the SUB_INIT block comment to note that the flag is only set on the sub-enable path, so it's always clear on the NONE re-entry (Andrea). Fixes: d3e73a0808dd ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-11Merge branch 'for-7.1-fixes' into for-7.2Tejun Heo
Pull to receive: 9a415cc53711 ("sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path") Conflicts with for-7.2's scx_task_iter_relock() rework. The fix moves put_task_struct(p) past scx_error(); for-7.2 still has it at the old position. Resolved by dropping the old one. Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-11sched_ext: Avoid UAF in scx_root_enable_workfn() init failure pathTejun Heo
In scx_root_enable_workfn(), put_task_struct(p) is called before scx_error() dereferences p->comm and p->pid. If the iterator's reference is the last drop, the task is freed synchronously and the deref becomes a UAF. Move put_task_struct() past scx_error(). Reported-by: Sashiko <sashiko-bot@kernel.org> Closes: https://lore.kernel.org/all/20260511214031.AF5E9C2BCB0@smtp.kernel.org/ Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-11sched_ext: Mark !CONFIG_EXT_SUB_SCHED dummy stubs static inlineTejun Heo
Mark !CONFIG_EXT_SUB_SCHED dummy stubs static inline to avoid -Wunused-function in configs without callers. No functional change. Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-11cgroup/cpuset: Reserve DL bandwidth only for root-domain movesGuopeng Zhang
cpuset_can_attach() currently adds the bandwidth of all migrating SCHED_DEADLINE tasks to sum_migrate_dl_bw. If the source and destination cpuset effective CPU masks do not overlap, the whole sum is then reserved in the destination root domain. set_cpus_allowed_dl(), however, subtracts bandwidth from the source root domain only when the affinity change really moves the task between root domains. A DL task can move between cpusets that are still in the same root domain, so including that task in sum_migrate_dl_bw can reserve destination bandwidth without a matching source-side subtraction. Share the root-domain move test with set_cpus_allowed_dl(). Keep nr_migrate_dl_tasks counting all migrating deadline tasks for cpuset DL task accounting, but add to sum_migrate_dl_bw only for tasks that need a root-domain bandwidth move. Keep using the destination cpuset effective CPU mask and leave the broader can_attach()/attach() transaction model unchanged. Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails") Cc: stable@vger.kernel.org # v6.10+ Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Reviewed-by: Waiman Long <longman@redhat.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-11sched_ext: Replace tryget_task_struct() with get_task_struct()Andrea Righi
The tryget_task_struct() calls in scx_sub_disable(), scx_root_enable_workfn() and scx_sub_enable_workfn() can never fail at the points they're invoked: - scx_root_enable_workfn() iterates over scx_tasks under scx_tasks_lock and rq lock. sched_ext_dead() removes tasks from scx_tasks under the same scx_tasks_lock before put_task_struct_rcu_user() runs in finish_task_switch(). So any task observed in scx_tasks must have usage > 0; put_task_struct_rcu_user() hasn't been called and the delayed_put_task_struct() callback that decrements usage cannot have been queued. - scx_sub_disable() and scx_sub_enable_workfn() iterate via css_task_iter, which takes a reference on each task in css_task_iter_next() and holds it until the next iter_next() call, so usage > 0 is guaranteed by the iter itself. The actual filter for dead tasks is the SCX_TASK_DEAD check inside scx_task_iter_next_locked(), not tryget; tryget only fails on zero usage, a state that can't be reached for tasks visible to these iters. Commit b7d4b28db7da ("sched_ext: Use SCX_TASK_READY test instead of tryget_task_struct() during class switch") removed an analogous tryget in the class-switch loop. Convert the remaining tryget calls to plain get_task_struct() and update the comment in scx_root_enable_workfn() that suggested tasks could be observed with zero @usage waiting for an RCU grace period. Link: https://lore.kernel.org/all/agCLBxHEUqWIepx8@google.com Suggested-by: Alice Ryhl <aliceryhl@google.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-10sched_ext: Clear ops->priv on scx_alloc_and_add_sched() error pathsAndrea Righi
scx_alloc_and_add_sched() can fail after @sch has been assigned to ops->priv. In those cases @sch is torn down (either via kfree() through the err_free_* chain or via kobject_put() -> scx_kobj_release() -> RCU work), but @ops->priv is left pointing at the about-to-be-freed pointer. With the recent -EBUSY gate in scx_root_enable_workfn() and scx_sub_enable_workfn() that rejects an attach when @ops->priv is still non-NULL, see commit bbf30b383cf6 ("sched_ext: Fix ops->priv clobber on concurrent attach/detach"), a dangling @ops->priv permanently locks the kdata out: every future attach attempt sees a stale binding and returns -EBUSY even though no scheduler is actually attached. Clear @ops->priv on the post-assign failure paths so that the kdata returns to its pre-attach state when the function returns ERR_PTR(). Fixes: bbf30b383cf6 ("sched_ext: Fix ops->priv clobber on concurrent attach/detach") Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-10sched_ext: Fix ops->priv clobber on concurrent attach/detachAndrea Righi
Under heavy concurrent attach/detach operations, scx_claim_exit() can trigger a NULL pointer dereference. This can be reproduced running the reload_loop kselftests inside a virtme-ng session: $ vng -v -- ./tools/testing/selftests/sched_ext/runner -t reload_loop ... BUG: kernel NULL pointer dereference, address: 0000000000000400 RIP: 0010:scx_claim_exit+0x3b/0x120 Call Trace: <TASK> bpf_scx_unreg+0x45/0xb0 bpf_struct_ops_map_link_dealloc+0x39/0x50 bpf_link_release+0x18/0x20 __fput+0x10b/0x2e0 __x64_sys_close+0x47/0xa0 The underlying race (diagnosed by Tejun Heo) is a stomp of @ops->priv, not a missing NULL check: T2 unreg(K) T1 reg(K) ----------- --------- sch = ops->priv = sch_b800 scx_disable; flush_disable_work [scx_root_disable: scx_root=NULL, mutex_unlock, state=DISABLED] mutex_lock; state ok scx_alloc_and_add_sched: ops->priv = sch_a800 scx_root = sch_a800; init=0 state=ENABLED; mutex_unlock [flush returns] RCU_INIT_POINTER(ops->priv, NULL) <-- clobbers sch_a800 kobject_put(sch_b800) T1 acquires scx_enable_mutex inside scx_root_disable()'s mutex_unlock window and starts a fresh attach on the same kdata, assigning sch_a800 to @ops->priv. T2 then continues out of scx_disable()/flush_disable_work and clobbers @ops->priv to NULL, leaking sch_a800; the bpf_link is gone but state stays SCX_ENABLED, so all future attaches fail with -EBUSY permanently. The next bpf_scx_unreg() on that kdata then reads NULL @ops->priv and dereferences it in scx_claim_exit(). Make @ops->priv the lifecycle binding: in scx_root_enable_workfn() and scx_sub_enable_workfn(), after the existing state check and still under scx_enable_mutex, refuse with -EBUSY if @ops->priv is non-NULL. This rejects an attempt to reuse a kdata that is still bound to a previous scheduler instance, closing the race without changing the unreg side. Fixes: 105dcd005be2 ("sched_ext: Introduce scx_prog_sched()") Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-10Merge branch 'for-7.1-fixes' into for-7.2Tejun Heo
Conflict between: [1] 41e3312861ea ("sched_ext: add p->scx.tid and SCX_OPS_TID_TO_TASK lookup") [2] c941d7391f25 ("sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN") in scx_root_enable_workfn()'s post-init block. [1] added a tid hash insertion under a scoped_guard() for scx_tasks_lock; [2] wraps the same region in task_rq_lock() for a DEAD recheck. A naive merge would invert the iter's outer/inner order. [3] f25ad1e3cbaa ("sched_ext: Add scx_task_iter_relock() and use it in scx_root_enable_workfn()") was added to for-7.2 for a clean resolution: scx_task_iter_relock(iter, p) takes both scx_tasks_lock and @p's rq lock in iter order. Resolved by routing both sides through [3]'s dual-lock helper: the post-init region runs under a single scx_task_iter_relock() acquisition, with [2]'s state machine and [1]'s hash insert in sequence inside it. Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-10sched_ext: Add scx_task_iter_relock() and use it in scx_root_enable_workfn()Tejun Heo
scx_root_enable_workfn()'s post-init block re-acquires scx_tasks_lock briefly via a scoped_guard() for the tid hash insertion. c941d7391f25 ("sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN") on for-7.1-fixes adds a post-init DEAD recheck that holds the task's rq lock across the state-machine updates in the same region. A naive merge would acquire scx_tasks_lock while the rq lock is held, inverting the iter's outer/inner order (scx_tasks_lock then rq lock). Add scx_task_iter_relock(iter, p), the counterpart to scx_task_iter_unlock(), that re-acquires scx_tasks_lock and, if @p is non-NULL, @p's rq lock. The locks are tracked in @iter so subsequent iteration releases them. Use it in scx_root_enable_workfn()'s post-init block and drop the now-redundant scoped_guard on the hash insertion. The post-init region now runs with both scx_tasks_lock and the task's rq lock held across the init failure check, the state-machine updates and the hash insert. v2: Move scx_task_iter_relock() earlier to ease the for-7.1-fixes merge. Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-10sched_ext: Handle SCX_TASK_NONE in disable/switched_from pathsTejun Heo
scx_fail_parent() leaves cgroup tasks at (state=NONE, sched=parent, sched_class=ext) until the parent itself is torn down by the scx_error() it raised. When the later root_disable iterates them, two paths trip on NONE. scx_disable_and_exit_task() re-enters the wrapper at NONE: the inner switch returns early but the trailing scx_set_task_sched(p, NULL) clobbers the parent sched left by scx_fail_parent(), and scx_set_task_state(p, NONE) wastes a write on an already-NONE task. switched_from_scx() then calls scx_disable_task(), which WARNs on non-ENABLED state and writes state=READY, producing a NONE -> READY transition the validation matrix rejects. Treat NONE as "nothing to do" in both paths. Add a NONE early-return at the top of scx_disable_and_exit_task() and a parallel NONE check in switched_from_scx() next to task_dead_and_done(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-10sched_ext: Close sub-sched init race with post-init DEAD recheckTejun Heo
scx_sub_enable_workfn()'s init pass and scx_sub_disable() migration both drop the rq lock to call __scx_init_task() against the other sched. A TASK_DEAD @p can fall through sched_ext_dead() in that window. sched_ext_dead() runs ops.exit_task() on the sched @p was attached to, not on the sched whose init just completed, so the new allocation leaks. Reuse the DEAD signal set by sched_ext_dead(). After __scx_init_task() returns, take task_rq_lock(p) and check for DEAD; on hit, call scx_sub_init_cancel_task() against the sub sched the init ran for and drop @p; on miss, proceed as before. Reported-by: zhidao su <suzhidao@xiaomi.com> Link: https://lore.kernel.org/all/20260429133155.3825247-1-suzhidao@xiaomi.com/ Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-10sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGINTejun Heo
scx_root_enable_workfn() drops the iter rq lock for ops.init_task() and a TASK_DEAD @p can fall through sched_ext_dead() in that window. The race hits when sched_ext_dead() observes SCX_TASK_INIT (the intermediate state before @p->scx.sched is published) and dereferences NULL via SCX_HAS_OP(NULL, exit_task), or observes SCX_TASK_NONE during the unlocked init window and skips cleanup so exit_task() never runs. Add SCX_TASK_INIT_BEGIN. The enable path writes NONE -> INIT_BEGIN under the iter rq lock, then takes the rq lock again after init to walk INIT_BEGIN -> INIT -> READY. sched_ext_dead() that wins the rq-lock race observes INIT_BEGIN and sets DEAD without calling into ops; the post-init recheck unwinds via scx_sub_init_cancel_task(). scx_fork() runs single-threaded against sched_ext_dead() (the task is not on scx_tasks until scx_post_fork() adds it) so its INIT_BEGIN -> INIT walk needs no rq-lock pairing; it rolls back to NONE on ops.init_task() failure. The validation matrix grows the INIT_BEGIN row and the INIT_BEGIN -> DEAD edge; INIT now requires INIT_BEGIN as the predecessor. scx_sub_disable()'s migration writes INIT_BEGIN as a synthetic predecessor to satisfy the tightened verification. The sub-sched paths still race with sched_ext_dead() during the unlocked init window. This will be fixed by the next patch. Reported-by: zhidao su <suzhidao@xiaomi.com> Link: https://lore.kernel.org/all/20260429133155.3825247-1-suzhidao@xiaomi.com/ Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-10sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD stateTejun Heo
SCX_TASK_OFF_TASKS marked tasks already through sched_ext_dead() so cgroup task iteration would skip them. This can be expressed better with a task state. Replace the flag with SCX_TASK_DEAD. scx_disable_and_exit_task() resets state to NONE on its way out, so sched_ext_dead() now sets DEAD after the wrapper returns. The validation matrix grows NONE -> DEAD, warns on DEAD -> NONE, and tightens READY's predecessor to INIT or ENABLED so the new DEAD value cannot silently transition to READY. Prepares for the following enable vs dead race fix. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-10sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into ↵Tejun Heo
scx_set_task_state() Prepare for the SCX_TASK_INIT_BEGIN/DEAD work that follows by collapsing the scx_init_task() helper. Move the SCX_TASK_RESET_RUNNABLE_AT setting into scx_set_task_state() on the INIT transition (it was set unconditionally at every INIT site through the scx_init_task() helper), inline scx_init_task() into scx_fork() and scx_root_enable_workfn(), and drop the helper. As a side effect, scx_sub_disable() migration sequence now also sets RESET_RUNNABLE_AT (it previously wrote INIT directly without going through scx_init_task()). The flag triggers a runnable_at reset on the next set_task_runnable(), which is harmless on a task that has just been moved between scheds. On root-enable, p->scx.flags is written without the task's rq lock. The task isn't visible to scx yet, and a follow-up patch restores the lock-held write. v2: Note p->scx.flags rq-lock relaxation on root-enable path. (Andrea) Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-10sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD workTejun Heo
Cleanups in preparation for the state-machine work that follows: - Convert three sub-sched call sites that open-code rcu_assign_pointer(p->scx.sched, ...) to scx_set_task_sched(). - Move scx_get_task_state()/scx_set_task_state() above the SCX task iter section so scx_task_iter_next_locked() can use them without a forward declaration. No functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-09sched_ext: Fix ops_cid layout assertTejun Heo
ca1d48a86fab ("sched_ext: Use offsetofend on both sides of the ops_cid layout assert") replaced sizeof() with offsetofend() to dodge 32-bit PPC trailing padding, but the resulting check is tautological: with CID_OFFSET_MATCH(priv, priv) already enforcing offsetof(priv) equality and @priv being the same type in both structs, the two offsetofends are equal by construction. The original protection - catching a stray field added past @priv in sched_ext_ops_cid - is gone. Anchor on a zero-size __end[] marker appended after @priv. Its offset sits flush after @priv regardless of trailing struct padding; if a field is inserted past @priv, __end shifts and the assert fires. Closes: https://lore.kernel.org/all/20260508215211.0C03AC2BCB0@smtp.kernel.org/ Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
2026-05-08Merge tag 'sched-urgent-2026-05-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix spurious failures in rseq self-tests (Mark Brown) - Fix rseq rseq::cpu_id_start ABI regression due to TCMalloc's creative use of the supposedly read-only field The fix is to introduce a new ABI variant based on a new (larger) rseq area registration size, to keep the TCMalloc use of rseq backwards compatible on new kernels (Thomas Gleixner) - Fix wakeup_preempt_fair() for not waking up task (Vincent Guittot) - Fix s64 mult overflow in vruntime_eligible() (Zhan Xusheng) * tag 'sched-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix wakeup_preempt_fair() for not waking up task sched/fair: Fix overflow in vruntime_eligible() selftests/rseq: Expand for optimized RSEQ ABI v2 rseq: Reenable performance optimizations conditionally rseq: Implement read only ABI enforcement for optimized RSEQ V2 mode selftests/rseq: Validate legacy behavior selftests/rseq: Make registration flexible for legacy and optimized mode selftests/rseq: Skip tests if time slice extensions are not available rseq: Revert to historical performance killing behaviour rseq: Don't advertise time slice extensions if disabled rseq: Protect rseq_reset() against interrupts rseq: Set rseq::cpu_id_start to 0 on unregistration selftests/rseq: Don't run tests with runner scripts outside of the scripts
2026-05-08sched_ext: Use offsetofend on both sides of the ops_cid layout assertTejun Heo
sizeof() includes trailing struct pad, offsetofend() doesn't. On 32-bit PPC, sched_ext_ops_cid tail-pads 4 bytes past @priv and the assert trips. Use offsetofend() on both sides. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202605081637.DbH4SZ1E-lkp@intel.com/ Fixes: 7e655ed7b953 ("sched_ext: Add bpf_sched_ext_ops_cid struct_ops type") Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-08sched_ext: Use IRQ_WORK_INIT_HARD() to initialize sch->disable_irq_workZqiang
For built with PREEMPT_RT kernels, the scx_disable_irq_workfn() is called from per-cpu irq_work kthreads context, this means that when call the scx_dump_state() in the scx_disable_irq_workfn() to output current->comm/pid, it always output current irq_work kthread's comm/pid. this commit therefore use the IRQ_WORK_INIT_HARD() to initialize sch->disable_irq_work to make scx_disable_irq_workfn() is called from hardirq context. Fixes: f4a6c506d118 ("sched_ext: Always bounce scx_disable() through irq_work") Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-07sched_ext: Fix !CONFIG_EXT_SUB_SCHED build warningsTejun Heo
W=1 with CONFIG_EXT_SUB_SCHED=n flags 'err_msg' uninitialized and 'err_free_lb_resched' unused. Initialize err_msg and gate the label. Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-07sched_ext: Drop unused scx_find_sub_sched() stubTejun Heo
scx_find_sub_sched()'s only caller, scx_bpf_sub_dispatch(), is gated on CONFIG_EXT_SUB_SCHED. When CONFIG_EXT_SUB_SCHED=n the caller compiles out and the stub becomes dead code, tripping -Wunused-function on randconfigs. Drop the stub. Fixes: 25037af712eb ("sched_ext: Add rhashtable lookup for sub-schedulers") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/all/202605080556.42PXw8U9-lkp@intel.com/ Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-07sched_ext: Move scx_error() out of scx_link_sched()'s lock regionTejun Heo
scx_link_sched() holds scx_sched_lock. The scx_error() calls inside take the same lock through scx_claim_exit() and deadlock. Move them out of the guard. Fixes: 6b4576b09714 ("sched_ext: Reject sub-sched attachment to a disabled parent") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-06sched/fair: Fix wakeup_preempt_fair() for not waking up taskVincent Guittot
Make sure to only call pick_next_entity() on an non-empty cfs_rq. The assumption that p is always enqueued and not delayed, is only true for wakeup. If p was moved while delayed, pick_next_entity() will dequeue it and the cfs might become empty. Test if there are still queued tasks before trying again to determine if p could be the next one to be picked. There are at least 2 cases: When cfs becomes idle, it tries to pull tasks but if those pulled tasks are delayed, they will be dequeued when attached to cfs. attach_tasks() -> attach_task() -> wakeup_preempt(rq, p, 0); A misfit task running on cfs A triggers a load balance to be pulled on a better cpu, the load balance on cfs B starts an active load balance to pulled the running misfit task. If there is a delayed dequeue task on cfs A, it can be pulled instead of the previously running misfit task. attach_one_task() -> attach_task() -> wakeup_preempt(rq, p, 0); Fixes: ac8e69e69363 ("sched/fair: Fix wakeup_preempt_fair() vs delayed dequeue") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260503104503.1732682-1-vincent.guittot@linaro.org
2026-05-06sched/fair: Fix overflow in vruntime_eligible()Zhan Xusheng
Zhan Xusheng reported running into sporadic a s64 mult overflow in vruntime_eligible(). When constructing a worst case scenario: If you have cgroups, then you can have an entity of weight 2 (per calc_group_shares()), and its vlag should then be bounded by: (slice+TICK_NSEC) * NICE_0_LOAD, which is around 44 bits as per the comment on entity_key(). The other extreme is 100*NICE_0_LOAD, thus you get: {key, weight}[] := { puny: { (slice + TICK_NSEC) * NICE_0_LOAD, 2 }, max: { 0, 100*NICE_0_LOAD }, } The avg_vruntime() would end up being very close to 0 (which is zero_vruntime), so no real help making that more accurate. vruntime_eligible(puny) ends up with: avg = 2 * puny.key (+ 0) load = 2 + 100 * NICE_0_LOAD avg >= puny.key * load And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100, which will overflow s64. Zhan suggested using __builtin_mul_overflow(), however after staring at compiler output for various architectures using godbolt, it seems that using an __int128 multiplication often results in better code. Specifically, a number of architectures already compute the __int128 product to determine the overflow. Eg. arm64 already has the 'smulh' instruction used. By explicitly doing an __int128 multiply, it will emit the 'mul; smulh' pattern, which modern cores can fuse (armv8-a clang-22.1.0). x86_64 has less branches (no OF handling). Since Linux has ARCH_SUPPORTS_INT128 to gate __int128 usage, also provide the __builtin_mul_overflow() variant as a fallback. [peterz: Changelog and __int128 bits] Fixes: 556146ce5e94 ("sched/fair: Avoid overflow in enqueue_entity()") Reported-by: Zhan Xusheng <zhanxusheng1024@gmail.com> Closes: https://patch.msgid.link/20260415145742.10359-1-zhanxusheng%40xiaomi.com Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260505103155.GN3102924%40noisy.programming.kicks-ass.net
2026-05-05Merge tag 'sched_ext-for-7.1-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix idle CPU selection returning prev_cpu outside the task's cpus_ptr when the BPF caller's allowed mask was wider. Stable backport. - Two opposite-direction gaps in scx_task_iter's cgroup-scoped mode versus the global mode: - Tasks past exit_signals() are filtered by the cgroup walk but kept by global. Sub-scheduler enable abort leaked __scx_init_task() state. Add a CSS_TASK_ITER_WITH_DEAD flag to cgroup's task iterator (scx_task_iter is its only user) and use it. - Tasks past sched_ext_dead() are still returned, tripping WARN_ON_ONCE() in callers or making them touch torn-down state. Mark and skip under the per-task rq lock. * tag 'sched_ext-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: idle: Recheck prev_cpu after narrowing allowed mask sched_ext: Skip past-sched_ext_dead() tasks in scx_task_iter_next_locked() cgroup, sched_ext: Include exiting tasks in cgroup iter
2026-05-05rseq: Revert to historical performance killing behaviourThomas Gleixner
The recent RSEQ optimization work broke the TCMalloc abuse of the RSEQ ABI as it not longer unconditionally updates the CPU, node, mm_cid fields, which are documented as read only for user space. Due to the observed behavior of the kernel it was possible for TCMalloc to overwrite the cpu_id_start field for their own purposes and rely on the kernel to update it unconditionally after each context switch and before signal delivery. The RSEQ ABI only guarantees that these fields are updated when the data changes, i.e. the task is migrated or the MMCID of the task changes due to switching from or to per CPU ownership mode. The optimization work eliminated the unconditional updates and reduced them to the documented ABI guarantees, which results in a massive performance win for syscall, scheduling heavy work loads, which in turn breaks the TCMalloc expectations. There have been several options discussed to restore the TCMalloc functionality while preserving the optimization benefits. They all end up in a series of hard to maintain workarounds, which in the worst case introduce overhead for everyone, e.g. in the scheduler. The requirements of TCMalloc and the optimization work are diametral and the required work arounds are a maintainence burden. They end up as fragile constructs, which are blocking further optimization work and are pretty much guaranteed to cause more subtle issues down the road. The optimization work heavily depends on the generic entry code, which is not used by all architectures yet. So the rework preserved the original mechanism moslty unmodified to keep the support for architectures, which handle rseq in their own exit to user space loop. That code is currently optimized out by the compiler on architectures which use the generic entry code. This allows to revert back to the original behaviour by replacing the compile time constant conditions with a runtime condition where required, which disables the optimization and the dependend time slice extension feature until the run-time condition can be enabled in the RSEQ registration code on a per task basis again. The following changes are required to restore the original behavior, which makes TCMalloc work again: 1) Replace the compile time constant conditionals with runtime conditionals where appropriate to prevent the compiler from optimizing the legacy mode out 2) Enforce unconditional update of IDs on context switch for the non-optimized v1 mode 3) Enforce update of IDs in the pre signal delivery path for the non-optimized v1 mode 4) Enforce update of IDs in the membarrier(RSEQ) IPI for the non-optimized v1 mode 5) Make time slice and future extensions depend on optimized v2 mode This brings back the full performance problems, but preserves the v2 optimization code and for generic entry code using architectures also the TIF_RSEQ optimization which avoids a full evaluation of the exit to user mode loop in many cases. Fixes: 566d8015f7ee ("rseq: Avoid CPU/MM CID updates when no event pending") Reported-by: Mathias Stearn <mathias@mongodb.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Closes: https://lore.kernel.org/CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com Link: https://patch.msgid.link/20260428224427.517051752%40kernel.org Cc: stable@vger.kernel.org
2026-05-05sched: Use trace_call__<tp>() to save a static branchGabriele Monaco
The wrapper functions __trace_set_current_state() and __trace_set_need_resched() allow the tracepoints to be called from code outside sched/core.c, those calls are already guarded by a tracepoint_enabled(<tp>) so there is no need to repeat this check once again inside the call using trace_<tp>(). Use the new trace_call__<tp>() API to directly call the tracepoint without check. Those helper functions must be called after the appropriate check. Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260429094227.34087-1-gmonaco@redhat.com
2026-05-05sched/membarrier: Modernize membarrier_global_expedited with cleanup guardsAniket Gattani
Replace explicit lock/unlock and free calls with scoped guards and automatic cleanup constructs. Signed-off-by: Aniket Gattani <aniketgattani@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260503212205.3714217-3-aniketgattani@google.com
2026-05-05sched/membarrier: Use per-CPU mutexes for targeted commandsAniket Gattani
Currently, the membarrier system call uses a single global mutex (`membarrier_ipi_mutex`) to serialize expedited commands. This causes significant contention on large systems when multiple threads invoke membarrier concurrently, even if they target different CPUs. This contention becomes critical when combined with CFS bandwidth throttling/unthrottling, during which interrupts can be disabled for relatively long periods on target CPUs. If membarrier is waiting for a response from such a CPU, it holds the global mutex, blocking all other membarrier calls on the system. This cascade effect can lead to hard lockups when thousands of threads stall waiting for the mutex. Optimize `MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ` when a specific CPU is targeted by introducing per-CPU mutexes. Broadcast commands and commands without a specific CPU target continue to use the global mutex. This prevents the cascade lockup scenario. As measured by the stress test introduced in the subsequent patch, on an AMD Turin machine with 384 CPUs (2 NUMA nodes with SMT=2), this optimization yields 200x more throughput. Signed-off-by: Aniket Gattani <aniketgattani@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260503212205.3714217-2-aniketgattani@google.com
2026-05-04sched_ext: Add __printf format attributes to scx_vexit() and bstr formattersTejun Heo
scx_vexit() forwards (fmt, args) to vscnprintf(); bstr_format() and __bstr_format() forward fmt to bstr_printf(); the BPF kfunc wrappers scx_bpf_exit_bstr(), scx_bpf_error_bstr() and scx_bpf_dump_bstr() in turn forward fmt to those formatters. None of them have __printf(), so clang -Wmissing-format-attribute fires on the forwarded calls and C-side callers don't get format-string checking. Annotate the six functions with __printf(N, 0) matching the fmt parameter position in each. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/all/202605041112.Y6OG7v9r-lkp@intel.com/ Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-04sched_ext: idle: Recheck prev_cpu after narrowing allowed maskDavid Carlier
scx_select_cpu_dfl() narrows @allowed to @cpus_allowed & @p->cpus_ptr when the BPF caller supplies a @cpus_allowed that differs from @p->cpus_ptr and @p doesn't have full affinity. However, @is_prev_allowed was computed against the original (wider) @cpus_allowed, so the prev_cpu fast paths could pick a @prev_cpu that is in @cpus_allowed but not in @p->cpus_ptr, violating the intended invariant that the returned CPU is always usable by @p. The kernel masks this via the SCX_EV_SELECT_CPU_FALLBACK fallback, but the behavior contradicts the documented contract. Move the @is_prev_allowed evaluation past the narrowing block so it tests against the final @allowed mask. Fixes: ee9a4e92799d ("sched_ext: idle: Properly handle invalid prev_cpu during idle selection") Cc: stable@vger.kernel.org # v6.16+ Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-04sched_ext: Normalize exit dump header to "on CPU N"Cheng-Yang Chou
Unify to uppercase to match the UEI output. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>