| Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from Netfilter, IPsec, Bluetooth and WiFi.
Current release - fix to a fix:
- ipmr: add __rcu to netns_ipv4.mrt, make sure we hold the RCU lock
in all relevant places
Current release - new code bugs:
- fixes for the recently added resizable hash tables
- ipv6: make sure we default IPv6 tunnel drivers to =m now that IPv6
itself is built in
- drv: octeontx2-af: fixes for parser/CAM fixes
Previous releases - regressions:
- phy: micrel: fix LAN8814 QSGMII soft reset
- wifi:
- cw1200: revert "Fix locking in error paths"
- ath12k: fix crash on WCN7850, due to adding the same queue
buffer to a list multiple times
Previous releases - always broken:
- number of info leak fixes
- ipv6: implement limits on extension header parsing
- wifi: number of fixes for missing bound checks in the drivers
- Bluetooth: fixes for races and locking issues
- af_unix:
- fix an issue between garbage collection and PEEK
- fix yet another issue with OOB data
- xfrm: esp: avoid in-place decrypt on shared skb frags
- netfilter: replace skb_try_make_writable() by skb_ensure_writable()
- openvswitch: vport: fix race between tunnel creation and linking
leading to invalid memory accesses (type confusion)
- drv: amd-xgbe: fix PTP addend overflow causing frozen clock
Misc:
- sched/isolation: make HK_TYPE_KTHREAD an alias of HK_TYPE_DOMAIN
(for relevant IPVS change)"
* tag 'net-7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (190 commits)
net: sparx5: configure serdes for 1000BASE-X in sparx5_port_init()
net: sparx5: fix wrong chip ids for TSN SKUs
net: stmmac: dwmac-nuvoton: fix NULL pointer dereference in nvt_set_phy_intf_sel()
tcp: Fix dst leak in tcp_v6_connect().
ipmr: Call ipmr_fib_lookup() under RCU.
net: phy: broadcom: Save PHY counters during suspend
net/smc: fix missing sk_err when TCP handshake fails
af_unix: Reject SIOCATMARK on non-stream sockets
veth: fix OOB txq access in veth_poll() with asymmetric queue counts
eth: fbnic: fix double-free of PCS on phylink creation failure
net: ethernet: cortina: Drop half-assembled SKB
selftests: mptcp: pm: restrict 'unknown' check to pm_nl_ctl
selftests: mptcp: check output: catch cmd errors
mptcp: pm: prio: skip closed subflows
mptcp: pm: ADD_ADDR rtx: return early if no retrans
mptcp: pm: ADD_ADDR rtx: skip inactive subflows
mptcp: pm: ADD_ADDR rtx: resched blocked ADD_ADDR quicker
mptcp: pm: ADD_ADDR rtx: free sk if last
mptcp: pm: ADD_ADDR rtx: always decrease sk refcount
mptcp: pm: ADD_ADDR rtx: fix potential data-race
...
|
|
Since commit 041ee6f3727a ("kthread: Rely on HK_TYPE_DOMAIN for preferred
affinity management"), kthreads default to use the HK_TYPE_DOMAIN
cpumask. IOW, it is no longer affected by the setting of the nohz_full
boot kernel parameter.
That means HK_TYPE_KTHREAD should now be an alias of HK_TYPE_DOMAIN
instead of HK_TYPE_KERNEL_NOISE to correctly reflect the current kthread
behavior. Make the change as HK_TYPE_KTHREAD is still being used in
some networking code.
Fixes: 041ee6f3727a ("kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
scx_task_iter's cgroup-scoped mode can return tasks whose
sched_ext_dead() has already completed: cgroup_task_dead() removes
from cset->tasks after sched_ext_dead() in finish_task_switch() and is
irq-work deferred on PREEMPT_RT. The global mode is fine -
sched_ext_dead() removes from scx_tasks via list_del_init() first.
Callers (sub-sched enable prep/abort/apply, scx_sub_disable(),
scx_fail_parent()) assume returned tasks are still on @sch and trip
WARN_ON_ONCE() or operate on torn-down state otherwise.
Set %SCX_TASK_OFF_TASKS in sched_ext_dead() under @p's rq lock and
have scx_task_iter_next_locked() skip flagged tasks under the same
lock. Setter and reader serialize on the per-task rq lock - no race.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull runtime verification updates from Steven Rostedt:
- Refactor da_monitor header to share handlers across monitor types
No functional changes, only less code duplication.
- Add Hybrid Automata model class
Add a new model class that extends deterministic automata by adding
constraints on transitions and states. Those constraints can take
into account wall-clock time and as such allow RV monitor to make
assertions on real time. Add documentation and code generation
scripts.
- Add stall monitor as hybrid automaton example
Add a monitor that triggers a violation when a task is stalling as an
example of automaton working with real time variables.
- Convert the opid monitor to a hybrid automaton
The opid monitor can be heavily simplified if written as a hybrid
automaton: instead of tracking preempt and interrupt enable/disable
events, it can just run constraints on the preemption/interrupt
states when events like wakeup and need_resched verify.
- Add support for per-object monitors in DA/HA
Allow writing deterministic and hybrid automata monitors for generic
objects (e.g. any struct), by exploiting a hash table where objects
are saved. This allows to track more than just tasks in RV. For
instance it will be used to track deadline entities in deadline
monitors.
- Add deadline tracepoints and move some deadline utilities
Prepare the ground for deadline monitors by defining events and
exporting helpers.
- Add nomiss deadline monitor
Add first example of deadline monitor asserting all entities complete
before their deadline.
- Improve rvgen error handling
Introduce AutomataError exception class and better handle expected
exceptions while showing a backtrace for unexpected ones.
- Improve python code quality in rvgen
Refactor the rvgen generation scripts to align with python best
practices: use f-strings instead of %, use len() instead of
__len__(), remove semicolons, use context managers for file
operations, fix whitespace violations, extract magic strings into
constants, remove unused imports and methods.
- Fix small bugs in rvgen
The generator scripts presented some corner case bugs: logical error
in validating what a correct dot file looks like, fix an isinstance()
check, enforce a dot file has an initial state, fix type annotations
and typos in comments.
- rvgen refactoring
Refactor automata.py to use iterator-based parsing and handle
required arguments directly in argparse.
- Allow epoll in rtapp-sleep monitor
The epoll_wait call is now rt-friendly so it should be allowed in the
sleep monitor as a valid sleep method.
* tag 'trace-rv-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (32 commits)
rv: Allow epoll in rtapp-sleep monitor
rv/rvgen: fix _fill_states() return type annotation
rv/rvgen: fix unbound loop variable warning
rv/rvgen: enforce presence of initial state
rv/rvgen: extract node marker string to class constant
rv/rvgen: fix isinstance check in Variable.expand()
rv/rvgen: make monitor arguments required in rvgen
rv/rvgen: remove unused __get_main_name method
rv/rvgen: remove unused sys import from dot2c
rv/rvgen: refactor automata.py to use iterator-based parsing
rv/rvgen: use class constant for init marker
rv/rvgen: fix DOT file validation logic error
rv/rvgen: fix PEP 8 whitespace violations
rv/rvgen: fix typos in automata and generator docstring and comments
rv/rvgen: use context managers for file operations
rv/rvgen: remove unnecessary semicolons
rv/rvgen: replace __len__() calls with len()
rv/rvgen: replace % string formatting with f-strings
rv/rvgen: remove bare except clauses in generator
rv/rvgen: introduce AutomataError exception class
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo:
- cgroup sub-scheduler groundwork
Multiple BPF schedulers can be attached to cgroups and the dispatch
path is made hierarchical. This involves substantial restructuring of
the core dispatch, bypass, watchdog, and dump paths to be
per-scheduler, along with new infrastructure for scheduler ownership
enforcement, lifecycle management, and cgroup subtree iteration
The enqueue path is not yet updated and will follow in a later cycle
- scx_bpf_dsq_reenq() generalized to support any DSQ including remote
local DSQs and user DSQs
Built on top of this, SCX_ENQ_IMMED guarantees that tasks dispatched
to local DSQs either run immediately or get reenqueued back through
ops.enqueue(), giving schedulers tighter control over queueing
latency
Also useful for opportunistic CPU sharing across sub-schedulers
- ops.dequeue() was only invoked when the core knew a task was in BPF
data structures, missing scheduling property change events and
skipping callbacks for non-local DSQ dispatches from ops.select_cpu()
Fixed to guarantee exactly one ops.dequeue() call when a task leaves
BPF scheduler custody
- Kfunc access validation moved from runtime to BPF verifier time,
removing runtime mask enforcement
- Idle SMT sibling prioritization in the idle CPU selection path
- Documentation, selftest, and tooling updates. Misc bug fixes and
cleanups
* tag 'sched_ext-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (134 commits)
tools/sched_ext: Add explicit cast from void* in RESIZE_ARRAY()
sched_ext: Make string params of __ENUM_set() const
tools/sched_ext: Kick home CPU for stranded tasks in scx_qmap
sched_ext: Drop spurious warning on kick during scheduler disable
sched_ext: Warn on task-based SCX op recursion
sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok()
sched_ext: Remove runtime kfunc mask enforcement
sched_ext: Add verifier-time kfunc context filter
sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup()
sched_ext: Decouple kfunc unlocked-context check from kf_mask
sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking
sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask
sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked
sched_ext: Drop TRACING access to select_cpu kfuncs
selftests/sched_ext: Fix wrong DSQ ID in peek_dsq error message
sched_ext: Documentation: improve accuracy of task lifecycle pseudo-code
selftests/sched_ext: Improve runner error reporting for invalid arguments
sched_ext: Documentation: Fix scx_bpf_move_to_local kfunc name
sched_ext: Documentation: Add ops.dequeue() to task lifecycle
tools/sched_ext: Fix off-by-one in scx_sdt payload zeroing
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Fair scheduling updates:
- Skip SCHED_IDLE rq for SCHED_IDLE tasks (Christian Loehle)
- Remove superfluous rcu_read_lock() in the wakeup path (K Prateek Nayak)
- Simplify the entry condition for update_idle_cpu_scan() (K Prateek Nayak)
- Simplify SIS_UTIL handling in select_idle_cpu() (K Prateek Nayak)
- Avoid overflow in enqueue_entity() (K Prateek Nayak)
- Update overutilized detection (Vincent Guittot)
- Prevent negative lag increase during delayed dequeue (Vincent Guittot)
- Clear buddies for preempt_short (Vincent Guittot)
- Implement more complex proportional newidle balance (Peter Zijlstra)
- Increase weight bits for avg_vruntime (Peter Zijlstra)
- Use full weight to __calc_delta() (Peter Zijlstra)
RT and DL scheduling updates:
- Fix incorrect schedstats for rt and dl thread (Dengjun Su)
- Skip group schedulable check with rt_group_sched=0 (Michal Koutný)
- Move group schedulability check to sched_rt_global_validate()
(Michal Koutný)
- Add reporting of runtime left & abs deadline to sched_getattr()
for DEADLINE tasks (Tommaso Cucinotta)
Scheduling topology updates by K Prateek Nayak:
- Compute sd_weight considering cpuset partitions
- Extract "imb_numa_nr" calculation into a separate helper
- Allocate per-CPU sched_domain_shared in s_data
- Switch to assigning "sd->shared" from s_data
- Remove sched_domain_shared allocation with sd_data
Energy-aware scheduling updates:
- Filter false overloaded_group case for EAS (Vincent Guittot)
- PM: EM: Switch to rcu_dereference_all() in wakeup path
(Dietmar Eggemann)
Infrastructure updates:
- Replace use of system_unbound_wq with system_dfl_wq (Marco Crivellari)
Proxy scheduling updates by John Stultz:
- Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
- Minimise repeated sched_proxy_exec() checking
- Fix potentially missing balancing with Proxy Exec
- Fix and improve task::blocked_on et al handling
- Add assert_balance_callbacks_empty() helper
- Add logic to zap balancing callbacks if we pick again
- Move attach_one_task() and attach_task() helpers to sched.h
- Handle blocked-waiter migration (and return migration)
- Add K Prateek Nayak to scheduler reviewers for proxy execution
Misc cleanups and fixes by John Stultz, Joseph Salisbury, Peter
Zijlstra, K Prateek Nayak, Michal Koutný, Randy Dunlap, Shrikanth
Hegde, Vincent Guittot, Zhan Xusheng, Xie Yuanbin and Vincent Guittot"
* tag 'sched-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
sched/eevdf: Clear buddies for preempt_short
sched/rt: Cleanup global RT bandwidth functions
sched/rt: Move group schedulability check to sched_rt_global_validate()
sched/rt: Skip group schedulable check with rt_group_sched=0
sched/fair: Avoid overflow in enqueue_entity()
sched: Use u64 for bandwidth ratio calculations
sched/fair: Prevent negative lag increase during delayed dequeue
sched/fair: Use sched_energy_enabled()
sched: Handle blocked-waiter migration (and return migration)
sched: Move attach_one_task and attach_task helpers to sched.h
sched: Add logic to zap balance callbacks if we pick again
sched: Add assert_balance_callbacks_empty helper
sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration
sched: Fix modifying donor->blocked on without proper locking
locking: Add task::blocked_lock to serialize blocked_on state
sched: Fix potentially missing balancing with Proxy Exec
sched: Minimise repeated sched_proxy_exec() checking
sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
MAINTAINERS: Add K Prateek Nayak to scheduler reviewers
sched/core: Get this cpu once in ttwu_queue_cond()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"Mutexes:
- Add killable flavor to guard definitions (Davidlohr Bueso)
- Remove the list_head from struct mutex (Matthew Wilcox)
- Rename mutex_init_lockep() (Davidlohr Bueso)
rwsems:
- Remove the list_head from struct rw_semaphore and
replace it with a single pointer (Matthew Wilcox)
- Fix logic error in rwsem_del_waiter() (Andrei Vagin)
Semaphores:
- Remove the list_head from struct semaphore (Matthew Wilcox)
Jump labels:
- Use ATOMIC_INIT() for initialization of .enabled (Thomas Weißschuh)
- Remove workaround for old compilers in initializations
(Thomas Weißschuh)
Lock context analysis changes and improvements:
- Add context analysis for rwsems (Peter Zijlstra)
- Fix rwlock and spinlock lock context annotations (Bart Van Assche)
- Fix rwlock support in <linux/spinlock_up.h> (Bart Van Assche)
- Add lock context annotations in the spinlock implementation
(Bart Van Assche)
- signal: Fix the lock_task_sighand() annotation (Bart Van Assche)
- ww-mutex: Fix the ww_acquire_ctx function annotations
(Bart Van Assche)
- Add lock context support in do_raw_{read,write}_trylock()
(Bart Van Assche)
- arm64, compiler-context-analysis: Permit alias analysis through
__READ_ONCE() with CONFIG_LTO=y (Marco Elver)
- Add __cond_releases() (Peter Zijlstra)
- Add context analysis for mutexes (Peter Zijlstra)
- Add context analysis for rtmutexes (Peter Zijlstra)
- Convert futexes to compiler context analysis (Peter Zijlstra)
Rust integration updates:
- Add atomic fetch_sub() implementation (Andreas Hindborg)
- Refactor various rust_helper_ methods for expansion (Boqun Feng)
- Add Atomic<*{mut,const} T> support (Boqun Feng)
- Add atomic operation helpers over raw pointers (Boqun Feng)
- Add performance-optimal Flag type for atomic booleans, to avoid
slow byte-sized RMWs on architectures that don't support them.
(FUJITA Tomonori)
- Misc cleanups and fixes (Andreas Hindborg, Boqun Feng, FUJITA
Tomonori)
LTO support updates:
- arm64: Optimize __READ_ONCE() with CONFIG_LTO=y (Marco Elver)
- compiler: Simplify generic RELOC_HIDE() (Marco Elver)
Miscellaneous fixes and cleanups by Peter Zijlstra, Randy Dunlap,
Thomas Weißschuh, Davidlohr Bueso and Mikhail Gavrilov"
* tag 'locking-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
compiler: Simplify generic RELOC_HIDE()
locking: Add lock context annotations in the spinlock implementation
locking: Add lock context support in do_raw_{read,write}_trylock()
locking: Fix rwlock support in <linux/spinlock_up.h>
lockdep: Raise default stack trace limits when KASAN is enabled
cleanup: Optimize guards
jump_label: remove workaround for old compilers in initializations
jump_label: use ATOMIC_INIT() for initialization of .enabled
futex: Convert to compiler context analysis
locking/rwsem: Fix logic error in rwsem_del_waiter()
locking/rwsem: Add context analysis
locking/rtmutex: Add context analysis
locking/mutex: Add context analysis
compiler-context-analysys: Add __cond_releases()
locking/mutex: Remove the list_head from struct mutex
locking/semaphore: Remove the list_head from struct semaphore
locking/rwsem: Remove the list_head from struct rw_semaphore
rust: atomic: Update a safety comment in impl of `fetch_add()`
rust: sync: atomic: Update documentation for `fetch_add()`
rust: sync: atomic: Add fetch_sub()
...
|
|
Now that scx_kfunc_context_filter enforces context-sensitive kfunc
restrictions at BPF load time, the per-task runtime enforcement via
scx_kf_mask is redundant. Remove it entirely:
- Delete enum scx_kf_mask, the kf_mask field on sched_ext_entity, and
the scx_kf_allow()/scx_kf_disallow()/scx_kf_allowed() helpers along
with the higher_bits()/highest_bit() helpers they used.
- Strip the @mask parameter (and the BUILD_BUG_ON checks) from the
SCX_CALL_OP[_RET]/SCX_CALL_OP_TASK[_RET]/SCX_CALL_OP_2TASKS_RET
macros and update every call site. Reflow call sites that were
wrapped only to fit the old 5-arg form and now collapse onto a single
line under ~100 cols.
- Remove the in-kfunc scx_kf_allowed() runtime checks from
scx_dsq_insert_preamble(), scx_dsq_move(), scx_bpf_dispatch_nr_slots(),
scx_bpf_dispatch_cancel(), scx_bpf_dsq_move_to_local___v2(),
scx_bpf_sub_dispatch(), scx_bpf_reenqueue_local(), and the per-call
guard inside select_cpu_from_kfunc().
scx_bpf_task_cgroup() and scx_kf_allowed_on_arg_tasks() were already
cleaned up in the "drop redundant rq-locked check" patch.
scx_kf_allowed_if_unlocked() was rewritten in the preceding "decouple"
patch. No further changes to those helpers here.
Co-developed-by: Juntong Deng <juntong.deng@outlook.com>
Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
Some utility functions on sched_dl_entity can be useful outside of
deadline.c , for instance for modelling, without relying on raw
structure fields.
Move functions like dl_task_of and dl_is_implicit to deadline.h to make
them available outside.
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20260330111010.153663-12-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
|
|
Commit 8e8e23dea43e ("sched/topology: Compute sd_weight considering
cpuset partitions") ends up relying on the fact that structure
initialization should not touch the flexible array.
However, the official GCC specification for "Arrays of Length Zero"
[*] says:
Although the size of a zero-length array is zero, an array member of
this kind may increase the size of the enclosing type as a result of
tail padding.
Additionally, structure initialization will zero tail padding. With
the end result that since offsetof(*type, member) < sizeof(*type),
array initialization will clobber the flex array.
Luckily, the way flexible array sizes are calculated is:
sizeof(*type) + count * sizeof(*type->member)
This means we have the complete size of the flex array *outside* of
sizeof(*type), so use that instead of relying on the broken flex array
definition.
[*] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
Fixes: 8e8e23dea43e ("sched/topology: Compute sd_weight considering cpuset partitions")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Debugged-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/20260323093627.GY3738010@noisy.programming.kicks-ass.net
|
|
Now that "sd->shared" assignments are using the sched_domain_shared
objects allocated with s_data, remove the sd_data based allocations.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/20260312044434.1974-6-kprateek.nayak@amd.com
|
|
Add SCX_ENQ_IMMED enqueue flag for local DSQ insertions. Once a task is
dispatched with IMMED, it either gets on the CPU immediately and stays on it,
or gets reenqueued back to the BPF scheduler. It will never linger on a local
DSQ behind other tasks or on a CPU taken by a higher-priority class.
rq_is_open() uses rq->next_class to determine whether the rq is available,
and wakeup_preempt_scx() triggers reenqueue when a higher-priority class task
arrives. These capture all higher class preemptions. Combined with reenqueue
points in the dispatch path, all cases where an IMMED task would not execute
immediately are covered.
SCX_TASK_IMMED persists in p->scx.flags until the next fresh enqueue, so the
guarantee survives SAVE/RESTORE cycles. If preempted while running,
put_prev_task_scx() reenqueues through ops.enqueue() with
SCX_TASK_REENQ_PREEMPTED instead of silently placing the task back on the
local DSQ.
This enables tighter scheduling latency control by preventing tasks from
piling up on local DSQs. It also enables opportunistic CPU sharing across
sub-schedulers - without this, a sub-scheduler can stuff the local DSQ of a
shared CPU, making it difficult for others to use.
v2: - Rewrite is_curr_done() as rq_is_open() using rq->next_class and
implement wakeup_preempt_scx() to achieve complete coverage of all
cases where IMMED tasks could get stranded.
- Track IMMED persistently in p->scx.flags and reenqueue
preempted-while-running tasks through ops.enqueue().
- Bound deferred reenq cycles (SCX_REENQ_LOCAL_MAX_REPEAT).
- Misc renames, documentation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
Add a new clone3() flag CLONE_AUTOREAP that makes a child process
auto-reap on exit without ever becoming a zombie. This is a per-process
property in contrast to the existing auto-reap mechanism via
SA_NOCLDWAIT or SIG_IGN for SIGCHLD which applies to all children of a
given parent.
Currently the only way to automatically reap children is to set
SA_NOCLDWAIT or SIG_IGN on SIGCHLD. This is a parent-scoped property
affecting all children which makes it unsuitable for libraries or
applications that need selective auto-reaping of specific children while
still being able to wait() on others.
CLONE_AUTOREAP stores an autoreap flag in the child's signal_struct.
When the child exits do_notify_parent() checks this flag and causes
exit_notify() to transition the task directly to EXIT_DEAD. Since the
flag lives on the child it survives reparenting: if the original parent
exits and the child is reparented to a subreaper or init the child still
auto-reaps when it eventually exits.
CLONE_AUTOREAP can be combined with CLONE_PIDFD to allow the parent to
monitor the child's exit via poll() and retrieve exit status via
PIDFD_GET_INFO. Without CLONE_PIDFD it provides a fire-and-forget
pattern where the parent simply doesn't care about the child's exit
status. No exit signal is delivered so exit_signal must be zero.
CLONE_AUTOREAP is rejected in combination with CLONE_PARENT. If a
CLONE_AUTOREAP child were to clone(CLONE_PARENT) the new grandchild
would inherit exit_signal == 0 from the autoreap parent's group leader
but without signal->autoreap. This grandchild would become a zombie that
never sends a signal and is never autoreaped - confusing and arguably
broken behavior.
The flag is not inherited by the autoreap process's own children. Each
child that should be autoreaped must be explicitly created with
CLONE_AUTOREAP.
Link: https://github.com/uapi-group/kernel-features/issues/45
Link: https://patch.msgid.link/20260226-work-pidfs-autoreap-v5-1-d148b984a989@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-7.1
Pull sched/core to resolve conflicts between:
c2a57380df9dd ("sched: Replace use of system_unbound_wq with system_dfl_wq")
from the tip tree and commit:
cde94c032b32b ("sched_ext: Make watchdog sub-sched aware")
The latter moves around code modiefied by the former. Apply the changes in
the new locations.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
SCX_ENQ_REENQ indicates that a task is being re-enqueued but doesn't tell the
BPF scheduler why. Add SCX_TASK_REENQ_REASON flags using bits 12-13 of
p->scx.flags to communicate the reason during ops.enqueue():
- NONE: Not being reenqueued
- KFUNC: Reenqueued by scx_bpf_dsq_reenq() and friends
More reasons will be added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
Task states (NONE, INIT, READY, ENABLED) were defined in a separate enum with
unshifted values and then shifted when stored in scx_entity.flags. Simplify by
defining them as pre-shifted values directly in scx_ent_flags and removing the
separate scx_task_state enum. This removes the need for shifting when
reading/writing state values.
scx_get_task_state() now returns the masked flags value directly.
scx_set_task_state() accepts the pre-shifted state value. scx_dump_task()
shifts down for display to maintain readable output.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
scx_bpf_dsq_reenq() currently only supports local DSQs. Extend it to support
user-defined DSQs by adding a deferred re-enqueue mechanism similar to the
local DSQ handling.
Add per-cpu deferred_reenq_user_node/flags to scx_dsq_pcpu and
deferred_reenq_users list to scx_rq. When scx_bpf_dsq_reenq() is called on a
user DSQ, the DSQ's per-cpu node is added to the current rq's deferred list.
process_deferred_reenq_users() then iterates the DSQ using the cursor helpers
and re-enqueues each task.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
Factor out cursor-based DSQ iteration from bpf_iter_scx_dsq_next() into
nldsq_cursor_next_task() and the task-lost check from scx_dsq_move() into
nldsq_cursor_lost_task() to prepare for reuse.
As ->priv is only used to record dsq->seq for cursors, update
INIT_DSQ_LIST_CURSOR() to take the DSQ pointer and set ->priv from dsq->seq
so that users don't have to read it manually. Move scx_dsq_iter_flags enum
earlier so nldsq_cursor_next_task() can use SCX_DSQ_ITER_REV.
bypass_lb_cpu() now sets cursor.priv to dsq->seq but doesn't use it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
Add per-CPU data structure to dispatch queues. Each DSQ now has a percpu
scx_dsq_pcpu which contains a back-pointer to the DSQ. This will be used by
future changes to implement per-CPU reenqueue tracking for user DSQs.
init_dsq() now allocates the percpu data and can fail, so it returns an
error code. All callers are updated to handle failures. exit_dsq() is added
to free the percpu data and is called from all DSQ cleanup paths.
In scx_bpf_create_dsq(), init_dsq() is called before rcu_read_lock() since
alloc_percpu() requires GFP_KERNEL context, and dsq->sched is set
afterwards.
v2: Fix err_free_pcpu to only exit_dsq() initialized bypass DSQs (Andrea
Righi).
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
The preceding changes implemented the framework to support cgroup
sub-scheds and updated scheduling paths and kfuncs so that they have
minimal but working support for sub-scheds. However, actual sub-sched
enabling/disabling hasn't been implemented yet and all tasks stayed on
scx_root.
Implement cgroup sub-sched enabling and disabling to actually activate
sub-scheds:
- Both enable and disable operations bypass only the tasks in the subtree
of the child being enabled or disabled to limit disruptions.
- When enabling, all candidate tasks are first initialized for the child
sched. Once that succeeds, the tasks are exited for the parent and then
switched over to the child. This adds a bit of complication but
guarantees that child scheduler failures are always contained.
- Disabling works the same way in the other direction. However, when the
parent may fail to initialize a task, disabling is propagated up to the
parent. While this means that a parent sched fail due to a child sched
event, the failure can only originate from the parent itself (its
ops.init_task()). The only effect a malfunctioning child can have on the
parent is attempting to move the tasks back to the parent.
After this change, although not all the necessary mechanisms are in place
yet, sub-scheds can take control of their tasks and schedule them.
v2: Fix missing scx_cgroup_unlock()/percpu_up_write() in abort path
(Cheng-Yang Chou).
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
In preparation of multiple scheduler support, add p->scx.sched which points
to the scx_sched instance that the task is scheduled by, which is currently
always scx_root. Add scx_task_sched[_rcu]() accessors which return the
associated scx_sched of the specified task and replace the raw scx_root
dereferences with it where applicable. scx_task_on_sched() is also added to
test whether a given task is on the specified sched.
As scx_root is still the only scheduler, this shouldn't introduce
user-visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
A system often runs multiple workloads especially in multi-tenant server
environments where a system is split into partitions servicing separate
more-or-less independent workloads each requiring an application-specific
scheduler. To support such and other use cases, sched_ext is in the process
of growing multiple scheduler support.
When partitioning a system in terms of CPUs for such use cases, an
oft-taken approach is hard partitioning the system using cpuset. While it
would be possible to tie sched_ext multiple scheduler support to cpuset
partitions, such an approach would have fundamental limitations stemming
from the lack of dynamism and flexibility.
Users often don't care which specific CPUs are assigned to which workload
and want to take advantage of optimizations which are enabled by running
workloads on a larger machine - e.g. opportunistic over-commit, improving
latency critical workload characteristics while maintaining bandwidth
fairness, employing control mechanisms based on different criteria than
on-CPU time for e.g. flexible memory bandwidth isolation, packing similar
parts from different workloads on same L3s to improve cache efficiency,
and so on.
As this sort of dynamic behaviors are impossible or difficult to implement
with hard partitioning, sched_ext is implementing cgroup sub-sched support
where schedulers can be attached to the cgroup hierarchy and a parent
scheduler is responsible for controlling the CPUs that each child can use
at any given moment. This makes CPU distribution dynamically controlled by
BPF allowing high flexibility.
This patch adds the skeletal sched_ext cgroup sub-sched support:
- sched_ext_ops.sub_cgroup_id and .sub_attach/detach() are added. Non-zero
sub_cgroup_id indicates that the scheduler is to be attached to the
identified cgroup. A sub-sched is attached to the cgroup iff the nearest
ancestor scheduler implements .sub_attach() and grants the attachment. Max
nesting depth is limited by SCX_SUB_MAX_DEPTH.
- When a scheduler exits, all its descendant schedulers are exited
together. Also, cgroup.scx_sched added which points to the effective
scheduler instance for the cgroup. This is updated on scheduler
init/exit and inherited on cgroup online. When a cgroup is offlined, the
attached scheduler is automatically exited.
- Sub-sched support is gated on CONFIG_EXT_SUB_SCHED which is
automatically enabled if both SCX and cgroups are enabled. Sub-sched
support is not tied to the CPU controller but rather the cgroup
hierarchy itself. This is intentional as the support for cpu.weight and
cpu.max based resource control is orthogonal to sub-sched support. Note
that CONFIG_CGROUPS around cgroup subtree iteration support for
scx_task_iter is replaced with CONFIG_EXT_SUB_SCHED for consistency.
- This allows loading sub-scheds and most framework operations such as
propagating disable down the hierarchy work. However, sub-scheds are not
operational yet and all tasks stay with the root sched. This will serve
as the basis for building up full sub-sched support.
- DSQs point to the scx_sched they belong to.
- scx_qmap is updated to allow attachment of sub-scheds and also serving
as sub-scheds.
- scx_is_descendant() is added but not yet used in this patch. It is used by
later changes in the series and placed here as this is where the function
belongs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
lock_task_sighand() may return NULL. Make this clear in its lock context
annotation.
Fixes: 04e49d926f43 ("sched: Enable context analysis for core.c and fair.c")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Marco Elver <elver@google.com>
Link: https://patch.msgid.link/20260225183244.4035378-3-bvanassche@acm.org
|
|
Reorder struct sched_ext_entity to place ops_state, ddsp_dsq_id, and
ddsp_enq_flags immediately after dsq. These fields are accessed together
in the do_enqueue_task() and finish_dispatch() hot paths but were
previously spread across three different cache lines. Grouping them on
the same cache line reduces cache misses on every enqueue and dispatch
operation.
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Currently, ops.dequeue() is only invoked when the sched_ext core knows
that a task resides in BPF-managed data structures, which causes it to
miss scheduling property change events. In addition, ops.dequeue()
callbacks are completely skipped when tasks are dispatched to non-local
DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
track task state.
Fix this by guaranteeing that each task entering the BPF scheduler's
custody triggers exactly one ops.dequeue() call when it leaves that
custody, whether the exit is due to a dispatch (regular or via a core
scheduling pick) or to a scheduling property change (e.g.
sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
balancing, etc.).
BPF scheduler custody concept: a task is considered to be in the BPF
scheduler's custody when the scheduler is responsible for managing its
lifecycle. This includes tasks dispatched to user-created DSQs or stored
in the BPF scheduler's internal data structures from ops.enqueue().
Custody ends when the task is dispatched to a terminal DSQ (such as the
local DSQ or %SCX_DSQ_GLOBAL), selected by core scheduling, or removed
due to a property change.
Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
entirely and are never in its custody. Terminal DSQs include:
- Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
where tasks go directly to execution.
- Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
BPF scheduler is considered "done" with the task.
As a result, ops.dequeue() is not invoked for tasks directly dispatched
to terminal DSQs.
To identify dequeues triggered by scheduling property changes, introduce
the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
the dequeue was caused by a scheduling property change.
New ops.dequeue() semantics:
- ops.dequeue() is invoked exactly once when the task leaves the BPF
scheduler's custody, in one of the following cases:
a) regular dispatch: a task dispatched to a user DSQ or stored in
internal BPF data structures is moved to a terminal DSQ
(ops.dequeue() called without any special flags set),
b) core scheduling dispatch: core-sched picks task before dispatch
(ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set),
c) property change: task properties modified before dispatch,
(ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set).
This allows BPF schedulers to:
- reliably track task ownership and lifecycle,
- maintain accurate accounting of managed tasks,
- update internal state when tasks change properties.
Cc: Tejun Heo <tj@kernel.org>
Cc: Emil Tsalapatis <emil@etsalapatis.com>
Cc: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
It turns out that a few workloads (easyWave, fio) have a fairly low
success rate on newidle balance, but still benefit greatly from having
it anyway.
Luckliky these workloads have a faily low newidle rate, so the cost if
doing the newidle is relatively low, even if unsuccessfull.
Add a simple rate based part to the newidle ratio compute, such that
low rate newidle will still have a high newidle ratio.
This cures the easyWave and fio workloads while not affecting the
schbench numbers either (which have a very high newidle rate).
Reported-by: Mario Roy <marioeroy@gmail.com>
Reported-by: "Mohamed Abuelfotoh, Hazem" <abuehaze@amazon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Mario Roy <marioeroy@gmail.com>
Tested-by: "Mohamed Abuelfotoh, Hazem" <abuehaze@amazon.com>
Link: https://patch.msgid.link/20260127151748.GA1079264@noisy.programming.kicks-ass.net
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 paravirt updates from Borislav Petkov:
- A nice cleanup to the paravirt code containing a unification of the
paravirt clock interface, taming the include hell by splitting the
pv_ops structure and removing of a bunch of obsolete code (Juergen
Gross)
* tag 'x86_paravirt_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/paravirt: Use XOR r32,r32 to clear register in pv_vcpu_is_preempted()
x86/paravirt: Remove trailing semicolons from alternative asm templates
x86/pvlocks: Move paravirt spinlock functions into own header
x86/paravirt: Specify pv_ops array in paravirt macros
x86/paravirt: Allow pv-calls outside paravirt.h
objtool: Allow multiple pv_ops arrays
x86/xen: Drop xen_mmu_ops
x86/xen: Drop xen_cpu_ops
x86/xen: Drop xen_irq_ops
x86/paravirt: Move pv_native_*() prototypes to paravirt.c
x86/paravirt: Introduce new paravirt-base.h header
x86/paravirt: Move paravirt_sched_clock() related code into tsc.c
x86/paravirt: Use common code for paravirt_steal_clock()
riscv/paravirt: Use common code for paravirt_steal_clock()
loongarch/paravirt: Use common code for paravirt_steal_clock()
arm64/paravirt: Use common code for paravirt_steal_clock()
arm/paravirt: Use common code for paravirt_steal_clock()
sched: Move clock related paravirt code to kernel/sched
paravirt: Remove asm/paravirt_api_clock.h
x86/paravirt: Move thunk macros to paravirt_types.h
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"Lock debugging:
- Implement compiler-driven static analysis locking context checking,
using the upcoming Clang 22 compiler's context analysis features
(Marco Elver)
We removed Sparse context analysis support, because prior to
removal even a defconfig kernel produced 1,700+ context tracking
Sparse warnings, the overwhelming majority of which are false
positives. On an allmodconfig kernel the number of false positive
context tracking Sparse warnings grows to over 5,200... On the plus
side of the balance actual locking bugs found by Sparse context
analysis is also rather ... sparse: I found only 3 such commits in
the last 3 years. So the rate of false positives and the
maintenance overhead is rather high and there appears to be no
active policy in place to achieve a zero-warnings baseline to move
the annotations & fixers to developers who introduce new code.
Clang context analysis is more complete and more aggressive in
trying to find bugs, at least in principle. Plus it has a different
model to enabling it: it's enabled subsystem by subsystem, which
results in zero warnings on all relevant kernel builds (as far as
our testing managed to cover it). Which allowed us to enable it by
default, similar to other compiler warnings, with the expectation
that there are no warnings going forward. This enforces a
zero-warnings baseline on clang-22+ builds (Which are still limited
in distribution, admittedly)
Hopefully the Clang approach can lead to a more maintainable
zero-warnings status quo and policy, with more and more subsystems
and drivers enabling the feature. Context tracking can be enabled
for all kernel code via WARN_CONTEXT_ANALYSIS_ALL=y (default
disabled), but this will generate a lot of false positives.
( Having said that, Sparse support could still be added back,
if anyone is interested - the removal patch is still
relatively straightforward to revert at this stage. )
Rust integration updates: (Alice Ryhl, Fujita Tomonori, Boqun Feng)
- Add support for Atomic<i8/i16/bool> and replace most Rust native
AtomicBool usages with Atomic<bool>
- Clean up LockClassKey and improve its documentation
- Add missing Send and Sync trait implementation for SetOnce
- Make ARef Unpin as it is supposed to be
- Add __rust_helper to a few Rust helpers as a preparation for
helper LTO
- Inline various lock related functions to avoid additional function
calls
WW mutexes:
- Extend ww_mutex tests and other test-ww_mutex updates (John
Stultz)
Misc fixes and cleanups:
- rcu: Mark lockdep_assert_rcu_helper() __always_inline (Arnd
Bergmann)
- locking/local_lock: Include more missing headers (Peter Zijlstra)
- seqlock: fix scoped_seqlock_read kernel-doc (Randy Dunlap)
- rust: sync: Replace `kernel::c_str!` with C-Strings (Tamir
Duberstein)"
* tag 'locking-core-2026-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (90 commits)
locking/rwlock: Fix write_trylock_irqsave() with CONFIG_INLINE_WRITE_TRYLOCK
rcu: Mark lockdep_assert_rcu_helper() __always_inline
compiler-context-analysis: Remove __assume_ctx_lock from initializers
tomoyo: Use scoped init guard
crypto: Use scoped init guard
kcov: Use scoped init guard
compiler-context-analysis: Introduce scoped init guards
cleanup: Make __DEFINE_LOCK_GUARD handle commas in initializers
seqlock: fix scoped_seqlock_read kernel-doc
tools: Update context analysis macros in compiler_types.h
rust: sync: Replace `kernel::c_str!` with C-Strings
rust: sync: Inline various lock related methods
rust: helpers: Move #define __rust_helper out of atomic.c
rust: wait: Add __rust_helper to helpers
rust: time: Add __rust_helper to helpers
rust: task: Add __rust_helper to helpers
rust: sync: Add __rust_helper to helpers
rust: refcount: Add __rust_helper to helpers
rust: rcu: Add __rust_helper to helpers
rust: processor: Add __rust_helper to helpers
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks
Pull kthread updates from Frederic Weisbecker:
"The kthread code provides an infrastructure which manages the
preferred affinity of unbound kthreads (node or custom cpumask)
against housekeeping (CPU isolation) constraints and CPU hotplug
events.
One crucial missing piece is the handling of cpuset: when an isolated
partition is created, deleted, or its CPUs updated, all the unbound
kthreads in the top cpuset become indifferently affine to _all_ the
non-isolated CPUs, possibly breaking their preferred affinity along
the way.
Solve this with performing the kthreads affinity update from cpuset to
the kthreads consolidated relevant code instead so that preferred
affinities are honoured and applied against the updated cpuset
isolated partitions.
The dispatch of the new isolated cpumasks to timers, workqueues and
kthreads is performed by housekeeping, as per the nice Tejun's
suggestion.
As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
from boot defined domain isolation (through isolcpus=) and cpuset
isolated partitions. Housekeeping cpumasks are now modifiable with a
specific RCU based synchronization. A big step toward making
nohz_full= also mutable through cpuset in the future"
* tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (33 commits)
doc: Add housekeeping documentation
kthread: Document kthread_affine_preferred()
kthread: Comment on the purpose and placement of kthread_affine_node() call
kthread: Honour kthreads preferred affinity after cpuset changes
sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN
sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN
kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management
kthread: Include kthreadd to the managed affinity list
kthread: Include unbound kthreads in the managed affinity list
kthread: Refine naming of affinity related fields
PCI: Remove superfluous HK_TYPE_WQ check
sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated()
cpuset: Remove cpuset_cpu_is_isolated()
timers/migration: Remove superfluous cpuset isolation test
cpuset: Propagate cpuset isolation update to timers through housekeeping
cpuset: Propagate cpuset isolation update to workqueue through housekeeping
PCI: Flush PCI probe workqueue on cpuset isolated partition change
sched/isolation: Flush vmstat workqueues on cpuset isolated partition change
sched/isolation: Flush memcg workqueues on cpuset isolated partition change
cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
...
|
|
It doesn't make sense to use nohz_full without also isolating the
related CPUs from the domain topology, either through the use of
isolcpus= or cpuset isolated partitions.
And now HK_TYPE_DOMAIN includes all kinds of domain isolated CPUs.
This means that HK_TYPE_DOMAIN should always be a subset of
HK_TYPE_KERNEL_NOISE (of which HK_TYPE_TICK is only an alias).
Therefore if a CPU is not HK_TYPE_DOMAIN, it shouldn't be
HK_TYPE_KERNEL_NOISE either. Testing the former is then enough.
Simplify cpu_is_isolated() accordingly.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
|
|
The set of cpuset isolated CPUs is now included in HK_TYPE_DOMAIN
housekeeping cpumask. There is no usecase left interested in just
checking what is isolated by cpuset and not by the isolcpus= kernel
boot parameter.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
|
|
Until now, HK_TYPE_DOMAIN used to only include boot defined isolated
CPUs passed through isolcpus= boot option. Users interested in also
knowing the runtime defined isolated CPUs through cpuset must use
different APIs: cpuset_cpu_is_isolated(), cpu_is_isolated(), etc...
There are many drawbacks to that approach:
1) Most interested subsystems want to know about all isolated CPUs, not
just those defined on boot time.
2) cpuset_cpu_is_isolated() / cpu_is_isolated() are not synchronized with
concurrent cpuset changes.
3) Further cpuset modifications are not propagated to subsystems
Solve 1) and 2) and centralize all isolated CPUs within the
HK_TYPE_DOMAIN housekeeping cpumask.
Subsystems can rely on RCU to synchronize against concurrent changes.
The propagation mentioned in 3) will be handled in further patches.
[Chen Ridong: Fix cpu_hotplug_lock deadlock and use correct static
branch API]
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
|
|
HK_TYPE_DOMAIN will soon integrate not only boot defined isolcpus= CPUs
but also cpuset isolated partitions.
Housekeeping still needs a way to record what was initially passed
to isolcpus= in order to keep these CPUs isolated after a cpuset
isolated partition is modified or destroyed while containing some of
them.
Create a new HK_TYPE_DOMAIN_BOOT to keep track of those.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
|
|
Patch series "mm kernel-doc fixes".
Here are kernel-doc fixes for mm subsystem. I'm also including textsearch
fix since there's currently no maintainer for include/linux/textsearch.h
(get_maintainer.pl only shows LKML).
This patch (of 4):
Sphinx reports kernel-doc warning:
WARNING: ./include/linux/sched/mm.h:332 function parameter 'flags' not described in 'memalloc_flags_save'
Describe @flags to fix it.
Link: https://lkml.kernel.org/r/20251219014006.16328-2-bagasdotme@gmail.com
Link: https://lkml.kernel.org/r/20251219014006.16328-3-bagasdotme@gmail.com
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Fixes: 3f6d5e6a468d ("mm: introduce memalloc_flags_{save,restore}")
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Acked-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Paravirt clock related functions are available in multiple archs.
In order to share the common parts, move the common static keys
to kernel/sched/ and remove them from the arch specific files.
Make a common paravirt_steal_clock() implementation available in
kernel/sched/cputime.c, guarding it with a new config option
CONFIG_HAVE_PV_STEAL_CLOCK_GEN, which can be selected by an arch
in case it wants to use that common variant.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260105110520.21356-7-jgross@suse.com
|
|
This demonstrates a larger conversion to use Clang's context
analysis. The benefit is additional static checking of locking rules,
along with better documentation.
Notably, kernel/sched contains sufficiently complex synchronization
patterns, and application to core.c & fair.c demonstrates that the
latest Clang version has become powerful enough to start applying this
to more complex subsystems (with some modest annotations and changes).
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251219154418.3592607-37-elver@google.com
|
|
As discussed in [1], removing __cond_lock() will improve the readability
of trylock code. Now that Sparse context tracking support has been
removed, we can also remove __cond_lock().
Change existing APIs to either drop __cond_lock() completely, or make
use of the __cond_acquires() function attribute instead.
In particular, spinlock and rwlock implementations required switching
over to inline helpers rather than statement-expressions for their
trylock_* variants.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250207082832.GU7145@noisy.programming.kicks-ass.net/ [1]
Link: https://patch.msgid.link/20251219154418.3592607-25-elver@google.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki)
Rework the vmalloc() code to support non-blocking allocations
(GFP_ATOIC, GFP_NOWAIT)
"ksm: fix exec/fork inheritance" (xu xin)
Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not
inherited across fork/exec
"mm/zswap: misc cleanup of code and documentations" (SeongJae Park)
Some light maintenance work on the zswap code
"mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira)
Enhance the /sys/kernel/debug/page_owner debug feature by adding
unique identifiers to differentiate the various stack traces so
that userspace monitoring tools can better match stack traces over
time
"mm/page_alloc: pcp->batch cleanups" (Joshua Hahn)
Minor alterations to the page allocator's per-cpu-pages feature
"Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra)
Address a scalability issue in userfaultfd's UFFDIO_MOVE operation
"kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov)
"drivers/base/node: fold node register and unregister functions" (Donet Tom)
Clean up the NUMA node handling code a little
"mm: some optimizations for prot numa" (Kefeng Wang)
Cleanups and small optimizations to the NUMA allocation hinting
code
"mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn)
Address long lock hold times at boot on large machines. These were
causing (harmless) softlockup warnings
"optimize the logic for handling dirty file folios during reclaim" (Baolin Wang)
Remove some now-unnecessary work from page reclaim
"mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park)
Enhance the DAMOS auto-tuning feature
"mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan)
Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace
configuration
"expand mmap_prepare functionality, port more users" (Lorenzo Stoakes)
Enhance the new(ish) file_operations.mmap_prepare() method and port
additional callsites from the old ->mmap() over to ->mmap_prepare()
"Fix stale IOTLB entries for kernel address space" (Lu Baolu)
Fix a bug (and possible security issue on non-x86) in the IOMMU
code. In some situations the IOMMU could be left hanging onto a
stale kernel pagetable entry
"mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang)
Clean up and optimize the folio splitting code
"mm, swap: misc cleanup and bugfix" (Kairui Song)
Some cleanups and a minor fix in the swap discard code
"mm/damon: misc documentation fixups" (SeongJae Park)
"mm/damon: support pin-point targets removal" (SeongJae Park)
Permit userspace to remove a specific monitoring target in the
middle of the current targets list
"mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo)
A couple of cleanups related to mm header file inclusion
"mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He)
improve the selection of swap devices for NUMA machines
"mm: Convert memory block states (MEM_*) macros to enums" (Israel Batista)
Change the memory block labels from macros to enums so they will
appear in kernel debug info
"ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes)
Address an inefficiency when KSM unmerges an address range
"mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park)
Fix leaks and unhandled malloc() failures in DAMON userspace unit
tests
"some cleanups for pageout()" (Baolin Wang)
Clean up a couple of minor things in the page scanner's
writeback-for-eviction code
"mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu)
Move hugetlb's sysfs/sysctl handling code into a new file
"introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes)
Make the VMA guard regions available in /proc/pid/smaps and
improves the mergeability of guarded VMAs
"mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes)
Reduce mmap lock contention for callers performing VMA guard region
operations
"vma_start_write_killable" (Matthew Wilcox)
Start work on permitting applications to be killed when they are
waiting on a read_lock on the VMA lock
"mm/damon/tests: add more tests for online parameters commit" (SeongJae Park)
Add additional userspace testing of DAMON's "commit" feature
"mm/damon: misc cleanups" (SeongJae Park)
"make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes)
Address the possible loss of a VMA's VM_SOFTDIRTY flag when that
VMA is merged with another
"mm: support device-private THP" (Balbir Singh)
Introduce support for Transparent Huge Page (THP) migration in zone
device-private memory
"Optimize folio split in memory failure" (Zi Yan)
"mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang)
Some more cleanups in the folio splitting code
"mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes)
Clean up our handling of pagetable leaf entries by introducing the
concept of 'software leaf entries', of type softleaf_t
"reparent the THP split queue" (Muchun Song)
Reparent the THP split queue to its parent memcg. This is in
preparation for addressing the long-standing "dying memcg" problem,
wherein dead memcg's linger for too long, consuming memory
resources
"unify PMD scan results and remove redundant cleanup" (Wei Yang)
A little cleanup in the hugepage collapse code
"zram: introduce writeback bio batching" (Sergey Senozhatsky)
Improve zram writeback efficiency by introducing batched bio
writeback support
"memcg: cleanup the memcg stats interfaces" (Shakeel Butt)
Clean up our handling of the interrupt safety of some memcg stats
"make vmalloc gfp flags usage more apparent" (Vishal Moola)
Clean up vmalloc's handling of incoming GFP flags
"mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang)
Teach soft dirty and userfaultfd write protect tracking to use
RISC-V's Svrsw60t59b extension
"mm: swap: small fixes and comment cleanups" (Youngjun Park)
Fix a small bug and clean up some of the swap code
"initial work on making VMA flags a bitmap" (Lorenzo Stoakes)
Start work on converting the vma struct's flags to a bitmap, so we
stop running out of them, especially on 32-bit
"mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park)
Address a possible bug in the swap discard code and clean things
up a little
[ This merge also reverts commit ebb9aeb980e5 ("vfio/nvgrace-gpu:
register device memory for poison handling") because it looks
broken to me, I've asked for clarification - Linus ]
* tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
mm: fix vma_start_write_killable() signal handling
mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate
mm/swapfile: fix list iteration when next node is removed during discard
fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling
mm/kfence: add reboot notifier to disable KFENCE on shutdown
memcg: remove inc/dec_lruvec_kmem_state helpers
selftests/mm/uffd: initialize char variable to Null
mm: fix DEBUG_RODATA_TEST indentation in Kconfig
mm: introduce VMA flags bitmap type
tools/testing/vma: eliminate dependency on vma->__vm_flags
mm: simplify and rename mm flags function for clarity
mm: declare VMA flags by bit
zram: fix a spelling mistake
mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity
mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
pagemap: update BUDDY flag documentation
mm: swap: remove scan_swap_map_slots() references from comments
mm: swap: change swap_alloc_slow() to void
mm, swap: remove redundant comment for read_swap_cache_async
mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo:
- Improve recovery from misbehaving BPF schedulers.
When a scheduler puts many tasks with varying affinity restrictions
on a shared DSQ, CPUs scanning through tasks they cannot run can
overwhelm the system, causing lockups.
Bypass mode now uses per-CPU DSQs with a load balancer to avoid this,
and hooks into the hardlockup detector to attempt recovery.
Add scx_cpu0 example scheduler to demonstrate this scenario.
- Add lockless peek operation for DSQs to reduce lock contention for
schedulers that need to query queue state during load balancing.
- Allow scx_bpf_reenqueue_local() to be called from anywhere in
preparation for deprecating cpu_acquire/release() callbacks in favor
of generic BPF hooks.
- Prepare for hierarchical scheduler support: add
scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() kfuncs,
make scx_bpf_dsq_insert*() return bool, and wrap kfunc args in
structs for future aux__prog parameter.
- Implement cgroup_set_idle() callback to notify BPF schedulers when a
cgroup's idle state changes.
- Fix migration tasks being incorrectly downgraded from
stop_sched_class to rt_sched_class across sched_ext enable/disable.
Applied late as the fix is low risk and the bug subtle but needs
stable backporting.
- Various fixes and cleanups including cgroup exit ordering,
SCX_KICK_WAIT reliability, and backward compatibility improvements.
* tag 'sched_ext-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (44 commits)
sched_ext: Fix incorrect sched_class settings for per-cpu migration tasks
sched_ext: tools: Removing duplicate targets during non-cross compilation
sched_ext: Use kvfree_rcu() to release per-cpu ksyncs object
sched_ext: Pass locked CPU parameter to scx_hardlockup() and add docs
sched_ext: Update comments replacing breather with aborting mechanism
sched_ext: Implement load balancer for bypass mode
sched_ext: Factor out abbreviated dispatch dequeue into dispatch_dequeue_locked()
sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR
sched_ext: Add scx_cpu0 example scheduler
sched_ext: Hook up hardlockup detector
sched_ext: Make handle_lockup() propagate scx_verror() result
sched_ext: Refactor lockup handlers into handle_lockup()
sched_ext: Make scx_exit() and scx_vexit() return bool
sched_ext: Exit dispatch and move operations immediately when aborting
sched_ext: Simplify breather mechanism with scx_aborting flag
sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode
sched_ext: Refactor do_enqueue_task() local and global DSQ paths
sched_ext: Use shorter slice in bypass mode
sched_ext: Mark racy bitfields to prevent adding fields that can't tolerate races
sched_ext: Minor cleanups to scx_task_iter
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Scalability and load-balancing improvements:
- Enable scheduler feature NEXT_BUDDY (Mel Gorman)
- Reimplement NEXT_BUDDY to align with EEVDF goals (Mel Gorman)
- Skip sched_balance_running cmpxchg when balance is not due (Tim
Chen)
- Implement generic code for architecture specific sched domain NUMA
distances (Tim Chen)
- Optimize the NUMA distances of the sched-domains builds of Intel
Granite Rapids (GNR) and Clearwater Forest (CWF) platforms (Tim
Chen)
- Implement proportional newidle balance: a randomized algorithm that
runs newidle balancing proportional to its success rate. (Peter
Zijlstra)
Scheduler infrastructure changes:
- Implement the 'sched_change' scoped_guard() pattern for the entire
scheduler (Peter Zijlstra)
- More broadly utilize the sched_change guard (Peter Zijlstra)
- Add support to pick functions to take runqueue-flags (Joel
Fernandes)
- Provide and use set_need_resched_current() (Peter Zijlstra)
Fair scheduling enhancements:
- Forfeit vruntime on yield (Fernand Sieber)
- Only update stats for allowed CPUs when looking for dst group (Adam
Li)
CPU-core scheduling enhancements:
- Optimize core cookie matching check (Fernand Sieber)
Deadline scheduler fixes:
- Only set free_cpus for online runqueues (Doug Berger)
- Fix dl_server time accounting (Peter Zijlstra)
- Fix dl_server stop condition (Peter Zijlstra)
Proxy scheduling fixes:
- Yield the donor task (Fernand Sieber)
Fixes and cleanups:
- Fix do_set_cpus_allowed() locking (Peter Zijlstra)
- Fix migrate_disable_switch() locking (Peter Zijlstra)
- Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked()
(Hao Jia)
- Increase sched_tick_remote timeout (Phil Auld)
- sched/deadline: Use cpumask_weight_and() in dl_bw_cpus() (Shrikanth
Hegde)
- sched/deadline: Clean up select_task_rq_dl() (Shrikanth Hegde)"
* tag 'sched-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
sched: Provide and use set_need_resched_current()
sched/fair: Proportional newidle balance
sched/fair: Small cleanup to update_newidle_cost()
sched/fair: Small cleanup to sched_balance_newidle()
sched/fair: Revert max_newidle_lb_cost bump
sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
sched/fair: Enable scheduler feature NEXT_BUDDY
sched: Increase sched_tick_remote timeout
sched/fair: Have SD_SERIALIZE affect newidle balancing
sched/fair: Skip sched_balance_running cmpxchg when balance is not due
sched/deadline: Minor cleanup in select_task_rq_dl()
sched/deadline: Use cpumask_weight_and() in dl_bw_cpus
sched/deadline: Document dl_server
sched/deadline: Fix dl_server stop condition
sched/deadline: Fix dl_server time accounting
sched/core: Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked()
sched/eevdf: Fix min_vruntime vs avg_vruntime
sched/core: Add comment explaining force-idle vruntime snapshots
sched/core: Optimize core cookie matching check
sched/proxy: Yield the donor task
...
|
|
Add a randomized algorithm that runs newidle balancing proportional to
its success rate.
This improves schbench significantly:
6.18-rc4: 2.22 Mrps/s
6.18-rc4+revert: 2.04 Mrps/s
6.18-rc4+revert+random: 2.18 Mrps/S
Conversely, per Adam Li this affects SpecJBB slightly, reducing it by 1%:
6.17: -6%
6.17+revert: 0%
6.17+revert+random: -1%
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Chris Mason <clm@meta.com>
Link: https://lkml.kernel.org/r/6825c50d-7fa7-45d8-9b81-c6e7e25738e2@meta.com
Link: https://patch.msgid.link/20251107161739.770122091@infradead.org
|
|
mm_get_unmapped_area() is a wrapper around arch_get_unmapped_area() /
arch_get_unmapped_area_topdown(), both of which search current->mm for
some free space. Neither take an mm_struct - they implicitly operate on
current->mm.
But the wrapper takes an mm_struct and uses it to decide whether to search
bottom up or top down. All callers pass in current->mm for this, so
everything is working consistently. But it feels like an accident waiting
to happen; eventually someone will call that function with a different mm,
expecting to find free space in it, but what gets returned is free space
in the current mm.
So let's simplify by removing the parameter and have the wrapper use
current->mm to decide which end to start at. Now everything is consistent
and self-documenting.
Link: https://lkml.kernel.org/r/20251003155306.2147572-1-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
might_alloc() catches invalid blocking allocations in contexts where
sleeping is not allowed.
However when PF_MEMALLOC is set, the page allocator already skips reclaim
and other blocking paths. In such cases, a blocking gfp_mask does not
actually lead to blocking, so triggering might_alloc() splats is
misleading.
Adjust might_alloc() to skip warnings when the current task has
PF_MEMALLOC set, matching the allocator's actual blocking behaviour.
Link: https://lkml.kernel.org/r/20251007122035.56347-9-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
With the buddy lockup detector, smp_processor_id() returns the detecting CPU,
not the locked CPU, making scx_hardlockup()'s printouts confusing. Pass the
locked CPU number from watchdog_hardlockup_check() as a parameter instead.
Also add kerneldoc comments to handle_lockup(), scx_hardlockup(), and
scx_rcu_cpu_stall() documenting their return value semantics.
Suggested-by: Doug Anderson <dianders@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
INIT_DSQ_LIST_CURSOR
Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR
macro in preparation for additional users.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
A poorly behaving BPF scheduler can trigger hard lockup. For example, on a
large system with many tasks pinned to different subsets of CPUs, if the BPF
scheduler puts all tasks in a single DSQ and lets all CPUs at it, the DSQ lock
can be contended to the point where hardlockup triggers. Unfortunately,
hardlockup can be the first signal out of such situations, thus requiring
hardlockup handling.
Hook scx_hardlockup() into the hardlockup detector to try kicking out the
current scheduler in an attempt to recover the system to a good state. The
handling strategy can delay watchdog taking its own action by one polling
period; however, given that the only remediation for hardlockup is crash, this
is likely an acceptable trade-off.
v2: Add missing dummy scx_hardlockup() definition for
!CONFIG_SCHED_CLASS_EXT (kernel test bot).
Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Bypass mode routes tasks through fallback dispatch queues. Originally a single
global DSQ, b7b3b2dbae73 ("sched_ext: Split the global DSQ per NUMA node")
changed this to per-node DSQs to resolve NUMA-related livelocks.
Dan Schatzberg found per-node DSQs can still livelock when many threads are
pinned to different small CPU subsets: each CPU must scan many incompatible
tasks to find runnable ones, causing severe contention with high CPU counts.
Switch to per-CPU bypass DSQs. Each task queues on its current CPU. Default
idle CPU selection and direct dispatch handle most cases well.
This introduces a failure mode when tasks concentrate on one CPU in
over-saturated systems. If the BPF scheduler severely skews placement before
triggering bypass, that CPU's queue may be too long to drain, causing RCU
stalls. A load balancer in a future patch will address this. The bypass DSQ is
separate from local DSQ to enable load balancing: local DSQs use rq locks,
preventing efficient scanning and transfer across CPUs, especially problematic
when systems are already contended.
v2: Clarified why bypass DSQ is separate from local DSQ (Andrea Righi).
Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
There have been reported cases of bypass mode not making forward progress fast
enough. The 20ms default slice is unnecessarily long for bypass mode where the
primary goal is ensuring all tasks can make forward progress.
Introduce SCX_SLICE_BYPASS set to 5ms and make the scheduler automatically
switch to it when entering bypass mode. Also make the bypass slice value
tunable through the slice_bypass_us module parameter (adjustable between 100us
and 100ms) to make it easier to test whether slice durations are a factor in
problem cases.
v3: Use READ_ONCE/WRITE_ONCE for scx_slice_dfl access (Dan).
v2: Removed slice_dfl_us module parameter. Fixed typos (Andrea).
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
We don't actually modify it.
Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-7-b447b82f2c9b@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
finish_task_switch()
sched_ext_free() was called from __put_task_struct() when the last reference
to the task is dropped, which could be long after the task has finished
running. This causes cgroup-related problems:
- ops.init_task() can be called on a cgroup which didn't get ops.cgroup_init()'d
during scheduler load, because the cgroup might be destroyed/unlinked
while the zombie or dead task is still lingering on the scx_tasks list.
- ops.cgroup_exit() could be called before ops.exit_task() is called on all
member tasks, leading to incorrect exit ordering.
Fix by moving it to finish_task_switch() to be called right after the final
context switch away from the dying task, matching when sched_class->task_dead()
is called. Rename it to sched_ext_dead() to match the new calling context.
By calling sched_ext_dead() before cgroup_task_dead(), we ensure that:
- Tasks visible on scx_tasks list have valid cgroups during scheduler load,
as cgroup_mutex prevents cgroup destruction while the task is still linked.
- All member tasks have ops.exit_task() called and are removed from scx_tasks
before the cgroup can be destroyed and trigger ops.cgroup_exit().
This fix is made possible by the cgroup_task_dead() split in the previous patch.
This also makes more sense resource-wise as there's no point in keeping
scheduler side resources around for dead tasks.
Reported-by: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|