linux-toradex.git/kernel/locking, branch master

locking/rt: Fix the incorrect RCU protection in rt_spin_unlock()

2026-06-21T09:51:13+00:00

rt_spin_unlock() releases the RCU protection before unlocking the
lock. That opens the door for the following UAF scenario:

 T1					T2
 spin_lock(&p->lock);		rcu_read_lock();
 invalidate(p);			p = rcu_dereference(ptr);
 rcu_assign_pointer(ptr, NULL);	if (!p) return;
 spin_unlock(&p->lock);		spin_lock(&p->lock)
 				   lock(&lock->lock);
				   rcu_read_lock();
 kfree_rcu(p);			rcu_read_unlock();
				....
				spin_unlock(&p->lock)
				  rcu_read_unlock(); // Ends grace period
 rcu_do_batch()
   kfree(p);
			    UAF ->	  rt_mutex_cmpxchg_release(&lock->lock...)

Regular spinlocks keep preemption disabled accross the unlock operation,
which provides full RCU protection, but the RT substitution fails to
resemble that. Same applies for the rwlock substitution.

Move the rcu_read_unlock() invocation past the unlock operations to match
the non-RT semantics. This makes it asymmetric vs. rt_xxx_lock(), but
that's harmless as the caller needs to hold RCU read lock across the lock
operation. The migrate_enable() call stays before the unlock operation
because there is no per CPU operation in the unlock path which would
require migration to be kept disabled.

Fixes: 0f383b6dc96e ("locking/spinlock: Provide RT variant")
Reported-by: syzbot+000c800a02097aaa10ed@syzkaller.appspotmail.com
Decoded-by: Jann Horn 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Sebastian Andrzej Siewior 
Acked-by: Al Viro 
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/87jyrud75z.ffs@fw13

Merge tag 'sched-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip

2026-06-15T09:20:18+00:00

Pull scheduler updates from Ingo Molnar:
 "SMP load-balancing updates:

   - A large series to introduce infrastructure for cache-aware load
     balancing, with the goal of co-locating tasks that share data
     within the same Last Level Cache (LLC) domain. By improving cache
     locality, the scheduler can reduce cache bouncing and cache misses,
     ultimately improving data access efficiency.

     Implemented by Chen Yu and Tim Chen, based on early prototype work
     by Peter Zijlstra, with fixes by Jianyong Wu, Peter Zijlstra and
     Shrikanth Hegde.

   - A series to simplify CONFIG_SCHED_SMT ifdef usage (Shrikanth Hegde)

  Fair scheduler updates:

   - A series to improve SD_ASYM_CPUCAPACITY scheduling by introducing
     SMT awareness (Andrea Righi, K Prateek Nayak)

   - A series to optimize cfs_rq and sched_entity allocation for better
     data locality (Zecheng Li)

   - A preparatory series to change fair/cgroup scheduling to a single
     runqueue, without the final change (Peter Zijlstra)

   - Auto-manage ext/fair dl_server bandwidth (Andrea Righi)

   - Fix cpu_util runnable_avg arithmetic (Hongyan Xia)

   - Optimize update_tg_load_avg()'s rate-limiting code (Rik van Riel)

   - Allow account_cfs_rq_runtime() to throttle current hierarchy
     (K Prateek Nayak)

   - Update util_est after updating util_avg during dequeue, to fix the
     util signal update logic, which reduces signal noise (Vincent
     Guittot)

  Scheduler topology updates:

   - Allow multiple domains to claim sched_domain_shared (K Prateek
     Nayak)

   - Add parameter to split LLC (Peter Zijlstra)

  Core scheduler updates:

   - Use trace_call__() to save a static branch (Gabriele Monaco)

  Scheduler statistics updates:

   - Drop now-stale mul_u64_u64_div_u64() cputime over-approximation
     guard (Nicolas Pitre)

  Deadline scheduler updates:

   - Reject debugfs dl_server writes for offline CPUs (Andrea Righi)

   - Fix replenishment logic for non-deferred servers (Yuri Andriaccio)

  RT scheduling updates:

   - Turn RT_PUSH_IPI default off for non PREEMPT_RT (Steven Rostedt)

   - Update default bandwidth for real-time tasks to 1.0 (Yuri
     Andriaccio)

  Proxy scheduling updates:

   - A series to implement Optimized Donor Migration for Proxy Execution
     (John Stultz, Peter Zijlstra)

   - Various proxy scheduling cleanups and fixes (Peter Zijlstra,
     K Prateek Nayak)

  Misc fixes, improvements and cleanups by Aaron Lu, Andrea Righi,
  Zenghui Yu, Chen Yu, Guanyou.Chen, John Stultz, Shrikanth Hegde,
  Peter Zijlstra, Liang Luo and Yiyang Chen"

* tag 'sched-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (91 commits)
  sched/fair: Fix newidle vs core-sched
  sched/deadline: Use task_on_rq_migrating() helper
  sched/core: Combine separate 'else' and 'if' statements
  sched/fair: Fix cpu_util runnable_avg arithmetic
  sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime()
  sched/fair: Move the throttled tasks to a local list in tg_unthrottle_up()
  sched/fair: Call update_curr() before unthrottling the hierarchy
  sched/fair: Use throttled_csd_list for local unthrottle
  sched/fair: Convert cfs bandwidth throttling to use guards
  sched/fair: Allocate cfs_tg_state with percpu allocator
  sched/fair: Remove task_group->se pointer array
  sched/fair: Co-locate cfs_rq and sched_entity in cfs_tg_state
  sched: restore timer_slack_ns when resetting RT policy on fork
  MAINTAINERS: Fix spelling mistake in Peter's name
  sched: Simplify ttwu_runnable()
  sched/proxy: Remove superfluous clear_task_blocked_in()
  sched/proxy: Remove PROXY_WAKING
  sched/proxy: Switch proxy to use p->is_blocked
  sched/proxy: Only return migrate when needed
  sched: Be more strict about p->is_blocked
  ...

Merge tag 'locking-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip

2026-06-15T08:51:14+00:00

Pull locking updates from Ingo Molnar:
 "Futex updates:

   - Optimize futex hash bucket access patterns (Peter Zijlstra)

   - Large series to address the robust futex unlock race for real, by
     Thomas Gleixner:

      "The robust futex unlock mechanism is racy in respect to the
       clearing of the robust_list_head::list_op_pending pointer because
       unlock and clearing the pointer are not atomic.

       The race window is between the unlock and clearing the pending op
       pointer. If the task is forced to exit in this window, exit will
       access a potentially invalid pending op pointer when cleaning up
       the robust list.

       That happens if another task manages to unmap the object
       containing the lock before the cleanup, which results in an UAF.

       In the worst case this UAF can lead to memory corruption when
       unrelated content has been mapped to the same address by the time
       the access happens.

       User space can't solve this problem without help from the kernel.
       This series provides the kernel side infrastructure to help it
       along:

        1) Combined unlock, pointer clearing, wake-up for the
           contended case

        2) VDSO based unlock and pointer clearing helpers with a
           fix-up function in the kernel when user space was interrupted
           within the critical section.

      ... with help by André Almeida:

        - Add a note about robust list race condition (André Almeida)
        - Add self-tests for robust release operations (André Almeida)

  Context analysis updates:

   - Implement context analysis for 'struct rt_mutex'. (Bart Van Assche)
   - Bump required Clang version to 23 (Marco Elver)

  Guard infrastructure updates:

   - Series to remove NULL check from unconditional guards (Dmitry
     Ilvokhin)

  Lockdep updates:

   - Restore self-test migrate_disable() and sched_rt_mutex state on
     PREEMPT_RT (Karl Mehltretter)

  Membarriers updates:

   - Use per-CPU mutexes for targeted commands (Aniket Gattani)
   - Modernize membarrier_global_expedited with cleanup guards (Aniket
     Gattani)
   - Add rseq stress test for CFS throttle interactions (Aniket Gattani)

  percpu-rwsems updates:

   - Extract __percpu_up_read() to optimize inlining overhead (Dmitry
     Ilvokhin)

  Seqlocks updates:

   - Allow UBSAN_ALIGNMENT to fail optimizing (Heiko Carstens)

  Lock tracing:

   - Add contended_release tracepoint to sleepable locks such as
     mutexes, percpu-rwsems, rtmutexes, rwsems and semaphores (Dmitry
     Ilvokhin)

  MAINTAINERS updates:

   - MAINTAINERS: Add RUST [SYNC] entry (Boqun Feng)

  Misc updates and fixes by Randy Dunlap, YE WEI-HONG, Fabricio Parra,
  Dmitry Ilvokhin and Peter Zijlstra"

* tag 'locking-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (36 commits)
  locking: Add contended_release tracepoint to sleepable locks
  locking/percpu-rwsem: Extract __percpu_up_read()
  tracing/lock: Remove unnecessary linux/sched.h include
  futex: Optimize futex hash bucket access patterns
  rust: sync: completion: Mark inline complete_all and wait_for_completion
  MAINTAINERS: Add RUST [SYNC] entry
  cleanup: Specify nonnull argument index
  selftests: futex: Add tests for robust release operations
  Documentation: futex: Add a note about robust list race condition
  x86/vdso: Implement __vdso_futex_robust_try_unlock()
  x86/vdso: Prepare for robust futex unlock support
  futex: Provide infrastructure to plug the non contended robust futex unlock race
  futex: Add robust futex unlock IP range
  futex: Add support for unlocking robust futexes
  futex: Cleanup UAPI defines
  x86: Select ARCH_MEMORY_ORDER_TSO
  uaccess: Provide unsafe_atomic_store_release_user()
  futex: Provide UABI defines for robust list entry modifiers
  futex: Move futex related mm_struct data into a struct
  futex: Make futex_mm_init() void
  ...

locking: Add contended_release tracepoint to sleepable locks

2026-06-11T11:41:25+00:00

Add the contended_release trace event. This tracepoint fires on the
holder side when a contended lock is released, complementing the
existing contention_begin/contention_end tracepoints which fire on the
waiter side.

This enables correlating lock hold time under contention with waiter
events by lock address.

Add trace_contended_release()/trace_call__contended_release() calls to
the slowpath unlock paths of sleepable locks: mutex, rtmutex, semaphore,
rwsem, percpu-rwsem, and RT-specific rwbase locks.

Where possible, trace_contended_release() fires before the lock is
released and before the waiter is woken. For some lock types, the
tracepoint fires after the release but before the wake. Making the
placement consistent across all lock types is not worth the added
complexity.

For reader/writer locks, the tracepoint fires for every reader releasing
while a writer is waiting, not only for the last reader.

Signed-off-by: Dmitry Ilvokhin 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Paul E. McKenney 
Acked-by: Usama Arif 
Link: https://patch.msgid.link/02f4f6c5ce6761e7f6587cf0ff2289d962ecddd4.1780506267.git.d@ilvokhin.com

locking/percpu-rwsem: Extract __percpu_up_read()

2026-06-11T11:41:25+00:00

Move the percpu_up_read() slowpath out of the inline function into a new
__percpu_up_read() to avoid binary size increase from adding a
tracepoint to an inlined function.

Signed-off-by: Dmitry Ilvokhin 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Usama Arif 
Link: https://patch.msgid.link/3dd2a1b9ab4f469e1892766cb63f41d6b0f53d29.1780506267.git.d@ilvokhin.com

locking/rtmutex: Skip remove_waiter() when waiter is not enqueued

2026-06-03T20:11:53+00:00

syzbot triggered the following splat in remove_waiter() via
FUTEX_CMP_REQUEUE_PI:

  KASAN: null-ptr-deref in range [0x0000000000000a88-0x0000000000000a8f]
   class_raw_spinlock_constructor
   remove_waiter+0x159/0x1200 kernel/locking/rtmutex.c:1561
   rt_mutex_start_proxy_lock+0x103/0x120
   futex_requeue+0x10e4/0x20d0
   __x64_sys_futex+0x34f/0x4d0

task_blocks_on_rt_mutex() does not arm the waiter upon deadlock detection,
leaving waiter->task nil, where 3bfdc63936dd ("rtmutex: Use waiter::task instead
of current in remove_waiter()") made this fatal.

Furthermore, rt_mutex_start_proxy_lock() should not be calling into remove_waiter()
upon a successfully grabbing the rtmutex. 1a1fb985f2e2 ("futex: Handle early deadlock
return correctly"), moved the remove_waiter() out of __rt_mutex_start_proxy_lock()
(where 'ret' was only ever 0 or < 0) into the wrapper. Tighten this check to
account for try_to_take_rt_mutex().

Fixes: 3bfdc63936dd ("rtmutex: Use waiter::task instead of current in remove_waiter()")
Reported-by: syzbot+78147abe6c524f183ee9@syzkaller.appspotmail.com
Signed-off-by: Davidlohr Bueso 
Signed-off-by: Thomas Gleixner 
Cc: stable@vger.kernel.org
Closes: https://lore.kernel.org/all/69f114ac.050a0220.ac8b.0003.GAE@google.com/
Link: https://patch.msgid.link/20260507112913.1019537-1-dave@stgolabs.net

sched/proxy: Remove PROXY_WAKING

2026-06-02T10:26:09+00:00

Now that the proxy path uses ->is_blocked, use the '->is_blocked &&
!->blocked_on' state instead of PROXY_WAKING. Notably, this is where a
blocked_on relation is broken but the donor task might still need a return
migration.

Signed-off-by: K Prateek Nayak 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://patch.msgid.link/20260526113322.596522894%40infradead.org

sched: Add blocked_donor link to task for smarter mutex handoffs

2026-06-02T10:26:07+00:00

Add link to the task this task is proxying for, and use it so
the mutex owner can do an intelligent hand-off of the mutex to
the task that the owner is running on behalf.

[jstultz: This patch was split out from larger proxy patch]
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Juri Lelli 
Signed-off-by: Valentin Schneider 
Signed-off-by: Connor O'Brien 
Signed-off-by: John Stultz 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://patch.msgid.link/20260512025635.2840817-8-jstultz@google.com

locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING

2026-06-02T10:26:05+00:00

Vineeth found came up with a test driver that could trip up
workqueue stalls. After fixing one issue this test found,
Vineeth reported the test was still failing.

Greatly simplified, a task that tries to take a mutex already
owned by another task that is sleeping, can hit a edge case in
the mutex_lock_common() case.

If the task fails to get the lock, calls into schedule, but gets
a spurious wakeup, it will find that it is first waiter, and
go into the mutex_optimistic_spin() logic. Though before calling
mutex_optimistic_spin(), we clear task blocked_on state, since
mutex_optimistic_spin() may call schedule() if need_resched() is
set.

After mutex_optimistic_spin() fails, we set blocked_on again,
restart the main mutex loop, try to take the lock and call into
schedule_preempt_disabled().

From there, with proxy-execution, we'll see the task is
blocked_on, follow the chain, see the owner is sleeping and
dequeue the waiting task from the runqueue.

This all sounds fine and reasonable.  But what I had missed is
that in mutex_optimistic_spin(), not only do we call schedule()
but we set TASK_RUNNABLE right before doing so.

This is ok for that invocation of schedule(). But when we come
back we re-set the blocked_on we had just cleared, but we do not
re-set the task state to TASK_INTERRUPTIBLE/UNINTERRUPTIBLE.

This means we have a task that is blocked_on & TASK_RUNNABLE,
so when the proxy execution code dequeues the task, we are
in trouble since future wakeups will be shortcut by the
ttwu_state_match() check.

Thus, to avoid this, after mutex_optimistic_spin(), set the task
state back when we set blocked_on.

Many many thanks again to Vineeth for his very useful testing
driver that uncovered this long hidden bug, that I hadn't
tripped in all my testing! Very impressed with the problems he's
uncovered!

Reported-by: Vineeth Pillai 
Signed-off-by: John Stultz 
Signed-off-by: Peter Zijlstra (Intel) 
Tested-by: Vineeth Pillai 
Link: https://patch.msgid.link/20260430215103.2978955-3-jstultz@google.com

locking/rtmutex: Annotate API and implementation

2026-05-19T11:49:01+00:00

Enable context analysis for struct rt_mutex and annotate all functions that
accept a struct rt_mutex pointer. In the __rt_mutex_lock_common() callers,
instead of adding the __no_context_analysis annotation, emit a runtime
warning if the __rt_mutex_lock_common() return value is not zero and add an
__acquire() statement.

Signed-off-by: Bart Van Assche 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://patch.msgid.link/20260508174520.1416285-1-bvanassche@acm.org