linux-toradex.git/include/linux/rseq_entry.h, branch v7.0-rc5

rseq: Mark rseq_arm_slice_extension_timer() __always_inline

2026-02-23T10:19:19+00:00

objtool warns about this function being called inside of a uaccess
section:

kernel/entry/common.o: warning: objtool: irqentry_exit+0x1dc: call to rseq_arm_slice_extension_timer() with UACCESS enabled

Interestingly, this happens with CONFIG_RSEQ_SLICE_EXTENSION disabled,
so this is an empty function, as the normal implementation is
already marked __always_inline.

I could reproduce this multiple times with gcc-11 but not with gcc-15,
so the compiler probably got better at identifying the trivial function.

Mark all the empty helpers for !RSEQ_SLICE_EXTENSION as __always_inline
for consistency, avoiding this warning.

Fixes: 0ac3b5c3dc45 ("rseq: Implement time slice extension enforcement timer")
Signed-off-by: Arnd Bergmann 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://patch.msgid.link/20260206074122.709580-1-arnd@kernel.org

rseq: Implement rseq_grant_slice_extension()

2026-01-22T10:11:18+00:00

Provide the actual decision function, which decides whether a time slice
extension is granted in the exit to user mode path when NEED_RESCHED is
evaluated.

The decision is made in two stages. First an inline quick check to avoid
going into the actual decision function. This checks whether:

 #1 the functionality is enabled

 #2 the exit is a return from interrupt to user mode

 #3 any TIF bit, which causes extra work is set. That includes TIF_RSEQ,
    which means the task was already scheduled out.

The slow path, which implements the actual user space ABI, is invoked
when:

  A) #1 is true, #2 is true and #3 is false

     It checks whether user space requested a slice extension by setting
     the request bit in the rseq slice_ctrl field. If so, it grants the
     extension and stores the slice expiry time, so that the actual exit
     code can double check whether the slice is already exhausted before
     going back.

  B) #1 - #3 are true _and_ a slice extension was granted in a previous
     loop iteration

     In this case the grant is revoked.

In case that the user space access faults or invalid state is detected, the
task is terminated with SIGSEGV.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://patch.msgid.link/20251215155709.195303303@linutronix.de

rseq: Reset slice extension when scheduled

2026-01-22T10:11:18+00:00

When a time slice extension was granted in the need_resched() check on exit
to user space, the task can still be scheduled out in one of the other
pending work items. When it gets scheduled back in, and need_resched() is
not set, then the stale grant would be preserved, which is just wrong.

RSEQ already keeps track of that and sets TIF_RSEQ, which invokes the
critical section and ID update mechanisms.

Utilize them and clear the user space slice control member of struct rseq
unconditionally within the existing user access sections. That's just an
unconditional store more in that path.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Mathieu Desnoyers 
Link: https://patch.msgid.link/20251215155709.131081527@linutronix.de

rseq: Implement time slice extension enforcement timer

2026-01-22T10:11:18+00:00

If a time slice extension is granted and the reschedule delayed, the kernel
has to ensure that user space cannot abuse the extension and exceed the
maximum granted time.

It was suggested to implement this via the existing hrtick() timer in the
scheduler, but that turned out to be problematic for several reasons:

   1) It creates a dependency on CONFIG_SCHED_HRTICK, which can be disabled
      independently of CONFIG_HIGHRES_TIMERS

   2) HRTICK usage in the scheduler can be runtime disabled or is only used
      for certain aspects of scheduling.

   3) The function is calling into the scheduler code and that might have
      unexpected consequences when this is invoked due to a time slice
      enforcement expiry. Especially when the task managed to clear the
      grant via sched_yield(0).

It would be possible to address #2 and #3 by storing state in the
scheduler, but that is extra complexity and fragility for no value.

Implement a dedicated per CPU hrtimer instead, which is solely used for the
purpose of time slice enforcement.

The timer is armed when an extension was granted right before actually
returning to user mode in rseq_exit_to_user_mode_restart().

It is disarmed, when the task relinquishes the CPU. This is expensive as
the timer is probably the first expiring timer on the CPU, which means it
has to reprogram the hardware. But that's less expensive than going through
a full hrtimer interrupt cycle for nothing.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Mathieu Desnoyers 
Link: https://patch.msgid.link/20251215155709.068329497@linutronix.de

rseq: Add statistics for time slice extensions

2026-01-22T10:11:17+00:00

Extend the quick statistics with time slice specific fields.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://patch.msgid.link/20251215155708.795202254@linutronix.de

rseq: Provide static branch for time slice extensions

2026-01-22T10:11:16+00:00

Guard the time slice extension functionality with a static key, which can
be disabled on the kernel command line.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://patch.msgid.link/20251215155708.733429292@linutronix.de

rseq: Always inline rseq_debug_syscall_return()

2025-12-12T09:26:26+00:00

To get the full benefit of:

  eaa9088d568c ("rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y")

clang needs an __always_inline instead of a plain inline qualifier:

	$ for i in {1..10}; do taskset -c 4 perf5 bench syscall basic -l 100000000 | grep "ops/sec"; done

		 Before	     After
	ops/sec  15424491    15872221   +2.9%

Signed-off-by: Eric Dumazet 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Ingo Molnar 
Reviewed-by: Mathieu Desnoyers 
Link: https://patch.msgid.link/20251205100753.4073221-1-edumazet@google.com

rseq: Switch to TIF_RSEQ if supported

2025-11-04T07:35:37+00:00

TIF_NOTIFY_RESUME is a multiplexing TIF bit, which is suboptimal especially
with the RSEQ fast path depending on it, but not really handling it.

Define a separate TIF_RSEQ in the generic TIF space and enable the full
separation of fast and slow path for architectures which utilize that.

That avoids the hassle with invocations of resume_user_mode_work() from
hypervisors, which clear TIF_NOTIFY_RESUME. It makes the therefore required
re-evaluation at the end of vcpu_run() a NOOP on architectures which
utilize the generic TIF space and have a separate TIF_RSEQ.

The hypervisor TIF handling does not include the separate TIF_RSEQ as there
is no point in doing so. The guest does neither know nor care about the VMM
host applications RSEQ state. That state is only relevant when the ioctl()
returns to user space.

The fastpath implementation still utilizes TIF_NOTIFY_RESUME for failure
handling, but this only happens within exit_to_user_mode_loop(), so
arguably the hypervisor ioctl() code is long done when this happens.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Reviewed-by: Mathieu Desnoyers 
Link: https://patch.msgid.link/20251027084307.903622031@linutronix.de

rseq: Split up rseq_exit_to_user_mode()

2025-11-04T07:35:30+00:00

Separate the interrupt and syscall exit handling. Syscall exit does not
require to clear the user_irq bit as it can't be set. On interrupt exit it
can be set when the interrupt did not result in a scheduling event and
therefore the return path did not invoke the TIF work handling, which would
have cleared it.

The debug check for the event state is also not really required even when
debug mode is enabled via the static key. Debug mode is largely aiding user
space by enabling a larger amount of validation checks, which cause a
segfault when a malformed critical section is detected. In production mode
the critical section handling takes the content mostly as is and lets user
space keep the pieces when it screwed up.

On kernel changes in that area the state check is useful, but that can be
done when lockdep is enabled, which is anyway a required test scenario for
fundamental changes.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Reviewed-by: Mathieu Desnoyers 
Link: https://patch.msgid.link/20251027084307.842785700@linutronix.de

rseq: Implement fast path for exit to user

2025-11-04T07:34:18+00:00

Implement the actual logic for handling RSEQ updates in a fast path after
handling the TIF work and at the point where the task is actually returning
to user space.

This is the right point to do that because at this point the CPU and the MM
CID are stable and cannot longer change due to yet another reschedule.
That happens when the task is handling it via TIF_NOTIFY_RESUME in
resume_user_mode_work(), which is invoked from the exit to user mode work
loop.

The function is invoked after the TIF work is handled and runs with
interrupts disabled, which means it cannot resolve page faults. It
therefore disables page faults and in case the access to the user space
memory faults, it:

  - notes the fail in the event struct
  - raises TIF_NOTIFY_RESUME
  - returns false to the caller

The caller has to go back to the TIF work, which runs with interrupts
enabled and therefore can resolve the page faults. This happens mostly on
fork() when the memory is marked COW.

If the user memory inspection finds invalid data, the function returns
false as well and sets the fatal flag in the event struct along with
TIF_NOTIFY_RESUME. The slow path notify handler has to evaluate that flag
and terminate the task with SIGSEGV as documented.

The initial decision to invoke any of this is based on one flags in the
event struct: @sched_switch. The decision is in pseudo ASM:

      load	tsk::event::sched_switch
      jnz	inspect_user_space
      mov	$0, tsk::event::events
      ...
      leave

So for the common case where the task was not scheduled out, this really
boils down to three instructions before going out if the compiler is not
completely stupid (and yes, some of them are).

If the condition is true, then it checks, whether CPU ID or MM CID have
changed. If so, then the CPU/MM IDs have to be updated and are thereby
cached for the next round. The update unconditionally retrieves the user
space critical section address to spare another user*begin/end() pair.  If
that's not zero and tsk::event::user_irq is set, then the critical section
is analyzed and acted upon. If either zero or the entry came via syscall
the critical section analysis is skipped.

If the comparison is false then the critical section has to be analyzed
because the event flag is then only true when entry from user was by
interrupt.

This is provided without the actual hookup to let reviewers focus on the
implementation details. The hookup happens in the next step.

Note: As with quite some other optimizations this depends on the generic
entry infrastructure and is not enabled to be sucked into random
architecture implementations.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Reviewed-by: Mathieu Desnoyers 
Link: https://patch.msgid.link/20251027084307.638929615@linutronix.de