diff options
| author | Thomas Gleixner <tglx@linutronix.de> | 2025-12-15 17:52:04 +0100 |
|---|---|---|
| committer | Peter Zijlstra <peterz@infradead.org> | 2026-01-22 11:11:16 +0100 |
| commit | d7a5da7a0f7fa7ff081140c4f6f971db98882703 (patch) | |
| tree | 5470dbdccb47bf19beda69bdcfd8d078d87f17c0 /Documentation/userspace-api | |
| parent | 4fe82cf3024a4bdd2571d584efc25598533d5c96 (diff) | |
rseq: Add fields and constants for time slice extension
Aside of a Kconfig knob add the following items:
- Two flag bits for the rseq user space ABI, which allow user space to
query the availability and enablement without a syscall.
- A new member to the user space ABI struct rseq, which is going to be
used to communicate request and grant between kernel and user space.
- A rseq state struct to hold the kernel state of this
- Documentation of the new mechanism
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.669472597@linutronix.de
Diffstat (limited to 'Documentation/userspace-api')
| -rw-r--r-- | Documentation/userspace-api/index.rst | 1 | ||||
| -rw-r--r-- | Documentation/userspace-api/rseq.rst | 135 |
2 files changed, 136 insertions, 0 deletions
diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst index 8a61ac4c1bf1..fa0fe8ada68e 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -21,6 +21,7 @@ System calls ebpf/index ioctl/index mseal + rseq Security-related interfaces =========================== diff --git a/Documentation/userspace-api/rseq.rst b/Documentation/userspace-api/rseq.rst new file mode 100644 index 000000000000..e1fdb0d5ce69 --- /dev/null +++ b/Documentation/userspace-api/rseq.rst @@ -0,0 +1,135 @@ +===================== +Restartable Sequences +===================== + +Restartable Sequences allow to register a per thread userspace memory area +to be used as an ABI between kernel and userspace for three purposes: + + * userspace restartable sequences + + * quick access to read the current CPU number, node ID from userspace + + * scheduler time slice extensions + +Restartable sequences (per-cpu atomics) +--------------------------------------- + +Restartable sequences allow userspace to perform update operations on +per-cpu data without requiring heavyweight atomic operations. The actual +ABI is unfortunately only available in the code and selftests. + +Quick access to CPU number, node ID +----------------------------------- + +Allows to implement per CPU data efficiently. Documentation is in code and +selftests. :( + +Scheduler time slice extensions +------------------------------- + +This allows a thread to request a time slice extension when it enters a +critical section to avoid contention on a resource when the thread is +scheduled out inside of the critical section. + +The prerequisites for this functionality are: + + * Enabled in Kconfig + + * Enabled at boot time (default is enabled) + + * A rseq userspace pointer has been registered for the thread + +The thread has to enable the functionality via prctl(2):: + + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET, + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0); + +prctl() returns 0 on success or otherwise with the following error codes: + +========= ============================================================== +Errorcode Meaning +========= ============================================================== +EINVAL Functionality not available or invalid function arguments. + Note: arg4 and arg5 must be zero +ENOTSUPP Functionality was disabled on the kernel command line +ENXIO Available, but no rseq user struct registered +========= ============================================================== + +The state can be also queried via prctl(2):: + + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0); + +prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if +disabled. Otherwise it returns with the following error codes: + +========= ============================================================== +Errorcode Meaning +========= ============================================================== +EINVAL Functionality not available or invalid function arguments. + Note: arg3 and arg4 and arg5 must be zero +========= ============================================================== + +The availability and status is also exposed via the rseq ABI struct flags +field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the +``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user +space and only for informational purposes. + +If the mechanism was enabled via prctl(), the thread can request a time +slice extension by setting rseq::slice_ctrl::request to 1. If the thread is +interrupted and the interrupt results in a reschedule request in the +kernel, then the kernel can grant a time slice extension and return to +userspace instead of scheduling out. The length of the extension is +determined by the ``rseq_slice_extension_nsec`` sysctl. + +The kernel indicates the grant by clearing rseq::slice_ctrl::request and +setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the +thread after granting the extension, the kernel clears the granted bit to +indicate that to userspace. + +If the request bit is still set when the leaving the critical section, +userspace can clear it and continue. + +If the granted bit is set, then userspace invokes rseq_slice_yield(2) when +leaving the critical section to relinquish the CPU. The kernel enforces +this by arming a timer to prevent misbehaving userspace from abusing this +mechanism. + +If both the request bit and the granted bit are false when leaving the +critical section, then this indicates that a grant was revoked and no +further action is required by userspace. + +The required code flow is as follows:: + + rseq->slice_ctrl.request = 1; + barrier(); // Prevent compiler reordering + critical_section(); + barrier(); // Prevent compiler reordering + rseq->slice_ctrl.request = 0; + if (rseq->slice_ctrl.granted) + rseq_slice_yield(); + +As all of this is strictly CPU local, there are no atomicity requirements. +Checking the granted state is racy, but that cannot be avoided at all:: + + if (rseq->slice_ctrl.granted) + -> Interrupt results in schedule and grant revocation + rseq_slice_yield(); + +So there is no point in pretending that this might be solved by an atomic +operation. + +If the thread issues a syscall other than rseq_slice_yield(2) within the +granted timeslice extension, the grant is also revoked and the CPU is +relinquished immediately when entering the kernel. This is required as +syscalls might consume arbitrary CPU time until they reach a scheduling +point when the preemption model is either NONE or VOLUNTARY and therefore +might exceed the grant by far. + +The preferred solution for user space is to use rseq_slice_yield(2) which +is side effect free. The support for arbitrary syscalls is required to +support onion layer architectured applications, where the code handling the +critical section and requesting the time slice extension has no control +over the code within the critical section. + +The kernel enforces flag consistency and terminates the thread with SIGSEGV +if it detects a violation. |
