summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2026-01-23bonding: annotate data-races around slave->last_rxEric Dumazet
slave->last_rx and slave->target_last_arp_rx[...] can be read and written locklessly. Add READ_ONCE() and WRITE_ONCE() annotations. syzbot reported: BUG: KCSAN: data-race in bond_rcv_validate / bond_rcv_validate write to 0xffff888149f0d428 of 8 bytes by interrupt on cpu 1: bond_rcv_validate+0x202/0x7a0 drivers/net/bonding/bond_main.c:3335 bond_handle_frame+0xde/0x5e0 drivers/net/bonding/bond_main.c:1533 __netif_receive_skb_core+0x5b1/0x1950 net/core/dev.c:6039 __netif_receive_skb_one_core net/core/dev.c:6150 [inline] __netif_receive_skb+0x59/0x270 net/core/dev.c:6265 netif_receive_skb_internal net/core/dev.c:6351 [inline] netif_receive_skb+0x4b/0x2d0 net/core/dev.c:6410 ... write to 0xffff888149f0d428 of 8 bytes by interrupt on cpu 0: bond_rcv_validate+0x202/0x7a0 drivers/net/bonding/bond_main.c:3335 bond_handle_frame+0xde/0x5e0 drivers/net/bonding/bond_main.c:1533 __netif_receive_skb_core+0x5b1/0x1950 net/core/dev.c:6039 __netif_receive_skb_one_core net/core/dev.c:6150 [inline] __netif_receive_skb+0x59/0x270 net/core/dev.c:6265 netif_receive_skb_internal net/core/dev.c:6351 [inline] netif_receive_skb+0x4b/0x2d0 net/core/dev.c:6410 br_netif_receive_skb net/bridge/br_input.c:30 [inline] NF_HOOK include/linux/netfilter.h:318 [inline] ... value changed: 0x0000000100005365 -> 0x0000000100005366 Fixes: f5b2b966f032 ("[PATCH] bonding: Validate probe replies in ARP monitor") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://patch.msgid.link/20260122162914.2299312-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23Merge branch 'for-7.0/cxl-init' into cxl-for-nextDave Jiang
Merge in patches to support several patch series such as Soft Reserve handling, type2 accelerator enabling, and LSA 2.1 labeling support. Mainly addition of cxl_memdev_attach() to allow the memdev probe to make a decision of proceed/fail depending success of CXL topology enumeration. dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is ready cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation cxl/mem: Drop @host argument to devm_cxl_add_memdev() cxl/mem: Convert devm_cxl_add_memdev() to scope-based-cleanup cxl/port: Arrange for always synchronous endpoint attach cxl/mem: Arrange for always-synchronous memdev attach cxl/mem: Fix devm_cxl_memdev_edac_release() confusion
2026-01-23Merge tag 'block-6.19-20260122' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - A set of selftest fixes for ublk - Fix for a pid mismatch in ublk, comparing PIDs in different namespaces if run inside a namespace - Fix for a regression added in this release with polling, where the nvme tcp connect code would spin forever - Zoned device error path fix - Tweak the blkzoned uapi additions from this kernel release, making them more easily discoverable - Fix for a regression in bcache with bio endio handling added in this release * tag 'block-6.19-20260122' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: bcache: use bio cloning for detached device requests blk-mq: use BLK_POLL_ONESHOT for synchronous poll completion selftests/ublk: fix garbage output in foreground mode selftests/ublk: fix error handling for starting device selftests/ublk: fix IO thread idle check block: make the new blkzoned UAPI constants discoverable ublk: fix ublksrv pid handling for pid namespaces block: Fix an error path in disk_update_zone_resources()
2026-01-23net: add queue config validation callbackJakub Kicinski
I imagine (tm) that as the number of per-queue configuration options grows some of them may conflict for certain drivers. While the drivers can obviously do all the validation locally doing so is fairly inconvenient as the config is fed to drivers piecemeal via different ops (for different params and NIC-wide vs per-queue). Add a centralized callback for validating the queue config in queue ops. The callback gets invoked before memory provider is installed, and in the future should also be called when ring params are modified. The validation is done after each layer of configuration. Since we can't fail MP un-binding we must make sure that the config is valid both before and after MP overrides are applied. This is moot for now since the set of MP and device configs are disjoint. It will matter significantly in the future, so adding it now so that we don't forget.. Link: https://patch.msgid.link/20260122005113.2476634-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23net: use netdev_queue_config() for mp restartJakub Kicinski
We should follow the prepare/commit approach for queue configuration. The qcfg struct should be added to dev->cfg rather than directly to queue objects so that we can clone and discard the pending config easily. Remove the qcfg in struct netdev_rx_queue, and switch remaining callers to netdev_queue_config(). netdev_queue_config() will construct the qcfg on the fly based on device defaults and state of the queue. ndo_default_qcfg becomes optional because having the callback itself does not have any meaningful semantics to us. Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Link: https://patch.msgid.link/20260122005113.2476634-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23net: introduce a trivial netdev_queue_config()Jakub Kicinski
We may choose to extend or reimplement the logic which renders the per-queue config. The drivers should not poke directly into the queue state. Add a helper for drivers to use when they want to query the config for a specific queue. Link: https://patch.msgid.link/20260122005113.2476634-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23geneve: add netlink support for GRO hintPaolo Abeni
Allow configuring and dumping the new device option, and cache its value into the geneve socket itself. The new option is not tie to it any code yet. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/2295d4e4d1e919a3189425141bbc71c7850a2de0.1769011015.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23geneve: expose gso partial features for tunnel offloadPaolo Abeni
GSO partial features for tunnels do not require any kind of support from the underlying device: we can safely add them to the geneve UDP tunnel. The only point of attention is the skb required features propagation in the device xmit op: partial features must be stripped, except for UDP_TUNNEL*. Keep partial features disabled by default. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/d851ca8e928cf05d68310bcbaeaa5e9e0b01e058.1769011015.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23net: introduce mangleid_featuresPaolo Abeni
Some/most devices implementing gso_partial need to disable the GSO partial features when the IP ID can't be mangled; to that extend each of them implements something alike the following[1]: if (skb->encapsulation && !(features & NETIF_F_TSO_MANGLEID)) features &= ~NETIF_F_TSO; in the ndo_features_check() op, which leads to a bit of duplicate code. Later patch in the series will implement GSO partial support for virtual devices, and the current status quo will require more duplicate code and a new indirect call in the TX path for them. Introduce the mangleid_features mask, allowing the core to disable NIC features based on/requiring MANGLEID, without any further intervention from the driver. The same functionality could be alternatively implemented adding a single boolean flag to the struct net_device, but would require an additional checks in ndo_features_check(). Also note that [1] is incorrect if the NIC additionally implements NETIF_F_GSO_UDP_L4, mangleid_features transparently handle even such a case. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/5a7cdaeea40b0a29b88e525b6c942d73ed3b8ce7.1769011015.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23rqspinlock: Fix TAS fallback lock entry creationKumar Kartikeya Dwivedi
The TAS fallback can be invoked directly when queued spin locks are disabled, and through the slow path when paravirt is enabled for queued spin locks. In the latter case, the res_spin_lock macro will attempt the fast path and already hold the entry when entering the slow path. This will lead to creation of extraneous entries that are not released, which may cause false positives for deadlock detection. Fix this by always preceding invocation of the TAS fallback in every case with the grabbing of the held lock entry, and add a comment to make note of this. Fixes: c9102a68c070 ("rqspinlock: Add a test-and-set fallback") Reported-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Tested-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20260122115911.3668985-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-23Merge tag 'drm-fixes-2026-01-23' of https://gitlab.freedesktop.org/drm/kernelLinus Torvalds
Pull drm fixes from Dave Airlie: "Probably a good thing you decided to do an rc8 in this round. Nothing stands out, but xe/amdgpu and mediatek all have a bunch of fixes, and then there are a few other single patches. Hopefully next week is calmer for release. xe: - Disallow bind-queue sharing across multiple VMs - Fix xe userptr in the absence of CONFIG_DEVICE_PRIVATE - Fix a missed page count update - Fix a confused argument to alloc_workqueue() - Kernel-doc fixes - Disable a workaround on VFs - Fix a job lock assert - Update wedged.mode only after successful reset policy change - Select CONFIG_DEVICE_PRIVATE when DRM_XE_GPUSVM is selected amdgpu: - fix color pipeline string leak - GC 12 fix - Misc error path fixes - DC analog fix - SMU 6 fixes - TLB flush fix - DC idle optimization fix amdkfd: - GC 11 cooperative launch fix imagination: - sync wait for logtype update completion to ensure FW trace is available bridge/synopsis: - Fix error paths in dw_dp_bind nouveau: - Add and implement missing DSB connector types, and improve unknown connector handling - Set missing atomic function ops intel: - place 3D lut at correct place in pipeline - fix color pipeline string leak vkms: - fix color pipeline string leak mediatek: - Fix platform_get_irq() error checking - HDMI DDC v2 driver fixes - dpi: Find next bridge during probe - mtk_gem: Partial refactor and use drm_gem_dma_object - dt-bindings: Fix typo 'hardwares' to 'hardware'" * tag 'drm-fixes-2026-01-23' of https://gitlab.freedesktop.org/drm/kernel: (38 commits) Revert "drm/amd/display: pause the workload setting in dm" drm/xe: Select CONFIG_DEVICE_PRIVATE when DRM_XE_GPUSVM is selected drm, drm/xe: Fix xe userptr in the absence of CONFIG_DEVICE_PRIVATE drm/i915/display: Fix color pipeline enum name leak drm/vkms: Fix color pipeline enum name leak drm/amd/display: Fix color pipeline enum name leak drm/i915/color: Place 3D LUT after CSC in plane color pipeline drm/nouveau/disp: Set drm_mode_config_funcs.atomic_(check|commit) drm/nouveau: implement missing DCB connector types; gracefully handle unknown connectors drm/nouveau: add missing DCB connector types drm/amdgpu: fix type for wptr in ring backup drm/amdgpu: Fix validating flush_gpu_tlb_pasid() drm/amd/pm: Workaround SI powertune issue on Radeon 430 (v2) drm/amd/pm: Don't clear SI SMC table when setting power limit drm/amd/pm: Fix si_dpm mmCG_THERMAL_INT setting drm/xe: Update wedged.mode only after successful reset policy change drm/xe/migrate: fix job lock assert drm/xe/uapi: disallow bind queue sharing drm/amd/display: Only poll analog connectors drm/amdgpu: fix error handling in ib_schedule() ...
2026-01-23fsnotify: Track inode connectors for a superblockJan Kara
Introduce a linked list tracking all inode connectors for a superblock. We will use this list when the superblock is getting shutdown to properly clean up all the inode marks instead of relying on scanning all inodes in the superblock which can get rather slow. Suggested-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz>
2026-01-23erofs: pass inode to trace_erofs_read_folioHongbo Li
The trace_erofs_read_folio accesses inode information through folio, but this method fails if the real inode is not associated with the folio(such as in the upcoming page cache sharing case). Therefore, we pass the real inode to it so that the inode information can be printed out in that case. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23Merge branch 'vfs-7.0.iomap' of ↵Gao Xiang
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull 'vfs-7.0.iomap' to allow iomap page cache users to set `iomap_iter::private` for the upcoming page cache sharing support. It also includes a patch to avoid triggering inline data reads for the FIEMAP operation. Signed-off-by: Gao Xiang <xiang@kernel.org>
2026-01-23firmware: xilinx: Add firmware API's to support aes-gcm in Versal deviceHarsh Jain
Add aes-gcm crypto API's for AMD/Xilinx Versal device. Signed-off-by: Harsh Jain <h.jain@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2026-01-23firmware: zynqmp: Add helper API to self discovery the deviceHarsh Jain
Add API to get SoC version and family info. Signed-off-by: Harsh Jain <h.jain@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2026-01-23firmware: zynqmp: Move crypto API's to separate fileHarsh Jain
For better maintainability move crypto related API's to new zynqmp-crypto.c file. Signed-off-by: Harsh Jain <h.jain@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2026-01-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.19-rc7). Conflicts: drivers/net/ethernet/huawei/hinic3/hinic3_irq.c b35a6fd37a00 ("hinic3: Add adaptive IRQ coalescing with DIM") fb2bb2a1ebf7 ("hinic3: Fix netif_queue_set_napi queue_index input parameter error") https://lore.kernel.org/fc0a7fdf08789a52653e8ad05281a0a849e79206.1768915707.git.zhuyikai1@h-partners.com drivers/net/wireless/ath/ath12k/mac.c drivers/net/wireless/ath/ath12k/wifi7/hw.c 31707572108d ("wifi: ath12k: Fix wrong P2P device link id issue") c26f294fef2a ("wifi: ath12k: Move ieee80211_ops callback to the arch specific module") https://lore.kernel.org/20260114123751.6a208818@canb.auug.org.au Adjacent changes: drivers/net/wireless/ath/ath12k/mac.c 8b8d6ee53dfd ("wifi: ath12k: Fix scan state stuck in ABORTING after cancel_remain_on_channel") 914c890d3b90 ("wifi: ath12k: Add framework for hardware specific ieee80211_ops registration") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22ublk: add new feature UBLK_F_BATCH_IOMing Lei
Add new feature UBLK_F_BATCH_IO which replaces the following two per-io commands: - UBLK_U_IO_FETCH_REQ - UBLK_U_IO_COMMIT_AND_FETCH_REQ with three per-queue batch io uring_cmd: - UBLK_U_IO_PREP_IO_CMDS - UBLK_U_IO_COMMIT_IO_CMDS - UBLK_U_IO_FETCH_IO_CMDS Then ublk can deliver batch io commands to ublk server in single multishort uring_cmd, also allows to prepare & commit multiple commands in batch style via single uring_cmd, communication cost is reduced a lot. This feature also doesn't limit task context any more for all supported commands, so any allowed uring_cmd can be issued in any task context. ublk server implementation becomes much easier. Meantime load balance becomes much easier to support with this feature. The command `UBLK_U_IO_FETCH_IO_CMDS` can be issued from multiple task contexts, so each task can adjust this command's buffer length or number of inflight commands for controlling how much load is handled by current task. Later, priority parameter will be added to command `UBLK_U_IO_FETCH_IO_CMDS` for improving load balance support. UBLK_U_IO_NEED_GET_DATA isn't supported in batch io yet, but it may be enabled in future via its batch pair. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22ublk: add UBLK_U_IO_FETCH_IO_CMDS for batch I/O processingMing Lei
Add UBLK_U_IO_FETCH_IO_CMDS command to enable efficient batch processing of I/O requests. This multishot uring_cmd allows the ublk server to fetch multiple I/O commands in a single operation, significantly reducing submission overhead compared to individual FETCH_REQ* commands. Key Design Features: 1. Multishot Operation: One UBLK_U_IO_FETCH_IO_CMDS can fetch many I/O commands, with the batch size limited by the provided buffer length. 2. Dynamic Load Balancing: Multiple fetch commands can be submitted simultaneously, but only one is active at any time. This enables efficient load distribution across multiple server task contexts. 3. Implicit State Management: The implementation uses three key variables to track state: - evts_fifo: Queue of request tags awaiting processing - fcmd_head: List of available fetch commands - active_fcmd: Currently active fetch command (NULL = none active) States are derived implicitly: - IDLE: No fetch commands available - READY: Fetch commands available, none active - ACTIVE: One fetch command processing events 4. Lockless Reader Optimization: The active fetch command can read from evts_fifo without locking (single reader guarantee), while writers (ublk_queue_rq/ublk_queue_rqs) use evts_lock protection. The memory barrier pairing plays key role for the single lockless reader optimization. Implementation Details: - ublk_queue_rq() and ublk_queue_rqs() save request tags to evts_fifo - __ublk_acquire_fcmd() selects an available fetch command when events arrive and no command is currently active - ublk_batch_dispatch() moves tags from evts_fifo to the fetch command's buffer and posts completion via io_uring_mshot_cmd_post_cqe() - State transitions are coordinated via evts_lock to maintain consistency Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22ublk: handle UBLK_U_IO_COMMIT_IO_CMDSMing Lei
Handle UBLK_U_IO_COMMIT_IO_CMDS by walking the uring_cmd fixed buffer: - read each element into one temp buffer in batch style - parse and apply each element for committing io result Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22ublk: handle UBLK_U_IO_PREP_IO_CMDSMing Lei
This commit implements the handling of the UBLK_U_IO_PREP_IO_CMDS command, which allows userspace to prepare a batch of I/O requests. The core of this change is the `ublk_walk_cmd_buf` function, which iterates over the elements in the uring_cmd fixed buffer. For each element, it parses the I/O details, finds the corresponding `ublk_io` structure, and prepares it for future dispatch. Add per-io lock for protecting concurrent delivery and committing. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22ublk: add new batch command UBLK_U_IO_PREP_IO_CMDS & UBLK_U_IO_COMMIT_IO_CMDSMing Lei
Add new command UBLK_U_IO_PREP_IO_CMDS, which is the batch version of UBLK_IO_FETCH_REQ. Add new command UBLK_U_IO_COMMIT_IO_CMDS, which is for committing io command result only, still the batch version. The new command header type is `struct ublk_batch_io`. This patch doesn't actually implement these commands yet, just validates the SQE fields. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-23Merge tag 'drm-misc-next-2026-01-22' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/misc/kernel into drm-next drm-misc-next for 6.20: Core Changes: - buddy: Fix free_trees memory leak, prevent a BUG_ON - dma-buf: Start to introduce cgroup memory accounting in heaps, Remove sysfs stats, add new tracepoints - hdmi: Limit infoframes exposure to userspace based on driver capabilities - property: Account for property blobs in memcg Driver Changes: - atmel-hlcdc: Switch to drmm resources, Support nomodeset parameter, various patches to use newish helpers and fix memory safety bugs - hisilicon: Fix various DisplayPort related bugs - imagination: Introduce hardware version checks - renesas: Fix kernel panic on reboot - rockchip: Fix RK3576 HPD interrupt handling, Improve RK3588 HPD interrupt handling - v3d: Convert to drm logging helpers - bridge: - Continuation of the refcounting effort - new bridge: Algoltek AG6311 - panel: - new panel: Anbernic RG-DS Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <mripard@redhat.com> Link: https://patch.msgid.link/20260122-antique-sexy-junglefowl-1bc5a8@houat
2026-01-22tcp: move tcp_rate_check_app_limited() to tcp.cEric Dumazet
tcp_rate_check_app_limited() is used from tcp_sendmsg_locked() fast path and from other callers. Move it to tcp.c so that it can be inlined in tcp_sendmsg_locked(). Small increase of code, for better TCP performance. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 1/0 up/down: 87/0 (87) Function old new delta tcp_sendmsg_locked 4217 4304 +87 Total: Before=22566462, After=22566549, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20260121095923.3134639-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22tcp: move tcp_rate_gen to tcp_input.cEric Dumazet
This function is called from one caller only, in TCP fast path. Move it to tcp_input.c so that compiler can inline it. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 1/0 up/down: 226/-300 (-74) Function old new delta tcp_ack 5405 5631 +226 __pfx_tcp_rate_gen 16 - -16 tcp_rate_gen 284 - -284 Total: Before=22566536, After=22566462, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20260121095923.3134639-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22net: stmmac: dwmac-imx: keep preamble before sfd on i.MX8MPStefan Eichenberger
The stmmac implementation used by NXP for the i.MX8MP SoC is subject to errata ERR050694. According to this errata, when no preamble byte is transferred before the SFD from the PHY to the MAC, the MAC will discard the frame. Setting the PHY_F_KEEP_PREAMBLE_BEFORE_SFD flag instructs PHYs that support it to keep the preamble byte before the SFD. This ensures that the MAC successfully receives frames. As this is an issue in the MAC implementation, only enable the flag for the i.MX8MP SoC where the errata applies but not for other SoCs using a working stmmac implementation. The exact wording of the errata ERR050694 from NXP: The IEEE 802.3 standard states that, in MII/GMII modes, the byte preceding the SFD (0xD5), SMD-S (0xE6,0x4C, 0x7F, or 0xB3), or SMD-C (0x61, 0x52, 0x9E, or 0x2A) byte can be a non-PREAMBLE byte or there can be no preceding preamble byte. The MAC receiver must successfully receive a packet without any preamble(0x55) byte preceding the SFD, SMD-S, or SMD-C byte. However due to the defect, in configurations where frame preemption is enabled, when preamble byte does not precede the SFD, SMD-S, or SMD-C byte, the received packet is discarded by the MAC receiver. This is because, the start-of-packet detection logic of the MAC receiver incorrectly checks for a preamble byte. NXP refers to IEEE 802.3 where in clause 35.2.3.2.2 Receive case (GMII) they show two tables one where the preamble is preceding the SFD and one where it is not. The text says: The operation of 1000 Mb/s PHYs can result in shrinkage of the preamble between transmission at the source GMII and reception at the destination GMII. Table 35-3 depicts the case where no preamble bytes are conveyed across the GMII. This case may not be possible with a specific PHY, but illustrates the minimum preamble with which MAC shall be able to operate. Table 35-4 depicts the case where the entire preamble is conveyed across the GMII. This workaround was tested on a Verdin iMX8MP by enforcing 10 MBit/s: ethtool -s end0 speed 10 Without keeping the preamble, no packet were received. With keeping the preamble, everything worked as expected. Signed-off-by: Stefan Eichenberger <stefan.eichenberger@toradex.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260120203905.23805-4-eichest@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22net: phy: add a new phy_device flag to keep preamble before sfdStefan Eichenberger
Add a new flag, PHY_F_KEEP_PREAMBLE_BEFORE_SFD, to indicate that the PHY shall not remove the preamble before the SFD if it supports it. MACs that do not support receiving frames without a preamble can set this flag. Signed-off-by: Stefan Eichenberger <stefan.eichenberger@toradex.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260120203905.23805-2-eichest@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22PCI/IDE: Fix off by one error calculating VF RID rangeLi Ming
The VF ID range of an SR-IOV device is [0, num_VFs - 1]. pci_ide_stream_alloc() mistakenly uses num_VFs to represent the last ID. Fix that off by one error to stay in bounds of the range. Fixes: 1e4d2ff3ae45 ("PCI/IDE: Add IDE establishment helpers") Signed-off-by: Li Ming <ming.li@zohomail.com> Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com> Link: https://patch.msgid.link/20260114111455.550984-1-ming.li@zohomail.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2026-01-22Revert "PCI/TSM: Report active IDE streams"Dan Williams
The proposed ABI failed to account for multiple host bridges with the same stream name. The fix needs to namespace streams or otherwise link back to the host bridge, but a change like that is too big for a fix. Given this ABI never saw a released kernel, delete it for now and bring it back later with this issue addressed. Reported-by: Xu Yilun <yilun.xu@linux.intel.com> Reported-by: Yi Lai <yi1.lai@intel.com> Closes: http://lore.kernel.org/20251223085601.2607455-1-yilun.xu@linux.intel.com Link: http://patch.msgid.link/6972c872acbb9_1d3310035@dwillia2-mobl4.notmuch Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2026-01-22io_uring: introduce non-circular SQPavel Begunkov
Outside of SQPOLL, normally SQ entries are consumed by the time the submission syscall returns. For those cases we don't need a circular buffer and the head/tail tracking, instead the kernel can assume that entries always start from the beginning of the SQ at index 0. This patch introduces a setup flag doing exactly that. It's a simpler and helps to keeps SQEs hot in cache. The feature is optional and enabled by setting IORING_SETUP_SQ_REWIND. The flag is rejected if passed together with SQPOLL as it'd require waiting for SQ before each submission. It also requires IORING_SETUP_NO_SQARRAY, which can be supported but it's unlikely there will be users, so leave more space for future optimisations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22PCI/AER: Report CXL or PCIe bus type in AER trace loggingTerry Bowman
The AER service driver and aer_event tracing currently log 'PCIe Bus Type' for all errors. Update the driver and aer_event tracing to log 'CXL Bus Type' for CXL device errors. This requires that AER can identify and distinguish between PCIe errors and CXL errors. Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in aer_get_device_error_info() and pci_print_aer(). Update the aer_event trace routine to accept a bus type string parameter. Signed-off-by: Terry Bowman <terry.bowman@amd.com> Co-developed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Link: https://patch.msgid.link/20260114182055.46029-15-terry.bowman@amd.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2026-01-22PCI/AER: Export pci_aer_unmask_internal_errors()Terry Bowman
Internal PCIe errors are not enabled by default during initialization because their behavior is too device-specific and there is no standard way to reason about them. However, for CXL an internal error is the standard mechanism for conveying CXL protocol errors. Export pci_aer_unmask_internal_errors() for CXL, but make it clear that they are only meant for CXL and the status quo for leaving them masked for PCIe in general remains. Signed-off-by: Terry Bowman <terry.bowman@amd.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Link: https://patch.msgid.link/20260114182055.46029-10-terry.bowman@amd.com Co-developed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2026-01-23Merge tag 'drm-xe-fixes-2026-01-22' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes UAPI Changes: - Disallow bind-queue sharing across multiple VMs (Matt Auld) Core Changes: - Fix xe userptr in the absence of CONFIG_DEVICE_PRIVATE (Thomas) Driver Changes: - Fix a missed page count update (Matt Brost) - Fix a confused argument to alloc_workqueue() (Marco Crivellari) - Kernel-doc fixes (Jani) - Disable a workaround on VFs (Matt Brost) - Fix a job lock assert (Matt Auld) - Update wedged.mode only after successful reset policy change (Lukasz) - Select CONFIG_DEVICE_PRIVATE when DRM_XE_GPUSVM is selected (Thomas) Signed-off-by: Dave Airlie <airlied@redhat.com> From: Thomas Hellstrom <thomas.hellstrom@linux.intel.com> Link: https://patch.msgid.link/aXIdiXaY-RxoaviV@fedora
2026-01-22PCI: Introduce pcie_is_cxl()Terry Bowman
CXL is a protocol that runs on top of PCIe electricals. Its error model also runs on top of the PCIe AER error model by standardizing "internal" errors as "CXL" errors. Linux has historically ignored internal errors. CXL protocol error handling is then a task of enhancing the PCIe AER core to understand that PCIe ports (upstream and downstream) and endpoints may throw internal errors that represent standard CXL protocol errors. The proposed method to make that determination is to teach 'struct pci_dev' to cache when its link has trained the CXL.mem and/or CXL.cache protocols and then treat all internal errors as CXL errors. A design goal is to not burden the PCIe AER core with CXL knowledge beyond just enough to forward error notifications to the CXL RAS core. The forwarded notification looks up a 'struct cxl_port' or 'struct cxl_dport' companion device to the PCI device. Introduce set_pcie_cxl() with logic checking for CXL.mem or CXL.cache status in the CXL Flex Bus DVSEC status register. The CXL Flex Bus DVSEC presence is used because it is required for all the CXL PCIe devices.[1] [1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended Capability (DVSEC) ID Assignment, Table 8-2 Signed-off-by: Terry Bowman <terry.bowman@amd.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Alejandro Lucero <alucerop@amd.com> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://patch.msgid.link/20260114182055.46029-4-terry.bowman@amd.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2026-01-22PCI: Update CXL DVSEC definitionsTerry Bowman
CXL DVSEC definitions were recently moved into uapi/pci_regs.h, but the newly added macros do not follow the file's existing naming conventions. The current format uses CXL_DVSEC_XYZ, while the new CXL entries must instead use the PCI_DVSEC_CXL_XYZ prefix to match the conventions already established in pci_regs.h. The new CXL DVSEC macros also introduce _MASK and _OFFSET suffixes, which are not used anywhere else in the file. These suffixes lengthen the identifiers and reduce readability. Remove _MASK and _OFFSET from the recently added definitions. Additionally, remove PCI_DVSEC_HEADER1_LENGTH, as it duplicates the existing PCI_DVSEC_HEADER1_LEN() macro. Update all existing references to use the new macro names. Finally, update the inline documentation to reference the latest revision of the CXL specification. Signed-off-by: Terry Bowman <terry.bowman@amd.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://patch.msgid.link/20260114182055.46029-3-terry.bowman@amd.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2026-01-22PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.hTerry Bowman
The CXL DVSECs are currently defined in cxl/core/cxlpci.h. These are not accessible to other subsystems. Move these to uapi/linux/pci_regs.h. The CXL DVSEC definitions will be renamed and reformatted to fit better with existing defines. Signed-off-by: Terry Bowman <terry.bowman@amd.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://patch.msgid.link/20260114182055.46029-2-terry.bowman@amd.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2026-01-22Merge tag 'net-6.19-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from CAN and wireless. Pretty big, but hard to make up any cohesive story that would explain it, a random collection of fixes. The two reverts of bad patches from this release here feel like stuff that'd normally show up by rc5 or rc6. Perhaps obvious thing to say, given the holiday timing. That said, no active investigations / regressions. Let's see what the next week brings. Current release - fix to a fix: - can: alloc_candev_mqs(): add missing default CAN capabilities Current release - regressions: - usbnet: fix crash due to missing BQL accounting after resume - Revert "net: wwan: mhi_wwan_mbim: Avoid -Wflex-array-member-not ... Previous releases - regressions: - Revert "nfc/nci: Add the inconsistency check between the input ... Previous releases - always broken: - number of driver fixes for incorrect use of seqlocks on stats - rxrpc: fix recvmsg() unconditional requeue, don't corrupt rcv queue when MSG_PEEK was set - ipvlan: make the addrs_lock be per port avoid races in the port hash table - sched: enforce that teql can only be used as root qdisc - virtio: coalesce only linear skb - wifi: ath12k: fix dead lock while flushing management frames - eth: igc: reduce TSN TX packet buffer from 7KB to 5KB per queue" * tag 'net-6.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (96 commits) Octeontx2-af: Add proper checks for fwdata dpll: Prevent duplicate registrations net/sched: act_ife: avoid possible NULL deref hinic3: Fix netif_queue_set_napi queue_index input parameter error vsock/test: add stream TX credit bounds test vsock/virtio: cap TX credit to local buffer size vsock/test: fix seqpacket message bounds test vsock/virtio: fix potential underflow in virtio_transport_get_credit() net: fec: account for VLAN header in frame length calculations net: openvswitch: fix data race in ovs_vport_get_upcall_stats octeontx2-af: Fix error handling net: pcs: pcs-mtk-lynxi: report in-band capability for 2500Base-X rxrpc: Fix data-race warning and potential load/store tearing net: dsa: fix off-by-one in maximum bridge ID determination net: bcmasp: Fix network filter wake for asp-3.0 bonding: provide a net pointer to __skb_flow_dissect() selftests: net: amt: wait longer for connection before sending packets be2net: Fix NULL pointer dereference in be_cmd_get_mac_from_list Revert "net: wwan: mhi_wwan_mbim: Avoid -Wflex-array-member-not-at-end warning" netrom: fix double-free in nr_route_frame() ...
2026-01-22netfilter: nf_tables: add .abort_skip_removal flag for set typesPablo Neira Ayuso
The pipapo set backend is the only user of the .abort interface so far. To speed up pipapo abort path, removals are skipped. The follow up patch updates the rbtree to use to build an array of ordered elements, then use binary search. This needs a new .abort interface but, unlike pipapo, it also need to undo/remove elements. Add a flag and use it from the pipapo set backend. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-22driver-core: move devres_for_each_res() to base.hDanilo Krummrich
devres_for_each_res() is only used by .../firmware_loader/main.c, which already includes base.h. The usage of devres_for_each_res() by code outside of driver-core is questionable, hence move it to base.h. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://patch.msgid.link/20260119162920.77189-1-dakr@kernel.org Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2026-01-22Merge tag 'wireless-2026-11-22' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Another set of updates: - various small fixes for ath10k/ath12k/mwifiex/rsi - cfg80211 fix for HE bitrate overflow - mac80211 fixes - S1G beacon handling in scan - skb tailroom handling for HW encryption - CSA fix for multi-link - handling of disabled links during association * tag 'wireless-2026-11-22' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: cfg80211: ignore link disabled flag from userspace wifi: mac80211: apply advertised TTLM from association response wifi: mac80211: parse all TTLM entries wifi: mac80211: don't increment crypto_tx_tailroom_needed_cnt twice wifi: mac80211: don't perform DA check on S1G beacon wifi: ath12k: Fix wrong P2P device link id issue wifi: ath12k: fix dead lock while flushing management frames wifi: ath12k: Fix scan state stuck in ABORTING after cancel_remain_on_channel wifi: ath12k: cancel scan only on active scan vdev wifi: mwifiex: Fix a loop in mwifiex_update_ampdu_rxwinsize() wifi: mac80211: correctly check if CSA is active wifi: cfg80211: Fix bitrate calculation overflow for HE rates wifi: rsi: Fix memory corruption due to not set vif driver data size wifi: ath12k: don't force radio frequency check in freq_to_idx() wifi: ath12k: fix dma_free_coherent() pointer wifi: ath10k: fix dma_free_coherent() pointer ==================== Link: https://patch.msgid.link/20260122110248.15450-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22KVM: arm64: Add exit to userspace on {LD,ST}64B* outside of memslotsMarc Zyngier
The main use of {LD,ST}64B* is to talk to a device, which is hopefully directly assigned to the guest and requires no additional handling. However, this does not preclude a VMM from exposing a virtual device to the guest, and to allow 64 byte accesses as part of the programming interface. A direct consequence of this is that we need to be able to forward such access to userspace. Given that such a contraption is very unlikely to ever exist, we choose to offer a limited service: userspace gets (as part of a new exit reason) the ESR, the IPA, and that's it. It is fully expected to handle the full semantics of the instructions, deal with ACCDATA, the return values and increment PC. Much fun. A canonical implementation can also simply inject an abort and be done with it. Frankly, don't try to do anything else unless you have time to waste. Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Oliver Upton <oupton@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com> Signed-off-by: Will Deacon <will@kernel.org>
2026-01-22net: bonding: use workqueue to make sure peer notify updated in lacp modeTonghao Zhang
The rtnl lock might be locked, preventing ad_cond_set_peer_notif() from acquiring the lock and updating send_peer_notif. This patch addresses the issue by using a workqueue. Since updating send_peer_notif does not require high real-time performance, such delayed updates are entirely acceptable. In fact, checking this value and using it in multiple places, all operations are protected at the same time by rtnl lock, such as - read send_peer_notif - send_peer_notif-- - bond_should_notify_peers By the way, rtnl lock is still required, when accessing bond.params.* for updating send_peer_notif. In lacp mode, resetting send_peer_notif in workqueue is safe, simple and effective way. Additionally, this patch introduces bond_peer_notify_may_events(), which is used to check whether an event should be sent. This function will be used in both patch 1 and 2. Cc: Jay Vosburgh <jv@jvosburgh.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Lunn <andrew+netdev@lunn.ch> Cc: Nikolay Aleksandrov <razor@blackwall.org> Cc: Hangbin Liu <liuhangbin@gmail.com> Cc: Jason Xing <kerneljasonxing@gmail.com> Suggested-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/f95accb5db0b10ce3ed2f834fc70f716c9abbb9c.1768709239.git.tonghao@bamaicloud.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-22sched: remove task_struct->faults_disabled_mappingChristoph Hellwig
This reverts commit 2b69987be575 ("sched: Add task_struct->faults_disabled_mapping"), which added this field without review or maintainer signoff. With bcachefs removed from the tree it is also unused now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260122085223.487092-1-hch@lst.de
2026-01-22rseq: Allow registering RSEQ with slice extensionPeter Zijlstra
Since glibc cares about the number of syscalls required to initialize a new thread, allow initializing rseq with slice extension on. This avoids having to do another prctl(). Requested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260121143207.814193010@infradead.org
2026-01-22rseq: Implement rseq_grant_slice_extension()Thomas Gleixner
Provide the actual decision function, which decides whether a time slice extension is granted in the exit to user mode path when NEED_RESCHED is evaluated. The decision is made in two stages. First an inline quick check to avoid going into the actual decision function. This checks whether: #1 the functionality is enabled #2 the exit is a return from interrupt to user mode #3 any TIF bit, which causes extra work is set. That includes TIF_RSEQ, which means the task was already scheduled out. The slow path, which implements the actual user space ABI, is invoked when: A) #1 is true, #2 is true and #3 is false It checks whether user space requested a slice extension by setting the request bit in the rseq slice_ctrl field. If so, it grants the extension and stores the slice expiry time, so that the actual exit code can double check whether the slice is already exhausted before going back. B) #1 - #3 are true _and_ a slice extension was granted in a previous loop iteration In this case the grant is revoked. In case that the user space access faults or invalid state is detected, the task is terminated with SIGSEGV. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251215155709.195303303@linutronix.de
2026-01-22rseq: Reset slice extension when scheduledThomas Gleixner
When a time slice extension was granted in the need_resched() check on exit to user space, the task can still be scheduled out in one of the other pending work items. When it gets scheduled back in, and need_resched() is not set, then the stale grant would be preserved, which is just wrong. RSEQ already keeps track of that and sets TIF_RSEQ, which invokes the critical section and ID update mechanisms. Utilize them and clear the user space slice control member of struct rseq unconditionally within the existing user access sections. That's just an unconditional store more in that path. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251215155709.131081527@linutronix.de
2026-01-22rseq: Implement time slice extension enforcement timerThomas Gleixner
If a time slice extension is granted and the reschedule delayed, the kernel has to ensure that user space cannot abuse the extension and exceed the maximum granted time. It was suggested to implement this via the existing hrtick() timer in the scheduler, but that turned out to be problematic for several reasons: 1) It creates a dependency on CONFIG_SCHED_HRTICK, which can be disabled independently of CONFIG_HIGHRES_TIMERS 2) HRTICK usage in the scheduler can be runtime disabled or is only used for certain aspects of scheduling. 3) The function is calling into the scheduler code and that might have unexpected consequences when this is invoked due to a time slice enforcement expiry. Especially when the task managed to clear the grant via sched_yield(0). It would be possible to address #2 and #3 by storing state in the scheduler, but that is extra complexity and fragility for no value. Implement a dedicated per CPU hrtimer instead, which is solely used for the purpose of time slice enforcement. The timer is armed when an extension was granted right before actually returning to user mode in rseq_exit_to_user_mode_restart(). It is disarmed, when the task relinquishes the CPU. This is expensive as the timer is probably the first expiring timer on the CPU, which means it has to reprogram the hardware. But that's less expensive than going through a full hrtimer interrupt cycle for nothing. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251215155709.068329497@linutronix.de
2026-01-22rseq: Implement syscall entry work for time slice extensionsThomas Gleixner
The kernel sets SYSCALL_WORK_RSEQ_SLICE when it grants a time slice extension. This allows to handle the rseq_slice_yield() syscall, which is used by user space to relinquish the CPU after finishing the critical section for which it requested an extension. In case the kernel state is still GRANTED, the kernel resets both kernel and user space state with a set of sanity checks. If the kernel state is already cleared, then this raced against the timer or some other interrupt and just clears the work bit. Doing it in syscall entry work allows to catch misbehaving user space, which issues an arbitrary syscall, i.e. not rseq_slice_yield(), from the critical section. Contrary to the initial strict requirement to use rseq_slice_yield() arbitrary syscalls are not considered a violation of the ABI contract anymore to allow onion architecture applications, which cannot control the code inside a critical section, to utilize this as well. If the code detects inconsistent user space that result in a SIGSEGV for the application. If the grant was still active and the task was not preempted yet, the work code reschedules immediately before continuing through the syscall. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251215155709.005777059@linutronix.de
2026-01-22rseq: Implement sys_rseq_slice_yield()Thomas Gleixner
Provide a new syscall which has the only purpose to yield the CPU after the kernel granted a time slice extension. sched_yield() is not suitable for that because it unconditionally schedules, but the end of the time slice extension is not required to schedule when the task was already preempted. This also allows to have a strict check for termination to catch user space invoking random syscalls including sched_yield() from a time slice extension region. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20251215155708.929634896@linutronix.de