summaryrefslogtreecommitdiff
path: root/io_uring
AgeCommit message (Collapse)Author
5 daysMerge tag 'for-7.2/io_uring-epoll-20260616' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring epoll update from Jens Axboe: "As discussed a few months ago, this pull request gets rid of allowing nested epoll notification contexts via io_uring. Nested contexts have been a source of issues on the epoll side, and there should not be a need to support them from io_uring. The epoll io_uring side exists mainly to facilitate a gradual migration from a notification based epoll setup to an io_uring ditto" * tag 'for-7.2/io_uring-epoll-20260616' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/epoll: disallow adding an epoll file to an epoll context io_uring/epoll: switch to using do_epoll_ctl_file() interface
7 daysMerge tag 'net-next-7.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Work on removing rtnl_lock protection throughout the stack continues. In this chapter: - don't use rtnl_lock for IPv6 multicast routing configuration - don't take rtnl_lock in ethtool for modern drivers - prepare Qdisc dump callbacks for rtnl_lock removal - Support dumping just ifindex + name of all interfaces, under RCU. It's a common operation for Netlink CLI tools (when translating names to ifindexes) and previously required full rtnl_lock. - Support dumping qdiscs and page pools for a specific netdev. Even tho user space wants a dump of all netdevs, most of the time, the OOO programming model results in repeating the dump for each netdev. Which, in absence of a cache, leads to a O(n^2) behavior. - Flush nexthops once on multi-nexthop removal (e.g. when device goes down), another O(n^2) -> O(n) improvement. - Rehash locally generated traffic to a different nexthop on retransmit timeout. - Honor oif when choosing nexthop for locally generated IPv6 traffic. - Convert TCP Auth Option to crypto library, and drop non-RFC algos. - Increase subflow limits in MPTCP to 64 and endpoint limit to 256. - Support MPTCP signaling of IPv6 address + port (ADD_ADDR). We need to selectively skip reporting of the standard TCP Timestamp option, because they won't fit into the header space together (12 + 30 > 40). - Support using bridge neighbor suppression, Duplicate Address Detection, Gratuitous ARP and unsolicited NA forwarding - in EVPN deployments, e.g. VXLAN fabrics (IPv4 and IPv6). - Improve link state reporting for upper netdevs (e.g. macvlan) over tunnel devices (again, mostly for EVPN deployments). - Support binding GENEVE tunnels to a local address. - Speed up UDP tunnel destruction (remove one synchronize_rcu()). - Support exponential field encoding in multicast (IGMPv3 and MLDv2). - Support attaching PSP crypto offload to containers (veth, netkit). - Add a new IPSec Netlink message XFRM_MSG_MIGRATE_STATE that allows migrating individual IPsec SAs independently of their policies. The existing XFRM_MSG_MIGRATE is tightly coupled to policy+SA migration, lacks SPI for unique SA identification, and cannot express reqid changes or migrate Transport mode selectors. The new interface identifies the SA via SPI and mark, supports reqid changes, address family changes, encap removal, and uses an atomic create+install flow under x->lock to prevent SN/IV reuse during AEAD SA migration. - Implement GRO/GSO support for PPPoE. - Convert sockopt callbacks in a number of protocols to iov_iter. Cross-tree stuff: - Remove support for Crypto TFM cloning (unblocked after the TCP Auth Option rework). This feature regressed performance for all crypto API users, since it changed crypto transformation objects into reference-counted objects. - Add FCrypt-PCBC implementation to rxrpc and remove it from the global crypto API as obsolete and insecure. Wireless: - Major rework of station bandwidth handling, fixing issues with lower capability than AP. - Cleanups for EMLSR spec issues (drafts differed). - More Neighbor Awareness Networking (Wi-Fi Aware) work (multicast, schedule improvements, multi-station etc.) - Some Ultra High Reliability (UHR) / IEEE 802.11bn (D1.4) work (e.g. non-primary channel access, UHR DBE support). - Fine Timing Measurement ranging (i.e. distance measurement) APIs. Netfilter: - Use per-rule hash initval in nf_conncount. This avoids unnecessary lock contention with short keys (e.g. conntrack zones) in different namespaces. - Various safety improvements, both in packet parsing and object lifetimes. Notably add refcounts to conntrack timeout policy. Deletions: - Remove TLS + sockmap integration. TLS wants to pin user pages to avoid a copy, and sockmap wants to write to the input stream. More work on this integration is clearly needed, and we can't find any users (original author admitted that they never deployed it). - Remove support for TLS offload with TCP Offload Engine (the far more common opportunistic offload is retained). The locking looks unfixable (driver sleeps under TCP spin locks) and people from the vendor that added this are AWOL. - Remove more ATM code, trying to leave behind only what PPPoATM needs, AAL5 and br2684 with permanent circuits. - Remove AppleTalk. Let it join hamradio in our out of tree protocol graveyard, I mean, repository. - Disable 32-bit x_tables compatibility (32bit binaries on 64bit kernel) interface in user namespaces. To be deleted completely, soon. - Remove 5/10 MHz support from cfg80211/mac80211. Drivers: - Software: - Support DEVMEM/DMABUF Tx over NETMEM_TX_NO_DMA devices (netkit) - bonding: add knob to strictly follow 802.3ad for link state - New drivers: - Alibaba Elastic Ethernet Adaptor (cloud vNIC). - NXP NETC switch within i.MX94. - DPLL: - Add operational state to pins (implement in zl3073x). - Add generic DPLL type, for daisy-chaining DPLLs (implement in ice). - Ethernet high-speed NICs: - Huawei (hinic3): - enhance tc flow offload support with queue selection, tunnels - nVidia/Mellanox: - avoid over-copying payload to the skb's linear part (up to 60% win for LRO on slow CPUs like ARM64 V2) - expose more per-queue stats over the standard API - support additional, unprivileged PFs in the DPU configuration - support Socket Direct (multi-PF) with switchdev offloads - add a pool / frag allocator for DMA mapped buffers for control objects, save memory on systems with 64kB page size - take advantage of the ability to dynamically change RSS table size, even when table is configured by the user - increase the max RSS table size for even traffic distribution - Ethernet NICs: - Marvell/Aquantia: - AQC113 PTP support - Realtek USB (r8152): - support 10Gbit Link Speeds and Energy-Efficient Ethernet (EEE) - support firmware loaded (for RTL8157/RTL8159) - support for the RTL8159 - Intel (ixgbe): - support Energy-Efficient Ethernet (EEE) on E610 devices - Ethernet switches: - Airoha: - support multiple netdevs on a single GDM block / port - Marvell (mv88e6xxx): - support SERDES of mv88e6321 - Microchip (ksz8/9): - rework the driver callbacks to remove one indirection layer - Motorcomm (yt921x): - support port rate policing - support TBF qdisc offload - support ACL/flower offload - nVidia/Mellanox: - expose per-PG rx_discards - Realtek: - rtl8365mb: bridge offloading and VLAN support - Ethernet PHYs: - Airoha: - support Airoha AN8801R Gigabit PHYs. - Micrel: - implement 3 low-loss cable tunables - Realtek: - support MDI swapping for RTL8226-CG - support MDIO for RTL931x - Qualcomm: - at803x: Rx and Tx clock management for IPQ5018 PHY - Motorcomm: - support YT8522 100M RMII PHY - set drive strength in YT8531s RGMII - TI: - dp83822: add optional external PHY clock - Bluetooth: - hci_sync: add support for HCI_LE_Set_Host_Feature [v2] - SMP: use AES-CMAC library API - Intel: - support Product level reset - support smart trigger dump - Mediatek: - add event filter to filter specific event - Realtek: - fix RTL8761B/BU broken LE extended scan - WiFi: - Broadcom (b43): - new support for a 11n device - MediaTek (mt76): - support mt7927 - mt792x: broken usb transport detection - mt7921: regulatory improvements - Qualcomm (ath9k): - GPIO interface improvements - Qualcomm (ath12k): - WDS support - replace dynamic memory allocation in WMI Rx path - thermal throttling/cooling device support - 6 GHz incumbent interference detection - channel 177 in 5 GHz - Realtek (rt89): - RTL8922AU support - USB 3 mode switch for performance - better monitor radiotap support - RTL8922DE preparations" * tag 'net-next-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1778 commits) ipv4: fib_rule: Move fib4_rules_exit() to ->exit(). net: serialize netif_running() check in enqueue_to_backlog() net: skmsg: preserve sg.copy across SG transforms appletalk: move the protocol out of tree appletalk: stop storing per-interface state in struct net_device selftests/bpf: test that TLS crypto is rejected on a sockmap socket selftests/bpf: drop the unused kTLS program from test_sockmap selftests/bpf: remove sockmap + ktls tests tls: remove dead sockmap (psock) handling from the SW path tls: reject the combination of TLS and sockmap atm: remove orphaned uAPI for deleted drivers, protocols and SVCs atm: remove unused ATM PHY operations atm: remove the unused pre_send and send_bh device operations atm: remove the unused change_qos device operation atm: remove SVC socket support and the signaling daemon interface atm: remove the local ATM (NSAP) address registry atm: remove dead SONET PHY ioctls atm: remove the unused send_oam / push_oam callbacks atm: remove AAL3/4 transport support net: dsa: sja1105: fix lastused timestamp in flower stats ...
8 daysMerge tag 'for-7.2/block-20260615' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - NVMe pull request via Keith: - Per-controller admin and IO timeout sysfs attributes, and letting the block layer set request timeouts (Maurizio, Maximilian) - Multipath passthrough iostats, and PCI P2PDMA enablement for multipath devices (Keith, Kiran) - A new diag sysfs attribute group exporting per-controller counters (retries, multipath failover, error counters, requeue and failure counts, reset and reconnect events) (Nilay) - FDP configuration validation and bounds check fixes (liuxixin) - Various nvmet fixes, including a pre-auth out-of-bounds read in the Discovery Get Log Page handler, auth payload bounds validation, and tcp error-path leak fixes (Bryam, Tianchu, Geliang) - nvme-tcp lockdep and workqueue fixes (Shin'ichiro, Kuniyuki, Eric) - Assorted other fixes and cleanups (John, Yao, Chao, Mateusz, Achkinazi, Wentao) - MD pull request via Yu Kuai: - raid1/raid10 fixes for a deadlock in the read error recovery path, error-path detection and bio accounting with cloned bios, and an nr_pending leak in the REQ_ATOMIC bad-block error path (Abd-Alrhman) - PCI P2PDMA propagation from member devices to the RAID device (Kiran) - dm-raid bio requeue fix, and various smaller fixes and cleanups (Benjamin, Chen, Li, Thorsten) - Enable Clang lock context analysis for the block layer, with the accompanying annotations across queue limits, the blk_holder_ops callbacks, crypto, cgroup, iocost, kyber and mq-deadline (Bart) - Block status code infrastructure work: a tagged status table, a str_to_blk_op() helper, a bio_endio_status() helper, and on top of that a new configurable block-layer error injection facility (Christoph) - DRBD netlink rework, replacing the genl_magic machinery with explicit netlink serialization and moving the DRBD UAPI headers to include/uapi/linux/ (Christoph Böhmwalder) - bvec improvements: a bvec_folio() helper and making the bvec_iter helpers proper inline functions (Willy, Christoph) - ublk cleanups and a canceling-flag fix for the disk-not-allocated case (Caleb, Ming) - Partition handling fixes: bound the AIX pp_count scan, fix an of_node refcount leak, and replace __get_free_page() with kmalloc() (Bryam, Wentao, Mike) - Convert numa_node to int in blk_mq_hw_ctx and ->init_request, and add WQ_PERCPU to the block workqueue users (Mateusz, Marco) - Block statistics and tracing: propagate in-flight to the whole disk on partition IO, export passthrough stats, and a new block_rq_tag_wait tracepoint (Tang, Keith, Aaron) - A round of removals, unexports and cleanups across bio, direct-io and the bvec helpers (Christoph) - Various driver fixes (mtip32xx use-after-free, rbd snap_count validation and strscpy conversion, nbd socket lockdep reclassify, virtio-blk zone report clamp, floppy) and a batch of MAINTAINERS email/list updates (Coly, Li, Yu, Christoph Böhmwalder) - Other little fixes and cleanups all over * tag 'for-7.2/block-20260615' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (117 commits) MAINTAINERS: Update Coly Li's email address block: check bio split for unaligned bvec nbd: Reclassify sockets to avoid lockdep circular dependency block: add configurable error injection block: add a str_to_blk_op helper block: add a "tag" for block status codes block: add a macro to initialize the status table floppy: Drop unused pnp driver data block: propagate in_flight to whole disk on partition I/O virtio-blk: clamp zone report to the report buffer capacity block: optimize I/O merge hot path with unlikely() hints drivers/block/rbd: Use strscpy() to copy strings into arrays partitions: aix: bound the pp_count scan to the ppe array block: Enable lock context analysis block/mq-deadline: Make the lock context annotations compatible with Clang block/Kyber: Make the lock context annotations compatible with Clang block/blk-mq-debugfs: Improve lock context annotations block/blk-iocost: Inline iocg_lock() and iocg_unlock() block/blk-iocost: Split ioc_rqos_throttle() block/crypto: Annotate the crypto functions ...
8 daysMerge tag 'for-7.2/io_uring-20260615' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring updates from Jens Axboe: - Rework the task_work infrastructure. Both the local (DEFER_TASKRUN) and the normal (tctx) task_work lists were llist based, which is LIFO ordered, and hence each run had to do an O(n) list reversal pass first to restore queue order. Additionally, to cap the amount of task_work run, each method needed a retry list as well. Add a lockless MPCS FIFO queue (based on Dmitry Vyukov's intrusive MPSC algorithm) and switch both task_work lists to it. It performs better than llists and we can then also ditch the retry lists as well as entries are popped one-at-the-time. On top of those changes, run the tctx fallback task_work directly and remove the now-unused per-ctx fallback machinery entirely. - zcrx user notifications. Add a mechanism for zcrx to communicate conditions back to userspace via a dedicated CQE, with the initial users being notification on running out of buffers and on a frag copy fallback, plus shared-memory notification statistics. Alongside that, a series of zcrx reliability and cleanup fixes: more reliable scrubbing, poisoning pointers on unregistration, dropping an extra ifq close, adding a ctx back-pointer, reordering fd allocation in the export path, and killing a dead 'sock' member. - Allow using io_uring registered buffers for plain SEND and RECV, not just for the zero-copy send path. This enables targets like ublk's NBD backend to push/pull IO data directly to/from a registered buffer over a plain send/recv on a TCP socket. - Registered buffer improvements: account huge pages correctly, bump the io_mapped_ubuf length field to size_t, and raise the previous 1GB registered buffer size limit. - Restrict the ctx access exposed to io_uring BPF struct_ops programs by handing them an opaque type rather than the full io_ring_ctx, and add a separate MAINTAINERS entry for the bpf-ops code. - Allow opcode filtering on IORING_OP_CONNECT. - Validate ring-provided buffer addresses with access_ok(), and align the legacy buffer add limit with MAX_BIDS_PER_BGID. - Various other cleanups and minor fixes, including avoiding msghdr async data on connect/bind, dropping async_size for OP_LISTEN, making the POLL_FIRST receive side checks consistent, re-checking IO_WQ_BIT_EXIT for each linked work item, and using trace_call__##name() at guarded tracepoint call sites. * tag 'for-7.2/io_uring-20260615' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (31 commits) io_uring/bpf-ops: add a separate maintainer entry io_uring/net: make POLL_FIRST receive side checks consistent io_uring: remove the per-ctx fallback task_work machinery io_uring: run the tctx task_work fallback directly io_uring: switch normal task_work to a mpscq io_uring: switch local task_work to a mpscq io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue io_uring: grab RCU read lock marking task run io_uring/zcrx: kill dead 'sock' member in struct io_zcrx_args io_uring/kbuf: validate ring provided buffer addresses with access_ok() io_uring/net: support registered buffer for plain send and recv io_uring/nop: Drop a wrong comment in struct io_nop io_uring/net: Remove async_size for OP_LISTEN io_uring/net: Avoid msghdr on op_connect/op_bind async data io_uring/bpf-ops: restrict ctx access to BPF io_uring/io-wq: re-check IO_WQ_BIT_EXIT for each linked work item io_uring/kbuf: align legacy buffer add limit with MAX_BIDS_PER_BGID io_uring/zcrx: add shared-memory notification statistics io_uring/zcrx: notify user on frag copy fallback io_uring/zcrx: notify user when out of buffers ...
8 daysMerge tag 'v7.2-p1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Drop support for off-CPU cryptography in af_alg - Document that af_alg is *always* slower - Document the deprecation of af_alg - Remove zero-copy support from skcipher and aead in af_alg - Cap AEAD AD length to 0x80000000 in af_alg - Free default RNG on module exit Algorithms: - Fix vli multiplication carry overflow in ecc - Drop unused cipher_null crypto_alg - Remove unused variants of drbg - Use lib/crypto in drbg - Use memcpy_from/to_sglist in authencesn - Allow authenc(hmac(sha{256,384}),cts(cbc(aes))) in FIPS mode - Disallow RSA PKCS#1 SHA-1 sig algs in FIPS mode - Filter out async aead implementations at alloc in krb5 - Fix non-parallel fallback by rstoring callback in pcrypt - Validate poly1305 template argument in chacha20poly1305 Drivers: - Add sysfs PCI reset support to qat - Add KPT support for GEN6 devices to qat - Remove unused character device and ioctls from qat - Add support for hw access via SMCC to mtk - Remove prng support from crypto4xx - Remove prng support from hisi-trng - Remove prng support from sun4i-ss - Remove prng support from xilinx-trng - Remove loongson-rng - Remove exynos-rng Others: - Remove support for AIO on sockets" * tag 'v7.2-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (196 commits) crypto: tegra - fix refcount leak in tegra_se_host1x_submit() crypto: rng - Free default RNG on module exit crypto: testmgr - allow authenc(hmac(sha{256,384}),cts(cbc(aes))) in FIPS mode hwrng: jh7110 - fix refcount leak in starfive_trng_read() crypto: atmel-ecc - drop dead code in atmel_ecdh_max_size crypto: cavium/cpt - fix DMA cleanup using wrong loop index crypto: marvell/octeontx - fix DMA cleanup using wrong loop index MAINTAINERS: make myself the maintainer of the Qualcomm QCE driver crypto: amcc - convert irq_of_parse_and_map to platform_get_irq crypto: sun4i-ss - Remove insecure and unused rng_alg hwrng: xilinx - Move xilinx-rng into drivers/char/hw_random/ crypto: xilinx-trng - Replace crypto_drbg_ctr_df() with HMAC-SHA512 crypto: xilinx-trng - Fix return value of xtrng_hwrng_trng_read() crypto: xilinx-trng - Remove crypto_rng interface crypto: exynos-rng - Remove exynos-rng driver hwrng: hisi-trng - Move hisi-trng into drivers/char/hw_random/ crypto: hisi-trng - Remove crypto_rng interface crypto: loongson - Remove broken and unused loongson-rng crypto: crypto4xx - Remove insecure and unused rng_alg crypto: qat - validate RSA CRT component lengths ...
8 daysMerge tag 'slab-for-7.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - Support for "allocation tokens" (currently available in Clang 22+) for smarter partitioning of kmalloc caches based on the allocated object type, which can be enabled instead of the "random" per-caller-address-hash partitioning. It should be able to deterministically separate types containing a pointer from those that do not (Marco Elver) - Improvements and simplification of the kmem_cache_alloc_bulk() and mempool_alloc_bulk() API. This includes adaptation of callers (Christoph Hellwig) - Performance improvements and cleanups related mostly to sheaves refill (Hao Li, Shengming Hu, Vlastimil Babka) - Several fixups for the slabinfo tool (Xuewen Wang) * tag 'slab-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab: do not limit zeroing to orig_size when only red zoning is enabled mm/slub: preserve original size in _kmalloc_nolock_noprof retry path mm: simplify the mempool_alloc_bulk API mm/slab: improve kmem_cache_alloc_bulk mm/slub: detach and reattach partial slabs in batch mm/slub: introduce helpers for node partial slab state mm/slub: use empty sheaf helpers for oversized sheaves tools/mm/slabinfo: remove redundant slab->partial assignment tools/mm/slabinfo: remove dead assignment in get_obj_and_str() tools/mm/slabinfo: Fix trace disable logic inversion MAINTAINERS: add slab-related scripts and tools to SLAB ALLOCATOR mm/slub: fix typo in sheaves comment mm, slab: simplify returning slab in __refill_objects_node() mm, slab: add an optimistic __slab_try_return_freelist() slab: fix kernel-docs for mm-api slab: improve KMALLOC_PARTITION_RANDOM randomness slab: support for compiler-assisted type-based slab cache partitioning mm/slub: defer freelist construction until after bulk allocation from a new slab
8 daysnetdev: expose io_uring rx_page_order order via netlinkDragos Tatulea
This adds observability for the io_uring zcrx rx-buf-len configuration. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Yael Chemla <ychemla@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/20260612211709.1456966-3-dtatulea@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
9 daysMerge tag 'locking-core-2026-06-14' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "Futex updates: - Optimize futex hash bucket access patterns (Peter Zijlstra) - Large series to address the robust futex unlock race for real, by Thomas Gleixner: "The robust futex unlock mechanism is racy in respect to the clearing of the robust_list_head::list_op_pending pointer because unlock and clearing the pointer are not atomic. The race window is between the unlock and clearing the pending op pointer. If the task is forced to exit in this window, exit will access a potentially invalid pending op pointer when cleaning up the robust list. That happens if another task manages to unmap the object containing the lock before the cleanup, which results in an UAF. In the worst case this UAF can lead to memory corruption when unrelated content has been mapped to the same address by the time the access happens. User space can't solve this problem without help from the kernel. This series provides the kernel side infrastructure to help it along: 1) Combined unlock, pointer clearing, wake-up for the contended case 2) VDSO based unlock and pointer clearing helpers with a fix-up function in the kernel when user space was interrupted within the critical section. ... with help by André Almeida: - Add a note about robust list race condition (André Almeida) - Add self-tests for robust release operations (André Almeida) Context analysis updates: - Implement context analysis for 'struct rt_mutex'. (Bart Van Assche) - Bump required Clang version to 23 (Marco Elver) Guard infrastructure updates: - Series to remove NULL check from unconditional guards (Dmitry Ilvokhin) Lockdep updates: - Restore self-test migrate_disable() and sched_rt_mutex state on PREEMPT_RT (Karl Mehltretter) Membarriers updates: - Use per-CPU mutexes for targeted commands (Aniket Gattani) - Modernize membarrier_global_expedited with cleanup guards (Aniket Gattani) - Add rseq stress test for CFS throttle interactions (Aniket Gattani) percpu-rwsems updates: - Extract __percpu_up_read() to optimize inlining overhead (Dmitry Ilvokhin) Seqlocks updates: - Allow UBSAN_ALIGNMENT to fail optimizing (Heiko Carstens) Lock tracing: - Add contended_release tracepoint to sleepable locks such as mutexes, percpu-rwsems, rtmutexes, rwsems and semaphores (Dmitry Ilvokhin) MAINTAINERS updates: - MAINTAINERS: Add RUST [SYNC] entry (Boqun Feng) Misc updates and fixes by Randy Dunlap, YE WEI-HONG, Fabricio Parra, Dmitry Ilvokhin and Peter Zijlstra" * tag 'locking-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (36 commits) locking: Add contended_release tracepoint to sleepable locks locking/percpu-rwsem: Extract __percpu_up_read() tracing/lock: Remove unnecessary linux/sched.h include futex: Optimize futex hash bucket access patterns rust: sync: completion: Mark inline complete_all and wait_for_completion MAINTAINERS: Add RUST [SYNC] entry cleanup: Specify nonnull argument index selftests: futex: Add tests for robust release operations Documentation: futex: Add a note about robust list race condition x86/vdso: Implement __vdso_futex_robust_try_unlock() x86/vdso: Prepare for robust futex unlock support futex: Provide infrastructure to plug the non contended robust futex unlock race futex: Add robust futex unlock IP range futex: Add support for unlocking robust futexes futex: Cleanup UAPI defines x86: Select ARCH_MEMORY_ORDER_TSO uaccess: Provide unsafe_atomic_store_release_user() futex: Provide UABI defines for robust list entry modifiers futex: Move futex related mm_struct data into a struct futex: Make futex_mm_init() void ...
11 daysio_uring/net: make POLL_FIRST receive side checks consistentJens Axboe
io_recv() and io_recvzc() are the odd ones out, as they checks for whether POLL_FIRST should be honored before checking if the file is a socket. It doesn't really matter, but might as well make it consistent across all receive and send types. Signed-off-by: Jens Axboe <axboe@kernel.dk>
11 daysio_uring: remove the per-ctx fallback task_work machineryJens Axboe
With the tctx fallback running its entries directly, the per-ctx fallback work has a single user left: moving local (DEFER_TASKRUN) task_work entries out of a ring that is going away. Both of its call sites are process context and don't hold ->uring_lock, the same conditions the deferred fallback work itself ran under - so run the entries in cancel mode right there instead, and rename the helper to io_cancel_local_task_work() to match what it now does. With that, ->fallback_llist, ->fallback_work, io_fallback_req_func() and __io_fallback_tw() can all go away, along with the fallback work flushing in the ring exit and cancel paths. Requests that get orphaned by an exiting task now run via the tctx fallback work, which the ring exit side implicitly waits on through the ctx refs those requests hold. Signed-off-by: Jens Axboe <axboe@kernel.dk>
11 daysio_uring: run the tctx task_work fallback directlyJens Axboe
The fallback work drains the tctx queue only to redistribute the entries into the per-ctx fallback lists, bouncing them through a second (per-ctx) work item before they finally run. That made sense when the producer side did the draining and could be in any context, but the fallback work is a regular process context kworker: it can just run the entries itself. Reuse the normal run loop - if run from the fallback kernel thread, ts.cancel will get set, and the work terminated. Signed-off-by: Jens Axboe <axboe@kernel.dk>
11 daysio_uring: switch normal task_work to a mpscqJens Axboe
Like the local task_work list, the normal (tctx) task_work list is an llist, and hence needs the O(n) llist_reverse_order() pass before running entries in queue order. On top of that, capped runs - sqpoll processing IORING_TW_CAP_ENTRIES_VALUE entries at a time - need the claimed-but-unprocessed leftovers carried in a separate retry_list, as they can't be pushed back to the shared list. Switch tctx->task_list to a mpscq, like what was done for the DEFER_TASKRUN paths as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
11 daysio_uring: switch local task_work to a mpscqJens Axboe
The local (DEFER_TASKRUN) task_work list is an llist, which is LIFO ordered, and hence __io_run_local_work() has to restore the right running order with an O(n) llist_reverse_order() pass first. On top of that, a batch that gets capped by max_events needs the leftover entries parked on a separate ->retry_llist, as they can't be pushed back to the shared list. Switch it to the FIFO mpscq. Adds are wait-free instead of a cmpxchg retry loop, entries are popped in queue order with no reversal pass, capping a run simply leaves the remainder on the queue, and ->retry_llist goes away entirely. The consumer cursor, ->work_head, lives with the rest of the ->uring_lock protected state rather than next to the queue, so that popping entries doesn't dirty the producer side cacheline. For low amounts of task_work, this ends up being a bit more efficient than the existing scheme. As an example of that, doing multishot receives for 8 clients has the following task_work overhead: 1.02% sock-test [kernel.kallsyms] [k] io_req_local_work_add 0.88% sock-test [kernel.kallsyms] [k] __io_run_local_work_loop 0.60% sock-test [kernel.kallsyms] [k] llist_reverse_order 0.14% sock-test [kernel.kallsyms] [k] __io_run_local_work 2.64% at ~46Gb/sec and after this change: 1.08% sock-test [kernel.kallsyms] [k] io_req_local_work_add 1.03% sock-test [kernel.kallsyms] [k] __io_run_local_work 2.11% at ~53Gb/sec which has less overhead even though that test run was faster. For a case of having 1024 clients on a single ring: 2.22% sock-test [kernel.kallsyms] [k] llist_reverse_order 0.84% sock-test [kernel.kallsyms] [k] __io_run_local_work_loop 0.42% sock-test [kernel.kallsyms] [k] io_req_local_work_add 0.02% sock-test [kernel.kallsyms] [k] __io_run_local_work 3.50% at ~24Gb/sec we start to see the llist reversing taking a considerable amount of time, and the total add+run task_work overhead is around 3.5%. After the change: 0.90% sock-test [kernel.kallsyms] [k] __io_run_local_work 0.42% sock-test [kernel.kallsyms] [k] io_req_local_work_add 1.32% at ~26Gb/sec most of that overhead is gone, and performance is better as well. Caleb Sander Mateos <csander@purestorage.com> reports that it improves the performance of a ublk 4kb workload by 4% [1], while testing v1 of this patchset. [1] https://lore.kernel.org/io-uring/CADUfDZr-MMYBaP-e+y9+xuRhuiunO2sBTUCmwZyd7AgT8sVtiQ@mail.gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
11 daysio_uring/mpscq: add lockless multi-producer, single-consumer FIFO queueJens Axboe
Local task_work is currently using llists for managing the work, but that's a LIFO type of list. This means that running this task_work needs to reverse the list first, to ensure fairness in running the queued items. Add a lockless FIFO queued, based on Dmitry Vyukov's intrusive MPSC node-based queue algorithm, modified with an externally held consumer cursor and conditional stub reinsertion. See comments in the header. Producers are wait-free: a push is a single xchg() on the queue tail, which serializes concurrent producers and defines the FIFO order, plus a store linking the node to its predecessor. There are no cmpxchg retry loops, and pushing is safe from any context, including hardirq. The cost of linked list FIFO ordering is that a push publishes the node in two steps - the xchg() makes it visible as the new tail before the subsequent store links it into the chain that is reachable from the head. A consumer hitting that window gets a NULL from mpscq_pop() while mpscq_empty() reports false, and must retry later rather than treat the queue as empty. The window is two instructions wide, but a producer can get preempted inside it, so the consumer must not busy wait on it. The consumer side supports a single consumer at a time, with callers providing their own serialization. A stub node, which also defines the empty state (tail == stub), allows the consumer to detach the final node without racing against producer link stores: that node is only handed out once the stub has been cmpxchg'ed back in as the tail. This also guarantees that the previous tail returned by mpscq_push() cannot get freed before that push has linked it, making it always valid for comparisons. The consumer cursor is deliberately not part of the queue struct - the caller owns it and passes it to mpscq_pop(). This is done to separate the consumer and producers cacheline. The cursor is written for every popped entry, and keeping it on the same cacheline as ->tail would have the consumer invalidating the line that producers need for every push. Keeping it external lets the caller place it with its own consumer side data instead. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
11 daysio_uring: grab RCU read lock marking task runJens Axboe
Not required right now, as io_req_local_work_add() already calls this helper with the RCU read lock held. But in preparation for that not being the case, grab it locally. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
11 daysMerge tag 'io_uring-7.1-20260611' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - Tweak for an off-by-one in the CQ ring accounting for the min wait support. - Don't truncate end buffer length for a bundle, as the transfer might not happen. It's not required in the first place, as the completion side handles this condition already. * tag 'io_uring-7.1-20260611' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/wait: fix min_timeout behavior io_uring/kbuf: don't truncate end buffer for bundles
12 daysio_uring/zcrx: kill dead 'sock' member in struct io_zcrx_argsJens Axboe
This member is only ever assigned, never read. Kill it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-08io_uring/kbuf: validate ring provided buffer addresses with access_ok()Jens Axboe
Commit: 809b997a5ce9 ("x86-64/arm64/powerpc: clean up and rename __copy_from_user_flushcache") sanitized that any provided copy helper should separately validate destination and source addresses, but we should also ensure that anything that is retrieved from a buffer is validated upfront. For ring provided buffers, always include an access_ok() when grabbing a new buffer. Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-08io_uring/net: support registered buffer for plain send and recvMing Lei
So far IORING_RECVSEND_FIXED_BUF is only honoured on the SEND_ZC path, even though the import wiring is already present for plain send and completely absent for recv. Targets such as ublk's NBD backend want to push/pull I/O data directly to/from an io_uring registered buffer over a plain send/recv on a TCP socket. Wire IORING_RECVSEND_FIXED_BUF into the plain IORING_OP_SEND and IORING_OP_RECV paths: - Accept the flag in SENDMSG_FLAGS / RECVMSG_FLAGS and, at prep time, restrict it to the non-vectorized IORING_OP_SEND / IORING_OP_RECV opcodes. It is mutually exclusive with buffer select, bundles and (for recv) multishot, and records sqe->buf_index. - For recv, set REQ_F_IMPORT_BUFFER in setup so the registered buffer is imported lazily at issue time, mirroring the send path. - In io_send()/io_recv(), import the registered buffer via io_import_reg_buf() (ITER_SOURCE for send, ITER_DEST for recv) and clear REQ_F_IMPORT_BUFFER. The resulting bvec iter persists in async_data, so MSG_WAITALL partial send/recv retries resume at the right offset. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Link: https://patch.msgid.link/20260608142511.659240-2-ming.lei@redhat.com [axboe: combine flags checks] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-07io_uring/wait: fix min_timeout behaviorChristian A. Ehrhardt
The wakeup condition if a min timeout is present and has expired is that at least _one_ CQE was posted. Thus set the cq_tail target to ->cq_min_tail + 1. Without this commit a spurious wakeup can result in a premature wakeup because io_should_wake() will return true even if _no_ CQE was posted at all. Cc: Tip ten Brink <tip@tenbrinkmeijs.com> Fixes: e15cb2200b93 ("io_uring: fix min_wait wakeups for SQPOLL") Cc: stable@vger.kernel.org Signed-off-by: Christian A. Ehrhardt <lk@c--e.de> Link: https://patch.msgid.link/20260606201120.1441447-1-lk@c--e.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-07io_uring/kbuf: don't truncate end buffer for bundlesJens Axboe
If buffers have been peeked for a bundle receive, the kernel will truncate the end buffer, if the available length is shorter than the buffer itself. This is unnecessary, as applications iterating bundle receives must always use the minimum size of the buffer length and the remaining number of bytes in the bundle. The examples in liburing do that as well, eg examples/proxy.c. If the kernel does truncate this buffer AND the current transfer fails, then the buffer will be left with a smaller size than what is otherwise available. Just remove the buffer truncation, as it's not necessary in the first place. Link: https://lore.kernel.org/io-uring/CAAEr8jbY60noGj1fw_k91UJRBkyiRVoS6=nLhZ7Svwidjn4CAA@mail.gmail.com/ Reported-by: Federico Brasili <federico.brasili@gmail.com> Cc: stable@vger.kernel.org Fixes: 35c8711c8fc4 ("io_uring/kbuf: add helpers for getting/peeking multiple buffers") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05Merge tag 'io_uring-7.1-20260605' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fix from Jens Axboe: "A single fix for a missing flag mask when multishot is used with an incrementally consumed buffer ring, potentially leading to application confusion because of lack of IORING_CQE_F_BUF_MORE consistency" * tag 'io_uring-7.1-20260605' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/net: inherit IORING_CQE_F_BUF_MORE across bundle recv retries
2026-06-05io_uring/net: inherit IORING_CQE_F_BUF_MORE across bundle recv retriesClément Léger
When a bundle recv retries inside io_recv_finish(), the merge logic OR the saved cflags from the previous iteration with the cflags returned by the new iteration: cflags = req->cqe.flags | (cflags & CQE_F_MASK); Bits listed in CQE_F_MASK are inherited from the new iteration, and all other bits (notably IORING_CQE_F_BUFFER and the buffer ID) come from the saved cflags. Before this change CQE_F_MASK covered only IORING_CQE_F_SOCK_NONEMPTY and IORING_CQE_F_MORE. When using provided buffer rings (IOU_PBUF_RING_INC) with incremental mode, and bundle recv, io_kbuf_inc_commit() can leave the head ring entry partially consumed, __io_put_kbufs() then sets IORING_CQE_F_BUF_MORE on the returned cflags so userspace knows the buffer ID will be reused for subsequent completions. Because IORING_CQE_F_BUF_MORE was not in CQE_F_MASK, the merge above silently dropped it whenever the final retry iteration partially consumed the buffer, and the subsequent req->cqe.flags = cflags & ~CQE_F_MASK save would have left a stale IORING_CQE_F_BUF_MORE in the carried-over cflags had one been present. Userspace would then wrongfully advance it ring head past an entry the kernel still uses. Add IORING_CQE_F_BUF_MORE to CQE_F_MASK so it is both inherited from the new iteration into the user-visible CQE and stripped from the saved cflags between iterations. Cc: stable@vger.kernel.org Signed-off-by: Clément Léger <cleger@meta.com> Assisted-by: Claude:claude-opus-4.6 Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Link: https://patch.msgid.link/20260604160715.2482972-1-cleger@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-03mm/slab: improve kmem_cache_alloc_bulkChristoph Hellwig
The kmem_cache_alloc_bulk return value is weird. It returns the number of allocated objects, but that must always be 0 or the requested number based on the implementations and the handling in the callers, but that assumption is not actually documented anywhere, which confuses automated review tools. Fix this by returning a bool if the allocation succeeded and adding a kerneldoc comment explaining the API. [rob.clark@oss.qualcomm.com: fixups in msm_iommu_pagetable_prealloc_allocate() ] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> # skbuff Link: https://patch.msgid.link/20260528093437.2519248-2-hch@lst.de Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-06-03futex: Add support for unlocking robust futexesThomas Gleixner
Unlocking robust non-PI futexes happens in user space with the following sequence: 1) robust_list_set_op_pending(mutex); 2) robust_list_remove(mutex); lval = 0; 3) lval = atomic_xchg(lock, lval); 4) if (lval & WAITERS) 5) sys_futex(WAKE,....); 6) robust_list_clear_op_pending(); That opens a window between #3 and #6 where the mutex could be acquired by some other task which observes that it is the last user and: A) unmaps the mutex memory B) maps a different file, which ends up covering the same address When the original task exits before reaching #6 then the kernel robust list handling observes the pending op entry and tries to fix up user space. In case that the newly mapped data contains the TID of the exiting thread at the address of the mutex/futex the kernel will set the owner died bit in that memory and therefore corrupting unrelated data. PI futexes have a similar problem both for the non-contented user space unlock and the in kernel unlock: 1) robust_list_set_op_pending(mutex); 2) robust_list_remove(mutex); lval = gettid(); 3) if (!atomic_try_cmpxchg(lock, lval, 0)) 4) sys_futex(UNLOCK_PI,....); 5) robust_list_clear_op_pending(); Address the first part of the problem where the futexes have waiters and need to enter the kernel anyway. Add a new FUTEX_ROBUST_UNLOCK flag, which is valid for the sys_futex() FUTEX_UNLOCK_PI, FUTEX_WAKE, FUTEX_WAKE_BITSET operations. This deliberately omits FUTEX_WAKE_OP from this treatment as it's unclear whether this is needed and there is no usage of it in glibc either to investigate. For the futex2 syscall family this needs to be implemented with a new syscall. The sys_futex() case [ab]uses the @uaddr2 argument to hand the pointer to robust_list_head::list_pending_op into the kernel. This argument is only evaluated when the FUTEX_ROBUST_UNLOCK bit is set and is therefore backward compatible. This is an explicit argument to avoid the lookup of the robust list pointer and retrieving the pending op pointer from there. User space has the pointer already available so it can just put it into the @uaddr2 argument. Aside of that this allows the usage of multiple robust lists in the future without any changes to the internal functions as they just operate on the provided pointer. This requires a second flag FUTEX_ROBUST_LIST32 which indicates that the robust list pointer points to an u32 and not to an u64. This is required for two reasons: 1) sys_futex() has no compat variant 2) The gaming emulators use both both 64-bit and compat 32-bit robust lists in the same 64-bit application As a consequence 32-bit applications have to set this flag unconditionally so they can run on a 64-bit kernel in compat mode unmodified. 32-bit kernels return an error code when the flag is not set. 64-bit kernels will happily clear the full 64 bits if user space fails to set it. In case of FUTEX_UNLOCK_PI this clears the robust list pending op when the unlock succeeded. In case of errors, the user space value is still locked by the caller and therefore the above cannot happen. In case of FUTEX_WAKE* this does the unlock of the futex in the kernel and clears the robust list pending op when the unlock was successful. If not, the user space value is still locked and user space has to deal with the returned error. That means that the unlocking of non-PI robust futexes has to use the same try_cmpxchg() unlock scheme as PI futexes. If the clearing of the pending list op fails (fault) then the kernel clears the registered robust list pointer if it matches to prevent that exit() will try to handle invalid data. That's a valid paranoid decision because the robust list head sits usually in the TLS and if the TLS is not longer accessible then the chance for fixing up the resulting mess is very close to zero. The problem of non-contended unlocks still exists and will be addressed separately. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@igalia.com> Link: https://patch.msgid.link/20260602090535.670514505@kernel.org
2026-06-02io_uring/nop: Drop a wrong comment in struct io_nopGabriel Krisman Bertazi
This was copy-pasted from io_rw, where the comment actually makes sense. Here, we don't have struct kiocb. Drop the comment. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://patch.msgid.link/20260602215327.1885109-4-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-02io_uring/net: Remove async_size for OP_LISTENGabriel Krisman Bertazi
OP_LISTEN does not use async_data. Remove it. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://patch.msgid.link/20260602215327.1885109-3-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-02io_uring/net: Avoid msghdr on op_connect/op_bind async dataGabriel Krisman Bertazi
Both IORING_OP_CONNECT and IORING_OP_BIND reuse the msghdr object just to store the sockaddr. Beyond allocating a much larger object than needed, msghdr can also wrap an iovec, which will be recycled unnecessarily. This uses the sockaddr directly. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://patch.msgid.link/20260602215327.1885109-2-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-02io_uring/bpf-ops: restrict ctx access to BPFPavel Begunkov
BPF programs should have no need in looking into struct io_ring_ctx, if anything, most of such cases would be anti patterns like looking up ring indices directly via the context. Replace it with a new empty structure, which is just an alias to struct io_ring_ctx. It'll create a new BTF type and fail verification if a BPF program tries to access it (beyond the first byte). It'll also give more flexibility for the future, and otherwise it can be made aligned with io_ring_ctx as before with struct groups if ever needed or extended in a different way. Fixes: d0e437b76bd3c ("io_uring/bpf-ops: implement loop_step with BPF struct_ops") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/5f6ca3649e9e0bae8667db4357e28dd00cd07901.1780394491.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-29Merge tag 'io_uring-7.1-20260529' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fix from Jens Axboe: "Just a single fix for a regression introduced in this cycle, where we should ensure the node is visible before the entry is added to the tctx list" * tag 'io_uring-7.1-20260529' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/tctx: set ->io_uring before publishing the tctx node
2026-05-29block: Add bvec_folio()Matthew Wilcox (Oracle)
This is a simple helper which replaces page_folio(bvec->bv_page). Minor improvement in readability, but the real motivation is to reduce the number of references to bvec->bv_page so that it can be changed with less work. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Leon Romanovsky <leon@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@linux.dev> Link: https://patch.msgid.link/20260528175905.1102280-2-willy@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-29net: Remove support for AIO on socketsDemi Marie Obenour
The only user of msg->msg_iocb was AF_ALG, but that's deprecated. It can be removed entirely at the cost of only supporting synchronous operations. This doesn't break userspace, which will silently block (for a bounded amount of time) in io_submit instead of operating asynchronously. This also makes struct msghdr smaller, helping every other caller of sendmsg(). Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2026-05-28io_uring/io-wq: re-check IO_WQ_BIT_EXIT for each linked work itemRunyu Xiao
commit 10dc95939817 ("io_uring/io-wq: check IO_WQ_BIT_EXIT inside work run loop") fixed the obvious case where io_worker_handle_work() took one exit-bit snapshot before draining pending work, but the fix stops one level too early. io_worker_handle_work() now re-checks IO_WQ_BIT_EXIT in its outer work run loop, yet it still snapshots that bit once before processing a whole dependent linked-work chain. If io_wq_exit_start() sets IO_WQ_BIT_EXIT after the first linked item has started, the remaining linked items can still reuse stale do_kill = false, skip IO_WQ_WORK_CANCEL, and continue running after exit has begun. Move the check further inside, so it covers linked items too. Note: this is a syzbot special as it loves setting up tons of slow linked work on weird devices like msr that take forever to read, and immediately close the ring. Exit then takes a long time. Fixes: 10dc95939817 ("io_uring/io-wq: check IO_WQ_BIT_EXIT inside work run loop") Cc: stable@vger.kernel.org Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn> Link: https://patch.msgid.link/20260527172203.2043962-1-runyu.xiao@seu.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-28io_uring/kbuf: align legacy buffer add limit with MAX_BIDS_PER_BGIDliyouhong
io_provide_buffers_prep() accepts nbufs up to MAX_BIDS_PER_BGID, but io_add_buffers() stops when bl->nbufs reaches USHRT_MAX. This makes the effective add limit one lower than the validated limit. Use MAX_BIDS_PER_BGID in the add-side boundary check so validation and execution use the same limit, and update the comment to refer to the actual limit constant. Signed-off-by: liyouhong <liyouhong@kylinos.cn> Link: https://patch.msgid.link/20260528024936.3672659-1-dayou5941@163.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-26io_uring/zcrx: add shared-memory notification statisticsClément Léger
Add support for an optional stats struct embedded in the refill queue region, allowing userspace to monitor copy-fallback in real-time. Userspace queries the stats struct size and alignment via IO_URING_QUERY_ZCRX_NOTIF (notif_stats_size / notif_stats_alignment), then provides a stats_offset in zcrx_notification_desc pointing to a location within the refill queue region. The kernel updates the stats counters in-place on every copy-fallback event. Signed-off-by: Clément Léger <cleger@meta.com> [pavel: rename io_uring_zcrx_notif_stats] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/f6af5a21015efea4b733b9d77aba22c637788fe4.1779189667.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-26io_uring/zcrx: notify user on frag copy fallbackClément Léger
Add a ZCRX_NOTIF_COPY notification type to signal userspace when a received fragment could not be delivered using zero-copy and was instead copied into a buffer. Signed-off-by: Clément Léger <cleger@meta.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/3d54bcd8bf10b3a1e88beb0cd39c40c3937bea4f.1779189667.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-26io_uring/zcrx: notify user when out of buffersPavel Begunkov
There are currently no easy ways for the user to know if zcrx is out of buffers and page pool fails to allocate. Add uapi for zcrx to communicate it back. It's implemented as a separate CQE, which for now is posted to the creator ctx. To use it, on registration the user space needs to pass an instance of struct zcrx_notification_desc, which tells the kernel the user_data for resulting CQEs and which event types are expected / allowed. When an allowed event happens, zcrx will post a CQE containing the specified user_data, and lower bits of cqe->res will be set to the event mask. Before the kernel could post another notification of the given type, the user needs to acknowledge that it processed the previous one by issuing IORING_REGISTER_ZCRX_CTRL with ZCRX_CTRL_ARM_NOTIFICATION. The only notification type the patch implements is ZCRX_NOTIF_NO_BUFFERS, but we'll need more of them in the future. Co-developed-by: Vishwanath Seshagiri <vishs@meta.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Vishwanath Seshagiri <vishs@meta.com> Link: https://patch.msgid.link/35cd307a03a43583838a2e151fc641c69abd786f.1779189667.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-26io_uring/zcrx: add ctx pointer to zcrxPavel Begunkov
zcrx will need to have a pointer to an owning ctx to communicate different events. Reference the ctx while it's attached to zcrx, and rely on zcrx termination to drop the ctx to avoid circular ref deps. Co-developed-by: Vishwanath Seshagiri <vishs@meta.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Vishwanath Seshagiri <vishs@meta.com> Link: https://patch.msgid.link/b60514b3d1bd92f571e3bd91751166f8c3599256.1779189667.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-26io_uring/zcrx: reorder fd allocation in zcrx_export()Bertie Tryner
Currently, zcrx_export() allocates a file descriptor and copies the control structure to userspace before the backing file is created. While the operation returns an error on failure, it is cleaner to follow the standard kernel pattern of performing the copy_to_user() and fd_install() only after all resource allocations (like the anon_inode) have succeeded. This aligns the code with other fd-publishing paths in the VFS. Signed-off-by: Bertie Tryner <Bertie.Tryner@warwick.ac.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/1513a3f4ae7161692ca6e991b9f01278a6bc60e4.1779189667.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-26io_uring/zcrx: remove extra ifq closePavel Begunkov
By the time io_zcrx_ifq_free() is called the interface queue should already be closed, so io_close_queue() will be a no-op. Remove the call and add a couple of warnings. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/be6c4a283a5bab5440e22fbccafe7b885acb7abc.1779189667.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-26io_uring/zcrx: poison pointers on unregistrationPavel Begunkov
Nobody should be touching area and other pointers after zcrx destruction, poison them instead of zeroing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/19112d1412539dcfc04a0317b5812e968623bc51.1779189667.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-26io_uring/zcrx: make scrubbing more reliablePavel Begunkov
Currently, scrubbing is done once before killing all recvzc requests. It's fine as those are cancelled and don't return buffers afterwards, but it'll be more reliable not to rely that much on cancellations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/c4ea127023494cbbedebd21a2b7ae5ff0448eb95.1779189667.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-24io_uring/tctx: set ->io_uring before publishing the tctx nodeLim HyeonJun
io_register_iowq_max_workers() walks ctx->tctx_list under ctx->tctx_lock and dereferences each node's task->io_uring without a NULL check: list_for_each_entry(node, &ctx->tctx_list, ctx_node) { tctx = node->task->io_uring; if (WARN_ON_ONCE(!tctx->io_wq)) continue; ... } __io_uring_add_tctx_node() installs the node into ctx->tctx_list (via io_tctx_install_node(), which does the list_add() under tctx_lock) and only assigns current->io_uring = tctx afterwards. A task doing its first io_uring operation on a shared ring therefore has a window in which its node is already visible on ctx->tctx_list while node->task->io_uring is still NULL. A concurrent IORING_REGISTER_IOWQ_MAX_WORKERS on the same ring reads that NULL and dereferences tctx->io_wq: KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] RIP: io_register_iowq_max_workers io_uring/register.c:423 Publish current->io_uring = tctx before installing the node, so any node visible on ctx->tctx_list always has a valid task->io_uring. Fixes: 7880174e1e5e ("io_uring/tctx: clean up __io_uring_add_tctx_node() error handling") Signed-off-by: Lim HyeonJun <shja0831@gmail.com> Link: https://patch.msgid.link/20260524110853.115634-1-shja0831@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-22Merge tag 'io_uring-7.1-20260522' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - Fix for an issue with IORING_OP_NOP and using injection results - Fix for an issue in IORING_OP_WAITID, where the info state was assumed cleared by the lower level syscall handler, but for some cases it is not. Just clear the data upfront, so that non-initialized data isn't copied back to userspace - Fix for a lockdep reported issue, where IORING_OP_BIND enters file create and hence hits mnt_want_write(), which creates a three part lockdep cycle between the super lock, io_uring's uring_lock, and the cred mutex - Fix a regression introduced in this cycle with how linked timeouts are deleted - Ensure that the ->opcode nospec indexing on the opcode issue side covers all the cases * tag 'io_uring-7.1-20260522' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/nop: pass all errors to userspace io_uring/timeout: splice timed out link in timeout handler io_uring: propagate array_index_nospec opcode into req->opcode io_uring/waitid: clear waitid info before copying it to userspace io_uring/net: punt IORING_OP_BIND async if it needs file create
2026-05-21io_uring/nop: pass all errors to userspaceAlexander A. Klimov
This fixes an inconsistency where io_nop() called req_set_fail() based on ret, but passed just nop->result to userspace. Originally, ret is a even copy of nop->result, but is set to an error when such happens subsequently. Now that's also passed to userspace. Fixes: a85f31052bce ("io_uring/nop: add support for testing registered files and buffers") Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de> Link: https://patch.msgid.link/20260520180045.538533-1-grandmaster@al2klimov.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-20io_uring/timeout: splice timed out link in timeout handlerJens Axboe
A previous commit deferred this to the task_work part of it, so it could be protected by ->uring_lock. But that's actually not necessary here, and in fact the head clearing is not enough to make that safe. For those two reasons, just re-instate the local splicing. Fixes: 49ae66eb8c27 ("io_uring: defer linked-timeout chain splice out of hrtimer context") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-18io_uring: propagate array_index_nospec opcode into req->opcodeMichael Bommarito
Commit 1e988c3fe126 ("io_uring: prevent opcode speculation") added array_index_nospec() to io_init_req(), but applied it only to a local opcode variable. req->opcode is initialized from sqe->opcode before the bounds check and remains the raw value. Keep req->opcode as the canonical opcode in io_init_req(): reject out-of-range values architecturally, then write the array_index_nospec() result back to req->opcode before any table lookup. This keeps downstream users of req->opcode from observing the raw user byte on a mispredicted path. No functional change: array_index_nospec() is a no-op for opcodes in [0, IORING_OP_LAST), and out-of-range opcodes are still rejected at the bounds check above the assignment. Fixes: 1e988c3fe126 ("io_uring: prevent opcode speculation") Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Link: https://patch.msgid.link/20260517213010.696135-1-michael.bommarito@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-16io_uring/waitid: clear waitid info before copying it to userspaceHeechan Kang
IORING_OP_WAITID stores its result fields in struct io_waitid::info and later copies them to userspace siginfo. The prep path initializes the request arguments, but it does not initialize info itself. If the wait operation completes without reporting a child event, the common wait code can return without writing wo_info. In that case io_waitid_finish() still copies iw->info to userspace, exposing stale bytes from the reused io_kiocb command storage. Clear the result storage during prep so the io_uring path matches the regular waitid syscall, which uses a zero-initialized struct waitid_info. Fixes: f31ecf671ddc ("io_uring: add IORING_OP_WAITID support") Cc: stable@vger.kernel.org # 6.7+ Signed-off-by: Heechan Kang <gganji11@naver.com> Link: https://patch.msgid.link/20260516184709.852814-1-gganji11@naver.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-15Merge tag 'io_uring-7.1-20260515' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - Small series sanitizing the locking done for either modifying or reading a chain of requests - If the application has a pid namespace, ensure that the sqthread pid is correctly printed in fdinfo - Fix for a hashing issue in the io-wq thread pool, which could lead to a use-after-free - Kill dead argument from io_prep_rw_pi() - Fix for a missed validation of the CQ ring head, affecting CQE refill * tag 'io_uring-7.1-20260515' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring: validate user-controlled cq.head in io_cqe_cache_refill() io-wq: check that the predecessor is hashed in io_wq_remove_pending() io_uring/rw: drop unused attr_type_mask from io_prep_rw_pi() io_uring: hold uring_lock across io_kill_timeouts() in cancel path io_uring: defer linked-timeout chain splice out of hrtimer context io_uring: hold uring_lock when walking link chain in io_wq_free_work() io_uring/fdinfo: translate SqThread PID through caller's pid_ns
2026-05-15io_uring/net: punt IORING_OP_BIND async if it needs file createJens Axboe
For two reasons: 1) An opcode cannot block inside io_uring_enter() doing submissions, as it'll stall the submission side pipeline. 2) Ending up in sb_start_write() -> __sb_start_write() -> percpu_down_read_freezable() introduces a new lockdep edge, which it correctly complains about. Check if the socket type is AF_UNIX and has a non-empty pathname. If it does, mark it REQ_F_FORCE_ASYNC to punt the submission to io-wq rather than attempt to do it inline. Fixes: 7481fd93fa0a ("io_uring: Introduce IORING_OP_BIND") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>