| Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring epoll update from Jens Axboe:
"As discussed a few months ago, this pull request gets rid of allowing
nested epoll notification contexts via io_uring.
Nested contexts have been a source of issues on the epoll side, and
there should not be a need to support them from io_uring. The epoll
io_uring side exists mainly to facilitate a gradual migration from a
notification based epoll setup to an io_uring ditto"
* tag 'for-7.2/io_uring-epoll-20260616' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/epoll: disallow adding an epoll file to an epoll context
io_uring/epoll: switch to using do_epoll_ctl_file() interface
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Work on removing rtnl_lock protection throughout the stack
continues. In this chapter:
- don't use rtnl_lock for IPv6 multicast routing configuration
- don't take rtnl_lock in ethtool for modern drivers
- prepare Qdisc dump callbacks for rtnl_lock removal
- Support dumping just ifindex + name of all interfaces, under RCU.
It's a common operation for Netlink CLI tools (when translating
names to ifindexes) and previously required full rtnl_lock.
- Support dumping qdiscs and page pools for a specific netdev. Even
tho user space wants a dump of all netdevs, most of the time, the
OOO programming model results in repeating the dump for each
netdev. Which, in absence of a cache, leads to a O(n^2) behavior.
- Flush nexthops once on multi-nexthop removal (e.g. when device goes
down), another O(n^2) -> O(n) improvement.
- Rehash locally generated traffic to a different nexthop on
retransmit timeout.
- Honor oif when choosing nexthop for locally generated IPv6 traffic.
- Convert TCP Auth Option to crypto library, and drop non-RFC algos.
- Increase subflow limits in MPTCP to 64 and endpoint limit to 256.
- Support MPTCP signaling of IPv6 address + port (ADD_ADDR). We need
to selectively skip reporting of the standard TCP Timestamp option,
because they won't fit into the header space together (12 + 30 >
40).
- Support using bridge neighbor suppression, Duplicate Address
Detection, Gratuitous ARP and unsolicited NA forwarding - in EVPN
deployments, e.g. VXLAN fabrics (IPv4 and IPv6).
- Improve link state reporting for upper netdevs (e.g. macvlan) over
tunnel devices (again, mostly for EVPN deployments).
- Support binding GENEVE tunnels to a local address.
- Speed up UDP tunnel destruction (remove one synchronize_rcu()).
- Support exponential field encoding in multicast (IGMPv3 and MLDv2).
- Support attaching PSP crypto offload to containers (veth, netkit).
- Add a new IPSec Netlink message XFRM_MSG_MIGRATE_STATE that allows
migrating individual IPsec SAs independently of their policies.
The existing XFRM_MSG_MIGRATE is tightly coupled to policy+SA
migration, lacks SPI for unique SA identification, and cannot
express reqid changes or migrate Transport mode selectors.
The new interface identifies the SA via SPI and mark, supports
reqid changes, address family changes, encap removal, and uses an
atomic create+install flow under x->lock to prevent SN/IV reuse
during AEAD SA migration.
- Implement GRO/GSO support for PPPoE.
- Convert sockopt callbacks in a number of protocols to iov_iter.
Cross-tree stuff:
- Remove support for Crypto TFM cloning (unblocked after the TCP Auth
Option rework). This feature regressed performance for all crypto
API users, since it changed crypto transformation objects into
reference-counted objects.
- Add FCrypt-PCBC implementation to rxrpc and remove it from the
global crypto API as obsolete and insecure.
Wireless:
- Major rework of station bandwidth handling, fixing issues with
lower capability than AP.
- Cleanups for EMLSR spec issues (drafts differed).
- More Neighbor Awareness Networking (Wi-Fi Aware) work (multicast,
schedule improvements, multi-station etc.)
- Some Ultra High Reliability (UHR) / IEEE 802.11bn (D1.4) work
(e.g. non-primary channel access, UHR DBE support).
- Fine Timing Measurement ranging (i.e. distance measurement) APIs.
Netfilter:
- Use per-rule hash initval in nf_conncount. This avoids unnecessary
lock contention with short keys (e.g. conntrack zones) in different
namespaces.
- Various safety improvements, both in packet parsing and object
lifetimes. Notably add refcounts to conntrack timeout policy.
Deletions:
- Remove TLS + sockmap integration. TLS wants to pin user pages to
avoid a copy, and sockmap wants to write to the input stream. More
work on this integration is clearly needed, and we can't find any
users (original author admitted that they never deployed it).
- Remove support for TLS offload with TCP Offload Engine (the far
more common opportunistic offload is retained). The locking looks
unfixable (driver sleeps under TCP spin locks) and people from the
vendor that added this are AWOL.
- Remove more ATM code, trying to leave behind only what PPPoATM
needs, AAL5 and br2684 with permanent circuits.
- Remove AppleTalk. Let it join hamradio in our out of tree protocol
graveyard, I mean, repository.
- Disable 32-bit x_tables compatibility (32bit binaries on 64bit
kernel) interface in user namespaces. To be deleted completely,
soon.
- Remove 5/10 MHz support from cfg80211/mac80211.
Drivers:
- Software:
- Support DEVMEM/DMABUF Tx over NETMEM_TX_NO_DMA devices (netkit)
- bonding: add knob to strictly follow 802.3ad for link state
- New drivers:
- Alibaba Elastic Ethernet Adaptor (cloud vNIC).
- NXP NETC switch within i.MX94.
- DPLL:
- Add operational state to pins (implement in zl3073x).
- Add generic DPLL type, for daisy-chaining DPLLs (implement in ice).
- Ethernet high-speed NICs:
- Huawei (hinic3):
- enhance tc flow offload support with queue selection,
tunnels
- nVidia/Mellanox:
- avoid over-copying payload to the skb's linear part (up to
60% win for LRO on slow CPUs like ARM64 V2)
- expose more per-queue stats over the standard API
- support additional, unprivileged PFs in the DPU
configuration
- support Socket Direct (multi-PF) with switchdev offloads
- add a pool / frag allocator for DMA mapped buffers for
control objects, save memory on systems with 64kB page size
- take advantage of the ability to dynamically change RSS
table size, even when table is configured by the user
- increase the max RSS table size for even traffic
distribution
- Ethernet NICs:
- Marvell/Aquantia:
- AQC113 PTP support
- Realtek USB (r8152):
- support 10Gbit Link Speeds and Energy-Efficient Ethernet
(EEE)
- support firmware loaded (for RTL8157/RTL8159)
- support for the RTL8159
- Intel (ixgbe):
- support Energy-Efficient Ethernet (EEE) on E610 devices
- Ethernet switches:
- Airoha:
- support multiple netdevs on a single GDM block / port
- Marvell (mv88e6xxx):
- support SERDES of mv88e6321
- Microchip (ksz8/9):
- rework the driver callbacks to remove one indirection layer
- Motorcomm (yt921x):
- support port rate policing
- support TBF qdisc offload
- support ACL/flower offload
- nVidia/Mellanox:
- expose per-PG rx_discards
- Realtek:
- rtl8365mb: bridge offloading and VLAN support
- Ethernet PHYs:
- Airoha:
- support Airoha AN8801R Gigabit PHYs.
- Micrel:
- implement 3 low-loss cable tunables
- Realtek:
- support MDI swapping for RTL8226-CG
- support MDIO for RTL931x
- Qualcomm:
- at803x: Rx and Tx clock management for IPQ5018 PHY
- Motorcomm:
- support YT8522 100M RMII PHY
- set drive strength in YT8531s RGMII
- TI:
- dp83822: add optional external PHY clock
- Bluetooth:
- hci_sync: add support for HCI_LE_Set_Host_Feature [v2]
- SMP: use AES-CMAC library API
- Intel:
- support Product level reset
- support smart trigger dump
- Mediatek:
- add event filter to filter specific event
- Realtek:
- fix RTL8761B/BU broken LE extended scan
- WiFi:
- Broadcom (b43):
- new support for a 11n device
- MediaTek (mt76):
- support mt7927
- mt792x: broken usb transport detection
- mt7921: regulatory improvements
- Qualcomm (ath9k):
- GPIO interface improvements
- Qualcomm (ath12k):
- WDS support
- replace dynamic memory allocation in WMI Rx path
- thermal throttling/cooling device support
- 6 GHz incumbent interference detection
- channel 177 in 5 GHz
- Realtek (rt89):
- RTL8922AU support
- USB 3 mode switch for performance
- better monitor radiotap support
- RTL8922DE preparations"
* tag 'net-next-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1778 commits)
ipv4: fib_rule: Move fib4_rules_exit() to ->exit().
net: serialize netif_running() check in enqueue_to_backlog()
net: skmsg: preserve sg.copy across SG transforms
appletalk: move the protocol out of tree
appletalk: stop storing per-interface state in struct net_device
selftests/bpf: test that TLS crypto is rejected on a sockmap socket
selftests/bpf: drop the unused kTLS program from test_sockmap
selftests/bpf: remove sockmap + ktls tests
tls: remove dead sockmap (psock) handling from the SW path
tls: reject the combination of TLS and sockmap
atm: remove orphaned uAPI for deleted drivers, protocols and SVCs
atm: remove unused ATM PHY operations
atm: remove the unused pre_send and send_bh device operations
atm: remove the unused change_qos device operation
atm: remove SVC socket support and the signaling daemon interface
atm: remove the local ATM (NSAP) address registry
atm: remove dead SONET PHY ioctls
atm: remove the unused send_oam / push_oam callbacks
atm: remove AAL3/4 transport support
net: dsa: sja1105: fix lastused timestamp in flower stats
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:
- NVMe pull request via Keith:
- Per-controller admin and IO timeout sysfs attributes, and
letting the block layer set request timeouts (Maurizio,
Maximilian)
- Multipath passthrough iostats, and PCI P2PDMA enablement for
multipath devices (Keith, Kiran)
- A new diag sysfs attribute group exporting per-controller
counters (retries, multipath failover, error counters, requeue
and failure counts, reset and reconnect events) (Nilay)
- FDP configuration validation and bounds check fixes (liuxixin)
- Various nvmet fixes, including a pre-auth out-of-bounds read in
the Discovery Get Log Page handler, auth payload bounds
validation, and tcp error-path leak fixes (Bryam, Tianchu,
Geliang)
- nvme-tcp lockdep and workqueue fixes (Shin'ichiro, Kuniyuki,
Eric)
- Assorted other fixes and cleanups (John, Yao, Chao, Mateusz,
Achkinazi, Wentao)
- MD pull request via Yu Kuai:
- raid1/raid10 fixes for a deadlock in the read error recovery
path, error-path detection and bio accounting with cloned bios,
and an nr_pending leak in the REQ_ATOMIC bad-block error path
(Abd-Alrhman)
- PCI P2PDMA propagation from member devices to the RAID device
(Kiran)
- dm-raid bio requeue fix, and various smaller fixes and cleanups
(Benjamin, Chen, Li, Thorsten)
- Enable Clang lock context analysis for the block layer, with the
accompanying annotations across queue limits, the blk_holder_ops
callbacks, crypto, cgroup, iocost, kyber and mq-deadline (Bart)
- Block status code infrastructure work: a tagged status table, a
str_to_blk_op() helper, a bio_endio_status() helper, and on top of
that a new configurable block-layer error injection facility
(Christoph)
- DRBD netlink rework, replacing the genl_magic machinery with explicit
netlink serialization and moving the DRBD UAPI headers to
include/uapi/linux/ (Christoph Böhmwalder)
- bvec improvements: a bvec_folio() helper and making the bvec_iter
helpers proper inline functions (Willy, Christoph)
- ublk cleanups and a canceling-flag fix for the disk-not-allocated
case (Caleb, Ming)
- Partition handling fixes: bound the AIX pp_count scan, fix an of_node
refcount leak, and replace __get_free_page() with kmalloc() (Bryam,
Wentao, Mike)
- Convert numa_node to int in blk_mq_hw_ctx and ->init_request, and add
WQ_PERCPU to the block workqueue users (Mateusz, Marco)
- Block statistics and tracing: propagate in-flight to the whole disk
on partition IO, export passthrough stats, and a new
block_rq_tag_wait tracepoint (Tang, Keith, Aaron)
- A round of removals, unexports and cleanups across bio, direct-io and
the bvec helpers (Christoph)
- Various driver fixes (mtip32xx use-after-free, rbd snap_count
validation and strscpy conversion, nbd socket lockdep reclassify,
virtio-blk zone report clamp, floppy) and a batch of MAINTAINERS
email/list updates (Coly, Li, Yu, Christoph Böhmwalder)
- Other little fixes and cleanups all over
* tag 'for-7.2/block-20260615' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (117 commits)
MAINTAINERS: Update Coly Li's email address
block: check bio split for unaligned bvec
nbd: Reclassify sockets to avoid lockdep circular dependency
block: add configurable error injection
block: add a str_to_blk_op helper
block: add a "tag" for block status codes
block: add a macro to initialize the status table
floppy: Drop unused pnp driver data
block: propagate in_flight to whole disk on partition I/O
virtio-blk: clamp zone report to the report buffer capacity
block: optimize I/O merge hot path with unlikely() hints
drivers/block/rbd: Use strscpy() to copy strings into arrays
partitions: aix: bound the pp_count scan to the ppe array
block: Enable lock context analysis
block/mq-deadline: Make the lock context annotations compatible with Clang
block/Kyber: Make the lock context annotations compatible with Clang
block/blk-mq-debugfs: Improve lock context annotations
block/blk-iocost: Inline iocg_lock() and iocg_unlock()
block/blk-iocost: Split ioc_rqos_throttle()
block/crypto: Annotate the crypto functions
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring updates from Jens Axboe:
- Rework the task_work infrastructure.
Both the local (DEFER_TASKRUN) and the normal (tctx) task_work lists
were llist based, which is LIFO ordered, and hence each run had to do
an O(n) list reversal pass first to restore queue order.
Additionally, to cap the amount of task_work run, each method needed
a retry list as well.
Add a lockless MPCS FIFO queue (based on Dmitry Vyukov's intrusive
MPSC algorithm) and switch both task_work lists to it. It performs
better than llists and we can then also ditch the retry lists as well
as entries are popped one-at-the-time.
On top of those changes, run the tctx fallback task_work directly and
remove the now-unused per-ctx fallback machinery entirely.
- zcrx user notifications.
Add a mechanism for zcrx to communicate conditions back to userspace
via a dedicated CQE, with the initial users being notification on
running out of buffers and on a frag copy fallback, plus
shared-memory notification statistics.
Alongside that, a series of zcrx reliability and cleanup fixes: more
reliable scrubbing, poisoning pointers on unregistration, dropping an
extra ifq close, adding a ctx back-pointer, reordering fd allocation
in the export path, and killing a dead 'sock' member.
- Allow using io_uring registered buffers for plain SEND and RECV, not
just for the zero-copy send path.
This enables targets like ublk's NBD backend to push/pull IO data
directly to/from a registered buffer over a plain send/recv on a TCP
socket.
- Registered buffer improvements: account huge pages correctly, bump
the io_mapped_ubuf length field to size_t, and raise the previous 1GB
registered buffer size limit.
- Restrict the ctx access exposed to io_uring BPF struct_ops programs
by handing them an opaque type rather than the full io_ring_ctx, and
add a separate MAINTAINERS entry for the bpf-ops code.
- Allow opcode filtering on IORING_OP_CONNECT.
- Validate ring-provided buffer addresses with access_ok(), and align
the legacy buffer add limit with MAX_BIDS_PER_BGID.
- Various other cleanups and minor fixes, including avoiding msghdr
async data on connect/bind, dropping async_size for OP_LISTEN, making
the POLL_FIRST receive side checks consistent, re-checking
IO_WQ_BIT_EXIT for each linked work item, and using
trace_call__##name() at guarded tracepoint call sites.
* tag 'for-7.2/io_uring-20260615' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (31 commits)
io_uring/bpf-ops: add a separate maintainer entry
io_uring/net: make POLL_FIRST receive side checks consistent
io_uring: remove the per-ctx fallback task_work machinery
io_uring: run the tctx task_work fallback directly
io_uring: switch normal task_work to a mpscq
io_uring: switch local task_work to a mpscq
io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue
io_uring: grab RCU read lock marking task run
io_uring/zcrx: kill dead 'sock' member in struct io_zcrx_args
io_uring/kbuf: validate ring provided buffer addresses with access_ok()
io_uring/net: support registered buffer for plain send and recv
io_uring/nop: Drop a wrong comment in struct io_nop
io_uring/net: Remove async_size for OP_LISTEN
io_uring/net: Avoid msghdr on op_connect/op_bind async data
io_uring/bpf-ops: restrict ctx access to BPF
io_uring/io-wq: re-check IO_WQ_BIT_EXIT for each linked work item
io_uring/kbuf: align legacy buffer add limit with MAX_BIDS_PER_BGID
io_uring/zcrx: add shared-memory notification statistics
io_uring/zcrx: notify user on frag copy fallback
io_uring/zcrx: notify user when out of buffers
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Drop support for off-CPU cryptography in af_alg
- Document that af_alg is *always* slower
- Document the deprecation of af_alg
- Remove zero-copy support from skcipher and aead in af_alg
- Cap AEAD AD length to 0x80000000 in af_alg
- Free default RNG on module exit
Algorithms:
- Fix vli multiplication carry overflow in ecc
- Drop unused cipher_null crypto_alg
- Remove unused variants of drbg
- Use lib/crypto in drbg
- Use memcpy_from/to_sglist in authencesn
- Allow authenc(hmac(sha{256,384}),cts(cbc(aes))) in FIPS mode
- Disallow RSA PKCS#1 SHA-1 sig algs in FIPS mode
- Filter out async aead implementations at alloc in krb5
- Fix non-parallel fallback by rstoring callback in pcrypt
- Validate poly1305 template argument in chacha20poly1305
Drivers:
- Add sysfs PCI reset support to qat
- Add KPT support for GEN6 devices to qat
- Remove unused character device and ioctls from qat
- Add support for hw access via SMCC to mtk
- Remove prng support from crypto4xx
- Remove prng support from hisi-trng
- Remove prng support from sun4i-ss
- Remove prng support from xilinx-trng
- Remove loongson-rng
- Remove exynos-rng
Others:
- Remove support for AIO on sockets"
* tag 'v7.2-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (196 commits)
crypto: tegra - fix refcount leak in tegra_se_host1x_submit()
crypto: rng - Free default RNG on module exit
crypto: testmgr - allow authenc(hmac(sha{256,384}),cts(cbc(aes))) in FIPS mode
hwrng: jh7110 - fix refcount leak in starfive_trng_read()
crypto: atmel-ecc - drop dead code in atmel_ecdh_max_size
crypto: cavium/cpt - fix DMA cleanup using wrong loop index
crypto: marvell/octeontx - fix DMA cleanup using wrong loop index
MAINTAINERS: make myself the maintainer of the Qualcomm QCE driver
crypto: amcc - convert irq_of_parse_and_map to platform_get_irq
crypto: sun4i-ss - Remove insecure and unused rng_alg
hwrng: xilinx - Move xilinx-rng into drivers/char/hw_random/
crypto: xilinx-trng - Replace crypto_drbg_ctr_df() with HMAC-SHA512
crypto: xilinx-trng - Fix return value of xtrng_hwrng_trng_read()
crypto: xilinx-trng - Remove crypto_rng interface
crypto: exynos-rng - Remove exynos-rng driver
hwrng: hisi-trng - Move hisi-trng into drivers/char/hw_random/
crypto: hisi-trng - Remove crypto_rng interface
crypto: loongson - Remove broken and unused loongson-rng
crypto: crypto4xx - Remove insecure and unused rng_alg
crypto: qat - validate RSA CRT component lengths
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
- Support for "allocation tokens" (currently available in Clang 22+)
for smarter partitioning of kmalloc caches based on the allocated
object type, which can be enabled instead of the "random"
per-caller-address-hash partitioning.
It should be able to deterministically separate types containing a
pointer from those that do not (Marco Elver)
- Improvements and simplification of the kmem_cache_alloc_bulk() and
mempool_alloc_bulk() API. This includes adaptation of callers
(Christoph Hellwig)
- Performance improvements and cleanups related mostly to sheaves
refill (Hao Li, Shengming Hu, Vlastimil Babka)
- Several fixups for the slabinfo tool (Xuewen Wang)
* tag 'slab-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm/slab: do not limit zeroing to orig_size when only red zoning is enabled
mm/slub: preserve original size in _kmalloc_nolock_noprof retry path
mm: simplify the mempool_alloc_bulk API
mm/slab: improve kmem_cache_alloc_bulk
mm/slub: detach and reattach partial slabs in batch
mm/slub: introduce helpers for node partial slab state
mm/slub: use empty sheaf helpers for oversized sheaves
tools/mm/slabinfo: remove redundant slab->partial assignment
tools/mm/slabinfo: remove dead assignment in get_obj_and_str()
tools/mm/slabinfo: Fix trace disable logic inversion
MAINTAINERS: add slab-related scripts and tools to SLAB ALLOCATOR
mm/slub: fix typo in sheaves comment
mm, slab: simplify returning slab in __refill_objects_node()
mm, slab: add an optimistic __slab_try_return_freelist()
slab: fix kernel-docs for mm-api
slab: improve KMALLOC_PARTITION_RANDOM randomness
slab: support for compiler-assisted type-based slab cache partitioning
mm/slub: defer freelist construction until after bulk allocation from a new slab
|
|
This adds observability for the io_uring zcrx rx-buf-len configuration.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/20260612211709.1456966-3-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"Futex updates:
- Optimize futex hash bucket access patterns (Peter Zijlstra)
- Large series to address the robust futex unlock race for real, by
Thomas Gleixner:
"The robust futex unlock mechanism is racy in respect to the
clearing of the robust_list_head::list_op_pending pointer because
unlock and clearing the pointer are not atomic.
The race window is between the unlock and clearing the pending op
pointer. If the task is forced to exit in this window, exit will
access a potentially invalid pending op pointer when cleaning up
the robust list.
That happens if another task manages to unmap the object
containing the lock before the cleanup, which results in an UAF.
In the worst case this UAF can lead to memory corruption when
unrelated content has been mapped to the same address by the time
the access happens.
User space can't solve this problem without help from the kernel.
This series provides the kernel side infrastructure to help it
along:
1) Combined unlock, pointer clearing, wake-up for the
contended case
2) VDSO based unlock and pointer clearing helpers with a
fix-up function in the kernel when user space was interrupted
within the critical section.
... with help by André Almeida:
- Add a note about robust list race condition (André Almeida)
- Add self-tests for robust release operations (André Almeida)
Context analysis updates:
- Implement context analysis for 'struct rt_mutex'. (Bart Van Assche)
- Bump required Clang version to 23 (Marco Elver)
Guard infrastructure updates:
- Series to remove NULL check from unconditional guards (Dmitry
Ilvokhin)
Lockdep updates:
- Restore self-test migrate_disable() and sched_rt_mutex state on
PREEMPT_RT (Karl Mehltretter)
Membarriers updates:
- Use per-CPU mutexes for targeted commands (Aniket Gattani)
- Modernize membarrier_global_expedited with cleanup guards (Aniket
Gattani)
- Add rseq stress test for CFS throttle interactions (Aniket Gattani)
percpu-rwsems updates:
- Extract __percpu_up_read() to optimize inlining overhead (Dmitry
Ilvokhin)
Seqlocks updates:
- Allow UBSAN_ALIGNMENT to fail optimizing (Heiko Carstens)
Lock tracing:
- Add contended_release tracepoint to sleepable locks such as
mutexes, percpu-rwsems, rtmutexes, rwsems and semaphores (Dmitry
Ilvokhin)
MAINTAINERS updates:
- MAINTAINERS: Add RUST [SYNC] entry (Boqun Feng)
Misc updates and fixes by Randy Dunlap, YE WEI-HONG, Fabricio Parra,
Dmitry Ilvokhin and Peter Zijlstra"
* tag 'locking-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (36 commits)
locking: Add contended_release tracepoint to sleepable locks
locking/percpu-rwsem: Extract __percpu_up_read()
tracing/lock: Remove unnecessary linux/sched.h include
futex: Optimize futex hash bucket access patterns
rust: sync: completion: Mark inline complete_all and wait_for_completion
MAINTAINERS: Add RUST [SYNC] entry
cleanup: Specify nonnull argument index
selftests: futex: Add tests for robust release operations
Documentation: futex: Add a note about robust list race condition
x86/vdso: Implement __vdso_futex_robust_try_unlock()
x86/vdso: Prepare for robust futex unlock support
futex: Provide infrastructure to plug the non contended robust futex unlock race
futex: Add robust futex unlock IP range
futex: Add support for unlocking robust futexes
futex: Cleanup UAPI defines
x86: Select ARCH_MEMORY_ORDER_TSO
uaccess: Provide unsafe_atomic_store_release_user()
futex: Provide UABI defines for robust list entry modifiers
futex: Move futex related mm_struct data into a struct
futex: Make futex_mm_init() void
...
|
|
io_recv() and io_recvzc() are the odd ones out, as they checks for
whether POLL_FIRST should be honored before checking if the file is a
socket. It doesn't really matter, but might as well make it consistent
across all receive and send types.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
With the tctx fallback running its entries directly, the per-ctx
fallback work has a single user left: moving local (DEFER_TASKRUN)
task_work entries out of a ring that is going away. Both of its call
sites are process context and don't hold ->uring_lock, the same
conditions the deferred fallback work itself ran under - so run the
entries in cancel mode right there instead, and rename the helper to
io_cancel_local_task_work() to match what it now does.
With that, ->fallback_llist, ->fallback_work, io_fallback_req_func()
and __io_fallback_tw() can all go away, along with the fallback work
flushing in the ring exit and cancel paths. Requests that get
orphaned by an exiting task now run via the tctx fallback work, which
the ring exit side implicitly waits on through the ctx refs those
requests hold.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The fallback work drains the tctx queue only to redistribute the entries
into the per-ctx fallback lists, bouncing them through a second
(per-ctx) work item before they finally run. That made sense when the
producer side did the draining and could be in any context, but the
fallback work is a regular process context kworker: it can just run the
entries itself. Reuse the normal run loop - if run from the fallback
kernel thread, ts.cancel will get set, and the work terminated.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Like the local task_work list, the normal (tctx) task_work list is an
llist, and hence needs the O(n) llist_reverse_order() pass before
running entries in queue order. On top of that, capped runs - sqpoll
processing IORING_TW_CAP_ENTRIES_VALUE entries at a time - need the
claimed-but-unprocessed leftovers carried in a separate retry_list,
as they can't be pushed back to the shared list.
Switch tctx->task_list to a mpscq, like what was done for the
DEFER_TASKRUN paths as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The local (DEFER_TASKRUN) task_work list is an llist, which is LIFO
ordered, and hence __io_run_local_work() has to restore the right
running order with an O(n) llist_reverse_order() pass first. On top of
that, a batch that gets capped by max_events needs the leftover entries
parked on a separate ->retry_llist, as they can't be pushed back to the
shared list.
Switch it to the FIFO mpscq. Adds are wait-free instead of a cmpxchg
retry loop, entries are popped in queue order with no reversal pass,
capping a run simply leaves the remainder on the queue, and
->retry_llist goes away entirely. The consumer cursor, ->work_head,
lives with the rest of the ->uring_lock protected state rather than
next to the queue, so that popping entries doesn't dirty the producer
side cacheline.
For low amounts of task_work, this ends up being a bit more efficient
than the existing scheme. As an example of that, doing multishot
receives for 8 clients has the following task_work overhead:
1.02% sock-test [kernel.kallsyms] [k] io_req_local_work_add
0.88% sock-test [kernel.kallsyms] [k] __io_run_local_work_loop
0.60% sock-test [kernel.kallsyms] [k] llist_reverse_order
0.14% sock-test [kernel.kallsyms] [k] __io_run_local_work
2.64% at ~46Gb/sec
and after this change:
1.08% sock-test [kernel.kallsyms] [k] io_req_local_work_add
1.03% sock-test [kernel.kallsyms] [k] __io_run_local_work
2.11% at ~53Gb/sec
which has less overhead even though that test run was faster. For a case
of having 1024 clients on a single ring:
2.22% sock-test [kernel.kallsyms] [k] llist_reverse_order
0.84% sock-test [kernel.kallsyms] [k] __io_run_local_work_loop
0.42% sock-test [kernel.kallsyms] [k] io_req_local_work_add
0.02% sock-test [kernel.kallsyms] [k] __io_run_local_work
3.50% at ~24Gb/sec
we start to see the llist reversing taking a considerable amount of
time, and the total add+run task_work overhead is around 3.5%. After
the change:
0.90% sock-test [kernel.kallsyms] [k] __io_run_local_work
0.42% sock-test [kernel.kallsyms] [k] io_req_local_work_add
1.32% at ~26Gb/sec
most of that overhead is gone, and performance is better as well.
Caleb Sander Mateos <csander@purestorage.com> reports that it improves
the performance of a ublk 4kb workload by 4% [1], while testing v1 of
this patchset.
[1] https://lore.kernel.org/io-uring/CADUfDZr-MMYBaP-e+y9+xuRhuiunO2sBTUCmwZyd7AgT8sVtiQ@mail.gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Local task_work is currently using llists for managing the work,
but that's a LIFO type of list. This means that running this task_work
needs to reverse the list first, to ensure fairness in running the
queued items.
Add a lockless FIFO queued, based on Dmitry Vyukov's intrusive MPSC
node-based queue algorithm, modified with an externally held consumer
cursor and conditional stub reinsertion. See comments in the header.
Producers are wait-free: a push is a single xchg() on the queue tail,
which serializes concurrent producers and defines the FIFO order, plus
a store linking the node to its predecessor. There are no cmpxchg retry
loops, and pushing is safe from any context, including hardirq.
The cost of linked list FIFO ordering is that a push publishes the node
in two steps - the xchg() makes it visible as the new tail before the
subsequent store links it into the chain that is reachable from the
head. A consumer hitting that window gets a NULL from mpscq_pop() while
mpscq_empty() reports false, and must retry later rather than treat the
queue as empty. The window is two instructions wide, but a producer can
get preempted inside it, so the consumer must not busy wait on it.
The consumer side supports a single consumer at a time, with callers
providing their own serialization. A stub node, which also defines the
empty state (tail == stub), allows the consumer to detach the final
node without racing against producer link stores: that node is only
handed out once the stub has been cmpxchg'ed back in as the tail. This
also guarantees that the previous tail returned by mpscq_push() cannot
get freed before that push has linked it, making it always valid for
comparisons.
The consumer cursor is deliberately not part of the queue struct - the
caller owns it and passes it to mpscq_pop(). This is done to separate
the consumer and producers cacheline. The cursor is written for every
popped entry, and keeping it on the same cacheline as ->tail would have
the consumer invalidating the line that producers need for every push.
Keeping it external lets the caller place it with its own consumer side
data instead.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Not required right now, as io_req_local_work_add() already calls this
helper with the RCU read lock held. But in preparation for that not
being the case, grab it locally.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:
- Tweak for an off-by-one in the CQ ring accounting for the min wait
support.
- Don't truncate end buffer length for a bundle, as the transfer might
not happen. It's not required in the first place, as the completion
side handles this condition already.
* tag 'io_uring-7.1-20260611' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/wait: fix min_timeout behavior
io_uring/kbuf: don't truncate end buffer for bundles
|
|
This member is only ever assigned, never read. Kill it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit:
809b997a5ce9 ("x86-64/arm64/powerpc: clean up and rename __copy_from_user_flushcache")
sanitized that any provided copy helper should separately validate
destination and source addresses, but we should also ensure that
anything that is retrieved from a buffer is validated upfront. For ring
provided buffers, always include an access_ok() when grabbing a new
buffer.
Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
So far IORING_RECVSEND_FIXED_BUF is only honoured on the SEND_ZC path,
even though the import wiring is already present for plain send and
completely absent for recv. Targets such as ublk's NBD backend want to
push/pull I/O data directly to/from an io_uring registered buffer over a
plain send/recv on a TCP socket.
Wire IORING_RECVSEND_FIXED_BUF into the plain IORING_OP_SEND and
IORING_OP_RECV paths:
- Accept the flag in SENDMSG_FLAGS / RECVMSG_FLAGS and, at prep time,
restrict it to the non-vectorized IORING_OP_SEND / IORING_OP_RECV
opcodes. It is mutually exclusive with buffer select, bundles and
(for recv) multishot, and records sqe->buf_index.
- For recv, set REQ_F_IMPORT_BUFFER in setup so the registered buffer
is imported lazily at issue time, mirroring the send path.
- In io_send()/io_recv(), import the registered buffer via
io_import_reg_buf() (ITER_SOURCE for send, ITER_DEST for recv) and
clear REQ_F_IMPORT_BUFFER. The resulting bvec iter persists in
async_data, so MSG_WAITALL partial send/recv retries resume at the
right offset.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260608142511.659240-2-ming.lei@redhat.com
[axboe: combine flags checks]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The wakeup condition if a min timeout is present and has expired is that
at least _one_ CQE was posted. Thus set the cq_tail target to
->cq_min_tail + 1. Without this commit a spurious wakeup can result in a
premature wakeup because io_should_wake() will return true even if _no_
CQE was posted at all.
Cc: Tip ten Brink <tip@tenbrinkmeijs.com>
Fixes: e15cb2200b93 ("io_uring: fix min_wait wakeups for SQPOLL")
Cc: stable@vger.kernel.org
Signed-off-by: Christian A. Ehrhardt <lk@c--e.de>
Link: https://patch.msgid.link/20260606201120.1441447-1-lk@c--e.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If buffers have been peeked for a bundle receive, the kernel will
truncate the end buffer, if the available length is shorter than the
buffer itself. This is unnecessary, as applications iterating bundle
receives must always use the minimum size of the buffer length and the
remaining number of bytes in the bundle. The examples in liburing do
that as well, eg examples/proxy.c.
If the kernel does truncate this buffer AND the current transfer fails,
then the buffer will be left with a smaller size than what is otherwise
available.
Just remove the buffer truncation, as it's not necessary in the first
place.
Link: https://lore.kernel.org/io-uring/CAAEr8jbY60noGj1fw_k91UJRBkyiRVoS6=nLhZ7Svwidjn4CAA@mail.gmail.com/
Reported-by: Federico Brasili <federico.brasili@gmail.com>
Cc: stable@vger.kernel.org
Fixes: 35c8711c8fc4 ("io_uring/kbuf: add helpers for getting/peeking multiple buffers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fix from Jens Axboe:
"A single fix for a missing flag mask when multishot is used with
an incrementally consumed buffer ring, potentially leading to
application confusion because of lack of IORING_CQE_F_BUF_MORE
consistency"
* tag 'io_uring-7.1-20260605' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/net: inherit IORING_CQE_F_BUF_MORE across bundle recv retries
|
|
When a bundle recv retries inside io_recv_finish(), the merge logic OR
the saved cflags from the previous iteration with the cflags returned by
the new iteration:
cflags = req->cqe.flags | (cflags & CQE_F_MASK);
Bits listed in CQE_F_MASK are inherited from the new iteration, and all
other bits (notably IORING_CQE_F_BUFFER and the buffer ID) come from the
saved cflags. Before this change CQE_F_MASK covered only
IORING_CQE_F_SOCK_NONEMPTY and IORING_CQE_F_MORE.
When using provided buffer rings (IOU_PBUF_RING_INC) with incremental
mode, and bundle recv, io_kbuf_inc_commit() can leave the head ring
entry partially consumed, __io_put_kbufs() then sets
IORING_CQE_F_BUF_MORE on the returned cflags so userspace knows the
buffer ID will be reused for subsequent completions.
Because IORING_CQE_F_BUF_MORE was not in CQE_F_MASK, the merge above
silently dropped it whenever the final retry iteration partially
consumed the buffer, and the subsequent req->cqe.flags = cflags &
~CQE_F_MASK save would have left a stale IORING_CQE_F_BUF_MORE in the
carried-over cflags had one been present. Userspace would then
wrongfully advance it ring head past an entry the kernel still uses.
Add IORING_CQE_F_BUF_MORE to CQE_F_MASK so it is both inherited from the
new iteration into the user-visible CQE and stripped from the saved
cflags between iterations.
Cc: stable@vger.kernel.org
Signed-off-by: Clément Léger <cleger@meta.com>
Assisted-by: Claude:claude-opus-4.6
Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption")
Link: https://patch.msgid.link/20260604160715.2482972-1-cleger@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The kmem_cache_alloc_bulk return value is weird. It returns the number
of allocated objects, but that must always be 0 or the requested number
based on the implementations and the handling in the callers, but that
assumption is not actually documented anywhere, which confuses automated
review tools.
Fix this by returning a bool if the allocation succeeded and adding a
kerneldoc comment explaining the API.
[rob.clark@oss.qualcomm.com: fixups in
msm_iommu_pagetable_prealloc_allocate() ]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> # skbuff
Link: https://patch.msgid.link/20260528093437.2519248-2-hch@lst.de
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
|
|
Unlocking robust non-PI futexes happens in user space with the following
sequence:
1) robust_list_set_op_pending(mutex);
2) robust_list_remove(mutex);
lval = 0;
3) lval = atomic_xchg(lock, lval);
4) if (lval & WAITERS)
5) sys_futex(WAKE,....);
6) robust_list_clear_op_pending();
That opens a window between #3 and #6 where the mutex could be acquired by
some other task which observes that it is the last user and:
A) unmaps the mutex memory
B) maps a different file, which ends up covering the same address
When the original task exits before reaching #6 then the kernel robust list
handling observes the pending op entry and tries to fix up user space.
In case that the newly mapped data contains the TID of the exiting thread
at the address of the mutex/futex the kernel will set the owner died bit in
that memory and therefore corrupting unrelated data.
PI futexes have a similar problem both for the non-contented user space
unlock and the in kernel unlock:
1) robust_list_set_op_pending(mutex);
2) robust_list_remove(mutex);
lval = gettid();
3) if (!atomic_try_cmpxchg(lock, lval, 0))
4) sys_futex(UNLOCK_PI,....);
5) robust_list_clear_op_pending();
Address the first part of the problem where the futexes have waiters and
need to enter the kernel anyway. Add a new FUTEX_ROBUST_UNLOCK flag, which
is valid for the sys_futex() FUTEX_UNLOCK_PI, FUTEX_WAKE, FUTEX_WAKE_BITSET
operations.
This deliberately omits FUTEX_WAKE_OP from this treatment as it's unclear
whether this is needed and there is no usage of it in glibc either to
investigate.
For the futex2 syscall family this needs to be implemented with a new
syscall.
The sys_futex() case [ab]uses the @uaddr2 argument to hand the pointer to
robust_list_head::list_pending_op into the kernel. This argument is only
evaluated when the FUTEX_ROBUST_UNLOCK bit is set and is therefore backward
compatible.
This is an explicit argument to avoid the lookup of the robust list pointer
and retrieving the pending op pointer from there. User space has the
pointer already available so it can just put it into the @uaddr2
argument. Aside of that this allows the usage of multiple robust lists in
the future without any changes to the internal functions as they just operate
on the provided pointer.
This requires a second flag FUTEX_ROBUST_LIST32 which indicates that the
robust list pointer points to an u32 and not to an u64. This is required
for two reasons:
1) sys_futex() has no compat variant
2) The gaming emulators use both both 64-bit and compat 32-bit robust
lists in the same 64-bit application
As a consequence 32-bit applications have to set this flag unconditionally
so they can run on a 64-bit kernel in compat mode unmodified. 32-bit
kernels return an error code when the flag is not set. 64-bit kernels will
happily clear the full 64 bits if user space fails to set it.
In case of FUTEX_UNLOCK_PI this clears the robust list pending op when the
unlock succeeded. In case of errors, the user space value is still locked
by the caller and therefore the above cannot happen.
In case of FUTEX_WAKE* this does the unlock of the futex in the kernel and
clears the robust list pending op when the unlock was successful. If not,
the user space value is still locked and user space has to deal with the
returned error. That means that the unlocking of non-PI robust futexes has
to use the same try_cmpxchg() unlock scheme as PI futexes.
If the clearing of the pending list op fails (fault) then the kernel clears
the registered robust list pointer if it matches to prevent that exit()
will try to handle invalid data. That's a valid paranoid decision because
the robust list head sits usually in the TLS and if the TLS is not longer
accessible then the chance for fixing up the resulting mess is very close
to zero.
The problem of non-contended unlocks still exists and will be addressed
separately.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
Link: https://patch.msgid.link/20260602090535.670514505@kernel.org
|
|
This was copy-pasted from io_rw, where the comment actually makes sense.
Here, we don't have struct kiocb. Drop the comment.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://patch.msgid.link/20260602215327.1885109-4-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
OP_LISTEN does not use async_data. Remove it.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://patch.msgid.link/20260602215327.1885109-3-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Both IORING_OP_CONNECT and IORING_OP_BIND reuse the msghdr object just
to store the sockaddr. Beyond allocating a much larger object than
needed, msghdr can also wrap an iovec, which will be recycled
unnecessarily. This uses the sockaddr directly.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://patch.msgid.link/20260602215327.1885109-2-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
BPF programs should have no need in looking into struct io_ring_ctx, if
anything, most of such cases would be anti patterns like looking up ring
indices directly via the context.
Replace it with a new empty structure, which is just an alias to struct
io_ring_ctx. It'll create a new BTF type and fail verification if a BPF
program tries to access it (beyond the first byte). It'll also give more
flexibility for the future, and otherwise it can be made aligned with
io_ring_ctx as before with struct groups if ever needed or extended in a
different way.
Fixes: d0e437b76bd3c ("io_uring/bpf-ops: implement loop_step with BPF struct_ops")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/5f6ca3649e9e0bae8667db4357e28dd00cd07901.1780394491.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fix from Jens Axboe:
"Just a single fix for a regression introduced in this cycle, where
we should ensure the node is visible before the entry is added to
the tctx list"
* tag 'io_uring-7.1-20260529' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/tctx: set ->io_uring before publishing the tctx node
|
|
This is a simple helper which replaces page_folio(bvec->bv_page).
Minor improvement in readability, but the real motivation is to reduce
the number of references to bvec->bv_page so that it can be changed
with less work.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@linux.dev>
Link: https://patch.msgid.link/20260528175905.1102280-2-willy@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The only user of msg->msg_iocb was AF_ALG, but that's deprecated.
It can be removed entirely at the cost of only supporting synchronous
operations. This doesn't break userspace, which will silently block
(for a bounded amount of time) in io_submit instead of operating
asynchronously.
This also makes struct msghdr smaller, helping every other caller of
sendmsg().
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
commit 10dc95939817 ("io_uring/io-wq: check IO_WQ_BIT_EXIT inside work
run loop") fixed the obvious case where io_worker_handle_work() took one
exit-bit snapshot before draining pending work, but the fix stops one
level too early.
io_worker_handle_work() now re-checks IO_WQ_BIT_EXIT in its outer work
run loop, yet it still snapshots that bit once before processing a whole
dependent linked-work chain. If io_wq_exit_start() sets IO_WQ_BIT_EXIT
after the first linked item has started, the remaining linked items can
still reuse stale do_kill = false, skip IO_WQ_WORK_CANCEL, and continue
running after exit has begun.
Move the check further inside, so it covers linked items too. Note: this
is a syzbot special as it loves setting up tons of slow linked work on
weird devices like msr that take forever to read, and immediately close
the ring. Exit then takes a long time.
Fixes: 10dc95939817 ("io_uring/io-wq: check IO_WQ_BIT_EXIT inside work run loop")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
Link: https://patch.msgid.link/20260527172203.2043962-1-runyu.xiao@seu.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_provide_buffers_prep() accepts nbufs up to MAX_BIDS_PER_BGID, but
io_add_buffers() stops when bl->nbufs reaches USHRT_MAX. This makes the
effective add limit one lower than the validated limit.
Use MAX_BIDS_PER_BGID in the add-side boundary check so validation and
execution use the same limit, and update the comment to refer to the
actual limit constant.
Signed-off-by: liyouhong <liyouhong@kylinos.cn>
Link: https://patch.msgid.link/20260528024936.3672659-1-dayou5941@163.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add support for an optional stats struct embedded in the refill queue
region, allowing userspace to monitor copy-fallback in real-time.
Userspace queries the stats struct size and alignment via
IO_URING_QUERY_ZCRX_NOTIF (notif_stats_size / notif_stats_alignment),
then provides a stats_offset in zcrx_notification_desc pointing to a
location within the refill queue region.
The kernel updates the stats counters in-place on every copy-fallback
event.
Signed-off-by: Clément Léger <cleger@meta.com>
[pavel: rename io_uring_zcrx_notif_stats]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/f6af5a21015efea4b733b9d77aba22c637788fe4.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a ZCRX_NOTIF_COPY notification type to signal userspace when a
received fragment could not be delivered using zero-copy and was
instead copied into a buffer.
Signed-off-by: Clément Léger <cleger@meta.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/3d54bcd8bf10b3a1e88beb0cd39c40c3937bea4f.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There are currently no easy ways for the user to know if zcrx is out of
buffers and page pool fails to allocate. Add uapi for zcrx to communicate
it back.
It's implemented as a separate CQE, which for now is posted to the creator
ctx. To use it, on registration the user space needs to pass an instance
of struct zcrx_notification_desc, which tells the kernel the user_data
for resulting CQEs and which event types are expected / allowed.
When an allowed event happens, zcrx will post a CQE containing the
specified user_data, and lower bits of cqe->res will be set to the event
mask. Before the kernel could post another notification of the given
type, the user needs to acknowledge that it processed the previous one
by issuing IORING_REGISTER_ZCRX_CTRL with ZCRX_CTRL_ARM_NOTIFICATION.
The only notification type the patch implements is
ZCRX_NOTIF_NO_BUFFERS, but we'll need more of them in the future.
Co-developed-by: Vishwanath Seshagiri <vishs@meta.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Vishwanath Seshagiri <vishs@meta.com>
Link: https://patch.msgid.link/35cd307a03a43583838a2e151fc641c69abd786f.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
zcrx will need to have a pointer to an owning ctx to communicate
different events. Reference the ctx while it's attached to zcrx, and
rely on zcrx termination to drop the ctx to avoid circular ref deps.
Co-developed-by: Vishwanath Seshagiri <vishs@meta.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Vishwanath Seshagiri <vishs@meta.com>
Link: https://patch.msgid.link/b60514b3d1bd92f571e3bd91751166f8c3599256.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently, zcrx_export() allocates a file descriptor and copies the
control structure to userspace before the backing file is created.
While the operation returns an error on failure, it is cleaner to
follow the standard kernel pattern of performing the copy_to_user()
and fd_install() only after all resource allocations (like the
anon_inode) have succeeded. This aligns the code with other
fd-publishing paths in the VFS.
Signed-off-by: Bertie Tryner <Bertie.Tryner@warwick.ac.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/1513a3f4ae7161692ca6e991b9f01278a6bc60e4.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
By the time io_zcrx_ifq_free() is called the interface queue should
already be closed, so io_close_queue() will be a no-op. Remove the call
and add a couple of warnings.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/be6c4a283a5bab5440e22fbccafe7b885acb7abc.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Nobody should be touching area and other pointers after zcrx
destruction, poison them instead of zeroing.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/19112d1412539dcfc04a0317b5812e968623bc51.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently, scrubbing is done once before killing all recvzc requests.
It's fine as those are cancelled and don't return buffers afterwards,
but it'll be more reliable not to rely that much on cancellations.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/c4ea127023494cbbedebd21a2b7ae5ff0448eb95.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_register_iowq_max_workers() walks ctx->tctx_list under ctx->tctx_lock
and dereferences each node's task->io_uring without a NULL check:
list_for_each_entry(node, &ctx->tctx_list, ctx_node) {
tctx = node->task->io_uring;
if (WARN_ON_ONCE(!tctx->io_wq))
continue;
...
}
__io_uring_add_tctx_node() installs the node into ctx->tctx_list (via
io_tctx_install_node(), which does the list_add() under tctx_lock) and
only assigns current->io_uring = tctx afterwards. A task doing its first
io_uring operation on a shared ring therefore has a window in which its
node is already visible on ctx->tctx_list while node->task->io_uring is
still NULL. A concurrent IORING_REGISTER_IOWQ_MAX_WORKERS on the same
ring reads that NULL and dereferences tctx->io_wq:
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
RIP: io_register_iowq_max_workers io_uring/register.c:423
Publish current->io_uring = tctx before installing the node, so any node
visible on ctx->tctx_list always has a valid task->io_uring.
Fixes: 7880174e1e5e ("io_uring/tctx: clean up __io_uring_add_tctx_node() error handling")
Signed-off-by: Lim HyeonJun <shja0831@gmail.com>
Link: https://patch.msgid.link/20260524110853.115634-1-shja0831@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:
- Fix for an issue with IORING_OP_NOP and using injection results
- Fix for an issue in IORING_OP_WAITID, where the info state was
assumed cleared by the lower level syscall handler, but for some
cases it is not. Just clear the data upfront, so that non-initialized
data isn't copied back to userspace
- Fix for a lockdep reported issue, where IORING_OP_BIND enters file
create and hence hits mnt_want_write(), which creates a three part
lockdep cycle between the super lock, io_uring's uring_lock, and the
cred mutex
- Fix a regression introduced in this cycle with how linked timeouts
are deleted
- Ensure that the ->opcode nospec indexing on the opcode issue side
covers all the cases
* tag 'io_uring-7.1-20260522' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/nop: pass all errors to userspace
io_uring/timeout: splice timed out link in timeout handler
io_uring: propagate array_index_nospec opcode into req->opcode
io_uring/waitid: clear waitid info before copying it to userspace
io_uring/net: punt IORING_OP_BIND async if it needs file create
|
|
This fixes an inconsistency where io_nop() called req_set_fail()
based on ret, but passed just nop->result to userspace.
Originally, ret is a even copy of nop->result, but is set to an error
when such happens subsequently. Now that's also passed to userspace.
Fixes: a85f31052bce ("io_uring/nop: add support for testing registered files and buffers")
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Link: https://patch.msgid.link/20260520180045.538533-1-grandmaster@al2klimov.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A previous commit deferred this to the task_work part of it, so it could
be protected by ->uring_lock. But that's actually not necessary here,
and in fact the head clearing is not enough to make that safe. For those
two reasons, just re-instate the local splicing.
Fixes: 49ae66eb8c27 ("io_uring: defer linked-timeout chain splice out of hrtimer context")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 1e988c3fe126 ("io_uring: prevent opcode speculation") added
array_index_nospec() to io_init_req(), but applied it only to a local
opcode variable. req->opcode is initialized from sqe->opcode before the
bounds check and remains the raw value.
Keep req->opcode as the canonical opcode in io_init_req(): reject
out-of-range values architecturally, then write the array_index_nospec()
result back to req->opcode before any table lookup. This keeps downstream
users of req->opcode from observing the raw user byte on a mispredicted
path.
No functional change: array_index_nospec() is a no-op for opcodes in
[0, IORING_OP_LAST), and out-of-range opcodes are still rejected at the
bounds check above the assignment.
Fixes: 1e988c3fe126 ("io_uring: prevent opcode speculation")
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Link: https://patch.msgid.link/20260517213010.696135-1-michael.bommarito@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
IORING_OP_WAITID stores its result fields in struct io_waitid::info and
later copies them to userspace siginfo. The prep path initializes the
request arguments, but it does not initialize info itself.
If the wait operation completes without reporting a child event, the common
wait code can return without writing wo_info. In that case io_waitid_finish()
still copies iw->info to userspace, exposing stale bytes from the reused
io_kiocb command storage.
Clear the result storage during prep so the io_uring path matches the
regular waitid syscall, which uses a zero-initialized struct waitid_info.
Fixes: f31ecf671ddc ("io_uring: add IORING_OP_WAITID support")
Cc: stable@vger.kernel.org # 6.7+
Signed-off-by: Heechan Kang <gganji11@naver.com>
Link: https://patch.msgid.link/20260516184709.852814-1-gganji11@naver.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:
- Small series sanitizing the locking done for either modifying or
reading a chain of requests
- If the application has a pid namespace, ensure that the sqthread pid
is correctly printed in fdinfo
- Fix for a hashing issue in the io-wq thread pool, which could lead to
a use-after-free
- Kill dead argument from io_prep_rw_pi()
- Fix for a missed validation of the CQ ring head, affecting CQE refill
* tag 'io_uring-7.1-20260515' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring: validate user-controlled cq.head in io_cqe_cache_refill()
io-wq: check that the predecessor is hashed in io_wq_remove_pending()
io_uring/rw: drop unused attr_type_mask from io_prep_rw_pi()
io_uring: hold uring_lock across io_kill_timeouts() in cancel path
io_uring: defer linked-timeout chain splice out of hrtimer context
io_uring: hold uring_lock when walking link chain in io_wq_free_work()
io_uring/fdinfo: translate SqThread PID through caller's pid_ns
|
|
For two reasons:
1) An opcode cannot block inside io_uring_enter() doing submissions, as
it'll stall the submission side pipeline.
2) Ending up in sb_start_write() -> __sb_start_write() ->
percpu_down_read_freezable() introduces a new lockdep edge, which it
correctly complains about.
Check if the socket type is AF_UNIX and has a non-empty pathname. If it
does, mark it REQ_F_FORCE_ASYNC to punt the submission to io-wq rather
than attempt to do it inline.
Fixes: 7481fd93fa0a ("io_uring: Introduce IORING_OP_BIND")
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|