summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2026-01-20mm/block/fs: remove laptop_modeJohannes Weiner
Laptop mode was introduced to save battery, by delaying and consolidating writes and thereby maximize the time rotating hard drives wouldn't have to spin. Luckily, rotating hard drives, with their high spin-up times and power draw, are a thing of the past for battery-powered devices. Reclaim has also since changed to not write single filesystem pages anymore, and regular filesystem writeback is lumpy by design. The juice doesn't appear worth the squeeze anymore. The footprint of the feature is small, but nevertheless it's a complicating factor in mm, block, filesystems. Developers don't think about it, and it likely hasn't been tested with new reclaim and writeback changes in years. Let's sunset it. Keep the sysctl with a deprecation warning around for a few more cycles, but remove all functionality behind it. [akpm@linux-foundation.org: fix Documentation/admin-guide/laptops/index.rst] Link: https://lkml.kernel.org/r/20251216185201.GH905277@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20block: pass io_comp_batch to rq_end_io_fn callbackMing Lei
Add a third parameter 'const struct io_comp_batch *' to the rq_end_io_fn callback signature. This allows end_io handlers to access the completion batch context when requests are completed via blk_mq_end_request_batch(). The io_comp_batch is passed from blk_mq_end_request_batch(), while NULL is passed from __blk_mq_end_request() and blk_mq_put_rq_ref() which don't have batch context. This infrastructure change enables drivers to detect whether they're being called from a batched completion path (like iopoll) and access additional context stored in the io_comp_batch. Update all rq_end_io_fn implementations: - block/blk-mq.c: blk_end_sync_rq - block/blk-flush.c: flush_end_io, mq_flush_data_end_io - drivers/nvme/host/ioctl.c: nvme_uring_cmd_end_io - drivers/nvme/host/core.c: nvme_keep_alive_end_io - drivers/nvme/host/pci.c: abort_endio, nvme_del_queue_end, nvme_del_cq_end - drivers/nvme/target/passthru.c: nvmet_passthru_req_done - drivers/scsi/scsi_error.c: eh_lock_door_done - drivers/scsi/sg.c: sg_rq_end_io - drivers/scsi/st.c: st_scsi_execute_end - drivers/target/target_core_pscsi.c: pscsi_req_done - drivers/md/dm-rq.c: end_clone_request Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-19block: Fix an error path in disk_update_zone_resources()Bart Van Assche
Any queue_limits_start_update() call must be followed either by a queue_limits_commit_update() call or by a queue_limits_cancel_update() call. Make sure that the error path near the start of disk_update_zone_resources() follows this requirement. Remove the "goto unfreeze" statement from that error path to make the code easier to verify. This was detected by annotating the queue_limits_*() calls with Clang thread-safety attributes and by building the kernel with thread-safety checking enabled. Without this patch and with thread-safety checking enabled, the following error is reported: block/blk-zoned.c:2020:1: error: mutex 'disk->queue->limits_lock' is not held on every path through here [-Werror,-Wthread-safety-analysis] 2020 | } | ^ block/blk-zoned.c:1959:8: note: mutex acquired here 1959 | lim = queue_limits_start_update(q); | ^ Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Fixes: bba4322e3f30 ("block: freeze queue when updating zone resources") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20260114192803.4171847-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-18Merge branch 'for-7.0/blk-pvec' into for-7.0/blockJens Axboe
* for-7.0/blk-pvec: types: move phys_vec definition to common header nvme-pci: Use size_t for length fields to handle larger sizes
2026-01-16Merge tag 'block-6.19-20260116' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Device quirk to disable faulty temperature (Ilikara) - TCP target null pointer fix from bad host protocol usage (Shivam) - Add apple,t8103-nvme-ans2 as a compatible apple controller (Janne) - FC tagset leak fix (Chaitanya) - TCP socket deadlock fix (Hannes) - Target name buffer overrun fix (Shin'ichiro) - Fix for an underflow for rnbd during device unmap - Zero the non-PI part of the auto integrity buffer - Fix for a configfs memory leak in the null block driver * tag 'block-6.19-20260116' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: rnbd-clt: fix refcount underflow in device unmap path nvme: fix PCIe subsystem reset controller state transition nvmet: do not copy beyond sybsysnqn string length nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready() null_blk: fix kmemleak by releasing references to fault configfs items block: zero non-PI portion of auto integrity buffer nvme-fc: release admin tagset if init fails nvme-apple: add "apple,t8103-nvme-ans2" as compatible nvme-tcp: fix NULL pointer dereferences in nvmet_tcp_build_pdu_iovec nvme-pci: disable secondary temp for Wodposit WPBSNM8
2026-01-15block: improve blk_op_str() commentDamien Le Moal
Replace XXX with what it actually means. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-15block: fix blk_zone_cond_str() commentDamien Le Moal
Fix the comment for blk_zone_cond_str() by replacing the meaningless BLK_ZONE_ZONE_XXX comment with the correct BLK_ZONE_COND_name, thus also replacing the XXX with what that actually means. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-13block, nvme: remove unused dma_iova_state function parameterNitesh Shetty
DMA IOVA state is not used inside blk_rq_dma_map_iter_next, get rid of the argument. Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-11Merge branch 'block-6.19' into for-7.0/blockJens Axboe
Merge in fixes that went to 6.19 after for-7.0/block was branched. Pending ublk changes depend on particularly the async scan work. * block-6.19: block: zero non-PI portion of auto integrity buffer ublk: fix use-after-free in ublk_partition_scan_work blk-mq: avoid stall during boot due to synchronize_rcu_expedited loop: add missing bd_abort_claiming in loop_set_status block: don't merge bios with different app_tags blk-rq-qos: Remove unlikely() hints from QoS checks loop: don't change loop device under exclusive opener in loop_set_status block, bfq: update outdated comment blk-mq: skip CPU offline notify on unmapped hctx selftests/ublk: fix Makefile to rebuild on header changes selftests/ublk: add test for async partition scan ublk: scan partition in async way block,bfq: fix aux stat accumulation destination md: Fix forward incompatibility from configurable logical block size md: Fix logical_block_size configuration being overwritten md: suspend array while updating raid_disks via sysfs md/raid5: fix possible null-pointer dereferences in raid5_store_group_thread_cnt() md: Fix static checker warning in analyze_sbs
2026-01-11blk-crypto: handle the fallback above the block layerChristoph Hellwig
Add a blk_crypto_submit_bio helper that either submits the bio when it is not encrypted or inline encryption is provided, but otherwise handles the encryption before going down into the low-level driver. This reduces the risk from bio reordering and keeps memory allocation as high up in the stack as possible. Note that if the submitter knows that inline enctryption is known to be supported by the underyling driver, it can still use plain submit_bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-11blk-crypto: optimize data unit alignment checkingChristoph Hellwig
Avoid the relatively high overhead of constructing and walking per-page segment bio_vecs for data unit alignment checking by merging the checks into existing loops. For hardware support crypto, perform the check in bio_split_io_at, which already contains a similar alignment check applied for all I/O. This means bio-based drivers that do not call bio_split_to_limits, should they ever grow blk-crypto support, need to implement the check themselves, just like all other queue limits checks. For blk-crypto-fallback do it in the encryption/decryption loops. This means alignment errors for decryption will only be detected after I/O has completed, but that seems like a worthwhile trade off. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-11blk-crypto: use mempool_alloc_bulk for encrypted bio page allocationChristoph Hellwig
Calling mempool_alloc in a loop is not safe unless the maximum allocation size times the maximum number of threads using it is less than the minimum pool size. Use the new mempool_alloc_bulk helper to allocate all missing elements in one pass to remove this deadlock risk. This also means that non-pool allocations now use alloc_pages_bulk which can be significantly faster than a loop over individual page allocations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-11blk-crypto: use on-stack skcipher requests for fallback en/decryptionChristoph Hellwig
Allocating a skcipher request dynamically can deadlock or cause unexpected I/O failures when called from writeback context. Avoid the allocation entirely by using on-stack skciphers, similar to what the non-blk-crypto fscrypt path already does. This drops the incomplete support for asynchronous algorithms, which previously could be used, but only synchronously. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-11blk-crypto: optimize bio splitting in blk_crypto_fallback_encrypt_bioChristoph Hellwig
The current code in blk_crypto_fallback_encrypt_bio is inefficient and prone to deadlocks under memory pressure: It first walks the passed in plaintext bio to see how much of it can fit into a single encrypted bio using up to BIO_MAX_VEC PAGE_SIZE segments, and then allocates a plaintext clone that fits the size, only to allocate another bio for the ciphertext later. While the plaintext clone uses a bioset to avoid deadlocks when allocations could fail, the ciphertex one uses bio_kmalloc which is a no-go in the file system I/O path. Switch blk_crypto_fallback_encrypt_bio to walk the source plaintext bio while consuming bi_iter without cloning it, and instead allocate a ciphertext bio at the beginning and whenever we fille up the previous one. The existing bio_set for the plaintext clones is reused for the ciphertext bios to remove the deadlock risk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-11blk-crypto: submit the encrypted bio in blk_crypto_fallback_bio_prepChristoph Hellwig
Restructure blk_crypto_fallback_bio_prep so that it always submits the encrypted bio instead of passing it back to the caller, which allows to simplify the calling conventions for blk_crypto_fallback_bio_prep and blk_crypto_bio_prep so that they never have to return a bio, and can use a true return value to indicate that the caller should submit the bio, and false that the blk-crypto code consumed it. The submission is handled by the on-stack bio list in the current task_struct by the block layer and does not cause additional stack usage or major overhead. It also prepares for the following optimization and fixes for the blk-crypto fallback write path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-10block: account for bi_bvec_done in bio_may_need_split()Ming Lei
When checking if a bio fits in a single segment, bio_may_need_split() compares bi_size against the current bvec's bv_len. However, for partially consumed bvecs (bi_bvec_done > 0), such as in cloned or split bios, the remaining bytes in the current bvec is actually (bv_len - bi_bvec_done), not bv_len. This could cause bio_may_need_split() to incorrectly return false, leading to nr_phys_segments being set to 1 when the bio actually spans multiple segments. This triggers the WARN_ON in __blk_rq_map_sg() when the actual mapped segments exceed the expected count. Fix by subtracting bi_bvec_done from bv_len in the comparison. Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Close: https://lore.kernel.org/linux-block/9687cf2b-1f32-44e1-b58d-2492dc6e7185@linux.ibm.com/ Repored-and-bisected-by: Christoph Hellwig <hch@infradead.org> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Tested-by: Christoph Hellwig <hch@infradead.org> Fixes: ee623c892aa5 ("block: use bvec iterator helper for bio_may_need_split()") Cc: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-10block: use pi_tuple_size in bi_offload_capable()Caleb Sander Mateos
bi_offload_capable() returns whether a block device's metadata size matches its PI tuple size. Use pi_tuple_size instead of switching on csum_type. This makes the code considerably simpler and less branchy. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-10block: zero non-PI portion of auto integrity bufferCaleb Sander Mateos
The auto-generated integrity buffer for writes needs to be fully initialized before being passed to the underlying block device, otherwise the uninitialized memory can be read back by userspace or anyone with physical access to the storage device. If protection information is generated, that portion of the integrity buffer is already initialized. The integrity data is also zeroed if PI generation is disabled via sysfs or the PI tuple size is 0. However, this misses the case where PI is generated and the PI tuple size is nonzero, but the metadata size is larger than the PI tuple. In this case, the remainder ("opaque") of the metadata is left uninitialized. Generalize the BLK_INTEGRITY_CSUM_NONE check to cover any case when the metadata is larger than just the PI tuple. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: c546d6f43833 ("block: only zero non-PI metadata tuples in bio_integrity_prep") Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-09Merge tag 'block-6.19-20260109' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Kill unlikely checks for blk-rq-qos. These checks are really all-or-nothing, either the branch is taken all the time, or it's not. Depending on the configuration, either one of those cases may be true. Just remove the annotation - Fix for merging bios with different app tags set - Fix for a recently introduced slowdown due to RCU synchronization - Fix for a status change on loop while it's in use, and then a later fix for that fix - Fix for the async partition scanning in ublk * tag 'block-6.19-20260109' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: ublk: fix use-after-free in ublk_partition_scan_work blk-mq: avoid stall during boot due to synchronize_rcu_expedited loop: add missing bd_abort_claiming in loop_set_status block: don't merge bios with different app_tags blk-rq-qos: Remove unlikely() hints from QoS checks loop: don't change loop device under exclusive opener in loop_set_status
2026-01-07blk-mq: avoid stall during boot due to synchronize_rcu_expeditedMikulas Patocka
On the kernel 6.19-rc, I am experiencing 15-second boot stall in a virtual machine when probing a virtio-scsi disk: [ 1.011641] SCSI subsystem initialized [ 1.013972] virtio_scsi virtio6: 16/0/0 default/read/poll queues [ 1.015983] scsi host0: Virtio SCSI HBA [ 1.019578] ACPI: \_SB_.GSIA: Enabled at IRQ 16 [ 1.020225] ahci 0000:00:1f.2: AHCI vers 0001.0000, 32 command slots, 1.5 Gbps, SATA mode [ 1.020228] ahci 0000:00:1f.2: 6/6 ports implemented (port mask 0x3f) [ 1.020230] ahci 0000:00:1f.2: flags: 64bit ncq only [ 1.024688] scsi host1: ahci [ 1.025432] scsi host2: ahci [ 1.025966] scsi host3: ahci [ 1.026511] scsi host4: ahci [ 1.028371] scsi host5: ahci [ 1.028918] scsi host6: ahci [ 1.029266] ata1: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23100 irq 16 lpm-pol 1 [ 1.029305] ata2: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23180 irq 16 lpm-pol 1 [ 1.029316] ata3: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23200 irq 16 lpm-pol 1 [ 1.029327] ata4: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23280 irq 16 lpm-pol 1 [ 1.029341] ata5: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23300 irq 16 lpm-pol 1 [ 1.029356] ata6: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23380 irq 16 lpm-pol 1 [ 1.118111] scsi 0:0:0:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 [ 1.348916] ata1: SATA link down (SStatus 0 SControl 300) [ 1.350713] ata2: SATA link down (SStatus 0 SControl 300) [ 1.351025] ata6: SATA link down (SStatus 0 SControl 300) [ 1.351160] ata5: SATA link down (SStatus 0 SControl 300) [ 1.351326] ata3: SATA link down (SStatus 0 SControl 300) [ 1.351536] ata4: SATA link down (SStatus 0 SControl 300) [ 1.449153] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input2 [ 16.483477] sd 0:0:0:0: Power-on or device reset occurred [ 16.483691] sd 0:0:0:0: [sda] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB) [ 16.483762] sd 0:0:0:0: [sda] Write Protect is off [ 16.483877] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 16.569225] sd 0:0:0:0: [sda] Attached SCSI disk I bisected it and it is caused by the commit 89e1fb7ceffd which introduces calls to synchronize_rcu_expedited. This commit replaces synchronize_rcu_expedited and kfree with a call to kfree_rcu_mightsleep, avoiding the 15-second delay. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 89e1fb7ceffd ("blk-mq: fix potential uaf for 'queue_hw_ctx'") Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-07block: don't initialize bi_vcnt for cloned bio in bio_iov_bvec_set()Ming Lei
bio_iov_bvec_set() creates a cloned bio that borrows a bvec array from an iov_iter. For cloned bios, bi_vcnt is meaningless because iteration is controlled entirely by bi_iter (bi_idx, bi_size, bi_bvec_done), not by bi_vcnt. Remove the incorrect bi_vcnt assignment. Explicitly initialize bi_iter.bi_idx to 0 to ensure iteration starts at the first bvec. While bi_idx is typically already zero from bio initialization, making this explicit improves clarity and correctness. This change also avoids accessing iter->nr_segs, which is an iov_iter implementation detail that block code should not depend on. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-07block: use bvec iterator helper for bio_may_need_split()Ming Lei
bio_may_need_split() uses bi_vcnt to determine if a bio has a single segment, but bi_vcnt is unreliable for cloned bios. Cloned bios share the parent's bi_io_vec array but iterate over a subset via bi_iter, so bi_vcnt may not reflect the actual segment count being iterated. Replace the bi_vcnt check with bvec iterator access via __bvec_iter_bvec(), comparing bi_iter.bi_size against the current bvec's length. This correctly handles both cloned and non-cloned bios. Move bi_io_vec into the first cache line adjacent to bi_iter. This is a sensible layout since bi_io_vec and bi_iter are commonly accessed together throughout the block layer - every bvec iteration requires both fields. This displaces bi_end_io to the second cache line, which is acceptable since bi_end_io and bi_private are always fetched together in bio_endio() anyway. The struct layout change requires bio_reset() to preserve and restore bi_io_vec across the memset, since it now falls within BIO_RESET_BYTES. Nitesh verified that this patch doesn't regress NVMe 512-byte IO perf [1]. Link: https://lore.kernel.org/linux-block/20251220081607.tvnrltcngl3cc2fh@green245.gost/ [1] Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06block: don't merge bios with different app_tagsCaleb Sander Mateos
nvme_set_app_tag() uses the app_tag value from the bio_integrity_payload of the struct request's first bio. This assumes all the request's bios have the same app_tag. However, it is possible for bios with different app_tag values to be merged into a single request. Add a check in blk_integrity_merge_{bio,rq}() to prevent the merging of bios/requests with different app_tag values if BIP_CHECK_APPTAG is set. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: 3d8b5a22d404 ("block: add support to pass user meta buffer") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06blk-rq-qos: Remove unlikely() hints from QoS checksBreno Leitao
The unlikely() annotations on QUEUE_FLAG_QOS_ENABLED checks are counterproductive. Writeback throttling (WBT) might be enabled by default, mainly because CONFIG_BLK_WBT_MQ defaults to 'y'. Branch profiling on Meta servers, which have WBT enabled, confirms 100% misprediction rates on these checks. Remove the unlikely() annotations to let the CPU's branch predictor learn the actual behavior, potentially improving I/O path performance. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06types: move phys_vec definition to common headerLeon Romanovsky
Move the struct phys_vec definition from block/blk-mq-dma.c to include/linux/types.h to make it available for use across the kernel. The phys_vec structure represents a physical address range with a length, which is used by the new physical address-based DMA mapping API. This structure is already used by the block layer and will be needed for DMA phys API users. Moving this definition to types.h provides a centralized location for this common data structure and eliminates code duplication across subsystems that need to work with physical address ranges. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06nvme-pci: Use size_t for length fields to handle larger sizesLeon Romanovsky
This patch changes the length variables from unsigned int to size_t. Using size_t ensures that we can handle larger sizes, as size_t is always equal to or larger than the previously used u32 type. Originally, u32 was used because blk-mq-dma code evolved from scatter-gather implementation, which uses unsigned int to describe length. This change will also allow us to reuse the existing struct phys_vec in places that don't need scatter-gather. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-02Merge tag 'block-6.19-20260102' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Scan partition tables asynchronously for ublk, similarly to how nvme does it. This avoids potential deadlocks, which is why nvme does it that way too. Includes a set of selftests as well. - MD pull request via Yu: - Fix null-pointer dereference in raid5 sysfs group_thread_cnt store (Tuo Li) - Fix possible mempool corruption during raid1 raid_disks update via sysfs (FengWei Shih) - Fix logical_block_size configuration being overwritten during super_1_validate() (Li Nan) - Fix forward incompatibility with configurable logical block size: arrays assembled on new kernels could not be assembled on older kernels (v6.18 and before) due to non-zero reserved pad rejection (Li Nan) - Fix static checker warning about iterator not incremented (Li Nan) - Skip CPU offlining notifications on unmapped hardware queues - bfq-iosched block stats fix - Fix outdated comment in bfq-iosched * tag 'block-6.19-20260102' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: block, bfq: update outdated comment blk-mq: skip CPU offline notify on unmapped hctx selftests/ublk: fix Makefile to rebuild on header changes selftests/ublk: add test for async partition scan ublk: scan partition in async way block,bfq: fix aux stat accumulation destination md: Fix forward incompatibility from configurable logical block size md: Fix logical_block_size configuration being overwritten md: suspend array while updating raid_disks via sysfs md/raid5: fix possible null-pointer dereferences in raid5_store_group_thread_cnt() md: Fix static checker warning in analyze_sbs
2026-01-01block, bfq: update outdated commentJulia Lawall
The function bfq_bfqq_may_idle() was renamed as bfq_better_to_idle() in commit 277a4a9b56cd ("block, bfq: give a better name to bfq_bfqq_may_idle"). Update the comment accordingly. Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-30blk-mq: skip CPU offline notify on unmapped hctxCong Zhang
If an hctx has no software ctx mapped, blk_mq_map_swqueue() never allocates tags and leaves hctx->tags NULL. The CPU hotplug offline notifier can still run for that hctx, return early since hctx cannot hold any requests. Signed-off-by: Cong Zhang <cong.zhang@oss.qualcomm.com> Fixes: bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline") Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-28block,bfq: fix aux stat accumulation destinationshechenglong
Route bfqg_stats_add_aux() time accumulation into the destination stats object instead of the source, aligning with other stat fields. Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: shechenglong <shechenglong@xfusion.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-20Merge tag 'block-6.19-20251218' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - ublk selftests for missing coverage - two fixes for the block integrity code - fix for the newly added newly added PR read keys ioctl, limiting the memory that can be allocated - work around for a deadlock that can occur with ublk, where partition scanning ends up recursing back into file closure, which needs the same mutex grabbed. Not the prettiest thing in the world, but an acceptable work-around until we can eliminate the reliance on disk->open_mutex for this - fix for a race between enabling writeback throttling and new IO submissions - move a bit of bio flag handling code. No changes, but needed for a patchset for a future kernel - fix for an init time id leak failure in rnbd - loop/zloop state check fix * tag 'block-6.19-20251218' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: block: validate interval_exp integrity limit block: validate pi_offset integrity limit block: rnbd-clt: Fix leaked ID in init_dev() ublk: fix deadlock when reading partition table block: add allocation size check in blkdev_pr_read_keys() Documentation: admin-guide: blockdev: replace zone_capacity with zone_capacity_mb when creating devices zloop: use READ_ONCE() to read lo->lo_state in queue_rq path loop: use READ_ONCE() to read lo->lo_state without locking block: fix race between wbt_enable_default and IO submission selftests: ublk: add user copy test cases selftests: ublk: add support for user copy to kublk selftests: ublk: forbid multiple data copy modes selftests: ublk: don't share backing files between ublk servers selftests: ublk: use auto_zc for PER_IO_DAEMON tests in stress_04 selftests: ublk: fix fio arguments in run_io_and_recover() selftests: ublk: remove unused ios map in seq_io.bt selftests: ublk: correct last_rw map type in seq_io.bt selftests: ublk: fix overflow in ublk_queue_auto_zc_fallback() block: move around bio flagging helpers
2025-12-18block: validate interval_exp integrity limitCaleb Sander Mateos
Various code assumes that the integrity interval is at least 1 sector and evenly divides the logical block size. Add these checks to blk_validate_integrity_limits(). This guards against block drivers that report invalid interval_exp values. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-18block: validate pi_offset integrity limitCaleb Sander Mateos
The PI tuple must be contained within the metadata value, so validate that pi_offset + pi_tuple_size <= metadata_size. This guards against block drivers that report invalid pi_offset values. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-17block: add allocation size check in blkdev_pr_read_keys()Deepanshu Kartikey
blkdev_pr_read_keys() takes num_keys from userspace and uses it to calculate the allocation size for keys_info via struct_size(). While there is a check for SIZE_MAX (integer overflow), there is no upper bound validation on the allocation size itself. A malicious or buggy userspace can pass a large num_keys value that doesn't trigger overflow but still results in an excessive allocation attempt, causing a warning in the page allocator when the order exceeds MAX_PAGE_ORDER. Fix this by introducing PR_KEYS_MAX to limit the number of keys to a sane value. This makes the SIZE_MAX check redundant, so remove it. Also switch to kvzalloc/kvfree to handle larger allocations gracefully. Fixes: 22a1ffea5f80 ("block: add IOC_PR_READ_KEYS ioctl") Tested-by: syzbot+660d079d90f8a1baf54d@syzkaller.appspotmail.com Reported-by: syzbot+660d079d90f8a1baf54d@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=660d079d90f8a1baf54d Link: https://lore.kernel.org/all/20251212013510.3576091-1-kartikey406@gmail.com/T/ [v1] Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-12block: fix race between wbt_enable_default and IO submissionMing Lei
When wbt_enable_default() is moved out of queue freezing in elevator_change(), it can cause the wbt inflight counter to become negative (-1), leading to hung tasks in the writeback path. Tasks get stuck in wbt_wait() because the counter is in an inconsistent state. The issue occurs because wbt_enable_default() could race with IO submission, allowing the counter to be decremented before proper initialization. This manifests as: rq_wait[0]: inflight: -1 has_waiters: True rwb_enabled() checks the state, which can be updated exactly between wbt_wait() (rq_qos_throttle()) and wbt_track()(rq_qos_track()), then the inflight counter will become negative. And results in hung task warnings like: task:kworker/u24:39 state:D stack:0 pid:14767 Call Trace: rq_qos_wait+0xb4/0x150 wbt_wait+0xa9/0x100 __rq_qos_throttle+0x24/0x40 blk_mq_submit_bio+0x672/0x7b0 ... Fix this by: 1. Splitting wbt_enable_default() into: - __wbt_enable_default(): Returns true if wbt_init() should be called - wbt_enable_default(): Wrapper for existing callers (no init) - wbt_init_enable_default(): New function that checks and inits WBT 2. Using wbt_init_enable_default() in blk_register_queue() to ensure proper initialization during queue registration 3. Move wbt_init() out of wbt_enable_default() which is only for enabling disabled wbt from bfq and iocost, and wbt_init() isn't needed. Then the original lock warning can be avoided. 4. Removing the ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT flag and its handling code since it's no longer needed This ensures WBT is properly initialized before any IO can be submitted, preventing the counter from going negative. Cc: Nilay Shroff <nilay@linux.ibm.com> Cc: Yu Kuai <yukuai@fnnas.com> Cc: Guangwu Zhang <guazhang@redhat.com> Fixes: 78c271344b6f ("block: move wbt_enable_default() out of queue freezing from sched ->exit()") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-12Merge tag 'block-6.19-20251211' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Always initialize DMA state, fixing a potentially nasty issue on the block side - btrfs zoned write fix with cached zone reports - Fix corruption issues in bcache with chained bio's, and further make it clear that the chained IO handler is simply a marker, it's not code meant to be executed - Kill old code dealing with synchronous IO polling in the block layer, that has been dead for a long time. Only async polling is supported these days - Fix a lockdep issue in tag_set management, moving it to RCU - Fix an issue with ublks bio_vec iteration - Don't unconditionally enforce blocking issue of ublk control commands, allow some of them with non-blocking issue as they do not block * tag 'block-6.19-20251211' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: blk-mq-dma: always initialize dma state blk-mq: delete task running check in blk_hctx_poll() block: fix cached zone reports on devices with native zone append block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock ublk: don't mutate struct bio_vec in iteration block: prohibit calls to bio_chain_endio bcache: fix improper use of bi_end_io ublk: allow non-blocking ctrl cmds in IO_URING_F_NONBLOCK issue
2025-12-10blk-mq-dma: always initialize dma stateKeith Busch
Ensure the dma state is initialized when we're not using the contiguous iova, otherwise the caller may be using a stale state from a previous request that could use the coalesed iova allocation. Fixes: 2f6b2565d43cdb5 ("block: accumulate memory segment gaps per bio") Reported-by: Sebastian Ott <sebott@redhat.com> Tested-by: Sebastian Ott <sebott@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-10blk-mq: delete task running check in blk_hctx_poll()Fengnan Chang
blk_hctx_poll() always checks if the task is running or not, and returns 1 if the task is running. This is a leftover from when polled IO was purely for synchronous IO, and doesn't make sense anymore when polled IO is purely asynchronous. Similarly, marking the task as TASK_RUNNING is also superflous, as the very much has to be running to enter the function in the first place. It looks like there has been this judgment for historical reasons, and in very early versions of this function the user would set the process state to TASK_UNINTERRUPTIBLE. Signed-off-by: Diangang Li <lidiangang@bytedance.com> Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> [axboe: kill all remnants of task running, pointless now. massage message] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-09block: fix cached zone reports on devices with native zone appendJohannes Thumshirn
When mounting a btrfs file system on virtio-blk which supports native Zone Append there has been a WARN triggering in btrfs' space management code. Further looking into btrfs' zoned statistics uncovered the filesystem expecting the zones to be used, but the write pointers being 0: # cat /sys/fs/btrfs/8eabd2e7-3294-4f9e-9b58-7e64135c8bf4/zoned_stats active block-groups: 4 reclaimable: 0 unused: 0 need reclaim: false data relocation block-group: 1342177280 active zones: start: 1073741824, wp: 0 used: 0, reserved: 0, unusable: 0 start: 1342177280, wp: 0 used: 0, reserved: 0, unusable: 0 start: 1610612736, wp: 0 used: 16384, reserved: 0, unusable: 18446744073709535232 start: 1879048192, wp: 0 used: 131072, reserved: 0, unusable: 18446744073709420544 Looking at the blkzone report output for the zone in question (1610612736) the write pointer on the device moved, but the filesystem did not see a change on the write pointer: # blkzone report -c 1 -o 0x300000 /dev/vda start: 0x000300000, len 0x080000, cap 0x080000, wptr 0x000040 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)] The zone write pointer is 0, because btrfs is using the cached version of blkdev_report_zones() and as virtio-blk is supporting native zone append, but blkdev_revalidate_zones() does not initialize the zone write plugs in this case. Not skipping the revalidate of sequential zones in blkdev_revalidate_zones() callchain fixes this issue. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Fixes: a6aa36e957a1 ("block: Remove zone write plugs when handling native zone append writes") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-09block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lockMohamed Khalfella
blk_mq_{add,del}_queue_tag_set() functions add and remove queues from tagset, the functions make sure that tagset and queues are marked as shared when two or more queues are attached to the same tagset. Initially a tagset starts as unshared and when the number of added queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along with all the queues attached to it. When the number of attached queues drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and the remaining queues as unshared. Both functions need to freeze current queues in tagset before setting on unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions hold set->tag_list_lock mutex, which makes sense as we do not want queues to be added or deleted in the process. This used to work fine until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset") made the nvme driver quiesce tagset instead of quiscing individual queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in set->tag_list while holding set->tag_list_lock also. This results in deadlock between two threads with these stacktraces: __schedule+0x47c/0xbb0 ? timerqueue_add+0x66/0xb0 schedule+0x1c/0xa0 schedule_preempt_disabled+0xa/0x10 __mutex_lock.constprop.0+0x271/0x600 blk_mq_quiesce_tagset+0x25/0xc0 nvme_dev_disable+0x9c/0x250 nvme_timeout+0x1fc/0x520 blk_mq_handle_expired+0x5c/0x90 bt_iter+0x7e/0x90 blk_mq_queue_tag_busy_iter+0x27e/0x550 ? __blk_mq_complete_request_remote+0x10/0x10 ? __blk_mq_complete_request_remote+0x10/0x10 ? __call_rcu_common.constprop.0+0x1c0/0x210 blk_mq_timeout_work+0x12d/0x170 process_one_work+0x12e/0x2d0 worker_thread+0x288/0x3a0 ? rescuer_thread+0x480/0x480 kthread+0xb8/0xe0 ? kthread_park+0x80/0x80 ret_from_fork+0x2d/0x50 ? kthread_park+0x80/0x80 ret_from_fork_asm+0x11/0x20 __schedule+0x47c/0xbb0 ? xas_find+0x161/0x1a0 schedule+0x1c/0xa0 blk_mq_freeze_queue_wait+0x3d/0x70 ? destroy_sched_domains_rcu+0x30/0x30 blk_mq_update_tag_set_shared+0x44/0x80 blk_mq_exit_queue+0x141/0x150 del_gendisk+0x25a/0x2d0 nvme_ns_remove+0xc9/0x170 nvme_remove_namespaces+0xc7/0x100 nvme_remove+0x62/0x150 pci_device_remove+0x23/0x60 device_release_driver_internal+0x159/0x200 unbind_store+0x99/0xa0 kernfs_fop_write_iter+0x112/0x1e0 vfs_write+0x2b1/0x3d0 ksys_write+0x4e/0xb0 do_syscall_64+0x5b/0x160 entry_SYSCALL_64_after_hwframe+0x4b/0x53 The top stacktrace is showing nvme_timeout() called to handle nvme command timeout. timeout handler is trying to disable the controller and as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not to call queue callback handlers. The thread is stuck waiting for set->tag_list_lock as it tries to walk the queues in set->tag_list. The lock is held by the second thread in the bottom stack which is waiting for one of queues to be frozen. The queue usage counter will drop to zero after nvme_timeout() finishes, and this will not happen because the thread will wait for this mutex forever. Given that [un]quiescing queue is an operation that does not need to sleep, update blk_mq_[un]quiesce_tagset() to use RCU instead of taking set->tag_list_lock, update blk_mq_{add,del}_queue_tag_set() to use RCU safe list operations. Also, delete INIT_LIST_HEAD(&q->tag_set_list) in blk_mq_del_queue_tag_set() because we can not re-initialize it while the list is being traversed under RCU. The deleted queue will not be added/deleted to/from a tagset and it will be freed in blk_free_queue() after the end of RCU grace period. Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com> Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset") Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-09block: prohibit calls to bio_chain_endioShida Zhang
Now that all potential callers of bio_chain_endio have been eliminated, completely prohibit any future calls to this function. Suggested-by: Ming Lei <ming.lei@redhat.com> Suggested-by: Andreas Gruenbacher <agruenba@redhat.com> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-09Merge tag 'block-6.19-20251208' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: "Followup set of fixes and updates for block for the 6.19 merge window. NVMe had some late minute debates which lead to dropping some patches from that tree, which is why the initial PR didn't have NVMe included. It's here now. This pull request contains: - NVMe pull request via Keith: - Subsystem usage cleanups (Max) - Endpoint device fixes (Shin'ichiro) - Debug statements (Gerd) - FC fabrics cleanups and fixes (Daniel) - Consistent alloc API usages (Israel) - Code comment updates (Chu) - Authentication retry fix (Justin) - Fix a memory leak in the discard ioctl code, if the task is being interrupted by a signal at just the wrong time - Zoned write plugging fixes - Add ioctls for for persistent reservations - Enable per-cpu bio caching by default - Various little fixes and tweaks" * tag 'block-6.19-20251208' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (27 commits) nvme-fabrics: add ENOKEY to no retry criteria for authentication failures nvme-auth: use kvfree() for memory allocated with kvcalloc() nvmet-tcp: use kvcalloc for commands array nvmet-rdma: use kvcalloc for commands and responses arrays nvme: fix typo error in nvme target nvmet-fc: use pr_* print macros instead of dev_* nvmet-fcloop: remove unused lsdir member. nvmet-fcloop: check all request and response have been processed nvme-fc: check all request and response have been processed block: fix memory leak in __blkdev_issue_zero_pages block: fix comment for op_is_zone_mgmt() to include RESET_ALL block: Clear BLK_ZONE_WPLUG_PLUGGED when aborting plugged BIOs blk-mq: Abort suspend when wakeup events are pending blk-mq: add blk_rq_nr_bvec() helper block: add IOC_PR_READ_RESERVATION ioctl block: add IOC_PR_READ_KEYS ioctl nvme: reject invalid pr_read_keys() num_keys values scsi: sd: reject invalid pr_read_keys() num_keys values block: enable per-cpu bio cache by default block: use bio_alloc_bioset for passthru IO by default ...
2025-12-04Merge tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfioLinus Torvalds
Pull VFIO updates from Alex Williamson: - Move libvfio selftest artifacts in preparation of more tightly coupled integration with KVM selftests (David Matlack) - Fix comment typo in mtty driver (Chu Guangqing) - Support for new hardware revision in the hisi_acc vfio-pci variant driver where the migration registers can now be accessed via the PF. When enabled for this support, the full BAR can be exposed to the user (Longfang Liu) - Fix vfio cdev support for VF token passing, using the correct size for the kernel structure, thereby actually allowing userspace to provide a non-zero UUID token. Also set the match token callback for the hisi_acc, fixing VF token support for this this vfio-pci variant driver (Raghavendra Rao Ananta) - Introduce internal callbacks on vfio devices to simplify and consolidate duplicate code for generating VFIO_DEVICE_GET_REGION_INFO data, removing various ioctl intercepts with a more structured solution (Jason Gunthorpe) - Introduce dma-buf support for vfio-pci devices, allowing MMIO regions to be exposed through dma-buf objects with lifecycle managed through move operations. This enables low-level interactions such as a vfio-pci based SPDK drivers interacting directly with dma-buf capable RDMA devices to enable peer-to-peer operations. IOMMUFD is also now able to build upon this support to fill a long standing feature gap versus the legacy vfio type1 IOMMU backend with an implementation of P2P support for VM use cases that better manages the lifecycle of the P2P mapping (Leon Romanovsky, Jason Gunthorpe, Vivek Kasireddy) - Convert eventfd triggering for error and request signals to use RCU mechanisms in order to avoid a 3-way lockdep reported deadlock issue (Alex Williamson) - Fix a 32-bit overflow introduced via dma-buf support manifesting with large DMA buffers (Alex Mastro) - Convert nvgrace-gpu vfio-pci variant driver to insert mappings on fault rather than at mmap time. This conversion serves both to make use of huge PFNMAPs but also to both avoid corrected RAS events during reset by now being subject to vfio-pci-core's use of unmap_mapping_range(), and to enable a device readiness test after reset (Ankit Agrawal) - Refactoring of vfio selftests to support multi-device tests and split code to provide better separation between IOMMU and device objects. This work also enables a new test suite addition to measure parallel device initialization latency (David Matlack) * tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfio: (65 commits) vfio: selftests: Add vfio_pci_device_init_perf_test vfio: selftests: Eliminate INVALID_IOVA vfio: selftests: Split libvfio.h into separate header files vfio: selftests: Move vfio_selftests_*() helpers into libvfio.c vfio: selftests: Rename vfio_util.h to libvfio.h vfio: selftests: Stop passing device for IOMMU operations vfio: selftests: Move IOVA allocator into iova_allocator.c vfio: selftests: Move IOMMU library code into iommu.c vfio: selftests: Rename struct vfio_dma_region to dma_region vfio: selftests: Upgrade driver logging to dev_err() vfio: selftests: Prefix logs with device BDF where relevant vfio: selftests: Eliminate overly chatty logging vfio: selftests: Support multiple devices in the same container/iommufd vfio: selftests: Introduce struct iommu vfio: selftests: Rename struct vfio_iommu_mode to iommu_mode vfio: selftests: Allow passing multiple BDFs on the command line vfio: selftests: Split run.sh into separate scripts vfio: selftests: Move run.sh into scripts directory vfio/nvgrace-gpu: wait for the GPU mem to be ready vfio/nvgrace-gpu: Inform devmem unmapped after reset ...
2025-12-04block: fix memory leak in __blkdev_issue_zero_pagesShaurya Rane
Move the fatal signal check before bio_alloc() to prevent a memory leak when BLKDEV_ZERO_KILLABLE is set and a fatal signal is pending. Previously, the bio was allocated before checking for a fatal signal. If a signal was pending, the code would break out of the loop without freeing or chaining the just-allocated bio, causing a memory leak. This matches the pattern already used in __blkdev_issue_write_zeroes() where the signal check precedes the allocation. Fixes: bf86bcdb4012 ("blk-lib: check for kill signal in ioctl BLKZEROOUT") Reported-by: syzbot+527a7e48a3d3d315d862@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=527a7e48a3d3d315d862 Signed-off-by: Shaurya Rane <ssrane_b23@ee.vjti.ac.in> Reviewed-by: Keith Busch <kbusch@kernel.org> Tested-by: syzbot+527a7e48a3d3d315d862@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-04block: Clear BLK_ZONE_WPLUG_PLUGGED when aborting plugged BIOsDamien Le Moal
Commit fe0418eb9bd6 ("block: Prevent potential deadlocks in zone write plug error recovery") added a WARN check in disk_put_zone_wplug() to verify that when the last reference to a zone write plug is dropped, this zone write plug does not have the BLK_ZONE_WPLUG_PLUGGED flag set, that is, that it is not plugged. However, the function disk_zone_wplug_abort(), which is called for zone reset and zone finish operations, does not clear this flag after emptying a zone write plug BIO list. This can result in the disk_put_zone_wplug() warning to trigger if the user (erroneously as that is bad pratcice) issues zone reset or zone finish operations while the target zone still has plugged BIOs. Modify disk_put_zone_wplug() to clear the BLK_ZONE_WPLUG_PLUGGED flag. And while at it, also add a lockdep annotation to ensure that this function is called with the zone write plug spinlock held. Fixes: fe0418eb9bd6 ("block: Prevent potential deadlocks in zone write plug error recovery") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Niklas Cassel <cassel@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-04blk-mq: Abort suspend when wakeup events are pendingCong Zhang
During system suspend, wakeup capable IRQs for block device can be delayed, which can cause blk_mq_hctx_notify_offline() to hang indefinitely while waiting for pending request to complete. Skip the request waiting loop and abort suspend when wakeup events are pending to prevent the deadlock. Fixes: bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline") Signed-off-by: Cong Zhang <cong.zhang@oss.qualcomm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-04block: add IOC_PR_READ_RESERVATION ioctlStefan Hajnoczi
Add a Persistent Reservations ioctl to read the current reservation. This calls the pr_ops->read_reservation() function that was previously added in commit c787f1baa503 ("block: Add PR callouts for read keys and reservation") but was only used by the in-kernel SCSI target so far. The IOC_PR_READ_RESERVATION ioctl is necessary so that userspace applications that rely on Persistent Reservations ioctls have a way of inspecting the current state. Cluster managers and validation tests need this functionality. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-04block: add IOC_PR_READ_KEYS ioctlStefan Hajnoczi
Add a Persistent Reservations ioctl to read the list of currently registered reservation keys. This calls the pr_ops->read_keys() function that was previously added in commit c787f1baa503 ("block: Add PR callouts for read keys and reservation") but was only used by the in-kernel SCSI target so far. The IOC_PR_READ_KEYS ioctl is necessary so that userspace applications that rely on Persistent Reservations ioctls have a way of inspecting the current state. Cluster managers and validation tests need this functionality. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-04block: enable per-cpu bio cache by defaultFengnan Chang
Since after commit 12e4e8c7ab59 ("io_uring/rw: enable bio caches for IRQ rw"), bio_put is safe for task and irq context, bio_alloc_bioset is safe for task context and no one calls in irq context, so we can enable per cpu bio cache by default. Benchmarked with t/io_uring and ext4+nvme: taskset -c 6 /root/fio/t/io_uring -p0 -d128 -b4096 -s1 -c1 -F1 -B1 -R1 -X1 -n1 -P1 /mnt/testfile base IOPS is 562K, patch IOPS is 574K. The CPU usage of bio_alloc_bioset decrease from 1.42% to 1.22%. The worst case is allocate bio in CPU A but free in CPU B, still use t/io_uring and ext4+nvme: base IOPS is 648K, patch IOPS is 647K. Also use fio test ext4/xfs with libaio/sync/io_uring on null_blk and nvme, no obvious performance regression. Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-04block: use bio_alloc_bioset for passthru IO by defaultFengnan Chang
Use bio_alloc_bioset for passthru IO by default, so that we can enable bio cache for irq and polled passthru IO in later. Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>