| Age | Commit message (Collapse) | Author |
|
Conversion performed via this Coccinelle script:
// SPDX-License-Identifier: GPL-2.0-only
// Options: --include-headers-for-types --all-includes --include-headers --keep-comments
virtual patch
@gfp depends on patch && !(file in "tools") && !(file in "samples")@
identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex,
kzalloc_obj,kzalloc_objs,kzalloc_flex,
kvmalloc_obj,kvmalloc_objs,kvmalloc_flex,
kvzalloc_obj,kvzalloc_objs,kvzalloc_flex};
@@
ALLOC(...
- , GFP_KERNEL
)
$ make coccicheck MODE=patch COCCI=gfp.cocci
Build and boot tested x86_64 with Fedora 42's GCC and Clang:
Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This converts some of the visually simpler cases that have been split
over multiple lines. I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.
Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script. I probably had made it a bit _too_ trivial.
So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.
The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull kmalloc_obj conversion from Kees Cook:
"This does the tree-wide conversion to kmalloc_obj() and friends using
coccinelle, with a subsequent small manual cleanup of whitespace
alignment that coccinelle does not handle.
This uncovered a clang bug in __builtin_counted_by_ref(), so the
conversion is preceded by disabling that for current versions of
clang. The imminent clang 22.1 release has the fix.
I've done allmodconfig build tests for x86_64, arm64, i386, and arm. I
did defconfig builds for alpha, m68k, mips, parisc, powerpc, riscv,
s390, sparc, sh, arc, csky, xtensa, hexagon, and openrisc"
* tag 'kmalloc_obj-treewide-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
kmalloc_obj: Clean up after treewide replacements
treewide: Replace kmalloc with kmalloc_obj for non-scalar types
compiler_types: Disable __builtin_counted_by_ref for Clang
|
|
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
For SQE128, sqe->cmd provides 80 bytes for uring_cmd. Add macro to
check if size of user struct does not exceed 80 bytes at compile time.
User doesn't have to track this manually during development.
Replace io_uring_sqe_cmd() inline func with macro and add
io_uring_sqe128_cmd() which checks struct
size for 16 bytes cmd and 80 bytes cmd respectively.
Signed-off-by: Govindarajulu Varadarajan <govind.varadar@gmail.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:
- Support for batch request processing for ublk, improving the
efficiency of the kernel/ublk server communication. This can yield
nice 7-12% performance improvements
- Support for integrity data for ublk
- Various other ublk improvements and additions, including a ton of
selftests additions and updated
- Move the handling of blk-crypto software fallback from below the
block layer to above it. This reduces the complexity of dealing with
bio splitting
- Series fixing a number of potential deadlocks in blk-mq related to
the queue usage counter and writeback throttling and rq-qos debugfs
handling
- Add an async_depth queue attribute, to resolve a performance
regression that's been around for a qhilw related to the scheduler
depth handling
- Only use task_work for IOPOLL completions on NVMe, if it is necessary
to do so. An earlier fix for an issue resulted in all these
completions being punted to task_work, to guarantee that completions
were only run for a given io_uring ring when it was local to that
ring. With the new changes, we can detect if it's necessary to use
task_work or not, and avoid it if possible.
- rnbd fixes:
- Fix refcount underflow in device unmap path
- Handle PREFLUSH and NOUNMAP flags properly in protocol
- Fix server-side bi_size for special IOs
- Zero response buffer before use
- Fix trace format for flags
- Add .release to rnbd_dev_ktype
- MD pull requests via Yu Kuai
- Fix raid5_run() to return error when log_init() fails
- Fix IO hang with degraded array with llbitmap
- Fix percpu_ref not resurrected on suspend timeout in llbitmap
- Fix GPF in write_page caused by resize race
- Fix NULL pointer dereference in process_metadata_update
- Fix hang when stopping arrays with metadata through dm-raid
- Fix any_working flag handling in raid10_sync_request
- Refactor sync/recovery code path, improve error handling for
badblocks, and remove unused recovery_disabled field
- Consolidate mddev boolean fields into mddev_flags
- Use mempool to allocate stripe_request_ctx and make sure
max_sectors is not less than io_opt in raid5
- Fix return value of mddev_trylock
- Fix memory leak in raid1_run()
- Add Li Nan as mdraid reviewer
- Move phys_vec definitions to the kernel types, mostly in preparation
for some VFIO and RDMA changes
- Improve the speed for secure erase for some devices
- Various little rust updates
- Various other minor fixes, improvements, and cleanups
* tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
blk-mq: ABI/sysfs-block: fix docs build warnings
selftests: ublk: organize test directories by test ID
block: decouple secure erase size limit from discard size limit
block: remove redundant kill_bdev() call in set_blocksize()
blk-mq: add documentation for new queue attribute async_dpeth
block, bfq: convert to use request_queue->async_depth
mq-deadline: covert to use request_queue->async_depth
kyber: covert to use request_queue->async_depth
blk-mq: add a new queue sysfs attribute async_depth
blk-mq: factor out a helper blk_mq_limit_depth()
blk-mq-sched: unify elevators checking for async requests
block: convert nr_requests to unsigned int
block: don't use strcpy to copy blockdev name
blk-mq-debugfs: warn about possible deadlock
blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs()
blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos()
blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static
blk-rq-qos: fix possible debugfs_mutex deadlock
blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos
blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter
...
|
|
The initial state of dma_needs_unmap may be false, but change to true
while mapping the data iterator. Enabling swiotlb is one such case that
can change the result. The nvme driver needs to save the mapped dma
vectors to be unmapped later, so allocate as needed during iteration
rather than assume it was always allocated at the beginning. This fixes
a NULL dereference from accessing an uninitialized dma_vecs when the
device dma unmapping requirements change mid-iteration.
Fixes: b8b7570a7ec8 ("nvme-pci: fix dma unmapping when using PRPs and not using the IOVA mapping")
Link: https://lore.kernel.org/linux-nvme/20260202125738.1194899-1-pradeep.pragallapati@oss.qualcomm.com/
Reported-by: Pradeep P V K <pradeep.pragallapati@oss.qualcomm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The call to nvme_free_sgls() in nvme_unmap_data() has the sg_list and sge
parameters swapped. This wasn't noticed by the compiler because both share
the same type. On a Xen PV hardware domain, and possibly any other
architectures that takes that path, this leads to corruption of the NVMe
contents.
Fixes: f0887e2a52d4 ("nvme-pci: create common sgl unmapping helper")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
When multiple io_uring rings poll on the same NVMe queue, one ring can
find completions belonging to another ring. The current code always
uses task_work to handle this, but this adds overhead for the common
single-ring case.
This patch passes the polling io_ring_ctx through io_comp_batch's new
poll_ctx field. In io_do_iopoll(), the polling ring's context is stored
in iob.poll_ctx before calling the iopoll callbacks.
In nvme_uring_cmd_end_io(), we now compare iob->poll_ctx with the
request's owning io_ring_ctx (via io_uring_cmd_ctx_handle()). If they
match (local context), we complete inline with io_uring_cmd_done32().
If they differ (remote context) or iob is NULL (non-iopoll path), we
use task_work as before.
This optimization eliminates task_work scheduling overhead for the
common case where a ring polls and finds its own completions.
~10% IOPS improvement is observed in the following benchmark:
fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -O0 -P1 -u1 -n1 /dev/ng0n1
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a third parameter 'const struct io_comp_batch *' to the rq_end_io_fn
callback signature. This allows end_io handlers to access the completion
batch context when requests are completed via blk_mq_end_request_batch().
The io_comp_batch is passed from blk_mq_end_request_batch(), while NULL
is passed from __blk_mq_end_request() and blk_mq_put_rq_ref() which don't
have batch context.
This infrastructure change enables drivers to detect whether they're
being called from a batched completion path (like iopoll) and access
additional context stored in the io_comp_batch.
Update all rq_end_io_fn implementations:
- block/blk-mq.c: blk_end_sync_rq
- block/blk-flush.c: flush_end_io, mq_flush_data_end_io
- drivers/nvme/host/ioctl.c: nvme_uring_cmd_end_io
- drivers/nvme/host/core.c: nvme_keep_alive_end_io
- drivers/nvme/host/pci.c: abort_endio, nvme_del_queue_end, nvme_del_cq_end
- drivers/nvme/target/passthru.c: nvmet_passthru_req_done
- drivers/scsi/scsi_error.c: eh_lock_door_done
- drivers/scsi/sg.c: sg_rq_end_io
- drivers/scsi/st.c: st_scsi_execute_end
- drivers/target/target_core_pscsi.c: pscsi_req_done
- drivers/md/dm-rq.c: end_clone_request
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
* for-7.0/blk-pvec:
types: move phys_vec definition to common header
nvme-pci: Use size_t for length fields to handle larger sizes
|
|
The commit d2fe192348f9 (“nvme: only allow entering LIVE from CONNECTING
state”) disallows controller state transitions directly from RESETTING
to LIVE. However, the NVMe PCIe subsystem reset path relies on this
transition to recover the controller on PowerPC (PPC) systems.
On PPC systems, issuing a subsystem reset causes a temporary loss of
communication with the NVMe adapter. A subsequent PCIe MMIO read then
triggers EEH recovery, which restores the PCIe link and brings the
controller back online. For EEH recovery to proceed correctly, the
controller must transition back to the LIVE state.
Due to the changes introduced by commit d2fe192348f9 (“nvme: only allow
entering LIVE from CONNECTING state”), the controller can no longer
transition directly from RESETTING to LIVE. As a result, EEH recovery
exits prematurely, leaving the controller stuck in the RESETTING state.
Fix this by explicitly transitioning the controller state from RESETTING
to CONNECTING and then to LIVE. This satisfies the updated state
transition rules and allows the controller to be successfully recovered
on PPC systems following a PCIe subsystem reset.
Cc: stable@vger.kernel.org
Fixes: d2fe192348f9 ("nvme: only allow entering LIVE from CONNECTING state")
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
DMA IOVA state is not used inside blk_rq_dma_map_iter_next, get
rid of the argument.
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
nvme_fabrics creates an NVMe/FC controller in following path:
nvmf_dev_write()
-> nvmf_create_ctrl()
-> nvme_fc_create_ctrl()
-> nvme_fc_init_ctrl()
nvme_fc_init_ctrl() allocates the admin blk-mq resources right after
nvme_add_ctrl() succeeds. If any of the subsequent steps fail (changing
the controller state, scheduling connect work, etc.), we jump to the
fail_ctrl path, which tears down the controller references but never
frees the admin queue/tag set. The leaked blk-mq allocations match the
kmemleak report seen during blktests nvme/fc.
Check ctrl->ctrl.admin_tagset in the fail_ctrl path and call
nvme_remove_admin_tag_set() when it is set so that all admin queue
allocations are reclaimed whenever controller setup aborts.
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
After discussion with the devicetree maintainers we agreed to not extend
lists with the generic compatible "apple,nvme-ans2" anymore [1]. Add
"apple,t8103-nvme-ans2" as fallback compatible as it is the SoC the
driver and bindings were written for.
[1]: https://lore.kernel.org/asahi/12ab93b7-1fc2-4ce0-926e-c8141cfe81bf@kernel.org/
Cc: stable@vger.kernel.org # v6.18+
Fixes: 5bd2927aceba ("nvme-apple: Add initial Apple SoC NVMe driver")
Reviewed-by: Neal Gompa <neal@gompa.dev>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Janne Grunau <j@jannau.net>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
This patch changes the length variables from unsigned int to size_t.
Using size_t ensures that we can handle larger sizes, as size_t is
always equal to or larger than the previously used u32 type.
Originally, u32 was used because blk-mq-dma code evolved from
scatter-gather implementation, which uses unsigned int to describe length.
This change will also allow us to reuse the existing struct phys_vec in places
that don't need scatter-gather.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Secondary temperature thresholds (temp2_{min,max}) were not reported
properly on this NVMe SSD. This resulted in an error while attempting to
read these values with sensors(1):
ERROR: Can't get value of subfeature temp2_min: I/O error
ERROR: Can't get value of subfeature temp2_max: I/O error
Add the device to the nvme_id_table with the
NVME_QUIRK_NO_SECONDARY_TEMP_THRESH flag to suppress access to all non-
composite temperature thresholds.
Cc: stable@vger.kernel.org
Tested-by: Wu Haotian <rigoligo03@gmail.com>
Signed-off-by: Ilikara Zheng <ilikara@aosc.io>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Pull NVMe updates from Keith:
"- Subsystem usage cleanups (Max)
- Endpoint device fixes (Shin'ichiro)
- Debug statements (Gerd)
- FC fabrics cleanups and fixes (Daniel)
- Consistent alloc API usages (Israel)
- Code comment updates (Chu)
- Authentication retry fix (Justin)"
* tag 'nvme-6.19-2025-12-04' of git://git.infradead.org/nvme:
nvme-fabrics: add ENOKEY to no retry criteria for authentication failures
nvme-auth: use kvfree() for memory allocated with kvcalloc()
nvmet-tcp: use kvcalloc for commands array
nvmet-rdma: use kvcalloc for commands and responses arrays
nvme: fix typo error in nvme target
nvmet-fc: use pr_* print macros instead of dev_*
nvmet-fcloop: remove unused lsdir member.
nvmet-fcloop: check all request and response have been processed
nvme-fc: check all request and response have been processed
nvme-fc: don't hold rport lock when putting ctrl
nvme-pci: add debug message on fail to read CSTS
nvme-pci: print error message on failure in nvme_probe
nvmet: pci-epf: fix DMA channel debug print
nvmet: pci-epf: move DMA initialization to EPC init callback
nvmet: remove redundant subsysnqn field from ctrl
nvmet: add sanity checks when freeing subsystem
|
|
With authentication, in addition to EKEYREJECTED there is also no point in
retrying reconnects when status is ENOKEY. Thus, add -ENOKEY as another
criteria to determine when to stop retries.
Cc: Daniel Wagner <wagi@kernel.org>
Cc: Hannes Reinecke <hare@suse.de>
Closes: https://lore.kernel.org/linux-nvme/20250829-nvme-fc-sync-v3-0-d69c87e63aee@kernel.org/
Signed-off-by: Justin Tee <justintee8345@gmail.com>
Tested-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Memory allocated by kvcalloc() may come from vmalloc or kmalloc,
so use kvfree() instead of kfree() for proper deallocation.
Fixes: aa36d711e945 ("nvme-auth: convert dhchap_auth_list to an array")
Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
When the rport is removed there shouldn't be any in flight request or
responses.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The pr_read_keys() interface has a u32 num_keys parameter. The NVMe
Reservation Report command has a u32 maximum length. Reject num_keys
values that are too large to fit.
This will become important when pr_read_keys() is exposed to untrusted
userspace via an <linux/pr.h> ioctl.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use bio_alloc_bioset for passthru IO by default, so that we can enable
bio cache for irq and polled passthru IO in later.
Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:
- Fix head insertion for mq-deadline, a regression from when priority
support was added
- Series simplifying and improving the ublk user copy code
- Various ublk related cleanups
- Fixup REQ_NOWAIT handling in loop/zloop, clearing NOWAIT when the
request is punted to a thread for handling
- Merge and then later revert loop dio nowait support, as it ended up
causing excessive stack usage for when the inline issue code needs to
dip back into the full file system code
- Improve auto integrity code, making it less deadlock prone
- Speedup polled IO handling, but manually managing the hctx lookups
- Fixes for blk-throttle for SSD devices
- Small series with fixes for the S390 dasd driver
- Add support for caching zones, avoiding unnecessary report zone
queries
- MD pull requests via Yu:
- fix null-ptr-dereference regression for dm-raid0
- fix IO hang for raid5 when array is broken with IO inflight
- remove legacy 1s delay to speed up system shutdown
- change maintainer's email address
- data can be lost if array is created with different lbs devices,
fix this problem and record lbs of the array in metadata
- fix rcu protection for md_thread
- fix mddev kobject lifetime regression
- enable atomic writes for md-linear
- some cleanups
- bcache updates via Coly
- remove useless discard and cache device code
- improve usage of per-cpu workqueues
- Reorganize the IO scheduler switching code, fixing some lockdep
reports as well
- Improve the block layer P2P DMA support
- Add support to the block tracing code for zoned devices
- Segment calculation improves, and memory alignment flexibility
improvements
- Set of prep and cleanups patches for ublk batching support. The
actual batching hasn't been added yet, but helps shrink down the
workload of getting that patchset ready for 6.20
- Fix for how the ps3 block driver handles segments offsets
- Improve how block plugging handles batch tag allocations
- nbd fixes for use-after-free of the configuration on device clear/put
- Set of improvements and fixes for zloop
- Add Damien as maintainer of the block zoned device code handling
- Various other fixes and cleanups
* tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
block/rnbd: correct all kernel-doc complaints
blk-mq: use queue_hctx in blk_mq_map_queue_type
md: remove legacy 1s delay in md_notify_reboot
md/raid5: fix IO hang when array is broken with IO inflight
md: warn about updating super block failure
md/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raid
sbitmap: fix all kernel-doc warnings
ublk: add helper of __ublk_fetch()
ublk: pass const pointer to ublk_queue_is_zoned()
ublk: refactor auto buffer register in ublk_dispatch_req()
ublk: add `union ublk_io_buf` with improved naming
ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg()
kfifo: add kfifo_alloc_node() helper for NUMA awareness
blk-mq: fix potential uaf for 'queue_hw_ctx'
blk-mq: use array manage hctx map instead of xarray
ublk: prevent invalid access with DEBUG
s390/dasd: Use scnprintf() instead of sprintf()
s390/dasd: Move device name formatting into separate function
s390/dasd: Remove unnecessary debugfs_create() return checks
s390/dasd: Fix gendisk parent after copy pair swap
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring updates from Jens Axboe:
- Unify how task_work cancelations are detected, placing it in the
task_work running state rather than needing to check the task state
- Series cleaning up and moving the cancelation code to where it
belongs, in cancel.c
- Cleanup of waitid and futex argument handling
- Add support for mixed sized SQEs. 6.18 added support for mixed sized
CQEs, improving flexibility and efficiency of workloads that need big
CQEs. This adds similar support for SQEs, where the occasional need
for a 128b SQE doesn't necessitate having all SQEs be 128b in size
- Introduce zcrx and SQ/CQ layout queries. The former returns what zcrx
features are available. And both return the ring size information to
help with allocation size calculation for user provided rings like
IORING_SETUP_NO_MMAP and IORING_MEM_REGION_TYPE_USER
- Zcrx updates for 6.19. It includes a bunch of small patches,
IORING_REGISTER_ZCRX_CTRL and RQ flushing and David's work on sharing
zcrx b/w multiple io_uring instances
- Series cleaning up ring initializations, notable deduplicating ring
size and offset calculations. It also moves most of the checking
before doing any allocations, making the code simpler
- Add support for getsockname and getpeername, which is mostly a
trivial hookup after a bit of refactoring on the networking side
- Various fixes and cleanups
* tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits)
io_uring: Introduce getsockname io_uring cmd
socket: Split out a getsockname helper for io_uring
socket: Unify getsockname and getpeername implementation
io_uring/query: drop unused io_handle_query_entry() ctx arg
io_uring/kbuf: remove obsolete buf_nr_pages and update comments
io_uring/register: use correct location for io_rings_layout
io_uring/zcrx: share an ifq between rings
io_uring/zcrx: add io_fill_zcrx_offsets()
io_uring/zcrx: export zcrx via a file
io_uring/zcrx: move io_zcrx_scrub() and dependencies up
io_uring/zcrx: count zcrx users
io_uring/zcrx: add sync refill queue flushing
io_uring/zcrx: introduce IORING_REGISTER_ZCRX_CTRL
io_uring/zcrx: elide passing msg flags
io_uring/zcrx: use folio_nr_pages() instead of shift operation
io_uring/zcrx: convert to use netmem_desc
io_uring/query: introduce rings info query
io_uring/query: introduce zcrx query
io_uring: move cq/sq user offset init around
io_uring: pre-calculate scq layout
...
|
|
nvme_fc_ctrl_put can acquire the rport lock when freeing the
ctrl object:
nvme_fc_ctrl_put
nvme_fc_ctrl_free
spin_lock_irqsave(rport->lock)
Thus we can't hold the rport lock when calling nvme_fc_ctrl_put.
Justin suggested use the safe list iterator variant because
nvme_fc_ctrl_put will also modify the rport->list.
Cc: Justin Tee <justin.tee@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Add a debug log spelling out that reading the CSTS register failed - to
distinguish this from other reasons for ENODEV.
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Add a new error message that makes failures to probe visible in the
kernel log, like:
nvme 0008:00:00.0: error -ENODEV: probe failed
This highlights issues with a particular device right away instead of
leaving users to search for missing drives.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Conflicts:
net/xdp/xsk.c
0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
8da7bea7db69 ("xsk: add indirect call for xsk_destruct_skb")
30ed05adca4a ("xsk: use a smaller new lock for shared pool case")
https://lore.kernel.org/20251127105450.4a1665ec@canb.auug.org.au
https://lore.kernel.org/eb4eee14-7e24-4d1b-b312-e9ea738fefee@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
nvme_fc_delete_assocation() waits for pending I/O to complete before
returning, and an error can cause ->ioerr_work to be queued after
cancel_work_sync() had been called. Move the call to cancel_work_sync() to
be after nvme_fc_delete_association() to ensure ->ioerr_work is not running
when the nvme_fc_ctrl object is freed. Otherwise the following can occur:
[ 1135.911754] list_del corruption, ff2d24c8093f31f8->next is NULL
[ 1135.917705] ------------[ cut here ]------------
[ 1135.922336] kernel BUG at lib/list_debug.c:52!
[ 1135.926784] Oops: invalid opcode: 0000 [#1] SMP NOPTI
[ 1135.931851] CPU: 48 UID: 0 PID: 726 Comm: kworker/u449:23 Kdump: loaded Not tainted 6.12.0 #1 PREEMPT(voluntary)
[ 1135.943490] Hardware name: Dell Inc. PowerEdge R660/0HGTK9, BIOS 2.5.4 01/16/2025
[ 1135.950969] Workqueue: 0x0 (nvme-wq)
[ 1135.954673] RIP: 0010:__list_del_entry_valid_or_report.cold+0xf/0x6f
[ 1135.961041] Code: c7 c7 98 68 72 94 e8 26 45 fe ff 0f 0b 48 c7 c7 70 68 72 94 e8 18 45 fe ff 0f 0b 48 89 fe 48 c7 c7 80 69 72 94 e8 07 45 fe ff <0f> 0b 48 89 d1 48 c7 c7 a0 6a 72 94 48 89 c2 e8 f3 44 fe ff 0f 0b
[ 1135.979788] RSP: 0018:ff579b19482d3e50 EFLAGS: 00010046
[ 1135.985015] RAX: 0000000000000033 RBX: ff2d24c8093f31f0 RCX: 0000000000000000
[ 1135.992148] RDX: 0000000000000000 RSI: ff2d24d6bfa1d0c0 RDI: ff2d24d6bfa1d0c0
[ 1135.999278] RBP: ff2d24c8093f31f8 R08: 0000000000000000 R09: ffffffff951e2b08
[ 1136.006413] R10: ffffffff95122ac8 R11: 0000000000000003 R12: ff2d24c78697c100
[ 1136.013546] R13: fffffffffffffff8 R14: 0000000000000000 R15: ff2d24c78697c0c0
[ 1136.020677] FS: 0000000000000000(0000) GS:ff2d24d6bfa00000(0000) knlGS:0000000000000000
[ 1136.028765] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1136.034510] CR2: 00007fd207f90b80 CR3: 000000163ea22003 CR4: 0000000000f73ef0
[ 1136.041641] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1136.048776] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 1136.055910] PKRU: 55555554
[ 1136.058623] Call Trace:
[ 1136.061074] <TASK>
[ 1136.063179] ? show_trace_log_lvl+0x1b0/0x2f0
[ 1136.067540] ? show_trace_log_lvl+0x1b0/0x2f0
[ 1136.071898] ? move_linked_works+0x4a/0xa0
[ 1136.075998] ? __list_del_entry_valid_or_report.cold+0xf/0x6f
[ 1136.081744] ? __die_body.cold+0x8/0x12
[ 1136.085584] ? die+0x2e/0x50
[ 1136.088469] ? do_trap+0xca/0x110
[ 1136.091789] ? do_error_trap+0x65/0x80
[ 1136.095543] ? __list_del_entry_valid_or_report.cold+0xf/0x6f
[ 1136.101289] ? exc_invalid_op+0x50/0x70
[ 1136.105127] ? __list_del_entry_valid_or_report.cold+0xf/0x6f
[ 1136.110874] ? asm_exc_invalid_op+0x1a/0x20
[ 1136.115059] ? __list_del_entry_valid_or_report.cold+0xf/0x6f
[ 1136.120806] move_linked_works+0x4a/0xa0
[ 1136.124733] worker_thread+0x216/0x3a0
[ 1136.128485] ? __pfx_worker_thread+0x10/0x10
[ 1136.132758] kthread+0xfa/0x240
[ 1136.135904] ? __pfx_kthread+0x10/0x10
[ 1136.139657] ret_from_fork+0x31/0x50
[ 1136.143236] ? __pfx_kthread+0x10/0x10
[ 1136.146988] ret_from_fork_asm+0x1a/0x30
[ 1136.150915] </TASK>
Fixes: 19fce0470f05 ("nvme-fc: avoid calling _nvme_fc_abort_outstanding_ios from interrupt context")
Cc: stable@vger.kernel.org
Tested-by: Marco Patalano <mpatalan@redhat.com>
Reviewed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Now target is removed from nvme_fc_ctrl_free() which is the ctrl->ref
release handler. And even admin queue is unquiesced there, this way
is definitely wrong because the ctr->ref is grabbed when submitting
command.
And Marco observed that nvme_fc_ctrl_free() can be called from request
completion code path, and trigger kernel warning since request completes
from softirq context.
Fix the issue by moveing target removal into nvme_fc_delete_ctrl(),
which is also aligned with nvme-tcp and nvme-rdma.
Patch originally proposed by Ming Lei, then modified to move the tagset
removal down to after nvme_fc_delete_association() after further testing.
Cc: Marco Patalano <mpatalan@redhat.com>
Cc: Ewan Milne <emilne@redhat.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: stable@vger.kernel.org
Tested-by: Marco Patalano <mpatalan@redhat.com>
Reviewed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Blktests test cases nvme/014, 057 and 058 fail occasionally due to a
lockdep WARN. As reported in the Closes tag URL, the WARN indicates that
a deadlock can happen due to the dependency among disk->open_mutex,
kblockd workqueue completion and partition_scan_work completion.
To avoid the lockdep WARN and the potential deadlock, cut the dependency
by running the partition_scan_work not by kblockd workqueue but by
nvme_wq.
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/linux-block/CAHj4cs8mJ+R_GmQm9R8ebResKAWUE8kF5+_WVg0v8zndmqd6BQ@mail.gmail.com/
Link: https://lore.kernel.org/linux-block/oeyzci6ffshpukpfqgztsdeke5ost5hzsuz4rrsjfmvpqcevax@5nhnwbkzbrpa/
Fixes: 1f021341eef4 ("nvme-multipath: defer partition scanning")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
In commit eadaa8b255f3 ("dma-mapping: introduce new DMA attribute to
indicate MMIO memory"), DMA_ATTR_MMIO attribute was added to describe
MMIO addresses, which require to avoid any memory cache flushing, as
an outcome of the discussion pointed in Link tag below.
In case of PCI_P2PDMA_MAP_THRU_HOST_BRIDGE transfer, blk-mq-dm logic
treated this as regular page and relied on "struct page" DMA flow.
That flow performs CPU cache flushing, which shouldn't be done here,
and doesn't set IOMMU_MMIO flag in DMA-IOMMU case.
As a solution, let's encode peer-to-peer transaction type in NVMe IOD
flags variable and provide it to blk-mq-dma API.
Link: https://lore.kernel.org/all/f912c446-1ae9-4390-9c11-00dce7bf0fd3@arm.com/
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
After introduction of dma_map_phys(), there is no need to convert
from physical address to struct page in order to map page. So let's
use it directly.
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The nvme virtual boundary is only required for the PRP format. Devices
that can use SGL for DMA don't need it for IO queues. Drop reporting it
for such devices; rdma fabrics controllers will continue to use the
limit as they currently don't report any boundary requirements, but tcp
and fc never needed it in the first place so they get to report no
virtual boundary.
Applications may continue to align to the same virtual boundaries for
optimization purposes if they want, and the driver will continue to
decide whether to use the PRP format the same as before if the IO allows
it.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Cross-merge networking fixes after downstream PR (net-6.18-rc5).
Conflicts:
drivers/net/wireless/ath/ath12k/mac.c
9222582ec524 ("Revert "wifi: ath12k: Fix missing station power save configuration"")
6917e268c433 ("wifi: ath12k: Defer vdev bring-up until CSA finalize to avoid stale beacon")
https://lore.kernel.org/11cece9f7e36c12efd732baa5718239b1bf8c950.camel@sipsolutions.net
Adjacent changes:
drivers/net/ethernet/intel/Kconfig
b1d16f7c0063 ("libie: depend on DEBUG_FS when building LIBIE_FWLOG")
93f53db9f9dc ("ice: switch to Page Pool")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The namespaces can access the controller's admin request_queue, and
stale references on the namespaces may exist after tearing down the
controller. Ensure the admin request_queue is active by moving the
controller's 'put' to after all controller references have been released
to ensure no one is can access the request_queue. This fixes a reported
use-after-free bug:
BUG: KASAN: slab-use-after-free in blk_queue_enter+0x41c/0x4a0
Read of size 8 at addr ffff88c0a53819f8 by task nvme/3287
CPU: 67 UID: 0 PID: 3287 Comm: nvme Tainted: G E 6.13.2-ga1582f1a031e #15
Tainted: [E]=UNSIGNED_MODULE
Hardware name: Jabil /EGS 2S MB1, BIOS 1.00 06/18/2025
Call Trace:
<TASK>
dump_stack_lvl+0x4f/0x60
print_report+0xc4/0x620
? _raw_spin_lock_irqsave+0x70/0xb0
? _raw_read_unlock_irqrestore+0x30/0x30
? blk_queue_enter+0x41c/0x4a0
kasan_report+0xab/0xe0
? blk_queue_enter+0x41c/0x4a0
blk_queue_enter+0x41c/0x4a0
? __irq_work_queue_local+0x75/0x1d0
? blk_queue_start_drain+0x70/0x70
? irq_work_queue+0x18/0x20
? vprintk_emit.part.0+0x1cc/0x350
? wake_up_klogd_work_func+0x60/0x60
blk_mq_alloc_request+0x2b7/0x6b0
? __blk_mq_alloc_requests+0x1060/0x1060
? __switch_to+0x5b7/0x1060
nvme_submit_user_cmd+0xa9/0x330
nvme_user_cmd.isra.0+0x240/0x3f0
? force_sigsegv+0xe0/0xe0
? nvme_user_cmd64+0x400/0x400
? vfs_fileattr_set+0x9b0/0x9b0
? cgroup_update_frozen_flag+0x24/0x1c0
? cgroup_leave_frozen+0x204/0x330
? nvme_ioctl+0x7c/0x2c0
blkdev_ioctl+0x1a8/0x4d0
? blkdev_common_ioctl+0x1930/0x1930
? fdget+0x54/0x380
__x64_sys_ioctl+0x129/0x190
do_syscall_64+0x5b/0x160
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f765f703b0b
Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 52 0f 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe2cefe808 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffe2cefe860 RCX: 00007f765f703b0b
RDX: 00007ffe2cefe860 RSI: 00000000c0484e41 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000000000
R10: 00007f765f611d50 R11: 0000000000000202 R12: 0000000000000003
R13: 00000000c0484e41 R14: 0000000000000001 R15: 00007ffe2cefea60
</TASK>
Reported-by: Casey Chen <cachen@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Commit b76b840fd933 ("dm: Fix dm-zoned-reclaim zone write pointer
alignment") introduced an indirect call for the callback function of a
report zones executed with blkdev_report_zones(). This is necessary so
that the function disk_zone_wplug_sync_wp_offset() can be called to
refresh a zone write plug zone write pointer offset after a write error.
However, this solution makes following the path of a zone information
harder to understand.
Clean this up by introducing the new blk_report_zones_args structure to
define a zone report callback and its private data and introduce the
helper function disk_report_zone() which calls both
disk_zone_wplug_sync_wp_offset() and the zone report user callback
function for all zones of a zone report. This helper function must be
called by all block device drivers that implement the report zones
block operation in order to correctly report a zone information.
All block device drivers supporting the report_zones block operation are
updated to use this new scheme.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Update all struct proto_ops connect() callback function prototypes from
"struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the
compiler about object sizes. Calls into struct proto handlers gain casts
that will be removed in the struct proto conversion patch.
No binary changes expected.
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20251104002617.2752303-3-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Update all struct proto_ops bind() callback function prototypes from
"struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the
compiler about object sizes. Calls into struct proto handlers gain casts
that will be removed in the struct proto conversion patch.
No binary changes expected.
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20251104002617.2752303-2-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
io_uring task work dispatch makes an indirect call to struct io_kiocb's
io_task_work.func field to allow running arbitrary task work functions.
In the uring_cmd case, this calls io_uring_cmd_work(), which immediately
makes another indirect call to struct io_uring_cmd's task_work_cb field.
Change the uring_cmd task work callbacks to functions whose signatures
match io_req_tw_func_t. Add a function io_uring_cmd_from_tw() to convert
from the task work's struct io_tw_req argument to struct io_uring_cmd *.
Define a constant IO_URING_CMD_TASK_WORK_ISSUE_FLAGS to avoid
manufacturing issue_flags in the uring_cmd task work callbacks. Now
uring_cmd task work dispatch makes a single indirect call to the
uring_cmd implementation's callback. This also allows removing the
task_work_cb field from struct io_uring_cmd, freeing up 8 bytes for
future storage.
Since fuse_uring_send_in_task() now has access to the io_tw_token_t,
check its cancel field directly instead of relying on the
IO_URING_F_TASK_DEAD issue flag.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The dma_map_bvec helper doesn't work for p2p data, so use the same
blk_map_iter method that sgl uses for this memory type.
Reported-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
With TLS enabled, records that are encrypted and appended to TLS TX
list can fail to see a retry if the underlying TCP socket is busy, for
example, hitting an EAGAIN from tcp_sendmsg_locked(). This is not known
to the NVMe TCP driver, as the TLS layer successfully generated a record.
Typically, the TLS write_space() callback would ensure such records are
retried, but in the NVMe TCP Host driver, write_space() invokes
nvme_tcp_write_space(). This causes a partially sent record in the TLS TX
list to timeout after not being retried.
This patch fixes the above by calling queue->write_space(), which calls
into the TLS layer to retry any pending records.
Fixes: be8e82caa685 ("nvme-tcp: enable TLS handshake upcall")
Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The sc_c field is currently not updated in the host response to the
controller challenge leading to failures while attempting secure
channel concatenation. Fix this by adding a new sc_c variable to the
dhchap queue context structure which is appropriately set during
negotiate and then used in the host response.
Fixes: e88a7595b57f ("nvme-tcp: request secure channel concatenation")
Signed-off-by: Martin George <marting@netapp.com>
Signed-off-by: Prashanth Adurthi <prashana@netapp.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
For queue-depth I/O policy, this patch fixes unbalanced I/Os across
nvme multipaths.
Issue Description:
The RETRY disposition incorrectly increments ns->ctrl->nr_active
counter and reinitializes iostat start-time. In such cases nr_active
counter never goes back to zero until that path disconnects and
reconnects.
Such a path is not chosen for new I/Os if multiple RETRY cases on a given
a path cause its queue-depth counter to be artificially higher compared
to other paths. This leads to unbalanced I/Os across paths.
The patch skips incrementing nr_active if NVME_MPATH_CNT_ACTIVE is already
set. And it skips restarting io stats if NVME_MPATH_IO_STATS is already set.
base-commit: e989a3da2d371a4b6597ee8dee5c72e407b4db7a
Fixes: d4d957b53d91eeb ("nvme-multipath: support io stats on the mpath device")
Signed-off-by: Amit Chaudhary <achaudhary@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:
- NVMe pull request via Keith:
- FC target fixes (Daniel)
- Authentication fixes and updates (Martin, Chris)
- Admin controller handling (Kamaljit)
- Target lockdep assertions (Max)
- Keep-alive updates for discovery (Alastair)
- Suspend quirk (Georg)
- MD pull request via Yu:
- Add support for a lockless bitmap.
A key feature for the new bitmap are that the IO fastpath is
lockless. If a user issues lots of write IO to the same bitmap
bit in a short time, only the first write has additional overhead
to update bitmap bit, no additional overhead for the following
writes.
By supporting only resync or recover written data, means in the
case creating new array or replacing with a new disk, there is no
need to do a full disk resync/recovery.
- Switch ->getgeo() and ->bios_param() to using struct gendisk rather
than struct block_device.
- Rust block changes via Andreas. This series adds configuration via
configfs and remote completion to the rnull driver. The series also
includes a set of changes to the rust block device driver API: a few
cleanup patches, and a few features supporting the rnull changes.
The series removes the raw buffer formatting logic from
`kernel::block` and improves the logic available in `kernel::string`
to support the same use as the removed logic.
- floppy arch cleanups
- Reduce the number of dereferencing needed for ublk commands
- Restrict supported sockets for nbd. Mostly done to eliminate a class
of issues perpetually reported by syzbot, by using nonsensical socket
setups.
- A few s390 dasd block fixes
- Fix a few issues around atomic writes
- Improve DMA interation for integrity requests
- Improve how iovecs are treated with regards to O_DIRECT aligment
constraints.
We used to require each segment to adhere to the constraints, now
only the request as a whole needs to.
- Clean up and improve p2p support, enabling use of p2p for metadata
payloads
- Improve locking of request lookup, using SRCU where appropriate
- Use page references properly for brd, avoiding very long RCU sections
- Fix ordering of recursively submitted IOs
- Clean up and improve updating nr_requests for a live device
- Various fixes and cleanups
* tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (164 commits)
s390/dasd: enforce dma_alignment to ensure proper buffer validation
s390/dasd: Return BLK_STS_INVAL for EINVAL from do_dasd_request
ublk: remove redundant zone op check in ublk_setup_iod()
nvme: Use non zero KATO for persistent discovery connections
nvmet: add safety check for subsys lock
nvme-core: use nvme_is_io_ctrl() for I/O controller check
nvme-core: do ioccsz/iorcsz validation only for I/O controllers
nvme-core: add method to check for an I/O controller
blk-cgroup: fix possible deadlock while configuring policy
blk-mq: fix null-ptr-deref in blk_mq_free_tags() from error path
blk-mq: Fix more tag iteration function documentation
selftests: ublk: fix behavior when fio is not installed
ublk: don't access ublk_queue in ublk_unmap_io()
ublk: pass ublk_io to __ublk_complete_rq()
ublk: don't access ublk_queue in ublk_need_complete_req()
ublk: don't access ublk_queue in ublk_check_commit_and_fetch()
ublk: don't pass ublk_queue to ublk_fetch()
ublk: don't access ublk_queue in ublk_config_io_buf()
ublk: don't access ublk_queue in ublk_check_fetch_buf()
ublk: pass q_id and tag to __ublk_check_and_get_req()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring updates from Jens Axboe:
- Store ring provided buffers locally for the users, rather than stuff
them into struct io_kiocb.
These types of buffers must always be fully consumed or recycled in
the current context, and leaving them in struct io_kiocb is hence not
a good ideas as that struct has a vastly different life time.
Basically just an architecture cleanup that can help prevent issues
with ring provided buffers in the future.
- Support for mixed CQE sizes in the same ring.
Before this change, a CQ ring either used the default 16b CQEs, or it
was setup with 32b CQE using IORING_SETUP_CQE32. For use cases where
a few 32b CQEs were needed, this caused everything else to use big
CQEs. This is wasteful both in terms of memory usage, but also memory
bandwidth for the posted CQEs.
With IORING_SETUP_CQE_MIXED, applications may use request types that
post both normal 16b and big 32b CQEs on the same ring.
- Add helpers for async data management, to make it harder for opcode
handlers to mess it up.
- Add support for multishot for uring_cmd, which ublk can use. This
helps improve efficiency, by providing a persistent request type that
can trigger multiple CQEs.
- Add initial support for ring feature querying.
We had basic support for probe operations, but the API isn't great.
Rather than expand that, add support for QUERY which is easily
expandable and can cover a lot more cases than the existing probe
support. This will help applications get a better idea of what
operations are supported on a given host.
- zcrx improvements from Pavel:
- Improve refill entry alignment for better caching
- Various cleanups, especially around deduplicating normal
memory vs dmabuf setup.
- Generalisation of the niov size (Patch 12). It's still hard
coded to PAGE_SIZE on init, but will let the user to specify
the rx buffer length on setup.
- Syscall / synchronous bufer return. It'll be used as a slow
fallback path for returning buffers when the refill queue is
full. Useful for tolerating slight queue size misconfiguration
or with inconsistent load.
- Accounting more memory to cgroups.
- Additional independent cleanups that will also be useful for
mutli-area support.
- Various fixes and cleanups
* tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits)
io_uring/cmd: drop unused res2 param from io_uring_cmd_done()
io_uring: fix nvme's 32b cqes on mixed cq
io_uring/query: cap number of queries
io_uring/query: prevent infinite loops
io_uring/zcrx: account niov arrays to cgroup
io_uring/zcrx: allow synchronous buffer return
io_uring/zcrx: introduce io_parse_rqe()
io_uring/zcrx: don't adjust free cache space
io_uring/zcrx: use guards for the refill lock
io_uring/zcrx: reduce netmem scope in refill
io_uring/zcrx: protect netdev with pp_lock
io_uring/zcrx: rename dma lock
io_uring/zcrx: make niov size variable
io_uring/zcrx: set sgt for umem area
io_uring/zcrx: remove dmabuf_offset
io_uring/zcrx: deduplicate area mapping
io_uring/zcrx: pass ifq to io_zcrx_alloc_fallback()
io_uring/zcrx: check all niovs filled with dma addresses
io_uring/zcrx: move area reg checks into io_import_area
io_uring/zcrx: don't pass slot to io_zcrx_create_area
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC driver updates from Arnd Bergmann:
"Lots of platform specific updates for Qualcomm SoCs, including a new
TEE subsystem driver for the Qualcomm QTEE firmware interface.
Added support for the Apple A11 SoC in drivers that are shared with
the M1/M2 series, among more updates for those.
Smaller platform specific driver updates for Renesas, ASpeed,
Broadcom, Nvidia, Mediatek, Amlogic, TI, Allwinner, and Freescale
SoCs.
Driver updates in the cache controller, memory controller and reset
controller subsystems.
SCMI firmware updates to add more features and improve robustness.
This includes support for having multiple SCMI providers in a single
system.
TEE subsystem support for protected DMA-bufs, allowing hardware to
access memory areas that managed by the kernel but remain inaccessible
from the CPU in EL1/EL0"
* tag 'soc-drivers-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (139 commits)
soc/fsl/qbman: Use for_each_online_cpu() instead of for_each_cpu()
soc: fsl: qe: Drop legacy-of-mm-gpiochip.h header from GPIO driver
soc: fsl: qe: Change GPIO driver to a proper platform driver
tee: fix register_shm_helper()
pmdomain: apple: Add "apple,t8103-pmgr-pwrstate"
dt-bindings: spmi: Add Apple A11 and T2 compatible
serial: qcom-geni: Load UART qup Firmware from linux side
spi: geni-qcom: Load spi qup Firmware from linux side
i2c: qcom-geni: Load i2c qup Firmware from linux side
soc: qcom: geni-se: Add support to load QUP SE Firmware via Linux subsystem
soc: qcom: geni-se: Cleanup register defines and update copyright
dt-bindings: qcom: se-common: Add QUP Peripheral-specific properties for I2C, SPI, and SERIAL bus
Documentation: tee: Add Qualcomm TEE driver
tee: qcom: enable TEE_IOC_SHM_ALLOC ioctl
tee: qcom: add primordial object
tee: add Qualcomm TEE driver
tee: increase TEE_MAX_ARG_SIZE to 4096
tee: add TEE_IOCTL_PARAM_ATTR_TYPE_OBJREF
tee: add TEE_IOCTL_PARAM_ATTR_TYPE_UBUF
tee: add close_context to TEE driver operation
...
|
|
for-6.18/block
Pull NVMe updates from Keith:
" - FC target fixes (Daniel)
- Authentication fixes and updates (Martin, Chris)
- Admin controller handling (Kamaljit)
- Target lockdep assertions (Max)
- Keep-alive updates for discovery (Alastair)
- Suspend quirk (Georg)"
* tag 'nvme-6.18-2025-09-23' of git://git.infradead.org/nvme:
nvme: Use non zero KATO for persistent discovery connections
nvmet: add safety check for subsys lock
nvme-core: use nvme_is_io_ctrl() for I/O controller check
nvme-core: do ioccsz/iorcsz validation only for I/O controllers
nvme-core: add method to check for an I/O controller
nvme-pci: Add TUXEDO IBS Gen8 to Samsung sleep quirk
nvme-auth: use hkdf_expand_label()
nvme-auth: add hkdf_expand_label()
nvme-tcp: send only permitted commands for secure concat
nvme-fc: use lock accessing port_state and rport state
nvmet-fcloop: call done callback even when remote port is gone
nvmet-fc: avoid scheduling association deletion twice
nvmet-fc: move lsop put work to nvmet_fc_ls_req_op
nvme-auth: update bi_directional flag
|