summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2025-11-08md: avoid repeated calls to del_gendiskXiao Ni
There is a uaf problem which is found by case 23rdev-lifetime: Oops: general protection fault, probably for non-canonical address 0xdead000000000122 RIP: 0010:bdi_unregister+0x4b/0x170 Call Trace: <TASK> __del_gendisk+0x356/0x3e0 mddev_unlock+0x351/0x360 rdev_attr_store+0x217/0x280 kernfs_fop_write_iter+0x14a/0x210 vfs_write+0x29e/0x550 ksys_write+0x74/0xf0 do_syscall_64+0xbb/0x380 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7ff5250a177e The sequence is: 1. rdev remove path gets reconfig_mutex 2. rdev remove path release reconfig_mutex in mddev_unlock 3. md stop calls do_md_stop and sets MD_DELETED 4. rdev remove path calls del_gendisk because MD_DELETED is set 5. md stop path release reconfig_mutex and calls del_gendisk again So there is a race condition we should resolve. This patch adds a flag MD_DO_DELETE to avoid the race condition. Link: https://lore.kernel.org/linux-raid/20251029063419.21700-1-xni@redhat.com Fixes: 9e59d609763f ("md: call del_gendisk in control path") Signed-off-by: Xiao Ni <xni@redhat.com> Suggested-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08md/md-llbitmap: Remove unneeded semicolonChen Ni
Remove unnecessary semicolons reported by Coccinelle/coccicheck and the semantic patch at scripts/coccinelle/misc/semicolon.cocci. Link: https://lore.kernel.org/linux-raid/20250910091912.25624-1-nichen@iscas.ac.cn Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08md/md-linear: Enable atomic writesJohn Garry
All the infrastructure has already been plumbed to support this for stacked devices, so just enable the request_queue limits features flag. A note about chunk sectors for linear arrays: While it is possible to set a chunk sectors param for building a linear array, this is for specifying the granularity at which data sectors from the device are used. It is not the same as a stripe size, like for RAID0. As such, it is not appropriate to set chunk_sectors request queue limit to the same value, as chunk_sectors request limit is a boundary for which requests cannot straddle. However, request_queue limit max_hw_sectors is set to chunk sectors, which almost has the same effect as setting chunk_sectors limit. Link: https://lore.kernel.org/linux-raid/20250903161052.3326176-1-john.g.garry@oracle.com Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Yu Kuai <yukuai3@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08Factor out code into md_should_do_recovery()Wu Guanghao
In md_check_recovery(), use new helper to make code cleaner. Link: https://lore.kernel.org/linux-raid/e62894c8-d916-94bc-ef48-3c60e6e1fc5d@huawei.com Signed-off-by: Wu Guanghao <wuguanghao3@huawei.com> Reviewed-by: Yu Kuai <yukuai3@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08md: fix rcu protection in md_wakeup_threadYun Zhou
We attempted to use RCU to protect the pointer 'thread', but directly passed the value when calling md_wakeup_thread(). This means that the RCU pointer has been acquired before rcu_read_lock(), which renders rcu_read_lock() ineffective and could lead to a use-after-free. Link: https://lore.kernel.org/linux-raid/20251015083227.1079009-1-yun.zhou@windriver.com Fixes: 446931543982 ("md: protect md_thread with rcu") Signed-off-by: Yun Zhou <yun.zhou@windriver.com> Reviewed-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08md: delete mddev kobj before deleting gendisk kobjXiao Ni
In sync del gendisk path, it deletes gendisk first and the directory /sys/block/md is removed. Then it releases mddev kobj in a delayed work. If we enable debug log in sysfs_remove_group, we can see the debug log 'sysfs group bitmap not found for kobject md'. It's the reason that the parent kobj has been deleted, so it can't find parent directory. In creating path, it allocs gendisk first, then adds mddev kobj. So it should delete mddev kobj before deleting gendisk. Before commit 9e59d609763f ("md: call del_gendisk in control path"), it releases mddev kobj first. If the kobj hasn't been deleted, it does clean job and deletes the kobj. Then it calls del_gendisk and releases gendisk kobj. So it doesn't need to call kobject_del to delete mddev kobj. After this patch, in sync del gendisk path, the sequence changes. So it needs to call kobject_del to delete mddev kobj. After this patch, the sequence is: 1. kobject del mddev kobj 2. del_gendisk deletes gendisk kobj 3. mddev_delayed_delete releases mddev kobj 4. md_kobj_release releases gendisk kobj Link: https://lore.kernel.org/linux-raid/20250928012424.61370-1-xni@redhat.com Fixes: 9e59d609763f ("md: call del_gendisk in control path") Signed-off-by: Xiao Ni <xni@redhat.com> Reviewed-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-05block: introduce disk_report_zone()Damien Le Moal
Commit b76b840fd933 ("dm: Fix dm-zoned-reclaim zone write pointer alignment") introduced an indirect call for the callback function of a report zones executed with blkdev_report_zones(). This is necessary so that the function disk_zone_wplug_sync_wp_offset() can be called to refresh a zone write plug zone write pointer offset after a write error. However, this solution makes following the path of a zone information harder to understand. Clean this up by introducing the new blk_report_zones_args structure to define a zone report callback and its private data and introduce the helper function disk_report_zone() which calls both disk_zone_wplug_sync_wp_offset() and the zone report user callback function for all zones of a zone report. This helper function must be called by all block device drivers that implement the report zones block operation in order to correctly report a zone information. All block device drivers supporting the report_zones block operation are updated to use this new scheme. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-24treewide: Remove in_irq()Matthew Wilcox (Oracle)
This old alias for in_hardirq() has been marked as deprecated since 2020; remove the stragglers. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024180654.1691095-1-willy@infradead.org
2025-10-20dm-verity: use 2-way interleaved SHA-256 hashing when supportedEric Biggers
When the crypto library provides an optimized implementation of sha256_finup_2x(), use it to interleave the hashing of pairs of data blocks. On some CPUs this nearly doubles hashing performance. The increase in overall throughput of cold-cache dm-verity reads that I'm seeing on arm64 and x86_64 is roughly 35% (though this metric is hard to measure as it jumps around a lot). For now this is done only on data blocks, not Merkle tree blocks. We could use sha256_finup_2x() on Merkle tree blocks too, but that is less important as there aren't as many Merkle tree blocks as data blocks, and that would require some additional code restructuring. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-10-20dm-verity: reduce scope of real and wanted digestsEric Biggers
In preparation for supporting interleaved hashing where dm-verity will need to keep track of the real and wanted digests for multiple data blocks simultaneously, stop using the want_digest and real_digest fields of struct dm_verity_io from so many different places. Specifically: - Make various functions take want_digest as a parameter rather than having it be implicitly passed via the struct dm_verity_io. - Add a new tmp_digest field, and use this instead of real_digest when computing a digest solely for the purpose of immediately checking it. The result is that real_digest and want_digest are used only by verity_verify_io(). Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-10-20dm-verity: use SHA-256 library for SHA-256Eric Biggers
When the hash algorithm is SHA-256 and the verity version is not 0, use the SHA-256 library instead of crypto_shash. This is a prerequisite for making dm-verity interleave the computation of SHA-256 hashes for increased performance. That optimization is available in the SHA-256 library but not in crypto_shash. Even without interleaved hashing, switching to the library also slightly improves performance by itself because it avoids the overhead of crypto_shash, including indirect calls and other API overhead. (Benchmark on x86_64, AMD Zen 5: hashing 4K blocks gets 2.1% faster.) SHA-256 is by far the most common hash algorithm used with dm-verity. It makes sense to optimize for the common case and fall back to the generic crypto layer for uncommon cases, as suggested by Linus: https://lore.kernel.org/r/CAHk-=wgp-fOSsZsYrbyzqCAfEvrt5jQs1jL-97Wc4seMNTUyng@mail.gmail.com Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-10-20dm-verity: remove log message with shash driver nameEric Biggers
I added this log message in commit bbf6a566920e ("dm verity: log the hash algorithm implementation"), to help people debug issues where they forgot to enable the architecture-optimized SHA-256 code in their kconfig or accidentally enabled a slow hardware offload driver (such as QCE) that overrode the faster CPU-accelerated code. However: - The crypto layer now always enables the architecture-optimized SHA-1, SHA-256, and SHA-512 code. Moreover, for simplicity the driver name is now fixed at "sha1-lib", "sha256-lib", etc. - dm-verity now uses crypto_shash instead of crypto_ahash, preventing the mistake of accidentally using a slow driver such as QCE. Therefore, this log message generally no longer provides useful information. Remove it. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-10-20dm: Fix deadlock when reloading a multipath tableBenjamin Marzinski
Request-based devices (dm-multipath) queue I/O in blk-mq on noflush suspends. Any queued IO will make it impossible to freeze the queue. If a process attempts to update the queue limits while there is queued IO, it can be get stuck holding the limits lock, while unable to freeze the queue. If device-mapper then attempts to update the limits during a table swap, it will deadlock trying to grab the limits lock while making it impossible to flush the IO. Disallow updating the queue limits during a table swap, when updating an immutable request-based dm device (dm-multipath) during a noflush suspend. It is userspace's responsibility to make sure that the new table uses the same limits as the existing table if it asks for a noflush suspend. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-10-20dm: sysfs: use sysfs_emit() in dm-sysfs.cVivek BalachandharTN
Replace sprintf()+strlen() with sysfs_emit(), the preferred helper for sysfs show() routines. sysfs_emit() returns the number of bytes written, guarantees NUL-termination, and clamps to PAGE_SIZE-1. Reference: Documentation/filesystems/sysfs.rst. No functional change intended. Signed-off-by: Vivek BalachandharTN <vivek.balachandhar@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-10-20dm: remove useless md->nr_zones variableBenjamin Marzinski
md->nr_zones is no longer used for anything. Remove it. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-10-20dm-crypt: use folio_nr_pages() instead of shift operationPedro Demarchi Gomes
folio_nr_pages() is a faster helper function to get the number of pages when NR_PAGES_IN_LARGE_FOLIO is enabled. Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-10-20dm-crypt: Use MD5 library instead of crypto_shashEric Biggers
The lmk IV mode, which dm-crypt supports for Loop-AES compatibility, involves an MD5 computation. Update its implementation to use the MD5 library API instead of crypto_shash. This has many benefits, such as: - Simpler code. Notably, much of the error-handling code is no longer needed, since the library functions can't fail. - Reduced stack usage. crypt_iv_lmk_one() now allocates only 112 bytes on the stack instead of 520 bytes. - The library functions are strongly typed, preventing bugs like https://lore.kernel.org/r/f1625ddc-e82e-4b77-80c2-dc8e45b54848@gmail.com - Slightly improved performance, as the library provides direct access to the MD5 code without unnecessary overhead such as indirect calls. To preserve the existing behavior of lmk support being disabled when the kernel is booted with "fips=1", make crypt_iv_lmk_ctr() check fips_enabled itself. Previously it relied on crypto_alloc_shash("md5") failing. (I don't know for sure that lmk *actually* needs to be disallowed in FIPS mode; this just preserves the existing behavior.) Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Ard Biesheuvel <ardb@kernel.org>
2025-10-03Merge tag 'for-6.18/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - a new dm-pcache target for read/write caching on persistent memory - fix typos in docs - misc small refactoring - mark dm-error with DM_TARGET_PASSES_INTEGRITY - dm-request-based: fix NULL pointer dereference and quiesce_depth out of sync - dm-linear: optimize REQ_PREFLUSH - dm-vdo: return error on corrupted metadata - dm-integrity: support asynchronous hash interface * tag 'for-6.18/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (27 commits) dm raid: use proper md_ro_state enumerators dm-integrity: prefer synchronous hash interface dm-integrity: enable asynchronous hash interface dm-integrity: rename internal_hash dm-integrity: add the "offset" argument dm-integrity: allocate the recalculate buffer with kmalloc dm-integrity: introduce integrity_kmap and integrity_kunmap dm-integrity: replace bvec_kmap_local with kmap_local_page dm-integrity: use internal variable for digestsize dm vdo: return error on corrupted metadata in start_restoring_volume functions dm vdo: Update code to use mem_is_zero dm: optimize REQ_PREFLUSH with data when using the linear target dm-pcache: use int type to store negative error codes dm: fix "writen"->"written" dm-pcache: cleanup: fix coding style report by checkpatch.pl dm-pcache: remove ctrl_lock for pcache_cache_segment dm: fix NULL pointer dereference in __dm_suspend() dm: fix queue start/stop imbalance under suspend/load/resume races dm-pcache: add persistent cache target in device-mapper dm error: mark as DM_TARGET_PASSES_INTEGRITY ...
2025-10-02Merge tag 'for-6.18/block-20250929' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - NVMe pull request via Keith: - FC target fixes (Daniel) - Authentication fixes and updates (Martin, Chris) - Admin controller handling (Kamaljit) - Target lockdep assertions (Max) - Keep-alive updates for discovery (Alastair) - Suspend quirk (Georg) - MD pull request via Yu: - Add support for a lockless bitmap. A key feature for the new bitmap are that the IO fastpath is lockless. If a user issues lots of write IO to the same bitmap bit in a short time, only the first write has additional overhead to update bitmap bit, no additional overhead for the following writes. By supporting only resync or recover written data, means in the case creating new array or replacing with a new disk, there is no need to do a full disk resync/recovery. - Switch ->getgeo() and ->bios_param() to using struct gendisk rather than struct block_device. - Rust block changes via Andreas. This series adds configuration via configfs and remote completion to the rnull driver. The series also includes a set of changes to the rust block device driver API: a few cleanup patches, and a few features supporting the rnull changes. The series removes the raw buffer formatting logic from `kernel::block` and improves the logic available in `kernel::string` to support the same use as the removed logic. - floppy arch cleanups - Reduce the number of dereferencing needed for ublk commands - Restrict supported sockets for nbd. Mostly done to eliminate a class of issues perpetually reported by syzbot, by using nonsensical socket setups. - A few s390 dasd block fixes - Fix a few issues around atomic writes - Improve DMA interation for integrity requests - Improve how iovecs are treated with regards to O_DIRECT aligment constraints. We used to require each segment to adhere to the constraints, now only the request as a whole needs to. - Clean up and improve p2p support, enabling use of p2p for metadata payloads - Improve locking of request lookup, using SRCU where appropriate - Use page references properly for brd, avoiding very long RCU sections - Fix ordering of recursively submitted IOs - Clean up and improve updating nr_requests for a live device - Various fixes and cleanups * tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (164 commits) s390/dasd: enforce dma_alignment to ensure proper buffer validation s390/dasd: Return BLK_STS_INVAL for EINVAL from do_dasd_request ublk: remove redundant zone op check in ublk_setup_iod() nvme: Use non zero KATO for persistent discovery connections nvmet: add safety check for subsys lock nvme-core: use nvme_is_io_ctrl() for I/O controller check nvme-core: do ioccsz/iorcsz validation only for I/O controllers nvme-core: add method to check for an I/O controller blk-cgroup: fix possible deadlock while configuring policy blk-mq: fix null-ptr-deref in blk_mq_free_tags() from error path blk-mq: Fix more tag iteration function documentation selftests: ublk: fix behavior when fio is not installed ublk: don't access ublk_queue in ublk_unmap_io() ublk: pass ublk_io to __ublk_complete_rq() ublk: don't access ublk_queue in ublk_need_complete_req() ublk: don't access ublk_queue in ublk_check_commit_and_fetch() ublk: don't pass ublk_queue to ublk_fetch() ublk: don't access ublk_queue in ublk_config_io_buf() ublk: don't access ublk_queue in ublk_check_fetch_buf() ublk: pass q_id and tag to __ublk_check_and_get_req() ...
2025-09-29Merge tag 'dlm-6.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "This adds a dlm_release_lockspace() flag to request that node-failure recovery be performed for the node leaving the lockspace. The implementation of this flag requires coordination with userland clustering components. It's been requested for use by GFS2" * tag 'dlm-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: check for undefined release_option values dlm: handle release_option as unsigned dlm: move to rinfo for all middle conversion cases dlm: handle invalid lockspace member remove dlm: add new flag DLM_RELEASE_RECOVER for dlm_lockspace_release dlm: add new configfs entry release_recover for lockspace members dlm: add new RELEASE_RECOVER uevent attribute for release_lockspace dlm: use defines for force values in dlm_release_lockspace dlm: check for defined force value in dlm_lockspace_release
2025-09-23dm raid: use proper md_ro_state enumeratorsHeinz Mauelshagen
The dm-raid code was using hardcoded integer values to represent the read-only/read-write state of RAID arrays instead of the proper enumeration constants defined in the md_ro_state enumerator type. Changes: - Replace hardcoded integers with the appropriate md_ro_state enumerator values - Add the missing MD_RDONLY setting in the post_suspend function (no failures have been attributed to this inconsistency, the fix ensures correct state transitions for completeness) This improves code clarity and maintainability by using the defined enumeration constants rather than magic numbers, ensuring the code properly conforms to the established API interface. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-23dm-integrity: prefer synchronous hash interfaceMikulas Patocka
The previous patch preferred async interface for the purpose of testing. However, the synchronous interface is faster, so it should be preferred. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-23dm-integrity: enable asynchronous hash interfaceMikulas Patocka
This commit enables the asynchronous hash interface in dm-integrity. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-23dm-integrity: rename internal_hashMikulas Patocka
Rename "internal_hash" to "internal_shash" and introduce a boolean value "internal_hash". Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-23dm-integrity: add the "offset" argumentMikulas Patocka
Make sure that the "data" argument passed to integrity_sector_checksum is always page-aligned and add an "offset" argument that specifies the offset from the start of the page. This will enable us to use the asynchronous hash interface later. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-23dm-integrity: allocate the recalculate buffer with kmallocMikulas Patocka
Allocate the recalculate buffer with kmalloc rather than vmalloc. This will be needed later, for the simplification of the asynchronous hash interface. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-23dm-integrity: introduce integrity_kmap and integrity_kunmapMikulas Patocka
This abstraction will be used later, for the asynchronous hash interface. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-23dm-integrity: replace bvec_kmap_local with kmap_local_pageMikulas Patocka
Replace bvec_kmap_local with kmap_local_page - it will be needed for the upcoming patches that make kmap_local_page optional, depending on whether asynchronous hash interface is used or not. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-23dm-integrity: use internal variable for digestsizeMikulas Patocka
Instead of calling digestsize() each time the digestsize for the internal hash is needed, store the digestsize in a new field internal_hash_digestsize within struct dm_integrity_c once and use this value when needed. Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-23dm vdo: return error on corrupted metadata in start_restoring_volume functionsIvan Abramov
The return values of VDO_ASSERT calls that validate metadata are not acted upon. Return UDS_CORRUPT_DATA in case of an error. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: a4eb7e255517 ("dm vdo: implement the volume index") Signed-off-by: Ivan Abramov <i.abramov@mt-integration.ru> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-23dm vdo: Update code to use mem_is_zeroBruce Johnston
Remove function that would check if data was all zeroes. Use the built-in kernel function mem_is_zero() instead. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-19Merge tag 'block-6.17-20250918' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: "A set of fixes for an issue with md array assembly and drbd for devices supporting write zeros" * tag 'block-6.17-20250918' of git://git.kernel.dk/linux: drbd: init queue_limits->max_hw_wzeroes_unmap_sectors parameter md: init queue_limits->max_hw_wzeroes_unmap_sectors parameter
2025-09-18dm: optimize REQ_PREFLUSH with data when using the linear targetMikulas Patocka
If the table has only linear targets and there is just one underlying device, we can optimize REQ_PREFLUSH with data - we don't have to split it to two bios - a flush and a write. We can pass it to the linear target directly. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
2025-09-17Merge tag 'for-6.17/dm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mikulas Patocka: - fix integer overflow in dm-stripe - limit tag size in dm-integrity to 255 bytes - fix 'alignment inconsistency' warning in dm-raid * tag 'for-6.17/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-raid: don't set io_min and io_opt for raid1 dm-integrity: limit MAX_TAG_SIZE to 255 dm-stripe: fix a possible integer overflow
2025-09-17dm-raid: don't set io_min and io_opt for raid1Mikulas Patocka
These commands modprobe brd rd_size=1048576 vgcreate vg /dev/ram* lvcreate -m4 -L10 -n lv vg trigger the following warnings: device-mapper: table: 252:10: adding target device (start sect 0 len 24576) caused an alignment inconsistency device-mapper: table: 252:10: adding target device (start sect 0 len 24576) caused an alignment inconsistency The warnings are caused by the fact that io_min is 512 and physical block size is 4096. If there's chunk-less raid, such as raid1, io_min shouldn't be set to zero because it would be raised to 512 and it would trigger the warning. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Cc: stable@vger.kernel.org
2025-09-17md: init queue_limits->max_hw_wzeroes_unmap_sectors parameterZhang Yi
The parameter max_hw_wzeroes_unmap_sectors in queue_limits should be equal to max_write_zeroes_sectors if it is set to a non-zero value. However, the stacked md drivers call md_init_stacking_limits() to initialize this parameter to UINT_MAX but only adjust max_write_zeroes_sectors when setting limits. Therefore, this discrepancy triggers a value check failure in blk_validate_limits(). $ modprobe scsi_debug num_parts=2 dev_size_mb=8 lbprz=1 lbpws=1 $ mdadm --create /dev/md0 --level=0 --raid-device=2 /dev/sda1 /dev/sda2 mdadm: Defaulting to version 1.2 metadata mdadm: RUN_ARRAY failed: Invalid argument Fix this failure by explicitly setting max_hw_wzeroes_unmap_sectors to max_write_zeroes_sectors. Since the linear and raid0 drivers support write zeroes, so they can support unmap write zeroes operation if all of the backend devices support it. However, the raid1/10/5 drivers don't support write zeroes, so we have to set it to zero. Fixes: 0c40d7cb5ef3 ("block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits") Reported-by: John Garry <john.g.garry@oracle.com> Closes: https://lore.kernel.org/linux-block/803a2183-a0bb-4b7a-92f1-afc5097630d2@oracle.com/ Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Tested-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Li Nan <linan122@huawei.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/linux-raid/20250910111107.3247530-2-yi.zhang@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-09-10md/md-llbitmap: Use DIV_ROUND_UP_SECTOR_TNathan Chancellor
When building for 32-bit platforms, there are several link (if builtin) or modpost (if a module) errors due to dividends of type 'sector_t' in DIV_ROUND_UP: arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o: in function `llbitmap_resize': drivers/md/md-llbitmap.c:1017:(.text+0xae8): undefined reference to `__aeabi_uldivmod' arm-linux-gnueabi-ld: drivers/md/md-llbitmap.c:1020:(.text+0xb10): undefined reference to `__aeabi_uldivmod' arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o: in function `llbitmap_end_discard': drivers/md/md-llbitmap.c:1114:(.text+0xf14): undefined reference to `__aeabi_uldivmod' arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o: in function `llbitmap_start_discard': drivers/md/md-llbitmap.c:1097:(.text+0x1808): undefined reference to `__aeabi_uldivmod' arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o: in function `llbitmap_read_sb': drivers/md/md-llbitmap.c:867:(.text+0x2080): undefined reference to `__aeabi_uldivmod' arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o:drivers/md/md-llbitmap.c:895: more undefined references to `__aeabi_uldivmod' follow Use DIV_ROUND_UP_SECTOR_T instead of DIV_ROUND_UP, which exists to handle this exact situation. Fixes: 5ab829f1971d ("md/md-llbitmap: introduce new lockless bitmap") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10md/raid0: convert raid0_make_request() to use bio_submit_split_bioset()Yu Kuai
Currently, raid0_make_request() will remap the original bio to underlying disks to prevent reordered IO. Now that bio_submit_split_bioset() will put original bio to the head of current->bio_list, it's safe converting to use this helper and bio will still be ordered. CC: Jan Kara <jack@suse.cz> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10md/md-linear: convert to use bio_submit_split_bioset()Yu Kuai
Unify bio split code, prepare to fix reordered split IO. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10md/raid5: convert to use bio_submit_split_bioset()Yu Kuai
Unify bio split code, prepare to fix ordering of split IO. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10md/raid10: convert read/write to use bio_submit_split_bioset()Yu Kuai
Unify bio split code, prepare to fix ordering of split IO, the error path is modified a bit, however no functional changes are intended: - bio_submit_split_bioset() can fail the original bio directly by split error, set R10BIO_Uptodate in this case to notify raid_end_bio_io() that the original bio is returned already. - set R10BIO_Uptodate and set error value to -EIO is useless now, for r10_bio without R10BIO_Uptodate, -EIO will be returned for original bio. And discard is not handled, because discard is only split for unaligned head and tail, and this can be considered slow path, the reorder here does not matter much. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10md/raid10: add a new r10bio flag R10BIO_ReturnedYu Kuai
The new helper bio_submit_split_bioset() can failed the orginal bio on split errors, prepare to handle this case in raid_end_bio_io(). The flag name is refer to the r1bio flag name. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10md/raid1: convert to use bio_submit_split_bioset()Yu Kuai
Unify bio split code, and prepare to fix ordering of split IO. Noted that bio_submit_split_bioset() can fail the original bio directly by split error, set R1BIO_Returned in this case to notify raid_end_bio_io() that the original bio is returned already. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10md/raid0: convert raid0_handle_discard() to use bio_submit_split_bioset()Yu Kuai
Unify bio split code, and prepare to fix ordering of split IO Noted commit 319ff40a5427 ("md/raid0: Fix performance regression for large sequential writes") already fix ordering of split IO by remapping bio to underlying disks before resubmitting it, with the respect md_submit_bio() already split it by sectors, and raid0_make_request() will split at most once for unaligned IO. This is a bit hacky and we'll convert this to solution in general later. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10md: fix mssing blktrace bio split eventsYu Kuai
If bio is split by internal handling like chunksize or badblocks, the corresponding trace_block_split() is missing, resulting in blktrace inability to catch BIO split events and making it harder to analyze the BIO sequence. Cc: stable@vger.kernel.org Fixes: 4b1faf931650 ("block: Kill bio_pair_split()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09Merge tag 'md-6.18-20250909' of ↵Jens Axboe
gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux into for-6.18/block Pull MD changes from Yu Kuai: "Redundant data is used to enhance data fault tolerance, and the storage method for redundant data vary depending on the RAID levels. And it's important to maintain the consistency of redundant data. Bitmap is used to record which data blocks have been synchronized and which ones need to be resynchronized or recovered. Each bit in the bitmap represents a segment of data in the array. When a bit is set, it indicates that the multiple redundant copies of that data segment may not be consistent. Data synchronization can be performed based on the bitmap after power failure or readding a disk. If there is no bitmap, a full disk synchronization is required. Due to known performance issues with md-bitmap and the unreasonable implementations: - self-managed IO submitting like filemap_write_page(); - global spin_lock I have decided not to continue optimizing based on the current bitmap implementation, this new bitmap is invented without locking from IO fast path and can be used with fast disks. Key features for the new bitmap: - IO fastpath is lockless, if user issues lots of write IO to the same bitmap bit in a short time, only the first write has additional overhead to update bitmap bit, no additional overhead for the following writes; - support only resync or recover written data, means in the case creating new array or replacing with a new disk, there is no need to do a full disk resync/recovery;" * tag 'md-6.18-20250909' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux: (24 commits) md/md-llbitmap: introduce new lockless bitmap md/md-bitmap: make method bitmap_ops->daemon_work optional md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER md/md-bitmap: add a new method blocks_synced() in bitmap_operations md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations md/md-bitmap: delay registration of bitmap_ops until creating bitmap md/md-bitmap: add a new sysfs api bitmap_type md: add a new mddev field 'bitmap_id' md/md-bitmap: support discard for bitmap ops md: factor out a helper raid_is_456() md: add a new parameter 'offset' to md_super_write() md/md-bitmap: introduce CONFIG_MD_BITMAP md: check before referencing mddev->bitmap_ops md/dm-raid: check before referencing mddev->bitmap_ops md/raid5: check before referencing mddev->bitmap_ops md/raid10: check before referencing mddev->bitmap_ops md/raid1: check before referencing mddev->bitmap_ops md/raid1: check bitmap before behind write md/md-bitmap: handle the case bitmap is not enabled before end_sync() md/md-bitmap: handle the case bitmap is not enabled before start_sync() ...
2025-09-09block: remove the bi_inline_vecs variable sized array from struct bioChristoph Hellwig
Bios are embedded into other structures, and at least spare is unhappy about embedding structures with variable sized arrays. There's no real need to the array anyway, we can replace it with a helper pointing to the memory just behind the bio, and with the previous cleanups there is very few site doing anything special with it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09block: add a bio_init_inline helperChristoph Hellwig
Just a simpler wrapper around bio_init for callers that want to initialize a bio with inline bvecs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-08dm-integrity: limit MAX_TAG_SIZE to 255Mikulas Patocka
MAX_TAG_SIZE was 0x1a8 and it may be truncated in the "bi->metadata_size = ic->tag_size" assignment. We need to limit it to 255. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-06md/md-llbitmap: introduce new lockless bitmapYu Kuai
Redundant data is used to enhance data fault tolerance, and the storage method for redundant data vary depending on the RAID levels. And it's important to maintain the consistency of redundant data. Bitmap is used to record which data blocks have been synchronized and which ones need to be resynchronized or recovered. Each bit in the bitmap represents a segment of data in the array. When a bit is set, it indicates that the multiple redundant copies of that data segment may not be consistent. Data synchronization can be performed based on the bitmap after power failure or readding a disk. If there is no bitmap, a full disk synchronization is required. Due to known performance issues with md-bitmap and the unreasonable implementations: - self-managed IO submitting like filemap_write_page(); - global spin_lock I have decided not to continue optimizing based on the current bitmap implementation, this new bitmap is invented without locking from IO fast path and can be used with fast disks. For designs and details, see the comments in drivers/md-llbitmap.c. Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-12-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Li Nan <linan122@huawei.com>