summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2026-04-07btrfs: stop printing condition result in assertion failure messagesFilipe Manana
It's useless to print the result of the condition, it's always 0 if the assertion is triggered, so it doesn't provide any useful information. Examples: assertion failed: cb->bbio.bio.bi_iter.bi_size == disk_num_bytes :: 0, in inode.c:9991 assertion failed: folio_test_writeback(folio) :: 0, in subpage.c:476 So stop printing that, it's always ":: 0" for any assertion triggered (except for conditions that are just an identifier). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: constify arguments of some functionsFilipe Manana
There are several functions that take pointer arguments but don't need to modify the objects they point to, so add the const qualifiers. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: avoid unnecessary root node COW during snapshottingFilipe Manana
There's no need to COW the root node of the subvolume we are snapshotting because we then call btrfs_copy_root(), which creates a copy of the root node and sets its generation to the current transaction. So remove this redundant COW right before calling btrfs_copy_root(), saving one extent allocation, memory allocation, copying things, etc, and making the code less confusing. Also rename the extent buffer variable from "old" to "root_eb" since that name no longer makes any sense after removing the unnecessary COW operation. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: check snapshot_force_cow earlier in can_nocow_file_extent()Chen Guan Jie
When a snapshot is being created, the atomic counter snapshot_force_cow is incremented to force incoming writes to fallback to COW. This is a critical mechanism to protect the consistency of the snapshot being taken. Currently, can_nocow_file_extent() checks this counter only after performing several checks, most notably the expensive cross-reference check via btrfs_cross_ref_exist(). btrfs_cross_ref_exist() releases the path and performs a search in the extent tree or backref cache, which involves btree traversals and locking overhead. Moves the snapshot_force_cow check to the very beginning of can_nocow_file_extent(). This reordering is safe and beneficial because: 1. args->writeback_path is invariant for the duration of the call (set by caller run_delalloc_nocow). 2. is_freespace_inode is a static property of the inode. 3. The state of snapshot_force_cow is driven by the btrfs_mksnapshot() process. Checking it earlier does not change the outcome of the NOCOW decision, but effectively prunes the expensive code path when a fallback to COW is inevitable. By failing fast when a snapshot is pending, we avoid the unnecessary overhead of btrfs_cross_ref_exist() and other extent item checks in the scenario where NOCOW is already known to be impossible. Signed-off-by: Chen Guan Jie <jk.chen1095@gmail.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: do not mark inode incompressible after inline attempt failsQu Wenruo
[BUG] The following sequence will set the file with nocompress flag: # mkfs.btrfs -f $dev # mount $dev $mnt -o max_inline=4,compress # xfs_io -f -c "pwrite 0 2k" -c sync $mnt/foobar The inode will have NOCOMPRESS flag, even if the content itself (all 0xcd) can still be compressed very well: item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160 generation 9 transid 10 size 2097152 nbytes 1052672 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 257 flags 0x8(NOCOMPRESS) Please note that, this behavior is there even before commit 59615e2c1f63 ("btrfs: reject single block sized compression early"). [CAUSE] At compress_file_range(), after btrfs_compress_folios() call, we try making an inlined extent by calling cow_file_range_inline(). But cow_file_range_inline() calls can_cow_file_range_inline() which has more accurate checks on if the range can be inlined. One of the user configurable conditions is the "max_inline=" mount option. If that value is set low (like the example, 4 bytes, which cannot store any header), or the compressed content is just slightly larger than 2K (the default value, meaning a 50% compression ratio), cow_file_range_inline() will return 1 immediately. And since we're here only to try inline the compressed data, the range is no larger than a single fs block. Thus compression is never going to make it a win, we fall back to marking the inode incompressible unavoidably. [FIX] Just add an extra check after inline attempt, so that if the inline attempt failed, do not set the nocompress flag. As there is no way to remove that flag, and the default 50% compression ratio is way too strict for the whole inode. CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove folio parameter from ordered io related functionsQu Wenruo
Both functions btrfs_finish_ordered_extent() and btrfs_mark_ordered_io_finished() are accepting an optional folio parameter. That @folio is passed into can_finish_ordered_extent(), which later will test and clear the ordered flag for the involved range. However I do not think there is any other call site that can clear ordered flags of an page cache folio and can affect can_finish_ordered_extent(). There are limited *_clear_ordered() callers out of can_finish_ordered_extent() function: - btrfs_migrate_folio() This is completely unrelated, it's just migrating the ordered flag to the new folio. - btrfs_cleanup_ordered_extents() We manually clean the ordered flags of all involved folios, then call btrfs_mark_ordered_io_finished() without a @folio parameter. So it doesn't need and didn't pass a @folio parameter in the first place. - btrfs_writepage_fixup_worker() This function is going to be removed soon, and we should not hit that function anymore. - btrfs_invalidate_folio() This is the real call site we need to bother with. If we already have a bio running, btrfs_finish_ordered_extent() in end_bbio_data_write() will be executed first, as btrfs_invalidate_folio() will wait for the writeback to finish. Thus if there is a running bio, it will not see the range has ordered flags, and just skip to the next range. If there is no bio running, meaning the ordered extent is created but the folio is not yet submitted. In that case btrfs_invalidate_folio() will manually clear the folio ordered range, but then manually finish the ordered extent with btrfs_dec_test_ordered_pending() without bothering the folio ordered flags. Meaning if the OE range with folio ordered flags will be finished manually without the need to call can_finish_ordered_extent(). This means all can_finish_ordered_extent() call sites should get a range that has folio ordered flag set, thus the old "return false" branch should never be triggered. Now we can: - Remove the @folio parameter from involved functions * btrfs_mark_ordered_io_finished() * btrfs_finish_ordered_extent() For call sites passing a @folio into those functions, let them manually clear the ordered flag of involved folios. - Move btrfs_finish_ordered_extent() out of the loop in end_bbio_data_write() We only need to call btrfs_finish_ordered_extent() once per bbio, not per folio. - Add an ASSERT() to make sure all folio ranges have ordered flags It's only for end_bbio_data_write(). And we already have enough safe nets to catch over-accounting of ordered extents. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove the btrfs_inode parameter from btrfs_remove_ordered_extent()Qu Wenruo
We already have btrfs_ordered_extent::inode, thus there is no need to pass a btrfs_inode parameter to btrfs_remove_ordered_extent(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove out-of-date comments in btree_writepages()Qu Wenruo
There is a lengthy comment introduced in commit b3ff8f1d380e ("btrfs: Don't submit any btree write bio if the fs has errors") and commit c9583ada8cc4 ("btrfs: avoid double clean up when submit_one_bio() failed"), explaining two things: - Why we don't want to submit metadata write if the fs has errors - Why we re-set @ret to 0 if it's positive However it's no longer uptodate by the following reasons: - We have better checks nowadays Commit 2618849f31e7 ("btrfs: ensure no dirty metadata is written back for an fs with errors") has introduced better checks, that if the fs is in an error state, metadata writes will not result in any bio but instead complete immediately. That covers all metadata writes better. - Mentioned incorrect function name The commit c9583ada8cc4 ("btrfs: avoid double clean up when submit_one_bio() failed") introduced this ret > 0 handling, but at that time the function name submit_extent_page() was already incorrect. It was submit_eb_page() that could return >0 at that time, and submit_extent_page() could only return 0 or <0 for errors, never >0. Later commit b35397d1d325 ("btrfs: convert submit_extent_page() to use a folio") changed "submit_extent_page()" to "submit_extent_folio()" in the comment, but it doesn't make any difference since the function name is wrong from day 1. Finally commit 5e121ae687b8 ("btrfs: use buffer xarray for extent buffer writeback operations") completely reworked how metadata writeback works, and removed submit_eb_page(), leaving only the wrong function name in the comment. Furthermore the function submit_extent_folio() still exists in the latest code base, but is never utilized for metadata writeback, causing more confusion. Just remove the lengthy comment, and replace the "if (ret > 0)" check with an ASSERT(), since only btrfs_check_meta_write_pointer() can modify @ret and it returns 0 or <0 for errors. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove bogus root search condition in load_extent_tree_free()Filipe Manana
There's no need to pass the maximum between the block group's start offset and BTRFS_SUPER_INFO_OFFSET (64K) since we can't have any block groups allocated in the first megabyte, as that's reserved space. Furthermore, even if we could, the correct thing to do was to pass the block group's start offset anyway - and that's precisely what we do for block groups hat happen to contain a superblock mirror (the range for the super block is never marked as free and it's marked as dirty in the fs_info->excluded_extents io tree). So simplify this and get rid of that maximum expression. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove duplicate include of delayed-inode.h in disk-io.cChen Ni
Remove duplicate inclusion of delayed-inode.h in disk-io.c to clean up redundant code. Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: pass literal booleans to functions that take boolean argumentsFilipe Manana
We have several functions with parameters defined as booleans but then we have callers passing integers, 0 or 1, instead of false and true. While this isn't a bug since 0 and 1 are converted to false and true, it is odd and less readable. Change the callers to pass true and false literals instead. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove pointless out label in qgroup_account_snapshot()Filipe Manana
The 'out' label is pointless as there are no cleanups to perform there, we can replace every goto with a direct return. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: rename btrfs_csum_file_blocks() to btrfs_insert_data_csums()Qu Wenruo
The function btrfs_csum_file_blocks() is a little confusing, unlike btrfs_csum_one_bio(), it is not calculating the checksum of some file blocks. Instead it's just inserting the already calculated checksums into a given root (can be a csum root or a log tree). So rename it to btrfs_insert_data_csums() to reflect its behavior better. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: make add_pending_csums() to take an ordered extent as parameterQu Wenruo
The structure btrfs_ordered_extent has a lot of list heads for different purposes, passing a random list_head pointer is never a good idea as if the wrong list is passed in, the type casting along with the fs will be screwed up. Instead pass the btrfs_ordered_extent pointer, and grab the csum_list inside add_pending_csums() to make it a little safer. Since we're here, also update the comments to follow the current style. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: rename btrfs_ordered_extent::list to csum_listQu Wenruo
That list head records all pending checksums for that ordered extent. And unlike other lists, we just use the name "list", which can be very confusing for readers. Rename it to "csum_list" which follows the remaining lists, showing the purpose of the list. And since we're here, remove a comment inside btrfs_finish_ordered_zoned() where we have "ASSERT(!list_empty(&ordered->csum_list))" to make sure the OE has pending csums. That comment is only here to make sure we do not call list_first_entry() before checking BTRFS_ORDERED_PREALLOC. But since we already have that bit checked and even have a dedicated ASSERT(), there is no need for that comment anymore. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: change return type of cache_save_setup to voidJohannes Thumshirn
None of the callers of `cache_save_setup` care about the return type as the space cache is purely and optimization. Also the free space cache is a deprecated feature that is being phased out. Change the return type of `cache_save_setup` to void to reflect this. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: avoid starting new transaction and commit in relocate_block_group()Filipe Manana
We join a transaction with the goal of catching the current transaction and then commit it to get rid of pinned extents and reclaim free space, but a join can create a new transaction if there isn't any running, and if right before we did the join the current transaction happened to be committed by someone else (like the transaction kthread for example), we end up starting and committing a new transaction, causing rotation of the super block backup roots besides extra and useless IO. So instead of doing a transaction join followed by a commit, use the helper btrfs_commit_current_transaction() which ensures no transaction is created if there isn't any running. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove redundant extent_buffer_uptodate() checks after read_tree_block()Filipe Manana
We have several places that call extent_buffer_uptodate() after reading a tree block with read_tree_block(), but that is redundant since we already call extent_buffer_uptodate() in the call chain of read_tree_block(): read_tree_block() btrfs_read_extent_buffer() read_extent_buffer_pages() returns -EIO if extent_buffer_uptodate() returns false So remove those redundant checks. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: use the helper extent_buffer_uptodate() everywhereFilipe Manana
Instead of open coding testing the uptodate bit on the extent buffer's flags, use the existing helper extent_buffer_uptodate() (which is even shorter to type). Also change the helper's return value from int to bool, since we always use it in a boolean context. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: zoned: add zone reclaim flush state for DATA space_infoJohannes Thumshirn
On zoned block devices, DATA block groups can accumulate large amounts of zone_unusable space (space between the write pointer and zone end). When zone_unusable reaches high levels (e.g., 98% of total space), new allocations fail with ENOSPC even though space could be reclaimed by relocating data and resetting zones. The existing flush states don't handle this scenario effectively - they either try to free cached space (which doesn't exist for zone_unusable) or reset empty zones (which doesn't help when zones contain valid data mixed with zone_unusable space). Add a new RECLAIM_ZONES flush state that triggers the block group reclaim machinery. This state: - Calls btrfs_reclaim_sweep() to identify reclaimable block groups - Calls btrfs_reclaim_bgs() to queue reclaim work - Waits for reclaim_bgs_work to complete via flush_work() - Commits the transaction to finalize changes The reclaim work (btrfs_reclaim_bgs_work) safely relocates valid data from fragmented block groups to other locations before resetting zones, converting zone_unusable space back into usable space. Insert RECLAIM_ZONES before RESET_ZONES in data_flush_states so that we attempt to reclaim partially-used block groups before falling back to resetting completely empty ones. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: zoned: move partially zone_unusable block groups to reclaim listJohannes Thumshirn
On zoned block devices, block groups accumulate zone_unusable space (space between the write pointer and zone end that cannot be allocated until the zone is reset). When a block group becomes mostly zone_unusable but still contains some valid data and it gets added to the unused_bgs list it can never be deleted because it's not actually empty. The deletion code (btrfs_delete_unused_bgs) skips these block groups due to the btrfs_is_block_group_used() check, leaving them on the unused_bgs list indefinitely. This causes two problems: 1. The block groups are never reclaimed, permanently wasting space 2. Eventually leads to ENOSPC even though reclaimable space exists Fix by detecting block groups where zone_unusable exceeds 50% of the block group size. Move these to the reclaim_bgs list instead of skipping them. This triggers btrfs_reclaim_bgs_work() which: 1. Marks the block group read-only 2. Relocates the remaining valid data via btrfs_relocate_chunk() 3. Removes the emptied block group 4. Resets the zones, converting zone_unusable back to usable space The 50% threshold ensures we only reclaim block groups where most space is unusable, making relocation worthwhile. Block groups with less zone_unusable are left on unused_bgs to potentially become fully empty through normal deletion. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: zoned: cap delayed refs metadata reservation to avoid overcommitJohannes Thumshirn
On zoned filesystems metadata space accounting can become overly optimistic due to delayed refs reservations growing without a hard upper bound. The delayed_refs_rsv block reservation is allowed to speculatively grow and is only backed by actual metadata space when refilled. On zoned devices this can result in delayed_refs_rsv reserving a large portion of metadata space that is already effectively unusable due to zone write pointer constraints. As a result, space_info->may_use can grow far beyond the usable metadata capacity, causing the allocator to believe space is available when it is not. This leads to premature ENOSPC failures and "cannot satisfy tickets" reports even though commits would be able to make progress by flushing delayed refs. Analysis of "-o enospc_debug" dumps using a Python debug script confirmed that delayed_refs_rsv was responsible for the majority of metadata overcommit on zoned devices. By correlating space_info counters (total, used, may_use, zone_unusable) across transactions, the analysis showed that may_use continued to grow even after usable metadata space was exhausted, with delayed refs refills accounting for the excess reservations. Here's the output of the analysis: ====================================================================== Space Type: METADATA ====================================================================== Raw Values: Total: 256.00 MB (268435456 bytes) Used: 128.00 KB (131072 bytes) Pinned: 16.00 KB (16384 bytes) Reserved: 144.00 KB (147456 bytes) May Use: 255.48 MB (267894784 bytes) Zone Unusable: 192.00 KB (196608 bytes) Calculated Metrics: Actually Usable: 255.81 MB (total - zone_unusable) Committed: 255.77 MB (used + pinned + reserved + may_use) Consumed: 320.00 KB (used + zone_unusable) Percentages: Zone Unusable: 0.07% of total May Use: 99.80% of total Fix this by adding a zoned-specific cap in btrfs_delayed_refs_rsv_refill(): Before reserving additional metadata bytes, limit the delayed refs reservation based on the usable metadata space (total bytes minus zone_unusable). If the reservation would exceed this cap, return -EAGAIN to trigger the existing flush/commit logic instead of overcommitting metadata space. This preserves the existing reservation and flushing semantics while preventing metadata overcommit on zoned devices. The change is limited to metadata space and does not affect non-zoned filesystems. This patch addresses premature metadata ENOSPC conditions on zoned devices and ensures delayed refs are throttled before exhausting usable metadata. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove duplicated eb uptodate check in btrfs_buffer_uptodate()Filipe Manana
We are calling extent_buffer_uptodate() twice, and the result will not change before the second call. So remove the second call. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: fix the inline compressed extent check in inode_need_compress()Qu Wenruo
[BUG] Since commit 59615e2c1f63 ("btrfs: reject single block sized compression early"), the following script will result the inode to have NOCOMPRESS flag, meanwhile old kernels don't: # mkfs.btrfs -f $dev # mount $dev $mnt -o max_inline=2k,compress=zstd # truncate -s 8k $mnt/foobar # xfs_io -f -c "pwrite 0 2k" $mnt/foobar # sync Before that commit, the inode will not have NOCOMPRESS flag: item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160 generation 9 transid 9 size 8192 nbytes 4096 block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 sequence 3 flags 0x0(none) But after that commit, the inode will have NOCOMPRESS flag: item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160 generation 9 transid 10 size 8192 nbytes 4096 block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 sequence 3 flags 0x8(NOCOMPRESS) This will make a lot of files no longer to be compressed. [CAUSE] The old compressed inline check looks like this: if (total_compressed <= blocksize && (start > 0 || end + 1 < inode->disk_i_size)) goto cleanup_and_bail_uncompressed; That inline part check is equal to "!(start == 0 && end + 1 >= inode->disk_i_size)", but the new check no longer has that disk_i_size check. Thus it means any single block sized write at file offset 0 will pass the inline check, which is wrong. Furthermore, since we have merged the old check into inode_need_compress(), there is no disk_i_size based inline check anymore, we will always try compressing that single block at file offset 0, then later find out it's not a net win and go to the mark_incompressible tag. This results the inode to have NOCOMPRESS flag. [FIX] Add back the missing disk_i_size based check into inode_need_compress(). Now the same script will no longer cause NOCOMPRESS flag. Fixes: 59615e2c1f63 ("btrfs: reject single block sized compression early") Reported-by: Chris Mason <clm@meta.com> Link: https://lore.kernel.org/linux-btrfs/20260208183840.975975-1-clm@meta.com/ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: set written super flag once in write_all_supers()Filipe Manana
In case we have multiple devices, there is no point in setting the written flag in the super block on every iteration over the device list. Just do it once before the loop. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove max_mirrors argument from write_all_supers()Filipe Manana
There's no need to pass max_mirrors to write_all_supers() since from the given transaction handle we can infer if we are in a transaction commit or fsync context, so we can determine how many mirrors we need to use. So remove the max_mirror argument from write_all_supers() and stop adjusting it in the callees write_dev_supers() and wait_dev_supers(), simplifying them besides the parameter list for write_all_supers(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: tag error branches as unlikely during super block writesFilipe Manana
Mark all the unexpected error checks as unlikely, to make it more clear they are unexpected and to allow the compiler to potentially generate better code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: abort transaction on error in write_all_supers()Filipe Manana
We are in a transaction context and have an handle, so instead of using the less preferred btrfs_handle_fs_error(), abort the transaction and log an error to give some context information. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: pass transaction handle to write_all_supers()Filipe Manana
We are holding a transaction In every context we call write_all_supers(), so pass the transaction handle instead of fs_info to it. This will allow us to abort the transaction in write_all_supers() instead of calling btrfs_handle_fs_error() in a later patch. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: mark all error and warning checks as unlikely in btrfs_validate_super()Filipe Manana
When validating a super block, either when mounting or every time we write a super block to disk, we do many checks for error and warnings and we don't expect to hit any. So mark each one as unlikely to reflect that and allow the compiler to potentially generate better code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: update comment for BTRFS_RESERVE_NO_FLUSHFilipe Manana
The comment is incomplete as BTRFS_RESERVE_NO_FLUSH is used for more reasons than currently holding a transaction handle open. Update the comment with all the other reasons and give some details. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: don't allow log trees to consume global reserve or overcommit metadataFilipe Manana
For a fsync we never reserve space in advance, we just start a transaction without reserving space and we use an empty block reserve for a log tree. We reserve space as we need while updating a log tree, we end up in btrfs_use_block_rsv() when reserving space for the allocation of a log tree extent buffer and we attempt first to reserve without flushing, and if that fails we attempt to consume from the global reserve or overcommit metadata. This makes us consume space that may be the last resort for a transaction commit to succeed, therefore increasing the chances for a transaction abort with -ENOSPC. So make btrfs_use_block_rsv() fail if we can't reserve metadata space for a log tree extent buffer allocation without flushing, making the fsync fallback to a transaction commit and avoid using critical space that could be the only resort for a transaction commit to succeed when we are in a critical space situation. Reviewed-by: Leo Martins <loemra.dev@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: be less aggressive with metadata overcommit when we can do full flushingFilipe Manana
Over the years we often get reports of some -ENOSPC failure while updating metadata that leads to a transaction abort. I have seen this happen for filesystems of all sizes and with workloads that are very user/customer specific and unable to reproduce, but Aleksandar recently reported a simple way to reproduce this with a 1G filesystem and using the bonnie++ benchmark tool. The following test script reproduces the failure: $ cat test.sh #!/bin/bash # Create and use a 1G null block device, memory backed, otherwise # the test takes a very long time. modprobe null_blk nr_devices="0" null_dev="/sys/kernel/config/nullb/nullb0" mkdir "$null_dev" size=$((1 * 1024)) # in MB echo 2 > "$null_dev/submit_queues" echo "$size" > "$null_dev/size" echo 1 > "$null_dev/memory_backed" echo 1 > "$null_dev/discard" echo 1 > "$null_dev/power" DEV=/dev/nullb0 MNT=/mnt/nullb0 mkfs.btrfs -f $DEV mount $DEV $MNT mkdir $MNT/test/ bonnie++ -d $MNT/test/ -m BTRFS -u 0 -s 256M -r 128M -b umount $MNT echo 0 > "$null_dev/power" rmdir "$null_dev" When running this bonnie++ fails in the phase where it deletes test directories and files: $ ./test.sh (...) Using uid:0, gid:0. Writing a byte at a time...done Writing intelligently...done Rewriting...done Reading a byte at a time...done Reading intelligently...done start 'em...done...done...done...done...done... Create files in sequential order...done. Stat files in sequential order...done. Delete files in sequential order...done. Create files in random order...done. Stat files in random order...done. Delete files in random order...Can't sync directory, turning off dir-sync. Can't delete file 9Bq7sr0000000338 Cleaning up test directory after error. Bonnie: drastic I/O error (rmdir): Read-only file system And in the syslog/dmesg we can see the following transaction abort trace: [161915.501506] BTRFS warning (device nullb0): Skipping commit of aborted transaction. [161915.502983] ------------[ cut here ]------------ [161915.503832] BTRFS: Transaction aborted (error -28) [161915.504748] WARNING: fs/btrfs/transaction.c:2045 at btrfs_commit_transaction+0xa21/0xd30 [btrfs], CPU#11: bonnie++/3377975 [161915.506786] Modules linked in: btrfs dm_zero dm_snapshot (...) [161915.518759] CPU: 11 UID: 0 PID: 3377975 Comm: bonnie++ Tainted: G W 6.19.0-rc7-btrfs-next-224+ #4 PREEMPT(full) [161915.520857] Tainted: [W]=WARN [161915.521405] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [161915.523414] RIP: 0010:btrfs_commit_transaction+0xa24/0xd30 [btrfs] [161915.524630] Code: 48 8b 7c 24 (...) [161915.526982] RSP: 0018:ffffd3fe8206fda8 EFLAGS: 00010292 [161915.527707] RAX: 0000000000000002 RBX: ffff8f4886d3c000 RCX: 0000000000000000 [161915.528723] RDX: 0000000002040001 RSI: 00000000ffffffe4 RDI: ffffffffc088f780 [161915.529691] RBP: ffff8f4f5adae7e0 R08: 0000000000000000 R09: ffffd3fe8206fb90 [161915.530842] R10: ffff8f4f9c1fffa8 R11: 0000000000000003 R12: 00000000ffffffe4 [161915.532027] R13: ffff8f4ef2cf2400 R14: ffff8f4f5adae708 R15: ffff8f4f62d18000 [161915.533229] FS: 00007ff93112a780(0000) GS:ffff8f4ff63ee000(0000) knlGS:0000000000000000 [161915.534611] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [161915.535575] CR2: 00005571b3072000 CR3: 0000000176080005 CR4: 0000000000370ef0 [161915.536758] Call Trace: [161915.537185] <TASK> [161915.537575] btrfs_sync_file+0x431/0x530 [btrfs] [161915.538473] do_fsync+0x39/0x80 [161915.539042] __x64_sys_fsync+0xf/0x20 [161915.539750] do_syscall_64+0x50/0xf20 [161915.540396] entry_SYSCALL_64_after_hwframe+0x76/0x7e [161915.541301] RIP: 0033:0x7ff930ca49ee [161915.541904] Code: 08 0f 85 f5 (...) [161915.544830] RSP: 002b:00007ffd94291f38 EFLAGS: 00000246 ORIG_RAX: 000000000000004a [161915.546152] RAX: ffffffffffffffda RBX: 00007ff93112a780 RCX: 00007ff930ca49ee [161915.547263] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003 [161915.548383] RBP: 0000000000000dab R08: 0000000000000000 R09: 0000000000000000 [161915.549853] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd94291fb0 [161915.551196] R13: 00007ffd94292350 R14: 0000000000000001 R15: 00007ffd94292340 [161915.552161] </TASK> [161915.552457] ---[ end trace 0000000000000000 ]--- [161915.553232] BTRFS info (device nullb0 state A): dumping space info: [161915.553236] BTRFS info (device nullb0 state A): space_info DATA (sub-group id 0) has 12582912 free, is not full [161915.553239] BTRFS info (device nullb0 state A): space_info total=12582912, used=0, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0 [161915.553243] BTRFS info (device nullb0 state A): space_info METADATA (sub-group id 0) has -5767168 free, is full [161915.553245] BTRFS info (device nullb0 state A): space_info total=53673984, used=6635520, pinned=46956544, reserved=16384, may_use=5767168, readonly=65536 zone_unusable=0 [161915.553251] BTRFS info (device nullb0 state A): space_info SYSTEM (sub-group id 0) has 8355840 free, is not full [161915.553254] BTRFS info (device nullb0 state A): space_info total=8388608, used=16384, pinned=16384, reserved=0, may_use=0, readonly=0 zone_unusable=0 [161915.553257] BTRFS info (device nullb0 state A): global_block_rsv: size 5767168 reserved 5767168 [161915.553261] BTRFS info (device nullb0 state A): trans_block_rsv: size 0 reserved 0 [161915.553263] BTRFS info (device nullb0 state A): chunk_block_rsv: size 0 reserved 0 [161915.553265] BTRFS info (device nullb0 state A): remap_block_rsv: size 0 reserved 0 [161915.553268] BTRFS info (device nullb0 state A): delayed_block_rsv: size 0 reserved 0 [161915.553270] BTRFS info (device nullb0 state A): delayed_refs_rsv: size 0 reserved 0 [161915.553272] BTRFS: error (device nullb0 state A) in cleanup_transaction:2045: errno=-28 No space left [161915.554463] BTRFS info (device nullb0 state EA): forced readonly The problem is that we allow for a very aggressive metadata overcommit, about 1/8th of the currently available space, even when the task attempting the reservation allows for full flushing. Over time this allows more and more tasks to overcommit without getting a transaction commit to release pinned extents, joining the same transaction and eventually lead to the transaction abort when attempting some tree update, as the extent allocator is not able to find any available metadata extent and it's not able to allocate a new metadata block group either (not enough unallocated space for that). Fix this by allowing the overcommit to be up to 1/64th of the available (unallocated) space instead and for that limit to apply to both types of full flushing, BTRFS_RESERVE_FLUSH_ALL and BTRFS_RESERVE_FLUSH_ALL_STEAL. This way we get more frequent transaction commits to release pinned extents in case our caller is in a context where full flushing is allowed. Note that the space infos dump in the dmesg/syslog right after the transaction abort give the wrong idea that we have plenty of unallocated space when the abort happened. During the bonnie++ workload we had a metadata chunk allocation attempt and it failed with -ENOSPC because at that time we had a bunch of data block groups allocated, which then became empty and got deleted by the cleaner kthread after the metadata chunk allocation failed with -ENOSPC and before the transaction abort happened and dumped the space infos. The custom tracing (some trace_printk() calls spread in strategic places) used to check that: mount-1793735 [011] ...1. 28877.261096: btrfs_add_bg_to_space_info: added bg offset 13631488 length 8388608 flags 1 to space_info->flags 1 total_bytes 8388608 bytes_used 0 bytes_may_use 0 mount-1793735 [011] ...1. 28877.261098: btrfs_add_bg_to_space_info: added bg offset 22020096 length 8388608 flags 34 to space_info->flags 2 total_bytes 8388608 bytes_used 16384 bytes_may_use 0 mount-1793735 [011] ...1. 28877.261100: btrfs_add_bg_to_space_info: added bg offset 30408704 length 53673984 flags 36 to space_info->flags 4 total_bytes 53673984 bytes_used 131072 bytes_may_use 0 These are from loading the block groups created by mkfs during mount. Then when bonnie++ starts doing its thing: kworker/u48:5-1792004 [011] ..... 28886.122050: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 kworker/u48:5-1792004 [011] ..... 28886.122053: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 927596544 kworker/u48:5-1792004 [011] ..... 28886.122055: btrfs_make_block_group: make bg offset 84082688 size 117440512 type 1 kworker/u48:5-1792004 [011] ...1. 28886.122064: btrfs_add_bg_to_space_info: added bg offset 84082688 length 117440512 flags 1 to space_info->flags 1 total_bytes 125829120 bytes_used 0 bytes_may_use 5251072 First allocation of a data block group of 112M. kworker/u48:5-1792004 [011] ..... 28886.192408: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 kworker/u48:5-1792004 [011] ..... 28886.192413: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 810156032 kworker/u48:5-1792004 [011] ..... 28886.192415: btrfs_make_block_group: make bg offset 201523200 size 117440512 type 1 kworker/u48:5-1792004 [011] ...1. 28886.192425: btrfs_add_bg_to_space_info: added bg offset 201523200 length 117440512 flags 1 to space_info->flags 1 total_bytes 243269632 bytes_used 0 bytes_may_use 122691584 Another 112M data block group allocated. kworker/u48:5-1792004 [011] ..... 28886.260935: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 kworker/u48:5-1792004 [011] ..... 28886.260941: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 692715520 kworker/u48:5-1792004 [011] ..... 28886.260943: btrfs_make_block_group: make bg offset 318963712 size 117440512 type 1 kworker/u48:5-1792004 [011] ...1. 28886.260954: btrfs_add_bg_to_space_info: added bg offset 318963712 length 117440512 flags 1 to space_info->flags 1 total_bytes 360710144 bytes_used 0 bytes_may_use 240132096 Yet another one. bonnie++-1793755 [010] ..... 28886.280407: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 bonnie++-1793755 [010] ..... 28886.280412: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 575275008 bonnie++-1793755 [010] ..... 28886.280414: btrfs_make_block_group: make bg offset 436404224 size 117440512 type 1 bonnie++-1793755 [010] ...1. 28886.280419: btrfs_add_bg_to_space_info: added bg offset 436404224 length 117440512 flags 1 to space_info->flags 1 total_bytes 478150656 bytes_used 0 bytes_may_use 268435456 One more. kworker/u48:5-1792004 [011] ..... 28886.566233: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 kworker/u48:5-1792004 [011] ..... 28886.566238: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 457834496 kworker/u48:5-1792004 [011] ..... 28886.566241: btrfs_make_block_group: make bg offset 553844736 size 117440512 type 1 kworker/u48:5-1792004 [011] ...1. 28886.566250: btrfs_add_bg_to_space_info: added bg offset 553844736 length 117440512 flags 1 to space_info->flags 1 total_bytes 595591168 bytes_used 268435456 bytes_may_use 209723392 Another one. bonnie++-1793755 [009] ..... 28886.613446: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 bonnie++-1793755 [009] ..... 28886.613451: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 340393984 bonnie++-1793755 [009] ..... 28886.613453: btrfs_make_block_group: make bg offset 671285248 size 117440512 type 1 bonnie++-1793755 [009] ...1. 28886.613458: btrfs_add_bg_to_space_info: added bg offset 671285248 length 117440512 flags 1 to space_info->flags 1 total_bytes 713031680 bytes_used 268435456 bytes_may_use 2 68435456 Another one. bonnie++-1793755 [009] ..... 28886.674953: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 bonnie++-1793755 [009] ..... 28886.674957: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 222953472 bonnie++-1793755 [009] ..... 28886.674959: btrfs_make_block_group: make bg offset 788725760 size 117440512 type 1 bonnie++-1793755 [009] ...1. 28886.674963: btrfs_add_bg_to_space_info: added bg offset 788725760 length 117440512 flags 1 to space_info->flags 1 total_bytes 830472192 bytes_used 268435456 bytes_may_use 1 34217728 Another one. bonnie++-1793755 [009] ..... 28886.674981: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 bonnie++-1793755 [009] ..... 28886.674982: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 105512960 bonnie++-1793755 [009] ..... 28886.674983: btrfs_make_block_group: make bg offset 906166272 size 105512960 type 1 bonnie++-1793755 [009] ...1. 28886.674984: btrfs_add_bg_to_space_info: added bg offset 906166272 length 105512960 flags 1 to space_info->flags 1 total_bytes 935985152 bytes_used 268435456 bytes_may_use 67108864 Another one, but a bit smaller (~100.6M) since we now have less space. bonnie++-1793758 [009] ..... 28891.962096: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 bonnie++-1793758 [009] ..... 28891.962103: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 12582912 bonnie++-1793758 [009] ..... 28891.962105: btrfs_make_block_group: make bg offset 1011679232 size 12582912 type 1 bonnie++-1793758 [009] ...1. 28891.962114: btrfs_add_bg_to_space_info: added bg offset 1011679232 length 12582912 flags 1 to space_info->flags 1 total_bytes 948568064 bytes_used 268435456 bytes_may_use 8192 Another one, this one even smaller (12M). kworker/u48:5-1792004 [011] ..... 28892.112802: btrfs_chunk_alloc: enter first metadata chunk alloc attempt kworker/u48:5-1792004 [011] ..... 28892.112805: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 131072 dev_extent_want 536870912 kworker/u48:5-1792004 [011] ..... 28892.112806: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 131072 dev_extent_want 536870912 max_avail 0 536870912 is 512M, the standard 256M metadata chunk size times 2 because of the DUP profile for metadata. 'max_avail' is what find_free_dev_extent() returns to us in gather_device_info(). As a result, gather_device_info() sets ctl->ndevs to 0, making decide_stripe_size() fail with -ENOSPC, and therefore metadata chunk allocation fails while we are attempting to run delayed items during the transaction commit. kworker/u48:5-1792004 [011] ..... 28892.112807: btrfs_create_chunk: decide_stripe_size fail -ENOSPC In the syslog/dmesg pasted above, which happened after the transaction was aborted, the space info dumps did not account for all these data block groups that were allocated during bonnie++'s workload. And that is because after the metadata chunk allocation failed with -ENOSPC and before the transaction abort happened, most of the data block groups had become empty and got deleted by by the cleaner kthread - when the abort happened, we had bonnie++ in the middle of deleting the files it created. But dumping the space infos right after the metadata chunk allocation fails by adding a call to btrfs_dump_space_info_for_trans_abort() in decide_stripe_size() when it returns -ENOSPC, we get: [29972.409295] BTRFS info (device nullb0): dumping space info: [29972.409300] BTRFS info (device nullb0): space_info DATA (sub-group id 0) has 673341440 free, is not full [29972.409303] BTRFS info (device nullb0): space_info total=948568064, used=0, pinned=275226624, reserved=0, may_use=0, readonly=0 zone_unusable=0 [29972.409305] BTRFS info (device nullb0): space_info METADATA (sub-group id 0) has 3915776 free, is not full [29972.409306] BTRFS info (device nullb0): space_info total=53673984, used=163840, pinned=42827776, reserved=147456, may_use=6553600, readonly=65536 zone_unusable=0 [29972.409308] BTRFS info (device nullb0): space_info SYSTEM (sub-group id 0) has 7979008 free, is not full [29972.409310] BTRFS info (device nullb0): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=393216, readonly=0 zone_unusable=0 [29972.409311] BTRFS info (device nullb0): global_block_rsv: size 5767168 reserved 5767168 [29972.409313] BTRFS info (device nullb0): trans_block_rsv: size 0 reserved 0 [29972.409314] BTRFS info (device nullb0): chunk_block_rsv: size 393216 reserved 393216 [29972.409315] BTRFS info (device nullb0): remap_block_rsv: size 0 reserved 0 [29972.409316] BTRFS info (device nullb0): delayed_block_rsv: size 0 reserved 0 So here we see there's ~904.6M of data space, ~51.2M of metadata space and 8M of system space, making a total of 963.8M. Reported-by: Aleksandar Gerasimovski <Aleksandar.Gerasimovski@belden.com> Link: https://lore.kernel.org/linux-btrfs/SA1PR18MB56922F690C5EC2D85371408B998FA@SA1PR18MB5692.namprd18.prod.outlook.com/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H61vZ3_+eqJ1A9po2WcgNJJjUu9MJQoYB2oDSAAecHaug@mail.gmail.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: use per-profile available space in calc_available_free_space()Qu Wenruo
For the following disk layout, can_overcommit() can cause false confidence in available space: devid 1 unallocated: 1GiB devid 2 unallocated: 50GiB metadata type: RAID1 As can_overcommit() simply uses unallocated space with factor to calculate the allocatable metadata chunk size, resulting 25.5GiB available space. But in reality we can only allocate one 1GiB RAID1 chunk, the remaining 49GiB on devid 2 will never be utilized to fulfill a RAID1 chunk. This leads to various ENOSPC related transaction abort and flips the fs read-only. Now use per-profile available space in calc_available_free_space(), and only when that failed we fall back to the old factor based estimation. And for zoned devices or for the very low chance of temporary memory allocation failure, we will still fallback to factor based estimation. But I hope in reality it's very rare. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: update per-profile available estimationQu Wenruo
This involves the following timing: - Chunk allocation - Chunk removal - After Mount - New device - Device removal - Device shrink - Device enlarge And since the function btrfs_update_per_profile_avail() will not return an error, this won't cause new error handling path. Although when btrfs_update_per_profile_avail() failed (only ENOSPC possible) it will mark the per-profile available estimation as unreliable, so later btrfs_get_per_profile_avail() will return false and require the caller to have a fallback solution. The function btrfs_update_per_profile_avail() will be executed with chunk_mutex hold, thus it will slightly slow down those involved functions, but not a lot. As all the core workload is just various u64 calculations inside a loop, without any tree search, the overhead should be acceptable even for all supported 9 profiles. For 4 disks (which exercises all 9 profiles), the execution time of that function will still be less than 10 us. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: introduce the device layout aware per-profile available spaceQu Wenruo
[BUG] There is a long known bug that if metadata is using RAID1 on two disks with unbalanced sizes, there is a very high chance to hit ENOSPC related transaction abort. [CAUSE] The root cause is in the available space estimation code: - Factor based calculation Just use all unallocated space, divide by the profile factor One obvious user is can_overcommit(). This can not handle the following example: devid 1 unallocated: 1GiB devid 2 unallocated: 50GiB metadata type: RAID1 If using factor based estimation, we can use (1GiB + 50GiB) / 2 = 25.5GiB free space for metadata. Thus we can continue allocating metadata (over-commit) way beyond the 1GiB limit. But this estimation is completely wrong, in reality we can only allocate one single 1GiB RAID1 block group, thus if we continue over-commit, at one time we will hit ENOSPC at some critical path and flips the fs read-only. [SOLUTION] This patch will introduce per-profile available space estimation, which can provide chunk-allocator like behavior to give a (mostly) accurate result, with under-estimate corner cases. There are some differences between the estimation and real chunk allocator: - No consideration on hole size It's fine for most cases, as all data/metadata strips are in 1GiB size thus there should not be any hole wasting much space. And chunk allocator is able to use smaller stripes when there is really no other choice. Although in theory this means it can lead to some over-estimation, it should not cause too much hassle in the real world. The other benefit of such behavior is, we avoid dev-extent tree search completely, thus the overhead is very small. - No true balance for certain cases If we have 3 disks RAID1, and each device has 2GiB unallocated space, we can load balance the chunk allocation so that we can allocate 3GiB RAID1 chunks, and that's what chunk allocator will do. But this current estimation code is using the largest available space to do a single allocation. Meaning the estimation will be 2GiB, thus under estimate. Such under estimation is fine and after the first chunk allocation, the estimation will be updated and still give a correct 2GiB estimation. So this only means the estimation will be a little conservative, which is safer for call sites like metadata over-commit check. With that facility, for above 1GiB + 50GiB case, it will give a RAID1 estimation of 1GiB, instead of the incorrect 25.5GiB. Or for a more complex example: devid 1 unallocated: 1T devid 2 unallocated: 1T devid 3 unallocated: 10T We will get an array of: RAID10: 2T RAID1: 2T RAID1C3: 1T RAID1C4: 0 (not enough devices) DUP: 6T RAID0: 3T SINGLE: 12T RAID5: 2T RAID6: 1T [IMPLEMENTATION] And for the each profile , we go chunk allocator level calculation: The pseudo code looks like: clear_virtual_used_space_of_all_rw_devices(); do { /* * The same as chunk allocator, despite used space, * we also take virtual used space into consideration. */ sort_device_with_virtual_free_space(); /* * Unlike chunk allocator, we don't need to bother hole/stripe * size, so we use the smallest device to make sure we can * allocated as many stripes as regular chunk allocator */ stripe_size = device_with_smallest_free->avail_space; stripe_size = min(stripe_size, to_alloc / ndevs); /* * Allocate a virtual chunk, allocated virtual chunk will * increase virtual used space, allow next iteration to * properly emulate chunk allocator behavior. */ ret = alloc_virtual_chunk(stripe_size, &allocated_size); if (ret == 0) avail += allocated_size; } while (ret == 0) This minimal available space based calculation is not perfect, but the important part is, the estimation is never exceeding the real available space. This patch just introduces the infrastructure, no hooks are executed yet. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: zoned: remove redundant space_info lock and variable in ↵Jiasheng Jiang
do_allocation_zoned() In do_allocation_zoned(), the code acquires space_info->lock before block_group->lock. However, the critical section does not access or modify any members of the space_info structure. Thus, the lock is redundant as it provides no necessary synchronization here. This change simplifies the locking logic and aligns the function with other zoned paths, such as __btrfs_add_free_space_zoned(), which only rely on block_group->lock. Since the 'space_info' local variable is no longer used after removing the lock calls, it is also removed. Removing this unnecessary lock reduces contention on the global space_info lock, improving concurrency in the zoned allocation path. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: move min sys chunk array size check to validate_sys_chunk_array()Filipe Manana
We check the minimum size of the sys chunk array in btrfs_validate_super() but we have a better place for that, the helper validate_sys_chunk_array() which we use for every other sys chunk array check. So move it there, also converting the return error from -EINVAL to -EUCLEAN, which is a better fit and also consistent with the other checks. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove duplicate system chunk array max size overflow checkFilipe Manana
We check it twice, once in validate_sys_chunk_array() and then again in its caller, btrfs_validate_super(), right after it calls validate_sys_chunk_array(). So remove the duplicated check from btrfs_validate_super(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: pass boolean literals as the last argument to inc_block_group_ro()Filipe Manana
The last argument of inc_block_group_ro() is defined as a boolean, but every caller is passing an integer literal, 0 or 1 for false and true respectively. While this is not incorrect, as 0 and 1 are converted to false and true, it's less readable and somewhat awkward since the argument is defined as boolean. Replace 0 and 1 with false and true. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: tests: zoned: add tests cases for zoned codeNaohiro Aota
Add a test function for the zoned code, for now it tests btrfs_load_block_group_by_raid_type() with various test cases. The load_zone_info_tests[] array defines the test cases. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07fs/ntfs3: terminate the cached volume label after UTF-8 conversionPengpeng Hou
ntfs_fill_super() loads the on-disk volume label with utf16s_to_utf8s() and stores the result in sbi->volume.label. The converted label is later exposed through ntfs3_label_show() using %s, but utf16s_to_utf8s() only returns the number of bytes written and does not add a trailing NUL. If the converted label fills the entire fixed buffer, ntfs3_label_show() can read past the end of sbi->volume.label while looking for a terminator. Terminate the cached label explicitly after a successful conversion and clamp the exact-full case to the last byte of the buffer. Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block") Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2026-04-07fs/ntfs3: fix potential double iput on d_make_root() failureZhan Xusheng
d_make_root() consumes the reference to the passed inode: it either attaches it to the newly created dentry on success, or drops it via iput() on failure. In the error path, the code currently does: sb->s_root = d_make_root(inode); if (!sb->s_root) goto put_inode_out; which leads to a second iput(inode) in put_inode_out. This results in a double iput and may trigger a use-after-free if the inode gets freed after the first iput(). Fix this by jumping directly to the common cleanup path, avoiding the extra iput(inode). Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2026-04-07ntfs3: fix integer overflow in run_unpack() volume boundary checkTobias Gaertner
The volume boundary check `lcn + len > sbi->used.bitmap.nbits` uses raw addition which can wrap around for large lcn and len values, bypassing the validation. Use check_add_overflow() as is already done for the adjacent prev_lcn + dlcn and vcn64 + len checks added by commit 3ac37e100385 ("ntfs3: Fix integer overflow in run_unpack()"). Found by fuzzing with a source-patched harness (LibAFL + QEMU). Fixes: 82cae269cfa95 ("fs/ntfs3: Add initialization of super block") Cc: stable@vger.kernel.org Signed-off-by: Tobias Gaertner <tob.gaertner@me.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2026-04-07ntfs3: add buffer boundary checks to run_unpack()Tobias Gaertner
run_unpack() checks `run_buf < run_last` at the top of the while loop but then reads size_size and offset_size bytes via run_unpack_s64() without verifying they fit within the remaining buffer. A crafted NTFS image with truncated run data in an MFT attribute triggers an OOB heap read of up to 15 bytes when the filesystem is mounted. Add boundary checks before each run_unpack_s64() call to ensure the declared field size does not exceed the remaining buffer. Found by fuzzing with a source-patched harness (LibAFL + QEMU). Fixes: 82cae269cfa95 ("fs/ntfs3: Add initialization of super block") Cc: stable@vger.kernel.org Signed-off-by: Tobias Gaertner <tob.gaertner@me.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2026-04-07ntfs3: fix mount failure on volumes with fragmented MFT bitmapKonstantin Komarov
When the $MFT's $BITMAP attribute is fragmented across multiple MFT records (base record + extent records), ntfs_fill_super() fails with -ENOENT during wnd_init() because the MFT bitmap's run list only contains runs from the base MFT record. The issue is that wnd_init() (which calls wnd_rescan()) is invoked before ni_load_all_mi(), so the extent MFT records containing additional $BITMAP runs have not been loaded yet. When wnd_rescan() tries to look up a VCN beyond the base record's runs, run_lookup_entry() fails and returns -ENOENT. This affects NTFS volumes with a large or heavily fragmented MFT, which is common on long-used Windows systems where the MFT bitmap's run list doesn't fit in the base MFT record and spills into extent records. Fix this by: 1. Moving ni_load_all_mi() before wnd_init() so all extent records are available. 2. After ni_load_all_mi(), iterating through the attribute list to find any $BITMAP extent attributes and unpacking their runs into sbi->mft.bitmap.run before wnd_init() is called. Tested on a 664GB NTFS volume with 86 MFT bitmap runs spanning records 0 (VCN 0-105) and 17 (VCN 106-165). Before the fix, mount fails with -ENOENT. After the fix, mount succeeds and all read/write operations work correctly. Stress-tested with 8 test categories (large file integrity, 10K small files, copy, move, delete/recreate cycles, concurrent writes, deep directories, overwrite persistence). Signed-off-by: Ruslan Elishev <relishev@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2026-04-07fs/ntfs3: fix $LXDEV xattr lookupZhan Xusheng
Use correct xattr name ("$LXDEV") and buffer size when calling ntfs_get_ea(), otherwise the attribute may not be read. Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2026-04-07ntfs3: fix OOB write in attr_wof_frame_info()0xkato
In attr_wof_frame_info(), the offset-table read range for a nonresident WofCompressedData stream is: u64 from = vbo[i] & ~(u64)(PAGE_SIZE - 1); u64 to = min(from + PAGE_SIZE, wof_size); ... ntfs_read_run(sbi, run, addr, from, to - from); A crafted image sets WofCompressedData.nres.data_size to 0xfff while the file is large enough to request frame 1024 (offset 0x400000). This gives from=0x1000, to=0xfff. The unsigned (to - from) wraps to 0xffffffffffffffff and ntfs_read_write_run() overflows the single-page offs_folio via memcpy. Triggered by pread() on a mounted NTFS image. Depending on adjacent memory layout at the time of the overflow, KASAN reports this as slab-out-of-bounds, use-after-free, or slab-use-after-free all at ntfs_read_write_run(). Secondary corruption/panic paths were also observed. Reject the read when the offset-table page is outside the stream. Signed-off-by: 0xkato <0xkkato@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2026-04-07orangefs: validate getxattr response lengthHyungJung Joo
orangefs_inode_getxattr() trusts the userspace-client-controlled downcall.resp.getxattr.val_sz and uses it as a memcpy() length both for the temporary user buffer and the cached xattr buffer. Reject malformed negative or oversized lengths before copying response bytes. Reported-by: Hyungjung Joo <jhj140711@gmail.com> Signed-off-by: HyungJung Joo <jhj140711@gmail.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2026-04-07orangefs_readahead: don't overflow the bufmap slot.Mike Marshall
generic/340 showed that this caller of wait_for_direct_io was sometimes asking for more than a bufmap slot could hold. This splits the calls up if needed. Signed-off-by: Mike Marshall <hubcap@omnibond.com>