summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2026-02-03btrfs: remove experimental offload csum modeQu Wenruo
The offload csum mode was introduced to allow developers to compare the performance of generating checksum for data writes at different timings: - During btrfs_submit_chunk() This is the most common one, if any of the following condition is met we go this path: * The csum is fast For now it's CRC32C and xxhash. * It's a synchronous write * Zoned - Delay the checksum generation to a workqueue However since commit dd57c78aec39 ("btrfs: introduce btrfs_bio::async_csum") we no longer need to bother any of them. As if it's an experimental build, async checksum generation at the background will be faster anyway. And if not an experimental build, we won't even have the offload csum mode support. Considering the async csum will be the new default, let's remove the offload csum mode code. There will be no impact to end users, and offload csum mode is still under experimental features. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: split btrfs_fs_closing() and change return type to boolDavid Sterba
There are two tests in btrfs_fs_closing() but checking the BTRFS_FS_CLOSING_DONE bit is done only in one place load_extent_tree_free(). As this is an inline we can reduce size of the generated code. The types can be also changed to bool as this becomes a simple condition. text data bss dec hex filename 1674006 146704 15560 1836270 1c04ee pre/btrfs.ko 1673772 146704 15560 1836036 1c0404 post/btrfs.ko DELTA: -234 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: reject single block sized compression earlyQu Wenruo
Currently for an inode that needs compression, even if there is a delalloc range that is single fs block sized and can not be inlined, we will still go through the compression path. Then inside compress_file_range(), we have one extra check to reject single block sized range, and fall back to regular uncompressed write. This rejection is in fact a little too late, we have already allocated memory to async_chunk, delayed the submission, just to fallback to the same uncompressed write. Change the behavior to reject such cases earlier at inode_need_compress(), so for such single block sized range we won't even bother trying to go through compress path. And since the inline small block check is inside inode_need_compress() and compress_file_range() also calls that function, we no longer need a dedicate check inside compress_file_range(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: update outdated comment in __add_block_group_free_space()Julia Lawall
The function add_block_group_free_space() was renamed btrfs_add_block_group_free_space() by commit 6fc5ef782988 ("btrfs: add btrfs prefix to free space tree exported functions"). Update the comment accordingly. Do some reorganization of the next few lines to keep the comment within 80 characters. Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: add mount time auto fix for orphan fst entriesQu Wenruo
[BUG] Before btrfs-progs v6.16.1 release, mkfs.btrfs can leave free space tree entries for deleted chunks: # mkfs.btrfs -f -O fst $dev # btrfs ins dump-tree -t chunk $dev btrfs-progs v6.16 chunk tree leaf 22036480 items 4 free space 15781 generation 8 owner CHUNK_TREE leaf 22036480 flags 0x1(WRITTEN) backref revision 1 item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98 item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 13631488) itemoff 16105 itemsize 80 ^^^ The first chunk is at 13631488 item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15993 itemsize 112 item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15881 itemsize 112 # btrfs ins dump-tree -t free-space-tree $dev btrfs-progs v6.16 free space tree key (FREE_SPACE_TREE ROOT_ITEM 0) leaf 30556160 items 13 free space 15918 generation 8 owner FREE_SPACE_TREE leaf 30556160 flags 0x1(WRITTEN) backref revision 1 item 0 key (1048576 FREE_SPACE_INFO 4194304) itemoff 16275 itemsize 8 free space info extent count 1 flags 0 item 1 key (1048576 FREE_SPACE_EXTENT 4194304) itemoff 16275 itemsize 0 free space extent item 2 key (5242880 FREE_SPACE_INFO 8388608) itemoff 16267 itemsize 8 free space info extent count 1 flags 0 item 3 key (5242880 FREE_SPACE_EXTENT 8388608) itemoff 16267 itemsize 0 free space extent ^^^ Above 4 items are all before the first chunk. item 4 key (13631488 FREE_SPACE_INFO 8388608) itemoff 16259 itemsize 8 free space info extent count 1 flags 0 item 5 key (13631488 FREE_SPACE_EXTENT 8388608) itemoff 16259 itemsize 0 free space extent ... This can trigger btrfs check errors. [CAUSE] It's a bug in free space tree implementation of btrfs-progs, which doesn't delete involved fst entries for the to-be-deleted chunk/block group. [ENHANCEMENT] The mostly common fix is to clear the space cache and rebuild it, but that requires a ro->rw remount which may not be possible for rootfs, and also relies on users to use "clear_cache" mount option manually. Here introduce a kernel fix for it, which will delete any entries that is before the first block group automatically at the first RW mount. For filesystems without such problem, the overhead is just a single tree search and no modification to the free space tree, thus the overhead should be minimal. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: simplify check for zoned NODATASUM writes in btrfs_submit_chunk()Zhen Ni
This function already dereferences 'inode' multiple times earlier, making the additional NULL check at line 840 redundant since the function would have crashed already if inode were NULL. After commit 81cea6cd7041 ("btrfs: remove btrfs_bio::fs_info by extracting it from btrfs_bio::inode"), the btrfs_bio::inode field is mandatory for all btrfs_bio allocations and is guaranteed to be non-NULL. Simplify the condition for allocating dummy checksums for zoned NODATASUM data by removing the unnecessary 'inode &&' check. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Zhen Ni <zhen.ni@easystack.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: avoid transaction commit on error in insert_balance_item()Filipe Manana
There's no point in committing the transaction if we failed to insert the balance item, since we haven't done anything else after we started/joined the transaction. Also stop using two variables for tracking the return value and use only 'ret'. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: move unlikely checks around btrfs_is_shutdown() into the helperFilipe Manana
Instead of surrounding every caller of btrfs_is_shutdown() with unlikely, move the unlikely into the helper itself, like we do in other places in btrfs and is common in the kernel outside btrfs too. Also make the fs_info argument of btrfs_is_shutdown() const. On a x86_84 box using gcc 14.2.0-19 from Debian, this resulted in a slight reduction of the module's text size. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1939044 172568 15592 2127204 207564 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1938876 172568 15592 2127036 2074bc fs/btrfs/btrfs.ko Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: tag as unlikely error conditions in the transaction commit pathFilipe Manana
Errors are unexpected during the transaction commit path, and when they happen we abort the transaction (by calling cleanup_transaction() under the label 'cleanup_transaction' in btrfs_commit_transaction()). So mark every error check in the transaction commit path as unlikely, to hint the compiler so that it can possibly generate better code, and make it clear for a reader about being unexpected. On a x86_84 box using gcc 14.2.0-19 from Debian, this resulted in a slight reduction of the module's text size. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1939476 172568 15592 2127636 207714 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1939044 172568 15592 2127204 207564 fs/btrfs/btrfs.ko Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: remove unreachable return after btrfs_backref_panic() in ↵Zhen Ni
btrfs_backref_finish_upper_links() The return statement after btrfs_backref_panic() is unreachable since btrfs_backref_panic() calls BUG() which never returns. Remove the return to unify it with the other calls to btrfs_backref_panic(). Signed-off-by: Zhen Ni <zhen.ni@easystack.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: refactor the main loop of cow_file_range()Qu Wenruo
Currently inside the main loop of cow_file_range(), we do the following sequence: - Reserve an extent - Lock the IO tree range - Create an IO extent map - Create an ordered extent Every step will need extra steps to do cleanup in the following order: - Drop the newly created extent map - Unlock extent range and cleanup the involved folios - Free the reserved extent However currently the error handling is done inconsistently: - Extent map drop is handled in a dedicated tag Out of the main loop, make it much harder to track. - The extent unlock and folios cleanup is done separately The extent is unlocked through btrfs_unlock_extent(), then extent_clear_unlock_delalloc() again in a dedicated tag. Meanwhile all other callsites (compression/encoded/nocow) all just call extent_clear_unlock_delalloc() to handle unlock and folio clean up in one go. - Reserved extent freeing is handled in a dedicated tag Out of the main loop, make it much harder to track. - Error handling of btrfs_reloc_clone_csums() is relying out-of-loop tags This is due to the special requirement to finish ordered extents to handle the metadata reserved space. Enhance the error handling and align the behavior by: - Introduce a dedicated cow_one_range() helper Which do the reserve/lock/allocation in the helper. And also handle the errors inside the helper. No more dedicated tags out of the main loop. - Use a single extent_clear_unlock_delalloc() to unlock and cleanup folios - Move the btrfs_reloc_clone_csums() error handling into the new helper Thankfully it's not that complex compared to other cases. And since we're here, also reduce the width of the following local variables to u32: - cur_alloc_size - min_alloc_size Each allocation won't go beyond 128M, thus u32 is more than enough. - blocksize The maximum is 64K, no need for u64. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: zoned: print block-group type for zoned statisticsJohannes Thumshirn
When printing the zoned statistics, also include the block-group type in the block-group listing output. The updated output looks as follows: device /dev/vda mounted on /mnt with fstype btrfs zoned statistics: active block-groups: 9 reclaimable: 0 unused: 2 need reclaim: false data relocation block-group: 3221225472 active zones: start: 1073741824, wp: 268419072 used: 268419072, reserved: 0, unusable: 0 (DATA) start: 1342177280, wp: 0 used: 0, reserved: 0, unusable: 0 (DATA) start: 1610612736, wp: 81920 used: 16384, reserved: 16384, unusable: 49152 (SYSTEM) start: 1879048192, wp: 2031616 used: 1458176, reserved: 65536, unusable: 507904 (METADATA) start: 2147483648, wp: 268419072 used: 268419072, reserved: 0, unusable: 0 (DATA) start: 2415919104, wp: 268419072 used: 268419072, reserved: 0, unusable: 0 (DATA) start: 2684354560, wp: 268419072 used: 268419072, reserved: 0, unusable: 0 (DATA) start: 2952790016, wp: 65536 used: 65536, reserved: 0, unusable: 0 (DATA) start: 3221225472, wp: 0 used: 0, reserved: 0, unusable: 0 (DATA) Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: move space_info_flag_to_str() to space-info.hJohannes Thumshirn
Move space_info_flag_to_str() to space-info.h and as it now isn't static to space-info.c any more prefix it with 'btrfs_'. This way it can be re-used in other places. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: zoned: show statistics about zoned filesystems in mountstatsJohannes Thumshirn
Add statistics output to /proc/<pid>/mountstats for zoned BTRFS, similar to the zoned statistics from XFS in mountstats. The output for /proc/<pid>/mountstats on an example filesystem will be as follows: device /dev/vda mounted on /mnt with fstype btrfs zoned statistics: active block-groups: 7 reclaimable: 0 unused: 5 need reclaim: false data relocation block-group: 1342177280 active zones: start: 1073741824, wp: 268419072 used: 0, reserved: 268419072, unusable: 0 start: 1342177280, wp: 0 used: 0, reserved: 0, unusable: 0 start: 1610612736, wp: 49152 used: 16384, reserved: 16384, unusable: 16384 start: 1879048192, wp: 950272 used: 131072, reserved: 622592, unusable: 196608 start: 2147483648, wp: 212238336 used: 0, reserved: 212238336, unusable: 0 start: 2415919104, wp: 0 used: 0, reserved: 0, unusable: 0 start: 2684354560, wp: 0 used: 0, reserved: 0, unusable: 0 Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: don't call btrfs_handle_fs_error() in btrfs_commit_transaction()Filipe Manana
There's no need to call btrfs_handle_fs_error() as we are inside a transaction and if we get an error we jump to the 'scrub_continue' label and end up calling cleanup_transaction(), which aborts the transaction. This is odd given that we have a transaction handle and that in the transaction commit path any error makes us abort the transaction and it's the only place that calls btrfs_handle_fs_error(). Remove the btrfs_handle_fs_error() call and replace it with an error message so that if it happens we know what went wrong during the transaction commit. Also annotate the condition in the if statement with 'unlikely' since this is not expected to happen. We've been wanting to remove btrfs_handle_fs_error(), so this removes one user that does not even needs it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: don't call btrfs_handle_fs_error() in qgroup_account_snapshot()Filipe Manana
There's no need to call btrfs_handle_fs_error() as we are inside a transaction and we propagate the error returned from btrfs_write_and_wait_transaction() to the caller and it ends going up the call chain to btrfs_commit_transaction() (returned by the call to create_pending_snapshots()), where we jump to the 'unlock_reloc' label and end up calling cleanup_transaction(), which aborts the transaction. This is odd given that we have a transaction handle and that in the transaction commit path any error makes us abort the transaction and, besides another place inside btrfs_commit_transaction(), it's the only place that calls btrfs_handle_fs_error(). Remove the btrfs_handle_fs_error() call and replace it with an error message so that if it happens we know what went wrong during the transaction commit. Also annotate the condition in the if statement with 'unlikely' since this is not expected to happen. We've been wanting to remove btrfs_handle_fs_error(), so this removes one user that does not even need it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: don't call btrfs_handle_fs_error() after failure to delete orphan itemFilipe Manana
In btrfs_find_orphan_roots() we don't need to call btrfs_handle_fs_error() if we fail to delete the orphan item for the current root. This is because we haven't done anything yet regarding the current root and previous iterations of the loop dealt with other roots, so there's nothing we need to undo. Instead log an error message and return the error to the caller, which will result either in a mount failure or remount failure (the only contexts it's called from). Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: don't call btrfs_handle_fs_error() after failure to join transactionFilipe Manana
In btrfs_find_orphan_roots() we don't need to call btrfs_handle_fs_error() if we fail to join a transaction. This is because we haven't done anything yet regarding the current root and previous iterations of the loop dealt with other roots, so there's nothing we need to undo. Instead log an error message and return the error to the caller, which will result either in a mount failure or remount failure (the only contexts it's called from). Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: remove redundant path release in btrfs_find_orphan_roots()Filipe Manana
There's no need to release the path in the if branch used when the root does not exists since we released the path before the call to btrfs_get_fs_root(). So remove that redundant btrfs_release_path() call. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: use single return variable in btrfs_find_orphan_roots()Filipe Manana
We use both 'ret' and 'err' which is a pattern that generates confusion and resulted in subtle bugs in the past. Remove 'err' and use only 'ret'. Also move simplify the error flow by directly returning from the function instead of breaking of the loop, since there are no resources to cleanup after the loop. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: avoid transaction commit on error in del_balance_item()Filipe Manana
There's no point in committing the transaction if we failed to delete the item, since we haven't done anything before. Also stop using two variables for tracking the return value and use only 'ret'. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: update stale comment in __cow_file_range_inline()Filipe Manana
We mention that the reserved data space is page size aligned but that's not true anymore, as it's sector size aligned instead. In commit 0bb067ca64e3 ("btrfs: fix the qgroup data free range for inline data extents") we updated the amount passed to btrfs_qgroup_free_data() from page size to sector size, but forgot to update the comment. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: remove duplicated root key setup in btrfs_create_tree()Filipe Manana
There's no need for an on stack key to define the root's key as we have already defined the key in the root itself. So remove the stack variable and use the key in the root. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: zoned: re-flow prepare_allocation_zoned()Johannes Thumshirn
Re-flow prepare allocation zoned to make it a bit more readable by returning early and removing unnecessary indentations. This patch does not change any functionality. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: shrink the size of btrfs_bioQu Wenruo
This is done by: - Shrink the size of btrfs_bio::mirror_num From 32 bits unsigned int to u16. Normally btrfs mirror number is either 0 (all profiles), 1 (all profiles), 2 (DUP/RAID1/RAID10/RAID5), 3 (RAID1C3) or 4 (RAID1C4). But for RAID6 the mirror number can go as large as the number of devices of that chunk. Currently the limit for number of devices for a data chunk is BTRFS_MAX_DEVS(), which is around 500 for the default 16K nodesize. And if going the max 64K nodesize, we can have a little over 2000 devices for a chunk. Although I'd argue it's way overkilled, we don't reject such cases yet thus u8 is not going to cut it, and have to use u16 (max out at 64K). - Use bit fields for boolean members Although it's not always safe for racy call sites, those members are safe. * csum_search_commit_root * is_scrub Those two are set immediately after bbio allocation and no more writes after allocation, thus they are very safe. * async_csum * can_use_append Those two are set for each split range, and after that there is no writes into those two members in different threads, thus they are also safe. And there are spaces for 4 more bits before increasing the size of btrfs_bio again, which should be future proof enough. - Reorder the structure members Now we always put the largest member first (after the huge 120 bytes union), making it easier to fill any holes. This reduce the size of btrfs_bio by 8 bytes, from 312 bytes to 304 bytes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: remove ASSERT compatibility for gcc < 8.xDavid Sterba
The minimum gcc version is 8 since 118c40b7b50340 ("kbuild: require gcc-8 and binutils-2.30"), the workaround for missing __VA_OPT__ support is not needed. Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: pass level to _btrfs_printk() to avoid parsing level from stringDavid Sterba
There's code in _btrfs_printk() to parse the message level from the input string so we can augment the message with the level description for better visibility in the logs. The parsing code has evolved over time, see commits: - 40f7828b36e3b9 ("btrfs: better handle btrfs_printk() defaults") - 262c5e86fec7cf ("printk/btrfs: handle more message headers") - 533574c6bc30cf ("btrfs: use printk_get_level and printk_skip_level, add __printf, fix fallout") - 4da35113426d16 ("btrfs: add varargs to btrfs_error") As we are using the specific level helpers everywhere we can simply pass the message level so we don't have to parse it. The proper printk() message header is created as KERN_SOH + "level". Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: simplify internal btrfs_printk helpersDavid Sterba
The printk() can be compiled out depending on CONFIG_PRINTK, this is reflected in our helpers. The indirection is provided by btrfs_printk() used in the ratelimited and RCU wrapper macros. Drop the btrfs_printk() helper and define the ratelimit and RCU helpers directly when CONFIG_PRINTK is undefined. This will allow further changes to the _btrfs_printk() interface (which is internal), any message in other code should use the level-specific helpers. Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: rename btrfs_create_block_group_cache to btrfs_create_block_groupJohannes Thumshirn
struct btrfs_block_group used to be called struct btrfs_block_group_cache but got renamed to btrfs_block_group with commit 32da5386d9a4 ("btrfs: rename btrfs_block_group_cache"). Rename btrfs_create_block_group_cache() to btrfs_create_block_group() to reflect that change. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: merge setting ret and return retDavid Sterba
In many places we have pattern: ret = ...; return ret; This can be simplified to a direct return, removing 'ret' if not otherwise needed. The places in self tests are not converted so we can add more test cases without changing surrounding code (extent-map-tests.c:test_case_4()). Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: remove dead assignment in prepare_one_folio()Massimiliano Pellizzer
In prepare_one_folio(), ret is initialized to 0 at declaration, and in an error path we assign ret = 0 before jumping to the again label to retry the operation. However, ret is immediately overwritten by ret = set_folio_extent_mapped(folio) after the again label. Both assignments are never observed by any code path, therefore they can be safely removed. Signed-off-by: Massimiliano Pellizzer <mpellizzer.dev@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: replace for_each_set_bit() with for_each_set_bitmap()Qu Wenruo
Inside extent_io.c, there are several simple call sites doing things like: for_each_set_bit(bit, bitmap, bitmap_size) { /* handle one fs block */ } The workload includes: - set_bit() Inside extent_writepage_io(). This can be replaced with a bitmap_set(). - btrfs_folio_set_lock() - btrfs_mark_ordered_io_finished() Inside writepage_delalloc(). Instead of calling it multiple times, we can pass a range into the function with one call. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: concentrate the error handling of submit_one_sector()Qu Wenruo
Currently submit_one_sector() has only one failure path from btrfs_get_extent(). However the error handling is split into two parts, one inside submit_one_sector(), which clears the dirty flag and finishes the writeback for the fs block. The other part is to submit any remaining bio inside bio_ctrl and mark the ordered extent finished for the fs block. There is no special reason that we must split the error handling, let's just concentrate all the error handling into submit_one_sector(). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: search for larger extent maps inside btrfs_do_readpage()Qu Wenruo
[CORNER CASE] If we have the following file extents layout, btrfs_get_extent() can return a smaller hole during read, and cause unnecessary extra tree searches: item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53 generation 9 type 1 (regular) extent data disk byte 13631488 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 (none) item 7 key (257 EXTENT_DATA 32768) itemoff 15757 itemsize 53 generation 9 type 1 (regular) extent data disk byte 13635584 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 (none) In above case, range [0, 4K) and [32K, 36K) are regular extents, and there is a hole in range [4K, 32K), and the fs has "no-holes" feature, meaning the hole will not have a file extent item. [INEFFICIENCY] Assume the system has 4K page size, and we're doing readahead for range [4K, 32K), no large folio yet. btrfs_readahead() for range [4K, 32K) |- btrfs_do_readpage() for folio 4K | |- get_extent_map() for range [4K, 8K) | |- btrfs_get_extent() for range [4K, 8K) | We hit item 6, then for the next item 7. | At this stage we know range [4K, 32K) is a hole. | But our search range is only [4K, 8K), not reaching 32K, thus | we go into not_found: tag, returning a hole em for [4K, 8K). | |- btrfs_do_readpage() for folio 8K | |- get_extent_map() for range [8K, 12K) | |- btrfs_get_extent() for range [8K, 12K) | We hit the same item 6, and then item 7. | But still we goto not_found tag, inserting a new hole em, | which will be merged with previous one. | | [ Repeat the same btrfs_get_extent() calls until the end ] So we're calling btrfs_get_extent() again and again, just for a different part of the same hole range [4K, 32K). [ENHANCEMENT] Make btrfs_do_readpage() to search for a larger extent map if readahead is involved. For btrfs_readahead() we have bio_ctrl::ractl set, and lock extents for the whole readahead range. If we find bio_ctrl::ractl is set, we can use that end range as extent map search end, this allows btrfs_get_extent() to return a much larger hole, thus reduce the need to call btrfs_get_extent() again and again. btrfs_readahead() for range [4K, 32K) |- btrfs_do_readpage() for folio 4K | |- get_extent_map() for range [4K, 32K) | |- btrfs_get_extent() for range [4K, 32K) | We hit item 6, then for the next item 7. | At this stage we know range [4K, 32K) is a hole. | So the hole em for range [4K, 32K) is returned. | |- btrfs_do_readpage() for folio 8K | |- get_extent_map() for range [8K, 32K) | The cached hole em range [4K, 32K) covers the range, | and reuse that em. | | [ Repeat the same btrfs_get_extent() calls until the end ] Now we only call btrfs_get_extent() once for the whole range [4K, 32K), other than the old 8 times. Such change will reduce the overhead of reading large holes a little. For current experimental build (with larger folios) on aarch64, there will be a tiny but consistent ~1% improvement reading a large hole file: Reading a 1GiB sparse file (all hole) using xfs_io, with 64K block size, the result is the time needed to read the whole file, reported from xfs_io. 32 runs, experimental build (with large folios). 64K page size, 4K fs block size. - Avg before: 0.20823 s - Avg after: 0.20635 s - Diff: -0.9% Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: introduce BTRFS_PATH_AUTO_RELEASE() helperQu Wenruo
There are already several bugs with on-stack btrfs_path involved, even it is already a little safer than btrfs_path pointers (only leaks the extent buffers, not the btrfs_path structure itself) - Patch "btrfs: make sure extent and csum paths are always released in scrub_raid56_parity_stripe()" - Patch "btrfs: fix a potential path leak in print_data_reloc_error()" Thus there is a real need to apply auto release for those on-stack paths. Introduces a new macro, BTRFS_PATH_AUTO_RELEASE() which defines one on-stack btrfs_path structure, initialize it all to 0, then call btrfs_release_path() on it when exiting the scope. This applies to current 3 on-stack path usages: - defrag_get_extent() in defrag.c - print_data_reloc_error() in inode.c There is a special case where we want to release the path early before the time consuming iterate_extent_inodes() call, thus that manual early release is kept as is, with an extra comment added. - scrub_radi56_parity_stripe() in scrub.c Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: enable direct IO for bs > ps casesQu Wenruo
Previously direct IO was disabled if the fs block size was larger than the page size, the reasons are: - Iomap direct IO can split the range ignoring the fs block alignment Which could trigger the bio size check from btrfs_submit_bio(). - The buffer is only ensured to be contiguous in user space memory The underlying physical memory is not ensured to be contiguous, and that can cause problems for the checksum generation/verification and RAID56 handling. However the above problems are solved by the following upstream commits: - 001397f5ef49 ("iomap: add IOMAP_DIO_FSBLOCK_ALIGNED flag") Which added an extra flag that can be utilized by the fs to ensure the bio submitted by iomap is always aligned to fs block size. - ec20799064c8 ("btrfs: enable encoded read/write/send for bs > ps cases") - 8870dbeedcf9 ("btrfs: raid56: enable bs > ps support") Which makes btrfs to handle bios that are not backed by large folios but still are aligned to fs block size. As the commits have been merged we can enable direct IO support for bs > ps cases. Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: switch to library APIs for checksumsEric Biggers
Make btrfs use the library APIs instead of crypto_shash, for all checksum computations. This has many benefits: - Allows future checksum types, e.g. XXH3 or CRC64, to be more easily supported. Only a library API will be needed, not crypto_shash too. - Eliminates the overhead of the generic crypto layer, including an indirect call for every function call and other API overhead. A microbenchmark of btrfs_check_read_bio() with crc32c checksums shows a speedup from 658 cycles to 608 cycles per 4096-byte block. - Decreases the stack usage of btrfs by reducing the size of checksum contexts from 384 bytes to 240 bytes, and by eliminating the need for some functions to declare a checksum context at all. - Increases reliability. The library functions always succeed and return void. In contrast, crypto_shash can fail and return errors. Also, the library functions are guaranteed to be available when btrfs is loaded; there's no longer any need to use module softdeps to try to work around the crypto modules sometimes not being loaded. - Fixes a bug where blake2b checksums didn't work on kernels booted with fips=1. Since btrfs checksums are for integrity only, it's fine for them to use non-FIPS-approved algorithms. Note that with having to handle 4 algorithms instead of just 1-2, this commit does result in a slightly positive diffstat. That being said, this wouldn't have been the case if btrfs had actually checked for errors from crypto_shash, which technically it should have been doing. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: zoned: don't zone append to conventional zoneJohannes Thumshirn
In case of a zoned RAID, it can happen that a data write is targeting a sequential write required zone and a conventional zone. In this case the bio will be marked as REQ_OP_ZONE_APPEND but for the conventional zone, this needs to be REQ_OP_WRITE. The setting of REQ_OP_ZONE_APPEND is deferred to the last possible time in btrfs_submit_dev_bio(), but the decision if we can use zone append is cached in btrfs_bio. CC: Naohiro Aota <naohiro.aota@wdc.com> Fixes: e9b9b911e03c ("btrfs: add raid stripe tree to features enabled with debug config") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: relax squota parent qgroup deletion ruleBoris Burkov
Currently, with squotas, we do not allow removing a parent qgroup with no members if it still has usage accounted to it. This makes it really difficult to recover from accounting bugs, as we have no good way of getting back to 0 usage. Instead, allow deletion (it's safe at 0 members..) while still warning about the inconsistency by adding a squota parent check. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: check squota parent usage on membership changeBoris Burkov
We could have detected the quick inherit bug more directly if we had an extra warning about squota hierarchy consistency while modifying the hierarchy. In squotas, the parent usage always simply adds up to the sum of its children, so we can just check for that when changing membership and detect more accounting bugs. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: simplify boolean argument for btrfs_inc_ref()/btrfs_dec_ref()Sun YangKai
Replace open-coded if/else blocks with the boolean directly and introduce local const bool variables, making the code shorter and easier to read. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: use true/false for boolean parameters in btrfs_inc_ref()/btrfs_dec_ref()Sun YangKai
Replace integer literals 0/1 with true/false when calling btrfs_inc_ref() and btrfs_dec_ref() to make the code self-documenting and avoid mixing bool/integer types. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: update comment for visit_node_for_delete()Sun YangKai
Drop the obsolete @refs parameter from the comment so the argument list matches the current function signature after commit f8c4d59de23c9 ("btrfs: drop unused parameter refs from visit_node_for_delete()"). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: raid56: fix memory leak of btrfs_raid_bio::stripe_uptodate_bitmapFilipe Manana
We allocate the bitmap but we never free it in free_raid_bio_pointers(). Fix this by adding a bitmap_free() call against the stripe_uptodate_bitmap of a raid bio. Fixes: 1810350b04ef ("btrfs: raid56: move sector_ptr::uptodate into a dedicated bitmap") Reported-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/linux-btrfs/20260126045315.GA31641@lst.de/ Reviewed-by: Qu Wenruo <wqu@suse.com> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03erofs: avoid some unnecessary #ifdefsFerry Meng
They can either be removed or replaced with IS_ENABLED(). Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-02-03erofs: handle end of filesystem properly for file-backed mountsGao Xiang
I/O requests beyond the end of the filesystem should be zeroed out, similar to loopback devices and that is what we expect. Fixes: ce63cb62d794 ("erofs: support unencoded inodes for fileio") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-02-03erofs: separate plain and compressed filesystems formallyGao Xiang
The EROFS on-disk format uses a tiny, plain metadata design that prioritizes performance and minimizes complex inconsistencies against common writable disk filesystems (almost all serious metadata inconsistency cannot happen in well-designed immutable filesystems like EROFS). EROFS deliberately avoids artificial design flaws to eliminate serious security risks from untrusted remote sources by design, although human-made implementation bugs can still happen sometimes. Currently, there is no strict check to prevent compressed inodes, especially LZ4-compressed inodes, from being read in plain filesystems. Starting with erofs-utils 1.0 and Linux 5.3, LZ4_0PADDING sb feature is automatically enabled for LZ4-compressed EROFS images to support in-place decompression. Furthermore, since Linux 5.4 LTS is no longer supported, we no longer need to handle ancient LZ4-compressed EROFS images generated by erofs-utils prior to 1.0. To formally distinguish different filesystem types for improved security: - Use the presence of LZ4_0PADDING or a non-zero `dsb->u1.lz4_max_distance` as a marker for compressed filesystems containing LZ4-compressed inodes only; - For other algorithms, use `dsb->u1.available_compr_algs` bitmap. Note: LZ4_0PADDING has been supported since Linux 5.4 (the first formal kernel version), so exposing it via sysfs is no longer necessary and is now deprecated (but remain it for five more years until 2031): `dsb->u1` has been strictly non-zero for all EROFS images containing compressed inodes starting with erofs-utils v1.3 and it is actually a much better marker for compressed filesystems. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-02-03erofs: use inode_set_cached_link()Gao Xiang
Symlink lengths are now cached in in-memory inodes directly so that readlink can be sped up. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-02-02fs: consolidate fsverity_info lookup in buffer.cChristoph Hellwig
Look up the fsverity_info once in end_buffer_async_read_io, and then pass it along to the I/O completion workqueue in struct postprocess_bh_ctx. This amortizes the lookup better once it becomes less efficient. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20260202060754.270269-8-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-02fsverity: push out fsverity_info lookupChristoph Hellwig
Pass a struct fsverity_info to the verification and readahead helpers, and push the lookup into the callers. Right now this is a very dumb almost mechanic move that open codes a lot of fsverity_info_addr() calls in the file systems. The subsequent patches will clean this up. This prepares for reducing the number of fsverity_info lookups, which will allow to amortize them better when using a more expensive lookup method. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: David Sterba <dsterba@suse.com> # btrfs Link: https://lore.kernel.org/r/20260202060754.270269-7-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>