summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
8 daysbtrfs: zoned: fixup last alloc pointer after extent removal for RAID1Naohiro Aota
When a block group is composed of a sequential write zone and a conventional zone, we recover the (pseudo) write pointer of the conventional zone using the end of the last allocated position. However, if the last extent in a block group is removed, the last extent position will be smaller than the other real write pointer position. Then, that will cause an error due to mismatch of the write pointers. We can fixup this case by moving the alloc_offset to the corresponding write pointer position. Fixes: 568220fa9657 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove out label in btrfs_wait_for_commit()Filipe Manana
There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove out label in btrfs_init_space_info()Filipe Manana
There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove out label in btrfs_check_rw_degradable()Filipe Manana
There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove out label in finish_verity()Filipe Manana
There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove out label in scrub_find_fill_first_stripe()Filipe Manana
There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove out label in lzo_decompress()Filipe Manana
There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove out label in btrfs_mark_extent_written()Filipe Manana
There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove out label in btrfs_csum_file_blocks()Filipe Manana
There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove out_failed label in find_lock_delalloc_range()Filipe Manana
There is no point in having the label since all it does is return the value in the 'found' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove out label in load_extent_tree_free()Filipe Manana
There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove pointless out labels from uuid-tree.cFilipe Manana
Some functions (btrfs_uuid_iter_rem() and btrfs_check_uuid_tree_entry()) have an 'out' label that does nothing but return, making it pointless. Simplify this by removing the label and returning instead of gotos plus setting the 'ret' variable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove pointless out labels from inode.cFilipe Manana
Some functions (insert_inline_extent() and insert_reserved_file_extent()) have an 'out' label that does nothing but return, making it pointless. Simplify this by removing the label and returning instead of gotos plus setting the 'ret' variable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove pointless out labels from free-space-cache.cFilipe Manana
Some functions (update_cache_item(), find_free_space(), trim_bitmaps(), btrfs_remove_free_space() and cleanup_free_space_cache_v1()) have an 'out' label that does nothing but return, making it pointless. Simplify this by removing the label and returning instead of gotos plus setting the 'ret' variable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove pointless out labels from extent-tree.cFilipe Manana
Some functions (lookup_extent_data_ref(), __btrfs_mod_ref() and btrfs_free_tree_block()) have an 'out' label that does nothing but return, making it pointless. Simplify this by removing the label and returning instead of gotos plus setting the 'ret' variable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove pointless out labels from disk-io.cFilipe Manana
Some functions (btrfs_validate_extent_buffer() and btrfs_start_pre_rw_mount()) have an 'out' label that does nothing but return, making it pointless. Simplify this by removing the label and returning instead of gotos plus setting the 'ret' variable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove pointless out labels from qgroup.cFilipe Manana
Some functions (__del_qgroup_relation() and qgroup_trace_new_subtree_blocks()) have an 'out' label that does nothing but return, making it pointless. Simplify this by removing the label and returning instead of gotos plus setting the 'ret' variable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove pointless out labels from send.cFilipe Manana
Some functions (process_extent(), process_recorded_refs_if_needed(), changed_inode(), compare_refs() and changed_cb()) have an 'out' label that does nothing but return, making it pointless. Simplify this by removing the label and returning instead of gotos plus setting the 'ret' variable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove pointless out labels from ioctl.cFilipe Manana
Some functions (__btrfs_ioctl_snap_create(), btrfs_ioctl_subvol_setflags() and copy_to_sk()) have an 'out' label that does nothing but return, making it pointless. Simplify this by removing the label and returning instead of gotos plus setting up the 'ret' variable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: qgroup: return correct error when deleting qgroup relation itemFilipe Manana
If we fail to delete the second qgroup relation item, we end up returning success or -ENOENT in case the first item does not exist, instead of returning the error from the second item deletion. Fixes: 73798c465b66 ("btrfs: qgroup: Try our best to delete qgroup relations") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: pass btrfs_fs_info to btrfs_first_delayed_node()David Sterba
As the delayed root is now in the fs_info we can pass it to btrfs_first_delayed_node(). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: don't use local variables for fs_info->delayed_rootDavid Sterba
In all cases the delayed_root is used once in a function, we don't need to use a local variable for that. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: reorder members in btrfs_delayed_root for better packingDavid Sterba
There are two unnecessary 4B holes in btrfs_delayed_root; struct btrfs_delayed_root { spinlock_t lock; /* 0 4 */ /* XXX 4 bytes hole, try to pack */ struct list_head node_list; /* 8 16 */ struct list_head prepare_list; /* 24 16 */ atomic_t items; /* 40 4 */ atomic_t items_seq; /* 44 4 */ int nodes; /* 48 4 */ /* XXX 4 bytes hole, try to pack */ wait_queue_head_t wait; /* 56 24 */ /* size: 80, cachelines: 2, members: 7 */ /* sum members: 72, holes: 2, sum holes: 8 */ /* last cacheline: 16 bytes */ }; Reordering 'nodes' after 'lock' reduces size by 8B, to 72 on release config. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: embed delayed root to struct btrfs_fs_infoDavid Sterba
The fs_info::delayed_root is allocated dynamically but there's only one instance per filesystem so we can embed it into the fs_info itself. The two object have the same lifetime and delayed roots are always present so we don't need to allocate it on demand from slab. There's still some space left in fs_info until the 4K so there won't be an spill over to next page on release config (size grows from 3880 to 3952). In case we want to shrink fs_info there are still holes to fill or we can separate other non-core or optional structures if needed. Link: https://lore.kernel.org/all/cover.1767979013.git.dsterba@suse.com/ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: add strict extent map alignment checksQu Wenruo
Currently we do not check the alignment of extent_map structure. The reasons are the inode and extent-map tests use unaligned values for start offsets and lengths. Thankfully those legacy problems are properly addressed by previous patches, now we can finally put the alignment checks into validate_extent_map(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tests: prepare extent map tests for strict alignment checksQu Wenruo
Currently the extent map self tests have the following points that will cause false alerts for the incoming strict extent map alignment checks: - Incorrect inlined extent map size Which is not following what the kernel is doing for inlined extents, as btrfs_extent_item_to_extent_map() always uses the fs block size as the length, not the ram_bytes. Fix it by using SZ_4K as extent map's length. - Incorrect btrfs_fs_info::sectorsize As we always use PAGE_SIZE, which can be values larger than 4K. Meanwhile all the immediate numbers used are based on 4K fs block size in the test case. Fix it by using fixed SZ_4K fs block size when allocating the dummy btrfs_fs_info. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tests: remove invalid file extent map testsQu Wenruo
In the inode self tests, there are several problems: - Invalid file extents E.g. hole range [4K, 4K + 4) is completely invalid. Only inlined extent maps can have an unaligned ram_bytes, and even for that case, the generated extent map will use sectorsize as em->len. - Unaligned hole after inlined extent The kernel never does this by itself, the current btrfs_get_extent() will only return a single inlined extent map that covers the first block. - Incorrect numbers in the comment E.g. 12291 no matter if you add or dec 1, is not aligned to 4K. The properly number for 12K is 12288, I don't know why there is even a diff of 3, and this completely doesn't match the extent map we inserted later. - Hard-to-modify sequence in setup_file_extents() If some unfortunate person, just like me, needs to modify setup_file_extents(), good luck not screwing up the file offset. Fix them by: - Remove invalid unaligned extent maps This mostly means remove the [4K, 4K + 4) hole case. The remaining ones are already properly aligned. This slightly changes the on-disk data extent allocation, with that removed, the regular extents at [4K, 8K) and [8K , 12K) can be merged. So also add a 4K gap between those two data extents to prevent em merge. - Remove the implied hole after an inlined extent Just like what the kernel is doing for inlined extents in the real world. - Update the commit using proper numbers with 'K' suffixes Since there is no unaligned range except the first inlined one, we can always use numbers with 'K' suffixes, which is way more easier to read, and will always be aligned to 1024 at least. - Add comments in setup_file_extents() So that we're clear about the file offset for each test file extent. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: unfold transaction aborts in btrfs_finish_one_ordered()Filipe Manana
We have a single transaction abort that can be caused either by a failure from a call to btrfs_mark_extent_written(), if we are dealing with a write to a prealloc extent, or otherwise from a call to insert_ordered_extent_file_extent(). So when the transaction abort happens we can not know for sure which case failed. Unfold the aborts so that it's clear in case of a failure. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: deal with missing root in sample_block_group_extent_item()Filipe Manana
In case the root does not exists, which is unexpected, btrfs_extent_root() returns NULL, but we ignore that and so if it happens we can trigger a NULL pointer dereference later. So verify if we found the root and log an error message in case it's missing. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove bogus root search condition in sample_block_group_extent_item()Filipe Manana
There's no need to pass the maximum between the block group's start offset and BTRFS_SUPER_INFO_OFFSET (64K) since we can't have any block groups allocated in the first megabyte, as that's reserved space. Furthermore, even if we could, the correct thing to do was to pass the block group's start offset anyway - and that's precisely what we do for block groups that happen to contain superblock mirror (the range for the super block is never marked as free and it's marked as dirty in the fs_info->excluded_extents io tree). So simplify this and get rid of that maximum expression. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: fallback to buffered IO if the data profile has duplicationQu Wenruo
[BACKGROUND] Inspired by a recent kernel bug report, which is related to direct IO buffer modification during writeback, that leads to contents mismatch of different RAID1 mirrors. [CAUSE AND PROBLEMS] The root cause is exactly the same explained in commit 968f19c5b1b7 ("btrfs: always fallback to buffered write if the inode requires checksum"), that we can not trust direct IO buffer which can be modified halfway during writeback. Unlike data checksum verification, if this happened on inodes without data checksum but has the data has extra mirrors, it will lead to stealth data mismatch on different mirrors. This will be way harder to detect without data checksum. Furthermore for RAID56, we can even have data without checksum and data with checksum mixed inside the same full stripe. In that case if the direct IO buffer got changed halfway for the nodatasum part, the data with checksum immediately lost its ability to recover, e.g.: " " = Good old data or parity calculated using good old data "X" = Data modified during writeback 0 32K 64K Data 1 | | Has csum Data 2 |XXXXXXXXXXXXXXXX | No csum Parity | | In above case, the parity is calculated using data 1 (has csum, from page cache, won't change during writeback), and old data 2 (has no csum, direct IO write). After parity is calculated, but before submission to the storage, direct IO buffer of data 2 is modified, causing the range [0, 32K) of data 2 has a different content. Now all data is submitted to the storage, and the fs got fully synced. Then the device of data 1 is lost, has to be rebuilt from data 2 and parity. But since the data 2 has some modified data, and the parity is calculated using old data, the recovered data is no the same for data 1, causing data checksum mismatch. [FIX] Fix the problem by checking the data allocation profile. If our data allocation profile is either RAID0 or SINGLE, we can allow true zero-copy direct IO and the end user is fully responsible for any race. However this is not going to fix all situations, as it's still possible to race with balance where the fs got a new data profile after the data allocation profile check. But this fix should still greatly reduce the window of the original bug. Link: https://bugzilla.kernel.org/show_bug.cgi?id=99171 Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: assert block group is locked in btrfs_use_block_group_size_class()Filipe Manana
It's supposed to be called with the block group locked, in order to read and set its size_class member, so assert it's locked. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: don't pass block group argument to load_block_group_size_class()Filipe Manana
There's no need to pass the block group since we can extract it from the given caching control structure. Same goes for its helper function sample_block_group_extent_item(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: allocate path on stack in load_block_group_size_class()Filipe Manana
Instead of allocating and freeing a path in every iteration of load_block_group_size_class(), through its helper function sample_block_group_extent_item(), allocate the path in the former and pass it to the later. The path is allocated on stack since it's short and we are in a workqueue context so there's not much stack usage. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: make load_block_group_size_class() return voidFilipe Manana
There's no point in returning anything since determining and setting a size class for a block group is an optimization, not something critical. The only caller of load_block_group_size_class() (the caching thread) does not do anything with the return value anyway, exactly because having a size class is just an optimization and it can always be set later when adding reserved bytes to a block group (btrfs_add_reserved_bytes()). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: zstd: use folio_iter to handle zstd_decompress_bio()Qu Wenruo
Currently zstd_decompress_bio() is using compressed_bio->compressed_folios[] array to grab each compressed folio. However cb->compressed_folios[] is just a pointer to each folio of the compressed bio, meaning we can just replace the compressed_folios[] array by just grabbing the folio inside the compressed bio. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: zlib: use folio_iter to handle zlib_decompress_bio()Qu Wenruo
Currently zlib_decompress_bio() is using compressed_bio->compressed_folios[] array to grab each compressed folio. However cb->compressed_folios[] is just a pointer to each folio of the compressed bio, meaning we can just replace the compressed_folios[] array by just grabbing the folio inside the compressed bio. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: lzo: use folio_iter to handle lzo_decompress_bio()Qu Wenruo
Currently lzo_decompress_bio() is using compressed_bio->compressed_folios[] array to grab each compressed folio. This is making the code much easier to read, as we only need to maintain a single iterator, @cur_in, and can easily grab any random folio using @cur_in >> min_folio_shift as an index. However lzo_decompress_bio() itself is ensured to only advance to the next folio at one time, and compressed_folios[] is just a pointer to each folio of the compressed bio, thus we have no real random access requirement for lzo_decompress_bio(). Replace the compressed_folios[] access by a helper, get_current_folio(), which uses folio_iter and an external folio counter to properly switch the folio when needed. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: consolidate reclaim readiness checks in btrfs_should_reclaim()Sun YangKai
Move the filesystem state validation from btrfs_reclaim_bgs_work() into btrfs_should_reclaim() to centralize the reclaim eligibility logic. Since it is the only caller of btrfs_should_reclaim(), there's no functional change. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: fix periodic reclaim conditionSun YangKai
Problems with current implementation: 1. reclaimable_bytes is signed while chunk_sz is unsigned, causing negative reclaimable_bytes to trigger reclaim unexpectedly 2. The "space must be freed between scans" assumption breaks the two-scan requirement: first scan marks block groups, second scan reclaims them. Without the second scan, no reclamation occurs. Instead, track actual reclaim progress: pause reclaim when block groups will be reclaimed, and resume only when progress is made. This ensures reclaim continues until no further progress can be made. And resume periodic reclaim when there's enough free space. And we take care if reclaim is making any progress now, so it's unnecessary to set periodic_reclaim_ready to false when failed to reclaim a block group. Fixes: 813d4c6422516 ("btrfs: prevent pathological periodic reclaim loops") CC: stable@vger.kernel.org # 6.12+ Suggested-by: Boris Burkov <boris@bur.io> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Sun YangKai <sunk67188@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: don't pass io_ctl to __btrfs_write_out_cache()Johannes Thumshirn
There's no need to pass both the block_group and block_group::io_ctl to __btrfs_write_out_cache(). Remove passing io_ctl to __btrfs_write_out_cache() and dereference it inside __btrfs_write_out_cache(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: use the btrfs_extent_map_end() helper everywhereFilipe Manana
We have a helper to calculate an extent map's exclusive end offset, but we only use it in some places. Update every site that open codes the calculation to use the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: use the btrfs_block_group_end() helper everywhereFilipe Manana
We have a helper to calculate a block group's exclusive end offset, but we only use it in some places. Update every site that open codes the calculation to use the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove bogus NULL checks in __btrfs_write_out_cache()Johannes Thumshirn
Dan reported a new smatch warning in free-space-cache.c: New smatch warnings: fs/btrfs/free-space-cache.c:1207 write_pinned_extent_entries() warn: variable dereferenced before check 'block_group' (see line 1203) But the check if the block_group pointer is NULL is bogus, because to get to this point block_group::io_ctl has already been dereferenced further up the call-chain when calling __btrfs_write_out_cache() from btrfs_write_out_cache(). Remove the bogus checks for block_group == NULL in __btrfs_write_out_cache() and it's siblings. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202601170636.WsePMV5H-lkp@intel.com/ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: populate fully_remapped_bgs_list on mountMark Harmstone
Add a function btrfs_populate_fully_remapped_bgs_list() which gets called on mount, which looks for fully remapped block groups (i.e. identity_remap_count == 0) which haven't yet had their chunk stripes and device extents removed. This happens when a filesystem is unmounted while async discard has not yet finished, as otherwise the data range occupied by the chunk stripes would be permanently unusable. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: handle discarding fully-remapped block groupsMark Harmstone
Discard normally works by iterating over the free-space entries of a block group. This doesn't work for fully-remapped block groups, as we removed their free-space entries when we started relocation. For sync discard, call btrfs_discard_extent() when we commit the transaction in which the last identity remap was removed. For async discard, add a new function btrfs_trim_fully_remapped_block_group() to be called by the discard worker, which iterates over the block group's range using the normal async discard rules. Once we reach the end, remove the chunk's stripes and device extents to get back its free space. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: allow balancing remap treeMark Harmstone
Balancing the METADATA_REMAP chunk, i.e. the chunk in which the remap tree lives, is a special case. We can't use the remap tree itself for this, as then we'd have no way to boostrap it on mount. And we can't use the pre-remap tree code for this as it relies on walking the extent tree, and we're not creating backrefs for METADATA_REMAP chunks. So instead, if a balance would relocate any METADATA_REMAP block groups, mark those block groups as readonly and COW every leaf of the remap tree. There's more sophisticated ways of doing this, such as only COWing nodes within a block group that's to be relocated, but they're fiddly and with lots of edge cases. Plus it's not anticipated that a) the number of METADATA_REMAP chunks is going to be particularly large, or b) that users will want to only relocate some of these chunks - the main use case here is to unbreak RAID conversion and device removal. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: add do_remap parameter to btrfs_discard_extent()Mark Harmstone
btrfs_discard_extent() can be called either when an extent is removed or from walking the free-space tree. With a remapped block group these two things are no longer equivalent: the extent's addresses are remapped, while the free-space tree exclusively uses underlying addresses. Add a do_remap parameter to btrfs_discard_extent() and btrfs_map_discard(), saying whether or not the address needs to be run through the remap tree first. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: replace identity remaps with actual remaps when doing relocationsMark Harmstone
Add a function do_remap_tree_reloc(), which does the actual work of doing a relocation using the remap tree. In a loop we call do_remap_reloc_trans(), which searches for the first identity remap for the block group. We call btrfs_reserve_extent() to find space elsewhere for it, and read the data into memory and write it to the new location. We then carve out the identity remap and replace it with an actual remap, which points to the new location in which to look. Once the last identity remap has been removed we call last_identity_remap_gone(), which, as with deletions, removes the chunk's stripes and device extents. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: move existing remaps before relocating block groupMark Harmstone
If when relocating a block group we find that `remap_bytes` > 0 in its block group item, that means that it has been the destination block group for another that has been remapped. We need to search the remap tree for any remap backrefs within this range, and move the data to a third block group. This is because otherwise btrfs_translate_remap() could end up following an unbounded chain of remaps, which would only get worse over time. We only relocate one block group at a time, so `remap_bytes` will only ever go down while we are doing this. Once we're finished we set the REMAPPED flag on the block group, which will permanently prevent any other data from being moved to within it. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>