summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
8 daysbtrfs: adjust subpage bit start based on sectorsizeJosef Bacik
[ Upstream commit e08e49d986f82c30f42ad0ed43ebbede1e1e3739 ] When running machines with 64k page size and a 16k nodesize we started seeing tree log corruption in production. This turned out to be because we were not writing out dirty blocks sometimes, so this in fact affects all metadata writes. When writing out a subpage EB we scan the subpage bitmap for a dirty range. If the range isn't dirty we do bit_start++; to move onto the next bit. The problem is the bitmap is based on the number of sectors that an EB has. So in this case, we have a 64k pagesize, 16k nodesize, but a 4k sectorsize. This means our bitmap is 4 bits for every node. With a 64k page size we end up with 4 nodes per page. To make this easier this is how everything looks [0 16k 32k 48k ] logical address [0 4 8 12 ] radix tree offset [ 64k page ] folio [ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers [ | | | | | | | | | | | | | | | | ] bitmap Now we use all of our addressing based on fs_info->sectorsize_bits, so as you can see the above our 16k eb->start turns into radix entry 4. When we find a dirty range for our eb, we correctly do bit_start += sectors_per_node, because if we start at bit 0, the next bit for the next eb is 4, to correspond to eb->start 16k. However if our range is clean, we will do bit_start++, which will now put us offset from our radix tree entries. In our case, assume that the first time we check the bitmap the block is not dirty, we increment bit_start so now it == 1, and then we loop around and check again. This time it is dirty, and we go to find that start using the following equation start = folio_start + bit_start * fs_info->sectorsize; so in the case above, eb->start 0 is now dirty, and we calculate start as 0 + 1 * fs_info->sectorsize = 4096 4096 >> 12 = 1 Now we're looking up the radix tree for 1, and we won't find an eb. What's worse is now we're using bit_start == 1, so we do bit_start += sectors_per_node, which is now 5. If that eb is dirty we will run into the same thing, we will look at an offset that is not populated in the radix tree, and now we're skipping the writeout of dirty extent buffers. The best fix for this is to not use sectorsize_bits to address nodes, but that's a larger change. Since this is a fs corruption problem fix it simply by always using sectors_per_node to increment the start bit. Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> [ Adjust context ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
8 daysbtrfs: avoid load/store tearing races when checking if an inode was loggedFilipe Manana
[ Upstream commit 986bf6ed44dff7fbae7b43a0882757ee7f5ba21b ] At inode_logged() we do a couple lockless checks for ->logged_trans, and these are generally safe except the second one in case we get a load or store tearing due to a concurrent call updating ->logged_trans (either at btrfs_log_inode() or later at inode_logged()). In the first case it's safe to compare to the current transaction ID since once ->logged_trans is set the current transaction, we never set it to a lower value. In the second case, where we check if it's greater than zero, we are prone to load/store tearing races, since we can have a concurrent task updating to the current transaction ID with store tearing for example, instead of updating with a single 64 bits write, to update with two 32 bits writes or four 16 bits writes. In that case the reading side at inode_logged() could see a positive value that does not match the current transaction and then return a false negative. Fix this by doing the second check while holding the inode's spinlock, add some comments about it too. Also add the data_race() annotation to the first check to avoid any reports from KCSAN (or similar tools) and comment about it. Fixes: 0f8ce49821de ("btrfs: avoid inode logging during rename and link when possible") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
8 daysbtrfs: fix race between setting last_dir_index_offset and inode loggingFilipe Manana
[ Upstream commit 59a0dd4ab98970086fd096281b1606c506ff2698 ] At inode_logged() if we find that the inode was not logged before we update its ->last_dir_index_offset to (u64)-1 with the goal that the next directory log operation will see the (u64)-1 and then figure out it must check what was the index of the last logged dir index key and update ->last_dir_index_offset to that key's offset (this is done in update_last_dir_index_offset()). This however has a possibility for a time window where a race can happen and lead to directory logging skipping dir index keys that should be logged. The race happens like this: 1) Task A calls inode_logged(), sees ->logged_trans as 0 and then checks that the inode item was logged before, but before it sets the inode's ->last_dir_index_offset to (u64)-1... 2) Task B is at btrfs_log_inode() which calls inode_logged() early, and that has set ->last_dir_index_offset to (u64)-1; 3) Task B then enters log_directory_changes() which calls update_last_dir_index_offset(). There it sees ->last_dir_index_offset is (u64)-1 and that the inode was logged before (ctx->logged_before is true), and so it searches for the last logged dir index key in the log tree and it finds that it has an offset (index) value of N, so it sets ->last_dir_index_offset to N, so that we can skip index keys that are less than or equal to N (later at process_dir_items_leaf()); 4) Task A now sets ->last_dir_index_offset to (u64)-1, undoing the update that task B just did; 5) Task B will now skip every index key when it enters process_dir_items_leaf(), since ->last_dir_index_offset is (u64)-1. Fix this by making inode_logged() not touch ->last_dir_index_offset and initializing it to 0 when an inode is loaded (at btrfs_alloc_inode()) and then having update_last_dir_index_offset() treat a value of 0 as meaning we must check the log tree and update with the index of the last logged index key. This is fine since the minimum possible value for ->last_dir_index_offset is 1 (BTRFS_DIR_START_INDEX - 1 = 2 - 1 = 1). This also simplifies the management of ->last_dir_index_offset and now all accesses to it are done under the inode's log_mutex. Fixes: 0f8ce49821de ("btrfs: avoid inode logging during rename and link when possible") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
8 daysbtrfs: fix race between logging inode and checking if it was logged beforeFilipe Manana
[ Upstream commit ef07b74e1be56f9eafda6aadebb9ebba0743c9f0 ] There's a race between checking if an inode was logged before and logging an inode that can cause us to mark an inode as not logged just after it was logged by a concurrent task: 1) We have inode X which was not logged before neither in the current transaction not in past transaction since the inode was loaded into memory, so it's ->logged_trans value is 0; 2) We are at transaction N; 3) Task A calls inode_logged() against inode X, sees that ->logged_trans is 0 and there is a log tree and so it proceeds to search in the log tree for an inode item for inode X. It doesn't see any, but before it sets ->logged_trans to N - 1... 3) Task B calls btrfs_log_inode() against inode X, logs the inode and sets ->logged_trans to N; 4) Task A now sets ->logged_trans to N - 1; 5) At this point anyone calling inode_logged() gets 0 (inode not logged) since ->logged_trans is greater than 0 and less than N, but our inode was really logged. As a consequence operations like rename, unlink and link that happen afterwards in the current transaction end up not updating the log when they should. Fix this by ensuring inode_logged() only updates ->logged_trans in case the inode item is not found in the log tree if after tacking the inode's lock (spinlock struct btrfs_inode::lock) the ->logged_trans value is still zero, since the inode lock is what protects setting ->logged_trans at btrfs_log_inode(). Fixes: 0f8ce49821de ("btrfs: avoid inode logging during rename and link when possible") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28btrfs: send: make fs_path_len() inline and constify its argumentFilipe Manana
[ Upstream commit 920e8ee2bfcaf886fd8c0ad9df097a7dddfeb2d8 ] The helper function fs_path_len() is trivial and doesn't need to change its path argument, so make it inline and constify the argument. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: send: use fallocate for hole punching with send stream v2Filipe Manana
[ Upstream commit 005b0a0c24e1628313e951516b675109a92cacfe ] Currently holes are sent as writes full of zeroes, which results in unnecessarily using disk space at the receiving end and increasing the stream size. In some cases we avoid sending writes of zeroes, like during a full send operation where we just skip writes for holes. But for some cases we fill previous holes with writes of zeroes too, like in this scenario: 1) We have a file with a hole in the range [2M, 3M), we snapshot the subvolume and do a full send. The range [2M, 3M) stays as a hole at the receiver since we skip sending write commands full of zeroes; 2) We punch a hole for the range [3M, 4M) in our file, so that now it has a 2M hole in the range [2M, 4M), and snapshot the subvolume. Now if we do an incremental send, we will send write commands full of zeroes for the range [2M, 4M), removing the hole for [2M, 3M) at the receiver. We could improve cases such as this last one by doing additional comparisons of file extent items (or their absence) between the parent and send snapshots, but that's a lot of code to add plus additional CPU and IO costs. Since the send stream v2 already has a fallocate command and btrfs-progs implements a callback to execute fallocate since the send stream v2 support was added to it, update the kernel to use fallocate for punching holes for V2+ streams. Test coverage is provided by btrfs/284 which is a version of btrfs/007 that exercises send stream v2 instead of v1, using fsstress with random operations and fssum to verify file contents. Link: https://github.com/kdave/btrfs-progs/issues/1001 CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: send: avoid path allocation for the current inode when issuing commandsFilipe Manana
[ Upstream commit 374d45af6435534a11b01b88762323abf03dd755 ] Whenever we issue a command we allocate a path and then compute it. For the current inode this is not necessary since we have one preallocated and computed in the send context structure, so we can use it instead and avoid allocating and freeing a path. For example if we have 100 extents to send (100 write commands) for a file, we are allocating and freeing paths 100 times. So improve on this by avoiding path allocation and freeing whenever a command is for the current inode by using the current inode's path stored in the send context structure. A test was run before applying this patch and the previous one in the series: "btrfs: send: keep the current inode's path cached" The test script is the following: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 mkfs.btrfs -f $DEV > /dev/null mount $DEV $MNT DIR="$MNT/one/two/three/four" FILE="$DIR/foobar" mkdir -p $DIR # Create some empty files to get a deeper btree and therefore make # path computations slower. for ((i = 1; i <= 30000; i++)); do echo -n > "$DIR/filler_$i" done for ((i = 0; i < 10000; i += 2)); do offset=$(( i * 4096 )) xfs_io -f -c "pwrite -S 0xab $offset 4K" $FILE > /dev/null done btrfs subvolume snapshot -r $MNT $MNT/snap start=$(date +%s%N) btrfs send -f /dev/null $MNT/snap end=$(date +%s%N) echo -e "\nsend took $(( (end - start) / 1000000 )) milliseconds" umount $MNT Result before applying the 2 patches: 1121 milliseconds Result after applying the 2 patches: 815 milliseconds (-31.6%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 005b0a0c24e1 ("btrfs: send: use fallocate for hole punching with send stream v2") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: send: keep the current inode's path cachedFilipe Manana
[ Upstream commit fc746acb7aa9aeaa2cb5dcba449323319ba5c8eb ] Whenever we need to send a command for the current inode, like sending writes, xattr updates, truncates, utimes, etc, we compute the inode's path each time, which implies doing some memory allocations and traversing the inode hierarchy to extract the name of the inode and each ancestor directory, and that implies doing lookups in the subvolume tree amongst other operations. Most of the time, by far, the current inode's path doesn't change while we are processing it (like if we need to issue 100 write commands, the path remains the same and it's pointless to compute it 100 times). To avoid this keep the current inode's path cached in the send context and invalidate it or update it whenever it's needed (after unlinks or renames). A performance test, and its results, is mentioned in the next patch in the series (subject: "btrfs: send: avoid path allocation for the current inode when issuing commands"). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 005b0a0c24e1 ("btrfs: send: use fallocate for hole punching with send stream v2") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: send: add and use helper to rename current inode when processing refsFilipe Manana
[ Upstream commit ec666c84deba56f714505b53556a97565f72db86 ] Extract the logic to rename the current inode at process_recorded_refs() into a helper function and use it, therefore removing duplicated logic and making it easier for an upcoming patch by avoiding yet more duplicated logic. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 005b0a0c24e1 ("btrfs: send: use fallocate for hole punching with send stream v2") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: send: only use boolean variables at process_recorded_refs()Filipe Manana
[ Upstream commit 9453fe329789073d9a971de01da5902c32c1a01a ] We have several local variables at process_recorded_refs() that are used as booleans, with some of them having a 'bool' type while two of them having an 'int' type. Change this to make them all use the 'bool' type which is more clear and to make everything more consistent. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 005b0a0c24e1 ("btrfs: send: use fallocate for hole punching with send stream v2") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: send: factor out common logic when sending xattrsFilipe Manana
[ Upstream commit 17f6a74d0b89092e38e3328b66eda1ab29a195d4 ] We always send xattrs for the current inode only and both callers of send_set_xattr() pass a path for the current inode. So move the path allocation and computation to send_set_xattr(), reducing duplicated code. This also facilitates an upcoming patch. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 005b0a0c24e1 ("btrfs: send: use fallocate for hole punching with send stream v2") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: populate otime when logging an inode itemQu Wenruo
[ Upstream commit 1ef94169db0958d6de39f9ea6e063ce887342e2d ] [TEST FAILURE WITH EXPERIMENTAL FEATURES] When running test case generic/508, the test case will fail with the new btrfs shutdown support: generic/508 - output mismatch (see /home/adam/xfstests/results//generic/508.out.bad) # --- tests/generic/508.out 2022-05-11 11:25:30.806666664 +0930 # +++ /home/adam/xfstests/results//generic/508.out.bad 2025-07-02 14:53:22.401824212 +0930 # @@ -1,2 +1,6 @@ # QA output created by 508 # Silence is golden # +Before: # +After : stat.btime = Thu Jan 1 09:30:00 1970 # +Before: # +After : stat.btime = Wed Jul 2 14:53:22 2025 # ... # (Run 'diff -u /home/adam/xfstests/tests/generic/508.out /home/adam/xfstests/results//generic/508.out.bad' to see the entire diff) Ran: generic/508 Failures: generic/508 Failed 1 of 1 tests Please note that the test case requires shutdown support, thus the test case will be skipped using the current upstream kernel, as it doesn't have shutdown ioctl support. [CAUSE] The direct cause the 0 time stamp in the log tree: leaf 30507008 items 2 free space 16057 generation 9 owner TREE_LOG leaf 30507008 flags 0x1(WRITTEN) backref revision 1 checksum stored e522548d checksum calced e522548d fs uuid 57d45451-481e-43e4-aa93-289ad707a3a0 chunk uuid d52bd3fd-5163-4337-98a7-7986993ad398 item 0 key (257 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 9 transid 9 size 0 nbytes 0 block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 sequence 1 flags 0x0(none) atime 1751432947.492000000 (2025-07-02 14:39:07) ctime 1751432947.492000000 (2025-07-02 14:39:07) mtime 1751432947.492000000 (2025-07-02 14:39:07) otime 0.0 (1970-01-01 09:30:00) <<< But the old fs tree has all the correct time stamp: btrfs-progs v6.12 fs tree key (FS_TREE ROOT_ITEM 0) leaf 30425088 items 2 free space 16061 generation 5 owner FS_TREE leaf 30425088 flags 0x1(WRITTEN) backref revision 1 checksum stored 48f6c57e checksum calced 48f6c57e fs uuid 57d45451-481e-43e4-aa93-289ad707a3a0 chunk uuid d52bd3fd-5163-4337-98a7-7986993ad398 item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 3 transid 0 size 0 nbytes 16384 block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0 sequence 0 flags 0x0(none) atime 1751432947.0 (2025-07-02 14:39:07) ctime 1751432947.0 (2025-07-02 14:39:07) mtime 1751432947.0 (2025-07-02 14:39:07) otime 1751432947.0 (2025-07-02 14:39:07) <<< The root cause is that fill_inode_item() in tree-log.c is only populating a/c/m time, not the otime (or btime in statx output). Part of the reason is that, the vfs inode only has a/c/m time, no native btime support yet. [FIX] Thankfully btrfs has its otime stored in btrfs_inode::i_otime_sec and btrfs_inode::i_otime_nsec. So what we really need is just fill the otime time stamp in fill_inode_item() of tree-log.c There is another fill_inode_item() in inode.c, which is doing the proper otime population. Fixes: 94edf4ae43a5 ("Btrfs: don't bother committing delayed inode updates when fsyncing") CC: stable@vger.kernel.org Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: constify more pointer parametersDavid Sterba
[ Upstream commit ca283ea9920ac20ae23ed398b693db3121045019 ] Continue adding const to parameters. This is for clarity and minor addition to safety. There are some minor effects, in the assembly code and .ko measured on release config. Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: fix ssd_spread overallocationBoris Burkov
[ Upstream commit 807d9023e75fc20bfd6dd2ac0408ce4af53f1648 ] If the ssd_spread mount option is enabled, then we run the so called clustered allocator for data block groups. In practice, this results in creating a btrfs_free_cluster which caches a block_group and borrows its free extents for allocation. Since the introduction of allocation size classes in 6.1, there has been a bug in the interaction between that feature and ssd_spread. find_free_extent() has a number of nested loops. The loop going over the allocation stages, stored in ffe_ctl->loop and managed by find_free_extent_update_loop(), the loop over the raid levels, and the loop over all the block_groups in a space_info. The size class feature relies on the block_group loop to ensure it gets a chance to see a block_group of a given size class. However, the clustered allocator uses the cached cluster block_group and breaks that loop. Each call to do_allocation() will really just go back to the same cached block_group. Normally, this is OK, as the allocation either succeeds and we don't want to loop any more or it fails, and we clear the cluster and return its space to the block_group. But with size classes, the allocation can succeed, then later fail, outside of do_allocation() due to size class mismatch. That latter failure is not properly handled due to the highly complex multi loop logic. The result is a painful loop where we continue to allocate the same num_bytes from the cluster in a tight loop until it fails and releases the cluster and lets us try a new block_group. But by then, we have skipped great swaths of the available block_groups and are likely to fail to allocate, looping the outer loop. In pathological cases like the reproducer below, the cached block_group is often the very last one, in which case we don't perform this tight bg loop but instead rip through the ffe stages to LOOP_CHUNK_ALLOC and allocate a chunk, which is now the last one, and we enter the tight inner loop until an allocation failure. Then allocation succeeds on the final block_group and if the next allocation is a size mismatch, the exact same thing happens again. Triggering this is as easy as mounting with -o ssd_spread and then running: mount -o ssd_spread $dev $mnt dd if=/dev/zero of=$mnt/big bs=16M count=1 &>/dev/null dd if=/dev/zero of=$mnt/med bs=4M count=1 &>/dev/null sync if you do the two writes + sync in a loop, you can force btrfs to spin an excessive amount on semi-successful clustered allocations, before ultimately failing and advancing to the stage where we force a chunk allocation. This results in 2G of data allocated per iteration, despite only using ~20M of data. By using a small size classed extent, the inner loop takes longer and we can spin for longer. The simplest, shortest term fix to unbreak this is to make the clustered allocator size_class aware in the dumbest way, where it fails on size class mismatch. This may hinder the operation of the clustered allocator, but better hindered than completely broken and terribly overallocating. Further re-design improvements are also in the works. Fixes: 52bb7a2166af ("btrfs: introduce size class to block group allocator") CC: stable@vger.kernel.org # 6.1+ Reported-by: David Sterba <dsterba@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: open code timespec64 in struct btrfs_inodeDavid Sterba
[ Upstream commit c6e8f898f56fae2cb5bc4396bec480f23cd8b066 ] The type of timespec64::tv_nsec is 'unsigned long', while we have only u32 for on-disk and in-memory. This wastes a few bytes in btrfs_inode. Add separate members for sec and nsec with the corresponding type width. This creates a 4 byte hole in btrfs_inode which can be utilized in the future. Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 1ef94169db09 ("btrfs: populate otime when logging an inode item") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: abort transaction on unexpected eb generation at btrfs_copy_root()Filipe Manana
[ Upstream commit 33e8f24b52d2796b8cfb28c19a1a7dd6476323a8 ] If we find an unexpected generation for the extent buffer we are cloning at btrfs_copy_root(), we just WARN_ON() and don't error out and abort the transaction, meaning we allow to persist metadata with an unexpected generation. Instead of warning only, abort the transaction and return -EUCLEAN. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: always abort transaction on failure to add block group to free space treeFilipe Manana
[ Upstream commit 1f06c942aa709d397cf6bed577a0d10a61509667 ] Only one of the callers of __add_block_group_free_space() aborts the transaction if the call fails, while the others don't do it and it's either never done up the call chain or much higher in the call chain. So make sure we abort the transaction at __add_block_group_free_space() if it fails, which brings a couple benefits: 1) If some call chain never aborts the transaction, we avoid having some metadata inconsistency because BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE is cleared when we enter __add_block_group_free_space() and therefore __add_block_group_free_space() is never called again to add the block group items to the free space tree, since the function is only called when that flag is set in a block group; 2) If the call chain already aborts the transaction, then we get a better trace that points to the exact step from __add_block_group_free_space() which failed, which is better for analysis. So abort the transaction at __add_block_group_free_space() if any of its steps fails. CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: move transaction aborts to the error site in add_block_group_free_space()David Sterba
[ Upstream commit b63c8c1ede4407835cb8c8bed2014d96619389f3 ] Transaction aborts should be done next to the place the error happens, which was not done in add_block_group_free_space(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 1f06c942aa70 ("btrfs: always abort transaction on failure to add block group to free space tree") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: qgroup: fix race between quota disable and quota rescan ioctlFilipe Manana
[ Upstream commit e1249667750399a48cafcf5945761d39fa584edf ] There's a race between a task disabling quotas and another running the rescan ioctl that can result in a use-after-free of qgroup records from the fs_info->qgroup_tree rbtree. This happens as follows: 1) Task A enters btrfs_ioctl_quota_rescan() -> btrfs_qgroup_rescan(); 2) Task B enters btrfs_quota_disable() and calls btrfs_qgroup_wait_for_completion(), which does nothing because at that point fs_info->qgroup_rescan_running is false (it wasn't set yet by task A); 3) Task B calls btrfs_free_qgroup_config() which starts freeing qgroups from fs_info->qgroup_tree without taking the lock fs_info->qgroup_lock; 4) Task A enters qgroup_rescan_zero_tracking() which starts iterating the fs_info->qgroup_tree tree while holding fs_info->qgroup_lock, but task B is freeing qgroup records from that tree without holding the lock, resulting in a use-after-free. Fix this by taking fs_info->qgroup_lock at btrfs_free_qgroup_config(). Also at btrfs_qgroup_rescan() don't start the rescan worker if quotas were already disabled. Reported-by: cen zhang <zzzccc427@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAFRLqsV+cMDETFuzqdKSHk_FDm6tneea45krsHqPD6B3FetLpQ@mail.gmail.com/ CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [ Check for BTRFS_FS_QUOTA_ENABLED, instead of btrfs_qgroup_full_accounting() ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: don't ignore inode missing when replaying log treeFilipe Manana
[ Upstream commit 7ebf381a69421a88265d3c49cd0f007ba7336c9d ] During log replay, at add_inode_ref(), we return -ENOENT if our current inode isn't found on the subvolume tree or if a parent directory isn't found. The error comes from btrfs_iget_logging() <- btrfs_iget() <- btrfs_read_locked_inode(). The single caller of add_inode_ref(), replay_one_buffer(), ignores an -ENOENT error because it expects that error to mean only that a parent directory wasn't found and that is ok. Before commit 5f61b961599a ("btrfs: fix inode lookup error handling during log replay") we were converting any error when getting a parent directory to -ENOENT and any error when getting the current inode to -EIO, so our caller would fail log replay in case we can't find the current inode. After that commit however in case the current inode is not found we return -ENOENT to the caller and therefore it ignores the critical fact that the current inode was not found in the subvolume tree. Fix this by converting -ENOENT to 0 when we don't find a parent directory, returning -ENOENT when we don't find the current inode and making the caller, replay_one_buffer(), not ignore -ENOENT anymore. Fixes: 5f61b961599a ("btrfs: fix inode lookup error handling during log replay") CC: stable@vger.kernel.org # 6.16 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [ adapted btrfs_inode pointer usage to older inode API ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: zoned: fix write time activation failure for metadata block groupNaohiro Aota
commit 5c4b93f4c8e5c53574c1a48d66a27a2c68b414af upstream. Since commit 13bb483d32ab ("btrfs: zoned: activate metadata block group on write time"), we activate a metadata block group at the write time. If the zone capacity is small enough, we can allocate the entire region before the first write. Then, we hit the btrfs_zoned_bg_is_full() in btrfs_zone_activate() and the activation fails. For a data block group, we activate it at the allocation time and we should check the fullness condition in the caller side. Add, a WARN to check the fullness condition. For a metadata block group, we don't need the fullness check because we activate it at the write time. Instead, activating it once it is written should be invalid. Catch that with a WARN too. Fixes: 13bb483d32ab ("btrfs: zoned: activate metadata block group on write time") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: do not allow relocation of partially dropped subvolumesQu Wenruo
commit 4289b494ac553e74e86fed1c66b2bf9530bc1082 upstream. [BUG] There is an internal report that balance triggered transaction abort, with the following call trace: item 85 key (594509824 169 0) itemoff 12599 itemsize 33 extent refs 1 gen 197740 flags 2 ref#0: tree block backref root 7 item 86 key (594558976 169 0) itemoff 12566 itemsize 33 extent refs 1 gen 197522 flags 2 ref#0: tree block backref root 7 ... BTRFS error (device loop0): extent item not found for insert, bytenr 594526208 num_bytes 16384 parent 449921024 root_objectid 934 owner 1 offset 0 BTRFS error (device loop0): failed to run delayed ref for logical 594526208 num_bytes 16384 type 182 action 1 ref_mod 1: -117 ------------[ cut here ]------------ BTRFS: Transaction aborted (error -117) WARNING: CPU: 1 PID: 6963 at ../fs/btrfs/extent-tree.c:2168 btrfs_run_delayed_refs+0xfa/0x110 [btrfs] And btrfs check doesn't report anything wrong related to the extent tree. [CAUSE] The cause is a little complex, firstly the extent tree indeed doesn't have the backref for 594526208. The extent tree only have the following two backrefs around that bytenr on-disk: item 65 key (594509824 METADATA_ITEM 0) itemoff 13880 itemsize 33 refs 1 gen 197740 flags TREE_BLOCK tree block skinny level 0 (176 0x7) tree block backref root CSUM_TREE item 66 key (594558976 METADATA_ITEM 0) itemoff 13847 itemsize 33 refs 1 gen 197522 flags TREE_BLOCK tree block skinny level 0 (176 0x7) tree block backref root CSUM_TREE But the such missing backref item is not an corruption on disk, as the offending delayed ref belongs to subvolume 934, and that subvolume is being dropped: item 0 key (934 ROOT_ITEM 198229) itemoff 15844 itemsize 439 generation 198229 root_dirid 256 bytenr 10741039104 byte_limit 0 bytes_used 345571328 last_snapshot 198229 flags 0x1000000000001(RDONLY) refs 0 drop_progress key (206324 EXTENT_DATA 2711650304) drop_level 2 level 2 generation_v2 198229 And that offending tree block 594526208 is inside the dropped range of that subvolume. That explains why there is no backref item for that bytenr and why btrfs check is not reporting anything wrong. But this also shows another problem, as btrfs will do all the orphan subvolume cleanup at a read-write mount. So half-dropped subvolume should not exist after an RW mount, and balance itself is also exclusive to subvolume cleanup, meaning we shouldn't hit a subvolume half-dropped during relocation. The root cause is, there is no orphan item for this subvolume. In fact there are 5 subvolumes from around 2021 that have the same problem. It looks like the original report has some older kernels running, and caused those zombie subvolumes. Thankfully upstream commit 8d488a8c7ba2 ("btrfs: fix subvolume/snapshot deletion not triggered on mount") has long fixed the bug. [ENHANCEMENT] For repairing such old fs, btrfs-progs will be enhanced. Considering how delayed the problem will show up (at run delayed ref time) and at that time we have to abort transaction already, it is too late. Instead here we reject any half-dropped subvolume for reloc tree at the earliest time, preventing confusion and extra time wasted on debugging similar bugs. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: zoned: do not select metadata BG as finish targetNaohiro Aota
commit 3a931e9b39c7ff8066657042f5f00d3b7e6ad315 upstream. We call btrfs_zone_finish_one_bg() to zone finish one block group and make room to activate another block group. Currently, we can choose a metadata block group as a target. But, as we reserve an active metadata block group, we no longer want to select a metadata block group. So, skip it in the loop. CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: fix log tree replay failure due to file with 0 links and extentsFilipe Manana
commit 0a32e4f0025a74c70dcab4478e9b29c22f5ecf2f upstream. If we log a new inode (not persisted in a past transaction) that has 0 links and extents, then log another inode with an higher inode number, we end up with failing to replay the log tree with -EINVAL. The steps for this are: 1) create new file A 2) write some data to file A 3) open an fd on file A 4) unlink file A 5) fsync file A using the previously open fd 6) create file B (has higher inode number than file A) 7) fsync file B 8) power fail before current transaction commits Now when attempting to mount the fs, the log replay will fail with -ENOENT at replay_one_extent() when attempting to replay the first extent of file A. The failure comes when trying to open the inode for file A in the subvolume tree, since it doesn't exist. Before commit 5f61b961599a ("btrfs: fix inode lookup error handling during log replay"), the returned error was -EIO instead of -ENOENT, since we converted any errors when attempting to read an inode during log replay to -EIO. The reason for this is that the log replay procedure fails to ignore the current inode when we are at the stage LOG_WALK_REPLAY_ALL, our current inode has 0 links and last inode we processed in the previous stage has a non 0 link count. In other words, the issue is that at replay_one_extent() we only update wc->ignore_cur_inode if the current replay stage is LOG_WALK_REPLAY_INODES. Fix this by updating wc->ignore_cur_inode whenever we find an inode item regardless of the current replay stage. This is a simple solution and easy to backport, but later we can do other alternatives like avoid logging extents or inode items other than the inode item for inodes with a link count of 0. The problem with the wc->ignore_cur_inode logic has been around since commit f2d72f42d5fa ("Btrfs: fix warning when replaying log after fsync of a tmpfile") but it only became frequent to hit since the more recent commit 5e85262e542d ("btrfs: fix fsync of files with no hard links not persisting deletion"), because we stopped skipping inodes with a link count of 0 when logging, while before the problem would only be triggered if trying to replay a log tree created with an older kernel which has a logged inode with 0 links. A test case for fstests will be submitted soon. Reported-by: Peter Jung <ptr1337@cachyos.org> Link: https://lore.kernel.org/linux-btrfs/fce139db-4458-4788-bb97-c29acf6cb1df@cachyos.org/ Reported-by: burneddi <burneddi@protonmail.com> Link: https://lore.kernel.org/linux-btrfs/lh4W-Lwc0Mbk-QvBhhQyZxf6VbM3E8VtIvU3fPIQgweP_Q1n7wtlUZQc33sYlCKYd-o6rryJQfhHaNAOWWRKxpAXhM8NZPojzsJPyHMf2qY=@protonmail.com/#t Reported-by: Russell Haley <yumpusamongus@gmail.com> Link: https://lore.kernel.org/linux-btrfs/598ecc75-eb80-41b3-83c2-f2317fbb9864@gmail.com/ Fixes: f2d72f42d5fa ("Btrfs: fix warning when replaying log after fsync of a tmpfile") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: clear dirty status from extent buffer on error at insert_new_root()Filipe Manana
commit c0d013495a80cbb53e2288af7ae0ec4170aafd7c upstream. If we failed to insert the tree mod log operation, we are not removing the dirty status from the allocated and dirtied extent buffer before we free it. Removing the dirty status is needed for several reasons such as to adjust the fs_info->dirty_metadata_bytes counter and remove the dirty status from the respective folios. So add the missing call to btrfs_clear_buffer_dirty(). Fixes: f61aa7ba08ab ("btrfs: do not BUG_ON() on tree mod log failure at insert_new_root()") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: zoned: do not remove unwritten non-data block groupNaohiro Aota
commit 3061801420469610c8fa6080a950e56770773ef1 upstream. There are some reports of "unable to find chunk map for logical 2147483648 length 16384" error message appears in dmesg. This means some IOs are occurring after a block group is removed. When a metadata tree node is cleaned on a zoned setup, we keep that node still dirty and write it out not to create a write hole. However, this can make a block group's used bytes == 0 while there is a dirty region left. Such an unused block group is moved into the unused_bg list and processed for removal. When the removal succeeds, the block group is removed from the transaction->dirty_bgs list, so the unused dirty nodes in the block group are not sent at the transaction commit time. It will be written at some later time e.g, sync or umount, and causes "unable to find chunk map" errors. This can happen relatively easy on SMR whose zone size is 256MB. However, calling do_zone_finish() on such block group returns -EAGAIN and keep that block group intact, which is why the issue is hidden until now. Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: abort transaction during log replay if walk_log_tree() failedFilipe Manana
commit 2a5898c4aac67494c2f0f7fe38373c95c371c930 upstream. If we failed walking a log tree during replay, we have a missing transaction abort to prevent committing a transaction where we didn't fully replay all the changes from a log tree and therefore can leave the respective subvolume tree in some inconsistent state. So add the missing transaction abort. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28btrfs: zoned: use filesystem size not disk size for reclaim decisionJohannes Thumshirn
commit 55f7c65b2f69c7e4cb7aa7c1654a228ccf734fd8 upstream. When deciding if a zoned filesystem is reaching the threshold to reclaim data block groups, look at the size of the filesystem not to potentially total available size of all drives in the filesystem. Especially if a filesystem was created with mkfs' -b option, constraining it to only a portion of the block device, the numbers won't match and potentially garbage collection is kicking in too late. Fixes: 3687fcb0752a ("btrfs: zoned: make auto-reclaim less aggressive") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-17btrfs: fix assertion when building free space treeFilipe Manana
[ Upstream commit 1961d20f6fa8903266ed9bd77c691924c22c8f02 ] When building the free space tree with the block group tree feature enabled, we can hit an assertion failure like this: BTRFS info (device loop0 state M): rebuilding free space tree assertion failed: ret == 0, in fs/btrfs/free-space-tree.c:1102 ------------[ cut here ]------------ kernel BUG at fs/btrfs/free-space-tree.c:1102! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP Modules linked in: CPU: 1 UID: 0 PID: 6592 Comm: syz-executor322 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 lr : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 sp : ffff8000a4ce7600 x29: ffff8000a4ce76e0 x28: ffff0000c9bc6000 x27: ffff0000ddfff3d8 x26: ffff0000ddfff378 x25: dfff800000000000 x24: 0000000000000001 x23: ffff8000a4ce7660 x22: ffff70001499cecc x21: ffff0000e1d8c160 x20: ffff0000e1cb7800 x19: ffff0000e1d8c0b0 x18: 00000000ffffffff x17: ffff800092f39000 x16: ffff80008ad27e48 x15: ffff700011e740c0 x14: 1ffff00011e740c0 x13: 0000000000000004 x12: ffffffffffffffff x11: ffff700011e740c0 x10: 0000000000ff0100 x9 : 94ef24f55d2dbc00 x8 : 94ef24f55d2dbc00 x7 : 0000000000000001 x6 : 0000000000000001 x5 : ffff8000a4ce6f98 x4 : ffff80008f415ba0 x3 : ffff800080548ef0 x2 : 0000000000000000 x1 : 0000000100000000 x0 : 000000000000003e Call trace: populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 (P) btrfs_rebuild_free_space_tree+0x14c/0x54c fs/btrfs/free-space-tree.c:1337 btrfs_start_pre_rw_mount+0xa78/0xe10 fs/btrfs/disk-io.c:3074 btrfs_remount_rw fs/btrfs/super.c:1319 [inline] btrfs_reconfigure+0x828/0x2418 fs/btrfs/super.c:1543 reconfigure_super+0x1d4/0x6f0 fs/super.c:1083 do_remount fs/namespace.c:3365 [inline] path_mount+0xb34/0xde0 fs/namespace.c:4200 do_mount fs/namespace.c:4221 [inline] __do_sys_mount fs/namespace.c:4432 [inline] __se_sys_mount fs/namespace.c:4409 [inline] __arm64_sys_mount+0x3e8/0x468 fs/namespace.c:4409 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151 el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767 el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600 Code: f0047182 91178042 528089c3 9771d47b (d4210000) ---[ end trace 0000000000000000 ]--- This happens because we are processing an empty block group, which has no extents allocated from it, there are no items for this block group, including the block group item since block group items are stored in a dedicated tree when using the block group tree feature. It also means this is the block group with the highest start offset, so there are no higher keys in the extent root, hence btrfs_search_slot_for_read() returns 1 (no higher key found). Fix this by asserting 'ret' is 0 only if the block group tree feature is not enabled, in which case we should find a block group item for the block group since it's stored in the extent root and block group item keys are greater than extent item keys (the value for BTRFS_BLOCK_GROUP_ITEM_KEY is 192 and for BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_ITEM_KEY the values are 168 and 169 respectively). In case 'ret' is 1, we just need to add a record to the free space tree which spans the whole block group, and we can achieve this by making 'ret == 0' as the while loop's condition. Reported-by: syzbot+36fae25c35159a763a2a@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/6841dca8.a00a0220.d4325.0020.GAE@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-17btrfs: fix inode lookup error handling during log replayFilipe Manana
[ Upstream commit 5f61b961599acbd2bed028d3089105a1f7d224b8 ] When replaying log trees we use read_one_inode() to get an inode, which is just a wrapper around btrfs_iget_logging(), which in turn is a wrapper for btrfs_iget(). But read_one_inode() always returns NULL for any error that btrfs_iget_logging() / btrfs_iget() may return and this is a problem because: 1) In many callers of read_one_inode() we convert the NULL into -EIO, which is not accurate since btrfs_iget() may return -ENOMEM and -ENOENT for example, besides -EIO and other errors. So during log replay we may end up reporting a false -EIO, which is confusing since we may not have had any IO error at all; 2) When replaying directory deletes, at replay_dir_deletes(), we assume the NULL returned from read_one_inode() means that the inode doesn't exist and then proceed as if no error had happened. This is wrong because unless btrfs_iget() returned ERR_PTR(-ENOENT), we had an actual error and the target inode may exist in the target subvolume root - this may later result in the log replay code failing at a later stage (if we are "lucky") or succeed but leaving some inconsistency in the filesystem. So fix this by not ignoring errors from btrfs_iget_logging() and as a consequence remove the read_one_inode() wrapper and just use btrfs_iget_logging() directly. Also since btrfs_iget_logging() is supposed to be called only against subvolume roots, just like read_one_inode() which had a comment about it, add an assertion to btrfs_iget_logging() to check that the target root corresponds to a subvolume root. Fixes: 5d4f98a28c7d ("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-17btrfs: return a btrfs_inode from btrfs_iget_logging()Filipe Manana
[ Upstream commit a488d8ac2c4d96ecc7da59bb35a573277204ac6b ] All callers of btrfs_iget_logging() are interested in the btrfs_inode structure rather than the VFS inode, so make btrfs_iget_logging() return the btrfs_inode instead, avoiding lots of BTRFS_I() calls. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 5f61b961599a ("btrfs: fix inode lookup error handling during log replay") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-17btrfs: remove redundant root argument from fixup_inode_link_count()Filipe Manana
[ Upstream commit 8befc61cbba2d4567122d400542da8900a352971 ] The root argument for fixup_inode_link_count() always matches the root of the given inode, so remove the root argument and get it from the inode argument. This also applies to the helpers count_inode_extrefs() and count_inode_refs() used by fixup_inode_link_count() - they don't need the root argument, as it always matches the root of the inode passed to them. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 5f61b961599a ("btrfs: fix inode lookup error handling during log replay") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-17btrfs: remove redundant root argument from btrfs_update_inode_fallback()Filipe Manana
[ Upstream commit 0a5d0dc55fcb15da016fa28d27bf50ca7f17ec11 ] The root argument for btrfs_update_inode_fallback() always matches the root of the given inode, so remove the root argument and get it from the inode argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 5f61b961599a ("btrfs: fix inode lookup error handling during log replay") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-17btrfs: remove noinline from btrfs_update_inode()Filipe Manana
[ Upstream commit cddaaacca9339d2f13599a822dc2f68be71d2e0d ] The noinline attribute of btrfs_update_inode() is pointless as the function is exported and widely used, so remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 5f61b961599a ("btrfs: fix inode lookup error handling during log replay") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-10btrfs: fix qgroup reservation leak on failure to allocate ordered extentFilipe Manana
[ Upstream commit 1f2889f5594a2bc4c6a52634c4a51b93e785def5 ] If we fail to allocate an ordered extent for a COW write we end up leaking a qgroup data reservation since we called btrfs_qgroup_release_data() but we didn't call btrfs_qgroup_free_refroot() (which would happen when running the respective data delayed ref created by ordered extent completion or when finishing the ordered extent in case an error happened). So make sure we call btrfs_qgroup_free_refroot() if we fail to allocate an ordered extent for a COW write. Fixes: 7dbeaad0af7d ("btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-10btrfs: use btrfs_record_snapshot_destroy() during rmdirFilipe Manana
[ Upstream commit 157501b0469969fc1ba53add5049575aadd79d80 ] We are setting the parent directory's last_unlink_trans directly which may result in a concurrent task starting to log the directory not see the update and therefore can log the directory after we removed a child directory which had a snapshot within instead of falling back to a transaction commit. Replaying such a log tree would result in a mount failure since we can't currently delete snapshots (and subvolumes) during log replay. This is the type of failure described in commit 1ec9a1ae1e30 ("Btrfs: fix unreplayable log after snapshot delete + parent dir fsync"). Fix this by using btrfs_record_snapshot_destroy() which updates the last_unlink_trans field while holding the inode's log_mutex lock. Fixes: 44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-10btrfs: propagate last_unlink_trans earlier when doing a rmdirFilipe Manana
[ Upstream commit c466e33e729a0ee017d10d919cba18f503853c60 ] In case the removed directory had a snapshot that was deleted, we are propagating its inode's last_unlink_trans to the parent directory after we removed the entry from the parent directory. This leaves a small race window where someone can log the parent directory after we removed the entry and before we updated last_unlink_trans, and as a result if we ever try to replay such a log tree, we will fail since we will attempt to remove a snapshot during log replay, which is currently not possible and results in the log replay (and mount) to fail. This is the type of failure described in commit 1ec9a1ae1e30 ("Btrfs: fix unreplayable log after snapshot delete + parent dir fsync"). So fix this by propagating the last_unlink_trans to the parent directory before we remove the entry from it. Fixes: 44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-10btrfs: rename err to ret in btrfs_rmdir()Anand Jain
[ Upstream commit c3a1cc8ff48875b050fda5285ac8a9889eb7898d ] Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: c466e33e729a ("btrfs: propagate last_unlink_trans earlier when doing a rmdir") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-10btrfs: fix iteration of extrefs during log replayFilipe Manana
[ Upstream commit 54a7081ed168b72a8a2d6ef4ba3a1259705a2926 ] At __inode_add_ref() when processing extrefs, if we jump into the next label we have an undefined value of victim_name.len, since we haven't initialized it before we did the goto. This results in an invalid memory access in the next iteration of the loop since victim_name.len was not initialized to the length of the name of the current extref. Fix this by initializing victim_name.len with the current extref's name length. Fixes: e43eec81c516 ("btrfs: use struct qstr instead of name and namelen pairs") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-10btrfs: fix missing error handling when searching for inode refs during log ↵Filipe Manana
replay [ Upstream commit 6561a40ceced9082f50c374a22d5966cf9fc5f5c ] During log replay, at __add_inode_ref(), when we are searching for inode ref keys we totally ignore if btrfs_search_slot() returns an error. This may make a log replay succeed when there was an actual error and leave some metadata inconsistency in a subvolume tree. Fix this by checking if an error was returned from btrfs_search_slot() and if so, return it to the caller. Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-06btrfs: update superblock's device bytes_used when dropping chunkMark Harmstone
commit ae4477f937569d097ca5dbce92a89ba384b49bc6 upstream. Each superblock contains a copy of the device item for that device. In a transaction which drops a chunk but doesn't create any new ones, we were correctly updating the device item in the chunk tree but not copying over the new bytes_used value to the superblock. This can be seen by doing the following: # dd if=/dev/zero of=test bs=4096 count=2621440 # mkfs.btrfs test # mount test /root/temp # cd /root/temp # for i in {00..10}; do dd if=/dev/zero of=$i bs=4096 count=32768; done # sync # rm * # sync # btrfs balance start -dusage=0 . # sync # cd # umount /root/temp # btrfs check test For btrfs-check to detect this, you will also need my patch at https://github.com/kdave/btrfs-progs/pull/991. Change btrfs_remove_dev_extents() so that it adds the devices to the fs_info->post_commit_list if they're not there already. This causes btrfs_commit_device_sizes() to be called, which updates the bytes_used value in the superblock. Fixes: bbbf7243d62d ("btrfs: combine device update operations during transaction commit") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <maharmstone@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-06btrfs: fix a race between renames and directory loggingFilipe Manana
commit 3ca864de852bc91007b32d2a0d48993724f4abad upstream. We have a race between a rename and directory inode logging that if it happens and we crash/power fail before the rename completes, the next time the filesystem is mounted, the log replay code will end up deleting the file that was being renamed. This is best explained following a step by step analysis of an interleaving of steps that lead into this situation. Consider the initial conditions: 1) We are at transaction N; 2) We have directories A and B created in a past transaction (< N); 3) We have inode X corresponding to a file that has 2 hardlinks, one in directory A and the other in directory B, so we'll name them as "A/foo_link1" and "B/foo_link2". Both hard links were persisted in a past transaction (< N); 4) We have inode Y corresponding to a file that as a single hard link and is located in directory A, we'll name it as "A/bar". This file was also persisted in a past transaction (< N). The steps leading to a file loss are the following and for all of them we are under transaction N: 1) Link "A/foo_link1" is removed, so inode's X last_unlink_trans field is updated to N, through btrfs_unlink() -> btrfs_record_unlink_dir(); 2) Task A starts a rename for inode Y, with the goal of renaming from "A/bar" to "A/baz", so we enter btrfs_rename(); 3) Task A inserts the new BTRFS_INODE_REF_KEY for inode Y by calling btrfs_insert_inode_ref(); 4) Because the rename happens in the same directory, we don't set the last_unlink_trans field of directoty A's inode to the current transaction id, that is, we don't cal btrfs_record_unlink_dir(); 5) Task A then removes the entries from directory A (BTRFS_DIR_ITEM_KEY and BTRFS_DIR_INDEX_KEY items) when calling __btrfs_unlink_inode() (actually the dir index item is added as a delayed item, but the effect is the same); 6) Now before task A adds the new entry "A/baz" to directory A by calling btrfs_add_link(), another task, task B is logging inode X; 7) Task B starts a fsync of inode X and after logging inode X, at btrfs_log_inode_parent() it calls btrfs_log_all_parents(), since inode X has a last_unlink_trans value of N, set at in step 1; 8) At btrfs_log_all_parents() we search for all parent directories of inode X using the commit root, so we find directories A and B and log them. Bu when logging direct A, we don't have a dir index item for inode Y anymore, neither the old name "A/bar" nor for the new name "A/baz" since the rename has deleted the old name but has not yet inserted the new name - task A hasn't called yet btrfs_add_link() to do that. Note that logging directory A doesn't fallback to a transaction commit because its last_unlink_trans has a lower value than the current transaction's id (see step 4); 9) Task B finishes logging directories A and B and gets back to btrfs_sync_file() where it calls btrfs_sync_log() to persist the log tree; 10) Task B successfully persisted the log tree, btrfs_sync_log() completed with success, and a power failure happened. We have a log tree without any directory entry for inode Y, so the log replay code deletes the entry for inode Y, name "A/bar", from the subvolume tree since it doesn't exist in the log tree and the log tree is authorative for its index (we logged a BTRFS_DIR_LOG_INDEX_KEY item that covers the index range for the dentry that corresponds to "A/bar"). Since there's no other hard link for inode Y and the log replay code deletes the name "A/bar", the file is lost. The issue wouldn't happen if task B synced the log only after task A called btrfs_log_new_name(), which would update the log with the new name for inode Y ("A/bar"). Fix this by pinning the log root during renames before removing the old directory entry, and unpinning after btrfs_log_new_name() is called. Fixes: 259c4b96d78d ("btrfs: stop doing unnecessary log updates during a rename") CC: stable@vger.kernel.org # 5.18+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-06btrfs: handle csum tree error with rescue=ibadroots correctlyQu Wenruo
[ Upstream commit 547e836661554dcfa15c212a3821664e85b4191a ] [BUG] There is syzbot based reproducer that can crash the kernel, with the following call trace: (With some debug output added) DEBUG: rescue=ibadroots parsed BTRFS: device fsid 14d642db-7b15-43e4-81e6-4b8fac6a25f8 devid 1 transid 8 /dev/loop0 (7:0) scanned by repro (1010) BTRFS info (device loop0): first mount of filesystem 14d642db-7b15-43e4-81e6-4b8fac6a25f8 BTRFS info (device loop0): using blake2b (blake2b-256-generic) checksum algorithm BTRFS info (device loop0): using free-space-tree BTRFS warning (device loop0): checksum verify failed on logical 5312512 mirror 1 wanted 0xb043382657aede36608fd3386d6b001692ff406164733d94e2d9a180412c6003 found 0x810ceb2bacb7f0f9eb2bf3b2b15c02af867cb35ad450898169f3b1f0bd818651 level 0 DEBUG: read tree root path failed for tree csum, ret=-5 BTRFS warning (device loop0): checksum verify failed on logical 5328896 mirror 1 wanted 0x51be4e8b303da58e6340226815b70e3a93592dac3f30dd510c7517454de8567a found 0x51be4e8b303da58e634022a315b70e3a93592dac3f30dd510c7517454de8567a level 0 BTRFS warning (device loop0): checksum verify failed on logical 5292032 mirror 1 wanted 0x1924ccd683be9efc2fa98582ef58760e3848e9043db8649ee382681e220cdee4 found 0x0cb6184f6e8799d9f8cb335dccd1d1832da1071d12290dab3b85b587ecacca6e level 0 process 'repro' launched './file2' with NULL argv: empty string added DEBUG: no csum root, idatacsums=0 ibadroots=134217728 Oops: general protection fault, probably for non-canonical address 0xdffffc0000000041: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000208-0x000000000000020f] CPU: 5 UID: 0 PID: 1010 Comm: repro Tainted: G OE 6.15.0-custom+ #249 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022 RIP: 0010:btrfs_lookup_csum+0x93/0x3d0 [btrfs] Call Trace: <TASK> btrfs_lookup_bio_sums+0x47a/0xdf0 [btrfs] btrfs_submit_bbio+0x43e/0x1a80 [btrfs] submit_one_bio+0xde/0x160 [btrfs] btrfs_readahead+0x498/0x6a0 [btrfs] read_pages+0x1c3/0xb20 page_cache_ra_order+0x4b5/0xc20 filemap_get_pages+0x2d3/0x19e0 filemap_read+0x314/0xde0 __kernel_read+0x35b/0x900 bprm_execve+0x62e/0x1140 do_execveat_common.isra.0+0x3fc/0x520 __x64_sys_execveat+0xdc/0x130 do_syscall_64+0x54/0x1d0 entry_SYSCALL_64_after_hwframe+0x76/0x7e ---[ end trace 0000000000000000 ]--- [CAUSE] Firstly the fs has a corrupted csum tree root, thus to mount the fs we have to go "ro,rescue=ibadroots" mount option. Normally with that mount option, a bad csum tree root should set BTRFS_FS_STATE_NO_DATA_CSUMS flag, so that any future data read will ignore csum search. But in this particular case, we have the following call trace that caused NULL csum root, but not setting BTRFS_FS_STATE_NO_DATA_CSUMS: load_global_roots_objectid(): ret = btrfs_search_slot(); /* Succeeded */ btrfs_item_key_to_cpu() found = true; /* We found the root item for csum tree. */ root = read_tree_root_path(); if (IS_ERR(root)) { if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) /* * Since we have rescue=ibadroots mount option, * @ret is still 0. */ break; if (!found || ret) { /* @found is true, @ret is 0, error handling for csum * tree is skipped. */ } This means we completely skipped to set BTRFS_FS_STATE_NO_DATA_CSUMS if the csum tree is corrupted, which results unexpected later csum lookup. [FIX] If read_tree_root_path() failed, always populate @ret to the error number. As at the end of the function, we need @ret to determine if we need to do the extra error handling for csum tree. Fixes: abed4aaae4f7 ("btrfs: track the csum, extent, and free space trees in a rb tree") Reported-by: Zhiyu Zhang <zhiyuzhang999@gmail.com> Reported-by: Longxing Li <coregee2000@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19btrfs: scrub: fix a wrong error type when metadata bytenr mismatchesQu Wenruo
[ Upstream commit f2c19541e421b3235efc515dad88b581f00592ae ] When the bytenr doesn't match for a metadata tree block, we will report it as an csum error, which is incorrect and should be reported as a metadata error instead. Fixes: a3ddbaebc7c9 ("btrfs: scrub: introduce a helper to verify one metadata block") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19btrfs: scrub: update device stats when an error is detectedQu Wenruo
[ Upstream commit ec1f3a207cdf314eae4d4ae145f1ffdb829f0652 ] [BUG] Since the migration to the new scrub_stripe interface, scrub no longer updates the device stats when hitting an error, no matter if it's a read or checksum mismatch error. E.g: BTRFS info (device dm-2): scrub: started on devid 1 BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488 BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file) BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488 BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file) BTRFS info (device dm-2): scrub: finished on devid 1 with status: 0 Note there is no line showing the device stats error update. [CAUSE] In the migration to the new scrub_stripe interface, we no longer call btrfs_dev_stat_inc_and_print(). [FIX] - Introduce a new bitmap for metadata generation errors * A new bitmap @meta_gen_error_bitmap is introduced to record which blocks have metadata generation mismatch errors. * A new counter for that bitmap @init_nr_meta_gen_errors, is also introduced to store the number of generation mismatch errors that are found during the initial read. This is for the error reporting at scrub_stripe_report_errors(). * New dedicated error message for unrepaired generation mismatches * Update @meta_gen_error_bitmap if a transid mismatch is hit - Add btrfs_dev_stat_inc_and_print() calls to the following call sites * scrub_stripe_report_errors() * scrub_write_endio() This is only for the write errors. This means there is a minor behavior change: - The timing of device stats error message Since we concentrate the error messages at scrub_stripe_report_errors(), the device stats error messages will all show up in one go, after the detailed scrub error messages: BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488 BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file) BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488 BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file) BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-04btrfs: check folio mapping after unlock in relocate_one_folio()Boris Burkov
commit 3e74859ee35edc33a022c3f3971df066ea0ca6b9 upstream. When we call btrfs_read_folio() to bring a folio uptodate, we unlock the folio. The result of that is that a different thread can modify the mapping (like remove it with invalidate) before we call folio_lock(). This results in an invalid page and we need to try again. In particular, if we are relocating concurrently with aborting a transaction, this can result in a crash like the following: BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 0 P4D 0 Oops: 0000 [#1] SMP CPU: 76 PID: 1411631 Comm: kworker/u322:5 Workqueue: events_unbound btrfs_reclaim_bgs_work RIP: 0010:set_page_extent_mapped+0x20/0xb0 RSP: 0018:ffffc900516a7be8 EFLAGS: 00010246 RAX: ffffea009e851d08 RBX: ffffea009e0b1880 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffc900516a7b90 RDI: ffffea009e0b1880 RBP: 0000000003573000 R08: 0000000000000001 R09: ffff88c07fd2f3f0 R10: 0000000000000000 R11: 0000194754b575be R12: 0000000003572000 R13: 0000000003572fff R14: 0000000000100cca R15: 0000000005582fff FS: 0000000000000000(0000) GS:ffff88c07fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000407d00f002 CR4: 00000000007706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __die+0x78/0xc0 ? page_fault_oops+0x2a8/0x3a0 ? __switch_to+0x133/0x530 ? wq_worker_running+0xa/0x40 ? exc_page_fault+0x63/0x130 ? asm_exc_page_fault+0x22/0x30 ? set_page_extent_mapped+0x20/0xb0 relocate_file_extent_cluster+0x1a7/0x940 relocate_data_extent+0xaf/0x120 relocate_block_group+0x20f/0x480 btrfs_relocate_block_group+0x152/0x320 btrfs_relocate_chunk+0x3d/0x120 btrfs_reclaim_bgs_work+0x2ae/0x4e0 process_scheduled_works+0x184/0x370 worker_thread+0xc6/0x3e0 ? blk_add_timer+0xb0/0xb0 kthread+0xae/0xe0 ? flush_tlb_kernel_range+0x90/0x90 ret_from_fork+0x2f/0x40 ? flush_tlb_kernel_range+0x90/0x90 ret_from_fork_asm+0x11/0x20 </TASK> This occurs because cleanup_one_transaction() calls destroy_delalloc_inodes() which calls invalidate_inode_pages2() which takes the folio_lock before setting mapping to NULL. We fail to check this, and subsequently call set_extent_mapping(), which assumes that mapping != NULL (in fact it asserts that in debug mode) Note that the "fixes" patch here is not the one that introduced the race (the very first iteration of this code from 2009) but a more recent change that made this particular crash happen in practice. Fixes: e7f1326cc24e ("btrfs: set page extent mapped after read_folio in relocate_one_page") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Zhaoyang Li <lizy04@hust.edu.cn> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-04btrfs: avoid NULL pointer dereference if no valid csum treeQu Wenruo
[ Upstream commit f95d186255b319c48a365d47b69bd997fecb674e ] [BUG] When trying read-only scrub on a btrfs with rescue=idatacsums mount option, it will crash with the following call trace: BUG: kernel NULL pointer dereference, address: 0000000000000208 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page CPU: 1 UID: 0 PID: 835 Comm: btrfs Tainted: G O 6.15.0-rc3-custom+ #236 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022 RIP: 0010:btrfs_lookup_csums_bitmap+0x49/0x480 [btrfs] Call Trace: <TASK> scrub_find_fill_first_stripe+0x35b/0x3d0 [btrfs] scrub_simple_mirror+0x175/0x290 [btrfs] scrub_stripe+0x5f7/0x6f0 [btrfs] scrub_chunk+0x9a/0x150 [btrfs] scrub_enumerate_chunks+0x333/0x660 [btrfs] btrfs_scrub_dev+0x23e/0x600 [btrfs] btrfs_ioctl+0x1dcf/0x2f80 [btrfs] __x64_sys_ioctl+0x97/0xc0 do_syscall_64+0x4f/0x120 entry_SYSCALL_64_after_hwframe+0x76/0x7e [CAUSE] Mount option "rescue=idatacsums" will completely skip loading the csum tree, so that any data read will not find any data csum thus we will ignore data checksum verification. Normally call sites utilizing csum tree will check the fs state flag NO_DATA_CSUMS bit, but unfortunately scrub does not check that bit at all. This results in scrub to call btrfs_search_slot() on a NULL pointer and triggered above crash. [FIX] Check both extent and csum tree root before doing any tree search. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-04btrfs: send: return -ENAMETOOLONG when attempting a path that is too longFilipe Manana
[ Upstream commit a77749b3e21813566cea050bbb3414ae74562eba ] When attempting to build a too long path we are currently returning -ENOMEM, which is very odd and misleading. So update fs_path_ensure_buf() to return -ENAMETOOLONG instead. Also, while at it, move the WARN_ON() into the if statement's expression, as it makes it clear what is being tested and also has the effect of adding 'unlikely' to the statement, which allows the compiler to generate better code as this condition is never expected to happen. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-04btrfs: get zone unusable bytes while holding lock at btrfs_reclaim_bgs_work()Filipe Manana
[ Upstream commit 1283b8c125a83bf7a7dbe90c33d3472b6d7bf612 ] At btrfs_reclaim_bgs_work(), we are grabbing a block group's zone unusable bytes while not under the protection of the block group's spinlock, so this can trigger race reports from KCSAN (or similar tools) since that field is typically updated while holding the lock, such as at __btrfs_add_free_space_zoned() for example. Fix this by grabbing the zone unusable bytes while we are still in the critical section holding the block group's spinlock, which is right above where we are currently grabbing it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-04btrfs: fix non-empty delayed iputs list on unmount due to async workersFilipe Manana
[ Upstream commit cda76788f8b0f7de3171100e3164ec1ce702292e ] At close_ctree() after we have ran delayed iputs either explicitly through calling btrfs_run_delayed_iputs() or later during the call to btrfs_commit_super() or btrfs_error_commit_super(), we assert that the delayed iputs list is empty. We have (another) race where this assertion might fail because we have queued an async write into the fs_info->workers workqueue. Here's how it happens: 1) We are submitting a data bio for an inode that is not the data relocation inode, so we call btrfs_wq_submit_bio(); 2) btrfs_wq_submit_bio() submits a work for the fs_info->workers queue that will run run_one_async_done(); 3) We enter close_ctree(), flush several work queues except fs_info->workers, explicitly run delayed iputs with a call to btrfs_run_delayed_iputs() and then again shortly after by calling btrfs_commit_super() or btrfs_error_commit_super(), which also run delayed iputs; 4) run_one_async_done() is executed in the work queue, and because there was an IO error (bio->bi_status is not 0) it calls btrfs_bio_end_io(), which drops the final reference on the associated ordered extent by calling btrfs_put_ordered_extent() - and that adds a delayed iput for the inode; 5) At close_ctree() we find that after stopping the cleaner and transaction kthreads the delayed iputs list is not empty, failing the following assertion: ASSERT(list_empty(&fs_info->delayed_iputs)); Fix this by flushing the fs_info->workers workqueue before running delayed iputs at close_ctree(). David reported this when running generic/648, which exercises IO error paths by using the DM error table. Reported-by: David Sterba <dsterba@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>