diff options
| author | Leo Martins <loemra.dev@gmail.com> | 2026-02-26 01:51:07 -0800 |
|---|---|---|
| committer | David Sterba <dsterba@suse.com> | 2026-04-07 18:56:00 +0200 |
| commit | f9a48549a15aa369d42cebc08a6a72b71a53d547 (patch) | |
| tree | 36cc48e9af28a2fda0bada4b6285406f976b42ed | |
| parent | cab4c8b594e23649591317c5f0606ea6a8a27236 (diff) | |
btrfs: inhibit extent buffer writeback to prevent COW amplification
Inhibit writeback on COW'd extent buffers for the lifetime of the
transaction handle, preventing background writeback from setting
BTRFS_HEADER_FLAG_WRITTEN and causing unnecessary re-COW.
COW amplification occurs when background writeback flushes an extent
buffer that a transaction handle is still actively modifying. When
lock_extent_buffer_for_io() transitions a buffer from dirty to
writeback, it sets BTRFS_HEADER_FLAG_WRITTEN, marking the block as
having been persisted to disk at its current bytenr. Once WRITTEN is
set, should_cow_block() must either COW the block again or overwrite
it in place, both of which are unnecessary overhead when the buffer
is still being modified by the same handle that allocated it. By
inhibiting background writeback on actively-used buffers, WRITTEN is
never set while a transaction handle holds a reference to the buffer,
avoiding this overhead entirely.
Add an atomic_t writeback_inhibitors counter to struct extent_buffer,
which fits in an existing 6-byte hole without increasing struct size.
When a buffer is COW'd in btrfs_force_cow_block(), call
btrfs_inhibit_eb_writeback() to store the eb in the transaction
handle's writeback_inhibited_ebs xarray (keyed by eb->start), take a
reference, and increment writeback_inhibitors. The function handles
dedup (same eb inhibited twice by the same handle) and replacement
(different eb at the same logical address). Allocation failure is
graceful: the buffer simply falls back to the pre-existing behavior
where it may be written back and re-COW'd.
Also inhibit writeback in should_cow_block() when COW is skipped,
so that every transaction handle that reuses an already-COW'd buffer
also inhibits its writeback. Without this, if handle A COWs a block
and inhibits it, and handle B later reuses the same block without
inhibiting, handle A's uninhibit on end_transaction leaves the buffer
unprotected while handle B is still using it. This ensures all handles
that access a COW'd buffer contribute to the inhibitor count, and the
buffer remains protected until the last handle releases it.
In lock_extent_buffer_for_io(), when writeback_inhibitors is non-zero
and the writeback mode is WB_SYNC_NONE, skip the buffer. WB_SYNC_NONE
is used by the VM flusher threads for background and periodic
writeback, which are the only paths that cause COW amplification by
opportunistically writing out dirty extent buffers mid-transaction.
Skipping these is safe because the buffers remain dirty in the page
cache and will be written out at transaction commit time.
WB_SYNC_ALL must always proceed regardless of writeback_inhibitors.
This is required for correctness in the fsync path: btrfs_sync_log()
writes log tree blocks via filemap_fdatawrite_range() (WB_SYNC_ALL)
while the transaction handle that inhibited those same blocks is still
active. Without the WB_SYNC_ALL bypass, those inhibited log tree
blocks would be silently skipped, resulting in an incomplete log on
disk and corruption on replay. btrfs_write_and_wait_transaction()
also uses WB_SYNC_ALL via filemap_fdatawrite_range(); for that path,
inhibitors are already cleared beforehand, but the bypass ensures
correctness regardless.
Uninhibit in __btrfs_end_transaction() before atomic_dec(num_writers)
to prevent a race where the committer proceeds while buffers are still
inhibited. Also uninhibit in btrfs_commit_transaction() before writing
and in cleanup_transaction() for the error path.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
| -rw-r--r-- | fs/btrfs/ctree.c | 9 | ||||
| -rw-r--r-- | fs/btrfs/extent_io.c | 64 | ||||
| -rw-r--r-- | fs/btrfs/extent_io.h | 6 | ||||
| -rw-r--r-- | fs/btrfs/transaction.c | 19 | ||||
| -rw-r--r-- | fs/btrfs/transaction.h | 3 |
5 files changed, 98 insertions, 3 deletions
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index b9004345cf3c..e8d260ecdcf6 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -21,6 +21,7 @@ #include "fs.h" #include "accessors.h" #include "extent-tree.h" +#include "extent_io.h" #include "relocation.h" #include "file-item.h" @@ -590,6 +591,9 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans, btrfs_tree_unlock(buf); free_extent_buffer_stale(buf); btrfs_mark_buffer_dirty(trans, cow); + + btrfs_inhibit_eb_writeback(trans, cow); + *cow_ret = cow; return 0; @@ -599,9 +603,9 @@ error_unlock_cow: return ret; } -static inline bool should_cow_block(const struct btrfs_trans_handle *trans, +static inline bool should_cow_block(struct btrfs_trans_handle *trans, const struct btrfs_root *root, - const struct extent_buffer *buf) + struct extent_buffer *buf) { if (btrfs_is_testing(root->fs_info)) return false; @@ -635,6 +639,7 @@ static inline bool should_cow_block(const struct btrfs_trans_handle *trans, if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) return true; + btrfs_inhibit_eb_writeback(trans, buf); return false; } diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 33b1afbee0a6..77dcec8b6b62 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -14,6 +14,7 @@ #include <linux/pagevec.h> #include <linux/prefetch.h> #include <linux/fsverity.h> +#include <linux/lockdep.h> #include "extent_io.h" #include "extent-io-tree.h" #include "extent_map.h" @@ -1955,7 +1956,9 @@ static noinline_for_stack bool lock_extent_buffer_for_io(struct extent_buffer *e * of time. */ spin_lock(&eb->refs_lock); - if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) { + if ((wbc->sync_mode == WB_SYNC_ALL || + atomic_read(&eb->writeback_inhibitors) == 0) && + test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) { XA_STATE(xas, &fs_info->buffer_tree, eb->start >> fs_info->nodesize_bits); unsigned long flags; @@ -2990,6 +2993,64 @@ static inline void btrfs_release_extent_buffer(struct extent_buffer *eb) kmem_cache_free(extent_buffer_cache, eb); } +/* + * Inhibit writeback on buffer during transaction. + * + * @trans: transaction handle that will own the inhibitor + * @eb: extent buffer to inhibit writeback on + * + * Attempt to track this extent buffer in the transaction's inhibited set. If + * memory allocation fails, the buffer is simply not tracked. It may be written + * back and need re-COW, which is the original behavior. This is acceptable + * since inhibiting writeback is an optimization. + */ +void btrfs_inhibit_eb_writeback(struct btrfs_trans_handle *trans, struct extent_buffer *eb) +{ + unsigned long index = eb->start >> trans->fs_info->nodesize_bits; + void *old; + + lockdep_assert_held(&eb->lock); + /* Check if already inhibited by this handle. */ + old = xa_load(&trans->writeback_inhibited_ebs, index); + if (old == eb) + return; + + /* Take reference for the xarray entry. */ + refcount_inc(&eb->refs); + + old = xa_store(&trans->writeback_inhibited_ebs, index, eb, GFP_NOFS); + if (xa_is_err(old)) { + /* Allocation failed, just skip inhibiting this buffer. */ + free_extent_buffer(eb); + return; + } + + /* Handle replacement of different eb at same index. */ + if (old && old != eb) { + struct extent_buffer *old_eb = old; + + atomic_dec(&old_eb->writeback_inhibitors); + free_extent_buffer(old_eb); + } + + atomic_inc(&eb->writeback_inhibitors); +} + +/* + * Uninhibit writeback on all extent buffers. + */ +void btrfs_uninhibit_all_eb_writeback(struct btrfs_trans_handle *trans) +{ + struct extent_buffer *eb; + unsigned long index; + + xa_for_each(&trans->writeback_inhibited_ebs, index, eb) { + atomic_dec(&eb->writeback_inhibitors); + free_extent_buffer(eb); + } + xa_destroy(&trans->writeback_inhibited_ebs); +} + static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start) { @@ -3000,6 +3061,7 @@ static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info eb->len = fs_info->nodesize; eb->fs_info = fs_info; init_rwsem(&eb->lock); + atomic_set(&eb->writeback_inhibitors, 0); btrfs_leak_debug_add_eb(eb); diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index d5851cc06f32..a3b0ab501361 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -99,6 +99,8 @@ struct extent_buffer { spinlock_t refs_lock; refcount_t refs; int read_mirror; + /* Inhibit WB_SYNC_NONE writeback when > 0. */ + atomic_t writeback_inhibitors; /* >= 0 if eb belongs to a log tree, -1 otherwise */ s8 log_index; u8 folio_shift; @@ -381,4 +383,8 @@ void btrfs_extent_buffer_leak_debug_check(struct btrfs_fs_info *fs_info); #define btrfs_extent_buffer_leak_debug_check(fs_info) do {} while (0) #endif +void btrfs_inhibit_eb_writeback(struct btrfs_trans_handle *trans, + struct extent_buffer *eb); +void btrfs_uninhibit_all_eb_writeback(struct btrfs_trans_handle *trans); + #endif diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 13e59ce3f5fb..79c488f120b2 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -15,6 +15,7 @@ #include "misc.h" #include "ctree.h" #include "disk-io.h" +#include "extent_io.h" #include "transaction.h" #include "locking.h" #include "tree-log.h" @@ -696,6 +697,8 @@ again: goto alloc_fail; } + xa_init(&h->writeback_inhibited_ebs); + /* * If we are JOIN_NOLOCK we're already committing a transaction and * waiting on this guy, so we don't need to do the sb_start_intwrite @@ -1092,6 +1095,13 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, if (trans->type & __TRANS_FREEZABLE) sb_end_intwrite(info->sb); + /* + * Uninhibit extent buffer writeback before decrementing num_writers, + * since the decrement wakes the committing thread which needs all + * buffers uninhibited to write them to disk. + */ + btrfs_uninhibit_all_eb_writeback(trans); + WARN_ON(cur_trans != info->running_transaction); WARN_ON(atomic_read(&cur_trans->num_writers) < 1); atomic_dec(&cur_trans->num_writers); @@ -2124,6 +2134,7 @@ static void cleanup_transaction(struct btrfs_trans_handle *trans, int err) if (!test_bit(BTRFS_FS_RELOC_RUNNING, &fs_info->flags)) btrfs_scrub_cancel(fs_info); + btrfs_uninhibit_all_eb_writeback(trans); kmem_cache_free(btrfs_trans_handle_cachep, trans); } @@ -2563,6 +2574,14 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) fs_info->cleaner_kthread) wake_up_process(fs_info->cleaner_kthread); + /* + * Uninhibit writeback on all extent buffers inhibited during this + * transaction before writing them to disk. Inhibiting prevented + * writeback while the transaction was building, but now we need + * them written. + */ + btrfs_uninhibit_all_eb_writeback(trans); + ret = btrfs_write_and_wait_transaction(trans); if (unlikely(ret)) { btrfs_err(fs_info, "error while writing out transaction: %d", ret); diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 18ef069197e5..7d70fe486758 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -12,6 +12,7 @@ #include <linux/time64.h> #include <linux/mutex.h> #include <linux/wait.h> +#include <linux/xarray.h> #include "btrfs_inode.h" #include "delayed-ref.h" @@ -162,6 +163,8 @@ struct btrfs_trans_handle { struct btrfs_fs_info *fs_info; struct list_head new_bgs; struct btrfs_block_rsv delayed_rsv; + /* Extent buffers with writeback inhibited by this handle. */ + struct xarray writeback_inhibited_ebs; }; /* |
