diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2025-05-26 12:43:30 -0700 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2025-05-26 12:43:30 -0700 |
| commit | 522544fc71c27b4b432386c7919f71ecc79a3bfb (patch) | |
| tree | 55c6f337ca94ad02480e2fa7fbec8a937a028a73 /fs/bcachefs/backpointers.c | |
| parent | 8fdabcd9c01d9ac585d8109a921e0734a69479fb (diff) | |
| parent | 9caea9208fc3fbdbd4a41a2de8c6a0c969b030f9 (diff) | |
Merge tag 'bcachefs-2025-05-24' of git://evilpiepirate.org/bcachefs
Pull bcachefs updates from Kent Overstreet:
- Poisoned extents can now be moved: this lets us handle bitrotted data
without deleting it. For now, reading from poisoned extents only
returns -EIO: in the future we'll have an API for specifying "read
this data even if there were bitflips".
- Incompatible features may now be enabled at runtime, via
"opts/version_upgrade" in sysfs. Toggle it to incompatible, and then
toggle it back - option changes via the sysfs interface are
persistent.
- Various changes to support deployable disk images:
- RO mounts now use less memory
- Images may be stripped of alloc info, particularly useful for
slimming them down if they will primarily be mounted RO. Alloc
info will be automatically regenerated on first RW mount, and
this is quite fast
- Filesystem images generated with 'bcachefs image' will be
automatically resized the first time they're mounted on a larger
device
The images 'bcachefs image' generates with compression enabled have
been comparable in size to those generated by squashfs and erofs -
but you get a full RW capable filesystem
- Major error message improvements for btree node reads, data reads,
and elsewhere. We now build up a single error message that lists all
the errors encountered, actions taken to repair, and success/failure
of the IO. This extends to other error paths that may kick off other
actions, e.g. scheduling recovery passes: actions we took because of
an error are included in that error message, with
grouping/indentation so we can see what caused what.
- New option, 'rebalance_on_ac_only'. Does exactly what the name
suggests, quite handy with background compression.
- Repair/self healing:
- We can now kick off recovery passes and run them in the
background if we detect errors. Currently, this is just used by
code that walks backpointers. We now also check for missing
backpointers at runtime and run check_extents_to_backpointers if
required. The messy 6.14 upgrade left missing backpointers for
some users, and this will correct that automatically instead of
requiring a manual fsck - some users noticed this as copygc
spinning and not making progress.
In the future, as more recovery passes come online, we'll be able
to repair and recover from nearly anything - except for
unreadable btree nodes, and that's why you're using replication,
of course - without shutting down the filesystem.
- There's a new recovery pass, for checking the rebalance_work
btree, which tracks extents that rebalance will process later.
- Hardening:
- Close the last known hole in btree iterator/btree locking
assertions: path->should_be_locked paths must stay locked until
the end of the transaction. This shook out a few bugs, including
a performance issue that was causing unnecessary path_upgrade
transaction restarts.
- Performance:
- Faster snapshot deletion: this is an incompatible feature, as it
requires new sentinal values, for safety. Snapshot deletion no
longer has to do a full metadata scan, it now just scans the
inodes btree: if an extent/dirent/xattr is present for a given
snapshot ID, we already require that an inode be present with
that same snapshot ID.
If/when users hit scalability limits again (ridiculously huge
filesystems with lots of inodes, and many sparse snapshots), let
me know - the next step will be to add an index from snapshot ID
-> inode number, which won't be too hard.
- Faster device removal: the "scan for pointers to this device" no
longer does a full metadata scan, instead it walks backpointers.
Like fast snapshot deletion this is another incompat feature: it
also requires a new sentinal value, because we don't want to
reuse these device IDs until after a fsck.
- We're now coalescing redundant accounting updates prior to
transaction commit, taking some pressure off the journal. Shortly
we'll also be doing multiple extent updates in a transaction in
the main write path, which combined with the previous should
drastically cut down on the amount of metadata updates we have to
journal.
- Stack usage improvements: All allocator state has been moved off the
stack
- Debug improvements:
- enumerated refcounts: The debug code previously used for
filesystem write refs is now a small library, and used for other
heavily used refcounts. Different users of a refcount are
enumerated, making it much easier to debug refcount issues.
- Async object debugging: There's a new kconfig option that makes
various async objects (different types of bios, data updates,
write ops, etc.) visible in debugfs, and it should be fast enough
to leave on in production.
- Various sets of assertions no longer require
CONFIG_BCACHEFS_DEBUG, instead they're controlled by module
parameters and static keys, meaning users won't need to compile
custom kernels as often to help debug issues.
- bch2_trans_kmalloc() calls can be tracked (there's a new kconfig
option). With it on you can check the btree_transaction_stats in
debugfs to see the bch2_trans_kmalloc() calls a transaction did
when it used the most memory.
* tag 'bcachefs-2025-05-24' of git://evilpiepirate.org/bcachefs: (218 commits)
bcachefs: Don't mount bs > ps without TRANSPARENT_HUGEPAGE
bcachefs: Fix btree_iter_next_node() for new locking asserts
bcachefs: Ensure we don't use a blacklisted journal seq
bcachefs: Small check_fix_ptr fixes
bcachefs: Fix opts.recovery_pass_last
bcachefs: Fix allocate -> self healing path
bcachefs: Fix endianness in casefold check/repair
bcachefs: Path must be locked if trans->locked && should_be_locked
bcachefs: Simplify bch2_path_put()
bcachefs: Plumb btree_trans for more locking asserts
bcachefs: Clear trans->locked before unlock
bcachefs: Clear should_be_locked before unlock in key_cache_drop()
bcachefs: bch2_path_get() reuses paths if upgrade_fails & !should_be_locked
bcachefs: Give out new path if upgrade fails
bcachefs: Fix btree_path_get_locks when not doing trans restart
bcachefs: btree_node_locked_type_nowrite()
bcachefs: Kill bch2_path_put_nokeep()
bcachefs: bch2_journal_write_checksum()
bcachefs: Reduce stack usage in data_update_index_update()
bcachefs: bch2_trans_log_str()
...
Diffstat (limited to 'fs/bcachefs/backpointers.c')
| -rw-r--r-- | fs/bcachefs/backpointers.c | 256 |
1 files changed, 183 insertions, 73 deletions
diff --git a/fs/bcachefs/backpointers.c b/fs/bcachefs/backpointers.c index 5f195d2280a4..cde7dd115267 100644 --- a/fs/bcachefs/backpointers.c +++ b/fs/bcachefs/backpointers.c @@ -12,9 +12,20 @@ #include "disk_accounting.h" #include "error.h" #include "progress.h" +#include "recovery_passes.h" #include <linux/mm.h> +static int bch2_bucket_bitmap_set(struct bch_dev *, struct bucket_bitmap *, u64); + +static inline struct bbpos bp_to_bbpos(struct bch_backpointer bp) +{ + return (struct bbpos) { + .btree = bp.btree_id, + .pos = bp.pos, + }; +} + int bch2_backpointer_validate(struct bch_fs *c, struct bkey_s_c k, struct bkey_validate_context from) { @@ -96,6 +107,8 @@ static noinline int backpointer_mod_err(struct btree_trans *trans, { struct bch_fs *c = trans->c; struct printbuf buf = PRINTBUF; + bool will_check = c->recovery.passes_to_run & + BIT_ULL(BCH_RECOVERY_PASS_check_extents_to_backpointers); int ret = 0; if (insert) { @@ -110,9 +123,7 @@ static noinline int backpointer_mod_err(struct btree_trans *trans, prt_printf(&buf, "for "); bch2_bkey_val_to_text(&buf, c, orig_k); - - bch_err(c, "%s", buf.buf); - } else if (c->curr_recovery_pass > BCH_RECOVERY_PASS_check_extents_to_backpointers) { + } else if (!will_check) { prt_printf(&buf, "backpointer not found when deleting\n"); printbuf_indent_add(&buf, 2); @@ -128,8 +139,7 @@ static noinline int backpointer_mod_err(struct btree_trans *trans, bch2_bkey_val_to_text(&buf, c, orig_k); } - if (c->curr_recovery_pass > BCH_RECOVERY_PASS_check_extents_to_backpointers && - __bch2_inconsistent_error(c, &buf)) + if (!will_check && __bch2_inconsistent_error(c, &buf)) ret = -BCH_ERR_erofs_unfixed_errors; bch_err(c, "%s", buf.buf); @@ -174,7 +184,7 @@ err: static int bch2_backpointer_del(struct btree_trans *trans, struct bpos pos) { - return (likely(!bch2_backpointers_no_use_write_buffer) + return (!static_branch_unlikely(&bch2_backpointers_no_use_write_buffer) ? bch2_btree_delete_at_buffered(trans, BTREE_ID_backpointers, pos) : bch2_btree_delete(trans, BTREE_ID_backpointers, pos, 0)) ?: bch2_trans_commit(trans, NULL, NULL, BCH_TRANS_COMMIT_no_enospc); @@ -184,7 +194,7 @@ static inline int bch2_backpointers_maybe_flush(struct btree_trans *trans, struct bkey_s_c visiting_k, struct bkey_buf *last_flushed) { - return likely(!bch2_backpointers_no_use_write_buffer) + return !static_branch_unlikely(&bch2_backpointers_no_use_write_buffer) ? bch2_btree_write_buffer_maybe_flush(trans, visiting_k, last_flushed) : 0; } @@ -478,7 +488,8 @@ found: bytes = p.crc.compressed_size << 9; - struct bch_dev *ca = bch2_dev_get_ioref(c, dev, READ); + struct bch_dev *ca = bch2_dev_get_ioref(c, dev, READ, + BCH_DEV_READ_REF_check_extent_checksums); if (!ca) return false; @@ -515,7 +526,8 @@ err: if (bio) bio_put(bio); kvfree(data_buf); - percpu_ref_put(&ca->io_ref[READ]); + enumerated_ref_put(&ca->io_ref[READ], + BCH_DEV_READ_REF_check_extent_checksums); printbuf_exit(&buf); return ret; } @@ -669,22 +681,33 @@ static int check_extent_to_backpointers(struct btree_trans *trans, rcu_read_lock(); struct bch_dev *ca = bch2_dev_rcu_noerror(c, p.ptr.dev); - bool check = ca && test_bit(PTR_BUCKET_NR(ca, &p.ptr), ca->bucket_backpointer_mismatches); - bool empty = ca && test_bit(PTR_BUCKET_NR(ca, &p.ptr), ca->bucket_backpointer_empty); + if (!ca) { + rcu_read_unlock(); + continue; + } + + if (p.ptr.cached && dev_ptr_stale_rcu(ca, &p.ptr)) { + rcu_read_unlock(); + continue; + } + + u64 b = PTR_BUCKET_NR(ca, &p.ptr); + if (!bch2_bucket_bitmap_test(&ca->bucket_backpointer_mismatch, b)) { + rcu_read_unlock(); + continue; + } - bool stale = p.ptr.cached && (!ca || dev_ptr_stale_rcu(ca, &p.ptr)); + bool empty = bch2_bucket_bitmap_test(&ca->bucket_backpointer_empty, b); rcu_read_unlock(); - if ((check || empty) && !stale) { - struct bkey_i_backpointer bp; - bch2_extent_ptr_to_bp(c, btree, level, k, p, entry, &bp); + struct bkey_i_backpointer bp; + bch2_extent_ptr_to_bp(c, btree, level, k, p, entry, &bp); - int ret = check - ? check_bp_exists(trans, s, &bp, k) - : bch2_bucket_backpointer_mod(trans, k, &bp, true); - if (ret) - return ret; - } + int ret = !empty + ? check_bp_exists(trans, s, &bp, k) + : bch2_bucket_backpointer_mod(trans, k, &bp, true); + if (ret) + return ret; } return 0; @@ -722,14 +745,6 @@ err: return ret; } -static inline struct bbpos bp_to_bbpos(struct bch_backpointer bp) -{ - return (struct bbpos) { - .btree = bp.btree_id, - .pos = bp.pos, - }; -} - static u64 mem_may_pin_bytes(struct bch_fs *c) { struct sysinfo i; @@ -788,6 +803,13 @@ static int bch2_get_btree_in_memory_pos(struct btree_trans *trans, return ret; } +static inline int bch2_fs_going_ro(struct bch_fs *c) +{ + return test_bit(BCH_FS_going_ro, &c->flags) + ? -EROFS + : 0; +} + static int bch2_check_extents_to_backpointers_pass(struct btree_trans *trans, struct extents_to_bp_state *s) { @@ -815,6 +837,7 @@ static int bch2_check_extents_to_backpointers_pass(struct btree_trans *trans, ret = for_each_btree_key_continue(trans, iter, 0, k, ({ bch2_progress_update_iter(trans, &progress, &iter, "extents_to_backpointers"); + bch2_fs_going_ro(c) ?: check_extent_to_backpointers(trans, s, btree_id, level, k) ?: bch2_trans_commit(trans, NULL, NULL, BCH_TRANS_COMMIT_no_enospc); })); @@ -854,6 +877,7 @@ static int data_type_to_alloc_counter(enum bch_data_type t) static int check_bucket_backpointers_to_extents(struct btree_trans *, struct bch_dev *, struct bpos); static int check_bucket_backpointer_mismatch(struct btree_trans *trans, struct bkey_s_c alloc_k, + bool *had_mismatch, struct bkey_buf *last_flushed) { struct bch_fs *c = trans->c; @@ -861,6 +885,8 @@ static int check_bucket_backpointer_mismatch(struct btree_trans *trans, struct b const struct bch_alloc_v4 *a = bch2_alloc_to_v4(alloc_k, &a_convert); bool need_commit = false; + *had_mismatch = false; + if (a->data_type == BCH_DATA_sb || a->data_type == BCH_DATA_journal || a->data_type == BCH_DATA_parity) @@ -931,12 +957,18 @@ static int check_bucket_backpointer_mismatch(struct btree_trans *trans, struct b goto err; } - if (!sectors[ALLOC_dirty] && - !sectors[ALLOC_stripe] && - !sectors[ALLOC_cached]) - __set_bit(alloc_k.k->p.offset, ca->bucket_backpointer_empty); - else - __set_bit(alloc_k.k->p.offset, ca->bucket_backpointer_mismatches); + bool empty = (sectors[ALLOC_dirty] + + sectors[ALLOC_stripe] + + sectors[ALLOC_cached]) == 0; + + ret = bch2_bucket_bitmap_set(ca, &ca->bucket_backpointer_mismatch, + alloc_k.k->p.offset) ?: + (empty + ? bch2_bucket_bitmap_set(ca, &ca->bucket_backpointer_empty, + alloc_k.k->p.offset) + : 0); + + *had_mismatch = true; } err: bch2_dev_put(ca); @@ -960,8 +992,14 @@ static bool backpointer_node_has_missing(struct bch_fs *c, struct bkey_s_c k) goto next; struct bpos bucket = bp_pos_to_bucket(ca, pos); - bucket.offset = find_next_bit(ca->bucket_backpointer_mismatches, - ca->mi.nbuckets, bucket.offset); + u64 next = ca->mi.nbuckets; + + unsigned long *bitmap = READ_ONCE(ca->bucket_backpointer_mismatch.buckets); + if (bitmap) + next = min_t(u64, next, + find_next_bit(bitmap, ca->mi.nbuckets, bucket.offset)); + + bucket.offset = next; if (bucket.offset == ca->mi.nbuckets) goto next; @@ -1070,28 +1108,6 @@ int bch2_check_extents_to_backpointers(struct bch_fs *c) { int ret = 0; - /* - * Can't allow devices to come/go/resize while we have bucket bitmaps - * allocated - */ - down_read(&c->state_lock); - - for_each_member_device(c, ca) { - BUG_ON(ca->bucket_backpointer_mismatches); - ca->bucket_backpointer_mismatches = kvcalloc(BITS_TO_LONGS(ca->mi.nbuckets), - sizeof(unsigned long), - GFP_KERNEL); - ca->bucket_backpointer_empty = kvcalloc(BITS_TO_LONGS(ca->mi.nbuckets), - sizeof(unsigned long), - GFP_KERNEL); - if (!ca->bucket_backpointer_mismatches || - !ca->bucket_backpointer_empty) { - bch2_dev_put(ca); - ret = -BCH_ERR_ENOMEM_backpointer_mismatches_bitmap; - goto err_free_bitmaps; - } - } - struct btree_trans *trans = bch2_trans_get(c); struct extents_to_bp_state s = { .bp_start = POS_MIN }; @@ -1100,23 +1116,24 @@ int bch2_check_extents_to_backpointers(struct bch_fs *c) ret = for_each_btree_key(trans, iter, BTREE_ID_alloc, POS_MIN, BTREE_ITER_prefetch, k, ({ - check_bucket_backpointer_mismatch(trans, k, &s.last_flushed); + bool had_mismatch; + bch2_fs_going_ro(c) ?: + check_bucket_backpointer_mismatch(trans, k, &had_mismatch, &s.last_flushed); })); if (ret) goto err; - u64 nr_buckets = 0, nr_mismatches = 0, nr_empty = 0; + u64 nr_buckets = 0, nr_mismatches = 0; for_each_member_device(c, ca) { nr_buckets += ca->mi.nbuckets; - nr_mismatches += bitmap_weight(ca->bucket_backpointer_mismatches, ca->mi.nbuckets); - nr_empty += bitmap_weight(ca->bucket_backpointer_empty, ca->mi.nbuckets); + nr_mismatches += ca->bucket_backpointer_mismatch.nr; } - if (!nr_mismatches && !nr_empty) + if (!nr_mismatches) goto err; bch_info(c, "scanning for missing backpointers in %llu/%llu buckets", - nr_mismatches + nr_empty, nr_buckets); + nr_mismatches, nr_buckets); while (1) { ret = bch2_pin_backpointer_nodes_with_missing(trans, s.bp_start, &s.bp_end); @@ -1147,23 +1164,71 @@ int bch2_check_extents_to_backpointers(struct bch_fs *c) s.bp_start = bpos_successor(s.bp_end); } + + for_each_member_device(c, ca) { + bch2_bucket_bitmap_free(&ca->bucket_backpointer_mismatch); + bch2_bucket_bitmap_free(&ca->bucket_backpointer_empty); + } err: bch2_trans_put(trans); bch2_bkey_buf_exit(&s.last_flushed, c); bch2_btree_cache_unpin(c); -err_free_bitmaps: - for_each_member_device(c, ca) { - kvfree(ca->bucket_backpointer_empty); - ca->bucket_backpointer_empty = NULL; - kvfree(ca->bucket_backpointer_mismatches); - ca->bucket_backpointer_mismatches = NULL; - } - up_read(&c->state_lock); bch_err_fn(c, ret); return ret; } +static int check_bucket_backpointer_pos_mismatch(struct btree_trans *trans, + struct bpos bucket, + bool *had_mismatch, + struct bkey_buf *last_flushed) +{ + struct btree_iter alloc_iter; + struct bkey_s_c k = bch2_bkey_get_iter(trans, &alloc_iter, + BTREE_ID_alloc, bucket, + BTREE_ITER_cached); + int ret = bkey_err(k); + if (ret) + return ret; + + ret = check_bucket_backpointer_mismatch(trans, k, had_mismatch, last_flushed); + bch2_trans_iter_exit(trans, &alloc_iter); + return ret; +} + +int bch2_check_bucket_backpointer_mismatch(struct btree_trans *trans, + struct bch_dev *ca, u64 bucket, + bool copygc, + struct bkey_buf *last_flushed) +{ + struct bch_fs *c = trans->c; + bool had_mismatch; + int ret = lockrestart_do(trans, + check_bucket_backpointer_pos_mismatch(trans, POS(ca->dev_idx, bucket), + &had_mismatch, last_flushed)); + if (ret || !had_mismatch) + return ret; + + u64 nr = ca->bucket_backpointer_mismatch.nr; + u64 allowed = copygc ? ca->mi.nbuckets >> 7 : 0; + + struct printbuf buf = PRINTBUF; + __bch2_log_msg_start(ca->name, &buf); + + prt_printf(&buf, "Detected missing backpointers in bucket %llu, now have %llu/%llu with missing\n", + bucket, nr, ca->mi.nbuckets); + + bch2_run_explicit_recovery_pass(c, &buf, + BCH_RECOVERY_PASS_check_extents_to_backpointers, + nr < allowed ? RUN_RECOVERY_PASS_ratelimit : 0); + + bch2_print_str(c, KERN_ERR, buf.buf); + printbuf_exit(&buf); + return 0; +} + +/* backpointers -> extents */ + static int check_one_backpointer(struct btree_trans *trans, struct bbpos start, struct bbpos end, @@ -1279,3 +1344,48 @@ int bch2_check_backpointers_to_extents(struct bch_fs *c) bch_err_fn(c, ret); return ret; } + +static int bch2_bucket_bitmap_set(struct bch_dev *ca, struct bucket_bitmap *b, u64 bit) +{ + scoped_guard(mutex, &b->lock) { + if (!b->buckets) { + b->buckets = kvcalloc(BITS_TO_LONGS(ca->mi.nbuckets), + sizeof(unsigned long), GFP_KERNEL); + if (!b->buckets) + return -BCH_ERR_ENOMEM_backpointer_mismatches_bitmap; + } + + b->nr += !__test_and_set_bit(bit, b->buckets); + } + + return 0; +} + +int bch2_bucket_bitmap_resize(struct bucket_bitmap *b, u64 old_size, u64 new_size) +{ + scoped_guard(mutex, &b->lock) { + if (!b->buckets) + return 0; + + unsigned long *n = kvcalloc(BITS_TO_LONGS(new_size), + sizeof(unsigned long), GFP_KERNEL); + if (!n) + return -BCH_ERR_ENOMEM_backpointer_mismatches_bitmap; + + memcpy(n, b->buckets, + BITS_TO_LONGS(min(old_size, new_size)) * sizeof(unsigned long)); + kvfree(b->buckets); + b->buckets = n; + } + + return 0; +} + +void bch2_bucket_bitmap_free(struct bucket_bitmap *b) +{ + mutex_lock(&b->lock); + kvfree(b->buckets); + b->buckets = NULL; + b->nr = 0; + mutex_unlock(&b->lock); +} |
