Commit Graph

94808 Commits

Author SHA1 Message Date
Kent Overstreet
c6b74e6733 bcachefs: bcachefs_metadata_version_disk_accounting_big_endian
Fix sort order for disk accounting keys, in order to fix a regression on
mount times.

The typetag is now the most significant byte of the key, meaning disk
accounting keys of the same type now sort together.

This lets us skip over disk accounting keys that aren't mirrored in
memory when reading accounting at startup, instead of having them
interleaved with other counter types.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:49:02 -05:00
Kent Overstreet
ab7eb8e365 bcachefs: bcachefs_metadata_version_backpointer_bucket_gen
New on disk format version: backpointers new include the generation
number of the bucket they refer to, and the obsolete bucket_offset field
(no longer needed because we no longer store backpointers in alloc keys)
is gone.

This is an expensive forced upgrade - hopefully the last; we have to run
the extents_to_backpointers recovery pass to regenerate backpointers.

It's a forced incompatible upgrade because the alternative would've been
permamently making backpointers bigger, and as one of the biggest btrees
(along with the extents btree) that's not an ideal option.

It's worth it though, because this allows us to make the
check_extents_to_backpointers pass drastically cheaper: an upcoming
patch changes it to sum up backpointers in a bucket and check the sum
against the sector counts for that bucket, only looking for missing
backpointers if they don't match (and then only for specific buckets).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:49:02 -05:00
Kent Overstreet
8062b34861 bcachefs: bch2_btree_path_peek_slot() doesn't return errors
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:49:02 -05:00
Kent Overstreet
0694b43ff9 bcachefs: trace_key_cache_fill
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:49:02 -05:00
Kent Overstreet
dd0d1ff378 bcachefs: Log message in journal for snapshot deletion
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:48:30 -05:00
Kent Overstreet
50dd5a0edf bcachefs: bch2_trans_log_msg()
Export a helper for logging to the journal when we're already in a
transaction context.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:48:30 -05:00
Kent Overstreet
5d9b21a555 bcachefs: Kill snapshot_t->equiv
Now entirely dead code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:48:30 -05:00
Kent Overstreet
7c410e21d8 bcachefs: Snapshot deletion no longer uses snapshot_t->equiv
Switch to generating a private list of interior nodes to delete, instead
of using the equivalence class in the global data structure.

This eliminates possible races with snapshot creation, and is much
cleaner - it'll let us delete a lot of janky code for calculating and
maintaining the equivalence classes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:48:30 -05:00
Kent Overstreet
46f92a9e99 bcachefs: Kill equiv_seen arg to delete_dead_snapshots_process_key()
When deleting dead snapshots, we move keys from redundant interior
snapshot nodes to child nodes - unless there's already a key, in which
case the ancestor key is deleted.

Previously, we tracked via equiv_seen whether the child snapshot had a
key, but this was tricky w.r.t. transaction restarts, and not
transactionally safe w.r.t. updates in the child snapshot.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:48:30 -05:00
Kent Overstreet
75bca41052 bcachefs: Don't run overwrite triggers before insert
This breaks when the trigger is inserting updates for the same btree, as
the inode trigger now does.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:48:30 -05:00
Kent Overstreet
395d7f5e24 bcachefs: alloc_data_type_set() happens in alloc trigger
Originally, we ran insert triggers before overwrite so that if an extent
was being moved (by fallocate insert/collapse range), the bucket sector
count wouldn't hit 0 partway through, and so we don't trigger state
changes caused by that too soon.

But this is better solved by just moving the data type change to the
alloc trigger itself, where it's already called.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:48:30 -05:00
Kent Overstreet
760fbaf1a8 bcachefs: Fix key cache + BTREE_ITER_all_snapshots
Normally, whitouts (KEY_TYPE_whitout) are filtered from btree lookups,
since they exist only to represent deletions of keys in ancestor
snapshots - except, they should not be filtered in
BTREE_ITER_all_snapshots mode, so that e.g. snapshot deletion can clean
them up.

This means that that the key cache has to store whiteouts, and key cache
fills cannot filter them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:48:30 -05:00
Kent Overstreet
bf5ec9b976 bcachefs: Fix btree_trans_peek_key_cache() BTREE_ITER_all_snapshots
In BTREE_ITER_all_snapshots mode, we're required to only return keys
where the snapshot field matches the iterator position -
BTREE_ITER_filter_snapshots requires pulling keys into the key cache
from ancestor snapshots, so we have to check for that.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:48:30 -05:00
Kent Overstreet
eece59055b bcachefs: tidy btree_trans_peek_journal()
Change to match bch2_btree_trans_peek_updates() calling convention.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:48:30 -05:00
Kent Overstreet
52ee09e70b bcachefs: tidy up __bch2_btree_iter_peek()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:47:28 -05:00
Kent Overstreet
cf44b080f7 bcachefs: check_indirect_extents can run online
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:16 -05:00
Kent Overstreet
bb4ae1459d bcachefs: Refactor c->opts.reconstruct_alloc
Now handled in one place.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:16 -05:00
Nathan Chancellor
cee2a479ac bcachefs: Add empty statement between label and declaration in check_inode_hash_info_matches_root()
Clang 18 and newer warns (or errors with CONFIG_WERROR=y):

  fs/bcachefs/str_hash.c:164:2: error: label followed by a declaration is a C23 extension [-Werror,-Wc23-extensions]
    164 |         struct bch_inode_unpacked inode;
        |         ^

In Clang 17 and prior, this is an unconditional hard error:

  fs/bcachefs/str_hash.c:164:2: error: expected expression
    164 |         struct bch_inode_unpacked inode;
        |         ^
  fs/bcachefs/str_hash.c:165:30: error: use of undeclared identifier 'inode'
    165 |         ret = bch2_inode_unpack(k, &inode);
        |                                     ^
  fs/bcachefs/str_hash.c:169:55: error: use of undeclared identifier 'inode'
    169 |         struct bch_hash_info hash2 = bch2_hash_info_init(c, &inode);
        |                                                              ^
  fs/bcachefs/str_hash.c:171:40: error: use of undeclared identifier 'inode'
    171 |                 ret = repair_inode_hash_info(trans, &inode);
        |                                                      ^

Add an empty statement between the label and the declaration to fix the
warning/error without disturbing the code too much.

Fixes: 2519d3b0d6 ("bcachefs: bch2_str_hash_check_key() now checks inode hash info")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202412092339.QB7hffGC-lkp@intel.com/
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:16 -05:00
Kent Overstreet
42c1d1a954 bcachefs: trace_write_buffer_maybe_flush
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:16 -05:00
Kent Overstreet
152c28eef5 bcachefs: bch2_snapshot_exists()
bch2_snapshot_equiv() is going away; convert users that just wanted to
know if the snapshot exists to something better

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:16 -05:00
Kent Overstreet
d54b4f311f bcachefs: bch2_check_key_has_snapshot() prints btree id
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:15 -05:00
Kent Overstreet
eccc694e14 bcachefs: bch2_str_hash_check_key() now checks inode hash info
Versions of the same inode in different snapshots must have the same
hash info; this is critical for lookups to work correctly.

We're going to be running the str_hash checks online, at readdir or
xattr list time, so we now need str_hash_check_key() to check for inode
hash seed mismatches, since it won't be run right after check_inodes().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:15 -05:00
Kent Overstreet
90ae216d58 bcachefs: Don't BUG_ON() inode unpack error
Bkey validation checks that inodes are well-formed and unpack
successfully, so an unpack error should always indicate memory
corruption or some other kind of hardware bug - but these are still
errors we can recover from.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:15 -05:00
Kent Overstreet
5cd80c5f33 bcachefs: Use proper errcodes for inode unpack errors
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:15 -05:00
Kent Overstreet
c5022a702e bcachefs: kill sysfs internal/accounting
Since we added per-inode counters there's now far too many counters to
show in one shot - if we want this in the future, it'll have to be in
debugfs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:15 -05:00
Kent Overstreet
2a11567a57 bcachefs: Kill unnecessary mark_lock usage
We can't hold mark_lock while calling fsck_err() - that's a deadlock,
mark_lock is meant to be a leaf node lock.

It's also unnecessary for gc_bucket() and bucket_gen(); rcu suffices
since the bucket_gens array describes its size, and we can't race with
device removal or resize during gc/fsck since that takes state lock.

Reported-by: syzbot+38641fcbda1aaffefdd4@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:15 -05:00
Kent Overstreet
0939c611ce bcachefs: Don't start rewriting btree nodes until after journal replay
This fixes a deadlock during journal replay when btree node read errors
kick off a ton of rewrites: we don't want them competing with journal
replay.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:15 -05:00
Kent Overstreet
d7f6becfe0 bcachefs: Fix reuse of bucket before journal flush on multiple empty -> nonempty transition
For each bucket we track when the bucket became nonempty and when it
became empty again: if we can ensure that there will be no journal
flushes in the range [nonempty, empty) (possibly because they occured at
the same journal sequence number), then it's safe to reuse the bucket
without waiting for a journal commit.

This is a major performance optimization for erasure coding, where
writes are initially replicated, but the extra replicas are quickly
dropped: if those buckets are reused and overwritten without issuing a
cache flush to the underlying device, then they only cost bus bandwidth.

But there's a tricky corner case when there's multiple empty -> nonempty
-> empty transitions in quick succession, i.e. when data is getting
overwritten immediately as it's being written.

If this happens and the previous empty transition hasn't been flushed,
we need to continue tracking the previous nonempty transition - not
start a new one.

Fixing this means we now need to track both the nonempty and empty
transitions in bch_alloc_v4.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:15 -05:00
Kent Overstreet
c106801642 bcachefs: bch2_journal_noflush_seq() now takes [start, end)
Harder to screw up if we're explicit about the range, and more correct
as journal reservations can be outstanding on multiple journal entries
simultaneously.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:15 -05:00
Kent Overstreet
f3b4692b79 bcachefs: Set bucket needs discard, inc gen on empty -> nonempty transition
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:15 -05:00
Kent Overstreet
8f367a5c8e bcachefs: Don't add unknown accounting types to eytzinger tree
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:15 -05:00
Kent Overstreet
e8d604148b bcachefs: Plumb bkey_validate_context to journal_entry_validate
This lets us print the exact location in the journal if it was found in
the journal, or correctly print if it was found in the superblock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:15 -05:00
Kent Overstreet
488249b3f6 bcachefs: Use a heap for handling overwrites in btree node scan
Fix an O(n^2) issue when we find many overlapping (overwritten) btree
nodes - especially when one node overwrites many smaller nodes.

This was discovered to be an issue with the bcachefs
merge_torture_flakey test - if we had a large btree that was then
emptied, the number of difficult overwrites can be unbounded.

Cc: Kuan-Wei Chiu <visitorckw@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:15 -05:00
Kent Overstreet
e0e0d738ca bcachefs: Minor bucket alloc optimization
Check open buckets and buckets waiting for journal commit before doing
other expensive lookups.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:15 -05:00
Kent Overstreet
2c77b17015 bcachefs: Mark more errors autofix
tested repairing from a bug uncovered by the merge_torture_flakey test

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:15 -05:00
Kent Overstreet
8ceb549abd bcachefs: fix bch2_btree_node_header_to_text() format string
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:15 -05:00
Kent Overstreet
6315b49e95 bcachefs: Journal space calculations should skip durability=0 devices
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:15 -05:00
Kent Overstreet
0ecfac8b60 bcachefs: factor out str_hash.c
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:15 -05:00
Kent Overstreet
7dacc22d76 bcachefs: kill flags param to bch2_subvolume_get()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:15 -05:00
Kent Overstreet
cea5427fba bcachefs: Don't call bch2_btree_interior_update_will_free_node() until after update succeeds
Originally, btree splits always succeeded once we got to the point of
recursing to the btree_insert_node() call.

But that changed when we switched to not taking intent locks all the way
up to the root, and that introduced a bug, because
bch2_btree_interior_update_will_free_node() cancels paending writes and
reparents a node that's going to be made visible on disk by another
btree update to the current btree update.

This was discovered in recent backpointers work, because
bch2_btree_interior_update_will_free_node() also clears the
will_make_reachable flag, causing backpointer target lookup to
spuriously thing it had found a dangling backpointer (when the
backpointer just hadn't been created yet by
btree_update_nodes_written()).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:15 -05:00
Kent Overstreet
61ab7cbbaa bcachefs: Make sure __bch2_run_explicit_recovery_pass() signals to rewind
We should always signal to rewind if the requested pass hasn't been run,
even if called multiple times.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:14 -05:00
Kent Overstreet
8ea098f248 bcachefs: Call bch2_btree_lost_data() on btree read error
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:14 -05:00
Kent Overstreet
70feb569f2 bcachefs: Journal write path refactoring, debug improvements
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:14 -05:00
Kent Overstreet
0a1a0391c4 bcachefs: dev_alloc_list.devs -> dev_alloc_list.data
This lets us use darray macros on dev_alloc_list (and it will become a
darray eventually, when we increase the maximum number of devices).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:14 -05:00
Kent Overstreet
effc7a1c06 bcachefs: Fix failure to allocate journal write on discard retry
When allocating a journal write fails, then retries after doing
discards, we were failing to count already allocated replicas.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:14 -05:00
Kent Overstreet
fd2a164b5c bcachefs: BCH_ERR_insufficient_journal_devices
kill another standard error code use

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:14 -05:00
Kent Overstreet
fe23688192 bcachefs: Silence "unable to allocate journal write" if we're already RO
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:14 -05:00
Kent Overstreet
2a21c9dea9 bcachefs: trace_accounting_mem_insert
Add a tracepoint for inserting new accounting entries: we're seeing odd
spinning behaviour in accounting read.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:14 -05:00
Kent Overstreet
fdfafffb03 bcachefs: Advance to next bp on BCH_ERR_backpointer_to_overwritten_btree_node
Don't spin.

Fixes: de95cc201a ("bcachefs: Kill bch2_get_next_backpointer()")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:14 -05:00
Kent Overstreet
52a0da6fcd bcachefs: Simplify disk accounting validate late
The validate late path was iterating over accounting entries in
eytzinger order, which is unnecessarily tricky when we may have to
remove entries.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-14 22:46:14 -05:00