Commit Graph

94721 Commits

Author SHA1 Message Date
Kent Overstreet
2d66d3160d bcachefs: Fix dup/misordered check in btree node read
We were checking for out of order keys, but not duplicate keys.

Reported-by: syzbot+dedbd67513939979f84f@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:03 -05:00
Kent Overstreet
46522a75a4 bcachefs: Bad btree roots are now autofix
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:02 -05:00
Kent Overstreet
873a885d1a bcachefs: Kill bch2_bucket_alloc_new_fs()
The early-early allocation path, bch2_bucket_alloc_new_fs(), is no
longer needed - and inconsistencies around new_fs_bucket_idx have been a
frequent source of bugs.

Reported-by: syzbot+592425844580a6598410@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:02 -05:00
Kent Overstreet
3307caf863 bcachefs: Fix btree node scan when unknown btree IDs are present
btree_root entries for unknown btree IDs are created during recovery,
before reading those btree roots.

But btree_node_scan may find btree nodes with unknown btree IDs when we
haven't seen roots for those btrees.

Reported-by: syzbot+1f202d4da221ec6ebf8e@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:02 -05:00
Kent Overstreet
658ca21817 bcachefs: backpointer_to_missing_ptr is now autofix
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:02 -05:00
Kent Overstreet
ba91f39cd4 bcachefs: Fix accounting_read when we rewind
If we rewind recovery to run topology repair, that causes
accounting_read to run twice.

This fixes accounting being double counted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:02 -05:00
Kent Overstreet
8b53739160 bcachefs: disk_accounting: bch2_dev_rcu -> bch2_dev_rcu_noerror
Accounting keys that reference invalid devices are corrected by fsck,
they shouldn't cause an emergency shutdown.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:02 -05:00
Kent Overstreet
f3542deaa9 bcachefs: errcode cleanup: journal errors
Instead of throwing standard error codes, we should be throwing
dedicated private error codes, this greatly improves debugability.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:02 -05:00
Kent Overstreet
3ed349d91e bcachefs: Use separate rhltable for bch2_inode_or_descendents_is_open()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:02 -05:00
Kent Overstreet
2ae6c5e05d bcachefs: BCH_ERR_btree_node_read_error_cached
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:02 -05:00
Kent Overstreet
0e796cf804 bcachefs: btree_write_buffer_flush_seq() no longer closes journal
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:02 -05:00
Kent Overstreet
4e378cabba bcachefs: discard fastpath now uses bch2_discard_one_bucket()
The discard bucket fastpath previously was using its own code for
discarding buckets and clearing them in the need_discard btree, which
didn't have any of the consistency checks of the main discard path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:02 -05:00
Kent Overstreet
2c9a60bc31 bcachefs: Bias reads more in favor of faster device
Per reports of performance issues on mixed multi device filesystems
where we're issuing too much IO to the spinning rust - tweak this
algorithm.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:02 -05:00
Kent Overstreet
709336f96d bcachefs: trivial btree write buffer refactoring
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:02 -05:00
Kent Overstreet
01d8d04564 bcachefs: Can now block journal activity without closing cur entry
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:02 -05:00
Kent Overstreet
1f3c4ab3fb bcachefs: New backpointers helpers
- bch2_backpointer_del()
- bch2_backpointer_maybe_flush()

Kill a bit of open coding and make sure we're properly handling the
btree write buffer.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:02 -05:00
Kent Overstreet
da89857b5f bcachefs: kill bch_backpointer.bucket_offset usage
bch_backpointer.bucket_offset is going away - it's no longer needed
since we no longer store backpointers in alloc keys, the same
information is in the key position itself.

And we'll be reclaiming the space in bch_backpointer for the bucket
generation number.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:02 -05:00
Kent Overstreet
283dcbb80c bcachefs: Fix check_backpointers_to_extents range limiting
bch2_get_btree_in_memory_pos() will return positions that refer directly
to the btree it's checking will fit in memory - i.e. backpointer
positions, not buckets.

This also means check_bp_exists() no longer has to refer to the device,
and we can delete some code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:02 -05:00
Kent Overstreet
ad5834890f bcachefs: bch_backpointer -> bkey_i_backpointer
Since we no longer store backpointers in alloc keys, there's no reason
not to pass around bkey_i_backpointers; this means we don't have to pass
the bucket pos separately.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:02 -05:00
Kent Overstreet
165ca83f55 bcachefs: Drop swab code for backpointers in alloc keys
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:02 -05:00
Kent Overstreet
bbc2ccccfd bcachefs: bucket_pos_to_bp_end()
Better helpers for iterating over backpointers within a specific bucket

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:02 -05:00
Kent Overstreet
3f2e467845 bcachefs: check for backpointers to invalid device
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:01 -05:00
Kent Overstreet
62b185571a bcachefs: fix bp_pos_to_bucket_nodev_noerror
_noerror means don't produce inconsistent errors, so it should be using
bch2_dev_rcu_noerror().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:19:01 -05:00
Kent Overstreet
16de129896 bcachefs: Fix evacuate_bucket tracepoint
86a494c8ee ("bcachefs: Kill bch2_get_next_backpointer()") dropped some
things the tracepoint emitted because bch2_evacuate_bucket() no longer
looks at the alloc key - but we did want at least some of that.

We still no longer look at the alloc key so we can't report on the
fragmentation number, but that's a direct function of dirty_sectors and
a copygc concern anyways - copygc should get its own tracepoint that
includes information from the fragmentation LRU.

But we can report on the number of sectors we moved and the bucket size.

Co-developed-by: Piotr Zalewski <pZ010001011111@proton.me>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-09 06:18:49 -05:00
Kent Overstreet
92084feca4 bcachefs: fix O(n^2) issue with whiteouts in journal keys
The journal_keys array can't be substantially modified after we go RW,
because lookups need to be able to check it locklessly - thus we're
limited on what we can do when a key in the journal has been
overwritten.

This is a problem when there's many overwrites to skip over for peek()
operations. To fix this, add tracking of ranges of overwrites: we create
a range entry when there's more than one contiguous whiteout.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:19 -05:00
Kent Overstreet
1a8f5adc20 bcachefs: btree_and_journal_iter: don't iterate over too many whiteouts when prefetching
To help ameloriate issues with peek operations having to skip over
deletions in the journal - just bail out if all we're doing is
prefetching btree nodes.

Since btree node prefetching runs every time we iterate to a new node,
and has to sequentially scan ahead, this avoids another O(n^2).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:19 -05:00
Kent Overstreet
1d1374a083 bcachefs: journal keys: sort keys for interior nodes first
There's an unavoidable issue with btree lookups when we're overlaying
journal keys and the journal has many deletions for keys present in the
btree - peek operations will have to iterate over all those deletions to
find the next live key to return.

This is mainly a problem for lookups in interior nodes, if we have to
traverse to a leaf. Looking up an insert position in a leaf (for journal
replay) doesn't have to find the next live key, but walking down the
btree does.

So to ameloriate this, change journal key sort ordering so that we
replay keys from roots and interior nodes first.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:19 -05:00
Kent Overstreet
ed144047ef bcachefs: kill bch2_journal_entries_free()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:19 -05:00
Kent Overstreet
b287adb628 bcachefs: Don't BUG_ON() when superblock feature wasn't set for compressed data
We don't allocate the mempools for compression/decompression unless we
need them - but that means there's an inconsistency to check for.

Reported-by: syzbot+cb3fbcfb417448cfd278@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:19 -05:00
Kent Overstreet
3a1897837a bcachefs: Don't use a shared decompress workspace mempool
gzip and zstd require different decompress workspace sizes, and if we
start with one and then start using the other at runtime we may not get
the correct size

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:18 -05:00
Kent Overstreet
3c0fc088af bcachefs: compression workspaces should be indexed by opt, not type
type includes lz4 and lz4_old, which do not get different compression
workspaces, and incompressible, a fake type - BCH_COMPRESSION_OPTS() is
the correct enum to use.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:18 -05:00
Kent Overstreet
d5b149f310 bcachefs: add missing BTREE_ITER_intent
this fixes excessive transaction restarts due to trans_commit having to
upgrade

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:18 -05:00
Kent Overstreet
7fdfb0cbea bcachefs: Kill bch2_get_next_backpointer()
Since for quite some time backpointers have only been stored in the
backpointers btree, not alloc keys (an aborted experiment, support for
which has been removed) - we can replace get_next_backpointer() with
simple btree iteration.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:18 -05:00
Kent Overstreet
59fad23c7a bcachefs: Delete backpointers check in try_alloc_bucket()
try_alloc_bucket() has a "safety" check, which avoids allocating a
bucket if there's any backpointers present.

But backpointers are not the source of truth for live data in a bucket,
the bucket sector counts are; this check was fairly useless, and we're
also deferring backpointers checks from fsck to runtime in the near
future.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:18 -05:00
Kent Overstreet
3140d0052a bcachefs: peek_prev_min(): Search forwards for extents, snapshots
With extents and snapshots, for slightly different reasons, we may have
to search forwards to find a key that compares equal to iter->pos (i.e.
a key that peek_prev() should return, as it returns keys <= iter->pos).

peek_slot() does this, and is an easy way to fix this case.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:18 -05:00
Kent Overstreet
632bcf3865 bcachefs: Implement bch2_btree_iter_prev_min()
A user contributed a filessytem dump, where the dump was actually
corrupted (due to being taken while the filesystem was online), but
which exposed an interesting bug in fsck - reconstruct_inode().

When itearting in BTREE_ITER_filter_snapshots mode, it's required to
give an end position for the iteration and it can't span inode numbers;
continuing into the next inode might mean we start seeing keys from a
different snapshot tree, that the is_ancestor() checks always filter,
thus we're never able to return a key and stop iterating.

Backwards iteration never implemented the end position because nothing
else needed it - except for reconstuct_inode().

Additionally, backwards iteration is now able to overlay keys from the
journal, which will be useful if we ever decide to start doing journal
replay in the background.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:18 -05:00
Kent Overstreet
a7df326af0 bcachefs: discard_one_bucket() now uses need_discard_or_freespace_err()
More conversion of inconsistent errors to fsck errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:18 -05:00
Kent Overstreet
cbc079bcff bcachefs: bch2_bucket_do_index(): inconsistent_err -> fsck_err
Factor out a common helper, need_discard_or_freespace_err(), which is
now used by both fsck and the runtime checks, and can repair.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:18 -05:00
Kent Overstreet
c51b601907 bcachefs: try_alloc_bucket() now uses bch2_check_discard_freespace_key()
check_discard_freespace_key() was doing all the same checks as
try_alloc_bucket(), but with repair.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:18 -05:00
Kent Overstreet
ecadaf9ae3 bcachefs: rework bch2_bucket_alloc_freelist() freelist iteration
Prep work for converting try_alloc_bucket() to use
bch2_check_discard_freespace_key().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:18 -05:00
Kent Overstreet
3de116ce17 bcachefs: kill inconsistent err in invalidate_one_bucket()
Change it to a normal fsck_err() - meaning it'll get repaired at runtime
when that's flipped on.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:18 -05:00
Kent Overstreet
df4270ccd3 bcachefs: Don't delete reflink pointers to missing indirect extents
To avoid tragic loss in the event of transient errors (i.e., a btree
node topology error that was later corrected by btree node scan), we
can't delete reflink pointers to correct errors.

This adds a new error bit to bch_reflink_p, indicating that it is known
to point to a missing indirect extent, and the error has already been
reported.

Indirect extent lookups now use bch2_lookup_indirect_extent(), which on
error reports it as a fsck_err() and sets the error bit, and clears it
if necessary on succesful lookup.

This also gets rid of the bch2_inconsistent_error() call in
__bch2_read_indirect_extent, and in the reflink_p trigger: part of the
online self healing project.

An on disk format change isn't necessary here: setting the error bit
will be interpreted by older versions as pointing to a different index,
which will also be missing - which is fine.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:18 -05:00
Kent Overstreet
6cf666ffb5 bcachefs: Reorganize reflink.c a bit
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:17 -05:00
Kent Overstreet
b36ff5dc08 bcachefs: Reserve 8 bits in bch_reflink_p
Better repair for reflink pointers, as well as propagating new inode
options to indirect extents, are going to require a few extra bits
bch_reflink_p: so claim a few from the high end of the destination
index.

Also add some missing bounds checking.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:17 -05:00
Kent Overstreet
4a370320dc bcachefs: Kill FSCK_NEED_FSCK
If we find an error that indicates that we need to run fsck, we can
specify that directly with run_explicit_recovery_pass().

These are now log_fsck_err() calls: we're just logging in the superblock
that an error occurred - and possibly doing an emergency shutdown,
depending on policy.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:17 -05:00
Kent Overstreet
d6ad842cf7 bcachefs: lru errors are expected when reconstructing alloc
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:17 -05:00
Kent Overstreet
7ad0ba0e18 bcachefs: Delete dead code from bch2_discard_one_bucket()
alloc key validation ensures that if a bucket is in need_discard state
the sector counts are all zero - we don't have to check for that.

The NEED_INC_GEN check appears to be dead code, as well: we only see
buckets in the need_discard btree, and it's an error if they aren't in
the need_discard state.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:17 -05:00
Kent Overstreet
7fda5f1508 bcachefs: bch2_btree_bit_mod_iter()
factor out a new helper, make it handle extents bitset btrees
(freespace).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:17 -05:00
Kent Overstreet
2228901e43 bcachefs: delete dead code
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:17 -05:00
Kent Overstreet
a0678f9c85 bcachefs: Fix shutdown message
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-08 23:56:17 -05:00