Rewrite fsck/gc for the new accounting scheme.
This adds a second set of in-memory accounting counters for gc to use;
like with other parts of gc we run all trigger in TRIGGER_GC mode, then
compare what we calculated to existing in-memory accounting at the end.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Main part of the disk accounting rewrite.
This is a wholesale rewrite of the existing disk space accounting, which
relies on percepu counters that are sharded by journal buffer, and
rolled up and added to each journal write.
With the new scheme, every set of counters is a distinct key in the
accounting btree; this fixes scaling limitations of the old scheme,
where counters took up space in each journal entry and required multiple
percpu counters.
Now, in memory accounting requires a single set of percpu counters - not
multiple for each in flight journal buffer - and in the future we'll
probably also have counters that don't use in memory percpu counters,
they're not strictly required.
An accounting update is now a normal btree update, using the btree write
buffer path. At transaction commit time, we apply accounting updates to
the in memory counters, which are percpu counters indexed in an
eytzinger tree by the accounting key.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Compatibility fix - we no longer have a separate table for which order
gc walks btrees in, and special case the stripes btree directly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're about to start using bch_validate_flags for superblock section
validation - it's no longer bkey specific.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Wrapper around bch2_dev_have_ref() for open_buckets; we do guarantee
that the device an open_bucket points to exists.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Currently, the reflink_p gc trigger does repair as well - turning a
reflink_p key into an error key if the reflink_v it points to doesn't
exist.
This won't work with online check/repair, because the repair path once
online will be subject to transaction restarts, but BTREE_TRIGGER_gc is
not idempotant - we can't run it multiple times if we get a transaction
restart.
So we need to split these paths; to do so this patch calls
check_fix_ptrs() by a new general path - a new trigger type,
BTREE_TRIGGER_check_repair.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we hit an inconsistency when updating allocation information, we
don't want to fail the update if it's for a deletion - only if it's for
a new key.
Rename check_bucket_ref() -> bucket_ref_update() so we can centralize
the logic to do this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This eliminates some duplicated logic, and the gc path now handles
stripe updates and deletions - we need this since soon we're bringing
back runtime gc.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Start to work on unifying mark_stripe_bucket() and
trans_mark_stripe_bucket(); first, clean up all the unnecessary and
gratuitious differences.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're working on potentially unifying bch2_check_bucket_ref() and
bch2_check_fix_ptrs() - or at least eliminating gratuitious differences.
Most immediately, there's a bunch of cleanups to be done regarding
BCH_DATA_stripe.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Combine iter/update/trigger/str_hash flags into a single enum, and
x-macroize them for a to_text() function later.
These flags are all for a specific iter/key/update context, so it makes
sense to group them together - iter/update/trigger flags were already
given distinct bits, this cleans up and unifies that handling.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previosuly, the transaction commit path would have to add keys to the
btree write buffer as a separate operation, requiring additional global
synchronization.
This patch introduces a new journal entry type, which indicates that the
keys need to be copied into the btree write buffer prior to being
written out. We switch the journal entry type back to
JSET_ENTRY_btree_keys prior to write, so this is not an on disk format
change.
Flushing the btree write buffer may require pulling keys out of journal
entries yet to be written, and quiescing outstanding journal
reservations; we previously added journal->buf_lock for synchronization
with the journal write path.
We also can't put strict bounds on the number of keys in the journal
destined for the write buffer, which means we might overflow the size of
the preallocated buffer and have to reallocate - this introduces a
potentially fatal memory allocation failure. This is something we'll
have to watch for, if it becomes an issue in practice we can do
additional mitigation.
The transaction commit path no longer has to explicitly check if the
write buffer is full and wait on flushing; this is another performance
optimization. Instead, when the btree write buffer is close to full we
change the journal watermark, so that only reservations for journal
reclaim are allowed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
for_each_btree_key() handles transaction restarts, like
for_each_btree_key2(), but only calls bch2_trans_begin() after a
transaction restart - for_each_btree_key2() wraps every loop iteration
in a transaction.
The for_each_btree_key() behaviour is problematic when it leads to
holding the SRCU lock that prevents key cache reclaim for an unbounded
amount of time - there's no real need to keep it around.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
__bch2_btree_write_buffer_flush() now assumes a write ref is already
held (as called by the transaction commit path); and the wrappers
bch2_write_buffer_flush() and flush_sync() take an explicit write ref.
This means internally the write buffer code can always use
BTREE_INSERT_NOCHECK_RW, instead of in the previous code passing flags
around and hoping the NOCHECK_RW flag was always carried around
correctly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now we can print out filesystem flags in sysfs, useful for debugging
various "what's my filesystem doing" issues.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can't create stripes if we don't have enough devices - this
manifested as an integer underflow bug later.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We now include the name of the device in the error message - and also
increment the number of checksum errors on that device.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're not supposed to have more than one btree_trans at a time in a
given thread - that causes recursive locking deadlocks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch adds a superblock error counter for every distinct fsck
error; this means that when analyzing filesystems out in the wild we'll
be able to see what sorts of inconsistencies are being found and repair,
and hence what bugs to look for.
Errors validating bkeys are not yet considered distinct fsck errors, but
this patch adds a new helper, bkey_fsck_err(), in order to add distinct
error types for them as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We now track IO errors per device since filesystem creation.
IO error counts can be viewed in sysfs, or with the 'bcachefs
show-super' command.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're using more stack than we'd like in a number of functions, and
btree_trans is the biggest object that we stack allocate.
But we have to do a heap allocatation to initialize it anyways, so
there's no real downside to heap allocating the entire thing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>