Commit Graph

181 Commits

Author SHA1 Message Date
Kent Overstreet
e1b8f5f5ca bcachefs: Plumb btree_id & level to trans_mark
For backpointers, we'll need the full key location - that means btree_id
and btree level. This patch plumbs it through.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:32 -04:00
Kent Overstreet
275c8426fb bcachefs: Add rw to .key_invalid()
This adds a new parameter to .key_invalid() methods for whether the key
is being read or written; the idea being that methods can do more
aggressive checks when a key is newly created and being written, when we
wouldn't want to delete the key because of those checks.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
f0ac7df23d bcachefs: Convert .key_invalid methods to printbufs
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
f25d8215f4 bcachefs: Kill allocator threads & freelists
Now that we have new persistent data structures for the allocator, this
patch converts the allocator to use them.

Now, foreground bucket allocation uses the freespace btree to find
buckets to allocate, instead of popping buckets off the freelist.

The background allocator threads are no longer needed and are deleted,
as well as the allocator freelists. Now we only need background tasks
for invalidating buckets containing cached data (when we are low on
empty buckets), and for issuing discards.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:29 -04:00
Kent Overstreet
b17d3cec14 bcachefs: Run btree updates after write out of write_point
In the write path, after the write to the block device(s) complete we
have to punt to process context to do the btree update.

Instead of using the work item embedded in op->cl, this patch switches
to a per write-point work item. This helps with two different issues:

 - lock contention: btree updates to the same writepoint will (usually)
   be updating the same alloc keys
 - context switch overhead: when we're bottlenecked on btree updates,
   having a thread (running out of a work item) checking the write point
   for completed ops is cheaper than queueing up a new work item and
   waking up a kworker.

In an arbitrary benchmark, 4k random writes with fio running inside a
VM, this patch resulted in a 10% improvement in total iops.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:29 -04:00
Kent Overstreet
5f41739403 bcachefs: bch2_btree_update_start() refactoring
This simplifies the logic in bch2_btree_update_start() a bit, handling
the unlock/block logic more locally.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:29 -04:00
Kent Overstreet
31f63fd124 bcachefs: Introduce a separate journal watermark for copygc
Since journal reclaim -> btree key cache flushing may require the
allocation of new btree nodes, it has an implicit dependency on copygc
in order to make forward progress - so we should avoid blocking copygc
unless the journal is really close to full.

This introduces watermarks to replace our single MAY_GET_UNRESERVED bit
in the journal, and adds a watermark for copygc and plumbs it through.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:29 -04:00
Kent Overstreet
3e1547116f bcachefs: x-macroize alloc_reserve enum
This makes an array of strings available, like our other enums.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:29 -04:00
Kent Overstreet
2a6870ada4 bcachefs: Use darray for extra_journal_entries
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:28 -04:00
Kent Overstreet
7071878bab bcachefs: Add a missing btree_path_set_dirty() calls
bch2_btree_iter_next_node() was mucking with other btree_path state
without setting path->update to be consistent with the fact that the
path is very much no longer uptodate - oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:28 -04:00
Kent Overstreet
3098553776 bcachefs: Fix usage of six lock's percpu mode
Six locks have a percpu mode, which we use for interior btree nodes, as
well as btree key cache keys for the subvolumes btree. We've been
switching locks back and forth between percpu and non percpu mode as
needed, but it turns out this is racy - when we're reusing an existing
node, other threads could be attempting to lock it while we're switching
it between modes.

This patch fixes this by never switching 'struct btree' between the two
modes, and instead segragating them between two different freed lists.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:26 -04:00
Kent Overstreet
ee68105f61 bcachefs: Simplify parameters to bch2_btree_update_start()
We don't need to pass the number of nodes required to
bch2_btree_update_start, just whether we're doing a split at @level.
This is prep work for a fix to our usage of six lock's percpu mode,
which is going to require us to count up and allocate interior nodes and
leaf nodes seperately.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:26 -04:00
Kent Overstreet
b66fbf3342 bcachefs: Drop unneeded journal pin in bch2_btree_update_start()
When we do an interior btree update, we create new btree nodes and link
them into the btree in memory, but they don't become reachable on disk
until later, when btree_update_nodes_written_trans() runs.

Updates to the new nodes can thus happen before they're reachable on
disk, and if the updates to those new nodes are written before the nodes
become reachable, we would then drop the journal pin for those updates
before the btree has them.

This is what the journal pin in bch2_btree_update_start() was protecting
against. However, it's not actually needed because we don't allow
subsequent append writes to btree nodes until the node is reachable on
disk.

Dropping this unneeded pin also fixes a bug introduced by "bcachefs:
Journal seq now incremented at entry open, not close" - in the new code,
if the journal is completely empty a journal pin list for
journal_cur_seq() won't exist.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:26 -04:00
Kent Overstreet
bf3efff5e4 bcachefs: Fix race leading to btree node write getting stuck
Checking btree_node_may_write() isn't atomic with the other btree flags,
dirty and need_write in particular. There was a rare race where we'd
unblock a node from writing while __btree_node_flush() was setting
need_write, and no thread would notice that the node was now both able
to write and needed to be written.

Fix this by adding btree node flags for will_make_reachable and
write_blocked that can be checked in the cmpxchg loop in
__bch2_btree_node_write.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:26 -04:00
Kent Overstreet
82732ef510 bcachefs: Improve btree_node_write_if_need()
btree_node_write_if_need() kicks off a btree node write only if
need_write is set; this makes the locking easier to reason about by
moving the check into the cmpxchg loop in __bch2_btree_node_write().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:26 -04:00
Kent Overstreet
de517c9551 bcachefs: Use x-macros for btree node flags
This is for adding an array of strings for btree node flag names.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:26 -04:00
Kent Overstreet
55334d7897 bcachefs: Kill BCH_FS_HOLD_BTREE_WRITES
This was just dead code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:25 -04:00
Kent Overstreet
a0a07c59f5 bcachefs: Fix btree path sorting
In btree_update_interior.c, we were changing a path's level directly -
which affects path sort order - without re-sorting paths, leading to
assertions when bch2_path_get() verified paths were sorted correctly.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:25 -04:00
Kent Overstreet
fa8e94faee bcachefs: Heap allocate printbufs
This patch changes printbufs dynamically allocate and reallocate a
buffer as needed. Stack usage has become a bit of a problem, and a major
cause of that has been static size string buffers on the stack.

The most involved part of this refactoring is that printbufs must now be
exited with printbuf_exit().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:25 -04:00
Kent Overstreet
ae94c78fb1 bcachefs: bch2_trans_mark_key() now takes a bkey_i *
We're now coming up with triggers that modify the update being done. A
bkey_s_c is const - bkey_i is the correct type to be using here.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:25 -04:00
Kent Overstreet
6e44568cc3 bcachefs: Set BTREE_NODE_SEQ() correctly in merge path
BTREE_NODE_SEQ() is supposed to give us a time ordering of btree nodes
on disk, so that we can tell which btree node is newer if we ever have
to scan the entire device to find btree nodes.

The btree node merge path wasn't setting it correctly on the new node -
oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:25 -04:00
Kent Overstreet
c7ce27328b bcachefs: Also show when blocked on write locks
This consolidates some of the btree node lock path, so that when we're
blocked taking a write lock on a node it shows up in
bch2_btree_trans_to_text(), along with intent and read locks.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:23 -04:00
Kent Overstreet
35228ecb7e bcachefs: Don't keep nodes in btree_reserve locked
These nodes aren't reachable by other threads, so there's no need to
keep it locked - and this fixes a bug with the assertion in
bch2_trans_unlock() firing on transaction restart.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:23 -04:00
Kent Overstreet
b674bfadd8 bcachefs: Use BTREE_INSERT_USE_RESERVE in btree_update_key()
bch2_btree_update_key() is used in the btree node write path - before
delivering the completion we have to update the parent pointer with the
number of sectors written.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:22 -04:00
Kent Overstreet
669f87a5da bcachefs: Switch to __func__for recording where btree_trans was initialized
Symbol decoding, via %ps, isn't supported in userspace - this will also
be faster when we're using trans->fn in the fast path, as with the new
BCH_JSET_ENTRY_log journal messages.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:21 -04:00
Kent Overstreet
d8601afca8 bcachefs: Simplify journal replay
With BTREE_ITER_WITH_JOURNAL, there's no longer any restrictions on the
order we have to replay keys from the journal in, and we can also start
up journal reclaim right away - and delete a bunch of code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:21 -04:00
Kent Overstreet
5222a4607c bcachefs: BTREE_ITER_WITH_JOURNAL
This adds a new btree iterator flag, BTREE_ITER_WITH_JOURNAL, that is
automatically enabled when initializing a btree iterator before journal
replay has completed - it overlays the contents of the journal with the
btree.

This lets us delete bch2_btree_and_journal_walk() and just use the
normal btree iterator interface instead - which also lets us delete a
significant amount of duplicated code.

Note that BTREE_ITER_WITH_JOURNAL is still unoptimized in this patch -
we're redoing the binary search over keys in the journal every time we
call bch2_btree_iter_peek().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:21 -04:00
Kent Overstreet
57af63b286 bcachefs: bch2_alloc_sectors_append_ptrs() now takes cached flag
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:20 -04:00
Kent Overstreet
8244f3209b bcachefs: Option improvements
This adds flags for options that must be a power of two (block size and
btree node size), and options that are stored in the superblock as a
power of two (encoded extent max).

Also: options are now stored in memory in the same units they're
displayed in (bytes): we now convert when getting and setting from the
superblock.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:19 -04:00
Kent Overstreet
f3e1f44433 bcachefs: BTREE_ITER_NOPRESERVE
This adds a flag to not mark the initial btree_path as preserve, for
paths that we expect to be cheap to reconstitute if necessary - this
solves a btree_path overflow caused by need_whiteout_for_snapshot().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:19 -04:00
Kent Overstreet
991ba02112 bcachefs: Add more time_stats
This adds more latency/event measurements and breaks some apart into
more events. Journal writes are broken apart into flush writes and
noflush writes, btree compactions are broken out from btree splits,
btree mergers are added, as well as btree_interior_updates - foreground
and total.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:18 -04:00
Kent Overstreet
f3cf0999ac bcachefs: bch2_btree_node_rewrite() now returns transaction restarts
We have been getting away from handling transaction restarts locally -
convert bch2_btree_node_rewrite() to the newer style.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:15 -04:00
Kent Overstreet
d355c6f4f7 bcachefs: for_each_btree_node() now returns errors directly
This changes for_each_btree_node() to work like for_each_btree_key(),
and to that end bch2_btree_iter_peek_node() and next_node() also return
error ptrs.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:14 -04:00
Kent Overstreet
d697b9abba bcachefs: More btree iterator fixes
- check for getting to the end of the btree in bch2_path_verify_locks
   and __btree_path_traverse_all(), this fixes an infinite loop in
   __btree_path_traverse_all().
 - relax requirement in bch2_btree_node_upgrade() that we must want an
   intent lock, this fixes bugs with paths that point to interior nodes
   (nonzero level).
 - bch2_btree_node_update_key(): fix it to upgrade the path to an intent
   lock, if necessary

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:13 -04:00
Kent Overstreet
71ed0056dc bcachefs: Fix an assertion
We can end up in a strange situation where a btree_path points to a node
being freed even after pointers to it should have been replaced by
pointers to the new node - if the btree node has been reused since the
pointer to it was created.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:13 -04:00
Kent Overstreet
22b383ad7e bcachefs: Kill retry loop in btree merge path
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:11 -04:00
Kent Overstreet
1d3ecd7ea7 bcachefs: Tighten up btree locking invariants
New rule is: if a btree path holds any locks it should be holding
precisely the locks wanted (accoringing to path->level and
path->locks_want).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:11 -04:00
Kent Overstreet
67e0dd8f0d bcachefs: btree_path
This splits btree_iter into two components: btree_iter is now the
externally visible componont, and it points to a btree_path which is now
reference counted.

This means we no longer have to clone iterators up front if they might
be mutated - btree_path can be shared by multiple iterators, and cloned
if an iterator would mutate a shared btree_path. This will help us use
iterators more efficiently, as well as slimming down the main long lived
state in btree_trans, and significantly cleans up the logic for iterator
lifetimes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:11 -04:00
Kent Overstreet
cab8e23373 bcachefs: Add an assertion for removing btree nodes from cache
Chasing a bug that has something to do with the btree node cache.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:11 -04:00
Kent Overstreet
a0a568794d bcachefs: More renaming
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:11 -04:00
Kent Overstreet
f7a966a3e2 bcachefs: Clean up/rename bch2_trans_node_* fns
These utility functions are for managing btree node state within a
btree_trans - rename them for consistency, and drop some unneeded
arguments.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:11 -04:00
Kent Overstreet
78cf784eaa bcachefs: Further reduce iter->trans usage
This is prep work for splitting btree_path out from btree_iter -
btree_path will not have a pointer to btree_trans.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:11 -04:00
Kent Overstreet
9f6bd30703 bcachefs: Reduce iter->trans usage
Disfavoured, and should go away.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:10 -04:00
Kent Overstreet
1a488e7306 bcachefs: Kill BTREE_INSERT_NOUNLOCK
With the recent transaction restart changes, it's no longer needed - all
transaction commits have BTREE_INSERT_NOUNLOCK semantics.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:10 -04:00
Kent Overstreet
e5af273fce bcachefs: trans->restarted
Start tracking when btree transactions have been restarted - and assert
that we're always calling bch2_trans_begin() immediately after
transaction restart.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:09 -04:00
Kent Overstreet
a88171c9e6 bcachefs: Clean up interior update paths
Btree node merging now happens prior to transaction commit, not after,
so we don't need to pay attention to BTREE_INSERT_NOUNLOCK.

Also, foreground_maybe_merge shouldn't be calling
bch2_btree_iter_traverse_all() - this is becoming private to the btree
iterator code and should only be called by bch2_trans_begin().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:09 -04:00
Kent Overstreet
9f1833cadd bcachefs: Update btree ptrs after every write
This closes a significant hole (and last known hole) in our ability to
verify metadata. Previously, since btree nodes are log structured, we
couldn't detect lost btree writes that weren't the first write to a
given node. Additionally, this seems to have lead to some significant
metadata corruption on multi device filesystems with metadata
replication: since a write may have made it to one device and not
another, if we read that btree node back from the replica that did have
that write and started appending after that point, the other replica
would have a gap in the bset entries and reading from that replica
wouldn't find the rest of the bsets.

But, since updates to interior btree nodes are now journalled, we can
close this hole by updating pointers to btree nodes after every write
with the currently written number of sectors, without negatively
affecting performance. This means we will always detect lost or corrupt
metadata - it also means that our btree is now a curious hybrid of COW
and non COW btrees, with all the benefits of both (excluding
complexity).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:08 -04:00
Kent Overstreet
9f6e1f7bb0 bcachefs: Fix an allocator shutdown deadlock
On fstest generic/388, we were seeing sporadic deadlocks in the
emergency shutdown, where we'd get stuck shutting down the allocator
because bch2_btree_update_start() -> bch2_btree_reserve_get() allocated
and then deallocated some btree nodes, putting them back on the
btree_reserve_cache, after the allocator shutdown code had already
cleared out that cache.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:08 -04:00
Kent Overstreet
19d5432445 bcachefs: Really don't hold btree locks while btree IOs are in flight
This is something we've attempted to stick to for quite some time, as it
helps guarantee filesystem latency - but there's a few remaining paths
that this patch fixes.

This is also necessary for an upcoming patch to update btree pointers
after every btree write - since the btree write completion path will now
be doing btree operations.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:08 -04:00
Kent Overstreet
e3a67bdb6e bcachefs: Regularize argument passing of btree_trans
btree_trans should always be passed when we have one - iter->trans is
disfavoured. This mainly updates old code in btree_update_interior.c,
some of which predates btree_trans.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:08 -04:00