refactoring the BTREE_ITER_WITH_UPDATES code, prep for removing the flag
and making it always-on
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
XXX: we're allocating memory with btree locks held - bad
We need to plumb through an error path so we can do
allocate_dropping_locks() - but we're merging this now because it fixes
a transaction path overflow caused by indirect extent fragmentation, and
the resize path is rare.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Start to plumb through dynamically growable btree_paths; this patch
replaces most BTREE_ITER_MAX references with trans->nr_paths.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
the reflink triggers are also bumping up against the maximum number of
paths in a transaction - and generating proportional numbers of updates.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Some tweaks to greatly reduce locking overhead for the list of btree
transactions, so that it can always be enabled: leave btree_trans
objects on the list when they're on the percpu single item freelist,
and only check for duplicates in the same process when
CONFIG_BCACHEFS_DEBUG is enabled
- don't zero out the full btree_trans() unless we allocated it from
the mempool
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Upcoming patches are going to be changing trans->paths to a
reallocatable buffer. We need to guard against use after free when it's
used by other threads; this introduces RCU protection to those paths and
changes them to check for trans->paths == NULL
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
path->idx is now a code smell: we should be using path_idx_t, since it's
stable across btree path reallocation.
This is also a bit faster, using the same loop counter vs. fetching
path->idx from each path we iterate over.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the future we'll be making trans->paths resizable and potentially
having _many_ more paths (for fsck); we need to start fixing algorithms
that walk each path in a transaction where possible.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Instead of using a darray, we now allocate journal entries for the
transaction commit path with our normal bump allocator - with an inlined
fastpath, and using btree_transaction_stats to remember how much to
initially allocate so as to avoid transaction restarts.
This is prep work for converting write buffer updates to use this
mechanism.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
These were for extra info in tracepoints for debugging a specialized
issue - we do not want to bloat btree_path for this, at least in release
builds.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the CI, we're seeing tests failing due to excessive would_deadlock
transaction restarts - the tracepoint now includes the lock cycle that
occured.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now we can print out filesystem flags in sysfs, useful for debugging
various "what's my filesystem doing" issues.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
For BTREE_ITER_WITH_JOURNAL, we memoize lookups in the journal keys, to
avoid the binary search overhead.
Previously we stashed the pos of the last key returned from the journal,
in order to force the lookup to be redone when rewinding.
Now bch2_journal_keys_peek_upto() handles rewinding itself when
necessary - so we can slim down btree_iter.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
As discussed in the previous patch, BTREE_ITER_ALL_LEVELS appears to be
racy with concurrent interior node updates - and perhaps it is fixable,
but it's tricky and unnecessary.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
path->level was being read, but never used.
Reported-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
peek_upto() checks against the end position and bails out before
FILTER_SNAPSHOTS checks; this is because if we end up at a different
inode number than the original search key none of the keys we see might
be visibile in the current snapshot - we might be looking at inode in a
completely different subvolume.
But this is broken, because when we're iterating over extents we're
checking against the extent start position to decide when to bail out,
and the extent start position isn't monotonically increasing until after
we've run FILTER_SNAPSHOTS.
Fix this by adding a simple inode number check where the old bailout
check was, and moving the main check to the correct position.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reported-by: "Carl E. Thompson" <list-bcachefs@carlthompson.net>
When bch2_fs_alloc() gets an error before calling
bch2_fs_btree_iter_init(), bch2_fs_btree_iter_exit() makes an invalid
memory access because btree_trans_list is uninitialized.
Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Fixes: 6bd68ec266 ("bcachefs: Heap allocate btree_trans")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The btree iterator code overlays keys from the journal until journal
replay is finished; since we're now starting copygc/rebalance etc.
before replay is finished, this is multithreaded access and thus needs
refcounting.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This deletes the complicated and somewhat expensive journal
pre-reservation machinery in favor of just using journal watermarks:
when the journal is more than half full, we run journal reclaim more
aggressively, and when the journal is more than 3/4s full we only allow
journal reclaim to get new journal reservations.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BTREE_INSERT_NOJOURNAL is primarily used for a performance optimization
related to inode updates and fsync - document it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The SRCU read lock that btree_trans takes exists to make it safe for
bch2_trans_relock() to deref pointers to btree nodes/key cache items we
don't have locked, but as a side effect it blocks reclaim from freeing
those items.
Thus, it's important to not hold it for too long: we need to
differentiate between bch2_trans_unlock() calls that will be only for a
short duration, and ones that will be for an unbounded duration.
This introduces bch2_trans_unlock_long(), to be used mainly by the data
move paths.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We should only be downgrading locks on success - otherwise, our
transaction restarts won't be getting the correct locks and we'll
livelock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since we can run with unknown btree IDs, we can't directly index btree
IDs into fixed size arrays.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're using more stack than we'd like in a number of functions, and
btree_trans is the biggest object that we stack allocate.
But we have to do a heap allocatation to initialize it anyways, so
there's no real downside to heap allocating the entire thing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There are several spelling mistakes in error messages. Fix these.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This changes mark_btree_node_locked() to take an enum
btree_node_locked_type, not a six_lock_type, since BTREE_NODE_UNLOCKED
is -1 which may cause problems converting back and forth to
six_lock_type if short enums are in use.
With this change, we never store BTREE_NODE_UNLOCKED in a six_lock_type
enum.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
subvolume.c has gotten a bit large, this splits out a separate file just
for managing snapshot trees - BTREE_ID_snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a bug in the cycle detector, bch2_check_for_deadlock() - we
have to make sure the node pointers in the btree paths array are set to
something not-garbage before another thread may see them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Split out a new file from recovery.c for managing the list of keys we
read from the journal: before journal replay finishes the btree iterator
code needs to be able to iterate over and return keys from the journal
as well, so there's a fair bit of code here.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
clang had a few more warnings about enum conversion, and also didn't
like the opts.c initializer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- endianness fixes
- mark some things static
- fix a few __percpu annotations
- fix silent enum conversions
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need to allow filesystems with metadata from newer versions to be
mountable and usable by older versions.
This patch enables us to roll out new btrees without a new major version
number; we can now handle btree roots for unknown btree types.
The unknown btree roots will be retained, and fsck (including
backpointers) will check them, the same as other btree types.
We add a dynamic array for the extra, unknown btree roots, in addition
to the fixed size btree root array, and add new helpers for looking up
btree roots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can't be holding btree_trans_lock while copying to user space, which
might incur a page fault. To fix this, convert it to a seqmutex so we
can unlock/relock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Using drop_locks_do() ensures that every unlock() is paired with a
relock(), with proper error checking.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When allocating memory, gfp flags should generally be
- GFP_NOWAIT|__GFP_NOWARN if btree locks are held
- GFP_NOFS if in the IO path or otherwise holding resources needed for
IO submission
- GFP_KERNEL otherwise
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a new helper for the common pattern of:
- trans_unlock()
- do something
- trans_relock()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_btree_trans_to_text() is used on btree_trans objects that are owned
by different threads - when printing out deadlock cycles - so we need a
safe version of trans_for_each_path(), else we race with seeing a
btree_path that was just allocated and not fully initialized:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
As suggested by Linus, this drops the six_lock_state union in favor of
raw bitmasks.
On the one hand, bitfields give more type-level structure to the code.
However, a significant amount of the code was working with
six_lock_state as a u64/atomic64_t, and the conversions from the
bitfields to the u64 were deemed a bit too out-there.
More significantly, because bitfield order is poorly defined (#ifdef
__LITTLE_ENDIAN_BITFIELD can be used, but is gross), incrementing the
sequence number would overflow into the rest of the bitfield if the
compiler didn't put the sequence number at the high end of the word.
The new code is a bit saner when we're on an architecture without real
atomic64_t support - all accesses to lock->state now go through
atomic64_*() operations.
On architectures with real atomic64_t support, we additionally use
atomic bit ops for setting/clearing individual bits.
Text size: 7467 bytes -> 4649 bytes - compilers still suck at
bitfields.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_path_put_nokeep() is sketchy, and we should consider removing it:
it unconditionally frees btree_paths once their ref hits 0.
The assumption is that we only use it for paths that have never been
visible outside the btree core btree code; i.e. higher level code will
never be making assumptions about locking based on these paths.
However, there's subtle brokenness with this approach:
- If we call bch2_path_put(), then bch2_path_put_nokeep(),
bch2_path_put() may free the first path on the assumption that we we
have another path keeping a node locked - but then
bch2_path_put_nokeep() just unconditionally frees it.
The same bug may arise if we're calling bch2_path_put() and
bch2_path_put_nokeep() on the same (refcounted) path, or two adjacent
paths that point to the same btree node.
This patch hacks around one of these bugs by calling
bch2_path_put_nokeep() first in bch2_trans_iter_exit.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds private error codes for most (but not all) of our ENOMEM uses,
which makes it easier to track down assorted allocation failures.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Soon, __bch2_btree_node_write() is going to require a btree_trans: zoned
device support is going to require a new allocation for every btree node
write. This is a bit of prep work.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
A thread should never be using more than one btree_trans - doing so is
an invitation for deadlocks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a new method of doing btree updates - a straight write buffer,
implemented as a flat fixed size array.
This is only useful when we don't need to read from the btree in order
to do the update, and when reading is infrequent - perfect for the LRU
btree.
This will make LRU btree updates fast enough that we'll be able to use
it for persistently indexing buckets by fragmentation, which will be a
massive boost to copygc performance.
Changes:
- A new btree_insert_type enum, for btree_insert_entries. Specifies
btree, btree key cache, or btree write buffer.
- bch2_trans_update_buffered(): updates via the btree write buffer
don't need a btree path, so we need a new update path.
- Transaction commit path changes:
The update to the btree write buffer both mutates global, and can
fail if there isn't currently room. Therefore we do all write buffer
updates in the transaction all at once, and also if it fails we have
to revert filesystem usage counter changes.
If there isn't room we flush the write buffer in the transaction
commit error path and retry.
- A new persistent option, for specifying the number of entries in the
write buffer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When we unlock in order to submit IO, the next relock event is likely to
fail if submit_bio() blocked - we shouldn't those events in our _fail
stats, since those are expected events and shouldn't cause test
failures.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need to call bch2_trans_update_max_paths() before marking the new
path as allocated, since we're not initializing it yet.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It's important that in BTREE_ITER_FILTER_SNAPSHOTS mode we always use
peek_upto() and provide an end for the interval we're searching for -
otherwise, when we hit the end of the inode the next inode be in a
different subvolume and not have any keys in the current snapshot, and
we'd iterate over arbitrarily many keys before returning one.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This uses the new _ip() interface to six locks and hooks it up to
btree_path->ip_allocated, when available.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This replaces various BUG_ON() assertions with panics that tell us where
the restart was done and the restart type.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In debug mode, we now track where btree iterators and paths are
initialized/allocated - helpful in tracking down btree path overflows.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is something we need to do more widely: instead of bothering with
GFP_NOIO/GFP_NOFS, if we need to allocate memory while holding locks:
- first attempt the allocation with GFP_NOWAIT
- if that fails, drop btree locks with bch2_trans_unlock(), then
retry with GFP_KERNEL.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need to take a ref on a path while we're traversing it: this fixes a
bug with paths getting reused while being traversed, in the key cache
fill code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
While a btree transaction is running, we hold a SRCU read lock on the
btree key cache that prevents btree key cache keys from being freed -
this is so that relock() operations won't access freed memory.
The downside of this is that long running btree transactions prevent
memory from being freed from the key cache. This adds a check in
bch2_trans_begin() - if the transaction has been running longer than 1
second, drop and retake the SRCU read lock and zero out pointers to
unlock key cache paths.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When we started stashing the key being overwritten in
btree_insert_entry, this introduced a typical iterator invalidation
problem, triggered by btree node splits or resorts.
Previously, dealt with this by unconditionally re-validating those
stashed pointers in the transaction commit path. This patch gets rid of
that by doing it only when needed, in bch2_trans_node_add() or
bch2_trans_node_reinit_iter().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_btree_iter_peek_upto() in snapshots mode may need to keep a
btree_path for the insert position, not just the position of the key
we're returning. The code was incorrectly assuming this would be in the
same btree node - we were missing a bch2_btree_path_traverse() call.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a (harmless) broken invariant in __bch2_btree_path_set_pos():
iterators to interior nodes should point to the first non whiteout.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This switches btree_key_cache_fill() to use a btree iterator, not a
btree path, so that it can search for keys in previous snapshots.
We also add another iterator flag, BTREE_ITER_KEY_CACHE_FILL, to avoid
recursion back into the key cache.
This will allow us to re-enable the key cache for inodes in the next
patch.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There was no reason for this to be a separate helper - we always want
the relock call that btree_trans_peek_key_cache() did.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch introduces
- bpos_eq()
- bpos_lt()
- bpos_le()
- bpos_gt()
- bpos_ge()
and equivalent replacements for bkey_cmp().
Looking at the generated assembly these could probably be improved
further, but we already see a significant code size improvement with
this patch.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When flags & btree_id are constants, we can constant fold the entire
calculation of the actual iterator flags - and the whole thing becomes
small enough to inline.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>