Split busy extent tracking from struct xfs_perag into its own private
structure, which can be pointed to by the generic group structure.
Note that this structure is now dynamically allocated instead of embedded
as the upcoming zone XFS code doesn't need it and will also have an
unusually high number of groups due to hardware constraints. Dynamically
allocating the structure this is a big memory saver for this case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Add helpers to convert an agbno to a daddr or fsbno based on a pag
structure.
This provides a simpler conversion and better type safety compared to the
existing code that passes the mount structure and the agno separately.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Convert the boolean to skip discard on free into a proper flags field so
that we can add more flags in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a helper function to compute the number of fsblocks needed to
store a maximally-sized extended attribute value.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
In preparation for supporting repair of indexed file-based metadata
(such as realtime bitmaps, directories, and extended attribute data),
add a function to reap the old blocks after a metadata repair finishes.
IOWs, this is an elaborate bunmapi call that deals with crosslinked
blocks by unmapping them without freeing them, and also scans for incore
buffers to invalidate.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
In an upcoming patch, we will need to be able to look for xfs_buf
objects caching file-based metadata blocks without needing to walk the
(possibly corrupt) structures to find all the buffers. Repair already
has most of the code needed to scan the buffer cache, so hoist these
utility functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Rebuild the reverse mapping btree from all primary metadata. This first
patch establishes the bare mechanics of finding records and putting
together a new ondisk tree; more complex pieces are needed to make it
work properly.
Link: Documentation/filesystems/xfs-online-fsck-design.rst
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Try to repair errors that we see in file CoW forks so that we don't do
stupid things like remap garbage into a file. There's not a lot we can
do with the COW fork -- the ondisk metadata record only that the COW
staging extents are owned by the refcount btree, which effectively means
that we can't reconstruct this incore structure from scratch.
Actually, this is even worse -- we can't touch written extents, because
those map space that are actively under writeback, and there's not much
to do with delalloc reservations. Hence we can only detect crosslinked
unwritten extents and fix them by punching out the problematic parts and
replacing them with delalloc extents.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Back in commit a55e07308831b ("xfs: only allow reaping of per-AG
blocks in xrep_reap_extents"), we removed from the reaping code the
ability to handle bmbt blocks. At the time, the reaping code only
walked single blocks, didn't correctly detect crosslinked blocks, and
the special casing made the function hard to understand. It was easier
to remove unneeded functionality prior to fixing all the bugs.
Now that we've fixed the problems, we want again the ability to reap
file metadata blocks. Reintroduce the per-file reaping functionality
atop the current implementation. We require that sc->sa is
uninitialized, so that we can use it to hold all the per-AG context for
a given extent.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move struct xagb_bitmap to its own pair of C and header files per
request of Christoph.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a version of the xbitmap that handles 32-bit integer intervals
and adapt the xfs_agblock_t bitmap to use it. This reduces the size of
the interval tree nodes from 48 to 36 bytes and enables us to use a more
efficient slab (:0000040 instead of :0000048) which allows us to pack
more nodes into a single slab page (102 vs 85).
As a side effect, the users of these bitmaps no longer have to convert
between u32 and u64 quantities just to use the bitmap; and the hairy
overflow checking code in xagb_bitmap_test goes away.
Later in this patchset we're going to add bitmaps for xfs_agino_t,
xfs_rgblock_t, and xfs_dablk_t, so the increase in code size (5622 vs.
9959 bytes) seems worth it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Overall, this function tries to find and invalidate all buffers for a
given extent of space on the data device. The inner for loop in this
function tries to find all xfs_bufs for a given daddr. The lengths of
all possible cached buffers range from 1 fsblock to the largest needed
to contain a 64k xattr value (~17fsb). The scan is capped to avoid
looking at anything buffer going past the given extent.
Unfortunately, the loop continuation test is wrong -- max_fsbs is the
largest size we want to scan, not one past that. Put another way, this
loop is actually 1-indexed, not 0-indexed. Therefore, the continuation
test should use <=, not <.
As a result, online repairs of btree blocks fails to stale any buffers
for btrees that are being torn down, which causes later assertions in
the buffer cache when another thread creates a different-sized buffer.
This happens in xfs/709 when allocating an inode cluster buffer:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 3346128 at fs/xfs/xfs_message.c:104 assfail+0x3a/0x40 [xfs]
CPU: 0 PID: 3346128 Comm: fsstress Not tainted 6.7.0-rc4-djwx #rc4
RIP: 0010:assfail+0x3a/0x40 [xfs]
Call Trace:
<TASK>
_xfs_buf_obj_cmp+0x4a/0x50
xfs_buf_get_map+0x191/0xba0
xfs_trans_get_buf_map+0x136/0x280
xfs_ialloc_inode_init+0x186/0x340
xfs_ialloc_ag_alloc+0x254/0x720
xfs_dialloc+0x21f/0x870
xfs_create_tmpfile+0x1a9/0x2f0
xfs_rename+0x369/0xfd0
xfs_vn_rename+0xfa/0x170
vfs_rename+0x5fb/0xc30
do_renameat2+0x52d/0x6e0
__x64_sys_renameat2+0x4b/0x60
do_syscall_64+0x3b/0xe0
entry_SYSCALL_64_after_hwframe+0x46/0x4e
A later refactoring patch in the online repair series fixed this by
accident, which is why I didn't notice this until I started testing only
the patches that are likely to end up in 6.8.
Fixes: 1c7ce115e521 ("xfs: reap large AG metadata extents when possible")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Introduce the concept of a defer ops barrier to separate consecutively
queued pending work items of the same type. With a barrier in place,
the two work items will be tracked separately, and receive separate log
intent items. The goal here is to prevent reaping of old metadata
blocks from creating unnecessarily huge EFIs that could then run the
risk of overflowing the scrub transaction.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
xfs_free_extent_later is a trivial helper, so remove it to reduce the
amount of thinking required to understand the deferred freeing
interface. This will make it easier to introduce automatic reaping of
speculative allocations in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The AGFL repair code uses a series of bitmaps to figure out where there
are OWN_AG blocks that are not claimed by the free space and rmap
btrees. These blocks become the new AGFL, and any overflow is reaped.
The bitmaps current track xfs_fsblock_t even though we already know the
AG number.
In the last patch, we introduced a new bitmap "type" for tracking
xfs_agblock_t extents. Port the reaping code and the AGFL repair to use
this new type, which makes it very obvious what we're tracking. This
also eliminates a bunch of unnecessary agblock <-> fsblock conversions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When we're freeing extents that have been set in a bitmap, break the
bitmap extent into multiple sub-extents organized by fate, and reap the
extents. This enables us to dispose of old resources more efficiently
than doing them block by block.
While we're at it, rename the reaping functions to make it clear that
they're reaping per-AG extents.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
After an online repair, we need to invalidate buffers representing the
blocks from the old metadata that we're replacing. It's possible that
parts of a tree that were previously cached in memory are no longer
accessible due to media failure or other corruption on interior nodes,
so repair figures out the old blocks from the reverse mapping data and
scans the buffer cache directly.
In other words, online fsck needs to find all the live (i.e. non-stale)
buffers for a range of fsblocks so that it can invalidate them.
Unfortunately, the current buffer cache code triggers asserts if the
rhashtable lookup finds a non-stale buffer of a different length than
the key we searched for. For regular operation this is desirable, but
for this repair procedure, we don't care since we're going to forcibly
stale the buffer anyway. Add an internal lookup flag to avoid the
assert. Skip buffers that are already XBF_STALE.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Rearrange the logic inside xrep_reap_block to make it more obvious that
crosslinked metadata blocks are handled differently. Add a couple of
tracepoints so that we can tell what's going on at the end of a btree
rebuild operation.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Use deferred frees (EFIs) to reap the blocks of a btree that we just
replaced. This helps us to shrink the window in which those old blocks
could be lost due to a system crash, though we try to flush the EFIs
every few hundred blocks so that we don't also overflow the transaction
reservations during and after we commit the new btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Now that we've refactored btree cursors to require the caller to pass in
a perag structure, there are numerous problems in xrep_reap_extents if
it's being called to reap extents for an inode metadata repair. We
don't have any repair functions that can do that, so drop the support
for now.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When we're discarding old btree blocks after a repair, only invalidate
the buffers for the ones that we're freeing -- if the metadata was
crosslinked with another data structure, we don't want to touch it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reaping blocks after a repair is a complicated affair involving a lot of
rmap btree lookups and figuring out if we're going to unmap or free old
metadata blocks that might be crosslinked. Eventually, we will need to
be able to reap per-AG metadata blocks, bmbt blocks from inode forks,
garbage CoW staging extents, and (even later) blocks from btrees rooted
in inodes. This results in a lot of reaping code, so we might as well
split that off while it's easy.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>