This patch series fixes a data corruption that occurs in a specific
multi-threaded write workload. The workload combined
racing unaligned adjacent buffered writes with low memory conditions
that caused both writeback and memory reclaim to race with the
writes.
The result of this was random partial blocks containing zeroes
instead of the correct data. The underlying problem is that iomap
caches the write iomap for the duration of the write() operation,
but it fails to take into account that the extent underlying the
iomap can change whilst the write is in progress.
The short story is that an iomap can span mutliple folios, and so
under low memory writeback can be cleaning folios the write()
overlaps. Whilst the overlapping data is cached in memory, this
isn't a problem, but because the folios are now clean they can be
reclaimed. Once reclaimed, the write() does the wrong thing when
re-instantiating partial folios because the iomap no longer reflects
the underlying state of the extent. e.g. it thinks the extent is
unwritten, so it zeroes the partial range, when in fact the
underlying extent is now written and so it should have read the data
from disk. This is how we get random zero ranges in the file
instead of the correct data.
The gory details of the race condition can be found here:
https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
Fixing the problem has two aspects. The first aspect of the problem
is ensuring that iomap can detect a stale cached iomap during a
write in a race-free manner. We already do this stale iomap
detection in the writeback path, so we have a mechanism for
detecting that the iomap backing the data range may have changed
and needs to be remapped.
In the case of the write() path, we have to ensure that the iomap is
validated at a point in time when the page cache is stable and
cannot be reclaimed from under us. We also need to validate the
extent before we start performing any modifications to the folio
state or contents. Combine these two requirements together, and the
only "safe" place to validate the iomap is after we have looked up
and locked the folio we are going to copy the data into, but before
we've performed any initialisation operations on that folio.
If the iomap fails validation, we then mark it stale, unlock the
folio and end the write. This effectively means a stale iomap
results in a short write. Filesystems should already be able to
handle this, as write operations can end short for many reasons and
need to iterate through another mapping cycle to be completed. Hence
the iomap changes needed to detect and handle stale iomaps during
write() operations is relatively simple...
However, the assumption is that filesystems should already be able
to handle write failures safely, and that's where the second
(first?) part of the problem exists. That is, handling a partial
write is harder than just "punching out the unused delayed
allocation extent". This is because mmap() based faults can race
with writes, and if they land in the delalloc region that the write
allocated, then punching out the delalloc region can cause data
corruption.
This data corruption problem is exposed by generic/346 when iomap is
converted to detect stale iomaps during write() operations. Hence
write failure in the filesytems needs to handle the fact that the
write() in progress doesn't necessarily own the data in the page
cache over the range of the delalloc extent it just allocated.
As a result, we can't just truncate the page cache over the range
the write() didn't reach and punch all the delalloc extent. We have
to walk the page cache over the untouched range and skip over any
dirty data region in the cache in that range. Which is ....
non-trivial.
That is, iterating the page cache has to handle partially populated
folios (i.e. block size < page size) that contain data. The data
might be discontiguous within a folio. Indeed, there might be
*multiple* discontiguous data regions within a single folio. And to
make matters more complex, multi-page folios mean we just don't know
how many sub-folio regions we might have to iterate to find all
these regions. All the corner cases between the conversions and
rounding between filesystem block size, folio size and multi-page
folio size combined with unaligned write offsets kept breaking my
brain.
However, if we convert the code to track the processed
write regions by byte ranges instead of fileystem block or page
cache index, we could simply use mapping_seek_hole_data() to find
the start and end of each discrete data region within the range we
needed to scan. SEEK_DATA finds the start of the cached data region,
SEEK_HOLE finds the end of the region. These are byte based
interfaces that understand partially uptodate folio regions, and so
can iterate discrete sub-folio data regions directly. This largely
solved the problem of discovering the dirty regions we need to keep
the delalloc extent over.
However, to use mapping_seek_hole_data() without needing to export
it, we have to move all the delalloc extent cleanup to the iomap
core and so now the iomap core can clean up delayed allocation
extents in a safe, sane and filesystem neutral manner.
With all this done, the original data corruption never occurs
anymore, and we now have a generic mechanism for ensuring that page
cache writes do not do the wrong thing when writeback and reclaim
change the state of the physical extent and/or page cache contents
whilst the write is in progress.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmOFSzwUHGRhdmlkQGZy
b21vcmJpdC5jb20ACgkQregpR/R1+h3djhAAwOf9VeLO7TW/0B1XfE3ktWGiDmEG
ekB8mkB7CAHB9SBq7TZMHjktJIJxY81q5+Iq9qHGiW3asoVbmWvkeRSJgXljhTby
D2KsUIT1NK/X6DhC9FhNjv/Q2GJ0nY6s65RLudUEkelYBFhGMM0kdXX+fZmtZ4yT
T/lRYk/KBFpeQCaGRcFXK55TnB/B9muOI9FyKvh2DNWe6r0Xu3Obb3a9k+snZA9R
EeUpAosDSrXzP4c2w2ovpU2eutUdo4eYTHIzXKGkhktbRhmCRLn4NlxvFCanoe8h
eSS85sb8DHUh2iyaaB8yrJ6LL3MuBytOi24rNBeyd1KAyEtT21+cTUK/QAahzble
pL8l6TA7ZXbhYcbk5uQvFEIAInR+0ffjde61uE14N55awq0Vdrym7C7D2ri60iw6
ts45AVYKYeF61coAbwvmaJyvqvQ0tUlmVZXI4lQzN2O17Lr6004gJFMjDRsXXU7H
eHLUt496Geq39rglw85y8G0vmxxGZ9iIGkeC1kUSSCmlvx/JfuJlbWBgyMGtNRBI
qzv0jmk67Ft1seQSMWQJttxCZs4uOF2gwERYGAF7iUR8F4PGob/N1e2/hpg75G8q
0S8u1N1p8Cv5u/jwybqy8FnSC2MlUZl6SQURaVQDy2DLMKHb4T1diu0jrCbiSPiF
JKfQ7aNQxaEZIJw=
=cv9i
-----END PGP SIGNATURE-----
Merge tag 'xfs-iomap-stale-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-6.2-mergeB
xfs, iomap: fix data corruption due to stale cached iomaps
This patch series fixes a data corruption that occurs in a specific
multi-threaded write workload. The workload combined
racing unaligned adjacent buffered writes with low memory conditions
that caused both writeback and memory reclaim to race with the
writes.
The result of this was random partial blocks containing zeroes
instead of the correct data. The underlying problem is that iomap
caches the write iomap for the duration of the write() operation,
but it fails to take into account that the extent underlying the
iomap can change whilst the write is in progress.
The short story is that an iomap can span mutliple folios, and so
under low memory writeback can be cleaning folios the write()
overlaps. Whilst the overlapping data is cached in memory, this
isn't a problem, but because the folios are now clean they can be
reclaimed. Once reclaimed, the write() does the wrong thing when
re-instantiating partial folios because the iomap no longer reflects
the underlying state of the extent. e.g. it thinks the extent is
unwritten, so it zeroes the partial range, when in fact the
underlying extent is now written and so it should have read the data
from disk. This is how we get random zero ranges in the file
instead of the correct data.
The gory details of the race condition can be found here:
https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
Fixing the problem has two aspects. The first aspect of the problem
is ensuring that iomap can detect a stale cached iomap during a
write in a race-free manner. We already do this stale iomap
detection in the writeback path, so we have a mechanism for
detecting that the iomap backing the data range may have changed
and needs to be remapped.
In the case of the write() path, we have to ensure that the iomap is
validated at a point in time when the page cache is stable and
cannot be reclaimed from under us. We also need to validate the
extent before we start performing any modifications to the folio
state or contents. Combine these two requirements together, and the
only "safe" place to validate the iomap is after we have looked up
and locked the folio we are going to copy the data into, but before
we've performed any initialisation operations on that folio.
If the iomap fails validation, we then mark it stale, unlock the
folio and end the write. This effectively means a stale iomap
results in a short write. Filesystems should already be able to
handle this, as write operations can end short for many reasons and
need to iterate through another mapping cycle to be completed. Hence
the iomap changes needed to detect and handle stale iomaps during
write() operations is relatively simple...
However, the assumption is that filesystems should already be able
to handle write failures safely, and that's where the second
(first?) part of the problem exists. That is, handling a partial
write is harder than just "punching out the unused delayed
allocation extent". This is because mmap() based faults can race
with writes, and if they land in the delalloc region that the write
allocated, then punching out the delalloc region can cause data
corruption.
This data corruption problem is exposed by generic/346 when iomap is
converted to detect stale iomaps during write() operations. Hence
write failure in the filesytems needs to handle the fact that the
write() in progress doesn't necessarily own the data in the page
cache over the range of the delalloc extent it just allocated.
As a result, we can't just truncate the page cache over the range
the write() didn't reach and punch all the delalloc extent. We have
to walk the page cache over the untouched range and skip over any
dirty data region in the cache in that range. Which is ....
non-trivial.
That is, iterating the page cache has to handle partially populated
folios (i.e. block size < page size) that contain data. The data
might be discontiguous within a folio. Indeed, there might be
*multiple* discontiguous data regions within a single folio. And to
make matters more complex, multi-page folios mean we just don't know
how many sub-folio regions we might have to iterate to find all
these regions. All the corner cases between the conversions and
rounding between filesystem block size, folio size and multi-page
folio size combined with unaligned write offsets kept breaking my
brain.
However, if we convert the code to track the processed
write regions by byte ranges instead of fileystem block or page
cache index, we could simply use mapping_seek_hole_data() to find
the start and end of each discrete data region within the range we
needed to scan. SEEK_DATA finds the start of the cached data region,
SEEK_HOLE finds the end of the region. These are byte based
interfaces that understand partially uptodate folio regions, and so
can iterate discrete sub-folio data regions directly. This largely
solved the problem of discovering the dirty regions we need to keep
the delalloc extent over.
However, to use mapping_seek_hole_data() without needing to export
it, we have to move all the delalloc extent cleanup to the iomap
core and so now the iomap core can clean up delayed allocation
extents in a safe, sane and filesystem neutral manner.
With all this done, the original data corruption never occurs
anymore, and we now have a generic mechanism for ensuring that page
cache writes do not do the wrong thing when writeback and reclaim
change the state of the physical extent and/or page cache contents
whilst the write is in progress.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
* tag 'xfs-iomap-stale-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: drop write error injection is unfixable, remove it
xfs: use iomap_valid method to detect stale cached iomaps
iomap: write iomap validity checks
xfs: xfs_bmap_punch_delalloc_range() should take a byte range
iomap: buffered write failure should not truncate the page cache
xfs,iomap: move delalloc punching to iomap
xfs: use byte ranges for write cleanup ranges
xfs: punching delalloc extents on write failure is racy
xfs: write page faults in iomap are not buffered writes
With the changes to scan the page cache for dirty data to avoid data
corruptions from partial write cleanup racing with other page cache
operations, the drop writes error injection no longer works the same
way it used to and causes xfs/196 to fail. This is because xfs/196
writes to the file and populates the page cache before it turns on
the error injection and starts failing -overwrites-.
The result is that the original drop-writes code failed writes only
-after- overwriting the data in the cache, followed by invalidates
the cached data, then punching out the delalloc extent from under
that data.
On the surface, this looks fine. The problem is that page cache
invalidation *doesn't guarantee that it removes anything from the
page cache* and it doesn't change the dirty state of the folio. When
block size == page size and we do page aligned IO (as xfs/196 does)
everything happens to align perfectly and page cache invalidation
removes the single page folios that span the written data. Hence the
followup delalloc punch pass does not find cached data over that
range and it can punch the extent out.
IOWs, xfs/196 "works" for block size == page size with the new
code. I say "works", because it actually only works for the case
where IO is page aligned, and no data was read from disk before
writes occur. Because the moment we actually read data first, the
readahead code allocates multipage folios and suddenly the
invalidate code goes back to zeroing subfolio ranges without
changing dirty state.
Hence, with multipage folios in play, block size == page size is
functionally identical to block size < page size behaviour, and
drop-writes is manifestly broken w.r.t to this case. Invalidation of
a subfolio range doesn't result in the folio being removed from the
cache, just the range gets zeroed. Hence after we've sequentially
walked over a folio that we've dirtied (via write data) and then
invalidated, we end up with a dirty folio full of zeroed data.
And because the new code skips punching ranges that have dirty
folios covering them, we end up leaving the delalloc range intact
after failing all the writes. Hence failed writes now end up
writing zeroes to disk in the cases where invalidation zeroes folios
rather than removing them from cache.
This is a fundamental change of behaviour that is needed to avoid
the data corruption vectors that exist in the old write fail path,
and it renders the drop-writes injection non-functional and
unworkable as it stands.
As it is, I think the error injection is also now unnecessary, as
partial writes that need delalloc extent are going to be a lot more
common with stale iomap detection in place. Hence this patch removes
the drop-writes error injection completely. xfs/196 can remain for
testing kernels that don't have this data corruption fix, but those
that do will report:
xfs/196 3s ... [not run] XFS error injection drop_writes unknown on this kernel.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Now that iomap supports a mechanism to validate cached iomaps for
buffered write operations, hook it up to the XFS buffered write ops
so that we can avoid data corruptions that result from stale cached
iomaps. See:
https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
or the ->iomap_valid() introduction commit for exact details of the
corruption vector.
The validity cookie we store in the iomap is based on the type of
iomap we return. It is expected that the iomap->flags we set in
xfs_bmbt_to_iomap() is not perturbed by the iomap core and are
returned to us in the iomap passed via the .iomap_valid() callback.
This ensures that the validity cookie is always checking the correct
inode fork sequence numbers to detect potential changes that affect
the extent cached by the iomap.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
A recent multithreaded write data corruption has been uncovered in
the iomap write code. The core of the problem is partial folio
writes can be flushed to disk while a new racing write can map it
and fill the rest of the page:
writeback new write
allocate blocks
blocks are unwritten
submit IO
.....
map blocks
iomap indicates UNWRITTEN range
loop {
lock folio
copyin data
.....
IO completes
runs unwritten extent conv
blocks are marked written
<iomap now stale>
get next folio
}
Now add memory pressure such that memory reclaim evicts the
partially written folio that has already been written to disk.
When the new write finally gets to the last partial page of the new
write, it does not find it in cache, so it instantiates a new page,
sees the iomap is unwritten, and zeros the part of the page that
it does not have data from. This overwrites the data on disk that
was originally written.
The full description of the corruption mechanism can be found here:
https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
To solve this problem, we need to check whether the iomap is still
valid after we lock each folio during the write. We have to do it
after we lock the page so that we don't end up with state changes
occurring while we wait for the folio to be locked.
Hence we need a mechanism to be able to check that the cached iomap
is still valid (similar to what we already do in buffered
writeback), and we need a way for ->begin_write to back out and
tell the high level iomap iterator that we need to remap the
remaining write range.
The iomap needs to grow some storage for the validity cookie that
the filesystem provides to travel with the iomap. XFS, in
particular, also needs to know some more information about what the
iomap maps (attribute extents rather than file data extents) to for
the validity cookie to cover all the types of iomaps we might need
to validate.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
All the callers of xfs_bmap_punch_delalloc_range() jump through
hoops to convert a byte range to filesystem blocks before calling
xfs_bmap_punch_delalloc_range(). Instead, pass the byte range to
xfs_bmap_punch_delalloc_range() and have it do the conversion to
filesystem blocks internally.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
iomap_file_buffered_write_punch_delalloc() currently invalidates the
page cache over the unused range of the delalloc extent that was
allocated. While the write allocated the delalloc extent, it does
not own it exclusively as the write does not hold any locks that
prevent either writeback or mmap page faults from changing the state
of either the page cache or the extent state backing this range.
Whilst xfs_bmap_punch_delalloc_range() already handles races in
extent conversion - it will only punch out delalloc extents and it
ignores any other type of extent - the page cache truncate does not
discriminate between data written by this write or some other task.
As a result, truncating the page cache can result in data corruption
if the write races with mmap modifications to the file over the same
range.
generic/346 exercises this workload, and if we randomly fail writes
(as will happen when iomap gets stale iomap detection later in the
patchset), it will randomly corrupt the file data because it removes
data written by mmap() in the same page as the write() that failed.
Hence we do not want to punch out the page cache over the range of
the extent we failed to write to - what we actually need to do is
detect the ranges that have dirty data in cache over them and *not
punch them out*.
To do this, we have to walk the page cache over the range of the
delalloc extent we want to remove. This is made complex by the fact
we have to handle partially up-to-date folios correctly and this can
happen even when the FSB size == PAGE_SIZE because we now support
multi-page folios in the page cache.
Because we are only interested in discovering the edges of data
ranges in the page cache (i.e. hole-data boundaries) we can make use
of mapping_seek_hole_data() to find those transitions in the page
cache. As we hold the invalidate_lock, we know that the boundaries
are not going to change while we walk the range. This interface is
also byte-based and is sub-page block aware, so we can find the data
ranges in the cache based on byte offsets rather than page, folio or
fs block sized chunks. This greatly simplifies the logic of finding
dirty cached ranges in the page cache.
Once we've identified a range that contains cached data, we can then
iterate the range folio by folio. This allows us to determine if the
data is dirty and hence perform the correct delalloc extent punching
operations. The seek interface we use to iterate data ranges will
give us sub-folio start/end granularity, so we may end up looking up
the same folio multiple times as the seek interface iterates across
each discontiguous data region in the folio.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Because that's what Christoph wants for this error handling path
only XFS uses.
It requires a new iomap export for handling errors over delalloc
ranges. This is basically the XFS code as is stands, but even though
Christoph wants this as iomap funcitonality, we still have
to call it from the filesystem specific ->iomap_end callback, and
call into the iomap code with yet another filesystem specific
callback to punch the delalloc extent within the defined ranges.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
xfs_buffered_write_iomap_end() currently converts the byte ranges
passed to it to filesystem blocks to pass them to the bmap code to
punch out delalloc blocks, but then has to convert filesytem
blocks back to byte ranges for page cache truncate.
We're about to make the page cache truncate go away and replace it
with a page cache walk, so having to convert everything to/from/to
filesystem blocks is messy and error-prone. It is much easier to
pass around byte ranges and convert to page indexes and/or
filesystem blocks only where those units are needed.
In preparation for the page cache walk being added, add a helper
that converts byte ranges to filesystem blocks and calls
xfs_bmap_punch_delalloc_range() and convert
xfs_buffered_write_iomap_end() to calculate limits in byte ranges.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
xfs_buffered_write_iomap_end() has a comment about the safety of
punching delalloc extents based holding the IOLOCK_EXCL. This
comment is wrong, and punching delalloc extents is not race free.
When we punch out a delalloc extent after a write failure in
xfs_buffered_write_iomap_end(), we punch out the page cache with
truncate_pagecache_range() before we punch out the delalloc extents.
At this point, we only hold the IOLOCK_EXCL, so there is nothing
stopping mmap() write faults racing with this cleanup operation,
reinstantiating a folio over the range we are about to punch and
hence requiring the delalloc extent to be kept.
If this race condition is hit, we can end up with a dirty page in
the page cache that has no delalloc extent or space reservation
backing it. This leads to bad things happening at writeback time.
To avoid this race condition, we need the page cache truncation to
be atomic w.r.t. the extent manipulation. We can do this by holding
the mapping->invalidate_lock exclusively across this operation -
this will prevent new pages from being inserted into the page cache
whilst we are removing the pages and the backing extent and space
reservation.
Taking the mapping->invalidate_lock exclusively in the buffered
write IO path is safe - it naturally nests inside the IOLOCK (see
truncate and fallocate paths). iomap_zero_range() can be called from
under the mapping->invalidate_lock (from the truncate path via
either xfs_zero_eof() or xfs_truncate_page(), but iomap_zero_iter()
will not instantiate new delalloc pages (because it skips holes) and
hence will not ever need to punch out delalloc extents on failure.
Fix the locking issue, and clean up the code logic a little to avoid
unnecessary work if we didn't allocate the delalloc extent or wrote
the entire region we allocated.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
The following error occurred during the fsstress test:
XFS: Assertion failed: VFS_I(ip)->i_nlink >= 2, file: fs/xfs/xfs_inode.c, line: 2452
The problem was that inode race condition causes incorrect i_nlink to be
written to disk, and then it is read into memory. Consider the following
call graph, inodes that are marked as both XFS_IFLUSHING and
XFS_IRECLAIMABLE, i_nlink will be reset to 1 and then restored to original
value in xfs_reinit_inode(). Therefore, the i_nlink of directory on disk
may be set to 1.
xfsaild
xfs_inode_item_push
xfs_iflush_cluster
xfs_iflush
xfs_inode_to_disk
xfs_iget
xfs_iget_cache_hit
xfs_iget_recycle
xfs_reinit_inode
inode_init_always
xfs_reinit_inode() needs to hold the ILOCK_EXCL as it is changing internal
inode state and can race with other RCU protected inode lookups. On the
read side, xfs_iflush_cluster() grabs the ILOCK_SHARED while under rcu +
ip->i_flags_lock, and so xfs_iflush/xfs_inode_to_disk() are protected from
racing inode updates (during transactions) by that lock.
Fixes: ff7bebeb91f8 ("xfs: refactor the inode recycling code") # goes further back than this
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
As of now only device names are printed out over __xfs_printk().
The device names are not persistent across reboots which in case
of searching for origin of corruption brings another task to properly
identify the devices. This patch add XFS UUID upon every mount/umount
event which will make the identification much easier.
Signed-off-by: Lukas Herbolt <lukas@herbolt.com>
[sandeen: rebase onto current upstream kernel]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When lazysbcount is enabled, fsstress and loop mount/unmount test report
the following problems:
XFS (loop0): SB summary counter sanity check failed
XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460,
xfs_sb block 0x0
XFS (loop0): Unmount and run xfs_repair
XFS (loop0): First 128 bytes of corrupted metadata buffer:
00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00 XFSB.........(..
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a i.|._.D..t....4Z
00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80 ..... ..........
00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................
00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00 ................
00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00 ................
00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19 ................
XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580). Shutting down filesystem.
XFS (loop0): Please unmount the filesystem and rectify the problem(s)
XFS (loop0): log mount/recovery failed: error -117
XFS (loop0): log mount failed
This corruption will shutdown the file system and the file system will
no longer be mountable. The following script can reproduce the problem,
but it may take a long time.
#!/bin/bash
device=/dev/sda
testdir=/mnt/test
round=0
function fail()
{
echo "$*"
exit 1
}
mkdir -p $testdir
while [ $round -lt 10000 ]
do
echo "******* round $round ********"
mkfs.xfs -f $device
mount $device $testdir || fail "mount failed!"
fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null &
sleep 4
killall -w fsstress
umount $testdir
xfs_repair -e $device > /dev/null
if [ $? -eq 2 ];then
echo "ERR CODE 2: Dirty log exception during repair."
exit 1
fi
round=$(($round+1))
done
With lazysbcount is enabled, There is no additional lock protection for
reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the
m_ifree, this will make the m_ifree greater than m_icount. For example,
consider the following sequence and ifreedelta is postive:
CPU0 CPU1
xfs_log_sb xfs_trans_unreserve_and_mod_sb
---------- ------------------------------
percpu_counter_sum(&mp->m_icount)
percpu_counter_add_batch(&mp->m_icount,
idelta, XFS_ICOUNT_BATCH)
percpu_counter_add(&mp->m_ifree, ifreedelta);
percpu_counter_sum(&mp->m_ifree)
After this, incorrect inode count (sb_ifree > sb_icount) will be writen to
the log. In the subsequent writing of sb, incorrect inode count (sb_ifree >
sb_icount) will fail to pass the boundary check in xfs_validate_sb_write()
that cause the file system shutdown.
When lazysbcount is enabled, we don't need to guarantee that Lazy sb
counters are completely correct, but we do need to guarantee that sb_ifree
<= sb_icount. On the other hand, the constraint that m_ifree <= m_icount
must be satisfied any time that there /cannot/ be other threads allocating
or freeing inode chunks. If the constraint is violated under these
circumstances, sb_i{count,free} (the ondisk superblock inode counters)
maybe incorrect and need to be marked sick at unmount, the count will
be rebuilt on the next mount.
Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks")
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Clean up resources if resetting the dotdot entry doesn't succeed.
Observed through code inspection.
Fixes: 5838d0356bb3 ("xfs: reset child dir '..' entry when unlinking child")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
Running the online fsck QA fuzz tests, I noticed that we were
consistently missing fuzzed records in the inode cores of the realtime
freespace files and the quota files. This patch adds the ability to
check inode cores in xchk_metadata_inode_forks.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmN1fwwACgkQ+H93GTRK
tOuqAw/+JLnoGvDkkWfLHqFqroINQp29ZEuk2Zgcn9E1Lk+7xfZimqRNdlJI47J3
Dz9ob5gO/GZfZoPtfmxQegJggaHnjRb1LLQGphb5EDhB7ezo23JQt8lM7A4iPVx9
KHhb/+7IIJLNt07p2RpyTxPjdmxk1s0vYpnjchnSrxUR1Lqzq5LU1CLVEbwfHyRI
PJX3OCG57fbk5jOFo/KgkKXP/cf6//PP/pGT56ytzveBMZt77D3dWtRSEiQGlUc2
QFlR77CR4KuLhkRjr69B1X2RVEopj4/sRItSMpFB/EoPpfDsYrGw27Pm8fgdVyTo
VLUzCS8NlhkVikoEO8EBawLidRT1QlgluqNVUUKx/+NSA5DsroymkEj8+BzL7mJO
vsHEGZw/IeTn9ATqapQtjHc6ZnbZFHCaf0b+eaJhuoMjyOI+guleYjIHQcaLTK73
pivmCoj6VE6V1IZNZAnFEAohNPCxpfzu85Sg7mVUcrL24QReOIuy21m207Tj3q94
6ytXENkOdJJ3ZJq1MoIIhAvOJPmzJvGEEpv+auSdu+OfeYK/kXbtcCPFIwtcI0p5
1fnKKq/y9mZyr+37Ifx74DBAiCeO+GHikkYOIP70Lkw0YQ3AVnU7PQBy1xmJCzYG
AlAebztXXED+uJbhvNICOrxuSXzVynKpN6Br3VYT7qfgCGjS8L8=
=mvcx
-----END PGP SIGNATURE-----
Merge tag 'scrub-check-metadata-inode-records-6.2_2022-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.2-mergeA
xfs: scrub inode core when checking metadata files
Running the online fsck QA fuzz tests, I noticed that we were
consistently missing fuzzed records in the inode cores of the realtime
freespace files and the quota files. This patch adds the ability to
check inode cores in xchk_metadata_inode_forks.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
* tag 'scrub-check-metadata-inode-records-6.2_2022-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: check inode core when scrubbing metadata files
xfs: don't warn about files that are exactly s_maxbytes long
This series strengthens the file extent mapping scrubber in various
ways, such as confirming that there are enough bmap records to match up
with the rmap records for this file, checking delalloc reservations,
checking for no unwritten extents in metadata files, invalid CoW fork
formats, and weird things like shared CoW fork extents.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmN1fwsACgkQ+H93GTRK
tOtCsQ/7BZO0D8QyAJH3DBBNoK90EpM19YjB7LbD6ExHLMTCzR2bitrR0W0NPey1
kS74/owG6cmCvqIem5bLIQB8hhnQAnhswEVEW4HmhmKq0QBZFdGhJK+C/pgpL072
SgKzNzU86CB0+StVJTGRd5jgWWzLSajEPbRXoNS1nat63F6aqx44zI8Gehn5qjhD
S2ldj7FUIWK0qV2dfb0szP8zzD8H7GvE1msXjEuFRWdVqFMHbKaTC+sfoYIE0cMK
WllJ8RVXX4BqgWmeJSuidy/ryWhOXGsW0/UDXxe8yLlw/eIYSP2hm/J/KPBf+Zd8
2ntJnnK19ggjM0TusmkXjupjrXR8zMP3gh3lcYJ3a37HM991aIZ18ffMNXj7y4+n
1T2CerePVWaUQ0K2L/tErBAeY9dolM0pux1dwEdvU9LdYTD1SgAmnOJ3HhVYiXbG
vIZ/e7F6VrccUKXHxmokj21PZldxJ5GpJgPMIsVOuMilh8Thnc+C/rD8XALytHww
fydvqTSKuIgKw7VJNnXhLgZK2JQBr6+jHhsC9nUay394tooNdoyZDwLeYJcM0Tnt
8d8/PalGOdGJHZH212LqeyQ44xIjAxjx4CuE4XLkuTEz7FPDqiXpzFpK+wYa3Ocn
c75ngS5WAqslbna2U/UtV/vol2np2gA+QYhELpCl5r2EPsrPyes=
=ZRqi
-----END PGP SIGNATURE-----
Merge tag 'scrub-bmap-enhancements-6.2_2022-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.2-mergeA
xfs: strengthen file mapping scrub
This series strengthens the file extent mapping scrubber in various
ways, such as confirming that there are enough bmap records to match up
with the rmap records for this file, checking delalloc reservations,
checking for no unwritten extents in metadata files, invalid CoW fork
formats, and weird things like shared CoW fork extents.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
* tag 'scrub-bmap-enhancements-6.2_2022-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: teach scrub to flag non-extents format cow forks
xfs: check that CoW fork extents are not shared
xfs: check quota files for unwritten extents
xfs: block map scrub should handle incore delalloc reservations
xfs: teach scrub to check for adjacent bmaps when rmap larger than bmap
xfs: fix perag loop in xchk_bmap_check_rmaps
This series makes two changes to the fs summary counter scrubber: first,
we should mark the scrub incomplete when we can't read the AG headers.
Second, it fixes a functionality gap where we don't actually check the
free rt extent count.
v23.2: fix pointless inline
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmN1fwsACgkQ+H93GTRK
tOv8EA/8CYkmhVy79KQSEJ33/b9qf62gNOsUlIDr3800FxWEyz/j1luK2yWF/4+4
cU8V/GcR8G9WLETHbCAFGsA1DTRsoDyIBZ51m3jmMsRaeF2sGCMaLR9kAs+4Mzbr
fhRIAlnQGL0BHkQW/lC/3GR1d9w5mBYgScvY5rzqo01AjEMs3nWr2IgHzmTlRKdz
26gPQcaHLhEkadv013XJYTNpcP3iBb1QeQ5E1zmBO32cvzyY9Jcza7dpDx6KvAvN
99SjAuxt2ZD8ecf+eiaCvWwdqduoZRGI2J+7YpHnk1GS18AjC4mepzkUQ9lNra3j
9BzekTpxTK0jZ3wQTkIiYRTpGAPkfXRJBDkvy7WEu6S434mneL/KqJBX5UZiDduy
7486xkBh1agncOHfU6XypC/k2b3oa1svMGvVDcUzYzH9EStaAm+CY37/DjEVcwgM
i2TAVignUJjqjx9nS7PmM76hrFBMksHhRU/crBK1yxF2JrKkdoeluFa5MabJVj9M
sicMwrra4wfuueAMRHZlfAhDW9a+ud6mSnnP7ndlBQXowc8PgrYvovfwZpL6C7q+
gr8/DV+1+zvLFQedyUDcn7rnzxMZ5yUQO7ehBzqt/0QoMT5DLBD/5i63nTh+Ms1G
pgIDx0BDL4GhDBmwGP84xwuEBcwNKmD9s8cMgUB/D58/vZAyXTc=
=IZks
-----END PGP SIGNATURE-----
Merge tag 'scrub-fscounters-enhancements-6.2_2022-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.2-mergeA
xfs: enhance fs summary counter scrubber
This series makes two changes to the fs summary counter scrubber: first,
we should mark the scrub incomplete when we can't read the AG headers.
Second, it fixes a functionality gap where we don't actually check the
free rt extent count.
v23.2: fix pointless inline
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
* tag 'scrub-fscounters-enhancements-6.2_2022-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: online checking of the free rt extent count
xfs: skip fscounters comparisons when the scan is incomplete
This short series makes some small changes to the way we handle the
realtime metadata inodes. First, we now make sure that the bitmap and
summary file forks are always loaded at mount time so that every
scrubber won't have to call xfs_iread_extents. This won't be easy if
we're, say, cross-referencing realtime space allocations. The second
change makes the ILOCK annotations more consistent with the rest of XFS.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmN1fwoACgkQ+H93GTRK
tOsLwg/+K27LJjHR9FF/Wpj4A4yupWPj91LYtB10Vd2TVjd6xUoCPxuYBUDctjGG
jb40eEg3hE8qFjaOavA8TgY5Od4M0EcdZKnZm9rLIB2w/WYjq00ZW13dnk6n5GXr
9y3OzSEHOyzimbvjddOLv/jyaeUJV8m6uKls653Ou2wRWXmT4oVmYeJGPSv8ZKjL
NLmE4zC9o7Qi8UfDOX6r9mrwQgLZLF3qHTGAQrGI0GG+TPjp4PI/SeN4WB27H0Ue
VklFPlMh2/Nmo4mfKoNupRLw2G0ZgMM4akkhdG2b0rHO9DC27/swycYUzKUV/pMP
L5rlzXE7a2eeia2OF1/HECBJ4meUvJI1qlxm8wMZTPCu92J49swE4o5CUhNhyo3e
31LLf9kiFaJnmWxxDOrDUGP8tK461MfRX3moGpLPNMPjks0QiLE61fbW6FsH7Dlu
CwPwVXSgr6JSfYjHpMMDwmYGI9li2KUmuDXXvjEvx0yAU/D2+R8HyTZiVmJiXHXK
kseQUTYv8BEloQAFmejf5pPqe3IRQ0CTGuS89k4uoGE2bXaQGwFsvof8ASePM02L
ITI4x1aYRRTMnfrj4VFIlK2oAvet9opV7mODW1repSM3OpYXSW7b/uHg5CY/u4eC
JrOr222viO1C26p+cCqHaiWn/WxIHy6/Dl3Ju5R55dINNjpFCfs=
=Nuyw
-----END PGP SIGNATURE-----
Merge tag 'scrub-fix-rtmeta-ilocking-6.2_2022-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.2-mergeA
xfs: improve rt metadata use for scrub
This short series makes some small changes to the way we handle the
realtime metadata inodes. First, we now make sure that the bitmap and
summary file forks are always loaded at mount time so that every
scrubber won't have to call xfs_iread_extents. This won't be easy if
we're, say, cross-referencing realtime space allocations. The second
change makes the ILOCK annotations more consistent with the rest of XFS.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
* tag 'scrub-fix-rtmeta-ilocking-6.2_2022-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: make rtbitmap ILOCKing consistent when scanning the rt bitmap file
xfs: load rtbitmap and rtsummary extent mapping btrees at mount time
Here we fix a couple of problems with the errno values that we return to
userspace.
v23.2: fix vague wording of comment
v23.3: fix the commit message to discuss what's really going on in this
patch
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmN1fwkACgkQ+H93GTRK
tOtqHxAAiKPUgjMDRHIBmVTYEmHo0QuxnBcPAMUmwlX5sAjtVZFHnIZ1ChfdR/K2
eDzfFLbNnxyW8HdkI6O7aP7XD/HVFkKRmzWdVkzA463scaFnCVrLl8NDecmZRtVS
TTHCCnDmzdZYM2uDayPJyZAHKZZejp6j33+N6bFBwtNYij9yBEFkHyIP+xZRmPVr
RU1rzzmXhrZqB3ci/8bfOQbkaGzotEAu6R+pFddvMQ55hSwtY7OZ1IctlqHkRlks
H86jGbcd8KNsi0SKo0RN+IuS2LtZNesI0Sm95LUOn1HnlEtmeHrJZYqC5WrKB3cn
gUKJlzgEaDU7eCjMhd6TolqE6zNi/3xMaqjxS/Utoe3wSwSkQvGFVIuJiew7OXFQ
BIDXxpOV/4C+vyi83CDcD147iLXJqz3RFECMD1HXDClYDXEEu9Qk3shJaDQQsOS/
Ddn7UhuF80JcFsf0ZjpwuLwJdsAvDGruD/652V/3D3zkOHiqL9ejMI/InHuU/Fnu
zGeignES4yVcJZMeOL8+crYM1EHGYjZRwe8qOEv4ufND+LnwAu71vyGmtfVqdki5
Qo3n0x1K0dk0Um+YPJrZSfeuSNdpGfBexR1rvWOdAuPJfrizaqHmWg/7fs3ZNSqG
Zfye7jpUcFNt8cxLmwHcNPxI5ll7Kev/4eaWzsz23v4ZTP39m0A=
=UIK1
-----END PGP SIGNATURE-----
Merge tag 'scrub-fix-return-value-6.2_2022-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.2-mergeA
xfs: fix incorrect return values in online fsck
Here we fix a couple of problems with the errno values that we return to
userspace.
v23.2: fix vague wording of comment
v23.3: fix the commit message to discuss what's really going on in this
patch
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
* tag 'scrub-fix-return-value-6.2_2022-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: don't return -EFSCORRUPTED from repair when resources cannot be grabbed
xfs: don't retry repairs harder when EAGAIN is returned
xfs: fix return code when fatal signal encountered during dquot scrub
xfs: return EINTR when a fatal signal terminates scrub
This series standardizes the GFP_ flags that we use for memory
allocation in online scrub, and convert the callers away from the old
kmem_alloc code that was ported from Irix.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmN1fwgACgkQ+H93GTRK
tOsdEA/+PrgIsICn+TE+ypTNnpCTmO+EbJ5uhmK8c/P7JcWuyuM2kHSfo5Q0y6Pv
lOF+MXUgPBXbWB1UIxSDJX4U9yaGGZE41tNpILN3Q1hd9WhMAtHLgvDwv+UdGV2J
/yILLD03pytSmE/SQunO4/0MgA2v1vIhyyoLXF63BbLq3HZ5I1JsezHE4gaZ8vdU
CnVUYGs9/qvDPDzYqVyoCBRI9yWx9GebanwceVP+kuwzmVN+ITSsrn5MjUVWbzro
lxTz7xBGkXF0Q4l5RHpDzVGrUj2Zs1GnoWrd8VAGfZqQi/lHsMXZp8yctzlviPm0
AJkfPWuWLBmmaUdDlgNOHLOojSTYj4n4P/t5OccylnJs8cPuP5UsJdMip1GQfHOO
mxjctZxrdIv5AdJBmG4M8hF7B4dHGIY66B+imfD0xueiB5b9pMHJOYuVn33iPQZe
Ga70dTEQfqD/MNUy9cAYevRHxcy8kU6B96TgYmNNHdhvJCweQozQ3xyEQB1LHReh
pJRnlARauMZJyZmj9GMB2VMnFi8Ghb/JP8qPEHWdm+WkqOS0oRI88kUVRfm2IHIG
n5/BfVWSFAV6xrMrlUnlM9qMpGjNYpg5wyDBO7VRK3VjNoFoYqpGTfQ4LrAX6mUh
KzEtcm7i4Ur7MenC+P9lx0ZOl4CpD73cIs17AJyLozbjZnaalZg=
=foY/
-----END PGP SIGNATURE-----
Merge tag 'scrub-cleanup-malloc-6.2_2022-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.2-mergeA
xfs: clean up memory allocations in online fsck
This series standardizes the GFP_ flags that we use for memory
allocation in online scrub, and convert the callers away from the old
kmem_alloc code that was ported from Irix.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
* tag 'scrub-cleanup-malloc-6.2_2022-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: pivot online scrub away from kmem.[ch]
xfs: initialize the check_owner object fully
xfs: standardize GFP flags usage in online scrub
While reading through the online fsck code, I noticed that the setup
code for AG metadata scrubs will attach the AGI, the AGF, and the AGFL
buffers to the transaction. It isn't necessary to hold the AGFL buffer,
since any code that wants to do anything with the AGFL will need to hold
the AGF to know which parts of the AGFL are active. Therefore, we only
need to hold the AGFL when scrubbing the AGFL itself.
The second bug fixed by this patchset is one that I observed while
testing online repair. When a buffer is held across a transaction roll,
its buffer log item will be detached if the bli was clean before the
roll. If we are holding the AG headers to maintain a lock on an AG, we
then need to set the buffer type on the new bli to avoid confusing the
logging code later.
There's also a bug fix for uninitialized memory in the directory scanner
that didn't fit anywhere else.
Ths patchset finishes off by teaching the AGFL repair function to look
for and discard crosslinked blocks instead of putting them back on the
AGFL.
v23.2: Log the buffers before rolling the transaction to keep the moving
forward in the log and avoid the bli falling off.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmN1fwAACgkQ+H93GTRK
tOvOrA//duBabr1KtTxEnTALDiQCsXJwSV/G6wQ8OO7KGpi+cBxGMiTStp28oDPw
rzg2qBFpt/Yr3vpyNPZz3qjaYY3/EHlz+SZrSlMnHXXXXl2diDXCGZYWpl2EQtEX
MljQ7HW9ux4CDzT1vfkAnwuYPlTzX166ISFtvJrJKoK4qDEn2nCdZn9fmrDVXlTU
sMtvYzgDofo83JoZTDS8rbVEXP883E9nUwdDgKxrGUef4QIeArEcIhCnb8xSF/6q
kQWDn9axn80JMhF5UKd0uwow9q11LvvGnYB+uK5RV6Kv7fja016spVN2xOT1B09E
vf+IBauz3/UaGuOIhq+9/rW3Pg4I3eeQUI3K+Px9OIW65Ascc+Zm2E2JWVaz/ZJd
jtt2iRmhYvSCwCRrP1nmMhtK6PCN8luY1hNvHfIul6MVYm6fi0pDfmevFEUEsoG/
UIRdzO5HcBZac4B9GbLGIL+tySimIwqND3yXuDlXKygHMXnkXstDcKDUb52MzFy2
mPoKNPGs4vmwymooPJHvD9ANPYQ9PWkrU0DLj2Sn1cs72yHYuKyQM4VPqTJRO7WZ
bcGkqMMYpmqjf+fePE3u7DQnNLYUMXIFR4caCyxjESc52ZPQbigBOfqliYveUgtQ
y7hXCxL7nVvUOGYs/S6JgMFfpwMeoAOrqBaZYLlPf1N95TNVgBc=
=1yOQ
-----END PGP SIGNATURE-----
Merge tag 'scrub-fix-ag-header-handling-6.2_2022-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.2-mergeA
xfs: fix handling of AG[IF] header buffers during scrub
While reading through the online fsck code, I noticed that the setup
code for AG metadata scrubs will attach the AGI, the AGF, and the AGFL
buffers to the transaction. It isn't necessary to hold the AGFL buffer,
since any code that wants to do anything with the AGFL will need to hold
the AGF to know which parts of the AGFL are active. Therefore, we only
need to hold the AGFL when scrubbing the AGFL itself.
The second bug fixed by this patchset is one that I observed while
testing online repair. When a buffer is held across a transaction roll,
its buffer log item will be detached if the bli was clean before the
roll. If we are holding the AG headers to maintain a lock on an AG, we
then need to set the buffer type on the new bli to avoid confusing the
logging code later.
There's also a bug fix for uninitialized memory in the directory scanner
that didn't fit anywhere else.
Ths patchset finishes off by teaching the AGFL repair function to look
for and discard crosslinked blocks instead of putting them back on the
AGFL.
v23.2: Log the buffers before rolling the transaction to keep the moving
forward in the log and avoid the bli falling off.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
* tag 'scrub-fix-ag-header-handling-6.2_2022-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: make AGFL repair function avoid crosslinked blocks
xfs: log the AGI/AGF buffers when rolling transactions during an AG repair
xfs: don't track the AGFL buffer in the scrub AG context
xfs: fully initialize xfs_da_args in xchk_directory_blocks
Metadata files (e.g. realtime bitmaps and quota files) do not show up in
the bulkstat output, which means that scrub-by-handle does not work;
they can only be checked through a specific scrub type. Therefore, each
scrub type calls xchk_metadata_inode_forks to check the metadata for
whatever's in the file.
Unfortunately, that function doesn't actually check the inode record
itself. Refactor the function a bit to make that happen.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
We can handle files that are exactly s_maxbytes bytes long; we just
can't handle anything larger than that.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
CoW forks only exist in memory, which means that they can only ever have
an incore extent tree. Hence they must always be FMT_EXTENTS, so check
this when we're scrubbing them.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Ensure that extents in an inode's CoW fork are not marked as shared in
the refcount btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Teach scrub to flag quota files containing unwritten extents.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Enhance the block map scrubber to check delayed allocation reservations.
Though there are no physical space allocations to check, we do need to
make sure that the range of file offsets being mapped are correct, and
to bump the lastoff cursor so that key order checking works correctly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When scrub is checking file fork mappings against rmap records and
the rmap record starts before or ends after the bmap record, check the
adjacent bmap records to make sure that they're adjacent to the one
we're checking. This helps us to detect cases where the rmaps cover
territory that the bmaps do not.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
sparse complains that we can return an uninitialized error from this
function and that pag could be uninitialized. We know that there are no
zero-AG filesystems and hence we had to call xchk_bmap_check_ag_rmaps at
least once, so this is not actually possible, but I'm too worn out from
automated complaints from unsophisticated AIs so let's just fix this and
move on to more interesting problems, eh?
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Teach the summary count checker to count the number of free realtime
extents and compare that to the superblock copy.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
xfs_rtalloc_query_range scans the realtime bitmap file in order of
increasing file offset, so this caller can take ILOCK_SHARED on the rt
bitmap inode instead of ILOCK_EXCL. This isn't going to yield any
practical benefits at mount time, but we'd like to make the locking
usage consistent around xfs_rtalloc_query_all calls. Make all the
places we do this use the same xfs_ilock lockflags for consistency.
Fixes: 4c934c7dd60c ("xfs: report realtime space information via the rtbitmap")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If we tried to repair something but the repair failed with -EDEADLOCK,
that means that the repair function couldn't grab some resource it
needed and wants us to try again. If we try again (with TRY_HARDER) but
still can't get all the resources we need, the repair fails and errors
remain on the filesystem.
Right now, repair returns the -EDEADLOCK to the caller as -EFSCORRUPTED,
which results in XFS_SCRUB_OFLAG_CORRUPT being passed out to userspace.
This is not correct because repair has not determined that anything is
corrupt. If the repair had been invoked on an object that could be
optimized but wasn't corrupt (OFLAG_PREEN), the inability to grab
resources will be reported to userspace as corrupt metadata, and users
will be unnecessarily alarmed that their suboptimal metadata turned into
a corruption.
Fix this by returning zero so that the results of the actual scrub will
be copied back out to userspace.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If any part of the per-AG summary counter scan loop aborts without
collecting all of the data we need, the scrubber's observation data will
be invalid. Set the incomplete flag so that we abort the scrub without
reporting false corruptions. Document the data dependency here too.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
It turns out that GETFSMAP and online fsck have had a bug for years due
to their use of ILOCK_SHARED to coordinate their linear scans of the
realtime bitmap. If the bitmap file's data fork happens to be in BTREE
format and the scan occurs immediately after mounting, the incore bmbt
will not be populated, leading to ASSERTs tripping over the incorrect
inode state. Because the bitmap scans always lock bitmap buffers in
increasing order of file offset, it is appropriate for these two callers
to take a shared ILOCK to improve scalability.
To fix this problem, load both data and attr fork state into memory when
mounting the realtime inodes. Realtime metadata files aren't supposed
to have an attr fork so the second step is likely a nop.
On most filesystems this is unlikely since the rtbitmap data fork is
usually in extents format, but it's possible to craft a filesystem that
will by fragmenting the free space in the data section and growfsing the
rt section.
Fixes: 4c934c7dd60c ("xfs: report realtime space information via the rtbitmap")
Also-Fixes: 46d9bfb5e706 ("xfs: cross-reference the realtime bitmap")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Convert all the online scrub code to use the Linux slab allocator
functions directly instead of going through the kmem wrappers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Repair functions will not return EAGAIN -- if they were not able to
obtain resources, they should return EDEADLOCK (like the rest of online
fsck) to signal that we need to grab all the resources and try again.
Hence we don't need to deal with this case except as a debugging
assertion.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Initialize the check_owner list head so that we don't corrupt the list.
Reduce the scope of the object pointer.
Fixes: 858333dcf021 ("xfs: check btree block ownership with bnobt/rmapbt when scrubbing btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If the scrub process is sent a fatal signal while we're checking dquots,
the predicate for this will set the error code to -EINTR. Don't then
squash that into -ECANCELED, because the wrong errno turns up in the
trace output.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If the program calling online fsck is terminated with a fatal signal,
bail out to userspace by returning EINTR, not EAGAIN. EAGAIN is used by
scrubbers to indicate that we should try again with more resources
locked, and not to indicate that the operation was cancelled. The
miswiring is mostly harmless, but it shows up in the trace data.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Teach the AGFL repair function to check each block of the proposed AGFL
against the rmap btree. If the rmapbt finds any mappings that are not
OWN_AG, strike that block from the list.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Memory allocation usage is the same throughout online fsck -- we want
kernel memory, we have to be able to back out if we can't allocate
memory, and we don't want to spray dmesg with memory allocation failure
reports. Standardize the GFP flag usage and document these requirements.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Currently, the only way to lock an allocation group is to hold the AGI
and AGF buffers. If a repair needs to roll the transaction while
repairing some AG metadata, it maintains that lock by holding the two
buffers across the transaction roll and joins them afterwards.
However, repair is not like other parts of XFS that employ the bhold -
roll - bjoin sequence because it's possible that the AGI or AGF buffers
are not actually dirty before the roll. This presents two problems --
First, we need to redirty those buffers to keep them moving along in the
log to avoid pinning the log tail. Second, a clean buffer log item can
detach from the buffer. If this happens, the buffer type state is
discarded along with the bli and must be reattached before the next time
the buffer is logged. If it is not, the logging code will complain and
log recovery will not work properly.
An earlier version of this patch tried to fix the second problem by
re-setting the buffer type in the bli after joining the buffer to the
new transaction, but that looked weird and didn't solve the first
problem. Instead, solve both problems by logging the buffer before
rolling the transaction.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
While scrubbing an allocation group, we don't need to hold the AGFL
buffer as part of the scrub context. All that is necessary to lock an
AG is to hold the AGI and AGF buffers, so fix all the existing users of
the AGFL buffer to grab them only when necessary.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
While running the online fsck test suite, I noticed the following
assertion in the kernel log (edited for brevity):
XFS: Assertion failed: 0, file: fs/xfs/xfs_health.c, line: 571
------------[ cut here ]------------
WARNING: CPU: 3 PID: 11667 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
CPU: 3 PID: 11667 Comm: xfs_scrub Tainted: G W 5.19.0-rc7-xfsx #rc7 6e6475eb29fd9dda3181f81b7ca7ff961d277a40
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
RIP: 0010:assfail+0x46/0x4a [xfs]
Call Trace:
<TASK>
xfs_dir2_isblock+0xcc/0xe0
xchk_directory_blocks+0xc7/0x420
xchk_directory+0x53/0xb0
xfs_scrub_metadata+0x2b6/0x6b0
xfs_scrubv_metadata+0x35e/0x4d0
xfs_ioc_scrubv_metadata+0x111/0x160
xfs_file_ioctl+0x4ec/0xef0
__x64_sys_ioctl+0x82/0xa0
do_syscall_64+0x2b/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
This assertion triggers in xfs_dirattr_mark_sick when the caller passes
in a whichfork value that is neither of XFS_{DATA,ATTR}_FORK. The cause
of this is that xchk_directory_blocks only partially initializes the
xfs_da_args structure that is passed to xfs_dir2_isblock. If the data
fork is not correct, the XFS_IS_CORRUPT clause will trigger. My
development branch reports this failure to the health monitoring
subsystem, which accesses the uninitialized args->whichfork field,
leading the the assertion tripping. We really shouldn't be passing
random stack contents around, so the solution here is to force the
compiler to zero-initialize the struct.
Found by fuzzing u3.bmx[0].blockcount = middlebit on xfs/1554.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When we reserve a delalloc region in xfs_buffered_write_iomap_begin,
we mark the iomap as IOMAP_F_NEW so that the the write context
understands that it allocated the delalloc region.
If we then fail that buffered write, xfs_buffered_write_iomap_end()
checks for the IOMAP_F_NEW flag and if it is set, it punches out
the unused delalloc region that was allocated for the write.
The assumption this code makes is that all buffered write operations
that can allocate space are run under an exclusive lock (i_rwsem).
This is an invalid assumption: page faults in mmap()d regions call
through this same function pair to map the file range being faulted
and this runs only holding the inode->i_mapping->invalidate_lock in
shared mode.
IOWs, we can have races between page faults and write() calls that
fail the nested page cache write operation that result in data loss.
That is, the failing iomap_end call will punch out the data that
the other racing iomap iteration brought into the page cache. This
can be reproduced with generic/34[46] if we arbitrarily fail page
cache copy-in operations from write() syscalls.
Code analysis tells us that the iomap_page_mkwrite() function holds
the already instantiated and uptodate folio locked across the iomap
mapping iterations. Hence the folio cannot be removed from memory
whilst we are mapping the range it covers, and as such we do not
care if the mapping changes state underneath the iomap iteration
loop:
1. if the folio is not already dirty, there is no writeback races
possible.
2. if we allocated the mapping (delalloc or unwritten), the folio
cannot already be dirty. See #1.
3. If the folio is already dirty, it must be up to date. As we hold
it locked, it cannot be reclaimed from memory. Hence we always
have valid data in the page cache while iterating the mapping.
4. Valid data in the page cache can exist when the underlying
mapping is DELALLOC, UNWRITTEN or WRITTEN. Having the mapping
change from DELALLOC->UNWRITTEN or UNWRITTEN->WRITTEN does not
change the data in the page - it only affects actions if we are
initialising a new page. Hence #3 applies and we don't care
about these extent map transitions racing with
iomap_page_mkwrite().
5. iomap_page_mkwrite() checks for page invalidation races
(truncate, hole punch, etc) after it locks the folio. We also
hold the mapping->invalidation_lock here, and hence the mapping
cannot change due to extent removal operations while we are
iterating the folio.
As such, filesystems that don't use bufferheads will never fail
the iomap_folio_mkwrite_iter() operation on the current mapping,
regardless of whether the iomap should be considered stale.
Further, the range we are asked to iterate is limited to the range
inside EOF that the folio spans. Hence, for XFS, we will only map
the exact range we are asked for, and we will only do speculative
preallocation with delalloc if we are mapping a hole at the EOF
page. The iterator will consume the entire range of the folio that
is within EOF, and anything beyond the EOF block cannot be accessed.
We never need to truncate this post-EOF speculative prealloc away in
the context of the iomap_page_mkwrite() iterator because if it
remains unused we'll remove it when the last reference to the inode
goes away.
Hence we don't actually need an .iomap_end() cleanup/error handling
path at all for iomap_page_mkwrite() for XFS. This means we can
separate the page fault processing from the complexity of the
.iomap_end() processing in the buffered write path. This also means
that the buffered write path will also be able to take the
mapping->invalidate_lock as necessary.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
- Fix region creation crash with pass-through decoders
- Fix region creation crash when no decoder allocation fails
- Fix region creation crash when scanning regions to enforce the
increasing physical address order constraint that CXL mandates
- Fix a memory leak for cxl_pmem_region objects, track 1:N instead of
1:1 memory-device-to-region associations.
- Fix a memory leak for cxl_region objects when regions with active
targets are deleted
- Fix assignment of NUMA nodes to CXL regions by CFMWS (CXL Window)
emulated proximity domains.
- Fix region creation failure for switch attached devices downstream of
a single-port host-bridge
- Fix false positive memory leak of cxl_region objects by recycling
recently used region ids rather than freeing them
- Add regression test infrastructure for a pass-through decoder
configuration
- Fix some mailbox payload handling corner cases
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCY2f0dwAKCRDfioYZHlFs
Z93zAQCHzy4qbEdw95SPQ/BpUJ2rxcWzruFZkaUTU1RHM5lApwEApP9Fjvdkgo9I
dlQTRON1nSqqoEXqSxbt8RU0I9Z11ws=
=pBN4
-----END PGP SIGNATURE-----
Merge tag 'cxl-fixes-for-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull cxl fixes from Dan Williams:
"Several fixes for CXL region creation crashes, leaks and failures.
This is mainly fallout from the original implementation of dynamic CXL
region creation (instantiate new physical memory pools) that arrived
in v6.0-rc1.
Given the theme of "failures in the presence of pass-through decoders"
this also includes new regression test infrastructure for that case.
Summary:
- Fix region creation crash with pass-through decoders
- Fix region creation crash when no decoder allocation fails
- Fix region creation crash when scanning regions to enforce the
increasing physical address order constraint that CXL mandates
- Fix a memory leak for cxl_pmem_region objects, track 1:N instead of
1:1 memory-device-to-region associations.
- Fix a memory leak for cxl_region objects when regions with active
targets are deleted
- Fix assignment of NUMA nodes to CXL regions by CFMWS (CXL Window)
emulated proximity domains.
- Fix region creation failure for switch attached devices downstream
of a single-port host-bridge
- Fix false positive memory leak of cxl_region objects by recycling
recently used region ids rather than freeing them
- Add regression test infrastructure for a pass-through decoder
configuration
- Fix some mailbox payload handling corner cases"
* tag 'cxl-fixes-for-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl/region: Recycle region ids
cxl/region: Fix 'distance' calculation with passthrough ports
tools/testing/cxl: Add a single-port host-bridge regression config
tools/testing/cxl: Fix some error exits
cxl/pmem: Fix cxl_pmem_region and cxl_memdev leak
cxl/region: Fix cxl_region leak, cleanup targets at region delete
cxl/region: Fix region HPA ordering validation
cxl/pmem: Use size_add() against integer overflow
cxl/region: Fix decoder allocation crash
ACPI: NUMA: Add CXL CFMWS 'nodes' to the possible nodes set
cxl/pmem: Fix failure to account for 8 byte header for writes to the device LSA.
cxl/region: Fix null pointer dereference due to pass through decoder commit
cxl/mbox: Add a check on input payload size
Fix two regressions:
- Commit 54cc3dbfc10d ("hwmon: (pmbus) Add regulator supply into macro")
resulted in regulator undercount when disabling regulators. Revert it.
- The thermal subsystem rework caused the scmi driver to no longer register
with the thermal subsystem because index values no longer match.
To fix the problem, the scmi driver now directly registers with the
thermal subsystem, no longer through the hwmon core.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEiHPvMQj9QTOCiqgVyx8mb86fmYEFAmNnv0YACgkQyx8mb86f
mYEAPQ/7BfM6TjP3L85oGHPUxhP+AEkEoCfD1gplgRD4JPxYZZlwv8F+YGTV/PPQ
mp12Bh8AH+NSryu6PhuhHKZxIvKcgxg+FEedfnT4gPYcMKtevPF84TBIbbFmLpuo
culgJm4w3w2qwqP480kF0UTow1pRXqwjZYd3/GILzcaCIjWS2JS55w//YMQrE5J+
+OkZsJSGWJITAJPYhIb0ETFcqjp8GROWfMZos2IdpErzmcuS31tcrD0igSwLXFwP
bylDDXb9O9AiLT/qzkc0kjZvaohC2zoP5we+7JyXj0sUSODf+wBCEClvM1DI00JN
CcS0zmgrmscktkTkonZDIz95vqO2Il5UmKXlK6dkpdDSS78JAX8g78LU1J7aFJ/9
3iZnmYg4J8NXsO01plr4P3cGdPRqDoZMEYojSWq4WQs2VaKNIxsk5Zzcl6XN6OT1
MT4w55d8QSQEghDu8vUvnmMg1X5Pkzc50RFYcWyeyMDrtue3rP4uD8m+wl9Sjgvs
tWdayF9JmTQBaVvh3dUO3OR+z+66T8Qk1YBE1xqb+ocptYPNA1+wGum/O5wT+3ew
lE8Jfn/8q7DAJ7lVNXKMyXMvPhZJZ5afhs+gpIJgPB2+m8p3plKgpGdW6Ew36AKe
PKwmGx8kgfzQ6RYqXmCMG6wCr+xiLrYDPNOVz5ewC5LvYVtibs4=
=u6Ip
-----END PGP SIGNATURE-----
Merge tag 'hwmon-for-v6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:
"Fix two regressions:
- Commit 54cc3dbfc10d ("hwmon: (pmbus) Add regulator supply into
macro") resulted in regulator undercount when disabling regulators.
Revert it.
- The thermal subsystem rework caused the scmi driver to no longer
register with the thermal subsystem because index values no longer
match. To fix the problem, the scmi driver now directly registers
with the thermal subsystem, no longer through the hwmon core"
* tag 'hwmon-for-v6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
Revert "hwmon: (pmbus) Add regulator supply into macro"
hwmon: (scmi) Register explicitly with Thermal Framework
fixed microcode revisions checking quirk
- Update Icelake and Sapphire Rapids events constraints
- Use the standard energy unit for Sapphire Rapids in RAPL
- Fix the hw_breakpoint test to fail more graciously on !SMP configs
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmNnr/4ACgkQEsHwGGHe
VUrFRg//dyB0lnQcdvIaPd7DWn3WGop+MeZv0NZI7uYk+SqjtJ3yJ/c4ktcaIgJV
MhTk8Q/gxHvuT+MZarC/f1QYtTqzRQ//rKD2aO/l9Gr813Hu4R0z2AEwrNKDmzyd
BYy3O5GXGeBAiLxtmKZ2bDlS5z8a9L3dlbLCWqjq6iGIVncljWmEDmNQmA3YPury
v8f+V8EqfSE4iWcpnNsZOdrmkMkXEzA8X5vRswQ9l2y6qMmnEeUk9Hn9mFlG+QK4
VDyxkQEB+vZVfWL2UjD3dpEaH5LVyfCQBwOaVdFfHhMmLhoTO2VmRMLza3Qd9ejZ
RIE1hlRibqGMqyHDTjZvnkPgnz4QQqayDf8UIIwVdaMVdIaZmxcIQwfsbQS12E5b
9EBzbaD6TJx42E56WuQHM+ZYt6nz0ktPz0IeBFJIwbU30gqJwdi0uIz2kXNpkthC
eX4Bq/iM9C41A58mj9+uerF9jshi/DJU74KcMGUZiJ7IeGDJgL9CfViOTueMOjr2
OI8nvLOtwBpj8X3AO1nEVkevSt4KPoTD+NVCNpXmjVm9DNFvMRo2EUsRHHrCkLJN
EO7iF14rTlSI7IAE+qxNgRsmXPCyuVBhB3S3/3YmCqsH1kQXqlgxT/2eOJN6kCGz
tlaWnD3TEaifH/DQQVGmv9nNFjS0C49MSxrZ7Oe7phnmSn3vaGY=
=midC
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Add Cooper Lake's stepping to the PEBS guest/host events isolation
fixed microcode revisions checking quirk
- Update Icelake and Sapphire Rapids events constraints
- Use the standard energy unit for Sapphire Rapids in RAPL
- Fix the hw_breakpoint test to fail more graciously on !SMP configs
* tag 'perf_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Add Cooper Lake stepping to isolation_ucodes[]
perf/x86/intel: Fix pebs event constraints for SPR
perf/x86/intel: Fix pebs event constraints for ICL
perf/x86/rapl: Use standard Energy Unit for SPR Dram RAPL domain
perf/hw_breakpoint: test: Skip the test if dependencies unmet
- Enforce that TDX guests are successfully loaded only on TDX hardware
where virtualization exception (#VE) delivery on kernel memory is
disabled because handling those in all possible cases is "essentially
impossible"
- Add the proper include to the syscall wrappers so that BTF can see the
real pt_regs definition and not only the forward declaration
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmNnrUgACgkQEsHwGGHe
VUoC2w//T6+5SlusY9uYIUpL/cGYj+888b/ysO0H0S37IVATUiI5m0eFAA+pcWON
pzn81oBqk1Lstm7x/jT2mzxsZ2fIFbe6EA8hnLAexA4KY70oGhall9Q6O363CmFa
DtUjd0LKjH6GkNH1RUcb5icJGVY3vZPCfSuxlYJUD66NBUx2pEF8l5hzZ0W20Yhq
cHVY0i1HoCNNDRBOODrH7MEY/kWMSvhFybCYOfRMhoVd3aJhsLlq+7/7Ic5wabyy
2mE8b0GU8or9mluU51OiCDjp+qnpB+BTFjV+88ji5jNEKLIarAXkoHDDD06xLhOK
a2L44zZ55RAFxxCBm9L10OE0ta3kUqpq+YKQkh0gGGdDdAylUp8IF0zXRl/6jRDC
T76jM1QOvC791HWD6kDf5XizY+PeaVD9LzAREezG6778mZbNNQwOtkECHZF0U3UP
n/NIabDlZIncuQQbT0sSshrIyfwtkH5E+epcyLuuchYUYnDGkvNkVU31ndiwFhUG
fW8I53XBnIlk5PunJ0jhaq4+Tugr7APipUs75y8IpFEINj6gxuoSdXyezlQVpmQ+
tL1UXqxSlQaCoW295Fr19p3ZBBfqRKXSCS/toCluB/ekhP3ISzIZV7/cB1smmsIR
JpgXQtcAMtXjIv9A1ZexQVlp2srk7Y6WrFocMNc47lKxmHZ78KY=
=nqZp
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Add new Intel CPU models
- Enforce that TDX guests are successfully loaded only on TDX hardware
where virtualization exception (#VE) delivery on kernel memory is
disabled because handling those in all possible cases is "essentially
impossible"
- Add the proper include to the syscall wrappers so that BTF can see
the real pt_regs definition and not only the forward declaration
* tag 'x86_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Add several Intel server CPU model numbers
x86/tdx: Panic on bad configs that #VE on "private" memory access
x86/tdx: Prepare for using "INFO" call for a second purpose
x86/syscall: Include asm/ptrace.h in syscall_wrapper header
- Use POSIX-compatible grep option.
- Document git-related tips for reproducible builds.
- Fix a typo in the modpost rule.
- Suppress SIGPIPE error message from gcc-ar and llvm-ar.
- Fix segmentation fault in the menuconfig search.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmNmPH8VHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGZ3AQALBPaJ5OBpz8PzAUVdWJkVMAJYeu
e0oPrRJPmxlvYZ4U4acAxxH9QGdAFopa+EBRWiCwb+L5lDQagvtb/boN5fyVHKWc
aQKoNanmzzxNoO9w3bH6ApTeDxZ9O54V3G5I6xiM/cVy+HfFQePvfAuF1tnxGpYi
RAftq2PhBo94ltpzhky00wnijYF8kU37RmTiZ/wUdSccOQ3cH/nhOduhnjXFpc+K
JbwocFT9PtvqSy1gSMzZbBikQL4jktK2CIslhJEsG3Pn5zi0eL6UQcY9Drc3oIF5
qOmtswtVJ6AiwJkdXb3/Vx5bS92wzIph3VOPpY2Vq8WkOA0t4gtByj13lzH2yJ0Q
05OsqXu1v5nilQOjHSWoyFaw6x3Exh/qa1hLOcPrfTAC7vP8LHO7L0ujySqtlbxe
pdmba/58YMIKdDPfZ3uFoMk4s3XuqDhBLkQl2ctoIfvX3KFWwcNE7oiyCkJXkE6v
asyH0gWYz2hyM29ulm15yA/eDt+OKweldz17e/GIOlA5hr8kt/96E/lEHW9r/tSK
Bw0u4HiWf92vlZWWKjDWkWD4T4FkM2n4Jn9zOU5fauS21BQG217LIHIh62bs1Luw
5Rb1UF7cAPEQxJZsTMdkdmWudZsabjpPFV68p8IucmKSQeHpgH1naJxXWCtln6V6
ZHnLnUNELUoMEg7F
=1MFY
-----END PGP SIGNATURE-----
Merge tag 'kbuild-fixes-v6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Use POSIX-compatible grep options
- Document git-related tips for reproducible builds
- Fix a typo in the modpost rule
- Suppress SIGPIPE error message from gcc-ar and llvm-ar
- Fix segmentation fault in the menuconfig search
* tag 'kbuild-fixes-v6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kconfig: fix segmentation fault in menuconfig search
kbuild: fix SIGPIPE error message for AR=gcc-ar and AR=llvm-ar
kbuild: fix typo in modpost
Documentation: kbuild: Add description of git for reproducible builds
kbuild: use POSIX-compatible grep option