Commit Graph

701 Commits

Author SHA1 Message Date
Darrick J. Wong
3f31406aef xfs: fix corruptions in the directory tree
Repair corruptions in the directory tree itself.  Cycles are broken by
removing an incoming parent->child link.  Multiply-owned directories are
fixed by pruning the extra parent -> child links  Disconnected subtrees
are reconnected to the lost and found.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23 16:55:17 -07:00
Allison Henderson
1c12949e50 xfs: Add parent pointers to xfs_cross_rename
Cross renames are handled separately from standard renames, and
need different handling to update the parent attributes correctly.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23 07:46:59 -07:00
Allison Henderson
5a8338c882 xfs: Add parent pointers to rename
This patch removes the old parent pointer attribute during the rename
operation, and re-adds the updated parent pointer.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: adjust to new ondisk format]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23 07:46:59 -07:00
Allison Henderson
d2d18330f6 xfs: remove parent pointers in unlink
This patch removes the parent pointer attribute during unlink

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: adjust to new ondisk format, minor rebase fixes]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23 07:46:59 -07:00
Allison Henderson
f1097be220 xfs: add parent attributes to link
This patch modifies xfs_link to add a parent pointer to the inode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: minor rebase fixes]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23 07:46:58 -07:00
Allison Henderson
b7c62d90c1 xfs: parent pointer attribute creation
Add parent pointer attribute during xfs_create, and subroutines to
initialize attributes.  Note that the xfs_attr_intent object contains a
pointer to the caller's xfs_da_args object, so the latter must persist
until transaction commit.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: shorten names, adjust to new format, set init_xattrs for parent
pointers]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23 07:46:58 -07:00
Allison Henderson
297da63379 xfs: Expose init_xattrs in xfs_create_tmpfile
Tmp files are used as part of rename operations and will need attr forks
initialized for parent pointers.  Expose the init_xattrs parameter to
the calling function to initialize the fork.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23 07:46:57 -07:00
Christoph Hellwig
6a94b1acda xfs: reinstate delalloc for RT inodes (if sb_rextsize == 1)
Commit aff3a9edb7 ("xfs: Use preallocation for inodes with extsz
hints") disabled delayed allocation for all inodes with extent size
hints due a data exposure problem.  It turns out we fixed this data
exposure problem since by always creating unwritten extents for
delalloc conversions due to more data exposure problems, but the
writeback path doesn't actually support extent size hints when
converting delalloc these days, which probably isn't a problem given
that people using the hints know what they get.

However due to the way how xfs_get_extsz_hint is implemented, it
always claims an extent size hint for RT inodes even if the RT
extent size is a single FSB.  Due to that the above commit effectively
disabled delalloc support for RT inodes.

Switch xfs_get_extsz_hint to return 0 for this case and work around
that in a few places to reinstate delalloc support for RT inodes on
file systems with an sb_rextsize of 1.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22 18:00:50 +05:30
Allison Henderson
69291726ca xfs: Hold inode locks in xfs_rename
Modify xfs_rename to hold all inode locks across a rename operation
We will need this later when we add parent pointers

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:59:02 -07:00
Allison Henderson
bd5562111d xfs: Hold inode locks in xfs_trans_alloc_dir
Modify xfs_trans_alloc_dir to hold locks after return.  Caller will be
responsible for manual unlock.  We will need this later to hold locks
across parent pointer operations

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:59:02 -07:00
Allison Henderson
267979b4ce xfs: Hold inode locks in xfs_ialloc
Modify xfs_ialloc to hold locks after return.  Caller will be
responsible for manual unlock.  We will need this later to hold locks
across parent pointer operations

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com>
[djwong: hold the parent ilocked across transaction rolls too]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:59:02 -07:00
Allison Henderson
7560c937b4 xfs: Increase XFS_DEFER_OPS_NR_INODES to 5
Renames that generate parent pointer updates can join up to 5
inodes locked in sorted order.  So we need to increase the
number of defer ops inodes and relock them in the same way.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com>
[djwong: have one sorting function]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:59:01 -07:00
Darrick J. Wong
5f204051d9 xfs: pin inodes that would otherwise overflow link count
The VFS inc_nlink function does not explicitly check for integer
overflows in the i_nlink field.  Instead, it checks the link count
against s_max_links in the vfs_{link,create,rename} functions.  XFS
sets the maximum link count to 2.1 billion, so integer overflows should
not be a problem.

However.  It's possible that online repair could find that a file has
more than four billion links, particularly if the link count got
corrupted while creating hardlinks to the file.  The di_nlinkv2 field is
not large enough to store a value larger than 2^32, so we ought to
define a magic pin value of ~0U which means that the inode never gets
deleted.  This will prevent a UAF error if the repair finds this
situation and users begin deleting links to the file.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:58:59 -07:00
Darrick J. Wong
10d587ecb7 xfs: check AGI unlinked inode buckets
Look for corruptions in the AGI unlinked bucket chains.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:58:58 -07:00
Darrick J. Wong
1e58a8ccf2 xfs: move orphan files to the orphanage
When we're repairing a directory structure or fixing the dotdot entry of
a subdirectory, it's possible that we won't ever find a parent for the
subdirectory.  When this is the case, move it to the orphanage, aka
/lost+found.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:58:56 -07:00
Darrick J. Wong
8d81082a8c xfs: inactivate directory data blocks
Teach inode inactivation to delete all the incore buffers backing a
directory.  In normal runtime this should never happen because the VFS
forbids rmdir on a non-empty directory.

In the next patch, online directory repair stands up a new directory,
exchanges it with the broken directory, and then drops the private
temporary directory.  If we cancel the repair just prior to exchanging
the directory contents, the new directory will need to be torn down.
Note: If we commit the repair, reaping will take care of all the ondisk
space allocations and incore buffers for the old corrupt directory.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:58:55 -07:00
Darrick J. Wong
e921533ef1 xfs: ensure unlinked list state is consistent with nlink during scrub
Now that we have the means to tell if an inode is on an unlinked inode
list or not, we can check that an inode with zero link count is on the
unlinked list; and an inode that has nonzero link count is not on that
list.  Make repair clean things up too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:58:54 -07:00
Darrick J. Wong
84c14ee39d xfs: create temporary files and directories for online repair
Teach the online repair code how to create temporary files or
directories.  These temporary files can be used to stage reconstructed
information until we're ready to perform an atomic extent swap to commit
the new metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:58:48 -07:00
Darrick J. Wong
ee20808d84 xfs: create a new helper to return a file's allocation unit
Create a new helper function to calculate the fundamental allocation
unit (i.e. the smallest unit of space we can allocate) of a file.
Things are going to get hairy with range-exchange on the realtime
device, so prepare for this now.

Remove the static attribute from xfs_is_falloc_aligned since the next
patch will need it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:10 -07:00
Darrick J. Wong
a4db266a70 xfs: move inode lease breaking functions to xfs_inode.c
The lease breaking functions operate at the scope of the entire VFS
inode, not subranges of a file.  Move them to xfs_inode.c since they're
already declared in xfs_inode.h.  This cleanup moves us closer to
having xfs_FOO.h declare only the symbols in xfs_FOO.c.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:07 -07:00
Darrick J. Wong
549d3c9a29 xfs: pass xfs_buf lookup flags to xfs_*read_agi
Allow callers to pass buffer lookup flags to xfs_read_agi and
xfs_ialloc_read_agi.  This will be used in the next patch to fix a
deadlock in the online fsck inode scanner.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:03 -07:00
Andrey Albershteyn
e23d7e82b7 xfs: allow cross-linking special files without project quota
There's an issue that if special files is created before quota
project is enabled, then it's not possible to link this file. This
works fine for normal files. This happens because xfs_quota skips
special files (no ioctls to set necessary flags). The check for
having the same project ID for source and destination then fails as
source file doesn't have any ID.

mkfs.xfs -f /dev/sda
mount -o prjquota /dev/sda /mnt/test

mkdir /mnt/test/foo
mkfifo /mnt/test/foo/fifo1

xfs_quota -xc "project -sp /mnt/test/foo 9" /mnt/test
> Setting up project 9 (path /mnt/test/foo)...
> xfs_quota: skipping special file /mnt/test/foo/fifo1
> Processed 1 (/etc/projects and cmdline) paths for project 9 with recursion depth infinite (-1).

ln /mnt/test/foo/fifo1 /mnt/test/foo/fifo1_link
> ln: failed to create hard link '/mnt/test/testdir/fifo1_link' => '/mnt/test/testdir/fifo1': Invalid cross-device link

mkfifo /mnt/test/foo/fifo2
ln /mnt/test/foo/fifo2 /mnt/test/foo/fifo2_link

Fix this by allowing linking of special files to the project quota
if special files doesn't have any ID set (ID = 0).

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-01 11:55:49 +05:30
Darrick J. Wong
0e24ec3c56 xfs: remember sick inodes that get inactivated
If an unhealthy inode gets inactivated, remember this fact in the
per-fs health summary.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:33:03 -08:00
Darrick J. Wong
baf44fa5c3 xfs: report inode corruption errors to the health system
Whenever we encounter corrupt inode records, we should report that to
the health monitoring system for later reporting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:32:43 -08:00
Darrick J. Wong
de6077ec41 xfs: report ag header corruption errors to the health tracking system
Whenever we encounter a corrupt AG header, we should report that to the
health monitoring system for later reporting.  Buffer readers that don't
respond to corruption events with a _mark_sick call can be detected with
the following script:

#!/bin/bash

# Detect missing calls to xfs_*_mark_sick

filter=cat
tty -s && filter=less

git grep -A10  -E '( = xfs_trans_read_buf| = xfs_buf_read\()' fs/xfs/*.[ch] fs/xfs/libxfs/*.[ch] | awk '
BEGIN {
	ignore = 0;
	lineno = 0;
	delete lines;
}
{
	if ($0 == "--") {
		if (!ignore) {
			for (i = 0; i < lineno; i++) {
				print(lines[i]);
			}
			printf("--\n");
		}
		delete lines;
		lineno = 0;
		ignore = 0;
	} else if ($0 ~ /mark_sick/) {
		ignore = 1;
	} else {
		lines[lineno++] = $0;
	}
}
' | $filter

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:31:03 -08:00
Darrick J. Wong
86a1746eea xfs: track directory entry updates during live nlinks fsck
Create the necessary hooks in the directory operations
(create/link/unlink/rename) code so that our live nlink scrub code can
stay up to date with link count updates in the rest of the filesystem.
This will be the means to keep our shadow link count information up to
date while the scan runs in real time.

In online fsck part 2, we'll use these same hooks to handle repairs
to directories and parent pointer information.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:30:59 -08:00
Darrick J. Wong
ebd610fe82 xfs: create a helper to count per-device inode block usage
Create a helper to compute the number of blocks that a file has
allocated from the data realtime volumes.  This patch was
split out to reduce the size of the upcoming quotacheck patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:30:53 -08:00
Matthew Wilcox (Oracle)
785dd13152 xfs: Remove mrlock wrapper
mrlock was an rwsem wrapper that also recorded whether the lock was
held for read or write.  Now that we can ask the generic code whether
the lock is held for read or write, we can remove this wrapper and use
an rwsem directly.

As the comment says, we can't use lockdep to assert that the ILOCK is
held for write, because we might be in a workqueue, and we aren't able
to tell lockdep that we do in fact own the lock.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-19 21:19:33 +05:30
Matthew Wilcox (Oracle)
3fed24fffc xfs: Replace xfs_isilocked with xfs_assert_ilocked
To use the new rwsem_assert_held()/rwsem_assert_held_write(), we can't
use the existing ASSERT macro.  Add a new xfs_assert_ilocked() and
convert all the callers.

Fix an apparent bug in xfs_isilocked(): If the caller specifies
XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL, xfs_assert_ilocked() will check both
the IOLOCK and the ILOCK are held for write.  xfs_isilocked() only
checked that the ILOCK was held for write.

xfs_assert_ilocked() is always on, even if DEBUG or XFS_WARN aren't
defined.  It's a cheap check, so I don't think it's worth defining
it away.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-19 21:19:33 +05:30
Dave Chinner
d4c75a1b40 xfs: convert remaining kmem_free() to kfree()
The remaining callers of kmem_free() are freeing heap memory, so
we can convert them directly to kfree() and get rid of kmem_free()
altogether.

This conversion was done with:

$ for f in `git grep -l kmem_free fs/xfs`; do
> sed -i s/kmem_free/kfree/ $f
> done
$

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13 18:07:34 +05:30
Christoph Hellwig
6e145f943b xfs: make if_data a void pointer
The xfs_ifork structure currently has a union of the if_root void pointer
and the if_data char pointer.  In either case it is an opaque pointer
that depends on the fork format.  Replace the union with a single if_data
void pointer as that is what almost all callers want.  Only the symlink
NULL termination code in xfs_init_local_fork actually needs a new local
variable now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-12-29 13:37:03 +05:30
Darrick J. Wong
a59eb5fc21 xfs: create a new inode fork block unmap helper
Create a new helper to unmap blocks from an inode's fork.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-15 10:03:43 -08:00
Darrick J. Wong
d9041681dd xfs: set inode sick state flags when we zap either ondisk fork
In a few patches, we'll add some online repair code that tries to
massage the ondisk inode record just enough to get it to pass the inode
verifiers so that we can continue with more file repairs.  Part of that
massaging can include zapping the ondisk forks to clear errors.  After
that point, the bmap fork repair functions will rebuild the zapped
forks.

Christoph asked for stronger protections against online repair zapping a
fork to get the inode to load vs. other threads trying to access the
partially repaired file.  Do this by adding a special "[DA]FORK_ZAPPED"
inode health flag whenever repair zaps a fork, and sprinkling checks for
that flag into the various file operations for things that don't like
handling an unexpected zero-extents fork.

In practice xfs_scrub will scrub and fix the forks almost immediately
after zapping them, so the window is very small.  However, if a crash or
unmount should occur, we can still detect these zapped inode forks by
looking for a zero-extents fork when data was expected.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-15 10:03:35 -08:00
Linus Torvalds
34f7632627 New code for 6.7:
* Realtime device subsystem
     - Cleanup usage of xfs_rtblock_t and xfs_fsblock_t data types.
     - Replace open coded conversions between rt blocks and rt extents with
       calls to static inline helpers.
     - Replace open coded realtime geometry compuation and macros with helper
       functions.
     - CPU usage optimizations for realtime allocator.
     - Misc. Bug fixes associated with Realtime device.
   * Allow read operations to execute while an FICLONE ioctl is being serviced.
   * Misc. bug fixes
     - Alert user when xfs_droplink() encounters an inode with a link count of zero.
     - Handle the case where the allocator could return zero extents when
       servicing an fallocate request.
 
 Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZUEvIgAKCRAH7y4RirJu
 9JnQAQCtnQAhZHbh9U2BNJI4hrpNm4Mh54DVlZvPFHW1N96AUAEA0Hnic/Zusrfc
 9aaHQbzs4qGSZ5UJWOU6GxcWob/tggs=
 =Ay05
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.7-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Chandan Babu:

 - Realtime device subsystem:
    - Cleanup usage of xfs_rtblock_t and xfs_fsblock_t data types
    - Replace open coded conversions between rt blocks and rt extents
      with calls to static inline helpers
    - Replace open coded realtime geometry compuation and macros with
      helper functions
    - CPU usage optimizations for realtime allocator
    - Misc bug fixes associated with Realtime device

 - Allow read operations to execute while an FICLONE ioctl is being
   serviced

 - Misc bug fixes:
    - Alert user when xfs_droplink() encounters an inode with a link
      count of zero
    - Handle the case where the allocator could return zero extents when
      servicing an fallocate request

* tag 'xfs-6.7-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (40 commits)
  xfs: allow read IO and FICLONE to run concurrently
  xfs: handle nimaps=0 from xfs_bmapi_write in xfs_alloc_file_space
  xfs: introduce protection for drop nlink
  xfs: don't look for end of extent further than necessary in xfs_rtallocate_extent_near()
  xfs: don't try redundant allocations in xfs_rtallocate_extent_near()
  xfs: limit maxlen based on available space in xfs_rtallocate_extent_near()
  xfs: return maximum free size from xfs_rtany_summary()
  xfs: invert the realtime summary cache
  xfs: simplify rt bitmap/summary block accessor functions
  xfs: simplify xfs_rtbuf_get calling conventions
  xfs: cache last bitmap block in realtime allocator
  xfs: use accessor functions for summary info words
  xfs: consolidate realtime allocation arguments
  xfs: create helpers for rtsummary block/wordcount computations
  xfs: use accessor functions for bitmap words
  xfs: create helpers for rtbitmap block/wordcount computations
  xfs: create a helper to handle logging parts of rt bitmap/summary blocks
  xfs: convert rt summary macros to helpers
  xfs: convert open-coded xfs_rtword_t pointer accesses to helper
  xfs: remove XFS_BLOCKWSIZE and XFS_BLOCKWMASK macros
  ...
2023-11-08 13:22:16 -08:00
Catherine Hoang
14a537983b xfs: allow read IO and FICLONE to run concurrently
One of our VM cluster management products needs to snapshot KVM image
files so that they can be restored in case of failure. Snapshotting is
done by redirecting VM disk writes to a sidecar file and using reflink
on the disk image, specifically the FICLONE ioctl as used by
"cp --reflink". Reflink locks the source and destination files while it
operates, which means that reads from the main vm disk image are blocked,
causing the vm to stall. When an image file is heavily fragmented, the
copy process could take several minutes. Some of the vm image files have
50-100 million extent records, and duplicating that much metadata locks
the file for 30 minutes or more. Having activities suspended for such
a long time in a cluster node could result in node eviction.

Clone operations and read IO do not change any data in the source file,
so they should be able to run concurrently. Demote the exclusive locks
taken by FICLONE to shared locks to allow reads while cloning. While a
clone is in progress, writes will take the IOLOCK_EXCL, so they block
until the clone completes.

Link: https://lore.kernel.org/linux-xfs/8911B94D-DD29-4D6E-B5BC-32EAF1866245@oracle.com/
Signed-off-by: Catherine Hoang <catherine.hoang@oracle.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-10-23 12:02:26 +05:30
Cheng Lin
2b99e410b2 xfs: introduce protection for drop nlink
When abnormal drop_nlink are detected on the inode,
return error, to avoid corruption propagation.

Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-10-23 11:04:47 +05:30
Jeff Layton
75d1e312bb
xfs: convert to new timestamp accessors
Convert to using the new inode timestamp accessor functions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20231004185347.80880-75-jlayton@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-18 14:08:29 +02:00
Darrick J. Wong
537c013b14 xfs: fix reloading entire unlinked bucket lists
During review of the patcheset that provided reloading of the incore
iunlink list, Dave made a few suggestions, and I updated the copy in my
dev tree.  Unfortunately, I then got distracted by ... who even knows
what ... and forgot to backport those changes from my dev tree to my
release candidate branch.  I then sent multiple pull requests with stale
patches, and that's what was merged into -rc3.

So.

This patch re-adds the use of an unlocked iunlink list check to
determine if we want to allocate the resources to recreate the incore
list.  Since lost iunlinked inodes are supposed to be rare, this change
helps us avoid paying the transaction and AGF locking costs every time
we open any inode.

This also re-adds the shutdowns on failure, and re-applies the
restructuring of the inner loop in xfs_inode_reload_unlinked_bucket, and
re-adds a requested comment about the quotachecking code.

Retain the original RVB tag from Dave since there's no code change from
the last submission.

Fixes: 68b957f64f ("xfs: load uncached unlinked inodes into memory on demand")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-09-24 18:12:13 -07:00
Darrick J. Wong
49813a21ed xfs: make inode unlinked bucket recovery work with quotacheck
Teach quotacheck to reload the unlinked inode lists when walking the
inode table.  This requires extra state handling, since it's possible
that a reloaded inode will get inactivated before quotacheck tries to
scan it; in this case, we need to ensure that the reloaded inode does
not have dquots attached when it is freed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-09-12 10:31:07 -07:00
Darrick J. Wong
83771c50e4 xfs: reload entire unlinked bucket lists
The previous patch to reload unrecovered unlinked inodes when adding a
newly created inode to the unlinked list is missing a key piece of
functionality.  It doesn't handle the case that someone calls xfs_iget
on an inode that is not the last item in the incore list.  For example,
if at mount time the ondisk iunlink bucket looks like this:

AGI -> 7 -> 22 -> 3 -> NULL

None of these three inodes are cached in memory.  Now let's say that
someone tries to open inode 3 by handle.  We need to walk the list to
make sure that inodes 7 and 22 get loaded cold, and that the
i_prev_unlinked of inode 3 gets set to 22.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-09-12 10:31:07 -07:00
Darrick J. Wong
f12b96683d xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list
Alter the definition of i_prev_unlinked slightly to make it more obvious
when an inode with 0 link count is not part of the iunlink bucket lists
rooted in the AGI.  This distinction is necessary because it is not
sufficient to check inode.i_nlink to decide if an inode is on the
unlinked list.  Updates to i_nlink can happen while holding only
ILOCK_EXCL, but updates to an inode's position in the AGI unlinked list
(which happen after the nlink update) requires both ILOCK_EXCL and the
AGI buffer lock.

The next few patches will make it possible to reload an entire unlinked
bucket list when we're walking the inode table or performing handle
operations and need more than the ability to iget the last inode in the
chain.

The upcoming directory repair code also needs to be able to make this
distinction to decide if a zero link count directory should be moved to
the orphanage or allowed to inactivate.  An upcoming enhancement to the
online AGI fsck code will need this distinction to check and rebuild the
AGI unlinked buckets.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-09-12 10:31:07 -07:00
Darrick J. Wong
68b957f64f xfs: load uncached unlinked inodes into memory on demand
shrikanth hegde reports that filesystems fail shortly after mount with
the following failure:

	WARNING: CPU: 56 PID: 12450 at fs/xfs/xfs_inode.c:1839 xfs_iunlink_lookup+0x58/0x80 [xfs]

This of course is the WARN_ON_ONCE in xfs_iunlink_lookup:

	ip = radix_tree_lookup(&pag->pag_ici_root, agino);
	if (WARN_ON_ONCE(!ip || !ip->i_ino)) { ... }

From diagnostic data collected by the bug reporters, it would appear
that we cleanly mounted a filesystem that contained unlinked inodes.
Unlinked inodes are only processed as a final step of log recovery,
which means that clean mounts do not process the unlinked list at all.

Prior to the introduction of the incore unlinked lists, this wasn't a
problem because the unlink code would (very expensively) traverse the
entire ondisk metadata iunlink chain to keep things up to date.
However, the incore unlinked list code complains when it realizes that
it is out of sync with the ondisk metadata and shuts down the fs, which
is bad.

Ritesh proposed to solve this problem by unconditionally parsing the
unlinked lists at mount time, but this imposes a mount time cost for
every filesystem to catch something that should be very infrequent.
Instead, let's target the places where we can encounter a next_unlinked
pointer that refers to an inode that is not in cache, and load it into
cache.

Note: This patch does not address the problem of iget loading an inode
from the middle of the iunlink list and needing to set i_prev_unlinked
correctly.

Reported-by: shrikanth hegde <sshegde@linux.vnet.ibm.com>
Triaged-by: Ritesh Harjani <ritesh.list@gmail.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-09-12 10:31:07 -07:00
Darrick J. Wong
76e589013f xfs: allow inode inactivation during a ro mount log recovery
In the next patch, we're going to prohibit log recovery if the primary
superblock contains an unrecognized rocompat feature bit even on
readonly mounts.  This requires removing all the code in the log
mounting process that temporarily disables the readonly state.

Unfortunately, inode inactivation disables itself on readonly mounts.
Clearing the iunlinked lists after log recovery needs inactivation to
run to free the unreferenced inodes, which (AFAICT) is the only reason
why log mounting plays games with the readonly state in the first place.

Therefore, change the inactivation predicates to allow inactivation
during log recovery of a readonly mount.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-09-12 10:31:07 -07:00
Jeff Layton
a0a415e34b xfs: convert to ctime accessor functions
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-80-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-24 10:30:06 +02:00
Dave Chinner
d4d12c02bf xfs: collect errors from inodegc for unlinked inode recovery
Unlinked list recovery requires errors removing the inode the from
the unlinked list get fed back to the main recovery loop. Now that
we offload the unlinking to the inodegc work, we don't get errors
being fed back when we trip over a corruption that prevents the
inode from being removed from the unlinked list.

This means we never clear the corrupt unlinked list bucket,
resulting in runtime operations eventually tripping over it and
shutting down.

Fix this by collecting inodegc worker errors and feed them
back to the flush caller. This is largely best effort - the only
context that really cares is log recovery, and it only flushes a
single inode at a time so we don't need complex synchronised
handling. Essentially the inodegc workers will capture the first
error that occurs and the next flush will gather them and clear
them. The flush itself will only report the first gathered error.

In the cases where callers can return errors, propagate the
collected inodegc flush error up the error handling chain.

In the case of inode unlinked list recovery, there are several
superfluous calls to flush queued unlinked inodes -
xlog_recover_iunlink_bucket() guarantees that it has flushed the
inodegc and collected errors before it returns. Hence nothing in the
calling path needs to run a flush, even when an error is returned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05 14:48:15 +10:00
Linus Torvalds
c0927a7a53 New code for 6.3-rc1, part 2:
* Fix a deadlock in the free space allocator due to the AG-walking
    algorithm forgetting to follow AG-order locking rules.
  * Make the inode allocator prefer existing free inodes instead of
    failing to allocate new inode chunks when free space is low.
  * Set minleft correctly when setting allocator parameters for bmap
    changes.
  * Fix uninitialized variable access in the getfsmap code.
  * Make a distinction between active and passive per-AG structure
    references.  For now, active references are taken to perform some
    work in an AG on behalf of a high level operation; passive references
    are used by lower level code to finish operations started by other
    threads.  Eventually this will become part of online shrink.
  * Split out all the different allocator strategies into separate
    functions to move us away from design antipattern of filling out a
    huge structure for various differentish things and issuing a single
    function multiplexing call.
  * Various cleanups in the filestreams allocator code, which we might
    very well want to deprecate instead of continuing.
  * Fix a bug with the agi rotor code that was introduced earlier in this
    series.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCY/zgqgAKCRBKO3ySh0YR
 plIkAQDIscqdqXGH01gF19/ncqG2GUaXY+/zeOReuk1Iv3VEVgD+MVXf+QvHk7LD
 /LTWNl2K6NQmE/9RtaBt0aFNDzvIAgU=
 =k7r8
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.3-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull moar xfs updates from Darrick Wong:
 "This contains a fix for a deadlock in the allocator. It continues the
  slow march towards being able to offline AGs, and it refactors the
  interface to the xfs allocator to be less indirection happy.

  Summary:

   - Fix a deadlock in the free space allocator due to the AG-walking
     algorithm forgetting to follow AG-order locking rules

   - Make the inode allocator prefer existing free inodes instead of
     failing to allocate new inode chunks when free space is low

   - Set minleft correctly when setting allocator parameters for bmap
     changes

   - Fix uninitialized variable access in the getfsmap code

   - Make a distinction between active and passive per-AG structure
     references. For now, active references are taken to perform some
     work in an AG on behalf of a high level operation; passive
     references are used by lower level code to finish operations
     started by other threads. Eventually this will become part of
     online shrink

   - Split out all the different allocator strategies into separate
     functions to move us away from design antipattern of filling out a
     huge structure for various differentish things and issuing a single
     function multiplexing call

   - Various cleanups in the filestreams allocator code, which we might
     very well want to deprecate instead of continuing

   - Fix a bug with the agi rotor code that was introduced earlier in
     this series"

* tag 'xfs-6.3-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (44 commits)
  xfs: restore old agirotor behavior
  xfs: fix uninitialized variable access
  xfs: refactor the filestreams allocator pick functions
  xfs: return a referenced perag from filestreams allocator
  xfs: pass perag to filestreams tracing
  xfs: use for_each_perag_wrap in xfs_filestream_pick_ag
  xfs: track an active perag reference in filestreams
  xfs: factor out MRU hit case in xfs_filestream_select_ag
  xfs: remove xfs_filestream_select_ag() longest extent check
  xfs: merge new filestream AG selection into xfs_filestream_select_ag()
  xfs: merge filestream AG lookup into xfs_filestream_select_ag()
  xfs: move xfs_bmap_btalloc_filestreams() to xfs_filestreams.c
  xfs: use xfs_bmap_longest_free_extent() in filestreams
  xfs: get rid of notinit from xfs_bmap_longest_free_extent
  xfs: factor out filestreams from xfs_bmap_btalloc_nullfb
  xfs: convert trim to use for_each_perag_range
  xfs: convert xfs_alloc_vextent_iterate_ags() to use perag walker
  xfs: move the minimum agno checks into xfs_alloc_vextent_check_args
  xfs: fold xfs_alloc_ag_vextent() into callers
  xfs: move allocation accounting to xfs_alloc_vextent_set_fsbno()
  ...
2023-02-28 16:08:30 -08:00
Dave Chinner
692b6cddeb xfs: t_firstblock is tracking AGs not blocks
The tp->t_firstblock field is now raelly tracking the highest AG we
have locked, not the block number of the highest allocation we've
made. It's purpose is to prevent AGF locking deadlocks, so rename it
to "highest AG" and simplify the implementation to just track the
agno rather than a fsbno.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-11 04:11:06 +11:00
Christian Brauner
c14329d39f
fs: port fs{g,u}id helpers to mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:30 +01:00
Christian Brauner
e67fe63341
fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
Convert to struct mnt_idmap.
Remove legacy file_mnt_user_ns() and mnt_user_ns().

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:29 +01:00
Christian Brauner
f2d40141d5
fs: port inode_init_owner() to mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:28 +01:00