Commit Graph

76982 Commits

Author SHA1 Message Date
Darrick J. Wong
4613b17cc4 xfs: introduce in-memory inode unlink log items
To facilitate future improvements in inode logging and improving
 inode cluster buffer locking order consistency, we need a new
 mechanism for defering inode cluster buffer modifications during
 unlinked list modifications.
 
 The unlinked inode list buffer locking is complex. The unlinked
 list is unordered - we add to the tail, remove from where-ever the
 inode is in the list. Hence we might need to lock two inode buffers
 here (previous inode in list and the one being removed). While we
 can order the locking of these buffers correctly within the confines
 of the unlinked list, there may be other inodes that need buffer
 locking in the same transaction. e.g. O_TMPFILE being linked into a
 directory also modifies the directory inode.
 
 Hence we need a mechanism for defering unlinked inode list updates
 until a point where we know that all modifications have been made
 and all that remains is to lock and modify the cluster buffers.
 
 We can do this by first observing that we serialise unlinked list
 modifications by holding the AGI buffer lock. IOWs, the AGI is going
 to be locked until the transaction commits any time we modify the
 unlinked list. Hence it doesn't matter when in the unlink
 transactions that we actually load, lock and modify the inode
 cluster buffer.
 
 We add an in-memory unlinked inode log item to defer the inode
 cluster buffer update to transaction commit time where it can be
 ordered with all the other inode cluster operations that need to be
 done. Essentially all we need to do is record the inodes that need
 to have their unlinked list pointer updated in a new log item that
 we attached to the transaction.
 
 This log item exists purely for the purpose of delaying the update
 of the unlinked list pointer until the inode cluster buffer can be
 locked in the correct order around the other inode cluster buffers.
 It plays no part in the actual commit, and there's no change to
 anything that is written to the log. i.e. the inode cluster buffers
 still have to be fully logged here (not just ordered) as log
 recovery depedends on this to replay mods to the unlinked inode
 list.
 
 Hence if we add a "precommit" hook into xfs_trans_commit()
 to run a "precommit" operation on these iunlink log items, we can
 delay the locking, modification and logging of the inode cluster
 buffer until after all other modifications have been made. The
 precommit hook reuires us to sort the items that are going to be run
 so that we can lock precommit items in the correct order as we
 perform the modifications they describe.
 
 To make this unlinked inode list processing simpler and easier to
 implement as a log item, we need to change the way we track the
 unlinked list in memory. Starting from the observation that an inode
 on the unlinked list is pinned in memory by the VFS, we can use the
 xfs_inode itself to track the unlinked list. To do this efficiently,
 we want the unlinked list to be a double linked list. The problem
 here is that we need a list per AGI unlinked list, and there are 64
 of these per AGI. The approach taken in this patchset is to shadow
 the AGI unlinked list heads in the perag, and link inodes by agino,
 hence requiring only 8 extra bytes per inode to track this state.
 
 We can then use the agino pointers for lockless inode cache lookups
 to retreive the inode. The aginos in the inode are modified only
 under the AGI lock, just like the cluster buffer pointers, so we
 don't need any extra locking here.  The i_next_unlinked field tracks
 the on-disk value of the unlinked list, and the i_prev_unlinked is a
 purely in-memory pointer that enables us to efficiently remove
 inodes from the middle of the list.
 
 This results in moving a lot of the unlink modification work into
 the precommit operations on the unlink log item. Tracking all the
 unlinked inodes in the inodes themselves also gets rid of the
 unlinked list reference hash table that is used to track this back
 pointer relationship. This greatly simplifies the the unlinked list
 modification code, and removes memory allocations in this hot path
 to track back pointers. This, overall, slightly reduces the CPU
 overhead of the unlink path.
 
 The result of this log item means that we move all the actual
 manipulation of objects to be logged out of the iunlink path and
 into the iunlink item. This allows for future optimisation of this
 mechanism without needing changes to high level unlink path, as
 well as making the unlink lock ordering predictable and synchronised
 with other operations that may require inode cluster locking.
 
 Signed-off-by: Dave Chinner <dchinner@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmLPvZAUHGRhdmlkQGZy
 b21vcmJpdC5jb20ACgkQregpR/R1+h2CDBAAj9QH4/XIe8JIx/mKgAGzcNNwQxu8
 geBqb5S2ri0oB22pRXKc/3zArw/8zPwcgZF83ChkFrQ6tLn4JGkEEuvIKr3b8k50
 2AEghYf8dqCaXRpkdvIGjJtdK54MFZIHv9TYRwHVzBp3WLtrz7uHmKeRf2qeSBMI
 DLurzVIbcocMptvHxrZZCpf1ajuVdovXtuw8ExiORZZKLOeF+3xBGztenkfh2BTO
 8Kh8qJVSNN41XQ8h87PWyQtmah6JouqURXXGERJcgLbr80pTSw2EBihJvmXUmn8y
 qnoT27TCPAMOEDTWe+SHzLOVRLvhN+at/lFWbvas6PwOvDGAwQQtZkv/QyLTSgqD
 6Zg9xJeeSgHhHP2kCeLlKmvW1dRptcUzhCWOrQ9Ry+WZnKK5ZenevkaAXAva6ucS
 NXwIU1DnWfJ51SHIYQiQIci2g+vF+pnJRQq1DtYUuwtBSWfsmw1uquNZodgbA9Ue
 k6hfk4qVua63k+vXsd5gVdCHT+Liw+1ldTInl2GNhT/riNzewO0HY3zmc1aZQyMM
 mymHXKVcQJbLpJvwqB5SXq8a37fbpoQDYlycptSF/YxxBhiCKKWuc6q7Tl6Y9VSS
 qpSHvh+MkJcP8PYtPjNUcJ9yeXhYJgkv1KK47zkIKzOD9a+zh4SIrfiBUflZbpQq
 M9ubXGHVmFqS4Nk=
 =59M0
 -----END PGP SIGNATURE-----

Merge tag 'xfs-iunlink-item-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeB

xfs: introduce in-memory inode unlink log items

To facilitate future improvements in inode logging and improving
inode cluster buffer locking order consistency, we need a new
mechanism for defering inode cluster buffer modifications during
unlinked list modifications.

The unlinked inode list buffer locking is complex. The unlinked
list is unordered - we add to the tail, remove from where-ever the
inode is in the list. Hence we might need to lock two inode buffers
here (previous inode in list and the one being removed). While we
can order the locking of these buffers correctly within the confines
of the unlinked list, there may be other inodes that need buffer
locking in the same transaction. e.g. O_TMPFILE being linked into a
directory also modifies the directory inode.

Hence we need a mechanism for defering unlinked inode list updates
until a point where we know that all modifications have been made
and all that remains is to lock and modify the cluster buffers.

We can do this by first observing that we serialise unlinked list
modifications by holding the AGI buffer lock. IOWs, the AGI is going
to be locked until the transaction commits any time we modify the
unlinked list. Hence it doesn't matter when in the unlink
transactions that we actually load, lock and modify the inode
cluster buffer.

We add an in-memory unlinked inode log item to defer the inode
cluster buffer update to transaction commit time where it can be
ordered with all the other inode cluster operations that need to be
done. Essentially all we need to do is record the inodes that need
to have their unlinked list pointer updated in a new log item that
we attached to the transaction.

This log item exists purely for the purpose of delaying the update
of the unlinked list pointer until the inode cluster buffer can be
locked in the correct order around the other inode cluster buffers.
It plays no part in the actual commit, and there's no change to
anything that is written to the log. i.e. the inode cluster buffers
still have to be fully logged here (not just ordered) as log
recovery depedends on this to replay mods to the unlinked inode
list.

Hence if we add a "precommit" hook into xfs_trans_commit()
to run a "precommit" operation on these iunlink log items, we can
delay the locking, modification and logging of the inode cluster
buffer until after all other modifications have been made. The
precommit hook reuires us to sort the items that are going to be run
so that we can lock precommit items in the correct order as we
perform the modifications they describe.

To make this unlinked inode list processing simpler and easier to
implement as a log item, we need to change the way we track the
unlinked list in memory. Starting from the observation that an inode
on the unlinked list is pinned in memory by the VFS, we can use the
xfs_inode itself to track the unlinked list. To do this efficiently,
we want the unlinked list to be a double linked list. The problem
here is that we need a list per AGI unlinked list, and there are 64
of these per AGI. The approach taken in this patchset is to shadow
the AGI unlinked list heads in the perag, and link inodes by agino,
hence requiring only 8 extra bytes per inode to track this state.

We can then use the agino pointers for lockless inode cache lookups
to retreive the inode. The aginos in the inode are modified only
under the AGI lock, just like the cluster buffer pointers, so we
don't need any extra locking here.  The i_next_unlinked field tracks
the on-disk value of the unlinked list, and the i_prev_unlinked is a
purely in-memory pointer that enables us to efficiently remove
inodes from the middle of the list.

This results in moving a lot of the unlink modification work into
the precommit operations on the unlink log item. Tracking all the
unlinked inodes in the inodes themselves also gets rid of the
unlinked list reference hash table that is used to track this back
pointer relationship. This greatly simplifies the the unlinked list
modification code, and removes memory allocations in this hot path
to track back pointers. This, overall, slightly reduces the CPU
overhead of the unlink path.

The result of this log item means that we move all the actual
manipulation of objects to be logged out of the iunlink path and
into the iunlink item. This allows for future optimisation of this
mechanism without needing changes to high level unlink path, as
well as making the unlink lock ordering predictable and synchronised
with other operations that may require inode cluster locking.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

* tag 'xfs-iunlink-item-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: add in-memory iunlink log item
  xfs: add log item precommit operation
  xfs: combine iunlink inode update functions
  xfs: clean up xfs_iunlink_update_inode()
  xfs: double link the unlinked inode list
  xfs: introduce xfs_iunlink_lookup
  xfs: refactor xlog_recover_process_iunlinks()
  xfs: track the iunlink list pointer in the xfs_inode
  xfs: factor the xfs_iunlink functions
  xfs: flush inode gc workqueue before clearing agi bucket
2022-07-14 09:21:42 -07:00
Dave Chinner
784eb7d8dd xfs: add in-memory iunlink log item
Now that we have a clean operation to update the di_next_unlinked
field of inode cluster buffers, we can easily defer this operation
to transaction commit time so we can order the inode cluster buffer
locking consistently.

To do this, we introduce a new in-memory log item to track the
unlinked list item modification that we are going to make. This
follows the same observations as the in-memory double linked list
used to track unlinked inodes in that the inodes on the list are
pinned in memory and cannot go away, and hence we can simply
reference them for the duration of the transaction without needing
to take active references or pin them or look them up.

This allows us to pass the xfs_inode to the transaction commit code
along with the modification to be made, and then order the logged
modifications via the ->iop_sort and ->iop_precommit operations
for the new log item type. As this is an in-memory log item, it
doesn't have formatting, CIL or AIL operational hooks - it exists
purely to run the inode unlink modifications and is then removed
from the transaction item list and freed once the precommit
operation has run.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-07-14 11:47:42 +10:00
Dave Chinner
fad743d7cd xfs: add log item precommit operation
For inodes that are dirty, we have an attached cluster buffer that
we want to use to track the dirty inode through the AIL.
Unfortunately, locking the cluster buffer and adding it to the
transaction when the inode is first logged in a transaction leads to
buffer lock ordering inversions.

The specific problem is ordering against the AGI buffer. When
modifying unlinked lists, the buffer lock order is AGI -> inode
cluster buffer as the AGI buffer lock serialises all access to the
unlinked lists. Unfortunately, functionality like xfs_droplink()
logs the inode before calling xfs_iunlink(), as do various directory
manipulation functions. The inode can be logged way down in the
stack as far as the bmapi routines and hence, without a major
rewrite of lots of APIs there's no way we can avoid the inode being
logged by something until after the AGI has been logged.

As we are going to be using ordered buffers for inode AIL tracking,
there isn't a need to actually lock that buffer against modification
as all the modifications are captured by logging the inode item
itself. Hence we don't actually need to join the cluster buffer into
the transaction until just before it is committed. This means we do
not perturb any of the existing buffer lock orders in transactions,
and the inode cluster buffer is always locked last in a transaction
that doesn't otherwise touch inode cluster buffers.

We do this by introducing a precommit log item method.  This commit
just introduces the mechanism; the inode item implementation is in
followup commits.

The precommit items need to be sorted into consistent order as we
may be locking multiple items here. Hence if we have two dirty
inodes in cluster buffers A and B, and some other transaction has
two separate dirty inodes in the same cluster buffers, locking them
in different orders opens us up to ABBA deadlocks. Hence we sort the
items on the transaction based on the presence of a sort log item
method.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-07-14 11:47:26 +10:00
Dave Chinner
062efdb080 xfs: combine iunlink inode update functions
Combine the logging of the inode unlink list update into the
calling function that looks up the buffer we end up logging. These
do not need to be separate functions as they are both short, simple
operations and there's only a single call path through them. This
new function will end up being the core of the iunlink log item
processing...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-14 11:46:59 +10:00
Dave Chinner
5301f87013 xfs: clean up xfs_iunlink_update_inode()
We no longer need to have this function return the previous next
agino value from the on-disk inode as we have it in the in-core
inode now.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-14 11:46:46 +10:00
Dave Chinner
2fd26cc07e xfs: double link the unlinked inode list
Now we have forwards traversal via the incore inode in place, we now
need to add back pointers to the incore inode to entirely replace
the back reference cache. We use the same lookup semantics and
constraints as for the forwards pointer lookups during unlinks, and
so we can look up any inode in the unlinked list directly and update
the list pointers, forwards or backwards, at any time.

The only wrinkle in converting the unlinked list manipulations to
use in-core previous pointers is that log recovery doesn't have the
incore inode state built up so it can't just read in an inode and
release it to finish off the unlink. Hence we need to modify the
traversal in recovery to read one inode ahead before we
release the inode at the head of the list. This populates the
next->prev relationship sufficient to be able to replay the unlinked
list and hence greatly simplify the runtime code.

This recovery algorithm also requires that we actually remove inodes
from the unlinked list one at a time as background inode
inactivation will result in unlinked list removal racing with the
building of the in-memory unlinked list state. We could serialise
this by holding the AGI buffer lock when constructing the in memory
state, but all that does is lockstep background processing with list
building. It is much simpler to flush the inodegc immediately after
releasing the inode so that it is unlinked immediately and there is
no races present at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-07-14 11:46:43 +10:00
Dave Chinner
a83d5a8b1d xfs: introduce xfs_iunlink_lookup
When an inode is on an unlinked list during normal operation, it is
guaranteed to be pinned in memory as it is either referenced by the
current unlink operation or it has a open file descriptor that
references it and has it pinned in memory. Hence to look up an inode
on the unlinked list, we can do a direct inode cache lookup and
always expect the lookup to succeed.

Add a function to do this lookup based on the agino that we use to
link the chain of unlinked inodes together so we can begin the
conversion the unlinked list manipulations to use in-memory inodes
rather than inode cluster buffers and remove the backref cache.

Use this lookup function to replace the on-disk inode buffer walk
when removing inodes from the unlinked list with an in-core inode
unlinked list walk.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-14 11:43:09 +10:00
Dave Chinner
04755d2e58 xfs: refactor xlog_recover_process_iunlinks()
For upcoming changes to the way inode unlinked list processing is
done, the structure of recovery needs to change slightly. We also
really need to untangle the messy error handling in list recovery
so that actions like emptying the bucket on inode lookup failure
are associated with the bucket list walk failing, not failing
to look up the inode.

Refactor the recovery code now to keep the re-organisation seperate
to the algorithm changes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-14 11:42:39 +10:00
Dave Chinner
4fcc94d653 xfs: track the iunlink list pointer in the xfs_inode
Having direct access to the i_next_unlinked pointer in unlinked
inodes greatly simplifies the processing of inodes on the unlinked
list. We no longer need to look up the inode buffer just to find
next inode in the list if the xfs_inode is in memory. These
improvements will be realised over upcoming patches as other
dependencies on the inode buffer for unlinked list processing are
removed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-07-14 11:38:54 +10:00
Dave Chinner
a4454cd69c xfs: factor the xfs_iunlink functions
Prep work that separates the locking that protects the unlinked list
from the actual operations being performed. This also helps document
the fact they are performing list insert  and remove operations. No
functional code change.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-14 11:36:40 +10:00
Zhang Yi
04a98a036c xfs: flush inode gc workqueue before clearing agi bucket
In the procedure of recover AGI unlinked lists, if something bad
happenes on one of the unlinked inode in the bucket list, we would call
xlog_recover_clear_agi_bucket() to clear the whole unlinked bucket list,
not the unlinked inodes after the bad one. If we have already added some
inodes to the gc workqueue before the bad inode in the list, we could
get below error when freeing those inodes, and finaly fail to complete
the log recover procedure.

 XFS (ram0): Internal error xfs_iunlink_remove at line 2456 of file
 fs/xfs/xfs_inode.c.  Caller xfs_ifree+0xb0/0x360 [xfs]

The problem is xlog_recover_clear_agi_bucket() clear the bucket list, so
the gc worker fail to check the agino in xfs_verify_agino(). Fix this by
flush workqueue before clearing the bucket.

Fixes: ab23a77687 ("xfs: per-cpu deferred inode inactivation queues")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-07-14 11:36:36 +10:00
Andrey Strachuk
0f38063d7a xfs: removed useless condition in function xfs_attr_node_get
At line 1561, variable "state" is being compared
with NULL every loop iteration.

-------------------------------------------------------------------
1561	for (i = 0; state != NULL && i < state->path.active; i++) {
1562		xfs_trans_brelse(args->trans, state->path.blk[i].bp);
1563		state->path.blk[i].bp = NULL;
1564	}
-------------------------------------------------------------------

However, it cannot be NULL.

----------------------------------------
1546	state = xfs_da_state_alloc(args);
----------------------------------------

xfs_da_state_alloc calls kmem_cache_zalloc. kmem_cache_zalloc is
called with __GFP_NOFAIL flag and, therefore, it cannot return NULL.

--------------------------------------------------------------------------
	struct xfs_da_state *
	xfs_da_state_alloc(
	struct xfs_da_args	*args)
	{
		struct xfs_da_state	*state;

		state = kmem_cache_zalloc(xfs_da_state_cache, GFP_NOFS | __GFP_NOFAIL);
		state->args = args;
		state->mp = args->dp->i_mount;
		return state;
	}
--------------------------------------------------------------------------

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Andrey Strachuk <strochuk@ispras.ru>

Fixes: 4d0cdd2bb8 ("xfs: clean up xfs_attr_node_hasname")
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-07-09 10:56:02 -07:00
Eric Sandeen
70b589a37e xfs: add selinux labels to whiteout inodes
We got a report that "renameat2() with flags=RENAME_WHITEOUT doesn't
apply an SELinux label on xfs" as it does on other filesystems
(for example, ext4 and tmpfs.)  While I'm not quite sure how labels
may interact w/ whiteout files, leaving them as unlabeled seems
inconsistent at best. Now that xfs_init_security is not static,
rename it to xfs_inode_init_security per dchinner's suggestion.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-07-09 10:56:02 -07:00
Darrick J. Wong
fddb564f62 xfs: per-ag conversions for 5.20
This series drives the perag down into the AGI, AGF and AGFL access
 routines and unifies the perag structure initialisation with the
 high level AG header read functions. This largely replaces the
 xfs_mount/agno pair that is passed to all these functions with a
 perag, and in most places we already have a perag ready to pass in.
 There are a few places where perags need to be grabbed before
 reading the AG header buffers - some of these will need to be driven
 to higher layers to ensure we can run operations on AGs without
 getting stuck part way through waiting on a perag reference.
 
 The latter section of this patchset moves some of the AG geometry
 information from the xfs_mount to the xfs_perag, and starts
 converting code that requires geometry validation to use a perag
 instead of a mount and having to extract the AGNO from the object
 location. This also allows us to store the AG size in the perag and
 then we can stop having to compare the agno against sb_agcount to
 determine if the AG is the last AG and so has a runt size.  This
 greatly simplifies some of the type validity checking we do and
 substantially reduces the CPU overhead of type validity checking. It
 also cuts over 1.2kB out of the binary size.
 
 Signed-off-by: Dave Chinner <dchinner@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmLHawsUHGRhdmlkQGZy
 b21vcmJpdC5jb20ACgkQregpR/R1+h09BhAAzsBr2K8yj+2eCwGZO7g4/Cynf/bb
 nbukVogeMkFuUOmxniQzY6F3NUNo9du0FcDgEfbh6mYJtGRVQZaVCBKyPmcMXPOZ
 q/1VRTod20XdMdvPfXvXP3FJoSp1W7/dPIx9Mxl4b5zCdFfeUTTfScl12MAePrdW
 TaJpRBmxP+CZ0/bocAyHL4/2kqY2FNVgbR4vGxxHgqyjfwgQrdmOBetw7xoVxze1
 lK3Iogm0btBd1bkGO83x7DceGl41JGEutx+92gI+/43rzP7Q4BRqXZm5Ik5taWVN
 QLcLgAyj5X91D/e5dmg9NkDvVyzo7QHx0/0O/HOfw5XbzNcVw81se49yUjUS7+VM
 n2LocVCLYsx9/DxqSJxd9lJUXlLtY/YvY7lewNknmeASwtHH8ReOvMPS89L5PJqD
 InPDKay7OVsBkJd9I2yG43Q/MzQTuJuVWmbP5yVoFqR/wX9V8bjf8ng9kkkfVqMj
 1nXnMyCr/41zSwvM12fEkv67ilwsbke3j5jYrO/TcfjAV8xqb6D6HaEx1PCRQpT2
 w1FATNRGDdc2m+ojU4/ETe36KHYO/eivio1oBtqzdoaE13GRJjCqQVErg9nsDSQn
 34zEcShHuhKwn5VLsR6ngZVZOOHSkAjAw4G6XsvoUKpjwtVH0Ueu/bja7/ZwOq7W
 ySka+/95OgHNS64=
 =71q3
 -----END PGP SIGNATURE-----

Merge tag 'xfs-perag-conv-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeA

xfs: per-ag conversions for 5.20

This series drives the perag down into the AGI, AGF and AGFL access
routines and unifies the perag structure initialisation with the
high level AG header read functions. This largely replaces the
xfs_mount/agno pair that is passed to all these functions with a
perag, and in most places we already have a perag ready to pass in.
There are a few places where perags need to be grabbed before
reading the AG header buffers - some of these will need to be driven
to higher layers to ensure we can run operations on AGs without
getting stuck part way through waiting on a perag reference.

The latter section of this patchset moves some of the AG geometry
information from the xfs_mount to the xfs_perag, and starts
converting code that requires geometry validation to use a perag
instead of a mount and having to extract the AGNO from the object
location. This also allows us to store the AG size in the perag and
then we can stop having to compare the agno against sb_agcount to
determine if the AG is the last AG and so has a runt size.  This
greatly simplifies some of the type validity checking we do and
substantially reduces the CPU overhead of type validity checking. It
also cuts over 1.2kB out of the binary size.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

* tag 'xfs-perag-conv-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: make is_log_ag() a first class helper
  xfs: replace xfs_ag_block_count() with perag accesses
  xfs: Pre-calculate per-AG agino geometry
  xfs: Pre-calculate per-AG agbno geometry
  xfs: pass perag to xfs_alloc_read_agfl
  xfs: pass perag to xfs_alloc_put_freelist
  xfs: pass perag to xfs_alloc_get_freelist
  xfs: pass perag to xfs_read_agf
  xfs: pass perag to xfs_read_agi
  xfs: pass perag to xfs_alloc_read_agf()
  xfs: kill xfs_alloc_pagf_init()
  xfs: pass perag to xfs_ialloc_read_agi()
  xfs: kill xfs_ialloc_pagi_init()
  xfs: make last AG grow/shrink perag centric
2022-07-09 10:55:44 -07:00
Darrick J. Wong
dd81dc0559 xfs: improve CIL scalability
This series aims to improve the scalability of XFS transaction
 commits on large CPU count machines. My 32p machine hits contention
 limits in xlog_cil_commit() at about 700,000 transaction commits a
 section. It hits this at 16 thread workloads, and 32 thread
 workloads go no faster and just burn CPU on the CIL spinlocks.
 
 This patchset gets rid of spinlocks and global serialisation points
 in the xlog_cil_commit() path. It does this by moving to a
 combination of per-cpu counters, unordered per-cpu lists and
 post-ordered per-cpu lists.
 
 This results in transaction commit rates exceeding 1.4 million
 commits/s under unlink certain workloads, and while the log lock
 contention is largely gone there is still significant lock
 contention in the VFS (dentry cache, inode cache and security layers)
 at >600,000 transactions/s that still limit scalability.
 
 The changes to the CIL accounting and behaviour, combined with the
 structural changes to xlog_write() in prior patchsets make the
 per-cpu restructuring possible and sane. This allows us to move to
 precalculated reservation requirements that allow for reservation
 stealing to be accounted across multiple CPUs accurately.
 
 That is, instead of trying to account for continuation log opheaders
 on a "growth" basis, we pre-calculate how many iclogs we'll need to
 write out a maximally sized CIL checkpoint and steal that reserveD
 that space one commit at a time until the CIL has a full
 reservation. If we ever run a commit when we are already at the hard
 limit (because post-throttling) we simply take an extra reservation
 from each commit that is run when over the limit. Hence we don't
 need to do space usage math in the fast path and so never need to
 sum the per-cpu counters in this fast path.
 
 Similarly, per-cpu lists have the problem of ordering - we can't
 remove an item from a per-cpu list if we want to move it forward in
 the CIL. We solve this problem by using an atomic counter to give
 every commit a sequence number that is copied into the log items in
 that transaction. Hence relogging items just overwrites the sequence
 number in the log item, and does not move it in the per-cpu lists.
 Once we reaggregate the per-cpu lists back into a single list in the
 CIL push work, we can run it through list-sort() and reorder it back
 into a globally ordered list. This costs a bit of CPU time, but now
 that the CIL can run multiple works and pipelines properly, this is
 not a limiting factor for performance. It does increase fsync
 latency when the CIL is full, but workloads issuing large numbers of
 fsync()s or sync transactions end up with very small CILs and so the
 latency impact or sorting is not measurable for such workloads.
 
 OVerall, this pushes the transaction commit bottleneck out to the
 lockless reservation grant head updates. These atomic updates don't
 start to be a limiting fact until > 1.5 million transactions/s are
 being run, at which point the accounting functions start to show up
 in profiles as the highest CPU users. Still, this series doubles
 transaction throughput without increasing CPU usage before we get
 to that cacheline contention breakdown point...
 `
 Signed-off-by: Dave Chinner <dchinner@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmLHai8UHGRhdmlkQGZy
 b21vcmJpdC5jb20ACgkQregpR/R1+h3JZQ//bb9HyBiBkeuK9MvqH40hOfazfGXD
 8+pdP9r22qWp9LHhjz/EtH4Wy1sYe6a99mtPxqlsT3DqSl8GiolA1VFn+T3Sadu4
 nqmB/ppzMLE0LLzKoVrb3/Zw+mEaz5Is3WLpr86CpK5gNW6gBHCj4B68lWiBtvjs
 OW5fTm0E44BnNORh/AdSUkJxxEB2OQhVk5omY/Op8vO5frviG5yqYakAeoQ3vFpS
 UKadwlGjei91c63g9se360Re+DXTBhzbgXz0oNV4YbgWba2O9lnut5zqlcJMvVAU
 YgGBxttT0OqCdSNp0vtwOG8UFeUqfWSY+AFwfDkNycltLASvU53efqC94kQHouoh
 9++2VrPwPg0KOcQsvQo5WViQqWrr0+KlsaiTRO/TE0XCGFx4xQKEuhZ6QAnHiiVU
 en34SMqY51qa5D3LSbs6F278rEZNcLQguiH6Urxe5KRmkJDfoxtsWQ/DpV8itbnk
 raCUFlhW8GIBrRvizB7Na+hDWj1/HGQRIEs+xlfqPcFDV9bkECE/IpbD04+JDbil
 wsDoy2IO15oG/rX05/bkXAY7fFuhWbnVAbKrqvl+50w8Oo5w0+X3ZHlqhiLqCzVr
 e/TL5lc+9Ciq4uG8TCwal4HoktYLwqez4qxz396YpE4LN1ax2ICFgR9HyY4GLqmU
 0H1qSxZmOkeueCU=
 =vLZn
 -----END PGP SIGNATURE-----

Merge tag 'xfs-cil-scale-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeA

xfs: improve CIL scalability

This series aims to improve the scalability of XFS transaction
commits on large CPU count machines. My 32p machine hits contention
limits in xlog_cil_commit() at about 700,000 transaction commits a
section. It hits this at 16 thread workloads, and 32 thread
workloads go no faster and just burn CPU on the CIL spinlocks.

This patchset gets rid of spinlocks and global serialisation points
in the xlog_cil_commit() path. It does this by moving to a
combination of per-cpu counters, unordered per-cpu lists and
post-ordered per-cpu lists.

This results in transaction commit rates exceeding 1.4 million
commits/s under unlink certain workloads, and while the log lock
contention is largely gone there is still significant lock
contention in the VFS (dentry cache, inode cache and security layers)
at >600,000 transactions/s that still limit scalability.

The changes to the CIL accounting and behaviour, combined with the
structural changes to xlog_write() in prior patchsets make the
per-cpu restructuring possible and sane. This allows us to move to
precalculated reservation requirements that allow for reservation
stealing to be accounted across multiple CPUs accurately.

That is, instead of trying to account for continuation log opheaders
on a "growth" basis, we pre-calculate how many iclogs we'll need to
write out a maximally sized CIL checkpoint and steal that reserveD
that space one commit at a time until the CIL has a full
reservation. If we ever run a commit when we are already at the hard
limit (because post-throttling) we simply take an extra reservation
from each commit that is run when over the limit. Hence we don't
need to do space usage math in the fast path and so never need to
sum the per-cpu counters in this fast path.

Similarly, per-cpu lists have the problem of ordering - we can't
remove an item from a per-cpu list if we want to move it forward in
the CIL. We solve this problem by using an atomic counter to give
every commit a sequence number that is copied into the log items in
that transaction. Hence relogging items just overwrites the sequence
number in the log item, and does not move it in the per-cpu lists.
Once we reaggregate the per-cpu lists back into a single list in the
CIL push work, we can run it through list-sort() and reorder it back
into a globally ordered list. This costs a bit of CPU time, but now
that the CIL can run multiple works and pipelines properly, this is
not a limiting factor for performance. It does increase fsync
latency when the CIL is full, but workloads issuing large numbers of
fsync()s or sync transactions end up with very small CILs and so the
latency impact or sorting is not measurable for such workloads.

OVerall, this pushes the transaction commit bottleneck out to the
lockless reservation grant head updates. These atomic updates don't
start to be a limiting fact until > 1.5 million transactions/s are
being run, at which point the accounting functions start to show up
in profiles as the highest CPU users. Still, this series doubles
transaction throughput without increasing CPU usage before we get
to that cacheline contention breakdown point...
`
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

* tag 'xfs-cil-scale-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: expanding delayed logging design with background material
  xfs: xlog_sync() manually adjusts grant head space
  xfs: avoid cil push lock if possible
  xfs: move CIL ordering to the logvec chain
  xfs: convert log vector chain to use list heads
  xfs: convert CIL to unordered per cpu lists
  xfs: Add order IDs to log items in CIL
  xfs: convert CIL busy extents to per-cpu
  xfs: track CIL ticket reservation in percpu structure
  xfs: implement percpu cil space used calculation
  xfs: introduce per-cpu CIL tracking structure
  xfs: rework per-iclog header CIL reservation
  xfs: lift init CIL reservation out of xc_cil_lock
  xfs: use the CIL space used counter for emptiness checks
2022-07-09 10:55:21 -07:00
Dave Chinner
36029dee38 xfs: make is_log_ag() a first class helper
We check if an ag contains the log in many places, so make this
a first class XFS helper by lifting it to fs/xfs/libxfs/xfs_ag.h and
renaming it xfs_ag_contains_log(). The convert all the places that
check if the AG contains the log to use this helper.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:13:21 +10:00
Dave Chinner
3829c9a10f xfs: replace xfs_ag_block_count() with perag accesses
Many of the places that call xfs_ag_block_count() have a perag
available. These places can just read pag->block_count directly
instead of calculating the AG block count from first principles.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:13:17 +10:00
Dave Chinner
2d6ca8321c xfs: Pre-calculate per-AG agino geometry
There is a lot of overhead in functions like xfs_verify_agino() that
repeatedly calculate the geometry limits of an AG. These can be
pre-calculated as they are static and the verification context has
a per-ag context it can quickly reference.

In the case of xfs_verify_agino(), we now always have a perag
context handy, so we can store the minimum and maximum agino values
in the AG in the perag. This means we don't have to calculate
it on every call and it can be inlined in callers if we move it
to xfs_ag.h.

xfs_verify_agino_or_null() gets the same perag treatment.

xfs_agino_range() is moved to xfs_ag.c as it's not really a type
function, and it's use is largely restricted as the first and last
aginos can be grabbed straight from the perag in most cases.

Note that we leave the original xfs_verify_agino in place in
xfs_types.c as a static function as other callers in that file do
not have per-ag contexts so still need to go the long way. It's been
renamed to xfs_verify_agno_agino() to indicate it takes both an agno
and an agino to differentiate it from new function.

$ size --totals fs/xfs/built-in.a
	   text    data     bss     dec     hex filename
before	1482185	 329588	    572	1812345	 1ba779	(TOTALS)
after	1481937	 329588	    572	1812097	 1ba681	(TOTALS)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:13:10 +10:00
Dave Chinner
0800169e3e xfs: Pre-calculate per-AG agbno geometry
There is a lot of overhead in functions like xfs_verify_agbno() that
repeatedly calculate the geometry limits of an AG. These can be
pre-calculated as they are static and the verification context has
a per-ag context it can quickly reference.

In the case of xfs_verify_agbno(), we now always have a perag
context handy, so we can store the AG length and the minimum valid
block in the AG in the perag. This means we don't have to calculate
it on every call and it can be inlined in callers if we move it
to xfs_ag.h.

Move xfs_ag_block_count() to xfs_ag.c because it's really a
per-ag function and not an XFS type function. We need a little
bit of rework that is specific to xfs_initialise_perag() to allow
growfs to calculate the new perag sizes before we've updated the
primary superblock during the grow (chicken/egg situation).

Note that we leave the original xfs_verify_agbno in place in
xfs_types.c as a static function as other callers in that file do
not have per-ag contexts so still need to go the long way. It's been
renamed to xfs_verify_agno_agbno() to indicate it takes both an agno
and an agbno to differentiate it from new function.

Future commits will make similar changes for other per-ag geometry
validation functions.

Further:

$ size --totals fs/xfs/built-in.a
	   text    data     bss     dec     hex filename
before	1483006	 329588	    572	1813166	 1baaae	(TOTALS)
after	1482185	 329588	    572	1812345	 1ba779	(TOTALS)

This rework reduces the binary size by ~820 bytes, indicating
that much less work is being done to bounds check the agbno values
against on per-ag geometry information.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:13:02 +10:00
Dave Chinner
cec7bb7d58 xfs: pass perag to xfs_alloc_read_agfl
We have the perag in most places we call xfs_alloc_read_agfl, so
pass the perag instead of a mount/agno pair.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:08:15 +10:00
Dave Chinner
8c392eb27f xfs: pass perag to xfs_alloc_put_freelist
It's available in all callers, so pass it in so that the perag can
be passed further down the stack.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:08:08 +10:00
Dave Chinner
49f0d84ec1 xfs: pass perag to xfs_alloc_get_freelist
It's available in all callers, so pass it in so that the perag can
be passed further down the stack.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:08:01 +10:00
Dave Chinner
fa044ae70c xfs: pass perag to xfs_read_agf
We have the perag in most places we call xfs_read_agf, so pass the
perag instead of a mount/agno pair.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:07:54 +10:00
Dave Chinner
61021deb1f xfs: pass perag to xfs_read_agi
We have the perag in most palces we call xfs_read_agi, so pass the
perag instead of a mount/agno pair.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:07:47 +10:00
Dave Chinner
08d3e84fee xfs: pass perag to xfs_alloc_read_agf()
xfs_alloc_read_agf() initialises the perag if it hasn't been done
yet, so it makes sense to pass it the perag rather than pull a
reference from the buffer. This allows callers to be per-ag centric
rather than passing mount/agno pairs everywhere.

Whilst modifying the xfs_reflink_find_shared() function definition,
declare it static and remove the extern declaration as it is an
internal function only these days.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:07:40 +10:00
Dave Chinner
76b47e528e xfs: kill xfs_alloc_pagf_init()
Trivial wrapper around xfs_alloc_read_agf(), can be easily replaced
by passing a NULL agfbp to xfs_alloc_read_agf().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:07:32 +10:00
Dave Chinner
99b13c7f0b xfs: pass perag to xfs_ialloc_read_agi()
xfs_ialloc_read_agi() initialises the perag if it hasn't been done
yet, so it makes sense to pass it the perag rather than pull a
reference from the buffer. This allows callers to be per-ag centric
rather than passing mount/agno pairs everywhere.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:07:24 +10:00
Dave Chinner
a95fee40e3 xfs: kill xfs_ialloc_pagi_init()
This is just a basic wrapper around xfs_ialloc_read_agi(), which can
be entirely handled by xfs_ialloc_read_agi() by passing a NULL
agibpp....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:07:16 +10:00
Dave Chinner
c6aee24814 xfs: make last AG grow/shrink perag centric
Because the perag must exist for these operations, look it up as
part of the common shrink operations and pass it instead of the
mount/agno pair.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 19:07:09 +10:00
Dave Chinner
d9f68777b2 xfs: xlog_sync() manually adjusts grant head space
When xlog_sync() rounds off the tail the iclog that is being
flushed, it manually subtracts that space from the grant heads. This
space is actually reserved by the transaction ticket that covers
the xlog_sync() call from xlog_write(), but we don't plumb the
ticket down far enough for it to account for the space consumed in
the current log ticket.

The grant heads are hot, so we really should be accounting this to
the ticket is we can, rather than adding thousands of extra grant
head updates every CIL commit.

Interestingly, this actually indicates a potential log space overrun
can occur when we force the log. By the time that xfs_log_force()
pushes out an active iclog and consumes the roundoff space, the
reservation for that roundoff space has been returned to the grant
heads and is no longer covered by a reservation. In theory the
roundoff added to log force on an already full log could push the
write head past the tail. In practice, the CIL commit that writes to
the log and needs the iclog pushed will have reserved space for
roundoff, so when it releases the ticket there will still be
physical space for the roundoff to be committed to the log, even
though it is no longer reserved. This roundoff won't be enough space
to allow a transaction to be woken if the log is full, so overruns
should not actually occur in practice.

That said, it indicates that we should not release the CIL context
log ticket until after we've released the commit iclog. It also
means that xlog_sync() still needs the direct grant head
manipulation if we don't provide it with a ticket. Log forces are
rare when we are in fast paths running 1.5 million transactions/s
that make the grant heads hot, so let's optimise the hot case and
pass CIL log tickets down to the xlog_sync() code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 18:56:09 +10:00
Dave Chinner
1ccb0745a9 xfs: avoid cil push lock if possible
Because now it hurts when the CIL fills up.

  - 37.20% __xfs_trans_commit
      - 35.84% xfs_log_commit_cil
         - 19.34% _raw_spin_lock
            - do_raw_spin_lock
                 19.01% __pv_queued_spin_lock_slowpath
         - 4.20% xfs_log_ticket_ungrant
              0.90% xfs_log_space_wake


Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 18:56:08 +10:00
Dave Chinner
4eb56069cb xfs: move CIL ordering to the logvec chain
Adding a list_sort() call to the CIL push work while the xc_ctx_lock
is held exclusively has resulted in fairly long lock hold times and
that stops all front end transaction commits from making progress.

We can move the sorting out of the xc_ctx_lock if we can transfer
the ordering information to the log vectors as they are detached
from the log items and then we can sort the log vectors.  With these
changes, we can move the list_sort() call to just before we call
xlog_write() when we aren't holding any locks at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 18:56:08 +10:00
Dave Chinner
169248536a xfs: convert log vector chain to use list heads
Because the next change is going to require sorting log vectors, and
that requires arbitrary rearrangement of the list which cannot be
done easily with a single linked list.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 18:55:59 +10:00
Dave Chinner
c0fb4765c5 xfs: convert CIL to unordered per cpu lists
So that we can remove the cil_lock which is a global serialisation
point. We've already got ordering sorted, so all we need to do is
treat the CIL list like the busy extent list and reconstruct it
before the push starts.

This is what we're trying to avoid:

 -   75.35%     1.83%  [kernel]            [k] xfs_log_commit_cil
    - 46.35% xfs_log_commit_cil
       - 41.54% _raw_spin_lock
          - 67.30% do_raw_spin_lock
               66.96% __pv_queued_spin_lock_slowpath

Which happens on a 32p system when running a 32-way 'rm -rf'
workload. After this patch:

-   20.90%     3.23%  [kernel]               [k] xfs_log_commit_cil
   - 17.67% xfs_log_commit_cil
      - 6.51% xfs_log_ticket_ungrant
           1.40% xfs_log_space_wake
        2.32% memcpy_erms
      - 2.18% xfs_buf_item_committing
         - 2.12% xfs_buf_item_release
            - 1.03% xfs_buf_unlock
                 0.96% up
              0.72% xfs_buf_rele
        1.33% xfs_inode_item_format
        1.19% down_read
        0.91% up_read
        0.76% xfs_buf_item_format
      - 0.68% kmem_alloc_large
         - 0.67% kmem_alloc
              0.64% __kmalloc
        0.50% xfs_buf_item_size

It kinda looks like the workload is running out of log space all
the time. But all the spinlock contention is gone and the
transaction commit rate has gone from 800k/s to 1.3M/s so the amount
of real work being done has gone up a *lot*.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 18:54:59 +10:00
Dave Chinner
016a23388c xfs: Add order IDs to log items in CIL
Before we split the ordered CIL up into per cpu lists, we need a
mechanism to track the order of the items in the CIL. We need to do
this because there are rules around the order in which related items
must physically appear in the log even inside a single checkpoint
transaction.

An example of this is intents - an intent must appear in the log
before it's intent done record so that log recovery can cancel the
intent correctly. If we have these two records misordered in the
CIL, then they will not be recovered correctly by journal replay.

We also will not be able to move items to the tail of
the CIL list when they are relogged, hence the log items will need
some mechanism to allow the correct log item order to be recreated
before we write log items to the hournal.

Hence we need to have a mechanism for recording global order of
transactions in the log items  so that we can recover that order
from un-ordered per-cpu lists.

Do this with a simple monotonic increasing commit counter in the CIL
context. Each log item in the transaction gets stamped with the
current commit order ID before it is added to the CIL. If the item
is already in the CIL, leave it where it is instead of moving it to
the tail of the list and instead sort the list before we start the
push work.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 18:53:59 +10:00
Dave Chinner
df7a4a2134 xfs: convert CIL busy extents to per-cpu
To get them out from under the CIL lock.

This is an unordered list, so we can simply punt it to per-cpu lists
during transaction commits and reaggregate it back into a single
list during the CIL push work.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 18:52:59 +10:00
Dave Chinner
1dd2a2c18e xfs: track CIL ticket reservation in percpu structure
To get it out from under the cil spinlock.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 18:51:59 +10:00
Dave Chinner
7c8ade2121 xfs: implement percpu cil space used calculation
Now that we have the CIL percpu structures in place, implement the
space used counter as a per-cpu counter.

We have to be really careful now about ensuring that the checks and
updates run without arbitrary delays, which means they need to run
with pre-emption disabled. We do this by careful placement of
the get_cpu_ptr/put_cpu_ptr calls to access the per-cpu structures
for that CPU.

We need to be able to reliably detect that the CIL has reached
the hard limit threshold so we can take extra reservations for the
iclog headers when the space used overruns the original reservation.
hence we factor out xlog_cil_over_hard_limit() from
xlog_cil_push_background().

The global CIL space used is an atomic variable that is backed by
per-cpu aggregation to minimise the number of atomic updates we do
to the global state in the fast path. While we are under the soft
limit, we aggregate only when the per-cpu aggregation is over the
proportion of the soft limit assigned to that CPU. This means that
all CPUs can use all but one byte of their aggregation threshold
and we will not go over the soft limit.

Hence once we detect that we've gone over both a per-cpu aggregation
threshold and the soft limit, we know that we have only
exceeded the soft limit by one per-cpu aggregation threshold. Even
if all CPUs hit this at the same time, we can't be over the hard
limit, so we can run an aggregation back into the atomic counter
at this point and still be under the hard limit.

At this point, we will be over the soft limit and hence we'll
aggregate into the global atomic used space directly rather than the
per-cpu counters, hence providing accurate detection of hard limit
excursion for accounting and reservation purposes.

Hence we get the best of both worlds - lockless, scalable per-cpu
fast path plus accurate, atomic detection of hard limit excursion.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 18:50:59 +10:00
Linus Torvalds
20855e4cb3 Fixes for 5.19-rc5:
- Fix statfs blocking on background inode gc workers
  - Fix some broken inode lock assertion code
  - Fix xattr leaf buffer leaks when cancelling a deferred xattr update
    operation
  - Clean up xattr recovery to make it easier to understand.
  - Fix xattr leaf block verifiers tripping over empty blocks.
  - Remove complicated and error prone xattr leaf block bholding mess.
  - Fix a bug where an rt extent crossing EOF was treated as "posteof"
    blocks and cleaned unnecessarily.
  - Fix a UAF when log shutdown races with unmount.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmK/kVMACgkQ+H93GTRK
 tOs0tQ/+PYRhEDKrgocxZGJFNvnxqPRdEDu9k5XCnO2Y/DZRAF52F0JZaPtuiFH4
 12e9vzYYRNrE9KifzPWo4j2L067kFszt4XcAjytJuf5f6k/duX7XbsdMb17Qxd28
 mZDtBBSQCc9fcQo21u5SdZlPaD1SC1843jB4Oe7Sbo3AFvVAMwuBUgnp2TSDA8V0
 0q25PUD0ZvWP3UTQS4M4fW4WhFa5wF+GnLR1DZjryFIzuUp9JwdCQZHIFnp6cHq9
 TZMDJ4WhD9igMSzicRfgPoC8z/D3Mm0cFmRoURbG3GLzAeJ+e7PJ43rvlwq6Ajcv
 v5DhyQvFkiVjKLsrtJyvvUGSpkLL/touNG8MUE9I0heiiwb0QbP108aHWU8AS1Dr
 q7XHIxPaOhvlzVZN1uTuZE4N51/0NWITGKBwF0XU1b5D3wLyvOY6fbI7KLfkX2Sa
 4zHKn4QpHUIE9fs5Na3H6L+ndlJclo2DJA6lF26pLgmrT7NLmJG+r97XagBsp/pr
 X8qOvVMg1XJA37Vy1bTN5cfEYzTTksJk/fQ3AvSKHDCeP5u87kiZ6hqNnW6dD0YF
 D8VTX29rVQr5HavbcGCmAyBZpk4CfclCsWCQrZu9MCnQSW37HnObXPJkIWvzt8Mn
 j6emhPcYHy5TwSChxdpzl733ZX0KdkdOAgkWgqtod2E/7fe+g7Q=
 =8QeL
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.19-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "This fixes some stalling problems and corrects the last of the
  problems (I hope) observed during testing of the new atomic xattr
  update feature.

   - Fix statfs blocking on background inode gc workers

   - Fix some broken inode lock assertion code

   - Fix xattr leaf buffer leaks when cancelling a deferred xattr update
     operation

   - Clean up xattr recovery to make it easier to understand.

   - Fix xattr leaf block verifiers tripping over empty blocks.

   - Remove complicated and error prone xattr leaf block bholding mess.

   - Fix a bug where an rt extent crossing EOF was treated as "posteof"
     blocks and cleaned unnecessarily.

   - Fix a UAF when log shutdown races with unmount"

* tag 'xfs-5.19-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: prevent a UAF when log IO errors race with unmount
  xfs: dont treat rt extents beyond EOF as eofblocks to be cleared
  xfs: don't hold xattr leaf buffers across transaction rolls
  xfs: empty xattr leaf header blocks are not corruption
  xfs: clean up the end of xfs_attri_item_recover
  xfs: always free xattri_leaf_bp when cancelling a deferred op
  xfs: use invalidate_lock to check the state of mmap_lock
  xfs: factor out the common lock flags assert
  xfs: introduce xfs_inodegc_push()
  xfs: bound maximum wait time for inodegc work
2022-07-03 09:42:17 -07:00
Linus Torvalds
69cb6c6556 Notable regression fixes:
- Fix NFSD crash during NFSv4.2 READ_PLUS operation
 - Fix incorrect status code returned by COMMIT operation
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmLAhFIACgkQM2qzM29m
 f5fK2w//TyUAzwoTZQEgmFbxNhXgUYC6uwWPqTCLMieStKtPwT8GsuXviY34QziF
 2vER0NW9Am2GQyL3gtiYFM07OoHhQ4gr86ltaeIHAHhcwm3eIs879ARwEsyN6eDR
 +RDpqnONwtg+yaepfCMc4Bki9Jex+mmoXro86nFPmH+TDM5QiIRY0ncBWSLVWvYT
 YciAgvL6vfo2G79NYOzohoTb15ydotmy6m9H70nN+a2l6bKOIT8cF4S8lZETJZXA
 Rlj+R0eE0iXZTtp7VsAfVAHHfOzexGJjE85hpVzGiZWbxe6o0WNmBpoHICXs9VoP
 WRkq6vBC9P6wJ7EOu84/SRlR/rQaxfrYB7beqTC2kO6t5Ka5/xpJMKZOhfukCMV+
 7+uPAUnSIRqpHG0hWm1kPto00bfXm9fMETQbOEw7UQ9dH/322iOJnXwy03ZEXtJq
 9G0k+yNcydoIs2g5OpNrw4f/wTQcdlbf1ZA5O9dAsxwS1ZTNpKUiE0Sd0Ez/0VIJ
 t5DZFWH44rs/60avxRYxtw965gNugTxb1YiKdXObvbFGf+xYtH0zePce++4kTJlE
 t886stLcHbuPYs4/XItwTxLMSnDM5UR22Icho7ElRHvPiWjZmc6o2fxCkWTzCOE2
 ylZowLFMYZ/KbrP5mcqLARKEP6oFJERXt/U8U0CWeVbZomy9GVY=
 =z5W8
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:
 "Notable regression fixes:

   - Fix NFSD crash during NFSv4.2 READ_PLUS operation

   - Fix incorrect status code returned by COMMIT operation"

* tag 'nfsd-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  SUNRPC: Fix READ_PLUS crasher
  NFSD: restore EINVAL error translation in nfsd_commit()
2022-07-02 11:20:56 -07:00
Linus Torvalds
76ff294e16 More NFS Client Bugfixes for Linux 5.19-rc
- Bugfixes:
   - Allocate a fattr for _nfs4_discover_trunking()
   - Fix module reference count leak in nfs4_run_state_manager()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmK/NHYACgkQ18tUv7Cl
 QOtbXBAAhivn5bqZvrKz4WI4WjddTRcjyvITiW6m26GZZVgNHz1Sc2Tp6TA6QEL8
 c7OLkj2SoA0SmO4kZX+gCpOzsapQA7FUWULFxpPxJFp/NCsgoxYqO5ZfX0qVpk6t
 1w8lGFqsMve3LdRmcqvbaIrvzJPMdsVvixrZwXRQMe/atvtUMmgHo4pkBPuQ5nv9
 FzWh0KRiohhEWSmncD0fjdYBCq4VOqrUEEn4BTeMoXwvg/noLj5GX3mS14CFH12Y
 +iOqQ05O48Ny8qhzeQ8bRat43t4cZoCpFUcwEPB0CWoNCqS4Qoqvn48Ic+4WMxpN
 nPg2CqkqaG2RUJozSJz8m+GQNbEohoGkruZXJh7TQaqWXrIJGRMBhKhI2b7hujBG
 meu0ypETzlbofjleCpevfvnNStoRTuakssMpcU/hfKjnfNIsHnADSYey0HWHWTAH
 ZaBT6N5Z3C6hRWz7wXIn3uTuWadZbDC9+HGvvyMpuP+PDQFM9exJfhoO0JQvQc/G
 dPB7SyINVj2Z9gguJaew4csRoSxZqxeLB28XHQ7yyYsmo7lbUWaeUgGGpynrrsZL
 Ysuh7JVoZyhSSdNb3b2vlnI7zK+F1qIkzg0/u2ZN8CLPfMaBCmPL031oFudpWJ2n
 TLnxoFHhsG2MTwua3CaOfk8QvNc2dT7ZsvxXHl7HxBlxGtBgSN4=
 =OYKI
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.19-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:

 - Allocate a fattr for _nfs4_discover_trunking()

 - Fix module reference count leak in nfs4_run_state_manager()

* tag 'nfs-for-5.19-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4: Add an fattr allocation to _nfs4_discover_trunking()
  NFS: restore module put when manager exits.
2022-07-01 11:11:32 -07:00
Linus Torvalds
6f8693ea2b A filesystem fix, marked for stable. There appears to be a deeper
issue on the MDS side, but for now we are going with this one-liner
 to avoid busy looping and potential soft lockups.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmK/C8MTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi1cEB/9CiJoDsc1v+DrP/4Ud/AbI4LMffMcr
 tkHmUo8ZT5D4feUzSFE6iKgb3gRCJUkYKzesywQ7Xhv7Mr6/DKB4+t9QtrympZFd
 sAg775mHkL0NI6/OLnLSRva/r627PFk6f1v8OWENOjsw01PLOtWAB/B5FqlgN8tG
 EQLfX0G83o4AXt4NcPCcsucPh7FxC2iKe8XWqAE6VTjkKnyz3IQHvSLweWV68U8R
 ht6eun8H+slx8Kw1lSZfW/XoFGFO4uKntCh/CKKH28ZqaXrxrdsfmXSVOMlOi351
 qxPfrTPgaSfvWQLbYQfPdQZCsfyyPgP2wdAVfpy56vk0yoxi2TLGBPsD
 =bu9O
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.19-rc5' of https://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
 "A ceph filesystem fix, marked for stable.

  There appears to be a deeper issue on the MDS side, but for now we are
  going with this one-liner to avoid busy looping and potential soft
  lockups"

* tag 'ceph-for-5.19-rc5' of https://github.com/ceph/ceph-client:
  ceph: wait on async create before checking caps for syncfs
2022-07-01 11:06:21 -07:00
Linus Torvalds
0a35d1622d io_uring-5.19-2022-07-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmK+6CoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsZPD/9xPZTAJhX3/HNTjbi+FlSvTaJ/4rll98No
 1pzW+nZyBVr4yesnHW2qtLwLRaYMNAFjdJmakn1BIUau4IT4Eqhb8NEz4ZCKnDD2
 Kwi0q/9c0I/GxTnVXmwXPQzQkZarYLa8cppQr1L/L3el1xTU9qXUdpR7+vxPKi4J
 ADDP+7buRYp7Td2RfBD2lD4B7jNMpZYVC/2/Y3fixkuJvK4eYKuf+5K7zgmbahm5
 YOm86k3P7QN7saTxUeyUrwR/G6CoY99Dd54KadQAS4XkU1f6XuNjF6IsYjPUEZ1B
 pKlhK4mhGieMlW8yBti0BdJLLTAHVsL9Pa0Aqsv1EdZ3x/Mfp9kmwig9RAGREyQX
 gNs316VgsfnZb+AdImZ9EItRnPZ/1Z0//VOWiDy7CijKABCZCSFXqOwQ+Yonyfab
 ZoVXlwlvOaxmiQAWhJe2XKxzRtAfeQgyirmF95N+c/wtIH6dWzJeIs2xFLPIKCaY
 tkv5Ah4IBGxofJj1SNqKNRUcv6N/Hr7zs/p6yTQpVEoUzsKqzh1eNz8PDA3ewrq4
 C6nkXnZfidyqPuUZJIfOa02N/cPLUSclxdll6pHQfIMiwLBlV60pFcSsylgdYTE+
 XT/iwiiaSTPUUIkCTYhyoUpfZnNX6IoVpxKOuh5gLOmTz/+xlRfcRjcjuXIoneHQ
 D9qlUWbYLA==
 =Edge
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.19-2022-07-01' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Two minor tweaks:

   - While we still can, adjust the send/recv based flags to be in
     ->ioprio rather than in ->addr2. This is consistent with eg accept,
     and also doesn't waste a full 64-bit field for flags (Pavel)

   - 5.18-stable fix for re-importing provided buffers. Not much real
     world relevance here as it'll only impact non-pollable files gone
     async, which is more of a practical test case rather than something
     that is used in the wild (Dylan)"

* tag 'io_uring-5.19-2022-07-01' of git://git.kernel.dk/linux-block:
  io_uring: fix provided buffer import
  io_uring: keep sendrecv flags in ioprio
2022-07-01 10:52:01 -07:00
Dave Chinner
af1c2146a5 xfs: introduce per-cpu CIL tracking structure
The CIL push lock is highly contended on larger machines, becoming a
hard bottleneck that about 700,000 transaction commits/s on >16p
machines. To address this, start moving the CIL tracking
infrastructure to utilise per-CPU structures.

We need to track the space used, the amount of log reservation space
reserved to write the CIL, the log items in the CIL and the busy
extents that need to be completed by the CIL commit.  This requires
a couple of per-cpu counters, an unordered per-cpu list and a
globally ordered per-cpu list.

Create a per-cpu structure to hold these and all the management
interfaces needed, as well as the hooks to handle hotplug CPUs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-02 02:13:52 +10:00
Dave Chinner
31151cc342 xfs: rework per-iclog header CIL reservation
For every iclog that a CIL push will use up, we need to ensure we
have space reserved for the iclog header in each iclog. It is
extremely difficult to do this accurately with a per-cpu counter
without expensive summing of the counter in every commit. However,
we know what the maximum CIL size is going to be because of the
hard space limit we have, and hence we know exactly how many iclogs
we are going to need to write out the CIL.

We are constrained by the requirement that small transactions only
have reservation space for a single iclog header built into them.
At commit time we don't know how much of the current transaction
reservation is made up of iclog header reservations as calculated by
xfs_log_calc_unit_res() when the ticket was reserved. As larger
reservations have multiple header spaces reserved, we can steal
more than one iclog header reservation at a time, but we only steal
the exact number needed for the given log vector size delta.

As a result, we don't know exactly when we are going to steal iclog
header reservations, nor do we know exactly how many we are going to
need for a given CIL.

To make things simple, start by calculating the worst case number of
iclog headers a full CIL push will require. Record this into an
atomic variable in the CIL. Then add a byte counter to the log
ticket that records exactly how much iclog header space has been
reserved in this ticket by xfs_log_calc_unit_res(). This tells us
exactly how much space we can steal from the ticket at transaction
commit time.

Now, at transaction commit time, we can check if the CIL has a full
iclog header reservation and, if not, steal the entire reservation
the current ticket holds for iclog headers. This minimises the
number of times we need to do atomic operations in the fast path,
but still guarantees we get all the reservations we need.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-02 02:12:52 +10:00
Dave Chinner
12380d237b xfs: lift init CIL reservation out of xc_cil_lock
The xc_cil_lock is the most highly contended lock in XFS now. To
start the process of getting rid of it, lift the initial reservation
of the CIL log space out from under the xc_cil_lock.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-02 02:11:52 +10:00
Dave Chinner
88591e7f06 xfs: use the CIL space used counter for emptiness checks
In the next patches we are going to make the CIL list itself
per-cpu, and so we cannot use list_empty() to check is the list is
empty. Replace the list_empty() checks with a flag in the CIL to
indicate we have committed at least one transaction to the CIL and
hence the CIL is not empty.

We need this flag to be an atomic so that we can clear it without
holding any locks in the commit fast path, but we also need to be
careful to avoid atomic operations in the fast path. Hence we use
the fact that test_bit() is not an atomic op to first check if the
flag is set and then run the atomic test_and_clear_bit() operation
to clear it and steal the initial unit reservation for the CIL
context checkpoint.

When we are switching to a new context in a push, we place the
setting of the XLOG_CIL_EMPTY flag under the xc_push_lock. THis
allows all the other places that need to check whether the CIL is
empty to use test_bit() and still be serialised correctly with the
CIL context swaps that set the bit.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-02 02:10:52 +10:00
Darrick J. Wong
7561cea5db xfs: prevent a UAF when log IO errors race with unmount
KASAN reported the following use after free bug when running
generic/475:

 XFS (dm-0): Mounting V5 Filesystem
 XFS (dm-0): Starting recovery (logdev: internal)
 XFS (dm-0): Ending recovery (logdev: internal)
 Buffer I/O error on dev dm-0, logical block 20639616, async page read
 Buffer I/O error on dev dm-0, logical block 20639617, async page read
 XFS (dm-0): log I/O error -5
 XFS (dm-0): Filesystem has been shut down due to log error (0x2).
 XFS (dm-0): Unmounting Filesystem
 XFS (dm-0): Please unmount the filesystem and rectify the problem(s).
 ==================================================================
 BUG: KASAN: use-after-free in do_raw_spin_lock+0x246/0x270
 Read of size 4 at addr ffff888109dd84c4 by task 3:1H/136

 CPU: 3 PID: 136 Comm: 3:1H Not tainted 5.19.0-rc4-xfsx #rc4 8e53ab5ad0fddeb31cee5e7063ff9c361915a9c4
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
 Workqueue: xfs-log/dm-0 xlog_ioend_work [xfs]
 Call Trace:
  <TASK>
  dump_stack_lvl+0x34/0x44
  print_report.cold+0x2b8/0x661
  ? do_raw_spin_lock+0x246/0x270
  kasan_report+0xab/0x120
  ? do_raw_spin_lock+0x246/0x270
  do_raw_spin_lock+0x246/0x270
  ? rwlock_bug.part.0+0x90/0x90
  xlog_force_shutdown+0xf6/0x370 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318]
  xlog_ioend_work+0x100/0x190 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318]
  process_one_work+0x672/0x1040
  worker_thread+0x59b/0xec0
  ? __kthread_parkme+0xc6/0x1f0
  ? process_one_work+0x1040/0x1040
  ? process_one_work+0x1040/0x1040
  kthread+0x29e/0x340
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  </TASK>

 Allocated by task 154099:
  kasan_save_stack+0x1e/0x40
  __kasan_kmalloc+0x81/0xa0
  kmem_alloc+0x8d/0x2e0 [xfs]
  xlog_cil_init+0x1f/0x540 [xfs]
  xlog_alloc_log+0xd1e/0x1260 [xfs]
  xfs_log_mount+0xba/0x640 [xfs]
  xfs_mountfs+0xf2b/0x1d00 [xfs]
  xfs_fs_fill_super+0x10af/0x1910 [xfs]
  get_tree_bdev+0x383/0x670
  vfs_get_tree+0x7d/0x240
  path_mount+0xdb7/0x1890
  __x64_sys_mount+0x1fa/0x270
  do_syscall_64+0x2b/0x80
  entry_SYSCALL_64_after_hwframe+0x46/0xb0

 Freed by task 154151:
  kasan_save_stack+0x1e/0x40
  kasan_set_track+0x21/0x30
  kasan_set_free_info+0x20/0x30
  ____kasan_slab_free+0x110/0x190
  slab_free_freelist_hook+0xab/0x180
  kfree+0xbc/0x310
  xlog_dealloc_log+0x1b/0x2b0 [xfs]
  xfs_unmountfs+0x119/0x200 [xfs]
  xfs_fs_put_super+0x6e/0x2e0 [xfs]
  generic_shutdown_super+0x12b/0x3a0
  kill_block_super+0x95/0xd0
  deactivate_locked_super+0x80/0x130
  cleanup_mnt+0x329/0x4d0
  task_work_run+0xc5/0x160
  exit_to_user_mode_prepare+0xd4/0xe0
  syscall_exit_to_user_mode+0x1d/0x40
  entry_SYSCALL_64_after_hwframe+0x46/0xb0

This appears to be a race between the unmount process, which frees the
CIL and waits for in-flight iclog IO; and the iclog IO completion.  When
generic/475 runs, it starts fsstress in the background, waits a few
seconds, and substitutes a dm-error device to simulate a disk falling
out of a machine.  If the fsstress encounters EIO on a pure data write,
it will exit but the filesystem will still be online.

The next thing the test does is unmount the filesystem, which tries to
clean the log, free the CIL, and wait for iclog IO completion.  If an
iclog was being written when the dm-error switch occurred, it can race
with log unmounting as follows:

Thread 1				Thread 2

					xfs_log_unmount
					xfs_log_clean
					xfs_log_quiesce
xlog_ioend_work
<observe error>
xlog_force_shutdown
test_and_set_bit(XLOG_IOERROR)
					xfs_log_force
					<log is shut down, nop>
					xfs_log_umount_write
					<log is shut down, nop>
					xlog_dealloc_log
					xlog_cil_destroy
					<wait for iclogs>
spin_lock(&log->l_cilp->xc_push_lock)
<KABOOM>

Therefore, free the CIL after waiting for the iclogs to complete.  I
/think/ this race has existed for quite a few years now, though I don't
remember the ~2014 era logging code well enough to know if it was a real
threat then or if the actual race was exposed only more recently.

Fixes: ac983517ec ("xfs: don't sleep in xlog_cil_force_lsn on shutdown")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-07-01 09:09:52 -07:00
Amir Goldstein
868f9f2f8e vfs: fix copy_file_range() regression in cross-fs copies
A regression has been reported by Nicolas Boichat, found while using the
copy_file_range syscall to copy a tracefs file.

Before commit 5dae222a5f ("vfs: allow copy_file_range to copy across
devices") the kernel would return -EXDEV to userspace when trying to
copy a file across different filesystems.  After this commit, the
syscall doesn't fail anymore and instead returns zero (zero bytes
copied), as this file's content is generated on-the-fly and thus reports
a size of zero.

Another regression has been reported by He Zhe - the assertion of
WARN_ON_ONCE(ret == -EOPNOTSUPP) can be triggered from userspace when
copying from a sysfs file whose read operation may return -EOPNOTSUPP.

Since we do not have test coverage for copy_file_range() between any two
types of filesystems, the best way to avoid these sort of issues in the
future is for the kernel to be more picky about filesystems that are
allowed to do copy_file_range().

This patch restores some cross-filesystem copy restrictions that existed
prior to commit 5dae222a5f ("vfs: allow copy_file_range to copy across
devices"), namely, cross-sb copy is not allowed for filesystems that do
not implement ->copy_file_range().

Filesystems that do implement ->copy_file_range() have full control of
the result - if this method returns an error, the error is returned to
the user.  Before this change this was only true for fs that did not
implement the ->remap_file_range() operation (i.e.  nfsv3).

Filesystems that do not implement ->copy_file_range() still fall-back to
the generic_copy_file_range() implementation when the copy is within the
same sb.  This helps the kernel can maintain a more consistent story
about which filesystems support copy_file_range().

nfsd and ksmbd servers are modified to fall-back to the
generic_copy_file_range() implementation in case vfs_copy_file_range()
fails with -EOPNOTSUPP or -EXDEV, which preserves behavior of
server-side-copy.

fall-back to generic_copy_file_range() is not implemented for the smb
operation FSCTL_DUPLICATE_EXTENTS_TO_FILE, which is arguably a correct
change of behavior.

Fixes: 5dae222a5f ("vfs: allow copy_file_range to copy across devices")
Link: https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drinkcat@chromium.org/
Link: https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=ZmpsG47CyHFJwnw7zSX6Q@mail.gmail.com/
Link: https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/
Link: https://lore.kernel.org/linux-fsdevel/20210630161320.29006-1-lhenriques@suse.de/
Reported-by: Nicolas Boichat <drinkcat@chromium.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Fixes: 64bf5ff58d ("vfs: no fallback for ->copy_file_range")
Link: https://lore.kernel.org/linux-fsdevel/20f17f64-88cb-4e80-07c1-85cb96c83619@windriver.com/
Reported-by: He Zhe <zhe.he@windriver.com>
Tested-by: Namjae Jeon <linkinjeon@kernel.org>
Tested-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-06-30 15:16:38 -07:00
Scott Mayhew
4f40a5b554 NFSv4: Add an fattr allocation to _nfs4_discover_trunking()
This was missed in c3ed222745 ("NFSv4: Fix free of uninitialized
nfs4_label on referral lookup.") and causes a panic when mounting
with '-o trunkdiscovery':

PID: 1604   TASK: ffff93dac3520000  CPU: 3   COMMAND: "mount.nfs"
 #0 [ffffb79140f738f8] machine_kexec at ffffffffaec64bee
 #1 [ffffb79140f73950] __crash_kexec at ffffffffaeda67fd
 #2 [ffffb79140f73a18] crash_kexec at ffffffffaeda76ed
 #3 [ffffb79140f73a30] oops_end at ffffffffaec2658d
 #4 [ffffb79140f73a50] general_protection at ffffffffaf60111e
    [exception RIP: nfs_fattr_init+0x5]
    RIP: ffffffffc0c18265  RSP: ffffb79140f73b08  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff93dac304a800  RCX: 0000000000000000
    RDX: ffffb79140f73bb0  RSI: ffff93dadc8cbb40  RDI: d03ee11cfaf6bd50
    RBP: ffffb79140f73be8   R8: ffffffffc0691560   R9: 0000000000000006
    R10: ffff93db3ffd3df8  R11: 0000000000000000  R12: ffff93dac4040000
    R13: ffff93dac2848e00  R14: ffffb79140f73b60  R15: ffffb79140f73b30
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffb79140f73b08] _nfs41_proc_get_locations at ffffffffc0c73d53 [nfsv4]
 #6 [ffffb79140f73bf0] nfs4_proc_get_locations at ffffffffc0c83e90 [nfsv4]
 #7 [ffffb79140f73c60] nfs4_discover_trunking at ffffffffc0c83fb7 [nfsv4]
 #8 [ffffb79140f73cd8] nfs_probe_fsinfo at ffffffffc0c0f95f [nfs]
 #9 [ffffb79140f73da0] nfs_probe_server at ffffffffc0c1026a [nfs]
    RIP: 00007f6254fce26e  RSP: 00007ffc69496ac8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000000  RCX: 00007f6254fce26e
    RDX: 00005600220a82a0  RSI: 00005600220a64d0  RDI: 00005600220a6520
    RBP: 00007ffc69496c50   R8: 00005600220a8710   R9: 003035322e323231
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007ffc69496c50
    R13: 00005600220a8440  R14: 0000000000000010  R15: 0000560020650ef9
    ORIG_RAX: 00000000000000a5  CS: 0033  SS: 002b

Fixes: c3ed222745 ("NFSv4: Fix free of uninitialized nfs4_label on referral lookup.")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-06-30 16:13:00 -04:00