commit 7f8a44f372 upstream.
For a sparse inodes filesystem, mkfs.xfs computes the values of
sb_spino_align and sb_inoalignmt with the following code:
int cluster_size = XFS_INODE_BIG_CLUSTER_SIZE;
if (cfg->sb_feat.crcs_enabled)
cluster_size *= cfg->inodesize / XFS_DINODE_MIN_SIZE;
sbp->sb_spino_align = cluster_size >> cfg->blocklog;
sbp->sb_inoalignmt = XFS_INODES_PER_CHUNK *
cfg->inodesize >> cfg->blocklog;
On a V5 filesystem with 64k fsblocks and 512 byte inodes, this results
in cluster_size = 8192 * (512 / 256) = 16384. As a result,
sb_spino_align and sb_inoalignmt are both set to zero. Unfortunately,
this trips the new sb_spino_align check that was just added to
xfs_validate_sb_common, and the mkfs fails:
# mkfs.xfs -f -b size=64k, /dev/sda
meta-data=/dev/sda isize=512 agcount=4, agsize=81136 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=1
= exchange=0 metadir=0
data = bsize=65536 blocks=324544, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=65536 ascii-ci=0, ftype=1, parent=0
log =internal log bsize=65536 blocks=5006, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=65536 blocks=0, rtextents=0
= rgcount=0 rgsize=0 extents
Discarding blocks...Sparse inode alignment (0) is invalid.
Metadata corruption detected at 0x560ac5a80bbe, xfs_sb block 0x0/0x200
libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
Sparse inode alignment (0) is invalid.
Metadata corruption detected at 0x560ac5a80bbe, xfs_sb block 0x0/0x200
libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1
mkfs.xfs: writing AG headers failed, err=22
Prior to commit 59e43f5479 this all worked fine, even if "sparse"
inodes are somewhat meaningless when everything fits in a single
fsblock. Adjust the checks to handle existing filesystems.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: 59e43f5479 ("xfs: sb_spino_align is not verified")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit a440a28ddb upstream.
In commit ca6448aed4, we created an "end_daddr" variable to fix
fsmap reporting when the end of the range requested falls in the middle
of an unknown (aka free on the rmapbt) region. Unfortunately, I didn't
notice that the the code sets end_daddr to the last sector of the device
but then uses that quantity to compute the length of the synthesized
mapping.
Zizhi Wo later observed that when end_daddr isn't set, we still don't
report the last fsblock on a device because in that case (aka when
info->last is true), the info->high mapping that we pass to
xfs_getfsmap_group_helper has a startblock that points to the last
fsblock. This is also wrong because the code uses startblock to
compute the length of the synthesized mapping.
Fix the second problem by setting end_daddr unconditionally, and fix the
first problem by setting start_daddr to one past the end of the range to
query.
Cc: <stable@vger.kernel.org> # v6.11
Fixes: ca6448aed4 ("xfs: Fix missing interval for missing_owner in xfs fsmap")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reported-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 1332533358 upstream.
The runt AG at the end of a filesystem is almost always smaller than
the mp->m_sb.sb_agblocks. Unfortunately, when setting the max_agbno
limit for the inode chunk allocation, we do not take this into
account. This means we can allocate a sparse inode chunk that
overlaps beyond the end of an AG. When we go to allocate an inode
from that sparse chunk, the irec fails validation because the
agbno of the start of the irec is beyond valid limits for the runt
AG.
Prevent this from happening by taking into account the size of the
runt AG when allocating inode chunks. Also convert the various
checks for valid inode chunk agbnos to use xfs_ag_block_count()
so that they will also catch such issues in the future.
Fixes: 56d1115c9b ("xfs: allocate sparse inode chunks on full chunk allocation failure")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
[djwong: backport to stable because upstream maintainer ignored cc-stable]
Link: https://lore.kernel.org/linux-xfs/20241112231539.GG9438@frogsfrogsfrogs/
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 59e43f5479 upstream.
It's just read in from the superblock and used without doing any
validity checks at all on the value.
Fixes: fb4f2b4e5a ("xfs: add sparse inode chunk alignment superblock field")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
[djwong: actually tag for 6.12 because upstream maintainer ignored cc-stable tag]
Link: https://lore.kernel.org/linux-xfs/20241024165544.GI21853@frogsfrogsfrogs/
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6422cde1b0 ]
For many use cases (e.g. container images are just fetched from remote),
performance will be impacted if underlay page cache is up-to-date but
direct i/o flushes dirty pages first.
Instead, let's use buffered I/O by default to keep in sync with loop
devices and add a (re)mount option to explicitly give a try to use
direct I/O if supported by the underlying files.
The container startup time is improved as below:
[workload] docker.io/library/workpress:latest
unpack 1st run non-1st runs
EROFS snapshotter buffered I/O file 4.586404265s 0.308s 0.198s
EROFS snapshotter direct I/O file 4.581742849s 2.238s 0.222s
EROFS snapshotter loop 4.596023152s 0.346s 0.201s
Overlayfs snapshotter 5.382851037s 0.206s 0.214s
Fixes: fb17675026 ("erofs: add file-backed mount support")
Cc: Derek McGowan <derek@mcg.dev>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241212134336.2059899-1-hsiangkao@linux.alibaba.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f8d920a402 ]
Record `m_sb` and `m_dif` to replace `m_fscache`, `m_daxdev`, `m_fp`
and `m_dax_part_off` in order to simplify the codebase.
Note that `m_bdev` is still left since it can be assigned from
`sb->s_bdev` directly.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241212235401.2857246-1-hsiangkao@linux.alibaba.com
Stable-dep-of: 6422cde1b0 ("erofs: use buffered I/O for file-backed mounts by default")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7b00af2c54 ]
Instead of just listing each one directly in `struct erofs_sb_info`
except that we still use `sb->s_bdev` for the primary block device.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241216125310.930933-2-hsiangkao@linux.alibaba.com
Stable-dep-of: 6422cde1b0 ("erofs: use buffered I/O for file-backed mounts by default")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e2de3c1bf6 ]
Unify the common parts of erofs_fc_free() and erofs_kill_sb() as
erofs_sb_free().
Thus, fput() in erofs_fc_get_tree() is no longer needed, too.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241212133504.2047178-1-hsiangkao@linux.alibaba.com
Stable-dep-of: 6422cde1b0 ("erofs: use buffered I/O for file-backed mounts by default")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bb57c81e97 ]
The cifs_io_request struct (a wrapper around netfs_io_request) holds open
the file on the server, even beyond the local Linux file being closed.
This can cause problems with Windows-based filesystems as the file's name
still exists after deletion until the file is closed, preventing the parent
directory from being removed and causing spurious test failures in xfstests
due to inability to remove a directory. The symptom looks something like
this in the test output:
rm: cannot remove '/mnt/scratch/test/p0/d3': Directory not empty
rm: cannot remove '/mnt/scratch/test/p1/dc/dae': Directory not empty
Fix this by waiting in unlink and rename for any outstanding I/O requests
to be completed on the target file before removing that file.
Note that this doesn't prevent Linux from trying to start new requests
after deletion if it still has the file open locally - something that's
perfectly acceptable on a UNIX system.
Note also that whilst I've marked this as fixing the commit to make cifs
use netfslib, I don't know that it won't occur before that.
Fixes: 3ee1a1fc39 ("cifs: Cut over to using netfslib")
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 53b001a21c upstream.
Debugging a filesystem patch with generic/475 caused the system to hang
after observing the following sequences in dmesg:
XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x491520 len 32 error 5
XFS (dm-0): metadata I/O error in "xfs_btree_read_buf_block+0xba/0x160 [xfs]" at daddr 0x3445608 len 8 error 5
XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x138e1c0 len 32 error 5
XFS (dm-0): log I/O error -5
XFS (dm-0): Metadata I/O Error (0x1) detected at xfs_trans_read_buf_map+0x1ea/0x4b0 [xfs] (fs/xfs/xfs_trans_buf.c:311). Shutting down filesystem.
XFS (dm-0): Please unmount the filesystem and rectify the problem(s)
XFS (dm-0): Internal error dqp->q_ino.reserved < dqp->q_ino.count at line 869 of file fs/xfs/xfs_trans_dquot.c. Caller xfs_trans_dqresv+0x236/0x440 [xfs]
XFS (dm-0): Corruption detected. Unmount and run xfs_repair
XFS (dm-0): Unmounting Filesystem be6bcbcc-9921-4deb-8d16-7cc94e335fa7
The system is stuck in unmount trying to lock a couple of inodes so that
they can be purged. The dquot corruption notice above is a clue to what
happened -- a link() call tried to set up a transaction to link a child
into a directory. Quota reservation for the transaction failed after IO
errors shut down the filesystem, but then we forgot to unlock the inodes
on our way out. Fix that.
Cc: <stable@vger.kernel.org> # v6.10
Fixes: bd5562111d ("xfs: Hold inode locks in xfs_trans_alloc_dir")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 44d9b07e52 upstream.
Committing a transaction tx0 with a defer ops chain of (A, B, C)
creates a chain of transactions that looks like this:
tx0 -> txA -> txB -> txC
Prior to commit cb04211748, __xfs_trans_commit would run precommits
on tx0, then call xfs_defer_finish_noroll to convert A-C to tx[A-C].
Unfortunately, after the finish_noroll loop we forgot to run precommits
on txC. That was fixed by adding the second precommit call.
Unfortunately, none of us remembered that xfs_defer_finish_noroll
calls __xfs_trans_commit a second time to commit tx0 before finishing
work A in txA and committing that. In other words, we run precommits
twice on tx0:
xfs_trans_commit(tx0)
__xfs_trans_commit(tx0, false)
xfs_trans_run_precommits(tx0)
xfs_defer_finish_noroll(tx0)
xfs_trans_roll(tx0)
txA = xfs_trans_dup(tx0)
__xfs_trans_commit(tx0, true)
xfs_trans_run_precommits(tx0)
This currently isn't an issue because the inode item precommit is
idempotent; the iunlink item precommit deletes itself so it can't be
called again; and the buffer/dquot item precommits only check the incore
objects for corruption. However, it doesn't make sense to run
precommits twice.
Fix this situation by only running precommits after finish_noroll.
Cc: <stable@vger.kernel.org> # v6.4
Fixes: cb04211748 ("xfs: defered work could create precommits")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ffc3ea4f3c upstream.
Fix a minor mistakes in the scrub tracepoints that can manifest when
inode-rooted btrees are enabled. The existing code worked fine for bmap
btrees, but we should tighten the code up to be less sloppy.
Cc: <stable@vger.kernel.org> # v5.7
Fixes: 92219c292a ("xfs: convert btree cursor inode-private member names")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7f8b718c58 upstream.
V4 symlink blocks didn't have headers, so return early if this is a V4
filesystem.
Cc: <stable@vger.kernel.org> # v5.1
Fixes: 39708c20ab ("xfs: miscellaneous verifier magic value fixups")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit af9f02457f upstream.
xfs_bmap_rtalloc initializes the bno_hint variable to NULLRTBLOCK (aka
NULLFSBLOCK). If the allocation request is for a file range that's
adjacent to an existing mapping, it will then change bno_hint to the
blkno hint in the bmalloca structure.
In other words, bno_hint is either a rt block number, or it's all 1s.
Unfortunately, commit ec12f97f1b didn't take the NULLRTBLOCK state
into account, which means that it tries to translate that into a
realtime extent number. We then end up with an obnoxiously high rtx
number and pointlessly feed that to the near allocator. This often
fails and falls back to the by-size allocator. Seeing as we had no
locality hint anyway, this is a waste of time.
Fix the code to detect a lack of bno_hint correctly. This was detected
by running xfs/009 with metadir enabled and a 28k rt extent size.
Cc: <stable@vger.kernel.org> # v6.12
Fixes: ec12f97f1b ("xfs: make the rtalloc start hint a xfs_rtblock_t")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bd27c7bcdc upstream.
With the nrext64 feature enabled, it's possible for a data fork to have
2^48 extent mappings. Even with a 64k fsblock size, that maps out to
a bmbt containing more than 2^32 blocks. Therefore, this predicate must
return a u64 count to avoid an integer wraparound that will cause scrub
to do the wrong thing.
It's unlikely that any such filesystem currently exists, because the
incore bmbt would consume more than 64GB of kernel memory on its own,
and so far nobody except me has driven a filesystem that far, judging
from the lack of complaints.
Cc: <stable@vger.kernel.org> # v5.19
Fixes: df9ad5cc7a ("xfs: Introduce macros to represent new maximum extent counts for data/attr forks")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7ce31f20a0 upstream.
Way back when we first implemented FICLONE for XFS, life was simple --
either the the entire remapping completed, or something happened and we
had to return an errno explaining what happened. Neither of those
ioctls support returning partial results, so it's all or nothing.
Then things got complicated when copy_file_range came along, because it
actually can return the number of bytes copied, so commit 3f68c1f562
tried to make it so that we could return a partial result if the
REMAP_FILE_CAN_SHORTEN flag is set. This is also how FIDEDUPERANGE can
indicate that the kernel performed a partial deduplication.
Unfortunately, the logic is wrong if an error stops the remapping and
CAN_SHORTEN is not set. Because those callers cannot return partial
results, it is an error for ->remap_file_range to return a positive
quantity that is less than the @len passed in. Implementations really
should be returning a negative errno in this case, because that's what
btrfs (which introduced FICLONE{,RANGE}) did.
Therefore, ->remap_range implementations cannot silently drop an errno
that they might have when the number of bytes remapped is less than the
number of bytes requested and CAN_SHORTEN is not set.
Found by running generic/562 on a 64k fsblock filesystem and wondering
why it reported corrupt files.
Cc: <stable@vger.kernel.org> # v4.20
Fixes: 3fc9f5e409 ("xfs: remove xfs_reflink_remap_range")
Really-Fixes: 3f68c1f562 ("xfs: support returning partial reflink results")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6d7b4bc1c3 upstream.
In commit 2c813ad66a, I partially fixed a bug wherein xfs_btree_insrec
would erroneously try to update the parent's key for a block that had
been split if we decided to insert the new record into the new block.
The solution was to detect this situation and update the in-core key
value that we pass up to the caller so that the caller will (eventually)
add the new block to the parent level of the tree with the correct key.
However, I missed a subtlety about the way inode-rooted btrees work. If
the full block was a maximally sized inode root block, we'll solve that
fullness by moving the root block's records to a new block, resizing the
root block, and updating the root to point to the new block. We don't
pass a pointer to the new block to the caller because that work has
already been done. The new record will /always/ land in the new block,
so in this case we need to use xfs_btree_update_keys to update the keys.
This bug can theoretically manifest itself in the very rare case that we
split a bmbt root block and the new record lands in the very first slot
of the new block, though I've never managed to trigger it in practice.
However, it is very easy to reproduce by running generic/522 with the
realtime rmapbt patchset if rtinherit=1.
Cc: <stable@vger.kernel.org> # v4.8
Fixes: 2c813ad66a ("xfs: support btrees with overlapping intervals for keys")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6f4669708a upstream.
If we need to reset a symlink target to the "durr it's busted" string,
then we clear the zapped flag as well. However, this should be using
the provided helper so that we don't set the zapped state on an
otherwise ok symlink.
Cc: <stable@vger.kernel.org> # v6.10
Fixes: 2651923d8d ("xfs: online repair of symbolic links")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b95629435b upstream.
Increment the session reference count within the lock for lookup to avoid
racy issue with session expire.
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-25737
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit dfe6c5692f ("ocfs2: fix the la space leak when
unmounting an ocfs2 volume").
In commit dfe6c5692f, the commit log "This bug has existed since the
initial OCFS2 code." is wrong. The correct introduction commit is
30dd3478c3 ("ocfs2: correctly use ocfs2_find_next_zero_bit()").
The influence of commit dfe6c5692f is that it provides a correct
fix for the latest kernel. however, it shouldn't be pushed to stable
branches. Let's use this commit to revert all branches that include
dfe6c5692f and use a new fix method to fix commit 30dd3478c3.
Fixes: dfe6c5692f ("ocfs2: fix the la space leak when unmounting an ocfs2 volume")
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b29bf7119d upstream.
The fix for a memory corruption contained a off-by-one error and
caused the compressor to fail in legit cases.
Cc: Kinsey Moore <kinsey.moore@oarcorp.com>
Cc: stable@vger.kernel.org
Fixes: fe051552f5 ("jffs2: Prevent rtime decompress memory corruption")
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fe051552f5 upstream.
The rtime decompression routine does not fully check bounds during the
entirety of the decompression pass and can corrupt memory outside the
decompression buffer if the compressed data is corrupted. This adds the
required check to prevent this failure mode.
Cc: stable@vger.kernel.org
Signed-off-by: Kinsey Moore <kinsey.moore@oarcorp.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 231825b2e1 ]
This reverts commit 5c26d2f1d3.
It turns out that we can't do this, because while the old behavior of
ignoring ignorable code points was most definitely wrong, we have
case-folding filesystems with on-disk hash values with that wrong
behavior.
So now you can't look up those names, because they hash to something
different.
Of course, it's also entirely possible that in the meantime people have
created *new* files with the new ("more correct") case folding logic,
and reverting will just make other things break.
The correct solution is to not do case folding in filesystems, but
sadly, people seem to never really understand that. People still see it
as a feature, not a bug.
Reported-by: Qi Han <hanqi@vivo.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219586
Cc: Gabriel Krisman Bertazi <krisman@suse.de>
Requested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9c803c474c ]
When activating a swap file we acquire the root's snapshot drew lock and
then check if the root is dead, failing and returning with -EPERM if it's
dead but without unlocking the root's snapshot lock. Fix this by adding
the missing unlock.
Fixes: 60021bd754 ("btrfs: prevent subvol with swapfile from being deleted")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 951a3f59d2 ]
[BUG]
The following reproducer can cause btrfs mount to fail:
dev="/dev/test/scratch1"
mnt1="/mnt/test"
mnt2="/mnt/scratch"
mkfs.btrfs -f $dev
mount $dev $mnt1
btrfs subvolume create $mnt1/subvol1
btrfs subvolume create $mnt1/subvol2
umount $mnt1
mount $dev $mnt1 -o subvol=subvol1
while mount -o remount,ro $mnt1; do mount -o remount,rw $mnt1; done &
bg=$!
while mount $dev $mnt2 -o subvol=subvol2; do umount $mnt2; done
kill $bg
wait
umount -R $mnt1
umount -R $mnt2
The script will fail with the following error:
mount: /mnt/scratch: /dev/mapper/test-scratch1 already mounted on /mnt/test.
dmesg(1) may have more information after failed mount system call.
umount: /mnt/test: target is busy.
umount: /mnt/scratch/: not mounted
And there is no kernel error message.
[CAUSE]
During the btrfs mount, to support mounting different subvolumes with
different RO/RW flags, we need to detect that and retry if needed:
Retry with matching RO flags if the initial mount fail with -EBUSY.
The problem is, during that retry we do not hold any super block lock
(s_umount), this means there can be a remount process changing the RO
flags of the original fs super block.
If so, we can have an EBUSY error during retry. And this time we treat
any failure as an error, without any retry and cause the above EBUSY
mount failure.
[FIX]
The current retry behavior is racy because we do not have a super block
thus no way to hold s_umount to prevent the race with remount.
Solve the root problem by allowing fc->sb_flags to mismatch from the
sb->s_flags at btrfs_get_tree_super().
Then at the re-entry point btrfs_get_tree_subvol(), manually check the
fc->s_flags against sb->s_flags, if it's a RO->RW mismatch, then
reconfigure with s_umount lock hold.
Reported-by: Enno Gotthold <egotthold@suse.com>
Reported-by: Fabian Vogt <fvogt@suse.com>
[ Special thanks for the reproducer and early analysis pointing to btrfs. ]
Fixes: f044b31867 ("btrfs: handle the ro->rw transition for mounting different subvolumes")
Link: https://bugzilla.suse.com/show_bug.cgi?id=1231836
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 01c5db782e ]
The only caller passes NULL, we can drop the parameter. This is since
the new mount option parser done in 3bb17a25bc ("btrfs: add get_tree
callback for new mount API").
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 951a3f59d2 ("btrfs: fix mount failure due to remount races")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 87cbab8636 ]
Since the new mount option parser in commit ad21f15b0f ("btrfs:
switch to the new mount API") we don't pass the options like that
anymore.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 951a3f59d2 ("btrfs: fix mount failure due to remount races")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 36008fe6e3 ]
We can't properly support chasing DFS links in cifs_tree_connect()
because
(1) We don't support creating new sessions while we're reconnecting,
which would be required for DFS interlinks.
(2) ->is_path_accessible() can't be called from cifs_tree_connect()
as it would deadlock with smb2_reconnect(). This is required for
checking if new DFS target is a nested DFS link.
By unconditionally trying to get an DFS referral from new DFS target
isn't correct because if the new DFS target (interlink) is an DFS
standalone namespace, then we would end up getting -ELOOP and then
potentially leaving tcon disconnected.
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e2705dd3d1 ]
Use local runs_tree instead of cached. This way excludes rw_semaphore lock.
Reported-by: syzbot+1c25748a40fe79b8a119@syzkaller.appspotmail.com
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 009a8241a8 ]
Quoted:
"at this time, there are still 1086911 extent nodes in this zombie
extent tree that need to be cleaned up.
crash_arm64_sprd_v8.0.3++> extent_tree.node_cnt ffffff80896cc500
node_cnt = {
counter = 1086911
},
"
As reported by Xiuhong, there will be a huge number of extent nodes
in extent tree, it may potentially cause:
- slab memory fragments
- extreme long time shrink on extent tree
- low mapping efficiency
Let's add a sysfs node to limit max read extent count for each inode,
by default, value of this threshold is 10240, it can be updated
according to user's requirement.
Reported-by: Xiuhong Wang <xiuhong.wang@unisoc.com>
Closes: https://lore.kernel.org/linux-f2fs-devel/20241112110627.1314632-1-xiuhong.wang@unisoc.com/
Signed-off-by: Xiuhong Wang <xiuhong.wang@unisoc.com>
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3fc5d5a182 ]
We use rwlock to protect core structure data of extent tree during
its shrink, however, if there is a huge number of extent nodes in
extent tree, during shrink of extent tree, it may hold rwlock for
a very long time, which may trigger kernel hang issue.
This patch fixes to shrink read extent node in batches, so that,
critical region of the rwlock can be shrunk to avoid its extreme
long time hold.
Reported-by: Xiuhong Wang <xiuhong.wang@unisoc.com>
Closes: https://lore.kernel.org/linux-f2fs-devel/20241112110627.1314632-1-xiuhong.wang@unisoc.com/
Signed-off-by: Xiuhong Wang <xiuhong.wang@unisoc.com>
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 81520c684c ]
If fs corruption occurs in f2fs_new_node_page(), let's print
more information about corrupted metadata into kernel log.
Meanwhile, it updates to record ERROR_INCONSISTENT_NAT instead
of ERROR_INVALID_BLKADDR if blkaddr in nat entry is not
NULL_ADDR which means nat bitmap and nat entry is inconsistent.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f9417fcfca ]
When mounting of a corrupted disk image fails, the error message printed
can reference uninitialized inode fields. To prevent that from happening,
always initialize those fields.
Reported-by: syzbot+aa0730b0a42646eb1359@syzkaller.appspotmail.com
Signed-off-by: Qianqiang Liu <qianqiang.liu@163.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d5c367ef82 ]
creating a large files during checkpoint disable until it runs out of
space and then delete it, then remount to enable checkpoint again, and
then unmount the filesystem triggers the f2fs_bug_on as below:
------------[ cut here ]------------
kernel BUG at fs/f2fs/inode.c:896!
CPU: 2 UID: 0 PID: 1286 Comm: umount Not tainted 6.11.0-rc7-dirty #360
Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
RIP: 0010:f2fs_evict_inode+0x58c/0x610
Call Trace:
__die_body+0x15/0x60
die+0x33/0x50
do_trap+0x10a/0x120
f2fs_evict_inode+0x58c/0x610
do_error_trap+0x60/0x80
f2fs_evict_inode+0x58c/0x610
exc_invalid_op+0x53/0x60
f2fs_evict_inode+0x58c/0x610
asm_exc_invalid_op+0x16/0x20
f2fs_evict_inode+0x58c/0x610
evict+0x101/0x260
dispose_list+0x30/0x50
evict_inodes+0x140/0x190
generic_shutdown_super+0x2f/0x150
kill_block_super+0x11/0x40
kill_f2fs_super+0x7d/0x140
deactivate_locked_super+0x2a/0x70
cleanup_mnt+0xb3/0x140
task_work_run+0x61/0x90
The root cause is: creating large files during disable checkpoint
period results in not enough free segments, so when writing back root
inode will failed in f2fs_enable_checkpoint. When umount the file
system after enabling checkpoint, the root inode is dirty in
f2fs_evict_inode function, which triggers BUG_ON. The steps to
reproduce are as follows:
dd if=/dev/zero of=f2fs.img bs=1M count=55
mount f2fs.img f2fs_dir -o checkpoint=disable:10%
dd if=/dev/zero of=big bs=1M count=50
sync
rm big
mount -o remount,checkpoint=enable f2fs_dir
umount f2fs_dir
Let's redirty inode when there is not free segments during checkpoint
is disable.
Signed-off-by: Qi Han <hanqi@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f69b0187f8 ]
Like commit f1f047bd7c ("smb: client: Fix -Wstringop-overflow issues"),
adjust the memcpy() destination address to be based off the surrounding
object rather than based off the 4-byte "Protocol" member. This avoids a
build-time warning when compiling under CONFIG_FORTIFY_SOURCE with GCC 15:
In function 'fortify_memcpy_chk',
inlined from 'CIFSSMBSetPathInfo' at ../fs/smb/client/cifssmb.c:5358:2:
../include/linux/fortify-string.h:571:25: error: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror=attribute-warning]
571 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 522249f05c ]
When working in "fd mode", fanotify_read() needs to open an fd
from a dentry to report event->fd to userspace.
Opening an fd from dentry can fail for several reasons.
For example, when tasks are gone and we try to open their
/proc files or we try to open a WRONLY file like in sysfs
or when trying to open a file that was deleted on the
remote network server.
Add a new flag FAN_REPORT_FD_ERROR for fanotify_init().
For a group with FAN_REPORT_FD_ERROR, we will send the
event with the error instead of the open fd, otherwise
userspace may not get the error at all.
For an overflow event, we report -EBADF to avoid confusing FAN_NOFD
with -EPERM. Similarly for pidfd open errors we report either -ESRCH
or the open error instead of FAN_NOPIDFD and FAN_EPIDFD.
In any case, userspace will not know which file failed to
open, so add a debug print for further investigation.
Reported-by: Krishna Vivek Vitta <kvitta@microsoft.com>
Link: https://lore.kernel.org/linux-fsdevel/SI2P153MB07182F3424619EDDD1F393EED46D2@SI2P153MB0718.APCP153.PROD.OUTLOOK.COM/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241003142922.111539-1-amir73il@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b98333c67d ]
This patch fixes a possible null pointer dereference when this function is
called from request_lock() as lkb->lkb_resource is not assigned yet,
only after validate_lock_args() by calling attach_lkb(). Another issue
is that a resource name could be a non printable bytearray and we cannot
assume to be ASCII coded.
The log functionality is probably never being hit when DLM is used in
normal way and no debug logging is enabled. The null pointer dereference
can only occur on a new created lkb that does not have the resource
assigned yet, it probably never hits the null pointer dereference but we
should be sure that other changes might not change this behaviour and we
actually can hit the mentioned null pointer dereference.
In this patch we just drop the printout of the resource name, the lkb id
is enough to make a possible connection to a resource name if this
exists.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c7fc0366c6 ]
Using mapped writes, it's technically possible to expose stale
post-eof data on a truncate up operation. Consider the following
example:
$ xfs_io -fc "pwrite 0 2k" -c "mmap 0 4k" -c "mwrite 2k 2k" \
-c "truncate 8k" -c "pread -v 2k 16" <file>
...
00000800: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX
...
This shows that the post-eof data written via mwrite lands within
EOF after a truncate up. While this is deliberate of the test case,
behavior is somewhat unpredictable because writeback does post-eof
zeroing, and writeback can occur at any time in the background. For
example, an fsync inserted between the mwrite and truncate causes
the subsequent read to instead return zeroes. This basically means
that there is a race window in this situation between any subsequent
extending operation and writeback that dictates whether post-eof
data is exposed to the file or zeroed.
To prevent this problem, perform partial block zeroing as part of
the various inode size extending operations that are susceptible to
it. For truncate extension, zero around the original eof similar to
how truncate down does partial zeroing of the new eof. For extension
via writes and fallocate related operations, zero the newly exposed
range of the file to cover any partial zeroing that must occur at
the original and new eof blocks.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://patch.msgid.link/20240919160741.208162-2-bfoster@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 70958a949d ]
If you follow the seed/sprout wiki, it suggests the following workflow:
btrfstune -S 1 seed_dev
mount seed_dev mnt
btrfs device add sprout_dev
mount -o remount,rw mnt
The first mount mounts the FS readonly, which results in not setting
BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
somewhat surprisingly clears the readonly bit on the sb (though the
mount is still practically readonly, from the users perspective...).
Finally, the remount checks the readonly bit on the sb against the flag
and sees no change, so it does not run the code intended to run on
ro->rw transitions, leaving BTRFS_FS_OPEN unset.
As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
does no work. This results in leaking deleted snapshots until we run out
of space.
I propose fixing it at the first departure from what feels reasonable:
when we clear the readonly bit on the sb during device add.
A new fstest I have written reproduces the bug and confirms the fix.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7e06de7c83 ]
[PROBLEM]
Currently btrfs accepts any file path for its device, resulting some
weird situation:
# ./mount_by_fd /dev/test/scratch1 /mnt/btrfs/
The program has the following source code:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
int main(int argc, char *argv[]) {
int fd = open(argv[1], O_RDWR);
char path[256];
snprintf(path, sizeof(path), "/proc/self/fd/%d", fd);
return mount(path, argv[2], "btrfs", 0, NULL);
}
Then we can have the following weird device path:
BTRFS: device fsid 2378be81-fe12-46d2-a9e8-68cf08dd98d5 devid 1 transid 7 /proc/self/fd/3 (253:2) scanned by mount_by_fd (18440)
Normally it's not a big deal, and later udev can trigger a device path
rename. But if udev didn't trigger, the device path "/proc/self/fd/3"
will show up in mtab.
[CAUSE]
For filename "/proc/self/fd/3", it means the opened file descriptor 3.
In above case, it's exactly the device we want to open, aka points to
"/dev/test/scratch1" which is another symlink pointing to "/dev/dm-2".
Inside kernel we solve the mount source using LOOKUP_FOLLOW, which
follows the symbolic link and grab the proper block device.
But inside btrfs we also save the filename into btrfs_device::name, and
utilize that member to report our mount source, which leads to the above
situation.
[FIX]
Instead of unconditionally trust the path, check if the original file
(not following the symbolic link) is inside "/dev/", if not, then
manually lookup the path to its final destination, and use that as our
device path.
This allows us to still use symbolic links, like
"/dev/mapper/test-scratch" from LVM2, which is required for fstests runs
with LVM2 setup.
And for really weird names, like the above case, we solve it to
"/dev/dm-2" instead.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Link: https://bugzilla.suse.com/show_bug.cgi?id=1230641
Reported-by: Fabian Vogt <fvogt@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2e8b6bc0ab ]
[PROBLEM]
It is very common for udev to trigger device scan, and every time a
mounted btrfs device got re-scan from different soft links, we will get
some of unnecessary device path updates, this is especially common
for LVM based storage:
# lvs
scratch1 test -wi-ao---- 10.00g
scratch2 test -wi-a----- 10.00g
scratch3 test -wi-a----- 10.00g
scratch4 test -wi-a----- 10.00g
scratch5 test -wi-a----- 10.00g
test test -wi-a----- 10.00g
# mkfs.btrfs -f /dev/test/scratch1
# mount /dev/test/scratch1 /mnt/btrfs
# dmesg -c
[ 205.705234] BTRFS: device fsid 7be2602f-9e35-4ecf-a6ff-9e91d2c182c9 devid 1 transid 6 /dev/mapper/test-scratch1 (253:4) scanned by mount (1154)
[ 205.710864] BTRFS info (device dm-4): first mount of filesystem 7be2602f-9e35-4ecf-a6ff-9e91d2c182c9
[ 205.711923] BTRFS info (device dm-4): using crc32c (crc32c-intel) checksum algorithm
[ 205.713856] BTRFS info (device dm-4): using free-space-tree
[ 205.722324] BTRFS info (device dm-4): checking UUID tree
So far so good, but even if we just touched any soft link of
"dm-4", we will get quite some unnecessary device path updates.
# touch /dev/mapper/test-scratch1
# dmesg -c
[ 469.295796] BTRFS info: devid 1 device path /dev/mapper/test-scratch1 changed to /dev/dm-4 scanned by (udev-worker) (1221)
[ 469.300494] BTRFS info: devid 1 device path /dev/dm-4 changed to /dev/mapper/test-scratch1 scanned by (udev-worker) (1221)
Such device path rename is unnecessary and can lead to random path
change due to the udev race.
[CAUSE]
Inside device_list_add(), we are using a very primitive way checking if
the device has changed, strcmp().
Which can never handle links well, no matter if it's hard or soft links.
So every different link of the same device will be treated as a different
device, causing the unnecessary device path update.
[FIX]
Introduce a helper, is_same_device(), and use path_equal() to properly
detect the same block device.
So that the different soft links won't trigger the rename race.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Link: https://bugzilla.suse.com/show_bug.cgi?id=1230641
Reported-by: Fabian Vogt <fvogt@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 914eec5e98 upstream.
The following INFO level message was seen:
seq_file: buggy .next function ocfs2_dlm_seq_next [ocfs2] did not
update position index
Fix:
Update *pos (so m->index) to make seq_read_iter happy though the index its
self makes no sense to ocfs2_dlm_seq_next.
Link: https://lkml.kernel.org/r/20241119174500.9198-1-wen.gang.wang@oracle.com
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>