NFSv4.1 OP_EXCHANGE_ID response from server may contain server
implementation details (domain, name and build time) in optional
nfs_impl_id4 field. Currently nfsd does not fill this field.
Send these information in NFSv4.1 OP_EXCHANGE_ID response. Fill them with
the same values as what is Linux NFSv4.1 client doing. Domain is hardcoded
to "kernel.org", name is composed in the same way as "uname -srvm" output
and build time is hardcoded to zeros.
NFSv4.1 client and server implementation fields are useful for statistic
purposes or for identifying type of clients and servers.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
NLMv2 is completely different protocol than NLMv1 and NLMv3, and in
original Sun implementation is used for RPC loopback callbacks from statd
to lockd services. Linux does not use nor does not implement NLMv2.
Hence, NLMv3 is not backward compatible with NLMv2. But NLMv3 is backward
compatible with NLMv1. Fix comment.
Signed-off-by: Pali Rohár <pali@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Turn nfsd_compound_encode_err tracepoint into a class and add a new
nfsd_compound_op_err tracepoint.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmc7S40QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjHVD/43rDZ8ehs+IAAr6S0RemNX1SRG0mK2UOEb
kMoNogS7StO/c4JYW3JuzCyLRn5ZsgeWV/muqxwDEWQrmTGrvi+V45KikrZPwm3k
p0ump33qV9EU2jiR1MKZjtwK2P0CI7/DD3W8ww6IOvKbTT7RcqQcdHznvXArFBtc
xCuQPpayFG7ZasC+N9VaBwtiUEVgU3Ek9AFT7UVZRWajjHPNalQwaooJWayO0rEG
KdoW5yG0ryLrgCY2ACSvRLS+2s14EJtb8hgT08WKHTNgd5LxhSKxfsTapamua+7U
FdVS6Ij0tEkgu2jpvgj7QKO0Uw10Cnep2gj7RHts/LVewvkliS6XcheOzqRS1jWU
I2EI+UaGOZ11OUiw52VIveEVS5zV/NWhgy5BSP9LYEvXw0BUAHRDYGMem8o5G1V1
SWqjIM1UWvcQDlAnMF9FDVzojvjVUmYWvcAlFFztO8J0B7SavHR3NcfHwEf57reH
rNoUbi/9c4/wjJJF33gejiR5pU+ewy/Mk75GrtX3xpEqlztfRbf9/FbPCMEAO1KR
DF/b3lkUV9i2/BRW6a0SpZ5RDSmSYMnateel6TrPyVSRnpiSSFO8FrbynwUOa17b
6i49YDFWzzXOrR1YWDg6IEtTrcmBEmvi7F6aoDs020qUnL0hwLn1ZuoIxuiFEpor
Z0iFF1B/nw==
=PWTH
-----END PGP SIGNATURE-----
Merge tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe updates via Keith:
- Use uring_cmd helper (Pavel)
- Host Memory Buffer allocation enhancements (Christoph)
- Target persistent reservation support (Guixin)
- Persistent reservation tracing (Guixen)
- NVMe 2.1 specification support (Keith)
- Rotational Meta Support (Matias, Wang, Keith)
- Volatile cache detection enhancment (Guixen)
- MD updates via Song:
- Maintainers update
- raid5 sync IO fix
- Enhance handling of faulty and blocked devices
- raid5-ppl atomic improvement
- md-bitmap fix
- Support for manually defining embedded partition tables
- Zone append fixes and cleanups
- Stop sending the queued requests in the plug list to the driver
->queue_rqs() handle in reverse order.
- Zoned write plug cleanups
- Cleanups disk stats tracking and add support for disk stats for
passthrough IO
- Add preparatory support for file system atomic writes
- Add lockdep support for queue freezing. Already found a bunch of
issues, and some fixes for that are in here. More will be coming.
- Fix race between queue stopping/quiescing and IO queueing
- ublk recovery improvements
- Fix ublk mmap for 64k pages
- Various fixes and cleanups
* tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux: (118 commits)
MAINTAINERS: Update git tree for mdraid subsystem
block: make struct rq_list available for !CONFIG_BLOCK
block/genhd: use seq_put_decimal_ull for diskstats decimal values
block: don't reorder requests in blk_mq_add_to_batch
block: don't reorder requests in blk_add_rq_to_plug
block: add a rq_list type
block: remove rq_list_move
virtio_blk: reverse request order in virtio_queue_rqs
nvme-pci: reverse request order in nvme_queue_rqs
btrfs: validate queue limits
block: export blk_validate_limits
nvmet: add tracing of reservation commands
nvme: parse reservation commands's action and rtype to string
nvmet: report ns's vwc not present
md/raid5: Increase r5conf.cache_name size
block: remove the ioprio field from struct request
block: remove the write_hint field from struct request
nvme: check ns's volatile write cache not present
nvme: add rotational support
nvme: use command set independent id ns if available
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmc0zT4ACgkQxWXV+ddt
WDtThRAAhzSSiHcJqTfCL5nHh7w85MNEVw28o1ETgXSYJmx0JOWLE7Znlp2FV7jj
IbYkFfF2gXJzYvRZkcXB/TAHV9KJG5yZIBZfccbM+9db9f8xkImVKMuqQRXPU41R
ppSCmqZTeujtt8ucsaJkMpm6pzECKJCJaGOsMJ8fiqKpo89dKO3eGAVboSbpPF4C
r0YmppiBwSP/cCXQCqWxZRbqPGN+lUgZpIGNRi157kehfmRHlVVJTO1pgqK8PCXb
uIT09Kulppfez8+1A10CPcniDTyinLik/qLTNlzdWoDBL4iNJMg0A0wsA04AJVf0
PdOS0REusiv3QcEIO6PefuRFRRfXcSLPpPDUceltJT5O0uM2gUqf2C7dEHXUGU3o
TdgYlbQpsJWpZ7VGWQDZeGGV04lOPQvu0LGLPgEerUQd5H9ABa0dX8Fn0sPhKsa8
whpAcdfE4rdNxB2OJFnqQeFq0z3cSjP/rvKlluCmAj97QYI+kiu3QyhemcT1YSC9
U7n5Ya9IzIYCN3ml54q3hEgyD0IVGGG20GuUmqC9XSP9mrQRC8I1g7v26AiOTrrk
VhgSdtMmphDxXudifsnYMaQ0Z1QqiUrW1SM/prAEOnBYCo75+HDsTgrq9ithgHoI
4xz4YXJyMRs18qfTJctXC1wmGuz5plTdQrwarHdNsELN5HEyqX4=
=aAcf
-----END PGP SIGNATURE-----
Merge tag 'for-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Changes outside of btrfs: add io_uring command flag to track a dying
task (the rest will go via the block git tree).
User visible changes:
- wire encoded read (ioctl) to io_uring commands, this can be used on
itself, in the future this will allow 'send' to be asynchronous. As
a consequence, the encoded read ioctl can also work in non-blocking
mode
- new ioctl to wait for cleaned subvolumes, no need to use the
generic and root-only SEARCH_TREE ioctl, will be used by "btrfs
subvol sync"
- recognize different paths/symlinks for the same devices and don't
report them during rescanning, this can be observed with LVM or DM
- seeding device use case change, the sprout device (the one
capturing new writes) will not clear the read-only status of the
super block; this prevents accumulating space from deleted
snapshots
Performance improvements:
- reduce lock contention when traversing extent buffers
- reduce extent tree lock contention when searching for inline
backref
- switch from rb-trees to xarray for delayed ref tracking,
improvements due to better cache locality, branching factors and
more compact data structures
- enable extent map shrinker again (prevent memory exhaustion under
some types of IO load), reworked to run in a single worker thread
(there used to be problems causing long stalls under memory
pressure)
Core changes:
- raid-stripe-tree feature updates:
- make device replace and scrub work
- implement partial deletion of stripe extents
- new selftests
- split the config option BTRFS_DEBUG and add EXPERIMENTAL for
features that are experimental or with known problems so we don't
misuse debugging config for that
- subpage mode updates (sector < page):
- update compression implementations
- update writepage, writeback
- continued folio API conversions:
- buffered writes
- make buffered write copy one page at a time, preparatory work for
future integration with large folios, may cause performance drop
- proper locking of root item regarding starting send
- error handling improvements
- code cleanups and refactoring:
- dead code removal
- unused parameter reduction
- lockdep assertions"
* tag 'for-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (119 commits)
btrfs: send: check for read-only send root under critical section
btrfs: send: check for dead send root under critical section
btrfs: remove check for NULL fs_info at btrfs_folio_end_lock_bitmap()
btrfs: fix warning on PTR_ERR() against NULL device at btrfs_control_ioctl()
btrfs: fix a typo in btrfs_use_zone_append
btrfs: avoid superfluous calls to free_extent_map() in btrfs_encoded_read()
btrfs: simplify logic to decrement snapshot counter at btrfs_mksnapshot()
btrfs: remove hole from struct btrfs_delayed_node
btrfs: update stale comment for struct btrfs_delayed_ref_node::add_list
btrfs: add new ioctl to wait for cleaned subvolumes
btrfs: simplify range tracking in cow_file_range()
btrfs: remove conditional path allocation in btrfs_read_locked_inode()
btrfs: push cleanup into btrfs_read_locked_inode()
io_uring/cmd: let cmds to know about dying task
btrfs: add struct io_btrfs_cmd as type for io_uring_cmd_to_pdu()
btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)
btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages()
btrfs: don't sleep in btrfs_encoded_read() if IOCB_NOWAIT is set
btrfs: change btrfs_encoded_read() so that reading of extent is done by caller
btrfs: remove pointless iocb::ki_pos addition in btrfs_encoded_read()
...
notably in the journaling code, bufered I/O, and compiler warning
cleanups.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmc7NN4ACgkQ8vlZVpUN
gaMJRAf+Oc3Tn/ZvuX0amkaBQI+ZNIeYD/U0WBSvarKb00bo1X39mM/0LovqV6ec
c51iRgt8U6uDZDUm6zJtppkIUiqkHRj+TmTInueFtmUqhIg8jgfZIpxCn0QkFKnQ
jI5EKCkvUqM0B347axH/s+dlOE9JBSlQNKgjkvCYOGknQ1PH6X8oMDt5QAqGEk3P
Nsa4QChIxt2yujFvydgFT+RAbjvY3sNvrZ7D3B+KL3VSJpILChVZK/UdFrraSXxq
mLO5j4txjtnr/OLgujCTHOfPsTiQReHHXErrSbKhnFhrTXLD0mZSUgJ6irpaxRQ5
wQHQzmsrVwqFfqPU3Hkl8FGeCR0owQ==
=26/E
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"A lot of miscellaneous ext4 bug fixes and cleanups this cycle, most
notably in the journaling code, bufered I/O, and compiler warning
cleanups"
* tag 'ext4_for_linus-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (33 commits)
jbd2: Fix comment describing journal_init_common()
ext4: prevent an infinite loop in the lazyinit thread
ext4: use struct_size() to improve ext4_htree_store_dirent()
ext4: annotate struct fname with __counted_by()
jbd2: avoid dozens of -Wflex-array-member-not-at-end warnings
ext4: use str_yes_no() helper function
ext4: prevent delalloc to nodelalloc on remount
jbd2: make b_frozen_data allocation always succeed
ext4: cleanup variable name in ext4_fc_del()
ext4: use string choices helpers
jbd2: remove the 'success' parameter from the jbd2_do_replay() function
jbd2: remove useless 'block_error' variable
jbd2: factor out jbd2_do_replay()
jbd2: refactor JBD2_COMMIT_BLOCK process in do_one_pass()
jbd2: unified release of buffer_head in do_one_pass()
jbd2: remove redundant judgments for check v1 checksum
ext4: use ERR_CAST to return an error-valued pointer
mm: zero range of eof folio exposed by inode size extension
ext4: partial zero eof block on unaligned inode size extension
ext4: disambiguate the return value of ext4_dio_write_end_io()
...
and friends
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdpZQAKCRBZ7Krx/gZQ
6whMAQDhlGFV+nGRetwe4t60mVRpxIoc71GLC7b6V8FmyfTI5AEAkAigkJ8KCZDP
mfGsN/3PtzoxnIkIqdk7Y7q4/fowyAw=
=4DWZ
-----END PGP SIGNATURE-----
Merge tag 'pull-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull statx updates from Al Viro:
"Sanitize struct filename and lookup flags handling in statx and
friends"
* tag 'pull-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
libfs: kill empty_dir_getattr()
fs: Simplify getattr interface function checking AT_GETATTR_NOSEC flag
fs/stat.c: switch to CLASS(fd_raw)
kill getname_statx_lookup_flags()
io_statx_prep(): use getname_uflags()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdkpwAKCRBZ7Krx/gZQ
632xAP0WukRCsROhDX1LsNXY+SF3E6znsO8VwjjHTDwZ+dGW6gD/f+KooaGG5qBU
4Z7sVuOIpOOA0S1LLDx8XBYidcG23g8=
=Aw1m
-----END PGP SIGNATURE-----
Merge tag 'pull-ufs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ufs updates from Al Viro:
"ufs cleanups, fixes and folio conversion"
* tag 'pull-ufs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ufs: ufs_sb_private_info: remove unused s_{2,3}apb fields
ufs: Convert ufs_change_blocknr() to take a folio
ufs: Pass a folio to ufs_new_fragments()
ufs: Convert ufs_inode_getfrag() to take a folio
ufs: Convert ufs_extend_tail() to take a folio
ufs: Convert ufs_inode_getblock() to take a folio
ufs: take the handling of free block counters into a helper
clean ufs_trunc_direct() up a bit...
ufs: get rid of ubh_{ubhcpymem,memcpyubh}()
ufs_inode_getfrag(): remove junk comment
ufs_free_fragments(): fix the braino in sanity check
ufs_clusteracct(): switch to passing fragment number
ufs: untangle ubh_...block...(), part 3
ufs: untangle ubh_...block...(), part 2
ufs: untangle ubh_...block...() macros, part 1
ufs: fix ufs_read_cylinder() failure handling
ufs: missing ->splice_write()
ufs: fix handling of delete_entry and set_link failures
add *xattrat() syscalls, sanitize struct filename handling in there.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdj4gAKCRBZ7Krx/gZQ
6/02AQC8ndn9i1wLGRb5DdZYGNWUDhXCdPrZCF2nyvU2swCIPwEAm1H5F/bxBXeT
6qCLHThVw4KTJOT2aDY03ELrxbi8Vg4=
=35Oj
-----END PGP SIGNATURE-----
Merge tag 'pull-xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull xattr updates from Al Viro:
"Sanitize xattr and io_uring interactions with it, add *xattrat()
syscalls, sanitize struct filename handling in there"
* tag 'pull-xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
xattr: remove redundant check on variable err
fs/xattr: add *at family syscalls
new helpers: file_removexattr(), filename_removexattr()
new helpers: file_listxattr(), filename_listxattr()
replace do_getxattr() with saner helpers.
replace do_setxattr() with saner helpers.
new helper: import_xattr_name()
fs: rename struct xattr_ctx to kernel_xattr_ctx
xattr: switch to CLASS(fd)
io_[gs]etxattr_prep(): just use getname()
io_uring: IORING_OP_F[GS]ETXATTR is fine with REQ_F_FIXED_FILE
getname_maybe_null() - the third variant of pathname copy-in
teach filename_lookup() to treat NULL filename as ""
Making sure that struct fd instances are destroyed in the same
scope where they'd been created, getting rid of reassignments
and passing them by reference, converting to CLASS(fd{,_pos,_raw}).
We are getting very close to having the memory safety of that stuff
trivial to verify.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdikAAKCRBZ7Krx/gZQ
69nJAQCmbQHK3TGUbQhOw6MJXOK9ezpyEDN3FZb4jsu38vTIdgEA6OxAYDO2m2g9
CN18glYmD3wRyU6Bwl4vGODouSJvDgA=
=gVH3
-----END PGP SIGNATURE-----
Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull 'struct fd' class updates from Al Viro:
"The bulk of struct fd memory safety stuff
Making sure that struct fd instances are destroyed in the same scope
where they'd been created, getting rid of reassignments and passing
them by reference, converting to CLASS(fd{,_pos,_raw}).
We are getting very close to having the memory safety of that stuff
trivial to verify"
* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
deal with the last remaing boolean uses of fd_file()
css_set_fork(): switch to CLASS(fd_raw, ...)
memcg_write_event_control(): switch to CLASS(fd)
assorted variants of irqfd setup: convert to CLASS(fd)
do_pollfd(): convert to CLASS(fd)
convert do_select()
convert vfs_dedupe_file_range().
convert cifs_ioctl_copychunk()
convert media_request_get_by_fd()
convert spu_run(2)
switch spufs_calls_{get,put}() to CLASS() use
convert cachestat(2)
convert do_preadv()/do_pwritev()
fdget(), more trivial conversions
fdget(), trivial conversions
privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget()
o2hb_region_dev_store(): avoid goto around fdget()/fdput()
introduce "fd_pos" class, convert fdget_pos() users to it.
fdget_raw() users: switch to CLASS(fd_raw)
convert vmsplice() to CLASS(fd)
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcmVQAKCRCRxhvAZXjc
ogE8AP9fT1T0rkYHT2ym2ulnQ2oARqO5VOnGmzYiTs2oHAfn4wD9GwH1nkMvlNTY
l7zalIg2AIdN6qvTATSoD3sN+fJXXgU=
=RNSE
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.ecryptfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull ecryptfs updates from Christian Brauner:
"The folio project is about to remove page->index. This contains the
work required for ecryptfs"
* tag 'vfs-6.13.ecryptfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
ecryptfs: Pass the folio index to crypt_extent()
ecryptfs: Convert lower_offset_for_page() to take a folio
ecryptfs: Convert ecryptfs_decrypt_page() to take a folio
ecryptfs: Convert ecryptfs_encrypt_page() to take a folio
ecryptfs: Convert ecryptfs_write_lower_page_segment() to take a folio
ecryptfs: Convert ecryptfs_write() to use a folio
ecryptfs: Convert ecryptfs_read_lower_page_segment() to take a folio
ecryptfs: Convert ecryptfs_copy_up_encrypted_with_header() to take a folio
ecryptfs: Use a folio throughout ecryptfs_read_folio()
ecryptfs: Convert ecryptfs_writepage() to ecryptfs_writepages()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcopwAKCRCRxhvAZXjc
oitWAQD68PGFI6/ES9x+qGsDFEZBH08icuO+a9dyaZXyNRosDgD/ex2zHj6F7IzS
Ghgb9jiqWQ8l2+PDYfisxa/0jiqCbAk=
=DmXf
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.untorn.writes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs untorn write support from Christian Brauner:
"An atomic write is a write issed with torn-write protection. This
means for a power failure or any hardware failure all or none of the
data from the write will be stored, never a mix of old and new data.
This work is already supported for block devices. If a block device is
opened with O_DIRECT and the block device supports atomic write, then
FMODE_CAN_ATOMIC_WRITE is added to the file of the opened block
device.
This contains the work to expand atomic write support to filesystems,
specifically ext4 and XFS. Currently, only support for writing exactly
one filesystem block atomically is added.
Since it's now possible to have filesystem block size > page size for
XFS, it's possible to write 4K+ blocks atomically on x86"
* tag 'vfs-6.13.untorn.writes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iomap: drop an obsolete comment in iomap_dio_bio_iter
ext4: Do not fallback to buffered-io for DIO atomic write
ext4: Support setting FMODE_CAN_ATOMIC_WRITE
ext4: Check for atomic writes support in write iter
ext4: Add statx support for atomic writes
xfs: Support setting FMODE_CAN_ATOMIC_WRITE
xfs: Validate atomic writes
xfs: Support atomic write for statx
fs: iomap: Atomic write support
fs: Export generic_atomic_write_valid()
block: Add bdev atomic write limits helpers
fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid()
block/fs: Pass an iocb to generic_atomic_write_valid()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcZIgAKCRCRxhvAZXjc
oge4AQDxhsKW+v/jKHydzqzwG3Ks7DIxrUg/mcGfdtBwjiWgvwEA8t0QAAfKECAK
B0+bNKJ8XJRUtZ10Jgm3dzURbEhBWgU=
=4Lui
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull tmpfs case folding updates from Christian Brauner:
"This adds case-insensitive support for tmpfs.
The work contained in here adds support for case-insensitive file
names lookups in tmpfs. The main difference from other casefold
filesystems is that tmpfs has no information on disk, just on RAM, so
we can't use mkfs to create a case-insensitive tmpfs. For this
implementation, there's a mount option for casefolding. The rest of
the patchset follows a similar approach as ext4 and f2fs.
The use case for this feature is similar to the use case for ext4, to
better support compatibility layers (like Wine), particularly in
combination with sandboxing/container tools (like Flatpak).
Those containerization tools can share a subset of the host filesystem
with an application. In the container, the root directory and any
parent directories required for a shared directory are on tmpfs, with
the shared directories bind-mounted into the container's view of the
filesystem.
If the host filesystem is using case-insensitive directories, then the
application can do lookups inside those directories in a
case-insensitive way, without this needing to be implemented in
user-space. However, if the host is only sharing a subset of a
case-insensitive directory with the application, then the parent
directories of the mount point will be part of the container's root
tmpfs. When the application tries to do case-insensitive lookups of
those parent directories on a case-sensitive tmpfs, the lookup will
fail"
* tag 'vfs-6.13.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
tmpfs: Initialize sysfs during tmpfs init
tmpfs: Fix type for sysfs' casefold attribute
libfs: Fix kernel-doc warning in generic_ci_validate_strict_name
docs: tmpfs: Add casefold options
tmpfs: Expose filesystem features via sysfs
tmpfs: Add flag FS_CASEFOLD_FL support for tmpfs dirs
tmpfs: Add casefold lookup support
libfs: Export generic_ci_ dentry functions
unicode: Recreate utf8_parse_version()
unicode: Export latest available UTF-8 version number
ext4: Use generic_ci_validate_strict_name helper
libfs: Create the helper function generic_ci_validate_strict_name()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcg9wAKCRCRxhvAZXjc
ou7GAQDtQ75C+DFjrL4alXjYG7uly0BWgrlRJ4SgNu1I0uUYdgD/UHH2fAkQeKGy
xTGT4z0abiELZw9RQ7F6qbiBhA0nIw8=
=Plww
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pidfs update from Christian Brauner:
"This adds a new ioctl to retrieve information about a pidfd.
A common pattern when using pidfds is having to get information about
the process, which currently requires /proc being mounted, resolving
the fd to a pid, and then do manual string parsing of /proc/N/status
and friends. This needs to be reimplemented over and over in all
userspace projects (e.g.: it has been reimplemented in systemd, dbus,
dbus-daemon, polkit so far), and requires additional care in checking
that the fd is still valid after having parsed the data, to avoid
races.
Having a programmatic API that can be used directly removes all these
requirements, including having /proc mounted.
As discussed at LPC24, add an ioctl with an extensible struct so that
more parameters can be added later if needed. Start with returning
pid/tgid/ppid and some creds unconditionally, and cgroupid optionally"
* tag 'vfs-6.13.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
pidfd: add ioctl to retrieve pid info
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcXJQAKCRCRxhvAZXjc
opmaAQDuEpUXi3PSNwgXGRYAzeBpsonoNrBtzSRTlaza7KZCiAEAsFUSbQJNGdkN
TqC9KvI1EiX2hVvWmVBVttBNH6CY+wY=
=4le5
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.ovl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull overlayfs updates from Christian Brauner:
"Make overlayfs support specifying layers through file descriptors.
Currently overlayfs only allows specifying layers through path names.
This is inconvenient for users that want to assemble an overlayfs
mount purely based on file descriptors:
This enables user to specify both:
fsconfig(fd_overlay, FSCONFIG_SET_FD, "upperdir+", NULL, fd_upper);
fsconfig(fd_overlay, FSCONFIG_SET_FD, "workdir+", NULL, fd_work);
fsconfig(fd_overlay, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower1);
fsconfig(fd_overlay, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower2);
in addition to:
fsconfig(fd_overlay, FSCONFIG_SET_STRING, "upperdir+", "/upper", 0);
fsconfig(fd_overlay, FSCONFIG_SET_STRING, "workdir+", "/work", 0);
fsconfig(fd_overlay, FSCONFIG_SET_STRING, "lowerdir+", "/lower1", 0);
fsconfig(fd_overlay, FSCONFIG_SET_STRING, "lowerdir+", "/lower2", 0);
There's also a large set of new overlayfs selftests to test new
features and some older properties"
* tag 'vfs-6.13.ovl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests: add test for specifying 500 lower layers
selftests: add overlayfs fd mounting selftests
selftests: use shared header
Documentation,ovl: document new file descriptor based layers
ovl: specify layers via file descriptors
fs: add helper to use mount option as path or fd
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcW4gAKCRCRxhvAZXjc
okF+AP9xTMb2SlnRPBOBd9yFcmVXmQi86TSCUPAEVb+wIldGYwD/RIOdvXYJlp9v
RgJkU1DC3ddkXtONNDY6gFaP+siIWA0=
=gMc7
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs file updates from Christian Brauner:
"This contains changes the changes for files for this cycle:
- Introduce a new reference counting mechanism for files.
As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop
it has O(N^2) behaviour under contention with N concurrent
operations and it is in a hot path in __fget_files_rcu().
The rcuref infrastructures remedies this problem by using an
unconditional increment relying on safe- and dead zones to make
this work and requiring rcu protection for the data structure in
question. This not just scales better it also introduces overflow
protection.
However, in contrast to generic rcuref, files require a memory
barrier and thus cannot rely on *_relaxed() atomic operations and
also require to be built on atomic_long_t as having massive amounts
of reference isn't unheard of even if it is just an attack.
This adds a file specific variant instead of making this a generic
library.
This has been tested by various people and it gives consistent
improvement up to 3-5% on workloads with loads of threads.
- Add a fastpath for find_next_zero_bit(). Skip 2-levels searching
via find_next_zero_bit() when there is a free slot in the word that
contains the next fd. This improves pts/blogbench-1.1.0 read by 8%
and write by 4% on Intel ICX 160.
- Conditionally clear full_fds_bits since it's very likely that a bit
in full_fds_bits has been cleared during __clear_open_fds(). This
improves pts/blogbench-1.1.0 read up to 13%, and write up to 5% on
Intel ICX 160.
- Get rid of all lookup_*_fdget_rcu() variants. They were used to
lookup files without taking a reference count. That became invalid
once files were switched to SLAB_TYPESAFE_BY_RCU and now we're
always taking a reference count. Switch to an already existing
helper and remove the legacy variants.
- Remove pointless includes of <linux/fdtable.h>.
- Avoid cmpxchg() in close_files() as nobody else has a reference to
the files_struct at that point.
- Move close_range() into fs/file.c and fold __close_range() into it.
- Cleanup calling conventions of alloc_fdtable() and expand_files().
- Merge __{set,clear}_close_on_exec() into one.
- Make __set_open_fd() set cloexec as well instead of doing it in two
separate steps"
* tag 'vfs-6.13.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests: add file SLAB_TYPESAFE_BY_RCU recycling stressor
fs: port files to file_ref
fs: add file_ref
expand_files(): simplify calling conventions
make __set_open_fd() set cloexec state as well
fs: protect backing files with rcu
file.c: merge __{set,clear}_close_on_exec()
alloc_fdtable(): change calling conventions.
fs/file.c: add fast path in find_next_fd()
fs/file.c: conditionally clear full_fds
fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
move close_range(2) into fs/file.c, fold __close_range() into it
close_files(): don't bother with xchg()
remove pointless includes of <linux/fdtable.h>
get rid of ...lookup...fdget_rcu() family
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcUoQAKCRCRxhvAZXjc
omxAAP9WE8zSxeu7Foa6+OmLO62mYdB8rRrQ4OjzX+zunL0UnAD9FAHPsB4amWm4
/zK3Nf7ipijop5+RgSJTgURffASKOgI=
=ifkk
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull netfs updates from Christian Brauner:
"Various fixes for the netfs library and related infrastructure:
cachefiles:
- Fix a dentry leak in cachefiles_open_file()
- Fix incorrect length return value in
cachefiles_ondemand_fd_write_iter()
- Fix missing pos updates in cachefiles_ondemand_fd_write_iter()
- Clean up in cachefiles_commit_tmpfile()
- Fix NULL pointer dereference in object->file
- Add a memory barrier for FSCACHE_VOLUME_CREATING
netfs:
- Remove call to folio_index()
- Fix a few minor bugs in netfs_page_mkwrite()
- Remove unnecessary references to pages"
* tag 'vfs-6.13.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
netfs/fscache: Add a memory barrier for FSCACHE_VOLUME_CREATING
cachefiles: Fix NULL pointer dereference in object->file
cachefiles: Clean up in cachefiles_commit_tmpfile()
cachefiles: Fix missing pos updates in cachefiles_ondemand_fd_write_iter()
cachefiles: Fix incorrect length return value in cachefiles_ondemand_fd_write_iter()
netfs: Remove unnecessary references to pages
netfs: Fix a few minor bugs in netfs_page_mkwrite()
netfs: Remove call to folio_index()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcUQAAKCRCRxhvAZXjc
onEpAQCUdwIBHpwmSIFvJFA9aNGpbLzi0dDSEIxuWYtp5qVuogD+ImccwqpG3kEi
Zq9vokdPpB1zbahxKl1mkvBG4G0GFQE=
=LbP6
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.pagecache' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs pagecache updates from Christian Brauner:
"Cleanup filesystem page flag usage: This continues the work to make
the mappedtodisk/owner_2 flag available to filesystems which don't use
buffer heads. Further patches remove uses of Private2. This brings us
very close to being rid of it entirely"
* tag 'vfs-6.13.pagecache' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
migrate: Remove references to Private2
ceph: Remove call to PagePrivate2()
btrfs: Switch from using the private_2 flag to owner_2
mm: Remove PageMappedToDisk
nilfs2: Convert nilfs_copy_buffer() to use folios
fs: Move clearing of mappedtodisk to buffer.c
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcT8AAKCRCRxhvAZXjc
or+CAQDb2JkNOVrugXw++kgvHrLBY+7rCzyA+sJhiZu7C7uQogEApQgGP1kjmpi0
f1wR6xomb9AmQNd991F0VWXCPBTUsAk=
=RQnY
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.rust.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs rust file abstractions from Christian Brauner:
"This contains the file abstractions needed by the Rust implementation
of the Binder driver and other parts of the kernel.
Let's treat this as a first attempt at getting something working but I
do expect the actual interfaces to change significantly over time.
Simply because we are still figuring out what actually works. But
there's no point in further theorizing. Let's see how it holds up with
actual users"
* tag 'vfs-6.13.rust.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
rust: task: adjust safety comments in Task methods
rust: add seqfile abstraction
rust: file: add abstraction for `poll_table`
rust: file: add `Kuid` wrapper
rust: file: add `FileDescriptorReservation`
rust: security: add abstraction for secctx
rust: cred: add Rust abstraction for `struct cred`
rust: file: add Rust abstraction for `struct file`
rust: task: add `Task::current_raw`
rust: types: add `NotThreadSafe`
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcToAAKCRCRxhvAZXjc
osL9AP948FFumJRC28gDJ4xp+X4eohNOfkgoEG8FTbF2zU6ulwD+O0pr26FqpFli
pqlG+38UdATImpfqqWjPbb72sBYcfQg=
=wLUh
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Fixup and improve NLM and kNFSD file lock callbacks
Last year both GFS2 and OCFS2 had some work done to make their
locking more robust when exported over NFS. Unfortunately, part of
that work caused both NLM (for NFS v3 exports) and kNFSD (for
NFSv4.1+ exports) to no longer send lock notifications to clients
This in itself is not a huge problem because most NFS clients will
still poll the server in order to acquire a conflicted lock
It's important for NLM and kNFSD that they do not block their
kernel threads inside filesystem's file_lock implementations
because that can produce deadlocks. We used to make sure of this by
only trusting that posix_lock_file() can correctly handle blocking
lock calls asynchronously, so the lock managers would only setup
their file_lock requests for async callbacks if the filesystem did
not define its own lock() file operation
However, when GFS2 and OCFS2 grew the capability to correctly
handle blocking lock requests asynchronously, they started
signalling this behavior with EXPORT_OP_ASYNC_LOCK, and the check
for also trusting posix_lock_file() was inadvertently dropped, so
now most filesystems no longer produce lock notifications when
exported over NFS
Fix this by using an fop_flag which greatly simplifies the problem
and grooms the way for future uses by both filesystems and lock
managers alike
- Add a sysctl to delete the dentry when a file is removed instead of
making it a negative dentry
Commit 681ce86235 ("vfs: Delete the associated dentry when
deleting a file") introduced an unconditional deletion of the
associated dentry when a file is removed. However, this led to
performance regressions in specific benchmarks, such as
ilebench.sum_operations/s, prompting a revert in commit
4a4be1ad3a ("Revert "vfs: Delete the associated dentry when
deleting a file""). This reintroduces the concept conditionally
through a sysctl
- Expand the statmount() system call:
* Report the filesystem subtype in a new fs_subtype field to
e.g., report fuse filesystem subtypes
* Report the superblock source in a new sb_source field
* Add a new way to return filesystem specific mount options in an
option array that returns filesystem specific mount options
separated by zero bytes and unescaped. This allows caller's to
retrieve filesystem specific mount options and immediately pass
them to e.g., fsconfig() without having to unescape or split
them
* Report security (LSM) specific mount options in a separate
security option array. We don't lump them together with
filesystem specific mount options as security mount options are
generic and most users aren't interested in them
The format is the same as for the filesystem specific mount
option array
- Support relative paths in fsconfig()'s FSCONFIG_SET_STRING command
- Optimize acl_permission_check() to avoid costly {g,u}id ownership
checks if possible
- Use smp_mb__after_spinlock() to avoid full smp_mb() in evict()
- Add synchronous wakeup support for ep_poll_callback.
Currently, epoll only uses wake_up() to wake up task. But sometimes
there are epoll users which want to use the synchronous wakeup flag
to give a hint to the scheduler, e.g., the Android binder driver.
So add a wake_up_sync() define, and use wake_up_sync() when sync is
true in ep_poll_callback()
Fixes:
- Fix kernel documentation for inode_insert5() and iget5_locked()
- Annotate racy epoll check on file->f_ep
- Make F_DUPFD_QUERY associative
- Avoid filename buffer overrun in initramfs
- Don't let statmount() return empty strings
- Add a cond_resched() to dump_user_range() to avoid hogging the CPU
- Don't query the device logical blocksize multiple times for hfsplus
- Make filemap_read() check that the offset is positive or zero
Cleanups:
- Various typo fixes
- Cleanup wbc_attach_fdatawrite_inode()
- Add __releases annotation to wbc_attach_and_unlock_inode()
- Add hugetlbfs tracepoints
- Fix various vfs kernel doc parameters
- Remove obsolete TODO comment from io_cancel()
- Convert wbc_account_cgroup_owner() to take a folio
- Fix comments for BANDWITH_INTERVAL and wb_domain_writeout_add()
- Reorder struct posix_acl to save 8 bytes
- Annotate struct posix_acl with __counted_by()
- Replace one-element array with flexible array member in freevxfs
- Use idiomatic atomic64_inc_return() in alloc_mnt_ns()"
* tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
statmount: retrieve security mount options
vfs: make evict() use smp_mb__after_spinlock instead of smp_mb
statmount: add flag to retrieve unescaped options
fs: add the ability for statmount() to report the sb_source
writeback: wbc_attach_fdatawrite_inode out of line
writeback: add a __releases annoation to wbc_attach_and_unlock_inode
fs: add the ability for statmount() to report the fs_subtype
fs: don't let statmount return empty strings
fs:aio: Remove TODO comment suggesting hash or array usage in io_cancel()
hfsplus: don't query the device logical block size multiple times
freevxfs: Replace one-element array with flexible array member
fs: optimize acl_permission_check()
initramfs: avoid filename buffer overrun
fs/writeback: convert wbc_account_cgroup_owner to take a folio
acl: Annotate struct posix_acl with __counted_by()
acl: Realign struct posix_acl to save 8 bytes
epoll: Add synchronous wakeup support for ep_poll_callback
coredump: add cond_resched() to dump_user_range
mm/page-writeback.c: Fix comment of wb_domain_writeout_add()
mm/page-writeback.c: Update comment for BANDWIDTH_INTERVAL
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcTJQAKCRCRxhvAZXjc
orndAP4gmY0BHtvXZoZiArqjvFIn5lELcRCPoXzYTlVlb2IQFQD9Gi3kMv5QUZBj
bKCKNJb1MaXknT/psytR9llWlDTMeQw=
=0/q1
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.mount.api' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs mount api conversions from Christian Brauner:
"Convert adfs, affs, befs, hfs, hfsplus, jfs, and hpfs to the new mount
api"
* tag 'vfs-6.13.mount.api' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
efs: fix the efs new mount api implementation
ubifs: Convert ubifs to use the new mount API
hpfs: convert hpfs to use the new mount api
jfs: convert jfs to use the new mount api
hfsplus: convert hfsplus to use the new mount api
hfs: convert hfs to use the new mount api
befs: convert befs to use the new mount api
affs: convert affs to use the new mount api
adfs: convert adfs to use the new mount api
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcScQAKCRCRxhvAZXjc
oj+5AP4k822a77wc/3iPFk379naIvQ4dsrgemh0/Pb6ZvzvkFQEAi3vFCfzCDR2x
SkJF/RwXXKZv6U31QXMRt2Qo6wfBuAc=
=nVlm
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs multigrain timestamps from Christian Brauner:
"This is another try at implementing multigrain timestamps. This time
with significant help from the timekeeping maintainers to reduce the
performance impact.
Thomas provided a base branch that contains the required timekeeping
interfaces for the VFS. It serves as the base for the multi-grain
timestamp work:
- Multigrain timestamps allow the kernel to use fine-grained
timestamps when an inode's attributes is being actively observed
via ->getattr(). With this support, it's possible for a file to get
a fine-grained timestamp, and another modified after it to get a
coarse-grained stamp that is earlier than the fine-grained time. If
this happens then the files can appear to have been modified in
reverse order, which breaks VFS ordering guarantees.
To prevent this, a floor value is maintained for multigrain
timestamps. Whenever a fine-grained timestamp is handed out, record
it, and when later coarse-grained stamps are handed out, ensure
they are not earlier than that value. If the coarse-grained
timestamp is earlier than the fine-grained floor, return the floor
value instead.
The timekeeper changes add a static singleton atomic64_t into
timekeeper.c that is used to keep track of the latest fine-grained
time ever handed out. This is tracked as a monotonic ktime_t value
to ensure that it isn't affected by clock jumps. Because it is
updated at different times than the rest of the timekeeper object,
the floor value is managed independently of the timekeeper via a
cmpxchg() operation, and sits on its own cacheline.
Two new public timekeeper interfaces are added:
(1) ktime_get_coarse_real_ts64_mg() fills a timespec64 with the
later of the coarse-grained clock and the floor time
(2) ktime_get_real_ts64_mg() gets the fine-grained clock value,
and tries to swap it into the floor. A timespec64 is filled
with the result.
- The VFS has always used coarse-grained timestamps when updating the
ctime and mtime after a change. This has the benefit of allowing
filesystems to optimize away a lot metadata updates, down to around
1 per jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting
via NFSv3, which relies on timestamps to validate caches. A lot of
changes can happen in a jiffy, so timestamps aren't sufficient to
help the client decide when to invalidate the cache. Even with
NFSv4, a lot of exported filesystems don't properly support a
change attribute and are subject to the same problems with
timestamp granularity. Other applications have similar issues with
timestamps (e.g backup applications).
If we were to always use fine-grained timestamps, that would
improve the situation, but that becomes rather expensive, as the
underlying filesystem would have to log a lot more metadata
updates.
This adds a way to only use fine-grained timestamps when they are
being actively queried. Use the (unused) top bit in
inode->i_ctime_nsec as a flag that indicates whether the current
timestamps have been queried via stat() or the like. When it's set,
we allow the kernel to use a fine-grained timestamp iff it's
necessary to make the ctime show a different value.
This solves the problem of being able to distinguish the timestamp
between updates, but introduces a new problem: it's now possible
for a file being changed to get a fine-grained timestamp. A file
that is altered just a bit later can then get a coarse-grained one
that appears older than the earlier fine-grained time. This
violates timestamp ordering guarantees.
This is where the earlier mentioned timkeeping interfaces help. A
global monotonic atomic64_t value is kept that acts as a timestamp
floor. When we go to stamp a file, we first get the latter of the
current floor value and the current coarse-grained time. If the
inode ctime hasn't been queried then we just attempt to stamp it
with that value.
If it has been queried, then first see whether the current coarse
time is later than the existing ctime. If it is, then we accept
that value. If it isn't, then we get a fine-grained time and try to
swap that into the global floor. Whether that succeeds or fails, we
take the resulting floor time, convert it to realtime and try to
swap that into the ctime.
We take the result of the ctime swap whether it succeeds or fails,
since either is just as valid.
Filesystems can opt into this by setting the FS_MGTIME fstype flag.
Others should be unaffected (other than being subject to the same
floor value as multigrain filesystems)"
* tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: reduce pointer chasing in is_mgtime() test
tmpfs: add support for multigrain timestamps
btrfs: convert to multigrain timestamps
ext4: switch to multigrain timestamps
xfs: switch to multigrain timestamps
Documentation: add a new file documenting multigrain timestamps
fs: add percpu counters for significant multigrain timestamp events
fs: tracepoints around multigrain timestamp events
fs: handle delegated timestamps in setattr_copy_mgtime
timekeeping: Add percpu counter for tracking floor swap events
timekeeping: Add interfaces for handling timestamps with a floor value
fs: have setattr_copy handle multigrain timestamps appropriately
fs: add infrastructure for multigrain timestamps
Correct spelling here and there as suggested by codespell.
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
strcpy() performs no bounds checking on the destination buffer. This
could result in linear overflows beyond the end of the buffer, leading
to all kinds of misbehaviors [1].
This fixes checkpatch warning:
WARNING: Prefer strscpy over strcpy
[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy
[ idryomov: formatting ]
Signed-off-by: Abdul Rahim <abdul.rahim@myyahoo.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Previously, the "name" in the new device syntax "<name>@<fsid>.<fsname>"
was ignored because (presumably) tests were done using mount.ceph which
also passed the entity name using "-o name=foo". If mounting is done
without the mount.ceph helper, the new device id syntax fails to set
the name properly.
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/68516
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph_caps_revoking() has been unused since 2017's commit
3fb99d483e ("ceph: nuke startsync op")
ceph_mdsc_open_export_target_sessions() has been unused since 2013's
commit 11df2dfb61 ("ceph: add imported caps when handling cap export message")
Remove them.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ensure the superblock is kept alive until we're done with iput().
Holding a reference to an inode is not allowed unless we ensure the
superblock stays alive, which fsnotify does by keeping the
watched_objects count elevated, so iput() must happen before the
watched_objects decrement.
This can lead to a UAF of something like sb->s_fs_info in tmpfs, but the
UAF is hard to hit because race orderings that oops are more likely, thanks
to the CHECK_DATA_CORRUPTION() block in generic_shutdown_super().
Also, ensure that fsnotify_put_sb_watched_objects() doesn't call
fsnotify_sb_watched_objects() on a superblock that may have already been
freed, which would cause a UAF read of sb->s_fsnotify_info.
Cc: stable@kernel.org
Fixes: d2f277e26f ("fsnotify: rename fsnotify_{get,put}_sb_connectors()")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
If dlm_recover_members() fails we don't drop the references of the
previous created root_list that holds and keep all rsbs alive during the
recovery. It might be not an unlikely event because ping_members() could
run into an -EINTR if another recovery progress was triggered again.
Fixes: 3a747f4a2e ("dlm: move rsb root_list to ls_recover() stack")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This reverts commit b571cfcb9d.
This patch appears to assume that if one request is complete, then the
others will complete too before unlocking. That is not a valid
assumption, since other requests could hit a non-fatal error or a short
write that would cause them not to complete.
Reported-by: Igor Raits <igor@gooddata.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219508
Fixes: b571cfcb9d ("nfs: don't reuse partially completed requests in nfs_lock_and_join_requests")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When using FUSE DAX with virtiofs, cache coherency is managed by the
host. Disk persistence is handled via fsync() and friends, which are
passed directly via the FUSE layer to the host. Therefore, there's no
need to do dax_writeback_mapping_range(). All that ends up doing is a
cache flush operation, which is not caught by KVM and doesn't do much,
since the host and guest are already cache-coherent.
Since dax_writeback_mapping_range() checks that the inode block size is
equal to PAGE_SIZE, this fixes a spurious WARN when virtiofs is used
with a mismatched guest PAGE_SIZE and virtiofs backing FS block size
(this happens, for example, when it's a tmpfs and the host and guest
have a different PAGE_SIZE). FUSE DAX does not require any particular FS
block size, since it always performs DAX mappings in aligned 2MiB
blocks.
See discussion in [1].
[1] https://lore.kernel.org/lkml/20241101-dax-page-size-v1-1-eedbd0c6b08f@asahilina.net/T/#u
[SzM: remove the empty callback]
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Asahi Lina <lina@asahilina.net>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Function fuse_direntplus_link() might call fuse_iget() to initialize a new
fuse_inode and change its attributes. If fi->attr_version is always
initialized with 0, even if the attributes returned by the FUSE_READDIR
request is staled, as the new fi->attr_version is 0, fuse_change_attributes
will still set the staled attributes to inode. This wrong behaviour may
cause file size inconsistency even when there is no changes from
server-side.
To reproduce the issue, consider the following 2 programs (A and B) are
running concurrently,
A B
---------------------------------- --------------------------------
{ /fusemnt/dir/f is a file path in a fuse mount, the size of f is 0. }
readdir(/fusemnt/dir) start
//Daemon set size 0 to f direntry
fallocate(f, 1024)
stat(f) // B see size 1024
echo 2 > /proc/sys/vm/drop_caches
readdir(/fusemnt/dir) reply to kernel
Kernel set 0 to the I_NEW inode
stat(f) // B see size 0
In the above case, only program B is modifying the file size, however, B
observes file size changing between the 2 'readonly' stat() calls. To fix
this issue, we should make sure readdirplus still follows the rule of
attr_version staleness checking even if the fi->attr_version is lost due to
inode eviction.
To identify this situation, the new fc->evict_ctr is used to record whether
the eviction of inodes occurs during the readdirplus request processing.
If it does, the result of readdirplus may be inaccurate; otherwise, the
result of readdirplus can be trusted. Although this may still lead to
incorrect invalidation, considering the relatively low frequency of
evict occurrences, it should be acceptable.
Link: https://lore.kernel.org/lkml/20230711043405.66256-2-zhangjiachen.jaycee@bytedance.com/
Link: https://lore.kernel.org/lkml/20241114070905.48901-1-zhangtianci.1997@bytedance.com/
Reported-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Zhang Tianci <zhangtianci.1997@bytedance.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Adjust sb->s_blocksize{,_bits} directly for file-backed
mounts when the fs block size is smaller than PAGE_SIZE.
Previously, EROFS used sb_set_blocksize(), which caused
a panic if bdev-backed mounts is not used.
Fixes: fb17675026 ("erofs: add file-backed mount support")
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241015103836.3757438-1-hongzhen@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
After commit 927e5010ff ("erofs: use kmap_local_page() only for
erofs_bread()"), `buf->kmap_type` actually has no use at all.
Let's get rid of `buf->kmap_type` now.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241114095813.839866-1-hsiangkao@linux.alibaba.com
Add a sysfs node to drop compression-related caches, currently used to
drop in-memory pclusters and cached compressed folios.
Signed-off-by: Chunhai Guo <guochunhai@vivo.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241113041148.749129-1-guochunhai@vivo.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Once a pcluster is fully decompressed and there are no attached cached
folios, its corresponding `struct z_erofs_pcluster` will be freed. This
will significantly reduce the frequency of calls to erofs_shrink_scan()
and the memory allocated for `struct z_erofs_pcluster`.
The tables below show approximately a 96% reduction in the calls to
erofs_shrink_scan() and in the memory allocated for `struct
z_erofs_pcluster` after applying this patch. The results were obtained
by performing a test to copy a 4.1GB partition on ARM64 Android devices
running the 6.6 kernel with an 8-core CPU and 12GB of memory.
1. The reduction in calls to erofs_shrink_scan():
+-----------------+-----------+----------+---------+
| | w/o patch | w/ patch | diff |
+-----------------+-----------+----------+---------+
| Average (times) | 11390 | 390 | -96.57% |
+-----------------+-----------+----------+---------+
2. The reduction in memory released by erofs_shrink_scan():
+-----------------+-----------+----------+---------+
| | w/o patch | w/ patch | diff |
+-----------------+-----------+----------+---------+
| Average (Byte) | 133612656 | 4434552 | -96.68% |
+-----------------+-----------+----------+---------+
Signed-off-by: Chunhai Guo <guochunhai@vivo.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241112043235.546164-1-guochunhai@vivo.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
`struct erofs_workgroup` was introduced to provide a unique header
for all physically indexed objects. However, after big pclusters and
shared pclusters are implemented upstream, it seems that all EROFS
encoded data (which requires transformation) can be represented with
`struct z_erofs_pcluster` directly.
Move all members into `struct z_erofs_pcluster` for simplicity.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241021035323.3280682-3-hsiangkao@linux.alibaba.com
Customers have reported use-after-free in @ses->auth_key.response with
SMB2.1 + sign mounts which occurs due to following race:
task A task B
cifs_mount()
dfs_mount_share()
get_session()
cifs_mount_get_session() cifs_send_recv()
cifs_get_smb_ses() compound_send_recv()
cifs_setup_session() smb2_setup_request()
kfree_sensitive() smb2_calc_signature()
crypto_shash_setkey() *UAF*
Fix this by ensuring that we have a valid @ses->auth_key.response by
checking whether @ses->ses_status is SES_GOOD or SES_EXITING with
@ses->ses_lock held. After commit 24a9799aa8 ("smb: client: fix UAF
in smb2_reconnect_server()"), we made sure to call ->logoff() only
when @ses was known to be good (e.g. valid ->auth_key.response), so
it's safe to access signing key when @ses->ses_status == SES_EXITING.
Cc: stable@vger.kernel.org
Reported-by: Jay Shin <jaeshin@redhat.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Remove hard-coded strings by using the str_yes_no() helper function.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Steve French <stfrench@microsoft.com>
Like commit f1f047bd7c ("smb: client: Fix -Wstringop-overflow issues"),
adjust the memcpy() destination address to be based off the surrounding
object rather than based off the 4-byte "Protocol" member. This avoids a
build-time warning when compiling under CONFIG_FORTIFY_SOURCE with GCC 15:
In function 'fortify_memcpy_chk',
inlined from 'CIFSSMBSetPathInfo' at ../fs/smb/client/cifssmb.c:5358:2:
../include/linux/fortify-string.h:571:25: error: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror=attribute-warning]
571 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
CIFSSMBCopy() is unused, remove it.
It seems to have been that way pre-git; looking in a historic
archive, I think it landed around May 2004 in Linus'
BKrev: 40ab7591J_OgkpHW-qhzZukvAUAw9g
and was unused back then.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzkr6AAKCRDdBJ7gKXxA
jsb2AP9HCOI4w9rQTmBdnaefXytS7fiiPq+LVNpjJ0NGXX2FSgD/e1NM0wi8KevQ
npcvlqTcXtRSJvYNF904aTNyDn+Kuw0=
=KFGY
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-11-16-15-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"10 hotfixes, 7 of which are cc:stable. All singletons, please see the
changelogs for details"
* tag 'mm-hotfixes-stable-2024-11-16-15-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: revert "mm: shmem: fix data-race in shmem_getattr()"
ocfs2: uncache inode which has failed entering the group
mm: fix NULL pointer dereference in alloc_pages_bulk_noprof
mm, doc: update read_ahead_kb for MADV_HUGEPAGE
fs/proc/task_mmu: prevent integer overflow in pagemap_scan_get_args()
sched/task_stack: fix object_is_on_stack() for KASAN tagged pointers
crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32
mm/mremap: fix address wraparound in move_page_tables()
tools/mm: fix compile error
mm, swap: fix allocation and scanning race with swapoff
The helper functions v9fs_cache_register and v9fs_cache_unregister are
trivial helper functions that don't offer any extra functionality and
are unncessary. Replace them with direct calls to v9fs_init_inode_cache
and v9fs_destroy_inode_cache respectively to simplify the code.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Message-ID: <20241107095756.10261-1-colin.i.king@gmail.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
The __filemap_get_folio() function returns error pointers.
It never returns NULL. So use IS_ERR() to check it.
Fixes: 1da86618bd ("fs: Convert aops->write_begin to take a folio")
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Richard Weinberger <richard@nod.at>
In one special case, recovery is unable to reliably rebuild
lock state by simply recreating lkb structs as sent from the
lock holders. That case is when the lkb's include conversions
between PR and CW modes.
The recovery code has always recognized this special case,
but the implemention has always been broken, and would set
invalid modes in recovered lkb's. Unpredictable or bogus
errors could then be returned for further locking calls on
these locks.
This bug has gone unnoticed for so long due to some
combination of:
- applications never or infrequently converting between PR/CW
- recovery not occuring during these conversions
- if the recovery bug does occur, the caller may not notice,
depending on what further locking calls are made, e.g. if
the lock is simply unlocked it may go unnoticed
However, a core analysis from a recent gfs2 bug report points
to this broken code.
PR = Protected Read
CW = Concurrent Write
PR and CW are incompatible
PR and PR are compatible
CW and CW are compatible
Example 1
node C, resource R
granted: PR node A
granted: PR node B
granted: NL node C
granted: NL node D
- A sends convert PR->CW to C
- C fails before A gets a reply
- recovery occurs
At this point, A does not know if it still holds
the lock in PR, or if its conversion to CW was granted:
- If A's conversion to CW was granted, then another
node's CW lock may also have been granted.
- If A's conversion to CW was not granted, it still
holds a PR lock, and other nodes may also hold PR locks.
So, the new master of R cannot simply recreate the lock
from A using granted mode PR and requested mode CW.
The new master must look at all the recovered locks to
determine the correct granted modes, and ensure that all
the recovered locks are recreated in compatible states.
The correct lock recovery steps in this example are:
- node D becomes the new master of R
- node B sends D its lkb, granted PR
- node A sends D its lkb, convert PR->CW
- D determines the correct lock state is:
granted: PR node B
convert: PR->CW node A
The lkb sent by each node was recreated without
any change on the new master node.
Example 2
node C, resource R
granted: PR node A
granted: NL node C
granted: NL node D
waiting: CW node B
- A sends convert PR->CW to C
- C grants the conversion to CW for A
- C grants the waiting request for CW to B
- C sends granted message to B, but fails
before it can send the granted message to A
- B receives the granted message from C
At this point:
- A believes it is converting PR->CW
- B believes it is holding a CW lock
The correct lock recovery steps in this example are:
- node D becomes the new master of R
- node A sends D its lkb, convert PR->CW
- node B sends D its lkb, granted CW
- D determins the correct lock state is:
granted: CW node B
granted: CW node A
The lkb sent by B is recreated without change,
but the lkb sent by A is changed because the
granted mode was not compatible.
Fixes to make this work correctly:
recover_convert_waiter: should not make any changes
to a converting lkb that is still waiting for a reply
message. It was previously setting grmode to IV, which
is invalid state, so the lkb would not be handled
correctly by other code.
receive_rcom_lock_args: was checking the wrong lkb field
(wait_type instead of status) to determine if the lkb is
being converted, and in need of inspection for this special
recovery. It was also setting grmode to IV in the lkb,
causing it to be mishandled by other code.
Now, this function just puts the lkb, directly as sent,
onto the convert queue of the resource being recovered,
and corrects it in recover_conversion() later, if needed.
recover_conversion: the job of this function is to detect
and correct lkb states for the special PR/CW conversions.
The new code now checks for recovered lkbs on the granted
queue with grmode PR or CW, and takes the real grmode from
that. Then it looks for lkbs on the convert queue with an
incompatible grmode (i.e. grmode PR when the real grmode is
CW, or v.v.) These converting lkbs need to be fixed.
They are fixed by temporarily setting their grmode to NL,
so that grmodes are not incompatible and won't confuse other
locking code. The converting lkb will then be granted at
the end of recovery, replacing the temporary NL grmode.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmc2aGIACgkQxWXV+ddt
WDu0AQ/9FLfC/e3X2GjZ0auna/7A/rF8MPoATUdAyHn75Md6Hc8PXpi1YvMph+ba
pvoufOqrU66/g0UNeacgsp6rF4rKJHg0q9Id+7wueLnDr41g9paXjsLYItq4j26w
GusDiZvUDwuDmb70vlTXrgAfnjooIdSwqJTlIzJxvl4wrNzOiUlUJtTMzmUrwn/9
Lf/iByWlGcPKKBc+1ZzFz4HlVOZZSt9YePeJw2/Aul2OMtuI3RTTAL/NtjaFIlYc
pb+NHVqFrrfgC+xo68hLBmnsBfS41EGR58rYRjEuQo0+hARa8WbxL3DNA/E/Vi5X
dsq/wQVlD7IVIWCoF9J94/iyDdwlDOGFMoL6FUrJwDtPGN/v/xxtA6ruvuC7k5zy
bHCR8ZVrJWVaxE7u0Gtl+hFPpDTwNTR7SfvK69gxPfci1cN0m2wCNK02SEUJwV09
N82N2ENGGwyWS+nOl/ERB+7A0QxViMr3JpUrPzSYqsmn8bwDvovSjK2fFouJoSey
bpAzbFWj+OS0O9nnRqabTJDM/Tk9O0s0Ye76aUS+Vfk9d5EuVfAg6pHiOBcFDhsK
UEG9QbPltfh6LPDHCdV93HOOsC0uNxCTCSpbQ9LFGKBICQsPIX/vZeV45fNFJDLX
j5kEtHFVU3snU+jA97nvYXPRANDnnNx/EzXv7zo0Ye8L+plecBs=
=ssYj
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"One more fix that seems urgent and good to have in 6.12 final.
It could potentially lead to unexpected transaction aborts, due to
wrong comparison and order of processing of delayed refs"
* tag 'for-6.12-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix incorrect comparison for delayed refs
Clang warns (or errors with CONFIG_WERROR=y):
fs/ubifs/journal.c:986:20: error: variable 'err' is uninitialized when used here [-Werror,-Wuninitialized]
986 | ubifs_ro_mode(c, err);
| ^~~
Set err to -EPERM before the call to ubifs_ro_mode() and reuse it in the
return statement to resolve the warning.
Fixes: 957e1c4e17 ("ubifs: ubifs_jnl_write_inode: Only check once for the limitation of xattr count")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Richard Weinberger <richard@nod.at>
Under the new mount API, mount options are parsed one at a time.
Any validation that examines multiple options must be done after parsing
is complete, so factor out a ecryptfs_validate_options() which can be
called separately.
To facilitate this, temporarily move the local variables that tracked
whether various options have been set in the parsing function, into the
ecryptfs_mount_crypt_stat structure so that they can be examined later.
These will be moved to a more ephemeral struct in the mount api conversion
patch to follow.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Link: https://lore.kernel.org/r/20241028143359.605061-2-sandeen@redhat.com
Acked-by: Tyler Hicks <code@tyhicks.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Teach open_by_handle_at(2) about the type format of "explicit connectable"
file handles that were created using the AT_HANDLE_CONNECTABLE flag to
name_to_handle_at(2).
When decoding an "explicit connectable" file handles, name_to_handle_at(2)
should fail if it cannot open a "connected" fd with known path, which is
accessible (to capable user) from mount fd path.
Note that this does not check if the path is accessible to the calling
user, just that it is accessible wrt the mount namesapce, so if there
is no "connected" alias, or if parts of the path are hidden in the
mount namespace, open_by_handle_at(2) will return -ESTALE.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20241011090023.655623-4-amir73il@gmail.com
Fixes: 570df4e9c2 ("ceph: snapshot nfs re-export")
Acked-by:
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
nfsd encodes "connectable" file handles for the subtree_check feature,
which can be resolved to an open file with a connected path.
So far, userspace nfs server could not make use of this functionality.
Introduce a new flag AT_HANDLE_CONNECTABLE to name_to_handle_at(2).
When used, the encoded file handle is "explicitly connectable".
The "explicitly connectable" file handle sets bits in the high 16bit of
the handle_type field, so open_by_handle_at(2) will know that it needs
to open a file with a connected path.
old kernels will now recognize the handle_type with high bits set,
so "explicitly connectable" file handles cannot be decoded by
open_by_handle_at(2) on old kernels.
The flag AT_HANDLE_CONNECTABLE is not allowed together with either
AT_HANDLE_FID or AT_EMPTY_PATH.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20241011090023.655623-3-amir73il@gmail.com
Fixes: 570df4e9c2 ("ceph: snapshot nfs re-export")
Acked-by:
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
We would like to use the high 16bit of the handle_type field to encode
file handle traits, such as "connectable".
In preparation for this change, make sure that filesystems do not return
a handle_type value with upper bits set and that the open_by_handle_at(2)
syscall rejects these handle types.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20241011090023.655623-2-amir73il@gmail.com
Fixes: 570df4e9c2 ("ceph: snapshot nfs re-export")
Acked-by:
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Stop using struct fd to return a real file from ovl_real_fdget(),
because we no longer return a temporary file object and the callers
always get a borrowed file reference.
Rename the helper to ovl_real_file(), return a borrowed reference of
the real file that is referenced from the overlayfs file or an error.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Stop using struct fd to return a real file from ovl_real_fdget_path(),
because we no longer return a temporary file object and the callers
always get a borrowed file reference.
Rename the helper to ovl_real_file_path(), return a borrowed reference
of the real file that is referenced from the overlayfs file or an error.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
When an overlayfs file is opened as lower and then the file is copied up,
every operation on the overlayfs open file will open a temporary backing
file to the upper dentry and close it at the end of the operation.
Store the upper real file along side the original (lower) real file in
ovl_file instead of opening a temporary upper file on every operation.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Instead of using ->private_data to point at realfile directly, so
that we can add more context per ovl open file.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
ovl_fsync() with !datasync opens a backing file from the top most dentry
in the stack, checks if this dentry is non-upper and skips the fsync.
In case of an overlay dentry stack with lower data and lower metadata
above it, but without an upper metadata above it, the backing file is
opened from the top most lower metadata dentry and never used.
Refactor the helper ovl_real_fdget_meta() into ovl_real_fdget_path() and
open code the checks for non-upper inode in ovl_fsync(), so in that case
we can avoid the unneeded backing file open.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Use override_creds_light() in ovl_override_creds() and
revert_creds_light() in ovl_revert_creds().
The _light() functions do not change the 'usage' of the credentials in
question, as they refer to the credentials associated with the
mounter, which have a longer lifetime.
In ovl_setup_cred_for_create(), do not need to modify the mounter
credentials (returned by override_creds_light()) 'usage' counter.
Add a warning to verify that we are indeed working with the mounter
credentials (stored in the superblock). Failure in this assumption
means that creds may leak.
Suggested-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Syzbot has reported the following BUG:
kernel BUG at fs/ocfs2/uptodate.c:509!
...
Call Trace:
<TASK>
? __die_body+0x5f/0xb0
? die+0x9e/0xc0
? do_trap+0x15a/0x3a0
? ocfs2_set_new_buffer_uptodate+0x145/0x160
? do_error_trap+0x1dc/0x2c0
? ocfs2_set_new_buffer_uptodate+0x145/0x160
? __pfx_do_error_trap+0x10/0x10
? handle_invalid_op+0x34/0x40
? ocfs2_set_new_buffer_uptodate+0x145/0x160
? exc_invalid_op+0x38/0x50
? asm_exc_invalid_op+0x1a/0x20
? ocfs2_set_new_buffer_uptodate+0x2e/0x160
? ocfs2_set_new_buffer_uptodate+0x144/0x160
? ocfs2_set_new_buffer_uptodate+0x145/0x160
ocfs2_group_add+0x39f/0x15a0
? __pfx_ocfs2_group_add+0x10/0x10
? __pfx_lock_acquire+0x10/0x10
? mnt_get_write_access+0x68/0x2b0
? __pfx_lock_release+0x10/0x10
? rcu_read_lock_any_held+0xb7/0x160
? __pfx_rcu_read_lock_any_held+0x10/0x10
? smack_log+0x123/0x540
? mnt_get_write_access+0x68/0x2b0
? mnt_get_write_access+0x68/0x2b0
? mnt_get_write_access+0x226/0x2b0
ocfs2_ioctl+0x65e/0x7d0
? __pfx_ocfs2_ioctl+0x10/0x10
? smack_file_ioctl+0x29e/0x3a0
? __pfx_smack_file_ioctl+0x10/0x10
? lockdep_hardirqs_on_prepare+0x43d/0x780
? __pfx_lockdep_hardirqs_on_prepare+0x10/0x10
? __pfx_ocfs2_ioctl+0x10/0x10
__se_sys_ioctl+0xfb/0x170
do_syscall_64+0xf3/0x230
entry_SYSCALL_64_after_hwframe+0x77/0x7f
...
</TASK>
When 'ioctl(OCFS2_IOC_GROUP_ADD, ...)' has failed for the particular
inode in 'ocfs2_verify_group_and_input()', corresponding buffer head
remains cached and subsequent call to the same 'ioctl()' for the same
inode issues the BUG() in 'ocfs2_set_new_buffer_uptodate()' (trying
to cache the same buffer head of that inode). Fix this by uncaching
the buffer head with 'ocfs2_remove_from_cache()' on error path in
'ocfs2_group_add()'.
Link: https://lkml.kernel.org/r/20241114043844.111847-1-dmantipov@yandex.ru
Fixes: 7909f2bf83 ("[PATCH 2/2] ocfs2: Implement group add for online resize")
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reported-by: syzbot+453873f1588c2d75b447@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=453873f1588c2d75b447
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Dmitry Antipov <dmantipov@yandex.ru>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The "arg->vec_len" variable is a u64 that comes from the user at the start
of the function. The "arg->vec_len * sizeof(struct page_region))"
multiplication can lead to integer wrapping. Use size_mul() to avoid
that.
Also the size_add/mul() functions work on unsigned long so for 32bit
systems we need to ensure that "arg->vec_len" fits in an unsigned long.
Link: https://lkml.kernel.org/r/39d41335-dd4d-48ed-8a7f-402c57d8ea84@stanley.mountain
Fixes: 52526ca7fd ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Andrei Vagin <avagin@google.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The rtime decompression routine does not fully check bounds during the
entirety of the decompression pass and can corrupt memory outside the
decompression buffer if the compressed data is corrupted. This adds the
required check to prevent this failure mode.
Cc: stable@vger.kernel.org
Signed-off-by: Kinsey Moore <kinsey.moore@oarcorp.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
The check for outpos > pos is always false because outpos is zero
and pos is at least zero; outpos can never be greater than pos.
The check is redundant and can be removed.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
When building the kernel with -Wmaybe-uninitialized, the compiler
reports this warning:
In function 'jffs2_mark_erased_block',
inlined from 'jffs2_erase_pending_blocks' at fs/jffs2/erase.c:116:4:
fs/jffs2/erase.c:474:9: warning: 'bad_offset' may be used uninitialized [-Wmaybe-uninitialized]
474 | jffs2_erase_failed(c, jeb, bad_offset);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/jffs2/erase.c: In function 'jffs2_erase_pending_blocks':
fs/jffs2/erase.c:402:18: note: 'bad_offset' was declared here
402 | uint32_t bad_offset;
| ^~~~~~~~~~
When mtd->point() is used, jffs2_erase_pending_blocks can return -EIO
without initializing bad_offset, which is later used at the filebad
label in jffs2_mark_erased_block.
Fix it by initializing this variable.
Fixes: 8a0f572397 ("[JFFS2] Return values of jffs2_block_check_erase error paths")
Signed-off-by: Qingfang Deng <qingfang.deng@siflower.com.cn>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
After an insertion in TNC, the tree might split and cause a node to
change its `znode->parent`. A further deletion of other nodes in the
tree (which also could free the nodes), the aforementioned node's
`znode->cparent` could still point to a freed node. This
`znode->cparent` may not be updated when getting nodes to commit in
`ubifs_tnc_start_commit()`. This could then trigger a use-after-free
when accessing the `znode->cparent` in `write_index()` in
`ubifs_tnc_end_commit()`.
This can be triggered by running
rm -f /etc/test-file.bin
dd if=/dev/urandom of=/etc/test-file.bin bs=1M count=60 conv=fsync
in a loop, and with `CONFIG_UBIFS_FS_AUTHENTICATION`. KASAN then
reports:
BUG: KASAN: use-after-free in ubifs_tnc_end_commit+0xa5c/0x1950
Write of size 32 at addr ffffff800a3af86c by task ubifs_bgt0_20/153
Call trace:
dump_backtrace+0x0/0x340
show_stack+0x18/0x24
dump_stack_lvl+0x9c/0xbc
print_address_description.constprop.0+0x74/0x2b0
kasan_report+0x1d8/0x1f0
kasan_check_range+0xf8/0x1a0
memcpy+0x84/0xf4
ubifs_tnc_end_commit+0xa5c/0x1950
do_commit+0x4e0/0x1340
ubifs_bg_thread+0x234/0x2e0
kthread+0x36c/0x410
ret_from_fork+0x10/0x20
Allocated by task 401:
kasan_save_stack+0x38/0x70
__kasan_kmalloc+0x8c/0xd0
__kmalloc+0x34c/0x5bc
tnc_insert+0x140/0x16a4
ubifs_tnc_add+0x370/0x52c
ubifs_jnl_write_data+0x5d8/0x870
do_writepage+0x36c/0x510
ubifs_writepage+0x190/0x4dc
__writepage+0x58/0x154
write_cache_pages+0x394/0x830
do_writepages+0x1f0/0x5b0
filemap_fdatawrite_wbc+0x170/0x25c
file_write_and_wait_range+0x140/0x190
ubifs_fsync+0xe8/0x290
vfs_fsync_range+0xc0/0x1e4
do_fsync+0x40/0x90
__arm64_sys_fsync+0x34/0x50
invoke_syscall.constprop.0+0xa8/0x260
do_el0_svc+0xc8/0x1f0
el0_svc+0x34/0x70
el0t_64_sync_handler+0x108/0x114
el0t_64_sync+0x1a4/0x1a8
Freed by task 403:
kasan_save_stack+0x38/0x70
kasan_set_track+0x28/0x40
kasan_set_free_info+0x28/0x4c
__kasan_slab_free+0xd4/0x13c
kfree+0xc4/0x3a0
tnc_delete+0x3f4/0xe40
ubifs_tnc_remove_range+0x368/0x73c
ubifs_tnc_remove_ino+0x29c/0x2e0
ubifs_jnl_delete_inode+0x150/0x260
ubifs_evict_inode+0x1d4/0x2e4
evict+0x1c8/0x450
iput+0x2a0/0x3c4
do_unlinkat+0x2cc/0x490
__arm64_sys_unlinkat+0x90/0x100
invoke_syscall.constprop.0+0xa8/0x260
do_el0_svc+0xc8/0x1f0
el0_svc+0x34/0x70
el0t_64_sync_handler+0x108/0x114
el0t_64_sync+0x1a4/0x1a8
The offending `memcpy()` in `ubifs_copy_hash()` has a use-after-free
when a node becomes root in TNC but still has a `cparent` to an already
freed node. More specifically, consider the following TNC:
zroot
/
/
zp1
/
/
zn
Inserting a new node `zn_new` with a key smaller then `zn` will trigger
a split in `tnc_insert()` if `zp1` is full:
zroot
/ \
/ \
zp1 zp2
/ \
/ \
zn_new zn
`zn->parent` has now been moved to `zp2`, *but* `zn->cparent` still
points to `zp1`.
Now, consider a removal of all the nodes _except_ `zn`. Just when
`tnc_delete()` is about to delete `zroot` and `zp2`:
zroot
\
\
zp2
\
\
zn
`zroot` and `zp2` get freed and the tree collapses:
zn
`zn` now becomes the new `zroot`.
`get_znodes_to_commit()` will now only find `zn`, the new `zroot`, and
`write_index()` will check its `znode->cparent` that wrongly points to
the already freed `zp1`. `ubifs_copy_hash()` thus gets wrongly called
with `znode->cparent->zbranch[znode->iip].hash` that triggers the
use-after-free!
Fix this by explicitly setting `znode->cparent` to `NULL` in
`get_znodes_to_commit()` for the root node. The search for the dirty
nodes is bottom-up in the tree. Thus, when `find_next_dirty(znode)`
returns NULL, the current `znode` _is_ the root node. Add an assert for
this.
Fixes: 16a26b20d2 ("ubifs: authentication: Add hashes to index nodes")
Tested-by: Waqar Hameed <waqar.hameed@axis.com>
Co-developed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Waqar Hameed <waqar.hameed@axis.com>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
- Assorted tiny syzbot fixes
- Shutdown path fix: "bch2_btree_write_buffer_flush_going_ro()"
The shutdown path wasn't flushing the btree write buffer, leading to
shutting down while we still had operations in flight. This fixes a
whole slew of syzbot bugs, and undoubtedly other strange heisenbugs.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmc1ceQACgkQE6szbY3K
bna56w//V4On/z8q8r4vyj44NXfFNXL87QNB7e5BAP8lmow8S1KnqpHlpUq+MaJm
/LI1OPPaDJITokrQ4eUKyfRLQaH9PlDk7WafgzixnYmaXKVIraqGzWfISSbNhLdC
Hw/EjYdqmuQMAxTgj1upyx2UijvBRaT0dvjeaTX2OgQiqIlBUWGdUDFHWzc1sRgO
8PrJf9CYscdS+P+XrDBKLLCiKGn73Z93lSsFTwVzDictfrqUT8X5J5/a+ofCwp8W
sEiL/13bp05ABb3YjVrBupApyY6szC6YrKNwJ8RoHxBGo0zl5L/gd9FnBJX0T1wc
10dDMv4gZWf9mF2KePDH5EYBBV62N8qEwRAHrjkGu/87mNBRf6sUZ3fbTiMMmVds
nSTJUHH6t6uiwtezPPzDKtPYQBcGsP95DBdvL4un97jTlUtszHLU8OqVErojTYzZ
sZLilUUMz+CK/qfoJjUcSJwslevHeiri5Hto4jOvf8M7uH0Xx29LvmwqWLdZP2uT
RNjslFLvqzZoWGsbJEfe2YihJwJd9kkiE/0xQK60169bzoNEI6oWNmOK2Ts1Dr0X
tEkdHmWqNI65+nbjomGOa3u1wysrACuaRwu6fq0562+IvlDtVXLUWbmlS58+bqz2
KwUmZp4Y2OibeKFT+y0iqxpw9WBPQb7dwj9xwpwk3xgph1uVwbY=
=L6kv
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-11-13' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"This fixes one minor regression from the btree cache fixes (in the
scan_for_btree_nodes repair path) - and the shutdown path fix is the
big one here, in terms of bugs closed:
- Assorted tiny syzbot fixes
- Shutdown path fix: "bch2_btree_write_buffer_flush_going_ro()"
The shutdown path wasn't flushing the btree write buffer, leading
to shutting down while we still had operations in flight. This
fixes a whole slew of syzbot bugs, and undoubtedly other strange
heisenbugs.
* tag 'bcachefs-2024-11-13' of git://evilpiepirate.org/bcachefs:
bcachefs: Fix assertion pop in bch2_ptr_swab()
bcachefs: Fix journal_entry_dev_usage_to_text() overrun
bcachefs: Allow for unknown key types in backpointers fsck
bcachefs: Fix assertion pop in topology repair
bcachefs: Fix hidden btree errors when reading roots
bcachefs: Fix validate_bset() repair path
bcachefs: Fix missing validation for bch_backpointer.level
bcachefs: Fix bch_member.btree_bitmap_shift validation
bcachefs: bch2_btree_write_buffer_flush_going_ro()
Move a pair of kfree() calls behind the label “out_err”
so that two statements can be better reused at the end of
this function implementation.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
An iput(xino) call was immediately used after a return value check
for a remove_xattr() call in this function implementation.
Thus call such a function only once instead directly before the check.
This issue was transformed by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Since commit e874dcde1c ("ubifs: Reserve one leb for each journal
head while doing budget"), available space is calulated by deducting
reservation for all journal heads. However, the total block count (
which is only used by statfs) is not updated yet, which will cause
the wrong displaying for used space(total - available).
Fix it by deducting reservation for all journal heads from total
block count.
Fixes: e874dcde1c ("ubifs: Reserve one leb for each journal head while doing budget")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
As opposed to open-code, using the ERR_CAST macro clearly indicates that
this is a pointer to an error value and a type conversion was performed.
Signed-off-by: Shen Lichuan <shenlichuan@vivo.com>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
In commit ae8c511757 ("fs: add FS_IOC_GETFSSYSFSPATH"), a
new fs ioctl was introduced to standardize exporting data from
sysfs across filesystems. The returned path will always be of the
form "$FSTYP/$SYSFS_IDENTIFIER", where the sysfs identifier may
be a UUID or a device name.
The ubifs is a file system based on char device, and the common
method to fill s_sysfs_name (super_set_sysfs_name_bdev) is
unavialable. So in order to support FS_IOC_GETFSSYSFSPATH ioctl,
we fill the s_sysfs_name with ubi_volume_info member which keeps
the format defined in macro UBIFS_DFS_DIR_NAME by using
super_set_sysfs_name_generic.
That's for ubifs, it will output "ubifs/<dev>".
```
$ ./ioctl_getfssysfs_path /mnt/ubifs/testfile
path: ubifs/ubi0_0
$ ls /sys/fs/ubifs/ubi0_0/
errors_crc errors_magic errors_node
```
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
In the ubifs, ubifs_fileattr_get and ubifs_fileattr_set
have been implemented, GETFLAGS and SETFLAGS ioctl are not
handled in filesystem's own ioctl helper. Additionally,
these flags' cases are not handled in ubifs's ioctl helper,
so we can remove them.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Display the inode number in error message when the same orphan inode
is added twice, which could provide more information for debugging.
Signed-off-by: Liu Mingrui <liumingrui@huawei.com>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Function ubifs_evict_xattr_inode() is imported by commit 272eda8298
("ubifs: Correctly evict xattr inodes") to reclaim xattr inode when
the host inode is deleted.
The xattr inode is evicted in the host inode deleting process since
commit 7959cf3a75 ("ubifs: journal: Handle xattrs like files").
So the ineffective function ubifs_evict_xattr_inode() can be deleted
safely.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
No need to check the limitation of xattr count every time in function
ubifs_jnl_write_inode(), because the 'ui->xattr_cnt' won't be modified
by others in the inode evicting process.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Add the ability to retrieve security mount options. Keep them separate
from filesystem specific mount options so it's easy to tell them apart.
Also allow to retrieve them separate from other mount options as most of
the time users won't be interested in security specific mount options.
Link: https://lore.kernel.org/r/20241114-radtour-ofenrohr-ff34b567b40a@brauner
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
When I reworked delayed ref comparison in cf4f04325b ("btrfs: move
->parent and ->ref_root into btrfs_delayed_ref_node"), I made a mistake
and returned -1 for the case where ref1->ref_root was > than
ref2->ref_root. This is a subtle bug that can result in improper
delayed ref running order, which can result in transaction aborts.
Fixes: cf4f04325b ("btrfs: move ->parent and ->ref_root into btrfs_delayed_ref_node")
CC: stable@vger.kernel.org # 6.10+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
ovl_setup_cred_for_create() decrements one refcount of new creds and
ovl_revert_creds() in callers decrements the last refcount.
In preparation to revert_creds_light() back to caller creds, pass an
explicit reference of the creators creds to the callers and drop the
refcount explicitly in the callers after ovl_revert_creds().
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
* arm64/for-next/perf:
perf: Switch back to struct platform_driver::remove()
perf: arm_pmuv3: Add support for Samsung Mongoose PMU
dt-bindings: arm: pmu: Add Samsung Mongoose core compatible
perf/dwc_pcie: Fix typos in event names
perf/dwc_pcie: Add support for Ampere SoCs
ARM: pmuv3: Add missing write_pmuacr()
perf/marvell: Marvell PEM performance monitor support
perf/arm_pmuv3: Add PMUv3.9 per counter EL0 access control
perf/dwc_pcie: Convert the events with mixed case to lowercase
perf/cxlpmu: Support missing events in 3.1 spec
perf: imx_perf: add support for i.MX91 platform
dt-bindings: perf: fsl-imx-ddr: Add i.MX91 compatible
drivers perf: remove unused field pmu_node
* for-next/gcs: (42 commits)
: arm64 Guarded Control Stack user-space support
kselftest/arm64: Fix missing printf() argument in gcs/gcs-stress.c
arm64/gcs: Fix outdated ptrace documentation
kselftest/arm64: Ensure stable names for GCS stress test results
kselftest/arm64: Validate that GCS push and write permissions work
kselftest/arm64: Enable GCS for the FP stress tests
kselftest/arm64: Add a GCS stress test
kselftest/arm64: Add GCS signal tests
kselftest/arm64: Add test coverage for GCS mode locking
kselftest/arm64: Add a GCS test program built with the system libc
kselftest/arm64: Add very basic GCS test program
kselftest/arm64: Always run signals tests with GCS enabled
kselftest/arm64: Allow signals tests to specify an expected si_code
kselftest/arm64: Add framework support for GCS to signal handling tests
kselftest/arm64: Add GCS as a detected feature in the signal tests
kselftest/arm64: Verify the GCS hwcap
arm64: Add Kconfig for Guarded Control Stack (GCS)
arm64/ptrace: Expose GCS via ptrace and core files
arm64/signal: Expose GCS state in signal frames
arm64/signal: Set up and restore the GCS context for signal handlers
arm64/mm: Implement map_shadow_stack()
...
* for-next/probes:
: Various arm64 uprobes/kprobes cleanups
arm64: insn: Simulate nop instruction for better uprobe performance
arm64: probes: Remove probe_opcode_t
arm64: probes: Cleanup kprobes endianness conversions
arm64: probes: Move kprobes-specific fields
arm64: probes: Fix uprobes for big-endian kernels
arm64: probes: Fix simulate_ldr*_literal()
arm64: probes: Remove broken LDR (literal) uprobe support
* for-next/asm-offsets:
: arm64 asm-offsets.c cleanup (remove unused offsets)
arm64: asm-offsets: remove PREEMPT_DISABLE_OFFSET
arm64: asm-offsets: remove DMA_{TO,FROM}_DEVICE
arm64: asm-offsets: remove VM_EXEC and PAGE_SZ
arm64: asm-offsets: remove MM_CONTEXT_ID
arm64: asm-offsets: remove COMPAT_{RT_,SIGFRAME_REGS_OFFSET
arm64: asm-offsets: remove VMA_VM_*
arm64: asm-offsets: remove TSK_ACTIVE_MM
* for-next/tlb:
: TLB flushing optimisations
arm64: optimize flush tlb kernel range
arm64: tlbflush: add __flush_tlb_range_limit_excess()
* for-next/misc:
: Miscellaneous patches
arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled
arm64/ptrace: Clarify documentation of VL configuration via ptrace
acpi/arm64: remove unnecessary cast
arm64/mm: Change protval as 'pteval_t' in map_range()
arm64: uprobes: Optimize cache flushes for xol slot
acpi/arm64: Adjust error handling procedure in gtdt_parse_timer_block()
arm64: fix .data.rel.ro size assertion when CONFIG_LTO_CLANG
arm64/ptdump: Test both PTE_TABLE_BIT and PTE_VALID for block mappings
arm64/mm: Sanity check PTE address before runtime P4D/PUD folding
arm64/mm: Drop setting PTE_TYPE_PAGE in pte_mkcont()
ACPI: GTDT: Tighten the check for the array of platform timer structures
arm64/fpsimd: Fix a typo
arm64: Expose ID_AA64ISAR1_EL1.XS to sanitised feature consumers
arm64: Return early when break handler is found on linked-list
arm64/mm: Re-organize arch_make_huge_pte()
arm64/mm: Drop _PROT_SECT_DEFAULT
arm64: Add command-line override for ID_AA64MMFR0_EL1.ECV
arm64: head: Drop SWAPPER_TABLE_SHIFT
arm64: cpufeature: add POE to cpucap_is_possible()
arm64/mm: Change pgattr_change_is_safe() arguments as pteval_t
* for-next/mte:
: Various MTE improvements
selftests: arm64: add hugetlb mte tests
hugetlb: arm64: add mte support
* for-next/sysreg:
: arm64 sysreg updates
arm64/sysreg: Update ID_AA64MMFR1_EL1 to DDI0601 2024-09
* for-next/stacktrace:
: arm64 stacktrace improvements
arm64: preserve pt_regs::stackframe during exec*()
arm64: stacktrace: unwind exception boundaries
arm64: stacktrace: split unwind_consume_stack()
arm64: stacktrace: report recovered PCs
arm64: stacktrace: report source of unwind data
arm64: stacktrace: move dump_backtrace() to kunwind_stack_walk()
arm64: use a common struct frame_record
arm64: pt_regs: swap 'unused' and 'pmr' fields
arm64: pt_regs: rename "pmr_save" -> "pmr"
arm64: pt_regs: remove stale big-endian layout
arm64: pt_regs: assert pt_regs is a multiple of 16 bytes
* for-next/hwcap3:
: Add AT_HWCAP3 support for arm64 (also wire up AT_HWCAP4)
arm64: Support AT_HWCAP3
binfmt_elf: Wire up AT_HWCAP3 at AT_HWCAP4
* for-next/kselftest: (30 commits)
: arm64 kselftest fixes/cleanups
kselftest/arm64: Try harder to generate different keys during PAC tests
kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all()
kselftest/arm64: Corrupt P0 in the irritator when testing SSVE
kselftest/arm64: Add FPMR coverage to fp-ptrace
kselftest/arm64: Expand the set of ZA writes fp-ptrace does
kselftets/arm64: Use flag bits for features in fp-ptrace assembler code
kselftest/arm64: Enable build of PAC tests with LLVM=1
kselftest/arm64: Check that SVCR is 0 in signal handlers
kselftest/arm64: Fix printf() compiler warnings in the arm64 syscall-abi.c tests
kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test
kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests
kselftest/arm64: Fix build with stricter assemblers
kselftest/arm64: Test signal handler state modification in fp-stress
kselftest/arm64: Provide a SIGUSR1 handler in the kernel mode FP stress test
kselftest/arm64: Implement irritators for ZA and ZT
kselftest/arm64: Remove unused ADRs from irritator handlers
kselftest/arm64: Correct misleading comments on fp-stress irritators
kselftest/arm64: Poll less often while waiting for fp-stress children
kselftest/arm64: Increase frequency of signal delivery in fp-stress
kselftest/arm64: Fix encoding for SVE B16B16 test
...
* for-next/crc32:
: Optimise CRC32 using PMULL instructions
arm64/crc32: Implement 4-way interleave using PMULL
arm64/crc32: Reorganize bit/byte ordering macros
arm64/lib: Handle CRC-32 alternative in C code
* for-next/guest-cca:
: Support for running Linux as a guest in Arm CCA
arm64: Document Arm Confidential Compute
virt: arm-cca-guest: TSM_REPORT support for realms
arm64: Enable memory encrypt for Realms
arm64: mm: Avoid TLBI when marking pages as valid
arm64: Enforce bounce buffers for realm DMA
efi: arm64: Map Device with Prot Shared
arm64: rsi: Map unprotected MMIO as decrypted
arm64: rsi: Add support for checking whether an MMIO is protected
arm64: realm: Query IPA size from the RMM
arm64: Detect if in a realm and set RIPAS RAM
arm64: rsi: Add RSI definitions
* for-next/haft:
: Support for arm64 FEAT_HAFT
arm64: pgtable: Warn unexpected pmdp_test_and_clear_young()
arm64: Enable ARCH_HAS_NONLEAF_PMD_YOUNG
arm64: Add support for FEAT_HAFT
arm64: setup: name 'tcr2' register
arm64/sysreg: Update ID_AA64MMFR1_EL1 register
* for-next/scs:
: Dynamic shadow call stack fixes
arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux()
arm64/scs: Deal with 64-bit relative offsets in FDE frames
arm64/scs: Fix handling of DWARF augmentation data in CIE/FDE frames
The is_mgtime test checks whether the FS_MGTIME flag is set in the
fstype. To get there from the inode though, we have to dereference 3
pointers.
Add a new IOP_MGTIME flag, and have inode_init_always() set that flag
when the fstype flag is set. Then, make is_mgtime test for IOP_MGTIME
instead.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20241113-mgtime-v1-1-84e256980e11@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
It literally directly follows a spin_lock() call.
This whacks an explicit barrier on x86-64.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20241113155103.4194099-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
As the size of a directory increases item creation slows down.
Optimizing access to s_children removes this bottleneck.
dirents are already pinned into the cache, there is no need to scan the
s_children list looking for duplicate Items. The configfs_dirent_exists
check is moved to a location where it is called only during subsystem
initialization.
d_lookup will only need to call configfs_lookup in the case where the
item in question is not pinned to dcache. The only items not pinned to
dcache are attributes. These are placed at the front of the s_children
list, whilst pinned items are inserted at the back. configfs_lookup
stops scanning when it encounters the first pinned entry in s_children.
The assumption of the above optimizations is that there will be few
attributes, but potentially many Items in a given directory.
Signed-off-by: Seamus Connor <sconnor@purestorage.com>
Reviewed-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
configfs_hash_and_remove() has been unused since it was added in 2005
by commit
7063fbf226 ("[PATCH] configfs: User-driven configuration filesystem")
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Call blk_validate_limits on the queue limits used for zone append
splitting so that calculated values get filled in and any stacking
conflicts get cought.
Without this there isn't a max_zone_append_sectors limits as of commit
559218d43e ("block: pre-calculate max_zone_append_sectors").
Fixes: 559218d43e ("block: pre-calculate max_zone_append_sectors")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241113084541.34315-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The code indicates that journal_init_common() fills the journal_t object
it returns while the comment incorrectly states that only a few fields are
initialised. Also, the comment claims that journal structures could be
created from scratch which isn't possible as journal_init_common() calls
journal_load_superblock() which loads and checks journal superblock from
disk.
Signed-off-by: Daniel Martín Gómez <dalme@riseup.net>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20241107144538.3544-1-dalme@riseup.net
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Use ktime_get_ns instead of ktime_get_real_ns when computing the lr_timeout
not to be affected by system time jumps.
Use a boolean instead of the MAX_JIFFY_OFFSET value to determine whether
the next_wakeup value has been set. Comparing elr->lr_next_sched to
MAX_JIFFY_OFFSET can cause the lazyinit thread to loop indefinitely.
Co-developed-by: Lukas Skupinski <lukas.skupinski@landisgyr.com>
Signed-off-by: Lukas Skupinski <lukas.skupinski@landisgyr.com>
Signed-off-by: Mathieu Othacehe <othacehe@gnu.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241106134741.26948-2-othacehe@gnu.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Inline and use struct_size() to calculate the number of bytes to
allocate for new_fn and remove the local variable len.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20241105103353.11590-2-thorsten.blum@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Add the __counted_by compiler attribute to the flexible array member
name to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20241105101813.10864-2-thorsten.blum@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We got a report that adding a fanotify filsystem watch prevents tail -f
from receiving events.
Reproducer:
1. Create 3 windows / login sessions. Become root in each session.
2. Choose a mounted filesystem that is pretty quiet; I picked /boot.
3. In the first window, run: fsnotifywait -S -m /boot
4. In the second window, run: echo data >> /boot/foo
5. In the third window, run: tail -f /boot/foo
6. Go back to the second window and run: echo more data >> /boot/foo
7. Observe that the tail command doesn't show the new data.
8. In the first window, hit control-C to interrupt fsnotifywait.
9. In the second window, run: echo still more data >> /boot/foo
10. Observe that the tail command in the third window has now printed
the missing data.
When stracing tail, we observed that when fanotify filesystem mark is
set, tail does get the inotify event, but the event is receieved with
the filename:
read(4, "\1\0\0\0\2\0\0\0\0\0\0\0\20\0\0\0foo\0\0\0\0\0\0\0\0\0\0\0\0\0",
50) = 32
This is unexpected, because tail is watching the file itself and not its
parent and is inconsistent with the inotify event received by tail when
fanotify filesystem mark is not set:
read(4, "\1\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0", 50) = 16
The inteference between different fsnotify groups was caused by the fact
that the mark on the sb requires the filename, so the filename is passed
to fsnotify(). Later on, fsnotify_handle_event() tries to take care of
not passing the filename to groups (such as inotify) that are interested
in the filename only when the parent is watching.
But the logic was incorrect for the case that no group is watching the
parent, some groups are watching the sb and some watching the inode.
Reported-by: Miklos Szeredi <miklos@szeredi.hu>
Fixes: 7372e79c9e ("fanotify: fix logic of reporting name info with watched parent")
Cc: stable@vger.kernel.org # 5.10+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
singletons.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzP1ZAAKCRDdBJ7gKXxA
jmBUAP9n2zTKoNeF/WpS0aSg+SpG78mtyMIwSUW2PPfGObYTBwD/bncG9U3fnno1
v6Sey0OjAKwGdV+gTd+5ymWJKPSQbgA=
=HxTA
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-11-12-16-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"10 hotfixes, 7 of which are cc:stable. 7 are MM, 3 are not. All
singletons"
* tag 'mm-hotfixes-stable-2024-11-12-16-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: swapfile: fix cluster reclaim work crash on rotational devices
selftests: hugetlb_dio: fixup check for initial conditions to skip in the start
mm/thp: fix deferred split queue not partially_mapped: fix
mm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM cases
nommu: pass NULL argument to vma_iter_prealloc()
ocfs2: fix UBSAN warning in ocfs2_verify_volume()
nilfs2: fix null-ptr-deref in block_dirty_buffer tracepoint
nilfs2: fix null-ptr-deref in block_touch_buffer tracepoint
mm: page_alloc: move mlocked flag clearance into free_pages_prepare()
mm: count zeromap read and set for swapout and swapin
It's used only to initialize ->getattr in one inode_operations instance
(empty_dir_inode_operations) and its behaviour had always been equivalent
to what we get with NULL ->getattr.
Just remove that initializer, along with empty_dir_getattr() itself.
While we are at it, the same instance has ->permission initialized to
generic_permission, which is what NULL ->permission ends up doing.
Again, no point keeping it.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit 8a924db2d7 ("fs: Pass AT_GETATTR_NOSEC flag to getattr interface
function")' introduced the AT_GETATTR_NOSEC flag to ensure that the
call paths only call vfs_getattr_nosec if it is set instead of vfs_getattr.
Now, simplify the getattr interface functions of filesystems where the flag
AT_GETATTR_NOSEC is checked.
There is only a single caller of inode_operations getattr function and it
is located in fs/stat.c in vfs_getattr_nosec. The caller there is the only
one from which the AT_GETATTR_NOSEC flag is passed from.
Two filesystems are checking this flag in .getattr and the flag is always
passed to them unconditionally from only vfs_getattr_nosec:
- ecryptfs: Simplify by always calling vfs_getattr_nosec in
ecryptfs_getattr. From there the flag is passed to no other
function and this function is not called otherwise.
- overlayfs: Simplify by always calling vfs_getattr_nosec in
ovl_getattr. From there the flag is passed to no other
function and this function is not called otherwise.
The query_flags in vfs_getattr_nosec will mask-out AT_GETATTR_NOSEC from
any caller using AT_STATX_SYNC_TYPE as mask so that the flag is not
important inside this function. Also, since no filesystem is checking the
flag anymore, remove the flag entirely now, including the BUG_ON check that
never triggered.
The net change of the changes here combined with the original commit is
that ecryptfs and overlayfs do not call vfs_getattr but only
vfs_getattr_nosec.
Fixes: 8a924db2d7 ("fs: Pass AT_GETATTR_NOSEC flag to getattr interface function")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Closes: https://lore.kernel.org/linux-fsdevel/20241101011724.GN1350452@ZenIV/T/#u
Cc: Tyler Hicks <code@tyhicks.com>
Cc: ecryptfs@vger.kernel.org
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: linux-unionfs@vger.kernel.org
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Stefan Berger <stefanb@linux.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
LOOKUP_EMPTY is ignored by the only remaining user, and without
that 'getname_' prefix makes no sense.
Remove LOOKUP_EMPTY part, rename to statx_lookup_flags() and make
static. It most likely is _not_ statx() specific, either, but
that's the next step.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Filesystem options can be retrieved with STATMOUNT_MNT_OPTS, which
returns a string of comma separated options, where some characters are
escaped using the \OOO notation.
Add a new flag, STATMOUNT_OPT_ARRAY, which instead returns the raw
option values separated with '\0' charaters.
Since escaped charaters are rare, this inteface is preferable for
non-libmount users which likley don't want to deal with option
de-escaping.
Example code:
if (st->mask & STATMOUNT_OPT_ARRAY) {
const char *opt = st->str + st->opt_array;
for (unsigned int i = 0; i < st->opt_num; i++) {
printf("opt_array[%i]: <%s>\n", i, opt);
opt += strlen(opt) + 1;
}
}
Example ouput:
(1) mnt_opts: <lowerdir+=/l\054w\054r,lowerdir+=/l\054w\054r1,upperdir=/upp\054r,workdir=/w\054rk,redirect_dir=nofollow,uuid=null>
(2) opt_array[0]: <lowerdir+=/l,w,r>
opt_array[1]: <lowerdir+=/l,w,r1>
opt_array[2]: <upperdir=/upp,r>
opt_array[3]: <workdir=/w,rk>
opt_array[4]: <redirect_dir=nofollow>
opt_array[5]: <uuid=null>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20241112101006.30715-1-mszeredi@redhat.com
Acked-by: Jeff Layton <jlayton@kernel.org>
[brauner: tweak variable naming and parsing add example output]
Signed-off-by: Christian Brauner <brauner@kernel.org>
Ugh, indeed - and not needed nearly a decade. It had been
added for the sake of inode_sb_list_lock and that spinlock had become
a per-superblock (->s_inode_list_lock) in March 2015...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241112213842.GC3387508@ZenIV
Christoph Hellwig <hch@lst.de> says:
This fixes one (of multiple) sparse warnings in fs-writeback.c, and
then reshuffles the code a bit that only the proper high level API
instead of low-level helpers is exported.
* patches from https://lore.kernel.org/r/20241112054403.1470586-1-hch@lst.de:
writeback: wbc_attach_fdatawrite_inode out of line
writeback: add a __releases annoation to wbc_attach_and_unlock_inode
Link: https://lore.kernel.org/r/20241112054403.1470586-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
/proc/self/mountinfo displays the source for the mount, but statmount()
doesn't yet have a way to return it. Add a new STATMOUNT_SB_SOURCE flag,
claim the 32-bit __spare1 field to hold the offset into the str[] array.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20241111-statmount-v4-3-2eaf35d07a80@kernel.org
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The b_frozen_data allocation should not be failed during journal
committing process, otherwise jbd2 will abort.
Since commit 490c1b444ce653d("jbd2: do not fail journal because of
frozen_buffer allocation failure") already added '__GFP_NOFAIL' flag
in do_get_write_access(), just add '__GFP_NOFAIL' flag for all allocations
in jbd2_journal_write_metadata_buffer(), like 'new_bh' allocation does.
Besides, remove all error handling branches for do_get_write_access().
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20241012085530.2147846-1-chengzhihao@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The variables "&EXT4_SB(inode->i_sb)->s_fc_lock" and "&sbi->s_fc_lock"
are the same lock. This function uses a mix of both, which is a bit
unsightly and confuses Smatch.
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://patch.msgid.link/96008557-8ff4-44cc-b5e3-ce242212f1a3@stanley.mountain
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Keep 'success' internally to track if any error happened and then
return it at the end in do_one_pass(). If jbd2_do_replay() return
-ENOMEM then stop replay journal.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240930005942.626942-7-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The judgement 'if (block_error && success == 0)' is never valid. Just
remove useless 'block_error' variable.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20240930005942.626942-6-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Factor out jbd2_do_replay() no funtional change.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240930005942.626942-5-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
To make JBD2_COMMIT_BLOCK process more clean, no functional change.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20240930005942.626942-4-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now buffer_head free is very fragmented in do_one_pass(), unified release
of buffer_head in do_one_pass()
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20240930005942.626942-3-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
'need_check_commit_time' is only used by v2/v3 checksum, so there isn't
need to add 'need_check_commit_time' judegement for v1 checksum logic.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20240930005942.626942-2-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Instead of directly casting and returning an error-valued pointer,
use ERR_CAST to make the error handling more explicit and improve
code clarity.
Signed-off-by: Yu Jiaoliang <yujiaoliang@vivo.com>
Link: https://patch.msgid.link/20240920021440.1959243-1-yujiaoliang@vivo.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Using mapped writes, it's technically possible to expose stale
post-eof data on a truncate up operation. Consider the following
example:
$ xfs_io -fc "pwrite 0 2k" -c "mmap 0 4k" -c "mwrite 2k 2k" \
-c "truncate 8k" -c "pread -v 2k 16" <file>
...
00000800: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX
...
This shows that the post-eof data written via mwrite lands within
EOF after a truncate up. While this is deliberate of the test case,
behavior is somewhat unpredictable because writeback does post-eof
zeroing, and writeback can occur at any time in the background. For
example, an fsync inserted between the mwrite and truncate causes
the subsequent read to instead return zeroes. This basically means
that there is a race window in this situation between any subsequent
extending operation and writeback that dictates whether post-eof
data is exposed to the file or zeroed.
To prevent this problem, perform partial block zeroing as part of
the various inode size extending operations that are susceptible to
it. For truncate extension, zero around the original eof similar to
how truncate down does partial zeroing of the new eof. For extension
via writes and fallocate related operations, zero the newly exposed
range of the file to cover any partial zeroing that must occur at
the original and new eof blocks.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://patch.msgid.link/20240919160741.208162-2-bfoster@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The commit 91562895f8 ("ext4: properly sync file size update after O_SYNC
direct IO") causes confusion about the meaning of the return value of
ext4_dio_write_end_io().
Specifically, when the ext4_handle_inode_extension() operation succeeds,
ext4_dio_write_end_io() directly returns count instead of 0.
This does not cause a bug in the current kernel, but the semantics of the
return value of the ext4_dio_write_end_io() function are wrong, which is
likely to introduce bugs in the future code evolution.
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20240919082539.381626-1-alexjlzheng@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Commit 449813515d ("block, fs: Restore the per-bio/request data
lifetime fields") restored write-hint support in ext4. But that is
applicable only for direct IO. This patch supports passing
write-hint for buffered IO from ext4 file system to block layer
by filling bi_write_hint of struct bio in io_submit_add_bh().
Signed-off-by: j.xia <j.xia@samsung.com>
Link: https://patch.msgid.link/20240919020341.2657646-1-j.xia@samsung.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When I enabled ext4 debug for fault injection testing, I encountered the
following warning:
EXT4-fs error (device sda): ext4_read_inode_bitmap:201: comm fsstress:
Cannot read inode bitmap - block_group = 8, inode_bitmap = 1051
WARNING: CPU: 0 PID: 511 at fs/buffer.c:1181 mark_buffer_dirty+0x1b3/0x1d0
The root cause of the issue lies in the improper implementation of ext4's
buffer_head read fault injection. The actual completion of buffer_head
read and the buffer_head fault injection are not atomic, which can lead
to the uptodate flag being cleared on normally used buffer_heads in race
conditions.
[CPU0] [CPU1] [CPU2]
ext4_read_inode_bitmap
ext4_read_bh()
<bh read complete>
ext4_read_inode_bitmap
if (buffer_uptodate(bh))
return bh
jbd2_journal_commit_transaction
__jbd2_journal_refile_buffer
__jbd2_journal_unfile_buffer
__jbd2_journal_temp_unlink_buffer
ext4_simulate_fail_bh()
clear_buffer_uptodate
mark_buffer_dirty
<report warning>
WARN_ON_ONCE(!buffer_uptodate(bh))
The best approach would be to perform fault injection in the IO completion
callback function, rather than after IO completion. However, the IO
completion callback function cannot get the fault injection code in sb.
Fix it by passing the result of fault injection into the bh read function,
we simulate faults within the bh read function itself. This requires adding
an extra parameter to the bh read functions that need fault injection.
Fixes: 46f870d690 ("ext4: simulate various I/O and checksum errors when reading metadata")
Signed-off-by: Long Li <leo.lilong@huawei.com>
Link: https://patch.msgid.link/20240906091746.510163-1-leo.lilong@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When converting a delalloc extent in ext4_es_insert_extent(), since we
only want to pass the info of whether the quota has already been claimed
if the allocation is a direct allocation from ext4_map_create_blocks(),
there is no need to pass full mapping flags, so changes to just pass
whether the EXT4_GET_BLOCKS_DELALLOC_RESERVE bit is set.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240906061401.2980330-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When ctx_set_flags() is unused, it prevents kernel builds
with clang, `make W=1` and CONFIG_WERROR=y:
.../ext4/super.c:2120:1: error: unused function 'ctx_set_flags' [-Werror,-Wunused-function]
2120 | EXT4_SET_CTX(flags); /* set only */
| ^~~~~~~~~~~~~~~~~~~
Fix this by marking ctx_*_flags() with __maybe_unused
(mark both for the sake of symmetry).
See also commit 6863f5643d ("kbuild: allow Clang to find unused static
inline functions for W=1 build").
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20240905163229.140522-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
cma_alloc() keep failed in our system which thanks to a jh->bh->b_page
can not be migrated out of CMA area[1] as the jh has one cp_transaction
pending on it because of j_free > j_max_transaction_buffers[2][3][4][5][6].
We temporarily solve this by launching jbd2_log_do_checkpoint forcefully
somewhere. Since journal is common mechanism to all JFSs and
cp_transaction has a little fewer opportunity to be launched, the
cma_alloc() could be affected under the same scenario. This patch
would like to have buffer_head of ext4 not use CMA pages when doing
sb_getblk.
[1]
crash_arm64_v8.0.4++> kmem -p|grep ffffff808f0aa150(sb->s_bdev->bd_inode->i_mapping)
fffffffe01a51c00 e9470000 ffffff808f0aa150 3 2 8000000008020 lru,private
fffffffe03d189c0 174627000 ffffff808f0aa150 4 2 2004000000008020 lru,private
fffffffe03d88e00 176238000 ffffff808f0aa150 3f9 2 2008000000008020 lru,private
fffffffe03d88e40 176239000 ffffff808f0aa150 6 2 2008000000008020 lru,private
fffffffe03d88e80 17623a000 ffffff808f0aa150 5 2 2008000000008020 lru,private
fffffffe03d88ec0 17623b000 ffffff808f0aa150 1 2 2008000000008020 lru,private
fffffffe03d88f00 17623c000 ffffff808f0aa150 0 2 2008000000008020 lru,private
fffffffe040e6540 183995000 ffffff808f0aa150 3f4 2 2004000000008020 lru,private
[2] page -> buffer_head
crash_arm64_v8.0.4++> struct page.private fffffffe01a51c00 -x
private = 0xffffff802fca0c00
[3] buffer_head -> journal_head
crash_arm64_v8.0.4++> struct buffer_head.b_private 0xffffff802fca0c00
b_private = 0xffffff8041338e10,
[4] journal_head -> b_cp_transaction
crash_arm64_v8.0.4++> struct journal_head.b_cp_transaction 0xffffff8041338e10 -x
b_cp_transaction = 0xffffff80410f1900,
[5] transaction_t -> journal
crash_arm64_v8.0.4++> struct transaction_t.t_journal 0xffffff80410f1900 -x
t_journal = 0xffffff80e70f3000,
[6] j_free & j_max_transaction_buffers
crash_arm64_v8.0.4++> struct journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x
j_free = 0x3f1,
j_max_transaction_buffers = 0x100,
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240904075300.1148836-1-zhaoyang.huang@unisoc.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The original implementation ext4's FS_IOC_GETFSMAP handling only
worked when the range of queried blocks included at least one free
(unallocated) block range. This is because how the metadata blocks
were emitted was as a side effect of ext4_mballoc_query_range()
calling ext4_getfsmap_datadev_helper(), and that function was only
called when a free block range was identified. As a result, this
caused generic/365 to fail.
Fix this by creating a new function ext4_getfsmap_meta_helper() which
gets called so that blocks before the first free block range in a
block group can get properly reported.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
The maximum length of a filename is 255 and the minimum block size is 1024,
so it is always guaranteed that the number of entries is greater than or
equal to 2 when do_split() is called. So unless ext4_dx_add_entry() and
make_indexed_dir() or some other functions are buggy, 'split == 0' will
not occur.
Setting 'continued' to 0 in this case masks the problem that the file
system has become corrupted, even though it prevents possible out-of-bounds
access. Hence WARN_ON_ONCE() is used to check if 'split' is 0, and if it is
then warns and returns an error to abort split.
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20240823160518.GA424729@mit.edu
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241008121152.3771906-1-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
After commit 21175ca434 ("ext4: make prefetch_block_bitmaps default"),
we enable 'prefetch_block_bitmaps' by default, but this is not shown in
the '/proc/fs/ext4/sdx/options' procfs interface.
This makes it impossible to distinguish whether the feature is enabled by
default or not, so 'prefetch_block_bitmaps' is shown in the 'options'
procfs interface when prefetch_block_bitmaps is enabled by default.
This makes it easy to notice changes to the default mount options between
versions through the '/proc/fs/ext4/sdx/options' procfs interface.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241008120134.3758097-1-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
These two fields are populated and stored as a "frequently used value"
in ufs_fill_super, but are not used afterwards in the driver.
Moreover, one of the shifts triggers UBSAN: shift-out-of-bounds when
apbshift is 12 because 12 * 3 = 36 and 1 << 36 does not fit in the 32
bit integer used to store the value.
Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2087853
Signed-off-by: Agathe Porte <agathe.porte@canonical.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This will eliminate the need for index recalculation during the fast
path.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Message-Id: <20241006184341.9081-1-mgurtovoy@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Enhance the device probing process by adding a log message when a new
virtio-fs tag is successfully discovered. This improvement provides
better visibility into the initialization of virtio-fs devices.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Message-Id: <20241006184324.8497-1-mgurtovoy@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This allows exporting this high-level interface only while keeping
wbc_attach_and_unlock_inode private in fs-writeback.c and unexporting
__inode_attach_wb.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241112054403.1470586-3-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
This shuts up a sparse lock context tracking warning.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241112054403.1470586-2-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
/proc/self/mountinfo prints out the sb->s_subtype after the type. This
is particularly useful for disambiguating FUSE mounts (at least when the
userland driver bothers to set it). Add STATMOUNT_FS_SUBTYPE and claim
one of the __spare2 fields to point to the offset into the str[] array.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ian Kent <raven@themaw.net>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20241111-statmount-v4-2-2eaf35d07a80@kernel.org
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
When one of the statmount_string() handlers doesn't emit anything to
seq, the kernel currently sets the corresponding flag and emits an empty
string.
Given that statmount() returns a mask of accessible fields, just leave
the bit unset in this case, and skip any NULL termination. If nothing
was emitted to the seq, then the EOVERFLOW and EAGAIN cases aren't
applicable and the function can just return immediately.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20241111-statmount-v4-1-2eaf35d07a80@kernel.org
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The comment suggests a hash or array approach to
store the active requests. Currently it iterates
through all the active requests and when found
deletes the requested request, in the linked list.
However io_cancel() isn’t a frequently used operation,
and optimizing it wouldn’t bring a substantial benefit
to real users and the increased complexity of maintaining
a hashtable for this would be significant and will slow
down other operation. Therefore remove this TODO
to avoid people spending time improving this.
Signed-off-by: Mohammed Anees <pvmohammedanees2003@gmail.com>
Link: https://lore.kernel.org/r/20241112113906.15825-1-pvmohammedanees2003@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Reorganize xfs_ondisk.h to group the build checks by type, then add a
bunch of missing checks that were in xfs/122 but not the build system.
With this, we can get rid of xfs/122.
With a bit of luck, this should all go splendidly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR
ph8MAP4nWT1zCLAEDpFqVQmJO7wjlS02x5xFTHEzcms30ptsIwEA73ycS74tAmP5
CGsPxFVzhG8oMkWgrWXnXuOGfpQBPQo=
=Dx2e
-----END PGP SIGNATURE-----
Merge tag 'better-ondisk-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge
xfs: improve ondisk structure checks [v5.5 10/10]
Reorganize xfs_ondisk.h to group the build checks by type, then add a
bunch of missing checks that were in xfs/122 but not the build system.
With this, we can get rid of xfs/122.
With a bit of luck, this should all go splendidly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Actually enable this very large feature, which adds metadata directory
trees, allocation groups on the realtime volume, persistent quota
options, and quota for realtime files.
With a bit of luck, this should all go splendidly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR
pvjKAP9A1v0QwkBo65ARf1gQb2AZYlebV8o6faSrpc+10vOPxgD/SV3GKhxQrtaT
gOswA6f9QJN3E82IJ6fR2PZgSke+iQs=
=xjZL
-----END PGP SIGNATURE-----
Merge tag 'metadir-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge
xfs: enable metadir [v5.5 09/10]
Actually enable this very large feature, which adds metadata directory
trees, allocation groups on the realtime volume, persistent quota
options, and quota for realtime files.
With a bit of luck, this should all go splendidly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
At some point, I realized that I've refactored enough of the quota code
in XFS that I should evaluate whether or not quota actually works on
realtime volumes. It turns out that it nearly works: the only broken
pieces are chown and delayed allocation, and reporting of project
quotas in the statvfs output for projinherit+rtinherit directories.
Fix these things and we can have realtime quotas again after 20 years.
With a bit of luck, this should all go splendidly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR
pkh4AQCtjI73mwU9rhzs2MO5nLNlpg9bgOxute+G4eqGCP02CwEAvg/LpT9yA6qk
1jM5x8C6xy03yIWTUc+DcMPqoCqJzwc=
=2QMv
-----END PGP SIGNATURE-----
Merge tag 'realtime-quotas-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge
xfs: enable quota for realtime volumes [v5.5 08/10]
At some point, I realized that I've refactored enough of the quota code
in XFS that I should evaluate whether or not quota actually works on
realtime volumes. It turns out that it nearly works: the only broken
pieces are chown and delayed allocation, and reporting of project
quotas in the statvfs output for projinherit+rtinherit directories.
Fix these things and we can have realtime quotas again after 20 years.
With a bit of luck, this should all go splendidly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Store the quota files in the metadata directory tree instead of the
superblock. Since we're introducing a new incompat feature flag, let's
also make the mount process bring up quotas in whatever state they were
when the filesystem was last unmounted, instead of requiring sysadmins
to remember that themselves.
With a bit of luck, this should all go splendidly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR
pod/AP9+FSHh3UF9ccpBQXMAG0NmhXYJBQ7grnAp4q89ko6fCAD+JyUsqi55zDw6
KnLSWZgNuO+aaCmMKXX4hEttA0//gAY=
=0YFz
-----END PGP SIGNATURE-----
Merge tag 'metadir-quotas-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge
xfs: persist quota options with metadir [v5.5 07/10]
Store the quota files in the metadata directory tree instead of the
superblock. Since we're introducing a new incompat feature flag, let's
also make the mount process bring up quotas in whatever state they were
when the filesystem was last unmounted, instead of requiring sysadmins
to remember that themselves.
With a bit of luck, this should all go splendidly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Right now, the realtime section uses a single pair of metadata inodes to
store the free space information. This presents a scalability problem
since every thread trying to allocate or free rt extents have to lock
these files. Solve this problem by sharding the realtime section into
separate realtime allocation groups.
While we're at it, define a superblock to be stamped into the start of
the rt section. This enables utilities such as blkid to identify block
devices containing realtime sections, and avoids the situation where
anything written into block 0 of the realtime extent can be
misinterpreted as file data.
The best advantage for rtgroups will become evident later when we get to
adding rmap and reflink to the realtime volume, since the geometry
constraints are the same for rt groups and AGs. Hence we can reuse all
that code directly.
This is a very large patchset, but it catches us up with 20 years of
technical debt that have accumulated.
With a bit of luck, this should all go splendidly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR
pqk4AQD31pupAefiZ39TFLz0oA1+Q2WUOoLxH/3Ovqin1GJNPgD9EG04/14fDmRU
WDUSVfU8JKKJYEXXZnLeJLsvEUL2EQ0=
=1/oh
-----END PGP SIGNATURE-----
Merge tag 'realtime-groups-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge
xfs: shard the realtime section [v5.5 06/10]
Right now, the realtime section uses a single pair of metadata inodes to
store the free space information. This presents a scalability problem
since every thread trying to allocate or free rt extents have to lock
these files. Solve this problem by sharding the realtime section into
separate realtime allocation groups.
While we're at it, define a superblock to be stamped into the start of
the rt section. This enables utilities such as blkid to identify block
devices containing realtime sections, and avoids the situation where
anything written into block 0 of the realtime extent can be
misinterpreted as file data.
The best advantage for rtgroups will become evident later when we get to
adding rmap and reflink to the realtime volume, since the geometry
constraints are the same for rt groups and AGs. Hence we can reuse all
that code directly.
This is a very large patchset, but it catches us up with 20 years of
technical debt that have accumulated.
With a bit of luck, this should all go splendidly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Prepare for realtime groups by adding a few bug fixes and generic code
that will be necessary.
With a bit of luck, this should all go splendidly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR
pgmeAP980iZAY49aL85dhr/QNl0G5YmuLOx6UW8DiAALCJxyxAEAnD5ilA7vQz40
80/cn+Y77fT3LptpAHTM5/FY+42IOgM=
=AzwL
-----END PGP SIGNATURE-----
Merge tag 'rtgroups-prep-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge
xfs: preparation for realtime allocation groups [v5.5 05/10]
Prepare for realtime groups by adding a few bug fixes and generic code
that will be necessary.
With a bit of luck, this should all go splendidly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Add in-memory data structures for sharding the realtime volume into
independent allocation groups. For existing filesystems, the entire rt
volume is modelled as having a single large group, with (potentially) a
number of rt extents exceeding 2^32 blocks, though these are not likely
to exist because the codebase has been a bit broken for decades. The
next series fills in the ondisk format and other supporting structures.
With a bit of luck, this should all go splendidly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR
ptrrAP41PURivFpHWXqg0sajsIUUezhuAdfg41fJqOop81qWDAEA2CsLf1z0c9/P
CQS/tlQ3xdwZ0MYZMaw2o0EgSHYjwg8=
=qVdv
-----END PGP SIGNATURE-----
Merge tag 'incore-rtgroups-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge
xfs: create incore rt allocation groups [v5.5 04/10]
Add in-memory data structures for sharding the realtime volume into
independent allocation groups. For existing filesystems, the entire rt
volume is modelled as having a single large group, with (potentially) a
number of rt extents exceeding 2^32 blocks, though these are not likely
to exist because the codebase has been a bit broken for decades. The
next series fills in the ondisk format and other supporting structures.
With a bit of luck, this should all go splendidly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
This series delivers a new feature -- metadata inode directories. This
is a separate directory tree (rooted in the superblock) that contains
only inodes that contain filesystem metadata. Different metadata
objects can be looked up with regular paths.
Start by creating xfs_imeta{dir,file}* functions to mediate access to
the metadata directory tree. By the end of this mega series, all
existing metadata inodes (rt+quota) will use this directory tree instead
of the superblock.
Next, define the metadir on-disk format, which consists of marking
inodes with a new iflag that says they're metadata. This prevents
bulkstat and friends from ever getting their hands on fs metadata files.
With a bit of luck, this should all go splendidly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR
ppNiAP9JgPFENv3P0UCJiCDqtWZlNWfz8a9ngAehm4AQMA0P9gD/XgVYNKZRY2Q3
P+3Sh1TVZ63dcEENlmEFE3myKjeJKAQ=
=IJLc
-----END PGP SIGNATURE-----
Merge tag 'metadata-directory-tree-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge
xfs: metadata inode directory trees [v5.5 03/10]
This series delivers a new feature -- metadata inode directories. This
is a separate directory tree (rooted in the superblock) that contains
only inodes that contain filesystem metadata. Different metadata
objects can be looked up with regular paths.
Start by creating xfs_imeta{dir,file}* functions to mediate access to
the metadata directory tree. By the end of this mega series, all
existing metadata inodes (rt+quota) will use this directory tree instead
of the superblock.
Next, define the metadir on-disk format, which consists of marking
inodes with a new iflag that says they're metadata. This prevents
bulkstat and friends from ever getting their hands on fs metadata files.
With a bit of luck, this should all go splendidly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Soon we'll be sharding the realtime volume into separate allocation
groups. These rt groups will /mostly/ behave the same as the ones on
the data device, but since rt groups don't have quite the same set of
struct fields as perags, let's hoist the parts that will be shared by
both into a common xfs_group object.
With a bit of luck, this should all go splendidly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR
pnJDAQCh14f3aSCHslr4XeG1YkT7NZ44AtMjBqEi5GRN26wC0wD+PeaVSRgt+5wy
nONT/nFqU5pApe1w2pq78SoJ+vLJxAk=
=la16
-----END PGP SIGNATURE-----
Merge tag 'generic-groups-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge
xfs: create a generic allocation group structure [v5.5 02/10]
Soon we'll be sharding the realtime volume into separate allocation
groups. These rt groups will /mostly/ behave the same as the ones on
the data device, but since rt groups don't have quite the same set of
struct fields as perags, let's hoist the parts that will be shared by
both into a common xfs_group object.
With a bit of luck, this should all go splendidly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Convert the xfs_mount perag tree to use an xarray instead of a radix
tree. There should be no functional changes here.
With a bit of luck, this should all go splendidly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQcwAKCRBKO3ySh0YR
pmKsAQDko7YQ/dJZsjDvBJRUEnjrJqK9ukR9Ovc00ak0WXIDYQD/cdm8xzkixHn2
yfwsuFxIm6k0PPzb9koSCDT/CQS6QAA=
=WGih
-----END PGP SIGNATURE-----
Merge tag 'perag-xarray-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge
xfs: convert perag to use xarrays [v5.5 01/10]
Convert the xfs_mount perag tree to use an xarray instead of a radix
tree. There should be no functional changes here.
With a bit of luck, this should all go splendidly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
This runs on extents that haven't yet been validated, so we don't want
to assert that we have a valid entry type.
Reported-by: syzbot+4f29c3f12f864d8a8d17@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If the jset_entry_dev_usage is malformed, and too small, our nr_entries
calculation will be incorrect - just bail out.
Reported-by: syzbot+05d7520be047c9be86e0@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When events are reported to userland and prefer_busy_poll is set, irqs
are temporarily suspended using napi_suspend_irqs.
If no events are found and ep_poll would go to sleep, irq suspension is
cancelled using napi_resume_irqs.
Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
Co-developed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://patch.msgid.link/20241109050245.191288-5-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Setting prefer_busy_poll now leads to an effectively nonblocking
iteration though napi_busy_loop, even when busy_poll_usecs is 0.
Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
Co-developed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://patch.msgid.link/20241109050245.191288-4-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now isolation no longer takes the list_lru global node lock, only use the
per-cgroup lock instead. And this lock is inside the list_lru_one being
walked, no longer needed to pass the lock explicitly.
Link: https://lkml.kernel.org/r/20241104175257.60853-7-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, every list_lru has a per-node lock that protects adding,
deletion, isolation, and reparenting of all list_lru_one instances
belonging to this list_lru on this node. This lock contention is heavy
when multiple cgroups modify the same list_lru.
This lock can be split into per-cgroup scope to reduce contention.
To achieve this, we need a stable list_lru_one for every cgroup. This
commit adds a lock to each list_lru_one and introduced a helper function
lock_list_lru_of_memcg, making it possible to pin the list_lru of a memcg.
Then reworked the reparenting process.
Reparenting will switch the list_lru_one instances one by one. By locking
each instance and marking it dead using the nr_items counter, reparenting
ensures that all items in the corresponding cgroup (on-list or not,
because items have a stable cgroup, see below) will see the list_lru_one
switch synchronously.
Objcg reparent is also moved after list_lru reparent so items will have a
stable mem cgroup until all list_lru_one instances are drained.
The only caller that doesn't work the *_obj interfaces are direct calls to
list_lru_{add,del}. But it's only used by zswap and that's also based on
objcg, so it's fine.
This also changes the bahaviour of the isolation function when LRU_RETRY
or LRU_REMOVED_RETRY is returned, because now releasing the lock could
unblock reparenting and free the list_lru_one, isolation function will
have to return withoug re-lock the lru.
prepare() {
mkdir /tmp/test-fs
modprobe brd rd_nr=1 rd_size=33554432
mkfs.xfs -f /dev/ram0
mount -t xfs /dev/ram0 /tmp/test-fs
for i in $(seq 1 512); do
mkdir "/tmp/test-fs/$i"
for j in $(seq 1 10240); do
echo TEST-CONTENT > "/tmp/test-fs/$i/$j"
done &
done; wait
}
do_test() {
read_worker() {
sleep 1
tar -cv "$1" &>/dev/null
}
read_in_all() {
cd "/tmp/test-fs" && ls
for i in $(seq 1 512); do
(exec sh -c 'echo "$PPID"') > "/sys/fs/cgroup/benchmark/$i/cgroup.procs"
read_worker "$i" &
done; wait
}
for i in $(seq 1 512); do
mkdir -p "/sys/fs/cgroup/benchmark/$i"
done
echo +memory > /sys/fs/cgroup/benchmark/cgroup.subtree_control
echo 512M > /sys/fs/cgroup/benchmark/memory.max
echo 3 > /proc/sys/vm/drop_caches
time read_in_all
}
Above script simulates compression of small files in multiple cgroups
with memory pressure. Run prepare() then do_test for 6 times:
Before:
real 0m7.762s user 0m11.340s sys 3m11.224s
real 0m8.123s user 0m11.548s sys 3m2.549s
real 0m7.736s user 0m11.515s sys 3m11.171s
real 0m8.539s user 0m11.508s sys 3m7.618s
real 0m7.928s user 0m11.349s sys 3m13.063s
real 0m8.105s user 0m11.128s sys 3m14.313s
After this commit (about ~15% faster):
real 0m6.953s user 0m11.327s sys 2m42.912s
real 0m7.453s user 0m11.343s sys 2m51.942s
real 0m6.916s user 0m11.269s sys 2m43.957s
real 0m6.894s user 0m11.528s sys 2m45.346s
real 0m6.911s user 0m11.095s sys 2m43.168s
real 0m6.773s user 0m11.518s sys 2m40.774s
Link: https://lkml.kernel.org/r/20241104175257.60853-6-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When using the "block:block_dirty_buffer" tracepoint, mark_buffer_dirty()
may cause a NULL pointer dereference, or a general protection fault when
KASAN is enabled.
This happens because, since the tracepoint was added in
mark_buffer_dirty(), it references the dev_t member bh->b_bdev->bd_dev
regardless of whether the buffer head has a pointer to a block_device
structure.
In the current implementation, nilfs_grab_buffer(), which grabs a buffer
to read (or create) a block of metadata, including b-tree node blocks,
does not set the block device, but instead does so only if the buffer is
not in the "uptodate" state for each of its caller block reading
functions. However, if the uptodate flag is set on a folio/page, and the
buffer heads are detached from it by try_to_free_buffers(), and new buffer
heads are then attached by create_empty_buffers(), the uptodate flag may
be restored to each buffer without the block device being set to
bh->b_bdev, and mark_buffer_dirty() may be called later in that state,
resulting in the bug mentioned above.
Fix this issue by making nilfs_grab_buffer() always set the block device
of the super block structure to the buffer head, regardless of the state
of the buffer's uptodate flag.
Link: https://lkml.kernel.org/r/20241106160811.3316-3-konishi.ryusuke@gmail.com
Fixes: 5305cb8308 ("block: add block_{touch|dirty}_buffer tracepoint")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ubisectech Sirius <bugreport@valiantsec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "nilfs2: fix null-ptr-deref bugs on block tracepoints".
This series fixes null pointer dereference bugs that occur when using
nilfs2 and two block-related tracepoints.
This patch (of 2):
It has been reported that when using "block:block_touch_buffer"
tracepoint, touch_buffer() called from __nilfs_get_folio_block() causes a
NULL pointer dereference, or a general protection fault when KASAN is
enabled.
This happens because since the tracepoint was added in touch_buffer(), it
references the dev_t member bh->b_bdev->bd_dev regardless of whether the
buffer head has a pointer to a block_device structure. In the current
implementation, the block_device structure is set after the function
returns to the caller.
Here, touch_buffer() is used to mark the folio/page that owns the buffer
head as accessed, but the common search helper for folio/page used by the
caller function was optimized to mark the folio/page as accessed when it
was reimplemented a long time ago, eliminating the need to call
touch_buffer() here in the first place.
So this solves the issue by eliminating the touch_buffer() call itself.
Link: https://lkml.kernel.org/r/20241106160811.3316-1-konishi.ryusuke@gmail.com
Link: https://lkml.kernel.org/r/20241106160811.3316-2-konishi.ryusuke@gmail.com
Fixes: 5305cb8308 ("block: add block_{touch|dirty}_buffer tracepoint")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: Ubisectech Sirius <bugreport@valiantsec.com>
Closes: https://lkml.kernel.org/r/86bd3013-887e-4e38-960f-ca45c657f032.bugreport@valiantsec.com
Reported-by: syzbot+9982fb8d18eba905abe2@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9982fb8d18eba905abe2
Tested-by: syzbot+9982fb8d18eba905abe2@syzkaller.appspotmail.com
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Syzbot has reported the following KMSAN splat:
BUG: KMSAN: uninit-value in ocfs2_file_read_iter+0x9a4/0xf80
ocfs2_file_read_iter+0x9a4/0xf80
__io_read+0x8d4/0x20f0
io_read+0x3e/0xf0
io_issue_sqe+0x42b/0x22c0
io_wq_submit_work+0xaf9/0xdc0
io_worker_handle_work+0xd13/0x2110
io_wq_worker+0x447/0x1410
ret_from_fork+0x6f/0x90
ret_from_fork_asm+0x1a/0x30
Uninit was created at:
__alloc_pages_noprof+0x9a7/0xe00
alloc_pages_mpol_noprof+0x299/0x990
alloc_pages_noprof+0x1bf/0x1e0
allocate_slab+0x33a/0x1250
___slab_alloc+0x12ef/0x35e0
kmem_cache_alloc_bulk_noprof+0x486/0x1330
__io_alloc_req_refill+0x84/0x560
io_submit_sqes+0x172f/0x2f30
__se_sys_io_uring_enter+0x406/0x41c0
__x64_sys_io_uring_enter+0x11f/0x1a0
x64_sys_call+0x2b54/0x3ba0
do_syscall_64+0xcd/0x1e0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Since an instance of 'struct kiocb' may be passed from the block layer
with 'private' field uninitialized, introduce 'ocfs2_iocb_init_rw_locked()'
and use it from where 'ocfs2_dio_end_io()' might take care, i.e. in
'ocfs2_file_read_iter()' and 'ocfs2_file_write_iter()'.
Link: https://lkml.kernel.org/r/20241029091736.1501946-1-dmantipov@yandex.ru
Fixes: 7cdfc3a1c3 ("ocfs2: Remember rw lock level during direct io")
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reported-by: syzbot+a73e253cca4f0230a5a5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=a73e253cca4f0230a5a5
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently we pass back the size and whether it has been modified, but
those just mirror values tracked inside the delegation. In a later
patch, we'll need to get at the timestamps in the delegation too, so
just pass back a reference to the write delegation, and use that to
properly override values in the iattr.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
We already have a slot for this in the kstat structure. Just overwrite
that instead of keeping a copy.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This is always the same value, and in a later patch we're going to need
to set bits in WORD2. We can simplify this code and save a little space
in the delegation too. Just hardcode the bitmap in the callback encode
function.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The inode that nfs4_open_delegation() passes to this function is
wrong, which throws off the result. The inode will end up getting a
directory-style change attr instead of a regular-file-style one.
Fix up nfs4_delegation_stat() to fetch STATX_MODE, and then drop the
inode parameter from nfsd4_change_attribute(), since it's no longer
needed.
Fixes: c5967721e1 ("NFSD: handle GETATTR conflict with write delegation")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up: Commit 65294c1f2c ("nfsd: add a new struct file caching
facility to nfsd") moved the fh_verify() call site out of
nfsd_open(). That was the only user of nfsd_open's @rqstp parameter,
so that parameter can be removed.
Reviewed-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The posix_acl_entry pointer pe is already initialized by the
FOREACH_ACL_ENTRY() macro. Remove the unnecessary initialization.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If the tag length is >= U32_MAX - 3 then the "length + 4" addition
can result in an integer overflow. Address this by splitting the
decoding into several steps so that decode_cb_compound4res() does
not have to perform arithmetic on the unsafe length value.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
In fscache_create_volume(), there is a missing memory barrier between the
bit-clearing operation and the wake-up operation. This may cause a
situation where, after a wake-up, the bit-clearing operation hasn't been
detected yet, leading to an indefinite wait. The triggering process is as
follows:
[cookie1] [cookie2] [volume_work]
fscache_perform_lookup
fscache_create_volume
fscache_perform_lookup
fscache_create_volume
fscache_create_volume_work
cachefiles_acquire_volume
clear_and_wake_up_bit
test_and_set_bit
test_and_set_bit
goto maybe_wait
goto no_wait
In the above process, cookie1 and cookie2 has the same volume. When cookie1
enters the -no_wait- process, it will clear the bit and wake up the waiting
process. If a barrier is missing, it may cause cookie2 to remain in the
-wait- process indefinitely.
In commit 3288666c72 ("fscache: Use clear_and_wake_up_bit() in
fscache_create_volume_work()"), barriers were added to similar operations
in fscache_create_volume_work(), but fscache_create_volume() was missed.
By combining the clear and wake operations into clear_and_wake_up_bit() to
fix this issue.
Fixes: bfa22da3ed ("fscache: Provide and use cache methods to lookup/create/free a volume")
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Link: https://lore.kernel.org/r/20241107110649.3980193-6-wozizhi@huawei.com
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
At present, the object->file has the NULL pointer dereference problem in
ondemand-mode. The root cause is that the allocated fd and object->file
lifetime are inconsistent, and the user-space invocation to anon_fd uses
object->file. Following is the process that triggers the issue:
[write fd] [umount]
cachefiles_ondemand_fd_write_iter
fscache_cookie_state_machine
cachefiles_withdraw_cookie
if (!file) return -ENOBUFS
cachefiles_clean_up_object
cachefiles_unmark_inode_in_use
fput(object->file)
object->file = NULL
// file NULL pointer dereference!
__cachefiles_write(..., file, ...)
Fix this issue by add an additional reference count to the object->file
before write/llseek, and decrement after it finished.
Fixes: c838305450 ("cachefiles: notify the user daemon when looking up cookie")
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Link: https://lore.kernel.org/r/20241107110649.3980193-5-wozizhi@huawei.com
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Currently, cachefiles_commit_tmpfile() will only be called if object->flags
is set to CACHEFILES_OBJECT_USING_TMPFILE. Only cachefiles_create_file()
and cachefiles_invalidate_cookie() set this flag. Both of these functions
replace object->file with the new tmpfile, and both are called by
fscache_cookie_state_machine(), so there are no concurrency issues.
So the equation "d_backing_inode(dentry) == file_inode(object->file)" in
cachefiles_commit_tmpfile() will never hold true according to the above
conditions. This patch removes this part of the redundant code and does not
involve any other logical changes.
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Link: https://lore.kernel.org/r/20241107110649.3980193-4-wozizhi@huawei.com
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
In the erofs on-demand loading scenario, read and write operations are
usually delivered through "off" and "len" contained in read req in user
mode. Naturally, pwrite is used to specify a specific offset to complete
write operations.
However, if the write(not pwrite) syscall is called multiple times in the
read-ahead scenario, we need to manually update ki_pos after each write
operation to update file->f_pos.
This step is currently missing from the cachefiles_ondemand_fd_write_iter
function, added to address this issue.
Fixes: c838305450 ("cachefiles: notify the user daemon when looking up cookie")
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Link: https://lore.kernel.org/r/20241107110649.3980193-3-wozizhi@huawei.com
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
cachefiles_ondemand_fd_write_iter() function first aligns "pos" and "len"
to block boundaries. When calling __cachefiles_write(), the aligned "pos"
is passed in, but "len" is the original unaligned value(iter->count).
Additionally, the returned length of the write operation is the modified
"len" aligned by block size, which is unreasonable.
The alignment of "pos" and "len" is intended only to check whether the
cache has enough space. But the modified len should not be used as the
return value of cachefiles_ondemand_fd_write_iter() because the length we
passed to __cachefiles_write() is the previous "len". Doing so would result
in a mismatch in the data written on-demand. For example, if the length of
the user state passed in is not aligned to the block size (the preread
scene/DIO writes only need 512 alignment/Fault injection), the length of
the write will differ from the actual length of the return.
To solve this issue, since the __cachefiles_prepare_write() modifies the
size of "len", we pass "aligned_len" to __cachefiles_prepare_write() to
calculate the free blocks and use the original "len" as the return value of
cachefiles_ondemand_fd_write_iter().
Fixes: c838305450 ("cachefiles: notify the user daemon when looking up cookie")
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Link: https://lore.kernel.org/r/20241107110649.3980193-2-wozizhi@huawei.com
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
We're checking if the send root is read-only without being under the
protection of the root's root_item_lock spinlock, which is what protects
the root's flags when clearing the read-only flag, done at
btrfs_ioctl_subvol_setflags(). Furthermore, it should be done in the
same critical section that increments the root's send_in_progress counter,
as btrfs_ioctl_subvol_setflags() clears the read-only flag in the same
critical section that checks the counter's value.
So fix this by moving the read-only check under the critical section
delimited by the root's root_item_lock which also increments the root's
send_in_progress counter.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're checking if the send root is dead without the protection of the
root's root_item_lock spinlock, which is what protects the root's flags.
The inverse, setting the dead flag on a root, is done under the protection
of that lock, at btrfs_delete_subvolume(). Also checking and updating the
root's send_in_progress counter is supposed to be done in the same
critical section as checking for or setting the root dead flag, so that
these operations are done atomically as a single step (which is correctly
done by btrfs_delete_subvolume()).
So fix this by checking if the send root is dead in the same critical
section that updates the send_in_progress counter, which is protected by
the root's root_item_lock spinlock.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Smatch complains about possibly dereferencing a NULL fs_info at
btrfs_folio_end_lock_bitmap():
fs/btrfs/subpage.c:332 btrfs_folio_end_lock_bitmap() warn: variable dereferenced before check 'fs_info' (see line 326)
because we access fs_info to set the 'start_bit' variable before doing the
check for a NULL fs_info.
However fs_info is never NULL, since in the only caller of
btrfs_folio_end_lock_bitmap() is extent_writepage(), where we have an
inode which always as a non-NULL fs_info.
So remove the check for a NULL fs_info at btrfs_folio_end_lock_bitmap().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Smatch complains about calling PTR_ERR() against a NULL pointer:
fs/btrfs/super.c:2272 btrfs_control_ioctl() warn: passing zero to 'PTR_ERR'
Fix this by calling PTR_ERR() against the device pointer only if it
contains an error.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
REQ_OP_ZONE_APPNED -> REQ_OP_ZONE_APPEND.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Change the control flow of btrfs_encoded_read() so that it doesn't call
free_extent_map() when we know that this has already been done.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no point in having a 'snapshot_force_cow' variable to track if we
need to decrement the root->snapshot_force_cow counter, as we never jump
to the 'out' label after incrementing the counter. Simplify this by
removing the variable and always decrementing the counter before the 'out'
label, right after the call to btrfs_mksubvol().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The comment refers to a list in the respective delayed ref head that no
longer exists (ref_list), it was replaced with a rbtree (ref_tree) in
commit 0e0adbcfdc ("btrfs: track refs in a rb_tree instead of a list").
So update the stale comment to refer to the rbtree instead of the old
list.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a new unprivileged ioctl that will let the command
'btrfs subvolume sync' work without the (privileged) SEARCH_TREE ioctl.
There are several modes of operation, where the most common ones are to
wait on a specific subvolume or all currently queued for cleaning. This
is utilized e.g. in backup applications that delete subvolumes and wait
until they're cleaned to check for remaining space.
The other modes are for flexibility, e.g. for monitoring or
checkpoints in the queue of deleted subvolumes, again without the need
to use SEARCH_TREE.
Notes:
- waiting is interruptible, the timeout is set to 1 second and is not
configurable
- repeated calls to the ioctl see a different state, so this is
inherently racy when using e.g. the count or peek next/last
Use cases:
- a subvolume A was deleted, wait for cleaning (WAIT_FOR_ONE)
- a bunch of subvolumes were deleted, wait for all (WAIT_FOR_QUEUED or
PEEK_LAST + WAIT_FOR_ONE)
- count how many are queued (not blocking), for monitoring purposes
- report progress (PEEK_NEXT), may miss some if cleaning is quick
- own waiting in user space (PEEK_LAST until it's 0)
Signed-off-by: David Sterba <dsterba@suse.com>
Simplify tracking of the range processed by using cur_alloc_size only to
store the reserved part that may fail to the allocated extent. Remove
the ram_size as well since it is always equal to cur_alloc_size in the
context. Advance the start in normal path until extent allocation
succeeds and keep the start unchanged in the error handling path.
Passed the fstest generic/475 test for a hundred times with quota
enabled. And a modified generic/475 test by removing the sleep time
for a hundred times. About one tenth of the tests do enter the error
handling path due to fail to reserve extent.
Suggested-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove conditional path allocation from btrfs_read_locked_inode(). Add
an ASSERT(path) to indicate it should never be called with a NULL path.
Call btrfs_read_locked_inode() directly from btrfs_iget(). This causes
code duplication between btrfs_iget() and btrfs_iget_path(), but I
think this is justifiable as it removes the need for conditionally
allocating the path inside of btrfs_read_locked_inode(). This makes the
code easier to reason about and makes it clear who has the
responsibility of allocating and freeing the path.
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move btrfs_add_inode_to_root() so it can be called from
btrfs_read_locked_inode(), no changes were made to the function.
Move cleanup code from btrfs_iget_path() to btrfs_read_locked_inode.
This improves readability and improves a leaky abstraction. Previously
btrfs_iget_path() had to handle a positive error case as a result of a
call to btrfs_search_slot(), but it makes more sense to handle this
closer to the source of the call.
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add struct io_btrfs_cmd as a wrapper type for io_uring_cmd_to_pdu(),
rather than using a raw pointer.
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add an io_uring command for encoded reads, using the same interface as
the existing BTRFS_IOC_ENCODED_READ ioctl.
btrfs_uring_encoded_read() is an io_uring version of
btrfs_ioctl_encoded_read(), which validates the user input and calls
btrfs_encoded_read() to read the appropriate metadata. If we determine
that we need to read an extent from disk, we call
btrfs_encoded_read_regular_fill_pages() through
btrfs_uring_read_extent() to prepare the bio.
The existing btrfs_encoded_read_regular_fill_pages() is changed so that
if it is passed a valid uring_ctx, rather than waking up any waiting
threads it calls btrfs_uring_read_extent_endio(). This in turn copies
the read data back to userspace, and calls io_uring_cmd_done() to
complete the io_uring command.
Because we're potentially doing a non-blocking read,
btrfs_uring_read_extent() doesn't clean up after itself if it returns
-EIOCBQUEUED. Instead, it allocates a priv struct, populates the fields
there that we will need to unlock the inode and free our allocations,
and defers this to the btrfs_uring_read_finished() that gets called when
the bio completes.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Change btrfs_encoded_read_regular_fill_pages() so that the priv struct
is allocated rather than stored on the stack, in preparation for adding
an asynchronous mode to the function.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Change btrfs_encoded_read() so that it returns -EAGAIN rather than sleeps
if IOCB_NOWAIT is set in iocb->ki_flags. The conditions that require
sleeping are: inode lock, writeback, extent lock, ordered range.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Change the behaviour of btrfs_encoded_read() so that if it needs to read
an extent from disk, it leaves the extent and inode locked and returns
-EIOCBQUEUED. The caller is then responsible for doing the I/O via
btrfs_encoded_read_regular() and unlocking the extent and inode.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
iocb->ki_pos isn't used after this function, so there's no point in
changing its value.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After the previous patch, which converted the rb-tree used to track
delayed ref heads into an xarray, the find_ref_head() function is now
used only by one caller which always passes false to the 'return_bigger'
argument. So remove the 'return_bigger' logic, simplifying the function,
and move all the function code to the single caller.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we use a red black tree (rb-tree) to track the delayed ref
heads (in struct btrfs_delayed_ref_root::href_root). This however is not
very efficient when the number of delayed ref heads is large (and it's
very common to be at least in the order of thousands) since rb-trees are
binary trees. For example for 10K delayed ref heads, the tree has a depth
of 13. Besides that, inserting into the tree requires navigating through
it and pulling useless cache lines in the process since the red black tree
nodes are embedded within the delayed ref head structure - on the other
hand, by being embedded, it requires no extra memory allocations.
We can improve this by using an xarray instead which has a much higher
branching factor than a red black tree (binary balanced tree) and is more
cache friendly and behaves like a resizable array, with a much better
search and insertion complexity than a red black tree. This only has one
small disadvantage which is that insertion will sometimes require
allocating memory for the xarray - which may fail (not that often since
it uses a kmem_cache) - but on the other hand we can reduce the delayed
ref head structure size by 24 bytes (from 152 down to 128 bytes) after
removing the embedded red black tree node, meaning than we can now fit
32 delayed ref heads per 4K page instead of 26, and that gain compensates
for the occasional memory allocations needed for the xarray nodes. We
also end up using only 2 cache lines instead of 3 per delayed ref head.
Running the following fs_mark test showed some improvements:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
MOUNT_OPTIONS="-o ssd"
FILES=100000
THREADS=$(nproc --all)
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $DEV
mount $MOUNT_OPTIONS $DEV $MNT
OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k"
for ((i = 1; i <= $THREADS; i++)); do
OPTS="$OPTS -d $MNT/d$i"
done
fs_mark $OPTS
umount $MNT
Before this patch:
FSUse% Count Size Files/sec App Overhead
10 1200000 0 171845.7 12253839
16 2400000 0 230898.7 12308254
23 3600000 0 212292.9 12467768
30 4800000 0 195737.8 12627554
46 6000000 0 171055.2 12783329
After this patch:
FSUse% Count Size Files/sec App Overhead
10 1200000 0 173835.0 12246131
16 2400000 0 233537.8 12271746
23 3600000 0 220398.7 12307737
30 4800000 0 204483.6 12392318
40 6000000 0 182923.3 12771843
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add some comments to struct btrfs_delayed_ref_root's fields to mention
what its spinlock protects.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The delayed refs lock must be held when calling add_delayed_ref_head(),
so assert that it's being held.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The delayed refs lock must be held when calling find_first_ref_head(), so
assert that it's being held.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have 3 callers for find_ref_head() so assert at find_ref_head() that we
have the delayed refs lock held, removing the assertion from one of its
callers (btrfs_find_delayed_ref_head()).
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
One of the following patches in the series will need to access fs_info at
btrfs_delete_ref_head(), so pass a fs_info argument to it.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
One of the following patches in the series will need to access fs_info in
the function find_ref_head(), so pass a fs_info argument to it as well as
to the functions btrfs_select_ref_head() and btrfs_find_delayed_ref_head()
which call find_ref_head().
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The unselect_delayed_ref_head() at extent-tree.c doesn't really belong in
that file as it's a delayed refs specific detail and therefore should be
at delayed-ref.c. Further its inverse, btrfs_select_ref_head(), is at
delayed-ref.c, so it only makes sense to have it there too.
So move unselect_delayed_ref_head() into delayed-ref.c and rename it to
btrfs_unselect_ref_head() so that its name closely matches its inverse
(btrfs_select_ref_head()).
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of doing it in two steps outside of delayed-ref.c, leaking low
level details such as locking, move the logic entirely to delayed-ref.c
under btrfs_select_ref_head(), reducing code and making things simpler
for the caller.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function only returns 0, meaning it was able to lock the delayed ref
head, or -EAGAIN in case it wasn't able to lock it. So simplify this and
use a boolean return type instead, returning true if it was able to lock
and false otherwise.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The atomic counter 'num_entries' is not used anymore, we increment it
and decrement it but then we don't ever read it to use for any logic.
Its last use was removed with commit 61a56a992f ("btrfs: delayed refs
pre-flushing should only run the heads we have"). So remove it.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of open coding it, use the find_first_ref_head() helper at
btrfs_destroy_delayed_refs(). This avoids duplicating the logic,
specially with the upcoming changes in subsequent patches.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When destroying delayed refs during a transaction abort, we have open
coded the removal of a delayed ref, which is also done by the static
helper function drop_delayed_ref(). So remove that duplicated code and
use drop_delayed_ref() instead.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info parameter is redundant because it can be extracted from the
transaction given as another parameter. So remove it and use the fs_info
accessible from the transaction.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info parameter is redundant because it can be extracted from the
transaction given as another parameter. So remove it and use the fs_info
accessible from the transaction.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's better suited at delayed-ref.c since it's about delayed refs and
contains logic to iterate over them (using the red black tree, doing all
the locking, freeing, etc), so move it from disk-io.c, which is pretty
big, into delayed-ref.c, hiding implementation details of how delayed
refs are tracked and managed. This also facilitates the next patches in
the series.
This change moves the code between files but also does the following
simple cleanups:
1) Rename the 'cache' variable to 'bg', since it's a block group
(the 'cache' logic comes from old days where the block group
structure was named 'btrfs_block_group_cache');
2) Move the 'ref' variable declaration to the scope of the inner
while loop, since it's not used outside that loop.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_destroy_delayed_refs() it's unexpected to not find the block
group to which a delayed reference's extent belongs to, so we have this
BUG_ON(), not just because it's highly unexpected but also because we
don't know what to do there.
Since we are in the transaction abort path, there's nothing we can do
other than proceed and cleanup all used resources we can. So remove
the BUG_ON() and deal with a missing block group by logging an error
message and continuing to cleanup all we can related to the current
delayed ref head and moving to other delayed refs.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When inserting extent backref, in order to check whether refs other than
inline refs are used, we always use path keep locks for tree search, which
will increase the lock contention of extent tree.
We do not need the parent node every time to determine whether normal
refs are used. It is only needed when the extent item is the last item
in a leaf.
Therefore, we change it to first use keep_locks=0 for search. If the
extent item happens to be the last item in the leaf, we then change to
keep_locks=1 for the second search to reduce lock contention.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Implement self-tests for partial deletion of RAID stripe-tree entries.
These two new tests cover both the deletion of the front of a RAID
stripe-tree stripe extent as well as truncation of an item to make it
smaller.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In our CI system, the RAID stripe tree configuration sometimes fails with
the following ASSERT():
assertion failed: found_start >= start && found_end <= end, in fs/btrfs/raid-stripe-tree.c:64
This ASSERT()ion triggers, because for the initial design of RAID
stripe-tree, I had the "one ordered-extent equals one bio" rule of zoned
btrfs in mind.
But for a RAID stripe-tree based system, that is not hosted on a zoned
storage device, but on a regular device this rule doesn't apply.
So in case the range we want to delete starts in the middle of the
previous item, grab the item and "truncate" it's length. That is, clone
the item, subtract the deleted portion from the key's offset, delete the
old item and insert the new one.
In case the range to delete ends in the middle of an item, we have to
adjust both the item's key as well as the stripe extents and then
re-insert the modified clone into the tree after deleting the old stripe
extent.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When fgp_flags and gfp_flags are zero, use filemap_get_folio(A, B)
instead of __filemap_get_folio(A, B, 0, 0)—no need for the extra
arguments 0, 0.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The buffered write path is still heavily utilizing the page interface.
Since we have converted it to do a page-by-page copying, it's much easier
to convert all involved functions to folio interface, this involves:
- btrfs_copy_from_user()
- btrfs_drop_folio()
- prepare_uptodate_page()
- prepare_one_page()
- lock_and_cleanup_extent_if_need()
- btrfs_dirty_page()
All function are changed to accept a folio parameter, and if the word
"page" is in the function name, change that to "folio" too.
The function btrfs_dirty_page() is exported for v1 space cache, convert
v1 cache call site to convert its page to folio for the new interface.
And there is a small enhancement for prepare_one_folio(), instead of
manually waiting for the page writeback, let __filemap_get_folio() to
handle that by using FGP_WRITEBEGIN, which implies
(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE).
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the btrfs_buffered_write() is preparing multiple page a time,
allowing a better performance.
But the current trend is to support larger folio as an optimization,
instead of implementing own multi-page optimization.
This is inspired by generic_perform_write(), which is copying one folio
a time.
Such change will prepare us to migrate to implement the write_begin()
and write_end() callbacks, and make every involved function a little
easier.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_do_encoded_write() was converted to use folios in 400b172b8c,
but we're still allocating based on sizeof(struct page *) rather than
sizeof(struct folio *). There's no functional change.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove hard-coded strings by using the str_yes_no() and str_no_yes()
helper functions.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since there is no user of reader locks, rename the writer locks into a
more generic name, by removing the "_writer" part from the name.
And also rename btrfs_subpage::writer into btrfs_subpage::locked.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit d7172f52e9 ("btrfs: use per-buffer locking for
extent_buffer reading"), metadata read no longer relies on the subpage
reader locking.
This means we do not need to maintain a different metadata/data split
for locking, so we can convert the existing reader lock users by:
- add_ra_bio_pages()
Convert to btrfs_folio_set_writer_lock()
- end_folio_read()
Convert to btrfs_folio_end_writer_lock()
- begin_folio_read()
Convert to btrfs_folio_set_writer_lock()
- folio_range_has_eb()
Remove the subpage->readers checks, since it is always 0.
- Remove btrfs_subpage_start_reader() and btrfs_subpage_end_reader()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is not really suitable to lock a folio, as it lacks the
proper mapping checks, thus the locked folio may not even belong to
btrfs.
And due to the above reason, the last user inside lock_delalloc_folios()
is already removed, and we can remove this function.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If you follow the seed/sprout wiki, it suggests the following workflow:
btrfstune -S 1 seed_dev
mount seed_dev mnt
btrfs device add sprout_dev
mount -o remount,rw mnt
The first mount mounts the FS readonly, which results in not setting
BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
somewhat surprisingly clears the readonly bit on the sb (though the
mount is still practically readonly, from the users perspective...).
Finally, the remount checks the readonly bit on the sb against the flag
and sees no change, so it does not run the code intended to run on
ro->rw transitions, leaving BTRFS_FS_OPEN unset.
As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
does no work. This results in leaking deleted snapshots until we run out
of space.
I propose fixing it at the first departure from what feels reasonable:
when we clear the readonly bit on the sb during device add.
A new fstest I have written reproduces the bug and confirms the fix.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
It's redundant to have the 'gen' variable since we already have the same
value in the local btrfs_tree_parent_check structure. So remove it and
instead use the structure's field.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's pointless to initialize the has_first_key field of the stack local
btrfs_tree_parent_check structure at btrfs_tree_parent_check() and at
btrfs_qgroup_trace_subtree() since all fields not explicitly initialized
are zeroed out. In the case of the first function it's a bit odd because
we are assigning 0 and the field is of type bool, however not incorrect
since a 0 is converted to false.
Just remove the explicit initializations due to their redundancy.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only caller of btrfs_verify_level_key() is read_block_for_search() and
it's passing 3 arguments to it that can be extracted from its on stack
variable of type struct btrfs_tree_parent_check.
So change btrfs_verify_level_key() to accept an argument of type
struct btrfs_tree_parent_check instead of level, first key and parent
transid arguments.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The level parameter passed to read_block_for_search() always matches the
level of the extent buffer passed in the "eb_ret" parameter, which we are
also extracting into the "parent_level" local variable.
So remove the level parameter and instead use the "parent_level" variable
which in fact has a better name (it's the level of the parent node from
which we are reading a child node/leaf).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the extent map shrinker can only be run by a single task and runs
asynchronously as a work queue job, enable it as it can no longer cause
stalls on tasks allocating memory and entering the extent map shrinker
through the fs shrinker (implemented by btrfs_free_cached_objects()).
This is crucial to prevent exhaustion of memory due to unbounded extent
map creation, primarily with direct IO but also for buffered IO on files
with holes. This problem, for the direct IO case, was first reported in
the Link tag below. That report was added to a Link tag of the first patch
that introduced the extent map shrinker, commit 956a17d9d0 ("btrfs: add
a shrinker for extent maps"), however the Link tag disappeared somehow
from the committed patch (but was included in the submitted patch to the
mailing list), so adding it below for future reference.
Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The names for the members of struct btrfs_fs_info related to the extent
map shrinker are a bit too long, so rename them to be shorter by replacing
the "extent_map_" prefix with the "em_" prefix.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the extent map shrinker can only be run by a single task (as a
work queue item) there is no need to keep the progress of the shrinker
protected by a spinlock and passing the progress to trace events as
parameters. So remove the lock and simplify the arguments for the trace
events.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the extent map shrinker is run synchronously for kswapd tasks
that end up calling the fs shrinker (fs/super.c:super_cache_scan()).
This has some disadvantages and for some heavy workloads with memory
pressure it can cause some delays and stalls that make a machine
unresponsive for some periods. This happens because:
1) We can have several kswapd tasks on machines with multiple NUMA zones,
and running the extent map shrinker concurrently can cause high
contention on some spin locks, namely the spin locks that protect
the radix tree that tracks roots, the per root xarray that tracks
open inodes and the list of delayed iputs. This not only delays the
shrinker but also causes high CPU consumption and makes the task
running the shrinker monopolize a core, resulting in the symptoms
of an unresponsive system. This was noted in previous commits such as
commit ae1e766f62 ("btrfs: only run the extent map shrinker from
kswapd tasks");
2) The extent map shrinker's iteration over inodes can often be slow, even
after changing the data structure that tracks open inodes for a root
from a red black tree (up to kernel 6.10) to an xarray (kernel 6.10+).
The transition to the xarray while it made things a bit faster, it's
still somewhat slow - for example in a test scenario with 10000 inodes
that have no extent maps loaded, the extent map shrinker took between
5ms to 8ms, using a release, non-debug kernel. Iterating over the
extent maps of an inode can also be slow if have an inode with many
thousands of extent maps, since we use a red black tree to track and
search extent maps. So having the extent map shrinker run synchronously
adds extra delay for other things a kswapd task does.
So make the extent map shrinker run asynchronously as a job for the
system unbounded workqueue, just like what we do for data and metadata
space reclaim jobs.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the common code to remove an extent map from its inode's tree into a
helper function and use it, reducing duplicated code.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When crawling btree, if an eb cache miss occurs, we change to use the eb
read lock and release all previous locks (including the parent lock) to
reduce lock contention.
If an eb cache miss occurs in a leaf and needs to execute IO, before this
change we released locks only from level 2 and up and we read a leaf's
content from disk while holding a lock on its parent (level 1), causing
the unnecessary lock contention on the parent, after this change we
release locks from level 1 and up, but we lock level 0, and read leaf's
content from disk.
Because we have prepared the check parameters and the read lock of eb we
hold, we can ensure that no race will occur during the check and cause
unexpected errors.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The compression heuristic pass does not need a level, so we can drop the
parameter.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Cascaded removal of fs_info that is not needed in several functions.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function got split in commit 6ab6ebb760 ("btrfs: split
alloc_log_tree()") and since then transaction parameter has been unused.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only caller passes NULL, we can drop the parameter. This is since
the new mount option parser done in 3bb17a25bc ("btrfs: add get_tree
callback for new mount API").
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the new mount option parser in commit ad21f15b0f ("btrfs:
switch to the new mount API") we don't pass the options like that
anymore.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter was added in 8ff8466d29 ("btrfs: support subpage for
extent buffer page release") for page but hasn't been used since, so we
can drop it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The mask parameter used for allocations got unified to GFP_NOFS and
removed from relevant functions in 1d12680044 ("btrfs: drop gfp from
parameter extent state helpers").
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter duplicates what can be effectively obtained from
wc->refs[level - 1] and this is what's actually used inside. Added in
commit 2b73c7e761 ("btrfs: unify logic to decide if we need to walk
down into a node during snapshot delete").
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter 'from' has never been used since commit b8d8e1fd57
("btrfs: introduce btrfs_write_check()"), this is for buffered write.
Direct io write needs it so it was probably an interface thing, but we
can drop it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The file_offset parameter used to be passed to encoded read struct but
was removed in commit b665affe93 ("btrfs: remove unused members from
struct btrfs_encoded_read_private").
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need offset for inline extents, they always start from 0.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need the inode pointer to read inline extent, it's all
accessible from the path pointer.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need the user passed parameter, rescan is a filesystem
operation so fs_info is sufficient.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>