linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-01-08 22:32:55 +00:00

Author	SHA1	Message	Date
Darrick J. Wong	9bb5127347	xfs: check that rtblock extents do not break rtsupers or rtgroups Check that rt block pointers do not point to the realtime superblock and that allocated rt space extents do not cross rtgroup boundaries. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:39 -08:00
Darrick J. Wong	8edde94d64	xfs: export realtime group geometry via XFS_FSOP_GEOM Export the realtime geometry information so that userspace can query it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:39 -08:00
Darrick J. Wong	76d3be00df	xfs: update realtime super every time we update the primary fs super Every time we update parts of the primary filesystem superblock that are echoed in the rt superblock, we must update the rt super. Avoid changing the log to support logging to the rt device by using ordered buffers. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:39 -08:00
Darrick J. Wong	18618e7100	xfs: check the realtime superblock at mount time Check the realtime superblock at mount time, to ensure that the label and uuids actually match the primary superblock on the data device. If the rt superblock is good, attach it to the xfs_mount so that the log can use ordered buffers to keep this primary in sync with the primary super on the data device. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:39 -08:00
Darrick J. Wong	96768e9151	xfs: define the format of rt groups Define the ondisk format of realtime group metadata, and a superblock for realtime volumes. rt supers are conditionally enabled by a predicate function so that they can be disabled if we ever implement zoned storage support for the realtime volume. For rt group enabled file systems there is a separate bitmap and summary file for each group and thus the number of bitmap and summary blocks needs to be calculated differently. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:39 -08:00
Christoph Hellwig	f220f6da5f	xfs: make RT extent numbers relative to the rtgroup To prepare for adding per-rtgroup bitmap files, make the xfs_rtxnum_t type encode the RT extent number relative to the rtgroup. The biggest part of this to clearly distinguish between the relative extent number that gets masked when converting from a global block number and length values that just have a factor applied to them when converting from file system blocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:38 -08:00
Darrick J. Wong	dca94251f6	xfs: fix rt device offset calculations for FITRIM FITRIM on xfs has this bizarro uapi where we flatten all the physically addressable storage across two block devices into a linear address space. In this address space, the realtime device comes immediately after the data device. Therefore, the xfs_trim_rtdev_extents has to convert its input parameters from the linear address space to actual rtdev block addresses on the realtime volume. Right now the address space conversion is done in units of rtblocks. However, a future patchset will convert xfs_rtblock_t to be a segmented address space (group:blkno) like the data device. Change the conversion code to be done in units of daddrs since those will never be segmented. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:38 -08:00
Christoph Hellwig	f8c5a8415f	xfs: refactor xfs_rtsummary_blockcount Make xfs_rtsummary_blockcount take all the required information from the mount structure and return the number of summary levels from it as well. This cleans up many of the callers and prepares for making the rtsummary files per-rtgroup where they need to look at different value. This means we recalculate some values in some callers, but as all these calculations are outside the fast path and cheap, which seems like a price worth paying. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:38 -08:00
Christoph Hellwig	5a7566c8d6	xfs: refactor xfs_rtbitmap_blockcount Rename the existing xfs_rtbitmap_blockcount to xfs_rtbitmap_blockcount_len and add a new xfs_rtbitmap_blockcount wrapper around it that takes the number of extents from the mount structure. This will simplify the move to per-rtgroup bitmaps as those will need to pass in the number of extents per rtgroup instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:38 -08:00
Christoph Hellwig	bde86b42d2	xfs: factor out a xfs_growfs_check_rtgeom helper Split the check that the rtsummary fits into the log into a separate helper, and use xfs_growfs_rt_alloc_fake_mount to calculate the new RT geometry. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: avoid division for the 0-rtx growfs check] Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:38 -08:00
Christoph Hellwig	fc233f1fb0	xfs: use xfs_growfs_rt_alloc_fake_mount in xfs_growfs_rt_alloc_blocks Use xfs_growfs_rt_alloc_fake_mount instead of manually recalculating the RT bitmap geometry. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:37 -08:00
Darrick J. Wong	1029f08dc5	xfs: factor out a xfs_growfs_rt_alloc_fake_mount helper Split the code to set up a fake mount point to calculate new RT geometry out of xfs_growfs_rt_bmblock so that it can be reused. Note that this changes the rmblocks calculation method to be based on the passed in rblocks and extsize and not the explicitly passed one, but both methods will always lead to the same result. The new version just does a little bit more math while being more general. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:37 -08:00
Christoph Hellwig	cb9cd6e56e	xfs: calculate RT bitmap and summary blocks based on sb_rextents Use the on-disk rextents to calculate the bitmap and summary blocks instead of the calculated one so that we can refactor the helpers for calculating them. As the RT bitmap and summary scrubbers already check that sb_rextents match the block count this does not change coverage of the scrubber. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:37 -08:00
Darrick J. Wong	c1442d22a0	xfs: remove XFS_ILOCK_RT* Now that we've centralized the realtime metadata locking routines, get rid of the ILOCK subclasses since we now use explicit lockdep classes. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:37 -08:00
Christoph Hellwig	ae897e0bed	xfs: support creating per-RTG files in growfs To support adding new RT groups in growfs, we need to be able to create the per-RT group files. Add a new xfs_rtginode_create helper to create a given per-RTG file. Most of the code for that is shared, but the details of the actual file are abstracted out using a new create method in struct xfs_rtginode_ops. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:37 -08:00
Christoph Hellwig	e3088ae2dc	xfs: move RT bitmap and summary information to the rtgroup Move the pointers to the RT bitmap and summary inodes as well as the summary cache to the rtgroups structure to prepare for having a separate bitmap and summary inodes for each rtgroup. Code using the inodes now needs to operate on a rtgroup. Where easily possible such code is converted to iterate over all rtgroups, else rtgroup 0 (the only one that can currently exist) is hardcoded. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:37 -08:00
Christoph Hellwig	c8edf1cbef	xfs: split xfs_trim_rtdev_extents Split xfs_trim_rtdev_extents into two parts to prepare for reusing the main validation also for RT group aware file systems. Use the fully features xfs_daddr_to_rtb helper to convert from a daddr to a xfs_rtblock_t to prepare for segmented addressing in RT groups. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:36 -08:00
Christoph Hellwig	d6d5c90ada	xfs: cleanup xfs_getfsmap_rtdev_rtbitmap Use mp->m_sb.sb_rblocks to calculate the end instead of sb_rextents that needs a conversion, use consistent names to xfs_rtblock_t types, and only calculated them by the time they are needed. Remove the pointless "high" local variable that only has a single user. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:36 -08:00
Christoph Hellwig	9154b5008c	xfs: factor out a xfs_growfs_rt_alloc_blocks helper Split out a helper to allocate or grow the rtbitmap and rtsummary files in preparation of per-RT group bitmap and summary files. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:36 -08:00
Christoph Hellwig	cd8d049082	xfs: add a xfs_qm_unmount_rt helper RT group enabled file systems fix the bug where we pointlessly attach quotas to the RT bitmap and summary files. Split the code to detach the quotas into a helper, make it conditional and document the differing behavior for RT group and pre-RT group file systems. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:36 -08:00
Christoph Hellwig	9c3cfb9c96	xfs: add a xfs_bmap_free_rtblocks helper Split the RT extent freeing logic from xfs_bmap_del_extent_real because it will become more complicated when adding RT group. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:36 -08:00
Darrick J. Wong	cd5b26f0c0	xfs: add rtgroup-based realtime scrubbing context management Create a state tracking structure and helpers to initialize the tracking structure so that we can check metadata records against the realtime space management metadata. Right now this is limited to grabbing the incore rtgroup object, but we'll eventually add to the tracking structure the ILOCK state and btree cursors. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:36 -08:00
Darrick J. Wong	0d2c636e48	xfs: repair metadata directory file path connectivity Fix disconnected or incorrect metadata directory paths. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:35 -08:00
Darrick J. Wong	65b1231b8c	xfs: support caching rtgroup metadata inodes Create the necessary per-rtgroup infrastructure that we need to load metadata inodes into memory. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:35 -08:00
Darrick J. Wong	c29237a65c	xfs: add a lockdep class key for rtgroup inodes Add a dynamic lockdep class key for rtgroup inodes. This will enable lockdep to deduce inconsistencies in the rtgroup metadata ILOCK locking order. Each class can have 8 subclasses, and for now we will only have 2 inodes per group. This enables rtgroup order and inode order checks when nesting ILOCKs. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:35 -08:00
Darrick J. Wong	0e4875b3fb	xfs: define locking primitives for realtime groups Define helper functions to lock all metadata inodes related to a realtime group. There's not much to look at now, but this will become important when we add per-rtgroup metadata files and online fsck code for them. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:35 -08:00
Darrick J. Wong	87fe4c34a3	xfs: create incore realtime group structures Create an incore object that will contain information about a realtime allocation group. This will eventually enable us to shard the realtime section in a similar manner to how we shard the data section, but for now just a single object for the entire RT subvolume is created. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:35 -08:00
Christoph Hellwig	dcfc65befb	xfs: clean up xfs_getfsmap_helper arguments The calling conventions for xfs_getfsmap_helper are confusing -- callers pass in an rmap record, but they must also supply startblock and blockcount in daddr units. This was bolted onto the original fsmap implementation so that we could report something for realtime volumes, which do not support rmap and hence can draw only from the rt free space bitmap. Free space on the rt volume can be more than 2^32 fsblocks long, which means that we can't use the rmap startblock or blockcount fields. This is confusing for callers, because they must supplying redundant data, but not all of it is used. Streamline this by creating a separate fsmap irec structure that contains exactly the data we need, once. Note that we actually do need rm_startblock for rmap key comparisons when we're actually querying an rmap btree, so leave that field but document why it's there. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:35 -08:00
Darrick J. Wong	87b7c205da	xfs: confirm dotdot target before replacing it during a repair xfs_dir_replace trips an assertion if you tell it to change a dirent to point to an inumber that it already points at. Look up the dotdot entry directly to confirm that we need to make a change. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:34 -08:00
Darrick J. Wong	b3c03efa59	xfs: check metadata directory file path connectivity Create a new scrubber type that checks that well known metadata directory paths are connected to the metadata inode that the incore structures think is in use. For example, check that "/quota/user" in the metadata directory tree actually points to mp->m_quotainfo->qi_uquotaip->i_ino. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:34 -08:00
Darrick J. Wong	9dc31acb01	xfs: move repair temporary files to the metadata directory tree Due to resource acquisition rules, we have to create the ondisk temporary files used to stage a filesystem repair before we can acquire a reference to the inode that we actually want to repair. Therefore, we do not know at tempfile creation time whether the tempfile will belong to the regular directory tree or the metadata directory tree. This distinction becomes important when the swapext code tries to figure out the quota accounting of the two files whose mappings are being swapped. The swapext code assumes that accounting updates are required for a file if dqattach attaches dquots. Metadir files are never accounted in quota, which means that swapext must not update the quota accounting when swapping in a repaired directory/xattr/rtbitmap structure. Prior to the swapext call, therefore, both files must be marked as METADIR for dqattach so that dqattach will ignore them. Add support for a repair tempfile to be switched to the metadir tree and switched back before being released so that ifree will just free the file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:34 -08:00
Darrick J. Wong	dcde94bdee	xfs: check the metadata directory inumber in superblocks When metadata directories are enabled, make sure that the secondary superblocks point to the metadata directory. This isn't strictly required because the secondaries are only used to recover damaged filesystems, and the metadir root inumber is fixed. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:34 -08:00
Darrick J. Wong	3d2c341111	xfs: scrub metadata directories Teach online scrub about the metadata directory tree. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:34 -08:00
Darrick J. Wong	5dab2daa8a	xfs: fix di_metatype field of inodes that won't load Make sure that the di_metatype field is at least set plausibly so that later scrubbers could set the real type. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:34 -08:00
Darrick J. Wong	aec2eb7da8	xfs: adjust parent pointer scrubber for sb-rooted metadata files Starting with the metadata directory feature, we're allowed to call the directory and parent pointer scrubbers for every metadata file, including the ones that are children of the superblock. For these children, checking the link count against the number of parent pointers is a bit funny -- there's no such thing as a parent pointer for a child of the superblock since there's no corresponding dirent. For purposes of validating nlink, we pretend that there is a parent pointer. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:33 -08:00
Darrick J. Wong	91fb4232be	xfs: metadata files can have xattrs if metadir is enabled If parent pointers are enabled, then metadata files will store parent pointers in xattrs, just like files in the user visible directory tree. Therefore, scrub and repair need to handle attr forks for metadata files on metadir filesystems. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:33 -08:00
Darrick J. Wong	13af229ee0	xfs: do not count metadata directory files when doing online quotacheck Previously, we stated that files in the metadata directory tree are not counted in the dquot information. Fix the online quotacheck code to reflect this. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:33 -08:00
Darrick J. Wong	679b098b59	xfs: refactor directory tree root predicates Metadata directory trees make reasoning about the parent of a file more difficult. Traditionally, user files are children of sb_rootino, and metadata files are "children" of the superblock. Now, we add a third possibility -- some metadata files can be children of sb_metadirino, but the classic ones (rt free space data and quotas) are left alone. Let's add some helper functions (instead of open-coding the logic everywhere) to make scrub logic easier to understand. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:33 -08:00
Darrick J. Wong	be42fc1393	xfs: record health problems with the metadata directory Make a report to the health monitoring subsystem any time we encounter something in the metadata directory tree that looks like corruption. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:33 -08:00
Darrick J. Wong	61b6bdb30a	xfs: adjust xfs_bmap_add_attrfork for metadir Online repair might use the xfs_bmap_add_attrfork to repair a file in the metadata directory tree if (say) the metadata file lacks the correct parent pointers. In that case, it is not correct to check that the file is dqattached -- metadata files must be not have /any/ dquot attached at all. Adjust the assertions appropriately. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:32 -08:00
Darrick J. Wong	cc0cf84aa7	xfs: mark quota inodes as metadata files When we're creating quota files at mount time, make sure to mark them as metadir inodes if appropriate. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:32 -08:00
Darrick J. Wong	382e275f0e	xfs: don't count metadata directory files to quota Files in the metadata directory tree are internal to the filesystem. Don't count the inodes or the blocks they use in the root dquot because users do not need to know about their resource usage. This will also quiet down complaints about dquot usage not matching du output. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:32 -08:00
Darrick J. Wong	df866c538f	xfs: allow bulkstat to return metadata directories Allow the V5 bulkstat ioctl to return information about metadata directory files so that xfs_scrub can find and scrub them, since they are otherwise ordinary directories. (Metadata files of course require per-file scrub code and hence do not need exposure.) Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:32 -08:00
Darrick J. Wong	688828d8f8	xfs: advertise metadata directory feature Advertise the existence of the metadata directory feature; this will be used by scrub to decide if it needs to scan the metadir too. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:32 -08:00
Darrick J. Wong	bb6cdd5529	xfs: hide metadata inodes from everyone because they are special Metadata inodes are private files and therefore cannot be exposed to userspace. This means no bulkstat, no open-by-handle, no linking them into the directory tree, and no feeding them to LSMs. As such, we mark them S_PRIVATE, which stops all that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:32 -08:00
Darrick J. Wong	8651b410ae	xfs: disable the agi rotor for metadata inodes Ideally, we'd put all the metadata inodes in one place if we could, so that the metadata all stay reasonably close together instead of spreading out over the disk. Furthermore, if the log is internal we'd probably prefer to keep the metadata near the log. Therefore, disable AGI rotoring for metadata inode allocations. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:31 -08:00
Darrick J. Wong	5d9b54a4ef	xfs: read and write metadata inode directory tree Plumb in the bits we need to load metadata inodes from a named entry in a metadir directory, create (or hardlink) inodes into a metadir directory, create metadir directories, and flag inodes as being metadata files. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:31 -08:00
Darrick J. Wong	7297fd0beb	xfs: enforce metadata inode flag Add checks for the metadata inode flag so that we don't ever leak metadata inodes out to userspace, and we don't ever try to read a regular inode as metadata. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:31 -08:00
Darrick J. Wong	c555dd9b8c	xfs: load metadata directory root at mount time Load the metadata directory root inode into memory at mount time and release it at unmount time. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:31 -08:00
Darrick J. Wong	dcf6069143	xfs: iget for metadata inodes Create a xfs_trans_metafile_iget function for metadata inodes to ensure that when we try to iget a metadata file, the inode is allocated and its file mode matches the metadata file type the caller expects. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:31 -08:00
Darrick J. Wong	4f3d4dd1b0	xfs: define the on-disk format for the metadir feature Define the on-disk layout and feature flags for the metadata inode directory feature. Add a xfs_sb_version_hasmetadir for benefit of xfs_repair, which needs to know where the new end of the superblock lies. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:31 -08:00
Christoph Hellwig	e5e5cae05b	xfs: store a generic group structure in the intents Replace the pag pointers in the extent free, bmap, rmap and refcount intent structures with a pointer to the generic group to prepare for adding intents for realtime groups. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:30 -08:00
Darrick J. Wong	ecc8065dfa	xfs: standardize EXPERIMENTAL warning generation Refactor the open-coded warnings about EXPERIMENTAL feature use into a standard helper before we go adding more experimental features. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:30 -08:00
Darrick J. Wong	4d272929a5	xfs: rename metadata inode predicates The predicate xfs_internal_inum tells us if an inumber refers to one of the inodes rooted in the superblock. Soon we're going to have internal inodes in a metadata directory tree, so this helper should be renamed to capture its limited scope. Ondisk inodes will soon have a flag to indicate that they're metadata inodes. Head off some confusion by renaming the xfs_is_metadata_inode predicate to xfs_is_internal_inode. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:30 -08:00
Darrick J. Wong	fdf5703b61	xfs: constify the xfs_inode predicates Change the xfs_inode predicates to take a const struct xfs_inode pointer because they do not change the inode. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:30 -08:00
Darrick J. Wong	8d939f4bd7	xfs: constify the xfs_sb predicates Change the xfs_sb predicates to take a const struct xfs_sb pointer because they do not change the superblock. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:30 -08:00
Christoph Hellwig	ba102a682d	xfs: remove xfs_group_intent_hold and xfs_group_intent_rele Each of them just has a single caller, so fold them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:29 -08:00
Christoph Hellwig	759cc1989a	xfs: add group based bno conversion helpers Add/move the blocks, blklog and blkmask fields to the generic groups structure so that code can work with AGs and RTGs by just using the right index into the array. Then, add convenience helpers to convert block numbers based on the generic group. This will allow writing code that doesn't care if it is used on AGs or the upcoming realtime groups. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:29 -08:00
Christoph Hellwig	198febb9fe	xfs: store a generic xfs_group pointer in xfs_getfsmap_info Replace the pag and rtg pointers with a generic group pointer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:29 -08:00
Christoph Hellwig	77a530e6c4	xfs: add a generic group pointer to the btree cursor Replace the pag pointers in the type specific union with a generic xfs_group pointer. This prepares for adding realtime group support. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:29 -08:00
Christoph Hellwig	adbc76aa0f	xfs: convert busy extent tracking to the generic group structure Split busy extent tracking from struct xfs_perag into its own private structure, which can be pointed to by the generic group structure. Note that this structure is now dynamically allocated instead of embedded as the upcoming zone XFS code doesn't need it and will also have an unusually high number of groups due to hardware constraints. Dynamically allocating the structure this is a big memory saver for this case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:29 -08:00
Christoph Hellwig	0e10cb98f1	xfs: convert extent busy tracepoints to the generic group structure Prepare for tracking busy RT extents by passing the generic group structure to the xfs_extent_busy_class tracepoints. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:29 -08:00
Christoph Hellwig	6af1300d47	xfs: return the busy generation from xfs_extent_busy_list_empty This avoid having to poke into the internals of the busy tracking in xrep_setup_ag_allocbt. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:28 -08:00
Christoph Hellwig	eb4a84a3c2	xfs: move the online repair rmap hooks to the generic group structure Prepare for the upcoming realtime groups feature by moving the online repair rmap hooks to based to the generic xfs_group structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:28 -08:00
Christoph Hellwig	34cf3a6f39	xfs: move draining of deferred operations to the generic group structure Prepare supporting the upcoming realtime groups feature by moving the deferred operation draining to the generic xfs_group structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:28 -08:00
Christoph Hellwig	2ed27a5464	xfs: mark xfs_perag_intent_{hold,rele} static These two functions are only used inside of xfs_drain.c, so mark them static. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:28 -08:00
Christoph Hellwig	5c8483cec3	xfs: move metadata health tracking to the generic group structure Prepare for also tracking the health status of the upcoming realtime groups by moving the health tracking code to the generic xfs_group structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:28 -08:00
Christoph Hellwig	86437e6abb	xfs: switch perag iteration from the for_each macros to a while based iterator The current for_each_perag* macros are a bit annoying in that they require the caller to both provide an object and an index iterator, and also somewhat obsfucate the underlying control flow mechanism. Switch to open coded while loops using new xfs_perag_next{,_from,_range} helpers that return the next pag structure to iterate on based on the previous one or NULL for the loop start. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:28 -08:00
Christoph Hellwig	d66496578b	xfs: insert the pag structures into the xarray later Cleaning up is much easier if a structure can't be looked up yet, so only insert the pag once it is fully set up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:27 -08:00
Christoph Hellwig	819928770b	xfs: add a xfs_group_next_range helper Add a helper to iterate over iterate over all groups, which can be used as a simple while loop: struct xfs_group *xg = NULL; while ((xg = xfs_group_next_range(mp, xg, 0, MAX_GROUP))) { ... } This will be wrapped by the realtime group code first, and eventually replace the for_each_rtgroup_from and for_each_rtgroup_range helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:27 -08:00
Christoph Hellwig	201c5fa342	xfs: split xfs_initialize_perag Factor out a xfs_perag_alloc helper that allocates a single perag structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:27 -08:00
Christoph Hellwig	e9c4d8bfb2	xfs: factor out a generic xfs_group structure Split the lookup and refcount handling of struct xfs_perag into an embedded xfs_group structure that can be reused for the upcoming realtime groups. It will be extended with more features later. Note that he xg_type field will only need a single bit even with realtime group support. For now it fills a hole, but it might be worth to fold it into another field if we can use this space better. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:27 -08:00
Christoph Hellwig	c4ae021bcb	xfs: convert remaining trace points to pass pag structures Convert all tracepoints that take [mp,agno] tuples to take a pag argument instead so that decoding only happens when tracepoints are enabled and to clean up the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:27 -08:00
Christoph Hellwig	0a4d79741d	xfs: factor out a xfs_iwalk_args helper Add a helper to share more code between xfs_iwalk and xfs_inobt_walk, and at the same time do away with the extra flags indirect so that everyone use the same names for the same flags when using the common iwalk code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:27 -08:00
Christoph Hellwig	dc8df7e382	xfs: pass the pag to the xrep_newbt_extent_class tracepoints This requires moving a few of the callsites a little bit to ensure that we already have the reference, but allows for the decoding to only happen when tracing is actually enabled, and cleans up the callsites a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:26 -08:00
Christoph Hellwig	934dde65b2	xfs: pass the pag to the trace_xrep_calc_ag_resblks{,_btsize} trace points This requires holding the pag refcount a little longer, but allows for the decoding to only happen when tracing is actually enabled, and cleans up the callsites a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:26 -08:00
Christoph Hellwig	618a27a94d	xfs: pass objects to the xrep_ibt_walk_rmap tracepoint Pass the perag structure and the irec so that the decoding is only done when tracing is actually enabled and the call sites look a lot neater, and remove the pointless class indirection. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:26 -08:00
Christoph Hellwig	1209d360eb	xfs: pass the iunlink item to the xfs_iunlink_update_dinode trace point So that decoding is only done when tracing is actually enabled and the call site look a lot neater. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:26 -08:00
Christoph Hellwig	487092ceaa	xfs: pass objects to the xfs_irec_merge_{pre,post} trace points Pass the perag structure and the irec to these tracepoints so that the decoding is only done when tracing is actually enabled and the call sites look a lot neater. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:26 -08:00
Christoph Hellwig	835ddb592f	xfs: pass a perag structure to the xfs_ag_resv_init_error trace point And remove the single instance class indirection for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:26 -08:00
Christoph Hellwig	2337ac79e9	xfs: constify pag arguments to trace points Trace points never modify their arguments. Mark all the pag objects passed to trace points. The exception is the xfs_ag_resv_class, which uses the xfs_perag_resv helper that can't be marked const due to other users modifying the returned structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:25 -08:00
Christoph Hellwig	3c39444939	xfs: remove the unused xrep_bmap_walk_rmap trace point Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:25 -08:00
Christoph Hellwig	c896fb44f6	xfs: remove the unused trace_xfs_iwalk_ag trace point Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:25 -08:00
Christoph Hellwig	8dcf5e617f	xfs: remove the mount field from struct xfs_busy_extents The mount field is only passed to xfs_extent_busy_clear, which never uses it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:25 -08:00
Christoph Hellwig	4a137e0915	xfs: keep a reference to the pag for busy extents Processing of busy extents requires the perag structure, so keep the reference while they are in flight. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:25 -08:00
Christoph Hellwig	b6dc8c6dd2	xfs: pass a pag to xfs_extent_busy_{search,reuse} Replace the [mp,agno] tuple with the perag structure, which will become more useful later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:25 -08:00
Christoph Hellwig	6abd82ab6e	xfs: add a xfs_agino_to_ino helper Add a helpers to convert an agino to an ino based on a pag structure. This provides a simpler conversion and better type safety compared to the existing code that passes the mount structure and the agno separately. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:24 -08:00
Christoph Hellwig	856a920ac2	xfs: add xfs_agbno_to_fsb and xfs_agbno_to_daddr helpers Add helpers to convert an agbno to a daddr or fsbno based on a pag structure. This provides a simpler conversion and better type safety compared to the existing code that passes the mount structure and the agno separately. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:24 -08:00
Christoph Hellwig	db129fa011	xfs: remove the agno argument to xfs_free_ag_extent xfs_free_ag_extent already has a pointer to the pag structure through the agf buffer. Use that instead of passing the redundant argument, and do the same for the tracepoint. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:24 -08:00
Christoph Hellwig	67ce5ba575	xfs: pass a pag to xfs_difree_inode_chunk We'll want to use more than just the agno field in a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:24 -08:00
Christoph Hellwig	9943b45732	xfs: remove the unused pag_active_wq field in struct xfs_perag pag_active_wq is only woken, but never waited for. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:24 -08:00
Christoph Hellwig	4e071d79e4	xfs: remove the unused pagb_count field in struct xfs_perag Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:23 -08:00
Christoph Hellwig	cd8ae42a82	xfs: fix superfluous clearing of info->low in __xfs_getfsmap_datadev The for_each_perag helpers update the agno passed in for each iteration, and thus the "if (pag->pag_agno == start_ag)" check will always be true. Add another variable for the loop iterator so that the field is only cleared after the first iteration. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:23 -08:00
Darrick J. Wong	62027820eb	xfs: fix simplify extent lookup in xfs_can_free_eofblocks In commit `11f4c3a53a`, we tried to simplify the extent lookup in xfs_can_free_eofblocks so that it doesn't incur the overhead of all the extra stuff that xfs_bmapi_read does around the iext lookup. Unfortunately, this causes regressions on generic/603, xfs/108, generic/219, xfs/173, generic/694, xfs/052, generic/230, and xfs/441 when always_cow is turned on. In all cases, the regressions take the form of alwayscow files consuming rather more space than the golden output is expecting. I observed that in all these cases, the cause of the excess space usage was due to CoW fork delalloc reservations that go beyond EOF. For alwayscow files we allow posteof delalloc CoW reservations because all writes go through the CoW fork. Recall that all extents in the CoW fork are accounted for via i_delayed_blks, which means that prior to this patch, we'd invoke xfs_free_eofblocks on first close if anything was in the CoW fork. Now we don't do that. Fix the problem by reverting the removal of the i_delayed_blks check. Cc: <stable@vger.kernel.org> # v6.12-rc1 Fixes: `11f4c3a53a` ("xfs: simplify extent lookup in xfs_can_free_eofblocks") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:23 -08:00
Matthew Wilcox (Oracle)	9b4bb82244	ecryptfs: Pass the folio index to crypt_extent() We need to pass pages, not folios, to crypt_extent() as we may be working with a plain page rather than a folio. But we need to know the index in the file, so pass it in from the caller. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20241025190822.1319162-11-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-11-05 17:20:00 +01:00
Matthew Wilcox (Oracle)	bf64913dfe	ecryptfs: Convert lower_offset_for_page() to take a folio Both callers have a folio, so pass it in and use folio->index instead of page->index. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20241025190822.1319162-10-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-11-05 17:20:00 +01:00
Matthew Wilcox (Oracle)	c15b81461d	ecryptfs: Convert ecryptfs_decrypt_page() to take a folio Both callers have a folio, so pass it in and use it throughout. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20241025190822.1319162-9-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-11-05 17:20:00 +01:00
Matthew Wilcox (Oracle)	6b9c0e8137	ecryptfs: Convert ecryptfs_encrypt_page() to take a folio All three callers have a folio, so pass it in and use it throughout. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20241025190822.1319162-8-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-11-05 17:19:59 +01:00
Matthew Wilcox (Oracle)	de5ced2721	ecryptfs: Convert ecryptfs_write_lower_page_segment() to take a folio Both callers now have a folio, so pass it in and use it throughout. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20241025190822.1319162-7-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-11-05 17:19:59 +01:00
Matthew Wilcox (Oracle)	4d3727fd06	ecryptfs: Convert ecryptfs_write() to use a folio Remove ecryptfs_get_locked_page() and call read_mapping_folio() directly. Use the folio throught this function. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20241025190822.1319162-6-willy@infradead.org Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-11-05 17:19:59 +01:00
Matthew Wilcox (Oracle)	890d477a0f	ecryptfs: Convert ecryptfs_read_lower_page_segment() to take a folio All callers have a folio, so pass it in and use it directly. This will not work for large folios, but I doubt anybody wants to use large folios with ecryptfs. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20241025190822.1319162-5-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-11-05 17:19:59 +01:00
Matthew Wilcox (Oracle)	497eb79c31	ecryptfs: Convert ecryptfs_copy_up_encrypted_with_header() to take a folio Both callers have a folio, so pass it in and use it throughout. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20241025190822.1319162-4-willy@infradead.org Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-11-05 17:19:58 +01:00
Matthew Wilcox (Oracle)	064fe6b475	ecryptfs: Use a folio throughout ecryptfs_read_folio() Remove the conversion to a struct page. Removes a few hidden calls to compound_head(). Use 'err' instead of 'rc' for clarity. Also remove the unnecessary call to ClearPageUptodate(); the uptodate flag is already clear if this function is being called. That lets us switch to folio_end_read() which does one atomic flag operation instead of two. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20241025190822.1319162-3-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-11-05 17:19:58 +01:00
Matthew Wilcox (Oracle)	807a11dab9	ecryptfs: Convert ecryptfs_writepage() to ecryptfs_writepages() By adding a ->migrate_folio implementation, theree is no need to keep the ->writepage implementation. The new writepages removes the unnecessary call to SetPageUptodate(); the folio should already be uptodate at this point. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20241025190822.1319162-2-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-11-05 17:19:58 +01:00
Christoph Hellwig	fe4e0faac9	xfs: remove xfs_page_mkwrite_iomap_ops Shared the regular buffered write iomap_ops with the page fault path and just check for the IOMAP_FAULT flag to skip delalloc punching. This keeps the delalloc punching checks in one place, and will make it easier to convert iomap to an iter model where the begin and end handlers are merged into a single callback. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-11-05 13:52:57 +01:00
Christoph Hellwig	a7fd3327d3	xfs: remove __xfs_filemap_fault xfs_filemap_huge_fault only ever serves DAX faults, so hard code the call to xfs_dax_read_fault and open code __xfs_filemap_fault in the only remaining caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-11-05 13:52:57 +01:00
Christoph Hellwig	1eb6fc0447	xfs: split write fault handling out of __xfs_filemap_fault Only two of the callers of __xfs_filemap_fault every handle read faults. Split the write_fault handling out of __xfs_filemap_fault so that all callers call that directly either conditionally or unconditionally and only leave the read fault handling in __xfs_filemap_fault. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-11-05 13:52:57 +01:00
Christoph Hellwig	1171de3296	xfs: split the page fault trace event Split the xfs_filemap_fault trace event into separate ones for read and write faults and move them into the applicable locations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-11-05 13:52:57 +01:00
Dave Chinner	59e43f5479	xfs: sb_spino_align is not verified It's just read in from the superblock and used without doing any validity checks at all on the value. Fixes: `fb4f2b4e5a` ("xfs: add sparse inode chunk alignment superblock field") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-11-05 13:51:59 +01:00
Christoph Hellwig	792ef2745d	xfs: simplify sector number calculation in xfs_zero_extent xfs_zero_extent does some really odd gymnstics to calculate the block layer sectors numbers passed to blkdev_issue_zeroout. This is because it used to call sb_issue_zeroout and the calculations in that helper got open coded here in the rather misleadingly named commit `3dc2916107` ("dax: use sb_issue_zerout instead of calling dax_clear_sectors"). Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-11-05 13:51:59 +01:00
Long Li	8b9b261594	xfs: remove the redundant xfs_alloc_log_agf There are two invocations of xfs_alloc_log_agf in xfs_alloc_put_freelist. The AGF does not change between the two calls. Although this does not pose any practical problems, it seems like a small mistake. Therefore, fix it by removing the first xfs_alloc_log_agf invocation. Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-11-05 13:51:55 +01:00
Namjae Jeon	0a77d947f5	ksmbd: check outstanding simultaneous SMB operations If Client send simultaneous SMB operations to ksmbd, It exhausts too much memory through the "ksmbd_work_cache”. It will cause OOM issue. ksmbd has a credit mechanism but it can't handle this problem. This patch add the check if it exceeds max credits to prevent this problem by assuming that one smb request consumes at least one credit. Cc: stable@vger.kernel.org # v5.15+ Reported-by: Norbert Szetei <norbert@doyensec.com> Tested-by: Norbert Szetei <norbert@doyensec.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-11-05 09:26:38 +09:00
Namjae Jeon	b8fc56fbca	ksmbd: fix slab-use-after-free in smb3_preauth_hash_rsp ksmbd_user_session_put should be called under smb3_preauth_hash_rsp(). It will avoid freeing session before calling smb3_preauth_hash_rsp(). Cc: stable@vger.kernel.org # v5.15+ Reported-by: Norbert Szetei <norbert@doyensec.com> Tested-by: Norbert Szetei <norbert@doyensec.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-11-05 09:26:37 +09:00
Namjae Jeon	0a77715db2	ksmbd: fix slab-use-after-free in ksmbd_smb2_session_create There is a race condition between ksmbd_smb2_session_create and ksmbd_expire_session. This patch add missing sessions_table_lock while adding/deleting session from global session table. Cc: stable@vger.kernel.org # v5.15+ Reported-by: Norbert Szetei <norbert@doyensec.com> Tested-by: Norbert Szetei <norbert@doyensec.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-11-05 09:26:35 +09:00
John Garry	3af5298ce9	xfs: Support setting FMODE_CAN_ATOMIC_WRITE Set FMODE_CAN_ATOMIC_WRITE flag if we can atomic write for that inode. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: John Garry <john.g.garry@oracle.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> #On ppc64	2024-11-04 16:22:11 -08:00
John Garry	f096207d32	xfs: Validate atomic writes Validate that an atomic write adheres to length/offset rules. Currently we can only write a single FS block. For an IOCB with IOCB_ATOMIC set to get as far as xfs_file_write_iter(), FMODE_CAN_ATOMIC_WRITE will need to be set for the file; for this, ATOMICWRITES flags would also need to be set for the inode. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-04 16:22:10 -08:00
John Garry	6432c6e723	xfs: Support atomic write for statx Support providing info on atomic write unit min and max for an inode. For simplicity, currently we limit the min at the FS block size. As for max, we limit also at FS block size, as there is no current method to guarantee extent alignment or granularity for regular files. Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-04 16:22:10 -08:00
John Garry	9e0933c21c	fs: iomap: Atomic write support Support direct I/O atomic writes by producing a single bio with REQ_ATOMIC flag set. Initially FSes (XFS) should only support writing a single FS block atomically. As with any atomic write, we should produce a single bio which covers the complete write length. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> [djwong: clarify a couple of things in the docs] Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-04 16:14:02 -08:00
John Garry	a570bad16b	fs: Export generic_atomic_write_valid() The XFS code will need this. Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-04 16:14:02 -08:00
Mike Snitzer	867da60d46	nfs: avoid i_lock contention in nfs_clear_invalid_mapping Multi-threaded buffered reads to the same file exposed significant inode spinlock contention in nfs_clear_invalid_mapping(). Eliminate this spinlock contention by checking flags without locking, instead using smp_rmb and smp_load_acquire accordingly, but then take spinlock and double-check these inode flags. Also refactor nfs_set_cache_invalid() slightly to use smp_store_release() to pair with nfs_clear_invalid_mapping()'s smp_load_acquire(). While this fix is beneficial for all multi-threaded buffered reads issued by an NFS client, this issue was identified in the context of surprisingly low LOCALIO performance with 4K multi-threaded buffered read IO. This fix dramatically speeds up LOCALIO performance: before: read: IOPS=1583k, BW=6182MiB/s (6482MB/s)(121GiB/20002msec) after: read: IOPS=3046k, BW=11.6GiB/s (12.5GB/s)(232GiB/20001msec) Fixes: `17dfeb9113` ("NFS: Fix races in nfs_revalidate_mapping") Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2024-11-04 10:24:19 -05:00
Mike Snitzer	bc29408695	nfs_common: fix localio to cope with racing nfs_local_probe() Fix the possibility of racing nfs_local_probe() resulting in: list_add double add: new=ffff8b99707f9f58, prev=ffff8b99707f9f58, next=ffffffffc0f30000. ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:35! Add nfs_uuid_init() to properly initialize all nfs_uuid_t members (particularly its list_head). Switch to returning bool from nfs_uuid_begin(), returns false if nfs_uuid_t is already in-use (its list_head is on a list). Update nfs_local_probe() to return early if the nfs_client's cl_uuid (nfs_uuid_t) is in-use. Also, switch nfs_uuid_begin() from using list_add_tail_rcu() to list_add_tail() -- rculist was used in an earlier version of the localio code that had a lockless nfs_uuid_lookup interface. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2024-11-04 10:24:19 -05:00
Trond Myklebust	40f45ab381	NFS: Further fixes to attribute delegation a/mtime changes When asked to set both an atime and an mtime to the current system time, ensure that the setting is atomic by calling inode_update_timestamps() only once with the appropriate flags. Fixes: `e12912d941` ("NFSv4: Add support for delegated atime and mtime attributes") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2024-11-04 10:24:19 -05:00
Trond Myklebust	d054c5eb28	NFS: Fix attribute delegation behaviour on exclusive create When the client does an exclusive create and the server decides to store the verifier in the timestamps, a SETATTR is subsequently sent to fix up those timestamps. When that is the case, suppress the exceptions for attribute delegations in nfs4_bitmap_copy_adjust(). Fixes: `32215c1f89` ("NFSv4: Don't request atime/mtime/size if they are delegated to us") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2024-11-04 10:24:19 -05:00
Roberto Sassu	dc270d7159	nfs: Fix KMSAN warning in decode_getfattr_attrs() Fix the following KMSAN warning: CPU: 1 UID: 0 PID: 7651 Comm: cp Tainted: G B Tainted: [B]=BAD_PAGE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009) ===================================================== ===================================================== BUG: KMSAN: uninit-value in decode_getfattr_attrs+0x2d6d/0x2f90 decode_getfattr_attrs+0x2d6d/0x2f90 decode_getfattr_generic+0x806/0xb00 nfs4_xdr_dec_getattr+0x1de/0x240 rpcauth_unwrap_resp_decode+0xab/0x100 rpcauth_unwrap_resp+0x95/0xc0 call_decode+0x4ff/0xb50 __rpc_execute+0x57b/0x19d0 rpc_execute+0x368/0x5e0 rpc_run_task+0xcfe/0xee0 nfs4_proc_getattr+0x5b5/0x990 __nfs_revalidate_inode+0x477/0xd00 nfs_access_get_cached+0x1021/0x1cc0 nfs_do_access+0x9f/0xae0 nfs_permission+0x1e4/0x8c0 inode_permission+0x356/0x6c0 link_path_walk+0x958/0x1330 path_lookupat+0xce/0x6b0 filename_lookup+0x23e/0x770 vfs_statx+0xe7/0x970 vfs_fstatat+0x1f2/0x2c0 __se_sys_newfstatat+0x67/0x880 __x64_sys_newfstatat+0xbd/0x120 x64_sys_call+0x1826/0x3cf0 do_syscall_64+0xd0/0x1b0 entry_SYSCALL_64_after_hwframe+0x77/0x7f The KMSAN warning is triggered in decode_getfattr_attrs(), when calling decode_attr_mdsthreshold(). It appears that fattr->mdsthreshold is not initialized. Fix the issue by initializing fattr->mdsthreshold to NULL in nfs_fattr_init(). Cc: stable@vger.kernel.org # v3.5.x Fixes: `88034c3d88` ("NFSv4.1 mdsthreshold attribute xdr") Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2024-11-04 10:24:18 -05:00
NeilBrown	6e2a10343e	NFSv3: only use NFS timeout for MOUNT when protocols are compatible If a timeout is specified in the mount options, it currently applies to both the NFS protocol and (with v3) the MOUNT protocol. This is sensible when they both use the same underlying protocol, or those protocols are compatible w.r.t timeouts as RDMA and TCP are. However if, for example, NFS is using TCP and MOUNT is using UDP then using the same timeout doesn't make much sense. If you mount -o vers=3,proto=tcp,mountproto=udp,timeo=600,retrans=5 \ server:/path /mountpoint then the timeo=600 which was intended for the NFS/TCP request will apply to the MOUNT/UDP requests with the result that there will only be one request sent (because UDP has a maximum timeout of 60 seconds). This is not what a reasonable person might expect. This patch disables the sharing of timeout information in cases where the underlying protocols are not compatible. Fixes: `c9301cb35b` ("nfs: hornor timeo and retrans option when mounting NFSv3") Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2024-11-04 10:24:18 -05:00
Kuniyuki Iwashima	ef7134c7fc	smb: client: Fix use-after-free of network namespace. Recently, we got a customer report that CIFS triggers oops while reconnecting to a server. [0] The workload runs on Kubernetes, and some pods mount CIFS servers in non-root network namespaces. The problem rarely happened, but it was always while the pod was dying. The root cause is wrong reference counting for network namespace. CIFS uses kernel sockets, which do not hold refcnt of the netns that the socket belongs to. That means CIFS must ensure the socket is always freed before its netns; otherwise, use-after-free happens. The repro steps are roughly: 1. mount CIFS in a non-root netns 2. drop packets from the netns 3. destroy the netns 4. unmount CIFS We can reproduce the issue quickly with the script [1] below and see the splat [2] if CONFIG_NET_NS_REFCNT_TRACKER is enabled. When the socket is TCP, it is hard to guarantee the netns lifetime without holding refcnt due to async timers. Let's hold netns refcnt for each socket as done for SMC in commit `9744d2bf19` ("smc: Fix use-after-free in tcp_write_timer_handler()."). Note that we need to move put_net() from cifs_put_tcp_session() to clean_demultiplex_info(); otherwise, __sock_create() still could touch a freed netns while cifsd tries to reconnect from cifs_demultiplex_thread(). Also, maybe_get_net() cannot be put just before __sock_create() because the code is not under RCU and there is a small chance that the same address happened to be reallocated to another netns. [0]: CIFS: VFS: \\XXXXXXXXXXX has not responded in 15 seconds. Reconnecting... CIFS: Serverclose failed 4 times, giving up Unable to handle kernel paging request at virtual address 14de99e461f84a07 Mem abort info: ESR = 0x0000000096000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 [14de99e461f84a07] address between user and kernel address ranges Internal error: Oops: 0000000096000004 [#1] SMP Modules linked in: cls_bpf sch_ingress nls_utf8 cifs cifs_arc4 cifs_md4 dns_resolver tcp_diag inet_diag veth xt_state xt_connmark nf_conntrack_netlink xt_nat xt_statistic xt_MASQUERADE xt_mark xt_addrtype ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment nft_compat nf_tables nfnetlink overlay nls_ascii nls_cp437 sunrpc vfat fat aes_ce_blk aes_ce_cipher ghash_ce sm4_ce_cipher sm4 sm3_ce sm3 sha3_ce sha512_ce sha512_arm64 sha1_ce ena button sch_fq_codel loop fuse configfs dmi_sysfs sha2_ce sha256_arm64 dm_mirror dm_region_hash dm_log dm_mod dax efivarfs CPU: 5 PID: 2690970 Comm: cifsd Not tainted 6.1.103-109.184.amzn2023.aarch64 #1 Hardware name: Amazon EC2 r7g.4xlarge/, BIOS 1.0 11/1/2018 pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : fib_rules_lookup+0x44/0x238 lr : __fib_lookup+0x64/0xbc sp : ffff8000265db790 x29: ffff8000265db790 x28: 0000000000000000 x27: 000000000000bd01 x26: 0000000000000000 x25: ffff000b4baf8000 x24: ffff00047b5e4580 x23: ffff8000265db7e0 x22: 0000000000000000 x21: ffff00047b5e4500 x20: ffff0010e3f694f8 x19: 14de99e461f849f7 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 3f92800abd010002 x11: 0000000000000001 x10: ffff0010e3f69420 x9 : ffff800008a6f294 x8 : 0000000000000000 x7 : 0000000000000006 x6 : 0000000000000000 x5 : 0000000000000001 x4 : ffff001924354280 x3 : ffff8000265db7e0 x2 : 0000000000000000 x1 : ffff0010e3f694f8 x0 : ffff00047b5e4500 Call trace: fib_rules_lookup+0x44/0x238 __fib_lookup+0x64/0xbc ip_route_output_key_hash_rcu+0x2c4/0x398 ip_route_output_key_hash+0x60/0x8c tcp_v4_connect+0x290/0x488 __inet_stream_connect+0x108/0x3d0 inet_stream_connect+0x50/0x78 kernel_connect+0x6c/0xac generic_ip_connect+0x10c/0x6c8 [cifs] __reconnect_target_unlocked+0xa0/0x214 [cifs] reconnect_dfs_server+0x144/0x460 [cifs] cifs_reconnect+0x88/0x148 [cifs] cifs_readv_from_socket+0x230/0x430 [cifs] cifs_read_from_socket+0x74/0xa8 [cifs] cifs_demultiplex_thread+0xf8/0x704 [cifs] kthread+0xd0/0xd4 Code: aa0003f8 f8480f13 eb18027f 540006c0 (b9401264) [1]: CIFS_CRED="/root/cred.cifs" CIFS_USER="Administrator" CIFS_PASS="Password" CIFS_IP="X.X.X.X" CIFS_PATH="//${CIFS_IP}/Users/Administrator/Desktop/CIFS_TEST" CIFS_MNT="/mnt/smb" DEV="enp0s3" cat <<EOF > ${CIFS_CRED} username=${CIFS_USER} password=${CIFS_PASS} domain=EXAMPLE.COM EOF unshare -n bash -c " mkdir -p ${CIFS_MNT} ip netns attach root 1 ip link add eth0 type veth peer veth0 netns root ip link set eth0 up ip -n root link set veth0 up ip addr add 192.168.0.2/24 dev eth0 ip -n root addr add 192.168.0.1/24 dev veth0 ip route add default via 192.168.0.1 dev eth0 ip netns exec root sysctl net.ipv4.ip_forward=1 ip netns exec root iptables -t nat -A POSTROUTING -s 192.168.0.2 -o ${DEV} -j MASQUERADE mount -t cifs ${CIFS_PATH} ${CIFS_MNT} -o vers=3.0,sec=ntlmssp,credentials=${CIFS_CRED},rsize=65536,wsize=65536,cache=none,echo_interval=1 touch ${CIFS_MNT}/a.txt ip netns exec root iptables -t nat -D POSTROUTING -s 192.168.0.2 -o ${DEV} -j MASQUERADE " umount ${CIFS_MNT} [2]: ref_tracker: net notrefcnt@000000004bbc008d has 1/1 users at sk_alloc (./include/net/net_namespace.h:339 net/core/sock.c:2227) inet_create (net/ipv4/af_inet.c:326 net/ipv4/af_inet.c:252) __sock_create (net/socket.c:1576) generic_ip_connect (fs/smb/client/connect.c:3075) cifs_get_tcp_session.part.0 (fs/smb/client/connect.c:3160 fs/smb/client/connect.c:1798) cifs_mount_get_session (fs/smb/client/trace.h:959 fs/smb/client/connect.c:3366) dfs_mount_share (fs/smb/client/dfs.c:63 fs/smb/client/dfs.c:285) cifs_mount (fs/smb/client/connect.c:3622) cifs_smb3_do_mount (fs/smb/client/cifsfs.c:949) smb3_get_tree (fs/smb/client/fs_context.c:784 fs/smb/client/fs_context.c:802 fs/smb/client/fs_context.c:794) vfs_get_tree (fs/super.c:1800) path_mount (fs/namespace.c:3508 fs/namespace.c:3834) __x64_sys_mount (fs/namespace.c:3848 fs/namespace.c:4057 fs/namespace.c:4034 fs/namespace.c:4034) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Fixes: `26abe14379` ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Tom Talpey <tom@talpey.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-11-03 19:28:31 -06:00
Linus Torvalds	a8cc743272	17 hotfixes. 9 are cc:stable. 13 are MM and 4 are non-MM. The usual collection of singletons - please see the changelogs. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZyfGDAAKCRDdBJ7gKXxA jr19AQD6bfDF/6L2Alq1QG26pgrgccEbKzDSzR6pBajwCbdrNQD/XPhiv3zRJfGf lgt0Qkqwe/ApBhVYUnL8y1CePv3EDgA= =W5W0 -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2024-11-03-10-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "17 hotfixes. 9 are cc:stable. 13 are MM and 4 are non-MM. The usual collection of singletons - please see the changelogs" * tag 'mm-hotfixes-stable-2024-11-03-10-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: multi-gen LRU: use {ptep,pmdp}_clear_young_notify() mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes mm: shrinker: avoid memleak in alloc_shrinker_info .mailmap: update e-mail address for Eugen Hristev vmscan,migrate: fix page count imbalance on node stats when demoting pages mailmap: update Jarkko's email addresses mm: allow set/clear page_type again nilfs2: fix potential deadlock with newly created symlinks Squashfs: fix variable overflow in squashfs_readpage_block kasan: remove vmalloc_percpu test tools/mm: -Werror fixes in page-types/slabinfo mm, swap: avoid over reclaim of full clusters mm: fix PSWPIN counter for large folios swap-in mm: avoid VM_BUG_ON when try to map an anon large folio to zero page. mm/codetag: fix null pointer check logic for ref and tag mm/gup: stop leaking pinned pages in low memory conditions	2024-11-03 10:25:05 -10:00
Al Viro	a71874379e	xattr: switch to CLASS(fd) Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 13:29:31 -05:00
Al Viro	8935989798	do_pollfd(): convert to CLASS(fd) lift setting ->revents into the caller, so that failure exits (including the early one) would be plain returns. We need the scope of our struct fd to end before the store to ->revents, since that's shared with the failure exits prior to the point where we can do fdget(). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:07 -05:00
Al Viro	d000e073ca	convert do_select() take the logics from fdget() to fdput() into an inlined helper - with existing wait_key_set() subsumed into that. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:07 -05:00
Al Viro	6b1a5ae9b5	convert vfs_dedupe_file_range(). fdput() is followed by checking fatal_signal_pending() (and aborting the loop in such case). fdput() is transposable with that check. Yes, it'll probably end up with slightly fatter code (call after the check has returned false + call on the almost never taken out-of-line path instead of one call before the check), but it's not worth bothering with explicit extra scope there (or dragging the check into the loop condition, for that matter). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:07 -05:00
Al Viro	9bd812744d	convert cifs_ioctl_copychunk() fdput() moved past mnt_drop_file_write(); harmless, if somewhat cringeworthy. Reordering could be avoided either by adding an explicit scope or by making mnt_drop_file_write() called via __cleanup. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:07 -05:00
Al Viro	20d9eb3b87	convert do_preadv()/do_pwritev() fdput() can be transposed with add_{r,w}char() and inc_sysc{r,w}(); it's the same story as with do_readv()/do_writev(), only with fdput() instead of fdput_pos(). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:06 -05:00
Al Viro	8152f82010	fdget(), more trivial conversions all failure exits prior to fdget() leave the scope, all matching fdput() are immediately followed by leaving the scope. [xfs_ioc_commit_range() chunk moved here as well] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:06 -05:00
Al Viro	6348be02ee	fdget(), trivial conversions fdget() is the first thing done in scope, all matching fdput() are immediately followed by leaving the scope. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:06 -05:00
Al Viro	554ceb7a5e	o2hb_region_dev_store(): avoid goto around fdget()/fdput() Preparation for CLASS(fd) conversion. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:06 -05:00
Al Viro	d7a9616ce0	introduce "fd_pos" class, convert fdget_pos() users to it. fdget_pos() for constructor, fdput_pos() for cleanup, all users of fd..._pos() converted trivially. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:06 -05:00
Al Viro	048181992c	fdget_raw() users: switch to CLASS(fd_raw) Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:06 -05:00
Al Viro	a6f46579d7	convert vmsplice() to CLASS(fd) Irregularity here is fdput() not in the same scope as fdget(); we could just lift it out vmsplice_type() in vmsplice(2), but there's no much point keeping vmsplice_type() separate after that... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:06 -05:00
Al Viro	0d113fcbc2	simplify xfs_find_handle() a bit XFS_IOC_FD_TO_HANDLE can grab a reference to copied ->f_path and let the file go; results in simpler control flow - cleanup is the same for both "by descriptor" and "by pathname" cases. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:06 -05:00
Al Viro	919a7a1aac	timerfd: switch to CLASS(fd) Fold timerfd_fget() into both callers to have fdget() and fdput() in the same scope. Could be done in different ways, but this is probably the smallest solution. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:06 -05:00
Al Viro	05e555642c	regularize emptiness checks in fini_module(2) and vfs_dedupe_file_range() With few exceptions emptiness checks are done as fd_file(...) in boolean context (usually something like if (!fd_file(f))...); those will be taken care of later. However, there's a couple of places where we do those checks as 'store fd_file(...) into a variable, then check if this variable is NULL' and those are harder to spot. Get rid of those now. use fd_empty() instead of extracting file and then checking it for NULL. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:06 -05:00
Linus Torvalds	3e5e6c9900	nfsd-6.12 fixes: - Fix two async COPY bugs found during NFS bake-a-thon - Fix an svcrdma memory leak -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmcmT5wACgkQM2qzM29m f5coUg/9FQMf1IeXhyGlDtV0ELUxkHoXaZ2T6zhfXFoLmyl/DU4AWMH4YbSAqk2M NnR47sVsM08tIwE3KYQgYzLbbFF41nQUa1sckeel+nYGtpcN6IwOlU5LYYpNNFeQ vsECxV78BA6FGdjaXwQ07r4G6lpVhCCqM/RZpDrwNSyoIWVLo77KBUVCSoQb5wzG z7OBvO9M7HfVJVOHcPd+tVcZaGAF0fhW812fibZQKV2mrWdhOOe+gWVs8ro3tmm1 GocbTTQW2hlYcLCZPe1przTI9flfwon6Lk8TmIZuU5IrzcaAB+U3P140aKgl9427 v4WdLuKYlKi+xISBdRG3omyaLroNUs8IHW4KoBXAW3FinyLzNsAyoPxb02m7SEge sOJ/gbeLtb2u+ur4wAp4gDmVKfg3TGyh05Hdt96LXsbQUuWIlwEcPurl+nY93Eoq vrPLIdPOXrOD5jBIaVQkBYlaCn04mDg+VTNbG9hW1wjorVFpWKS7MwCjVloXSVIn uE++cVpQtIKp8aTYBbFVXqtVREatczl++f+Npnlm8xlcquDbaORkk4ZBOs4vQuHo pNuZcWO0rIBR6hakr44OjTLnJBIwChPBYvBVgtq1E6oAbwHvC2SVXbiJB8IcCPOx nB2jnF0/tpTs2LnrHAxdAGdU9Om6RGmahfz/uwh/8djdbCVH+Ik= =NiQh -----END PGP SIGNATURE----- Merge tag 'nfsd-6.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Fix two async COPY bugs found during NFS bake-a-thon - Fix an svcrdma memory leak * tag 'nfsd-6.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: rpcrdma: Always release the rpcrdma_device's xa_array NFSD: Never decrement pending_async_copies on error NFSD: Initialize struct nfsd4_copy earlier	2024-11-02 09:27:11 -10:00
Linus Torvalds	f6a7b4ec74	XFS bug fies for 6.12-rc6 * fix a sysbot reported crash on filestreams * Reduce cpu time spent searching for extents in a very fragmented FS * Check for delayed allocations before setting extsize Signed-off-by: Carlos Maiolino <cem@kernel.org> -----BEGIN PGP SIGNATURE----- iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZyIMDwAKCRBGdaER5Qtf pllxAYCkk+mtDTD5xBfOVGZWO5MMFz8HqYcro5wrSCzgL8HDmW29kXTBYFviGn3R 3l/H6BEBgOk0EkI5qGOzijpzbsWyJeLzPzZtxQFPD8zFBdxSERCtbpqFDLLvLQWG M+TLhUNkPQ== =kKX4 -----END PGP SIGNATURE----- Merge tag 'xfs-6.12-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs fixes from Carlos Maiolino: - fix a sysbot reported crash on filestreams - Reduce cpu time spent searching for extents in a very fragmented FS - Check for delayed allocations before setting extsize * tag 'xfs-6.12-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: streamline xfs_filestream_pick_ag xfs: fix finding a last resort AG in xfs_filestream_pick_ag xfs: Reduce unnecessary searches when searching for the best extents xfs: Check for delayed allocations before setting extsize	2024-11-02 09:22:16 -10:00
Linus Torvalds	17fa6a5f93	vfs-6.12-rc6.iomap -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZyTGVAAKCRCRxhvAZXjc oltEAP9r8cWa3Tdv8DzMNWu/jezTUXoW/mX5Qe+c1L6faqj0WQD/dIVtBtG37Tfq 3Ci9F/GEWjKijtCQ5lwMGUq27jQJ1gk= =/0iA -----END PGP SIGNATURE----- Merge tag 'vfs-6.12-rc6.iomap' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull iomap fixes from Christian Brauner: "Fixes for iomap to prevent data corruption bugs in the fallocate unshare range implementation of fsdax and a small cleanup to turn iomap_want_unshare_iter() into an inline function" * tag 'vfs-6.12-rc6.iomap' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: iomap: turn iomap_want_unshare_iter into an inline function fsdax: dax_unshare_iter needs to copy entire blocks fsdax: remove zeroing code from dax_unshare_iter iomap: share iomap_unshare_iter predicate code with fsdax xfs: don't allocate COW extents when unsharing a hole	2024-11-01 07:45:00 -10:00
Linus Torvalds	d56239a82e	vfs-6.12-rc6.fixes -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZyTGAQAKCRCRxhvAZXjc opd6AQCal4omyfS8FYe4VRRZ/0XHouagq99I0U0TAmKkvoKAsgD/XrdE+pSTEkPX Pv4T9phh1cZRxcyKVu77UoYkuHJEDAg= =Lu9R -----END PGP SIGNATURE----- Merge tag 'vfs-6.12-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull filesystem fixes from Christian Brauner: "VFS: - Fix copy_page_from_iter_atomic() if KMAP_LOCAL_FORCE_MAP=y is set - Add a get_tree_bdev_flags() helper that allows to modify e.g., whether errors are logged into the filesystem context during superblock creation. This is used by erofs to fix a userspace regression where an error is currently logged when its used on a regular file which is an new allowed mode in erofs. netfs: - Fix the sysfs debug path in the documentation. - Fix iov_iter_get_pages() for folio queues by skipping the page extracation if we're at the end of a folio. afs: - Fix moving subdirectories to different parent directory. autofs: - Fix handling of AUTOFS_DEV_IOCTL_TIMEOUT_CMD ioctl in validate_dev_ioctl(). The actual ioctl number, not the ioctl command needs to be checked for autofs" tag 'vfs-6.12-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: iov_iter: fix copy_page_from_iter_atomic() if KMAP_LOCAL_FORCE_MAP autofs: fix thinko in validate_dev_ioctl() iov_iter: Fix iov_iter_get_pages*() for folio_queue afs: Fix missing subdir edit when renamed between parent dirs doc: correcting the debug path for cachefiles erofs: use get_tree_bdev_flags() to avoid misleading messages fs/super.c: introduce get_tree_bdev_flags()	2024-11-01 07:37:10 -10:00
Linus Torvalds	6b4926494e	for-6.12-rc5-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmck8eQACgkQxWXV+ddt WDu05g/6AwrnvPkivC4iVOv4Wkzrpk4gm76smx91Y9B8tSDLI1pHaS27CvJz9iWl vBKXPN3PQVQHwo6SPn+NjsFOSMkXlbBOVKpPU+MlZwH9Tuw66qcC+EnUCK2wEuAy 3TN7cUGIA4r/j+SkhgIz+Irlr5pjdb1KkPIMBEVGcVFqDIuvDaTEGBqTn2i/V5aa dMn+gK+9rfngTOJ68t/pEFaX7SEWCvgMIcBpBB4/vs1gHm3ve2bcc1sBAdMxb1Se SrxgZfq+Rc5tkMn540JaWGwkb0rLzwXlurK6ygTKDKCpH0IMX+pBvDkexh9Zj0ux jejlRxiuDzTx3z2a7FjHDyp2sdZWMpq3sPsowpJ1Dsgi5EtSxTy4irmQuSAZY1Uj /uo6YwV9aTGeiNDwZeKqKc/wOuAttaMZLr14s37pro9KxndFJ/XZBxeyB+euUCOw B8AvAQVVIJAYQLyWINWruNKppqlgiO2RaN15RvvT2pX01d0TOx1KX1XFQku7YFxb M/8ZNXzJ96XtkeyHL3euo3zj7N5jWtnCvPINugUG1ADQa+bc8aX336gld1neD6fs QqIFIgzZG0l4N95viJilACrI6tW9zFnBqMyNFRhucKiX9aP9glOvhSfxfjcpDuQ/ i/LIyxVLwp8M3hPNvv8tC345+1C2ug9AD0OyhWjjIYPuiOxtTWs= =alpB -----END PGP SIGNATURE----- Merge tag 'for-6.12-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more stability fixes. There's one patch adding export of MIPS cmpxchg helper, used in the error propagation fix. - fix error propagation from split bios to the original btrfs bio - fix merging of adjacent extents (normal operation, defragmentation) - fix potential use after free after freeing btrfs device structures" * tag 'for-6.12-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix defrag not merging contiguous extents due to merged extent maps btrfs: fix extent map merging not happening for adjacent extents btrfs: fix use-after-free of block device file in __btrfs_free_extra_devids() btrfs: fix error propagation of split bios MIPS: export __cmpxchg_small()	2024-11-01 07:31:47 -10:00
Linus Torvalds	7b83601da4	bcachefs fixes for 6.12-rc6 Various syzbot fixes, and the more notable ones: - Fix for pointers in an extent overflowing the max (16) on a filesystem with many devices: we were creating too many cached copies when moving data around. Now, we only create at most one cached copy if there's a promote target set. Caching will be a bit broken for reflinked data until 6.13: I have larger series queued up which significantly improves the plumbing for data options down into the extent (bch_extent_rebalance) to fix this. - Fix for deadlock on -ENOSPC on tiny filesystems Allocation from the partial open_bucket list wasn't correctly accounting partial open_buckets as free: this fixes the main cause of tests timing out in the automated tests. -----BEGIN PGP SIGNATURE----- iQJOBAABCAA4FiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmckUrIaHGtlbnQub3Zl cnN0cmVldEBsaW51eC5kZXYACgkQE6szbY3KbnaFYA//Qd9SD8v+ypavnaogWhqk 3bufCO8YJDV5DRQVuX/z36fia8zOKzWQGRAYvq0vF0mmgagwBE+AcBh6vvfDCxqZ m1937IXcv/hHh2FFQau9gWEItTH9dGwyQjeDjB3xaTL5ZTGsAdA9558ygf8GAVOe wD+W8Z8Qj09hAErnNS7y50t/PGbZDuG7AV2Dy2unp+fp6U0FVrZ3Z0bhFuhxcR7/ e3j49DoW4EZL7Gu1svn7nzehjWK4wx1wX7QhynPgSOVIhdj2Fc3XG76b3mBsuZF6 A/cBRKmSZsYL9MBK0vferqizqeuwlIJsvwpo/6zzukpyf8QOl+0IqPuAXFoz8vg3 vrdp9cdvzWvQNexTD2+7PYosCKoUswOvo0oIy8Iopkg4VGSreZib1sZeCPzw2FBK AZcQaQSBLKojWpYsn9Dl2AlqEHHTvnopjr5wRXiimqKe/OcA3ugIvebUw2UE2ACp /Z2ZQu615BtRYQM+dRIJJQ2CAy0F3EZxIXEXwc/yrH7kL2VBay8QCKp/k/9YYy4e Nlxxw7alb/XGgT8GQgu24tho3yMKT621dLFOaAZ7x2HtLP8T56zL/L/wKWsocW/V R8Kwqot6F1EVb3Q0LECUJottYQ+5I1Et7ZpVyOPxfqF1y7KOsuxKOmZFLO7i3Spc fg0gOt/fyKrAF3zuSmWXne8= =hzm/ -----END PGP SIGNATURE----- Merge tag 'bcachefs-2024-10-31' of git://evilpiepirate.org/bcachefs Pull bcachefs fixes from Kent Overstreet: "Various syzbot fixes, and the more notable ones: - Fix for pointers in an extent overflowing the max (16) on a filesystem with many devices: we were creating too many cached copies when moving data around. Now, we only create at most one cached copy if there's a promote target set. Caching will be a bit broken for reflinked data until 6.13: I have larger series queued up which significantly improves the plumbing for data options down into the extent (bch_extent_rebalance) to fix this. - Fix for deadlock on -ENOSPC on tiny filesystems Allocation from the partial open_bucket list wasn't correctly accounting partial open_buckets as free: this fixes the main cause of tests timing out in the automated tests" * tag 'bcachefs-2024-10-31' of git://evilpiepirate.org/bcachefs: bcachefs: Fix NULL ptr dereference in btree_node_iter_and_journal_peek bcachefs: fix possible null-ptr-deref in __bch2_ec_stripe_head_get() bcachefs: Fix deadlock on -ENOSPC w.r.t. partial open buckets bcachefs: Don't filter partial list buckets in open_buckets_to_text() bcachefs: Don't keep tons of cached pointers around bcachefs: init freespace inited bits to 0 in bch2_fs_initialize bcachefs: Fix unhandled transaction restart in fallocate bcachefs: Fix UAF in bch2_reconstruct_alloc() bcachefs: fix null-ptr-deref in have_stripes() bcachefs: fix shift oob in alloc_lru_idx_fragmentation bcachefs: Fix invalid shift in validate_sb_layout()	2024-11-01 07:21:03 -10:00
Linus Torvalds	cb80d9074f	fs: optimize acl_permission_check() generic_permission() turned out to be costlier than expected. The reason is that acl_permission_check() performs owner checks that involve costly pointer dereferences. There's already code that skips expensive group checks if the group and other permission bits are the same wrt to the mask that is checked. This logic can be extended to also shortcut permission checking in acl_permission_check(). If the inode doesn't have POSIX ACLs than ownership doesn't matter. If there are no missing UGO permissions the permission check can be shortcut. Acked-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/all/CAHk-=wgXEoAOFRkDg+grxs+p1U+QjWXLixRGmYEfd=vG+OBuFw@mail.gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-11-01 14:12:34 +01:00
Kalesh Singh	e4d32142d1	tracing: Fix tracefs mount options Commit `78ff640819` ("vfs: Convert tracefs to use the new mount API") converted tracefs to use the new mount APIs caused mount options (e.g. gid=<gid>) to not take effect. The tracefs superblock can be updated from multiple paths: - on fs_initcall() to init_trace_printk_function_export() - from a work queue to initialize eventfs tracer_init_tracefs_work_func() - fsconfig() syscall to mount or remount of tracefs The tracefs superblock root inode gets created early on in init_trace_printk_function_export(). With the new mount API, tracefs effectively uses get_tree_single() instead of the old API mount_single(). Previously, mount_single() ensured that the options are always applied to the superblock root inode: (1) If the root inode didn't exist, call fill_super() to create it and apply the options. (2) If the root inode exists, call reconfigure_single() which effectively calls tracefs_apply_options() to parse and apply options to the subperblock's fs_info and inode and remount eventfs (if necessary) On the other hand, get_tree_single() effectively calls vfs_get_super() which: (3) If the root inode doesn't exists, calls fill_super() to create it and apply the options. (4) If the root inode already exists, updates the fs_context root with the superblock's root inode. (4) above is always the case for tracefs mounts, since the super block's root inode will already be created by init_trace_printk_function_export(). This means that the mount options get ignored: - Since it isn't applied to the superblock's root inode, it doesn't get inherited by the children. - Since eventfs is initialized from a separate work queue and before call to mount with the options, and it doesn't get remounted for mount. Ensure that the mount options are applied to the super block and eventfs is remounted to respect the mount options. To understand this better, if fstab has the following: tracefs /sys/kernel/tracing tracefs nosuid,nodev,noexec,gid=tracing 0 0 On boot up, permissions look like: # ls -l /sys/kernel/tracing/trace -rw-r----- 1 root root 0 Nov 1 08:37 /sys/kernel/tracing/trace When it should look like: # ls -l /sys/kernel/tracing/trace -rw-r----- 1 root tracing 0 Nov 1 08:37 /sys/kernel/tracing/trace Link: https://lore.kernel.org/r/536e99d3-345c-448b-adee-a21389d7ab4b@redhat.com/ Cc: Eric Sandeen <sandeen@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Ali Zahraee <ahzahraee@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: stable@vger.kernel.org Fixes: `78ff640819` ("vfs: Convert tracefs to use the new mount API") Link: https://lore.kernel.org/20241030171928.4168869-2-kaleshsingh@google.com Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-11-01 08:38:14 -04:00
Jakub Kicinski	5b1c965956	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.12-rc6). Conflicts: drivers/net/wireless/intel/iwlwifi/mvm/mld-mac80211.c `cbe84e9ad5` ("wifi: iwlwifi: mvm: really send iwl_txpower_constraints_cmd") `188a1bf894` ("wifi: mac80211: re-order assigning channel in activate links") https://lore.kernel.org/all/20241028123621.7bbb131b@canb.auug.org.au/ net/mac80211/cfg.c `c4382d5ca1` ("wifi: mac80211: update the right link for tx power") `8dd0498983` ("wifi: mac80211: Fix setting txpower with emulate_chanctx") drivers/net/ethernet/intel/ice/ice_ptp_hw.h `6e58c33106` ("ice: fix crash on probe for DPLL enabled E810 LOM") `e4291b64e1` ("ice: Align E810T GPIO to other products") `ebb2693f8f` ("ice: Read SDP section from NVM for pin definitions") `ac532f4f42` ("ice: Cleanup unused declarations") https://lore.kernel.org/all/20241030120524.1ee1af18@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-10-31 18:10:07 -07:00
Filipe Manana	77b0d113ee	btrfs: fix defrag not merging contiguous extents due to merged extent maps When running defrag (manual defrag) against a file that has extents that are contiguous and we already have the respective extent maps loaded and merged, we end up not defragging the range covered by those contiguous extents. This happens when we have an extent map that was the result of merging multiple extent maps for contiguous extents and the length of the merged extent map is greater than or equals to the defrag threshold length. The script below reproduces this scenario: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount $DEV $MNT # Create a 256K file with 4 extents of 64K each. xfs_io -f -c "falloc 0 64K" \ -c "pwrite 0 64K" \ -c "falloc 64K 64K" \ -c "pwrite 64K 64K" \ -c "falloc 128K 64K" \ -c "pwrite 128K 64K" \ -c "falloc 192K 64K" \ -c "pwrite 192K 64K" \ $MNT/foo umount $MNT echo -n "Initial number of file extent items: " btrfs inspect-internal dump-tree -t 5 $DEV \| grep EXTENT_DATA \| wc -l mount $DEV $MNT # Read the whole file in order to load and merge extent maps. cat $MNT/foo > /dev/null btrfs filesystem defragment -t 128K $MNT/foo umount $MNT echo -n "Number of file extent items after defrag with 128K threshold: " btrfs inspect-internal dump-tree -t 5 $DEV \| grep EXTENT_DATA \| wc -l mount $DEV $MNT # Read the whole file in order to load and merge extent maps. cat $MNT/foo > /dev/null btrfs filesystem defragment -t 256K $MNT/foo umount $MNT echo -n "Number of file extent items after defrag with 256K threshold: " btrfs inspect-internal dump-tree -t 5 $DEV \| grep EXTENT_DATA \| wc -l Running it: $ ./test.sh Initial number of file extent items: 4 Number of file extent items after defrag with 128K threshold: 4 Number of file extent items after defrag with 256K threshold: 4 The 4 extents don't get merged because we have an extent map with a size of 256K that is the result of merging the individual extent maps for each of the four 64K extents and at defrag_lookup_extent() we have a value of zero for the generation threshold ('newer_than' argument) since this is a manual defrag. As a consequence we don't call defrag_get_extent() to get an extent map representing a single file extent item in the inode's subvolume tree, so we end up using the merged extent map at defrag_collect_targets() and decide not to defrag. Fix this by updating defrag_lookup_extent() to always discard extent maps that were merged and call defrag_get_extent() regardless of the minimum generation threshold ('newer_than' argument). A test case for fstests will be sent along soon. CC: stable@vger.kernel.org # 6.1+ Fixes: `199257a78b` ("btrfs: defrag: don't use merged extent map for their generation check") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-31 16:46:41 +01:00
Filipe Manana	a0f0625390	btrfs: fix extent map merging not happening for adjacent extents If we have 3 or more adjacent extents in a file, that is, consecutive file extent items pointing to adjacent extents, within a contiguous file range and compatible flags, we end up not merging all the extents into a single extent map. For example: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ xfs_io -f -d -c "pwrite -b 64K 0 64K" \ -c "pwrite -b 64K 64K 64K" \ -c "pwrite -b 64K 128K 64K" \ -c "pwrite -b 64K 192K 64K" \ /mnt/sdc/foo After all the ordered extents complete we unpin the extent maps and try to merge them, but instead of getting a single extent map we get two because: 1) When the first ordered extent completes (file range [0, 64K)) we unpin its extent map and attempt to merge it with the extent map for the range [64K, 128K), but we can't because that extent map is still pinned; 2) When the second ordered extent completes (file range [64K, 128K)), we unpin its extent map and merge it with the previous extent map, for file range [0, 64K), but we can't merge with the next extent map, for the file range [128K, 192K), because this one is still pinned. The merged extent map for the file range [0, 128K) gets the flag EXTENT_MAP_MERGED set; 3) When the third ordered extent completes (file range [128K, 192K)), we unpin its extent map and attempt to merge it with the previous extent map, for file range [0, 128K), but we can't because that extent map has the flag EXTENT_MAP_MERGED set (mergeable_maps() returns false due to different flags) while the extent map for the range [128K, 192K) doesn't have that flag set. We also can't merge it with the next extent map, for file range [192K, 256K), because that one is still pinned. At this moment we have 3 extent maps: One for file range [0, 128K), with the flag EXTENT_MAP_MERGED set. One for file range [128K, 192K). One for file range [192K, 256K) which is still pinned; 4) When the fourth and final extent completes (file range [192K, 256K)), we unpin its extent map and attempt to merge it with the previous extent map, for file range [128K, 192K), which succeeds since none of these extent maps have the EXTENT_MAP_MERGED flag set. So we end up with 2 extent maps: One for file range [0, 128K), with the flag EXTENT_MAP_MERGED set. One for file range [128K, 256K), with the flag EXTENT_MAP_MERGED set. Since after merging extent maps we don't attempt to merge again, that is, merge the resulting extent map with the one that is now preceding it (and the one following it), we end up with those two extent maps, when we could have had a single extent map to represent the whole file. Fix this by making mergeable_maps() ignore the EXTENT_MAP_MERGED flag. While this doesn't present any functional issue, it prevents the merging of extent maps which allows to save memory, and can make defrag not merging extents too (that will be addressed in the next patch). Fixes: `199257a78b` ("btrfs: defrag: don't use merged extent map for their generation check") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-31 16:45:16 +01:00
Ryusuke Konishi	b3a033e3ec	nilfs2: fix potential deadlock with newly created symlinks Syzbot reported that page_symlink(), called by nilfs_symlink(), triggers memory reclamation involving the filesystem layer, which can result in circular lock dependencies among the reader/writer semaphore nilfs->ns_segctor_sem, s_writers percpu_rwsem (intwrite) and the fs_reclaim pseudo lock. This is because after commit `21fc61c73c` ("don't put symlink bodies in pagecache into highmem"), the gfp flags of the page cache for symbolic links are overwritten to GFP_KERNEL via inode_nohighmem(). This is not a problem for symlinks read from the backing device, because the __GFP_FS flag is dropped after inode_nohighmem() is called. However, when a new symlink is created with nilfs_symlink(), the gfp flags remain overwritten to GFP_KERNEL. Then, memory allocation called from page_symlink() etc. triggers memory reclamation including the FS layer, which may call nilfs_evict_inode() or nilfs_dirty_inode(). And these can cause a deadlock if they are called while nilfs->ns_segctor_sem is held: Fix this issue by dropping the __GFP_FS flag from the page cache GFP flags of newly created symlinks in the same way that nilfs_new_inode() and __nilfs_read_inode() do, as a workaround until we adopt nofs allocation scope consistently or improve the locking constraints. Link: https://lkml.kernel.org/r/20241020050003.4308-1-konishi.ryusuke@gmail.com Fixes: `21fc61c73c` ("don't put symlink bodies in pagecache into highmem") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+9ef37ac20608f4836256@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=9ef37ac20608f4836256 Tested-by: syzbot+9ef37ac20608f4836256@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-10-30 20:14:12 -07:00
Phillip Lougher	d31638ff6c	Squashfs: fix variable overflow in squashfs_readpage_block Syzbot reports a slab out of bounds access in squashfs_readpage_block(). This is caused by an attempt to read page index 0x2000000000. This value (start_index) is stored in an integer loop variable which overflows producing a value of 0. This causes a loop which iterates over pages start_index -> end_index to iterate over 0 -> end_index, which ultimately causes an out of bounds page array access. Fix by changing variable to a loff_t, and rename to index to make it clearer it is a page index, and not a loop count. Link: https://lkml.kernel.org/r/20241020232200.837231-1-phillip@squashfs.org.uk Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com> Closes: https://lore.kernel.org/all/ZwzcnCAosIPqQ9Ie@ly-workstation/ Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-10-30 20:14:12 -07:00
Jan Kara	76486b1041	ext4: avoid remount errors with 'abort' mount option When we remount filesystem with 'abort' mount option while changing other mount options as well (as is LTP test doing), we can return error from the system call after commit `d3476f3dad` ("ext4: don't set SB_RDONLY after filesystem errors") because the application of mount option changes detects shutdown filesystem and refuses to do anything. The behavior of application of other mount options in presence of 'abort' mount option is currently rather arbitary as some mount option changes are handled before 'abort' and some after it. Move aborting of the filesystem to the end of remount handling so all requested changes are properly applied before the filesystem is shutdown to have a reasonably consistent behavior. Fixes: `d3476f3dad` ("ext4: don't set SB_RDONLY after filesystem errors") Reported-by: Jan Stancek <jstancek@redhat.com> Link: https://lore.kernel.org/all/Zvp6L+oFnfASaoHl@t14s Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Jan Stancek <jstancek@redhat.com> Link: https://patch.msgid.link/20241004221556.19222-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-10-30 17:42:44 -04:00
Jeongjun Park	902cc179c9	ext4: supress data-race warnings in ext4_free_inodes_{count,set}() find_group_other() and find_group_orlov() read _lo, _hi with ext4_free_inodes_count without additional locking. This can cause data-race warning, but since the lock is held for most writes and free inodes value is generally not a problem even if it is incorrect, it is more appropriate to use READ_ONCE()/WRITE_ONCE() than to add locking. ================================================================== BUG: KCSAN: data-race in ext4_free_inodes_count / ext4_free_inodes_set write to 0xffff88810404300e of 2 bytes by task 6254 on cpu 1: ext4_free_inodes_set+0x1f/0x80 fs/ext4/super.c:405 __ext4_new_inode+0x15ca/0x2200 fs/ext4/ialloc.c:1216 ext4_symlink+0x242/0x5a0 fs/ext4/namei.c:3391 vfs_symlink+0xca/0x1d0 fs/namei.c:4615 do_symlinkat+0xe3/0x340 fs/namei.c:4641 __do_sys_symlinkat fs/namei.c:4657 [inline] __se_sys_symlinkat fs/namei.c:4654 [inline] __x64_sys_symlinkat+0x5e/0x70 fs/namei.c:4654 x64_sys_call+0x1dda/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:267 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x54/0x120 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e read to 0xffff88810404300e of 2 bytes by task 6257 on cpu 0: ext4_free_inodes_count+0x1c/0x80 fs/ext4/super.c:349 find_group_other fs/ext4/ialloc.c:594 [inline] __ext4_new_inode+0x6ec/0x2200 fs/ext4/ialloc.c:1017 ext4_symlink+0x242/0x5a0 fs/ext4/namei.c:3391 vfs_symlink+0xca/0x1d0 fs/namei.c:4615 do_symlinkat+0xe3/0x340 fs/namei.c:4641 __do_sys_symlinkat fs/namei.c:4657 [inline] __se_sys_symlinkat fs/namei.c:4654 [inline] __x64_sys_symlinkat+0x5e/0x70 fs/namei.c:4654 x64_sys_call+0x1dda/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:267 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x54/0x120 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e Cc: stable@vger.kernel.org Signed-off-by: Jeongjun Park <aha310510@gmail.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://patch.msgid.link/20241003125337.47283-1-aha310510@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-10-30 17:42:44 -04:00
Markus Elfring	d431a2cd28	ext4: Call ext4_journal_stop(handle) only once in ext4_dio_write_iter() An ext4_journal_stop(handle) call was immediately used after a return value check for a ext4_orphan_add() call in this function implementation. Thus call such a function only once instead directly before the check. This issue was transformed by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/cf895072-43cf-412c-bced-8268498ad13e@web.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-10-30 17:42:44 -04:00
Chuck Lever	8286f8b622	NFSD: Never decrement pending_async_copies on error The error flow in nfsd4_copy() calls cleanup_async_copy(), which already decrements nn->pending_async_copies. Reported-by: Olga Kornievskaia <okorniev@redhat.com> Fixes: `aadc3bbea1` ("NFSD: Limit the number of concurrent async COPY operations") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-10-30 14:12:16 -04:00
Christoph Hellwig	81a1e1c32e	xfs: streamline xfs_filestream_pick_ag Directly return the error from xfs_bmap_longest_free_extent instead of breaking from the loop and handling it there, and use a done label to directly jump to the exist when we found a suitable perag structure to reduce the indentation level and pag/max_pag check complexity in the tail of the function. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-30 11:27:18 +01:00
Christoph Hellwig	dc60992ce7	xfs: fix finding a last resort AG in xfs_filestream_pick_ag When the main loop in xfs_filestream_pick_ag fails to find a suitable AG it tries to just pick the online AG. But the loop for that uses args->pag as loop iterator while the later code expects pag to be set. Fix this by reusing the max_pag case for this last resort, and also add a check for impossible case of no AG just to make sure that the uninitialized pag doesn't even escape in theory. Reported-by: syzbot+4125a3c514e3436a02e6@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: syzbot+4125a3c514e3436a02e6@syzkaller.appspotmail.com Fixes: `f8f1ed1ab3` ("xfs: return a referenced perag from filestreams allocator") Cc: <stable@vger.kernel.org> # v6.3 Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-30 11:27:18 +01:00
Chi Zhiling	3ef2268403	xfs: Reduce unnecessary searches when searching for the best extents Recently, we found that the CPU spent a lot of time in xfs_alloc_ag_vextent_size when the filesystem has millions of fragmented spaces. The reason is that we conducted much extra searching for extents that could not yield a better result, and these searches would cost a lot of time when there were millions of extents to search through. Even if we get the same result length, we don't switch our choice to the new one, so we can definitely terminate the search early. Since the result length cannot exceed the found length, when the found length equals the best result length we already have, we can conclude the search. We did a test in that filesystem: [root@localhost ~]# xfs_db -c freesp /dev/vdb from to extents blocks pct 1 1 215 215 0.01 2 3 994476 1988952 99.99 Before this patch: 0) \| xfs_alloc_ag_vextent_size [xfs]() { 0) * 15597.94 us \| } After this patch: 0) \| xfs_alloc_ag_vextent_size [xfs]() { 0) 19.176 us \| } Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-30 11:27:18 +01:00
Ojaswin Mujoo	2a492ff666	xfs: Check for delayed allocations before setting extsize Extsize should only be allowed to be set on files with no data in it. For this, we check if the files have extents but miss to check if delayed extents are present. This patch adds that check. While we are at it, also refactor this check into a helper since it's used in some other places as well like xfs_inactive() or xfs_ioctl_setattr_xflags() Without the patch (SUCCEEDS) $ xfs_io -c 'open -f testfile' -c 'pwrite 0 1024' -c 'extsize 65536' wrote 1024/1024 bytes at offset 0 1 KiB, 1 ops; 0.0002 sec (4.628 MiB/sec and 4739.3365 ops/sec) With the patch (FAILS as expected) $ xfs_io -c 'open -f testfile' -c 'pwrite 0 1024' -c 'extsize 65536' wrote 1024/1024 bytes at offset 0 1 KiB, 1 ops; 0.0002 sec (4.628 MiB/sec and 4739.3365 ops/sec) xfs_io: FS_IOC_FSSETXATTR testfile: Invalid argument Fixes: `e94af02a9c` ("[XFS] fix old xfs_setattr mis-merge from irix; mostly harmless esp if not using xfs rt") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-30 11:27:18 +01:00
Christian Brauner	2ec67bb4f9	Merge branch 'work.fdtable' into vfs.file Bring in the fdtable changes for this cycle. Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-30 09:58:02 +01:00
Christian Brauner	90ee6ed776	fs: port files to file_ref Port files to rely on file_ref reference to improve scaling and gain overflow protection. - We continue to WARN during get_file() in case a file that is already marked dead is revived as get_file() is only valid if the caller already holds a reference to the file. This hasn't changed just the check changes. - The semantics for epoll and ttm's dmabuf usage have changed. Both epoll and ttm synchronize with __fput() to prevent the underlying file from beeing freed. (1) epoll Explaining epoll is straightforward using a simple diagram. Essentially, the mutex of the epoll instance needs to be taken in both __fput() and around epi_fget() preventing the file from being freed while it is polled or preventing the file from being resurrected. CPU1 CPU2 fput(file) -> __fput(file) -> eventpoll_release(file) -> eventpoll_release_file(file) mutex_lock(&ep->mtx) epi_item_poll() -> epi_fget() -> file_ref_get(file) mutex_unlock(&ep->mtx) mutex_lock(&ep->mtx); __ep_remove() mutex_unlock(&ep->mtx); -> kmem_cache_free(file) (2) ttm dmabuf This explanation is a bit more involved. A regular dmabuf file stashed the dmabuf in file->private_data and the file in dmabuf->file: file->private_data = dmabuf; dmabuf->file = file; The generic release method of a dmabuf file handles file specific things: f_op->release::dma_buf_file_release() while the generic dentry release method of a dmabuf handles dmabuf freeing including driver specific things: dentry->d_release::dma_buf_release() During ttm dmabuf initialization in ttm_object_device_init() the ttm driver copies the provided struct dma_buf_ops into a private location: struct ttm_object_device { spinlock_t object_lock; struct dma_buf_ops ops; void (dmabuf_release)(struct dma_buf dma_buf); struct idr idr; }; ttm_object_device_init(const struct dma_buf_ops ops) { // copy original dma_buf_ops in private location tdev->ops = ops; // stash the release method of the original struct dma_buf_ops tdev->dmabuf_release = tdev->ops.release; // override the release method in the copy of the struct dma_buf_ops // with ttm's own dmabuf release method tdev->ops.release = ttm_prime_dmabuf_release; } When a new dmabuf is created the struct dma_buf_ops with the overriden release method set to ttm_prime_dmabuf_release is passed in exp_info.ops: DEFINE_DMA_BUF_EXPORT_INFO(exp_info); exp_info.ops = &tdev->ops; exp_info.size = prime->size; exp_info.flags = flags; exp_info.priv = prime; The call to dma_buf_export() then sets mutex_lock_interruptible(&prime->mutex); dma_buf = dma_buf_export(&exp_info) { dmabuf->ops = exp_info->ops; } mutex_unlock(&prime->mutex); which creates a new dmabuf file and then install a file descriptor to it in the callers file descriptor table: ret = dma_buf_fd(dma_buf, flags); When that dmabuf file is closed we now get: fput(file) -> __fput(file) -> f_op->release::dma_buf_file_release() -> dput() -> d_op->d_release::dma_buf_release() -> dmabuf->ops->release::ttm_prime_dmabuf_release() mutex_lock(&prime->mutex); if (prime->dma_buf == dma_buf) prime->dma_buf = NULL; mutex_unlock(&prime->mutex); Where we can see that prime->dma_buf is set to NULL. So when we have the following diagram: CPU1 CPU2 fput(file) -> __fput(file) -> f_op->release::dma_buf_file_release() -> dput() -> d_op->d_release::dma_buf_release() -> dmabuf->ops->release::ttm_prime_dmabuf_release() ttm_prime_handle_to_fd() mutex_lock_interruptible(&prime->mutex) dma_buf = prime->dma_buf dma_buf && get_dma_buf_unless_doomed(dma_buf) -> file_ref_get(dma_buf->file) mutex_unlock(&prime->mutex); mutex_lock(&prime->mutex); if (prime->dma_buf == dma_buf) prime->dma_buf = NULL; mutex_unlock(&prime->mutex); -> kmem_cache_free(file) The logic of the mechanism is the same as for epoll: sync with __fput() preventing the file from being freed. Here the synchronization happens through the ttm instance's prime->mutex. Basically, the lifetime of the dma_buf and the file are tighly coupled. Both (1) and (2) used to call atomic_inc_not_zero() to check whether the file has already been marked dead and then refuse to revive it. This is only safe because both (1) and (2) sync with __fput() and thus prevent kmem_cache_free() on the file being called and thus prevent the file from being immediately recycled due to SLAB_TYPESAFE_BY_RCU. Both (1) and (2) have been ported from atomic_inc_not_zero() to file_ref_get(). That means a file that is already in the process of being marked as FILE_REF_DEAD: file_ref_put() cnt = atomic_long_dec_return() -> __file_ref_put(cnt) if (cnt == FIlE_REF_NOREF) atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD) can be revived again: CPU1 CPU2 file_ref_put() cnt = atomic_long_dec_return() -> __file_ref_put(cnt) if (cnt == FIlE_REF_NOREF) file_ref_get() // Brings reference back to FILE_REF_ONEREF atomic_long_add_negative() atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD) This is fine and inherent to the file_ref_get()/file_ref_put() semantics. For both (1) and (2) this is safe because __fput() is prevented from making progress if file_ref_get() fails due to the aforementioned synchronization mechanisms. Two cases need to be considered that affect both (1) epoll and (2) ttm dmabuf: (i) fput()'s file_ref_put() and marks the file as FILE_REF_NOREF but before that fput() can mark the file as FILE_REF_DEAD someone manages to sneak in a file_ref_get() and brings the refcount back from FILE_REF_NOREF to FILE_REF_ONEREF. In that case the original fput() doesn't call __fput(). For epoll the poll will finish and for ttm dmabuf the file can be used again. For ttm dambuf this is actually an advantage because it avoids immediately allocating a new dmabuf object. CPU1 CPU2 file_ref_put() cnt = atomic_long_dec_return() -> __file_ref_put(cnt) if (cnt == FIlE_REF_NOREF) file_ref_get() // Brings reference back to FILE_REF_ONEREF atomic_long_add_negative() atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD) (ii) fput()'s file_ref_put() marks the file FILE_REF_NOREF and also suceeds in actually marking it FILE_REF_DEAD and then calls into __fput() to free the file. When either (1) or (2) call file_ref_get() they fail as atomic_long_add_negative() will return true. At the same time, both (1) and (2) all file_ref_get() under mutexes that __fput() must also acquire preventing kmem_cache_free() from freeing the file. So while this might be treated as a change in semantics for (1) and (2) it really isn't. It if should end up causing issues this can be fixed by adding a helper that does something like: long cnt = atomic_long_read(&ref->refcnt); do { if (cnt < 0) return false; } while (!atomic_long_try_cmpxchg(&ref->refcnt, &cnt, cnt + 1)); return true; which would block FILE_REF_NOREF to FILE_REF_ONEREF transitions. - Jann correctly pointed out that kmem_cache_zalloc() cannot be used anymore once files have been ported to file_ref_t. The kmem_cache_zalloc() call will memset() the whole struct file to zero when it is reallocated. This will also set file->f_ref to zero which mens that a concurrent file_ref_get() can return true: CPU1 CPU2 __get_file_rcu() rcu_dereference_raw() close() [frees file] alloc_empty_file() kmem_cache_zalloc() [reallocates same file] memset(..., 0, ...) file_ref_get() [increments 0->1, returns true] init_file() file_ref_init(..., 1) [sets to 0] rcu_dereference_raw() fput() file_ref_put() [decrements 0->FILE_REF_NOREF, frees file] [UAF] causing a concurrent __get_file_rcu() call to acquire a reference to the file that is about to be reallocated and immediately freeing it on realizing that it has been recycled. This causes a UAF for the task that reallocated/recycled the file. This is prevented by switching from kmem_cache_zalloc() to kmem_cache_alloc() and initializing the fields manually. With file->f_ref initialized last. Note that a memset() also isn't guaranteed to atomically update an unsigned long so it's theoretically possible to see torn and therefore bogus counter values. Link: https://lore.kernel.org/r/20241007-brauner-file-rcuref-v2-3-387e24dc9163@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-30 09:57:43 +01:00
Jakub Kicinski	71e0ad3451	wireless-next patches for v6.13 The first -next "new features" pull request for v6.13. This is a big one as we have not been able to send one earlier. We have also some patches affecting other subsystems: in staging we deleted the rtl8192e driver and in debugfs added a new interface to save struct file_operations memory; both were acked by GregKH. Because of the lib80211/libipw move there were quite a lot of conflicts and to solve those we decided to merge net-next into wireless-next. Currently there's one conflict in Documentation/networking/net_cachelines/net_device.rst. To fix that just remove the iw_public_data line: https://lore.kernel.org/all/20241011121014.674661a0@canb.auug.org.au/ And when net is merged to net-next there will be another simple conflict in in net/mac80211/cfg.c: https://lore.kernel.org/all/20241024115523.4cd35dde@canb.auug.org.au/ Major changes: cfg80211/mac80211 * stop exporting wext symbols * new mac80211 op to indicate that a new interface is to be added * support radio separation of multi-band devices Wireless Extensions * move wext spy implementation to libiw * remove iw_public_data from struct net_device brcmfmac * optional LPO clock support ipw2x00 * move remaining lib80211 code into libiw wilc1000 * WILC3000 support rtw89 * RTL8852BE and RTL8852BE-VT BT-coexistence improvements -----BEGIN PGP SIGNATURE----- iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmcbz9YRHGt2YWxvQGtl cm5lbC5vcmcACgkQbhckVSbrbZsabQf8CWJ/kyonw/Z8hRxgfE/7D6Jiqoq7R+ML 8W8lbc6F5wra4eCBq/oo6UVV36Ss6mxQYcRcmLq+nCkXa4qdMpg/z55QECMHxx5Z YnIBbD2vBrIj7W21gfCKH1WJ+b5IQFZl3zuxuCgXjxD9TJM2CjUfOkvrhrqqzrPn clfUx5f01vfv2jdvClPR5977gFE5One/ANeRQNs7uDS0TeeD2P+61DEB1//htIJo 7GwwCyUJCeOcfWRMzQwhpoppWKcPAV70kSVJrl/fRstS68vQGSQbcx9yiNeWkSFw JXjQGdc8eYLPzLqECwS0KwFkta6AXbafAYYXe1wdlAzr+kmJ9x5oqA== =x+mr -----END PGP SIGNATURE----- Merge tag 'wireless-next-2024-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.13 The first -next "new features" pull request for v6.13. This is a big one as we have not been able to send one earlier. We have also some patches affecting other subsystems: in staging we deleted the rtl8192e driver and in debugfs added a new interface to save struct file_operations memory; both were acked by GregKH. Because of the lib80211/libipw move there were quite a lot of conflicts and to solve those we decided to merge net-next into wireless-next. Major changes: cfg80211/mac80211 * stop exporting wext symbols * new mac80211 op to indicate that a new interface is to be added * support radio separation of multi-band devices Wireless Extensions * move wext spy implementation to libiw * remove iw_public_data from struct net_device brcmfmac * optional LPO clock support ipw2x00 * move remaining lib80211 code into libiw wilc1000 * WILC3000 support rtw89 * RTL8852BE and RTL8852BE-VT BT-coexistence improvements * tag 'wireless-next-2024-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (126 commits) mac80211: Remove NOP call to ieee80211_hw_config wifi: iwlwifi: work around -Wenum-compare-conditional warning wifi: mac80211: re-order assigning channel in activate links wifi: mac80211: convert debugfs files to short fops debugfs: add small file operations for most files wifi: mac80211: remove misleading j_0 construction parts wifi: mac80211_hwsim: use hrtimer_active() wifi: mac80211: refactor BW limitation check for CSA parsing wifi: mac80211: filter on monitor interfaces based on configured channel wifi: mac80211: refactor ieee80211_rx_monitor wifi: mac80211: add support for the monitor SKIP_TX flag wifi: cfg80211: add monitor SKIP_TX flag wifi: mac80211: add flag to opt out of virtual monitor support wifi: cfg80211: pass net_device to .set_monitor_channel wifi: mac80211: remove status->ampdu_delimiter_crc wifi: cfg80211: report per wiphy radio antenna mask wifi: mac80211: use vif radio mask to limit creating chanctx wifi: mac80211: use vif radio mask to limit ibss scan frequencies wifi: cfg80211: add option for vif allowed radios wifi: iwlwifi: allow IWL_FW_CHECK() with just a string ... ==================== Link: https://patch.msgid.link/20241025170705.5F6B2C4CEC3@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-10-29 18:50:58 -07:00
Nihar Chaithanya	a174706ba4	jfs: add a check to prevent array-index-out-of-bounds in dbAdjTree When the value of lp is 0 at the beginning of the for loop, it will become negative in the next assignment and we should bail out. Reported-by: syzbot+412dea214d8baa3f7483@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=412dea214d8baa3f7483 Tested-by: syzbot+412dea214d8baa3f7483@syzkaller.appspotmail.com Signed-off-by: Nihar Chaithanya <niharchaithanya@gmail.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>	2024-10-29 17:43:41 -05:00
Artem Sadovnikov	d9f9d96136	jfs: xattr: check invalid xattr size more strictly Commit `7c55b78818` ("jfs: xattr: fix buffer overflow for invalid xattr") also addresses this issue but it only fixes it for positive values, while ea_size is an integer type and can take negative values, e.g. in case of a corrupted filesystem. This still breaks validation and would overflow because of implicit conversion from int to size_t in print_hex_dump(). Fix this issue by clamping the ea_size value instead. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Cc: stable@vger.kernel.org Signed-off-by: Artem Sadovnikov <ancowi69@gmail.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>	2024-10-29 17:17:43 -05:00
Ghanshyam Agrawal	839f102efb	jfs: fix array-index-out-of-bounds in jfs_readdir The stbl might contain some invalid values. Added a check to return error code in that case. Reported-by: syzbot+0315f8fe99120601ba88@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=0315f8fe99120601ba88 Signed-off-by: Ghanshyam Agrawal <ghanshyam1898@gmail.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>	2024-10-29 17:03:41 -05:00
Ghanshyam Agrawal	a5f5e4698f	jfs: fix shift-out-of-bounds in dbSplit When dmt_budmin is less than zero, it causes errors in the later stages. Added a check to return an error beforehand in dbAllocCtl itself. Reported-by: syzbot+b5ca8a249162c4b9a7d0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b5ca8a249162c4b9a7d0 Signed-off-by: Ghanshyam Agrawal <ghanshyam1898@gmail.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>	2024-10-29 16:53:18 -05:00
Ghanshyam Agrawal	ca84a2c9be	jfs: array-index-out-of-bounds fix in dtReadFirst The value of stbl can be sometimes out of bounds due to a bad filesystem. Added a check with appopriate return of error code in that case. Reported-by: syzbot+65fa06e29859e41a83f3@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=65fa06e29859e41a83f3 Signed-off-by: Ghanshyam Agrawal <ghanshyam1898@gmail.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>	2024-10-29 16:13:36 -05:00
Zhihao Cheng	aec8e6bf83	btrfs: fix use-after-free of block device file in __btrfs_free_extra_devids() Mounting btrfs from two images (which have the same one fsid and two different dev_uuids) in certain executing order may trigger an UAF for variable 'device->bdev_file' in __btrfs_free_extra_devids(). And following are the details: 1. Attach image_1 to loop0, attach image_2 to loop1, and scan btrfs devices by ioctl(BTRFS_IOC_SCAN_DEV): / btrfs_device_1 → loop0 fs_device \ btrfs_device_2 → loop1 2. mount /dev/loop0 /mnt btrfs_open_devices btrfs_device_1->bdev_file = btrfs_get_bdev_and_sb(loop0) btrfs_device_2->bdev_file = btrfs_get_bdev_and_sb(loop1) btrfs_fill_super open_ctree fail: btrfs_close_devices // -ENOMEM btrfs_close_bdev(btrfs_device_1) fput(btrfs_device_1->bdev_file) // btrfs_device_1->bdev_file is freed btrfs_close_bdev(btrfs_device_2) fput(btrfs_device_2->bdev_file) 3. mount /dev/loop1 /mnt btrfs_open_devices btrfs_get_bdev_and_sb(&bdev_file) // EIO, btrfs_device_1->bdev_file is not assigned, // which points to a freed memory area btrfs_device_2->bdev_file = btrfs_get_bdev_and_sb(loop1) btrfs_fill_super open_ctree btrfs_free_extra_devids if (btrfs_device_1->bdev_file) fput(btrfs_device_1->bdev_file) // UAF ! Fix it by setting 'device->bdev_file' as 'NULL' after closing the btrfs_device in btrfs_close_one_device(). Fixes: `1423881941` ("btrfs: do not background blkdev_put()") CC: stable@vger.kernel.org # 4.19+ Link: https://bugzilla.kernel.org/show_bug.cgi?id=219408 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-29 21:59:25 +01:00
Chuck Lever	63fab04cbd	NFSD: Initialize struct nfsd4_copy earlier Ensure the refcount and async_copies fields are initialized early. cleanup_async_copy() will reference these fields if an error occurs in nfsd4_copy(). If they are not correctly initialized, at the very least, a refcount underflow occurs. Reported-by: Olga Kornievskaia <okorniev@redhat.com> Fixes: `aadc3bbea1` ("NFSD: Limit the number of concurrent async COPY operations") Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Olga Kornievskaia <okorniev@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-10-29 15:31:18 -04:00
Christoph Hellwig	2f5a65ef30	block: add a bdev_limits helper Add a helper to get the queue_limits from the bdev without having to poke into the request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241029141937.249920-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 09:15:00 -06:00
Piotr Zalewski	3726a1970b	bcachefs: Fix NULL ptr dereference in btree_node_iter_and_journal_peek Add NULL check for key returned from bch2_btree_and_journal_iter_peek in btree_node_iter_and_journal_peek to avoid NULL ptr dereference in bch2_bkey_buf_reassemble. When key returned from bch2_btree_and_journal_iter_peek is NULL it means that btree topology needs repair. Print topology error message with position at which node wasn't found, its parent node information and btree_id with level. Return error code returned by bch2_topology_error to ensure that topology error is handled properly by recovery. Reported-by: syzbot+005ef9aa519f30d97657@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=005ef9aa519f30d97657 Fixes: `5222a4607c` ("bcachefs: BTREE_ITER_WITH_JOURNAL") Suggested-by: Alan Huang <mmpgouride@gmail.com> Suggested-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-29 06:34:11 -04:00
Gaosheng Cui	ca959e328b	bcachefs: fix possible null-ptr-deref in __bch2_ec_stripe_head_get() The function ec_new_stripe_head_alloc() returns nullptr if kzalloc() fails. It is crucial to verify its return value before dereferencing it to avoid a potential nullptr dereference. Fixes: `035d72f72c` ("bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices") Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-29 06:34:10 -04:00
Kent Overstreet	778ac324cc	bcachefs: Fix deadlock on -ENOSPC w.r.t. partial open buckets Open buckets on the partial list should not count as allocated when we're trying to allocate from the partial list. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-29 06:34:10 -04:00
Kent Overstreet	e0fafac5c4	bcachefs: Don't filter partial list buckets in open_buckets_to_text() these are an important source of stranded buckets we need to be able to watch Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-29 06:34:10 -04:00
Kent Overstreet	a34eef6dd1	bcachefs: Don't keep tons of cached pointers around We had a bug report where the data update path was creating an extent that failed to validate because it had too many pointers; almost all of them were cached. To fix this, we have: - want_cached_ptr(), a new helper that checks if we even want a cached pointer (is on appropriate target, device is readable). - bch2_extent_set_ptr_cached() now only sets a pointer cached if we want it. - bch2_extent_normalize_by_opts() now ensures that we only have a single cached pointer that we want. While working on this, it was noticed that this doesn't work well with reflinked data and per-file options. Another patch series is coming that plumbs through additional io path options through bch_extent_rebalance, with improved option handling. Reported-by: Reed Riley <reed@riley.engineer> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-29 06:34:10 -04:00
Piotr Zalewski	3fd27e9c57	bcachefs: init freespace inited bits to 0 in bch2_fs_initialize Initialize freespace_initialized bits to 0 in member's flags and update member's cached version for each device in bch2_fs_initialize. It's possible for the bits to be set to 1 before fs is initialized and if call to bch2_trans_mark_dev_sbs (just before bch2_fs_freespace_init) fails bits remain to be 1 which can later indirectly trigger BUG condition in bch2_bucket_alloc_freelist during shutdown. Reported-by: syzbot+2b6a17991a6af64f9489@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2b6a17991a6af64f9489 Fixes: `bbe682c767` ("bcachefs: Ensure devices are always correctly initialized") Suggested-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-29 06:34:10 -04:00
Kent Overstreet	c1fa854acc	bcachefs: Fix unhandled transaction restart in fallocate This used to not matter, but now we're being more strict. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-29 06:34:10 -04:00
Ryusuke Konishi	41e192ad27	nilfs2: fix kernel bug due to missing clearing of checked flag Syzbot reported that in directory operations after nilfs2 detects filesystem corruption and degrades to read-only, __block_write_begin_int(), which is called to prepare block writes, may fail the BUG_ON check for accesses exceeding the folio/page size, triggering a kernel bug. This was found to be because the "checked" flag of a page/folio was not cleared when it was discarded by nilfs2's own routine, which causes the sanity check of directory entries to be skipped when the directory page/folio is reloaded. So, fix that. This was necessary when the use of nilfs2's own page discard routine was applied to more than just metadata files. Link: https://lkml.kernel.org/r/20241017193359.5051-1-konishi.ryusuke@gmail.com Fixes: `8c26c4e269` ("nilfs2: fix issue with flush kernel thread after remount in RO mode because of driver's internal error or metadata corruption") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+d6ca2daf692c7a82f959@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d6ca2daf692c7a82f959 Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-10-28 21:40:40 -07:00
Edward Adam Davis	bc0a2f3a73	ocfs2: pass u64 to ocfs2_truncate_inline maybe overflow Syzbot reported a kernel BUG in ocfs2_truncate_inline. There are two reasons for this: first, the parameter value passed is greater than ocfs2_max_inline_data_with_xattr, second, the start and end parameters of ocfs2_truncate_inline are "unsigned int". So, we need to add a sanity check for byte_start and byte_len right before ocfs2_truncate_inline() in ocfs2_remove_inode_range(), if they are greater than ocfs2_max_inline_data_with_xattr return -EINVAL. Link: https://lkml.kernel.org/r/tencent_D48DB5122ADDAEDDD11918CFB68D93258C07@qq.com Fixes: `1afc32b952` ("ocfs2: Write support for inline data") Signed-off-by: Edward Adam Davis <eadavis@qq.com> Reported-by: syzbot+81092778aac03460d6b7@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=81092778aac03460d6b7 Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-10-28 21:40:40 -07:00
Lorenzo Stoakes	f64e67e5d3	fork: do not invoke uffd on fork if error occurs Patch series "fork: do not expose incomplete mm on fork". During fork we may place the virtual memory address space into an inconsistent state before the fork operation is complete. In addition, we may encounter an error during the fork operation that indicates that the virtual memory address space is invalidated. As a result, we should not be exposing it in any way to external machinery that might interact with the mm or VMAs, machinery that is not designed to deal with incomplete state. We specifically update the fork logic to defer khugepaged and ksm to the end of the operation and only to be invoked if no error arose, and disallow uffd from observing fork events should an error have occurred. This patch (of 2): Currently on fork we expose the virtual address space of a process to userland unconditionally if uffd is registered in VMAs, regardless of whether an error arose in the fork. This is performed in dup_userfaultfd_complete() which is invoked unconditionally, and performs two duties - invoking registered handlers for the UFFD_EVENT_FORK event via dup_fctx(), and clearing down userfaultfd_fork_ctx objects established in dup_userfaultfd(). This is problematic, because the virtual address space may not yet be correctly initialised if an error arose. The change in commit `d240629148` ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") makes this more pertinent as we may be in a state where entries in the maple tree are not yet consistent. We address this by, on fork error, ensuring that we roll back state that we would otherwise expect to clean up through the event being handled by userland and perform the memory freeing duty otherwise performed by dup_userfaultfd_complete(). We do this by implementing a new function, dup_userfaultfd_fail(), which performs the same loop, only decrementing reference counts. Note that we perform mmgrab() on the parent and child mm's, however userfaultfd_ctx_put() will mmdrop() this once the reference count drops to zero, so we will avoid memory leaks correctly here. Link: https://lkml.kernel.org/r/cover.1729014377.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/d3691d58bb58712b6fb3df2be441d175bd3cdf07.1729014377.git.lorenzo.stoakes@oracle.com Fixes: `d240629148` ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Linus Torvalds <torvalds@linuxfoundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-10-28 21:40:38 -07:00
André Almeida	58e55efd6c	tmpfs: Add casefold lookup support Enable casefold lookup in tmpfs, based on the encoding defined by userspace. That means that instead of comparing byte per byte a file name, it compares to a case-insensitive equivalent of the Unicode string. * Dcache handling There's a special need when dealing with case-insensitive dentries. First of all, we currently invalidated every negative casefold dentries. That happens because currently VFS code has no proper support to deal with that, giving that it could incorrectly reuse a previous filename for a new file that has a casefold match. For instance, this could happen: $ mkdir DIR $ rm -r DIR $ mkdir dir $ ls DIR/ And would be perceived as inconsistency from userspace point of view, because even that we match files in a case-insensitive manner, we still honor whatever is the initial filename. Along with that, tmpfs stores only the first equivalent name dentry used in the dcache, preventing duplications of dentries in the dcache. The d_compare() version for casefold files uses a normalized string, so the filename under lookup will be compared to another normalized string for the existing file, achieving a casefolded lookup. * Enabling casefold via mount options Most filesystems have their data stored in disk, so casefold option need to be enabled when building a filesystem on a device (via mkfs). However, as tmpfs is a RAM backed filesystem, there's no disk information and thus no mkfs to store information about casefold. For tmpfs, create casefold options for mounting. Userspace can then enable casefold support for a mount point using: $ mount -t tmpfs -o casefold=utf8-12.1.0 fs_name mount_dir/ Userspace must set what Unicode standard is aiming to. The available options depends on what the kernel Unicode subsystem supports. And for strict encoding: $ mount -t tmpfs -o casefold=utf8-12.1.0,strict_encoding fs_name mount_dir/ Strict encoding means that tmpfs will refuse to create invalid UTF-8 sequences. When this option is not enabled, any invalid sequence will be treated as an opaque byte sequence, ignoring the encoding thus not being able to be looked up in a case-insensitive way. * Check for casefold dirs on simple_lookup() On simple_lookup(), do not create dentries for casefold directories. Currently, VFS does not support case-insensitive negative dentries and can create inconsistencies in the filesystem. Prevent such dentries to being created in the first place. Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241021-tonyk-tmpfs-v8-6-f443d5814194@igalia.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-28 13:36:55 +01:00
André Almeida	458532c8df	libfs: Export generic_ci_ dentry functions Export generic_ci_ dentry functions so they can be used by case-insensitive filesystems that need something more custom than the default one set by `struct generic_ci_dentry_ops`. Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241021-tonyk-tmpfs-v8-5-f443d5814194@igalia.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-28 13:36:54 +01:00
André Almeida	142fa60f61	unicode: Recreate utf8_parse_version() All filesystems that currently support UTF-8 casefold can fetch the UTF-8 version from the filesystem metadata stored on disk. They can get the data stored and directly match it to a integer, so they can skip the string parsing step, which motivated the removal of this function in the first place. However, for tmpfs, the only way to tell the kernel which UTF-8 version we are about to use is via mount options, using a string. Re-introduce utf8_parse_version() to be used by tmpfs. This version differs from the original by skipping the intermediate step of copying the version string to an auxiliary string before calling match_token(). This versions calls match_token() in the argument string. The paramenters are simpler now as well. utf8_parse_version() was created by `9d53690f0d` ("unicode: implement higher level API for string handling") and later removed by `49bd03cc7e` ("unicode: pass a UNICODE_AGE() tripple to utf8_load"). Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241021-tonyk-tmpfs-v8-4-f443d5814194@igalia.com Reviewed-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-28 13:36:54 +01:00
André Almeida	04dad6c6d3	unicode: Export latest available UTF-8 version number Export latest available UTF-8 version number so filesystems can easily load the newest one. Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241021-tonyk-tmpfs-v8-3-f443d5814194@igalia.com Acked-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-28 13:36:54 +01:00
André Almeida	3f5ad0d21d	ext4: Use generic_ci_validate_strict_name helper Use the helper function to check the requirements for casefold directories using strict encoding. Suggested-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241021-tonyk-tmpfs-v8-2-f443d5814194@igalia.com Acked-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-28 13:36:53 +01:00
Pankaj Raghav	30dac24e14	fs/writeback: convert wbc_account_cgroup_owner to take a folio Most of the callers of wbc_account_cgroup_owner() are converting a folio to page before calling the function. wbc_account_cgroup_owner() is converting the page back to a folio to call mem_cgroup_css_from_folio(). Convert wbc_account_cgroup_owner() to take a folio instead of a page, and convert all callers to pass a folio directly except f2fs. Convert the page to folio for all the callers from f2fs as they were the only callers calling wbc_account_cgroup_owner() with a page. As f2fs is already in the process of converting to folios, these call sites might also soon be calling wbc_account_cgroup_owner() with a folio directly in the future. No functional changes. Only compile tested. Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20240926140121.203821-1-kernel@pankajraghav.com Acked-by: David Sterba <dsterba@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-28 13:26:54 +01:00
Ian Kent	f19910006e	autofs: fix thinko in validate_dev_ioctl() I was so sure the per-dentry expire timeout patch worked ok but my testing was flawed. In validate_dev_ioctl() the check for ioctl AUTOFS_DEV_IOCTL_TIMEOUT_CMD should use the ioctl number not the passed in ioctl command. Fixes: `433f9d76a0` ("autofs: add per dentry expire timeout") Cc: <stable@vger.kernel.org> # mainline only Signed-off-by: Ian Kent <raven@themaw.net> Link: https://lore.kernel.org/r/20241027224732.5507-1-raven@themaw.net Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-28 13:16:56 +01:00
Jinjie Ruan	3abab905b1	ksmbd: Fix the missing xa_store error check xa_store() can fail, it return xa_err(-EINVAL) if the entry cannot be stored in an XArray, or xa_err(-ENOMEM) if memory allocation failed, so check error for xa_store() to fix it. Cc: stable@vger.kernel.org Fixes: `b685757c7b` ("ksmbd: Implements sess->rpc_handle_list as xarray") Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-10-28 08:30:05 +09:00
Linus Torvalds	a8b3be2617	XFS bug fixes for 6.12-rc5 * fix recovery of allocator ops after a growfs * Do not fail repairs on metadata files with no attr fork Signed-off-by: Carlos Maiolino <cem@kernel.org> -----BEGIN PGP SIGNATURE----- iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZxjo7gAKCRBGdaER5Qtf pr26AYCUc9+Vlg5iReesrghYHJgeCaMYZm2i4WdNdI+BO8d+5+AA1oUO55ib3xWd fX8A0MEBf32eeMR0E+K0NeKsmHnbGHXyWRg/27IlNRniL4/yldssEFB8X3b7Gkw5 /geUVdz99A== =+NGs -----END PGP SIGNATURE----- Merge tag 'xfs-6.12-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs fixes from Carlos Maiolino: - Fix recovery of allocator ops after a growfs - Do not fail repairs on metadata files with no attr fork * tag 'xfs-6.12-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: update the pag for the last AG at recovery time xfs: don't use __GFP_RETRY_MAYFAIL in xfs_initialize_perag xfs: error out when a superblock buffer update reduces the agcount xfs: update the file system geometry after recoverying superblock buffers xfs: merge the perag freeing helpers xfs: pass the exact range to initialize to xfs_initialize_perag xfs: don't fail repairs on metadata files with no attr fork	2024-10-27 08:23:49 -10:00
Linus Torvalds	850925a813	Revert patches causing inode collision problems -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmccC7UACgkQq06b7GqY 5nBYbA/7BP/Me80+ofClszxd0F3nlXCGJPe31vC36EeFsap5p1TrIM7jvj0iR0Zr HXLxyTwocrubFgSE28rUrSS2nhmk016g3EUEVeoaO+hYSbvtdrRhoTSCubEwb4N5 68vUz1ebGqZ35oT9oVZeAd+xdwxpKPktqrN6zCJsyWbSiZevE8YMkV+dgYrGde+Y j7BMypcxtQ2+u8donJ1PnGz04k0RNhqcqGGED0VuNw1Sp6quY8n0EWvHEeYjrb+N hFYB1lSeA245fVm8bmnUFHLJ2w47bxfJZ8hANbO9C0zLI98vlmH6xItXQnGofc4P 1VtCHxoZnIFMMjK/iTPjvi4JsanNr0mS1Aa5w8HAHqyHYeIoBCpn78XFf79IhFAl pO3RV6nxE8fELXBM39Yyq7I3rQfyFmNBeob0UWIbazQ7cC54xBHpu0RDMj2gD6xy TBbZL2euRfGEswKWfrdR32U1jzSCv+jTAWew9sdx+z0RX9inRyaNBHFVsjOnVFZS iLJn20cotgw0WU2OhghWvgsV4o/Ckx4eJAC9oNxZ9pT1s7BqSvW1qg/Zlpn9wYgf UGQGXrCdXGY6XZ3W53c2joS+QWtmU6N8mAC6jCeqQm8pCQQeiTHE1QycyNc0xPGd ELhQYzCA1hEiNDki3e/vqzbPScpAer9dc/6hMKUT55SVfjcCBoc= =O6o1 -----END PGP SIGNATURE----- Merge tag '9p-for-6.12-rc5' of https://github.com/martinetd/linux Pull more 9p reverts from Dominique Martinet: "Revert patches causing inode collision problems. The code simplification introduced significant regressions on servers that do not remap inode numbers when exporting multiple underlying filesystems with colliding inodes. See the top-most revert (commit `be2ca38253`) for details. This problem had been ignored for too long and the reverts will also head to stable (6.9+). I'm confident this set of patches gets us back to previous behaviour (another related patch had already been reverted back in April and we're almost back to square 1, and the rest didn't touch inode lifecycle)" * tag '9p-for-6.12-rc5' of https://github.com/martinetd/linux: Revert "fs/9p: simplify iget to remove unnecessary paths" Revert "fs/9p: fix uaf in in v9fs_stat2inode_dotl" Revert "fs/9p: remove redundant pointer v9ses" Revert " fs/9p: mitigate inode collisions"	2024-10-25 15:25:02 -07:00
Linus Torvalds	c71f8fb4dc	two fixes for stable -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmcblTgACgkQiiy9cAdy T1HlwgwAm1aDntHOUht5HqhmZpNT89hVPWqbH8KuAUTUewNE2g5GEgqtX2Fqbl+i Nh09ipjFVqQzTL17t0lxlE2dTann5HoyZFCplGYo0KQUNgDPKpi6oeEDXufLOTkj SXE4CTw0fZWSNRwZKffbRfPRkMLr8/Gn1BBiPKU+fd9G2+sszuc+h6ovy8pXNM+W w49agufwENOVmMoJBpDBOvDlorzpWCoIV8NKHsbxBR4dijR6oRQWbb1Dk/YtivAn vC9rHbgyOVVR5IOJKUAF4PjNwXxDpRlitUyY2GF4gGxITy/gHK8054i4mfT79IAa RVL0wMoxE1IZfv6fDiP0cXvN3w7X935Z6ggjKgcvP3zOajcs11PCOPSIX4LEeTI5 B2Tn8i+4+uB7INcFUBxyoFb3lM2v8L/ejWii0cTsy3HOnaU8D36NZJLLQT0g2PIk WcGvzz/+l63tSBWZY9uEcD21H50WJbZpK4DKnkG8gClmC2QJBw5nR4F/XvbO3Z/7 k3yE1Gfp =agOS -----END PGP SIGNATURE----- Merge tag 'v6.12-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull smb client fixes from Steve French: - Fix init module error caseb - Fix memory allocation error path (for passwords) in mount * tag 'v6.12-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix warning when destroy 'cifs_io_request_pool' smb: client: Handle kstrdup failures for passwords	2024-10-25 11:45:22 -07:00
Linus Torvalds	81dcc79758	fuse fixes for 6.12-rc5 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCZxu02AAKCRDh3BK/laaZ PLDHAPwMzz4c+wbqz8Qo2IEo3lxvgPjgzMNXQetCgZFKvxKRlwD+PaIeRixGwwmB ON3IsScZjROphzb+ofroUpj7lLEM7Ag= =9rYN -----END PGP SIGNATURE----- Merge tag 'fuse-fixes-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: - Fix cached size after passthrough writes This fix needed a trivial change in the backing-file API, which resulted in some non-fuse files being touched. - Revert a commit meant as a cleanup but which triggered a WARNING - Remove a stray debug line left-over * tag 'fuse-fixes-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: remove stray debug line Revert "fuse: move initialization of fuse_file to fuse_writepages() instead of in callback" fuse: update inode size after extending passthrough write fs: pass offset and result to backing_file end_write() callback	2024-10-25 11:41:18 -07:00
Linus Torvalds	f647053312	nfsd-6.12 fixes: - Fix a couple of use-after-free bugs -----BEGIN PGP SIGNATURE----- iQIyBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmcbp9YACgkQM2qzM29m f5dzAg/4lCFRbsebia7qktW88ZDBbAoZyKolk2DHfjZXCuq5DHb/5I/Hk1rGyTYs VaJmCU59ZdpyBFSdQhOYKf2xvgNPvJG02U8il5KWtMAY5cStXFjeU0FSDoC5O4Dl 9IoaVbtAWGMCjxWJ1WEGpU82JoM7moSVB4G718LlxF+4cUS7idq5se0uK31WQvft DmJsOfmnch1Y/7+RRcDwbBu0HwP2ZQHS8zMYMQ2JGXPDZJFenTibezVb36YyzyZn WQfvaW6MmdiVL9omZxvURL9WuBKA2L+Ly/92PyHaflcAXSngcpfu28IIQbzp/m7K JT/3ad32lB7F3LrDP5I4gwVh8oGLYiI5r7RBWo0e98LPvAR/89gBVdZHhjasstCh nAL7Kk6P/jQdbM/KR9T+yS7xTVScI5Wp4Xcitz2mlHgU4br67GO9gpo1e/tqXenm Gasapkg5qCduz+ksj2vwpeFXKQi+qwJgfVGKMxELoV8qTazyr09Dfgouqe045ztl /0khkOrLkw0bYLDNJKhj/XG0ZEV5V10c/0PEnivC2BHVQioBIDRQAaH2S2G/8vQH MdWayGhNlTV0g4DdtMVPxf5uN+uQmfMsj5BIe17NIUYxiJnw0CSM/bzHUcJ15+DH nTblaek6sa5BFpZ2fed8by4il4/tme3NOJEeHnSqe/3QKyZgOQ== =1YOr -----END PGP SIGNATURE----- Merge tag 'nfsd-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Fix a couple of use-after-free bugs * tag 'nfsd-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: cancel nfsd_shrinker_work using sync mode in nfs4_state_shutdown_net nfsd: fix race between laundromat and free_stateid	2024-10-25 11:38:15 -07:00
Kent Overstreet	8e910ca20e	bcachefs: Fix UAF in bch2_reconstruct_alloc() write_super() -> sb_counters_from_cpu() may reallocate the superblock Reported-by: syzbot+9fc4dac4775d07bcfe34@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-25 13:17:23 -04:00
Jeongjun Park	a25a83de45	bcachefs: fix null-ptr-deref in have_stripes() c->btree_roots_known[i].b can be NULL. In this case, a NULL pointer dereference occurs, so you need to add code to check the variable. Reported-by: syzbot+b468b9fef56949c3b528@syzkaller.appspotmail.com Fixes: `7773df19c3` ("bcachefs: metadata version bucket_stripe_sectors") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-25 13:17:06 -04:00
Miklos Szeredi	d34a5575e6	fuse: remove stray debug line It wasn't there when the patch was posted for review, but somehow made it into the pull. Link: https://lore.kernel.org/all/20240913104703.1673180-1-mszeredi@redhat.com/ Fixes: `efad7153bf` ("fuse: allow O_PATH fd for FUSE_DEV_IOC_BACKING_OPEN") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2024-10-25 17:05:49 +02:00
Jeongjun Park	5c41f75d1b	bcachefs: fix shift oob in alloc_lru_idx_fragmentation The size of a.data_type is set abnormally large, causing shift-out-of-bounds. To fix this, we need to add validation on a.data_type in alloc_lru_idx_fragmentation(). Reported-by: syzbot+7f45fa9805c40db3f108@syzkaller.appspotmail.com Fixes: `260af1562e` ("bcachefs: Kill alloc_v4.fragmentation_lru") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-24 17:41:43 -04:00
Gianfranco Trad	2045fc4295	bcachefs: Fix invalid shift in validate_sb_layout() Add check on layout->sb_max_size_bits against BCH_SB_LAYOUT_SIZE_BITS_MAX to prevent UBSAN shift-out-of-bounds in validate_sb_layout(). Reported-by: syzbot+089fad5a3a5e77825426@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=089fad5a3a5e77825426 Fixes: `03ef80b469` ("bcachefs: Ignore unknown mount options") Tested-by: syzbot+089fad5a3a5e77825426@syzkaller.appspotmail.com Signed-off-by: Gianfranco Trad <gianf.trad@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-24 17:41:43 -04:00
Dominique Martinet	be2ca38253	Revert "fs/9p: simplify iget to remove unnecessary paths" This reverts commit `724a08450f`. This code simplification introduced significant regressions on servers that do not remap inode numbers when exporting multiple underlying filesystems with colliding inodes, as can be illustrated with simple tmpfs exports in qemu with remapping disabled: ``` # host side cd /tmp/linux-test mkdir m1 m2 mount -t tmpfs tmpfs m1 mount -t tmpfs tmpfs m2 mkdir m1/dir m2/dir echo foo > m1/dir/foo echo bar > m2/dir/bar # guest side # started with -virtfs local,path=/tmp/linux-test,mount_tag=tmp,security_model=mapped-file mount -t 9p -o trans=virtio,debug=1 tmp /mnt/t ls /mnt/t/m1/dir # foo ls /mnt/t/m2/dir # bar (works ok if directry isn't open) # cd to keep first dir's inode alive cd /mnt/t/m1/dir ls /mnt/t/m2/dir # foo (should be bar) ``` Other examples can be crafted with regular files with fscache enabled, in which case I/Os just happen to the wrong file leading to corruptions, or guest failing to boot with: \| VFS: Lookup of 'com.android.runtime' in 9p 9p would have caused loop In theory, we'd want the servers to be smart enough and ensure they never send us two different files with the same 'qid.path', but while qemu has an option to remap that is recommended (and qemu prints a warning if this case happens), there are many other servers which do not (kvmtool, nfs-ganesha, probably diod...), we should at least ensure we don't cause regressions on this: - assume servers can't be trusted and operations that should get a 'new' inode properly do so. commit `d05dcfdf5e` (" fs/9p: mitigate inode collisions") attempted to do this, but v9fs_fid_iget_dotl() was not called so some higher level of caching got in the way; this needs to be fixed properly before we can re-apply the patches. - if we ever want to really simplify this code, we will need to add some negotiation with the server at mount time where the server could claim they handle this properly, at which point we could optimize this out. (but that might not be needed at all if we properly handle the 'new' check?) Fixes: `724a08450f` ("fs/9p: simplify iget to remove unnecessary paths") Reported-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/all/20240408141436.GA17022@redhat.com/ Link: https://lkml.kernel.org/r/20240923100508.GA32066@willie-the-truck Cc: stable@vger.kernel.org # v6.9+ Message-ID: <20241024-revert_iget-v1-4-4cac63d25f72@codewreck.org> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>	2024-10-25 06:26:09 +09:00
Dominique Martinet	26f8dd2dde	Revert "fs/9p: fix uaf in in v9fs_stat2inode_dotl" This reverts commit `11763a8598`. This is a requirement to revert commit `724a08450f` ("fs/9p: simplify iget to remove unnecessary paths"), see that revert for details. Fixes: `724a08450f` ("fs/9p: simplify iget to remove unnecessary paths") Reported-by: Will Deacon <will@kernel.org> Link: https://lkml.kernel.org/r/20240923100508.GA32066@willie-the-truck Cc: stable@vger.kernel.org # v6.9+ Message-ID: <20241024-revert_iget-v1-3-4cac63d25f72@codewreck.org> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>	2024-10-25 06:26:09 +09:00
Dominique Martinet	fedd06210b	Revert "fs/9p: remove redundant pointer v9ses" This reverts commit `10211b4a23`. This is a requirement to revert commit `724a08450f` ("fs/9p: simplify iget to remove unnecessary paths"), see that revert for details. Fixes: `724a08450f` ("fs/9p: simplify iget to remove unnecessary paths") Reported-by: Will Deacon <will@kernel.org> Link: https://lkml.kernel.org/r/20240923100508.GA32066@willie-the-truck Cc: stable@vger.kernel.org # v6.9+ Message-ID: <20241024-revert_iget-v1-2-4cac63d25f72@codewreck.org> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>	2024-10-25 06:26:09 +09:00
Dominique Martinet	f69999b5f9	Revert " fs/9p: mitigate inode collisions" This reverts commit `d05dcfdf5e`. This is a requirement to revert commit `724a08450f` ("fs/9p: simplify iget to remove unnecessary paths"), see that revert for details. Fixes: `724a08450f` ("fs/9p: simplify iget to remove unnecessary paths") Reported-by: Will Deacon <will@kernel.org> Link: https://lkml.kernel.org/r/20240923100508.GA32066@willie-the-truck Cc: stable@vger.kernel.org # v6.9+ Message-ID: <20241024-revert_iget-v1-1-4cac63d25f72@codewreck.org> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>	2024-10-25 06:26:08 +09:00
Linus Torvalds	4e46774408	for-6.12-rc4-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmcZGqsACgkQxWXV+ddt WDsZTg//dSLAAswT3dTAvWDt7rGxPhJC1V+gnzWdj/0Q3CUimNaw7zHp1QHtJDFT S+3TVvJJ2JDvOnbi3u24s/bL5YdkWcvIyy87oVE4trJMbQPc/E45pkYFqF3TRQAH wEYAVEnO9f9WUY+ekxX2XBzhmKp3xol93j9BBHDXOF9kHsDC5lI0D5YVYEhRY8Qu D4GcqAhqUj5Hj6I6ppiqO47NCJBNzw1Se9QsgruPpmItRbB0/LYJFhUfevwosTKg xVpkRVntFqQjkIdIdBfBv/ZGWfxJyM7K4M49QwLUqUQfxugu7BiGYuEjkBkiy07a pZDEOF9s8wUZsxvRVohqKlhL0zHBF+/pAANowYuhKNW1sqKt4GVCdN3V34AbH8ST JIbPvC2g1tUzIc2wckE8GO/NnsNR4r3k6iPB53MdCHrIo3jnENOeb9wF0GiVDb6s OrhCa3ph2ps80YC1aCnc4Jr/yV2ONebSivnqvHCIUQEZfpjtAc07atm0i946/7nx eiBE+9zSVJZB0LoooIVz2I3HX3lRnm3Wwi4nj8U/sNL/IbGaHNCurIETR23NivYP yWVql2njwE+yMc8q9YZXs4MBdKsSGP6eGJW3ZwKi6ru2PJdf5eIib+ffvR4bLqXD UUDfMyC1esGsQB24sc8wppk97wmmMrdqQUj9WnmNlTP8FUFggvQ= =sYxX -----END PGP SIGNATURE----- Merge tag 'for-6.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - mount option fixes: - fix handling of compression mount options on remount - reject rw remount in case there are options that don't work in read-write mode (like rescue options) - fix zone accounting of unusable space - fix in-memory corruption when merging extent maps - fix delalloc range locking for sector < page - use more convenient default value of drop subtree threshold, clean more subvolumes without the fallback to marking quotas inconsistent - fix smatch warning about incorrect value passed to ERR_PTR * tag 'for-6.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix passing 0 to ERR_PTR in btrfs_search_dir_index_item() btrfs: reject ro->rw reconfiguration if there are hard ro requirements btrfs: fix read corruption due to race with extent map merging btrfs: fix the delalloc range locking if sector size < page size btrfs: qgroup: set a more sane default value for subtree drop threshold btrfs: clear force-compress on remount when compress mount option is given btrfs: zoned: fix zone unusable accounting for freed reserved extent	2024-10-24 13:04:15 -07:00
Linus Torvalds	6cc65abee8	Fix a regression introduced in 6.12-rc1 -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEIodevzQLVs53l6BhNqiEXrVAjGQFAmcY+6QACgkQNqiEXrVA jGRitw//TWl2UsPGzHCJXrdQTDmceGpCXoH+2C04eRe1Kk0ILqkJdSobbM1e10My 0eb/leiOqP90d+6WnrwWIpqcBOriaXs1NN7H6qdZpyE/xmmEPLfLKMU0SB6jMDaz A9BNoRGvKmxy960kid0OsDd79uOKyu95PT1VsRET5bm10LYUSM1E/qwMxaLydX3s g8kVLakgKPKq0iKXNpOuFgTKxhNJACaIm1a1+Lq55lXXtg8HN3QuLKwM/rXD+yOD 0+imvhTKcQBQGqHeRJtssk93BxKqA42FXsn36Zi+zBmLY75WshKZd0USihpCYlwE zL7JUSPmtFHtxkbjfr3pK6HzDlSg7rzp78BPIMeWOymZh4tQ9W8G6UtPHTb6KYoP WLY/LCe4IVQDnpyeKDcCPcLEe4/Jgk1QbCGtqjoOzrSuIEGReAP6LSxs9A9Kc7VD AVa2OLMXoyD+oLyHk7uuIuZch51bD34mhaGhOB+fyX5xz6ZocSOei+1/yiTWHwdD 3AAmbIvVc2rNxlg/O7t+UvrAACWLZhZ/uqQLqcAmdT5CNDZZHdNRZGy3pe5ELa+F PIzERoDlrALPAASOGqvmFsbVEZSG6z89uTREhFxAErryWBESHK4nPJ8WqHlybSwm ISQeTJ/h9+vgw7poeHtYDmghh0Eh8H6/iy+DTLCqZ16lxytfvRE= =sZ6X -----END PGP SIGNATURE----- Merge tag 'jfs-6.12-rc5' of github.com:kleikamp/linux-shaggy Pull jfs fix from David Kleikamp: "Fix a regression introduced in 6.12-rc1" * tag 'jfs-6.12-rc5' of github.com:kleikamp/linux-shaggy: jfs: Fix sanity check in dbMount	2024-10-24 12:47:01 -07:00
Linus Torvalds	c1e822754c	bcachefs fixes for 6.12-rc5 Lots of hotfixes: - transaction restart injection has been shaking out a few things - fix a data corruption in the buffered write path on -ENOSPC, found by xfstests generic/299 - Some small show_options fixes - Repair mismatches in inode hash type, seed: different snapshot versions of an inode must have the same hash/type seed, used for directory entries and xattrs. We were checking the hash seed, but not the type, and a user contributed a filesystem where the hash type on one inode had somehow been flipped; these fixes allow his filesystem to repair. Additionally, the hash type flip made some directory entries invisible, which were then recreated by userspace; so the hash check code now checks for duplicate non dangling dirents, and renames one of them if necessary. - Don't use wait_event_interruptible() in recovery: this fixes some filesystems failing to mount with -ERESTARTSYS - Workaround for kvmalloc not supporting > INT_MAX allocations, causing an -ENOMEM when allocating the sorted array of journal keys: this allows a 75 TB filesystem to mount - Make sure bch_inode_unpacked.bi_snapshot is set in the old inode compat path: this alllows Marcin's filesystem (in use since before 6.7) to repair and mount. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmcX4vYACgkQE6szbY3K bnbywxAArBfIJfshWq5Wk9WztenzUmyUmV2HIgntT/iN4ty4eIpZ26VSvHcGvgkU j3wx+OuxMTPBGc3fjUS+gALf/BGcQEgh6oPZCV+6M3kasTzNzG2jYOCkLqKbpcO1 V5n/Le/SM1X2grkgTm/H+TulGHNgG9gJ2U4kjihroJrTbTesZhzcW/qlz6RWo7U1 02NvLop4WE9M6WaW9RzsHK2llRUAl2Z3oRMuwNz3IIijCpm98STGD4gyvGoMV2b8 qNsXjy7b2lkYObKI29yWF0caRzWK1LRz79afRlnNVSJb6DK1QB83ms5Qa8rprCU4 uOq0wsGWyg6lzwQ19X+2TvUYABopVk2HXLlzTO/lJrWeMTuYJVPZ7KZi3l6ubw5T GIsAD5qMdCm8E5nXX8hG//0rOIl6QK288+zMQyRCvAkCL+iN2k0TU8qKAEEC44de vj6ZyNqbuLR39LLz9K09ZhzIZGk09ELpxOJ2Wwwj4ZFriwphWDtFgBtBUpNo/KWA inBfq2lZJsmNjfns9vCqOmNOStOJxXnyMOR25sTv7wM69QPGkl41dPY3oeuG8lRk cU/qJQKlpTKJbFeXiEKWKDnMzWxOnovqLFC0tKu2qAYM6vAz+AtwTXgthVFGh21U QoUDbsnQCCixMkS2AksCo7nivLrxmV/EeYm5pgeiU38VdA5ofBM= =OpYN -----END PGP SIGNATURE----- Merge tag 'bcachefs-2024-10-22' of https://github.com/koverstreet/bcachefs Pull bcachefs fixes from Kent Overstreet: "Lots of hotfixes: - transaction restart injection has been shaking out a few things - fix a data corruption in the buffered write path on -ENOSPC, found by xfstests generic/299 - Some small show_options fixes - Repair mismatches in inode hash type, seed: different snapshot versions of an inode must have the same hash/type seed, used for directory entries and xattrs. We were checking the hash seed, but not the type, and a user contributed a filesystem where the hash type on one inode had somehow been flipped; these fixes allow his filesystem to repair. Additionally, the hash type flip made some directory entries invisible, which were then recreated by userspace; so the hash check code now checks for duplicate non dangling dirents, and renames one of them if necessary. - Don't use wait_event_interruptible() in recovery: this fixes some filesystems failing to mount with -ERESTARTSYS - Workaround for kvmalloc not supporting > INT_MAX allocations, causing an -ENOMEM when allocating the sorted array of journal keys: this allows a 75 TB filesystem to mount - Make sure bch_inode_unpacked.bi_snapshot is set in the old inode compat path: this alllows Marcin's filesystem (in use since before 6.7) to repair and mount" * tag 'bcachefs-2024-10-22' of https://github.com/koverstreet/bcachefs: (26 commits) bcachefs: Set bch_inode_unpacked.bi_snapshot in old inode path bcachefs: Mark more errors as AUTOFIX bcachefs: Workaround for kvmalloc() not supporting > INT_MAX allocations bcachefs: Don't use wait_event_interruptible() in recovery bcachefs: Fix __bch2_fsck_err() warning bcachefs: fsck: Improve hash_check_key() bcachefs: bch2_hash_set_or_get_in_snapshot() bcachefs: Repair mismatches in inode hash seed, type bcachefs: Add hash seed, type to inode_to_text() bcachefs: INODE_STR_HASH() for bch_inode_unpacked bcachefs: Run in-kernel offline fsck without ratelimit errors bcachefs: skip mount option handle for empty string. bcachefs: fix incorrect show_options results bcachefs: Fix data corruption on -ENOSPC in buffered write path bcachefs: bch2_folio_reservation_get_partial() is now better behaved bcachefs: fix disk reservation accounting in bch2_folio_reservation_get() bcachefS: ec: fix data type on stripe deletion bcachefs: Don't use commit_do() unnecessarily bcachefs: handle restarts in bch2_bucket_io_time_reset() bcachefs: fix restart handling in __bch2_resume_logged_op_finsert() ...	2024-10-24 12:38:59 -07:00
Dominique Martinet	f009e946c1	Revert "9p: Enable multipage folios" This reverts commit `1325e4a91a`. using multipage folios apparently break some madvise operations like MADV_PAGEOUT which do not reliably unload the specified page anymore, Revert the patch until that is figured out. Reported-by: Andrii Nakryiko <andrii@kernel.org> Fixes: `1325e4a91a` ("9p: Enable multipage folios") Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2024-10-24 11:24:05 -07:00
Luca Boccassi	cdda1f26e7	pidfd: add ioctl to retrieve pid info A common pattern when using pid fds is having to get information about the process, which currently requires /proc being mounted, resolving the fd to a pid, and then do manual string parsing of /proc/N/status and friends. This needs to be reimplemented over and over in all userspace projects (e.g.: I have reimplemented resolving in systemd, dbus, dbus-daemon, polkit so far), and requires additional care in checking that the fd is still valid after having parsed the data, to avoid races. Having a programmatic API that can be used directly removes all these requirements, including having /proc mounted. As discussed at LPC24, add an ioctl with an extensible struct so that more parameters can be added later if needed. Start with returning pid/tgid/ppid and creds unconditionally, and cgroupid optionally. Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com> Link: https://lore.kernel.org/r/20241010155401.2268522-1-luca.boccassi@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-24 13:54:51 +02:00
David Howells	247d65fb12	afs: Fix missing subdir edit when renamed between parent dirs When rename moves an AFS subdirectory between parent directories, the subdir also needs a bit of editing: the ".." entry needs updating to point to the new parent (though I don't make use of the info) and the DV needs incrementing by 1 to reflect the change of content. The server also sends a callback break notification on the subdirectory if we have one, but we can take care of recovering the promise next time we access the subdir. This can be triggered by something like: mount -t afs %example.com:xfstest.test20 /xfstest.test/ mkdir /xfstest.test/{aaa,bbb,aaa/ccc} touch /xfstest.test/bbb/ccc/d mv /xfstest.test/{aaa/ccc,bbb/ccc} touch /xfstest.test/bbb/ccc/e When the pathwalk for the second touch hits "ccc", kafs spots that the DV is incorrect and downloads it again (so the fix is not critical). Fix this, if the rename target is a directory and the old and new parents are different, by: (1) Incrementing the DV number of the target locally. (2) Editing the ".." entry in the target to refer to its new parent's vnode ID and uniquifier. Link: https://lore.kernel.org/r/3340431.1729680010@warthog.procyon.org.uk Fixes: `63a4681ff3` ("afs: Locally edit directory data for mkdir/create/unlink/...") cc: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-24 13:50:27 +02:00
Naohiro Aota	d48e1dea39	btrfs: fix error propagation of split bios The purpose of btrfs_bbio_propagate_error() shall be propagating an error of split bio to its original btrfs_bio, and tell the error to the upper layer. However, it's not working well on some cases. * Case 1. Immediate (or quick) end_bio with an error When btrfs sends btrfs_bio to mirrored devices, btrfs calls btrfs_bio_end_io() when all the mirroring bios are completed. If that btrfs_bio was split, it is from btrfs_clone_bioset and its end_io function is btrfs_orig_write_end_io. For this case, btrfs_bbio_propagate_error() accesses the orig_bbio's bio context to increase the error count. That works well in most cases. However, if the end_io is called enough fast, orig_bbio's (remaining part after split) bio context may not be properly set at that time. Since the bio context is set when the orig_bbio (the last btrfs_bio) is sent to devices, that might be too late for earlier split btrfs_bio's completion. That will result in NULL pointer dereference. That bug is easily reproducible by running btrfs/146 on zoned devices [1] and it shows the following trace. [1] You need raid-stripe-tree feature as it create "-d raid0 -m raid1" FS. BUG: kernel NULL pointer dereference, address: 0000000000000020 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 UID: 0 PID: 13 Comm: kworker/u32:1 Not tainted 6.11.0-rc7-BTRFS-ZNS+ #474 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Workqueue: writeback wb_workfn (flush-btrfs-5) RIP: 0010:btrfs_bio_end_io+0xae/0xc0 [btrfs] BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 RSP: 0018:ffffc9000006f248 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888005a7f080 RCX: ffffc9000006f1dc RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff888005a7f080 RBP: ffff888011dfc540 R08: 0000000000000000 R09: 0000000000000001 R10: ffffffff82e508e0 R11: 0000000000000005 R12: ffff88800ddfbe58 R13: ffff888005a7f080 R14: ffff888005a7f158 R15: ffff888005a7f158 FS: 0000000000000000(0000) GS:ffff88803ea80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000020 CR3: 0000000002e22006 CR4: 0000000000370ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __die_body.cold+0x19/0x26 ? page_fault_oops+0x13e/0x2b0 ? _printk+0x58/0x73 ? do_user_addr_fault+0x5f/0x750 ? exc_page_fault+0x76/0x240 ? asm_exc_page_fault+0x22/0x30 ? btrfs_bio_end_io+0xae/0xc0 [btrfs] ? btrfs_log_dev_io_error+0x7f/0x90 [btrfs] btrfs_orig_write_end_io+0x51/0x90 [btrfs] dm_submit_bio+0x5c2/0xa50 [dm_mod] ? find_held_lock+0x2b/0x80 ? blk_try_enter_queue+0x90/0x1e0 __submit_bio+0xe0/0x130 ? ktime_get+0x10a/0x160 ? lockdep_hardirqs_on+0x74/0x100 submit_bio_noacct_nocheck+0x199/0x410 btrfs_submit_bio+0x7d/0x150 [btrfs] btrfs_submit_chunk+0x1a1/0x6d0 [btrfs] ? lockdep_hardirqs_on+0x74/0x100 ? __folio_start_writeback+0x10/0x2c0 btrfs_submit_bbio+0x1c/0x40 [btrfs] submit_one_bio+0x44/0x60 [btrfs] submit_extent_folio+0x13f/0x330 [btrfs] ? btrfs_set_range_writeback+0xa3/0xd0 [btrfs] extent_writepage_io+0x18b/0x360 [btrfs] extent_write_locked_range+0x17c/0x340 [btrfs] ? __pfx_end_bbio_data_write+0x10/0x10 [btrfs] run_delalloc_cow+0x71/0xd0 [btrfs] btrfs_run_delalloc_range+0x176/0x500 [btrfs] ? find_lock_delalloc_range+0x119/0x260 [btrfs] writepage_delalloc+0x2ab/0x480 [btrfs] extent_write_cache_pages+0x236/0x7d0 [btrfs] btrfs_writepages+0x72/0x130 [btrfs] do_writepages+0xd4/0x240 ? find_held_lock+0x2b/0x80 ? wbc_attach_and_unlock_inode+0x12c/0x290 ? wbc_attach_and_unlock_inode+0x12c/0x290 __writeback_single_inode+0x5c/0x4c0 ? do_raw_spin_unlock+0x49/0xb0 writeback_sb_inodes+0x22c/0x560 __writeback_inodes_wb+0x4c/0xe0 wb_writeback+0x1d6/0x3f0 wb_workfn+0x334/0x520 process_one_work+0x1ee/0x570 ? lock_is_held_type+0xc6/0x130 worker_thread+0x1d1/0x3b0 ? __pfx_worker_thread+0x10/0x10 kthread+0xee/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Modules linked in: dm_mod btrfs blake2b_generic xor raid6_pq rapl CR2: 0000000000000020 * Case 2. Earlier completion of orig_bbio for mirrored btrfs_bios btrfs_bbio_propagate_error() assumes the end_io function for orig_bbio is called last among split bios. In that case, btrfs_orig_write_end_io() sets the bio->bi_status to BLK_STS_IOERR by seeing the bioc->error [2]. Otherwise, the increased orig_bio's bioc->error is not checked by anyone and return BLK_STS_OK to the upper layer. [2] Actually, this is not true. Because we only increases orig_bioc->errors by max_errors, the condition "atomic_read(&bioc->error) > bioc->max_errors" is still not met if only one split btrfs_bio fails. * Case 3. Later completion of orig_bbio for un-mirrored btrfs_bios In contrast to the above case, btrfs_bbio_propagate_error() is not working well if un-mirrored orig_bbio is completed last. It sets orig_bbio->bio.bi_status to the btrfs_bio's error. But, that is easily over-written by orig_bbio's completion status. If the status is BLK_STS_OK, the upper layer would not know the failure. * Solution Considering the above cases, we can only save the error status in the orig_bbio (remaining part after split) itself as it is always available. Also, the saved error status should be propagated when all the split btrfs_bios are finished (i.e, bbio->pending_ios == 0). This commit introduces "status" to btrfs_bbio and saves the first error of split bios to original btrfs_bio's "status" variable. When all the split bios are finished, the saved status is loaded into original btrfs_bio's status. With this commit, btrfs/146 on zoned devices does not hit the NULL pointer dereference anymore. Fixes: `852eee62d3` ("btrfs: allow btrfs_submit_bio to split bios") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-23 18:17:43 +02:00
Johannes Berg	8dc6d81c6b	debugfs: add small file operations for most files As struct file_operations is really big, but (most) debugfs files only use simple_open, read, write and perhaps seek, and don't need anything else, this wastes a lot of space for NULL pointers. Add a struct debugfs_short_fops and some bookkeeping code in debugfs so that users can use that with debugfs_create_file() using _Generic to figure out which function to use. Converting mac80211 to use it where possible saves quite a bit of space: 1010127 205064 1220 1216411 128f9b net/mac80211/mac80211.ko (before) 981199 205064 1220 1187483 121e9b net/mac80211/mac80211.ko (after) ------- -28928 = ~28KiB With a marginal space cost in debugfs: 8701 550 16 9267 2433 fs/debugfs/inode.o (before) 25233 325 32 25590 63f6 fs/debugfs/file.o (before) 8914 558 16 9488 2510 fs/debugfs/inode.o (after) 25380 325 32 25737 6489 fs/debugfs/file.o (after) --------------- +360 +8 (All on x86-64) A simple spatch suggests there are more than 300 instances, not even counting the ones hidden in macros like in mac80211, that could be trivially converted, for additional savings of about 240 bytes for each. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://patch.msgid.link/20241022151838.26f9925fb959.Ia80b55e934bbfc45ce0df42a3233d34b35508046@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2024-10-23 16:47:01 +02:00
Ye Bin	2ce1007f42	cifs: fix warning when destroy 'cifs_io_request_pool' There's a issue as follows: WARNING: CPU: 1 PID: 27826 at mm/slub.c:4698 free_large_kmalloc+0xac/0xe0 RIP: 0010:free_large_kmalloc+0xac/0xe0 Call Trace: <TASK> ? __warn+0xea/0x330 mempool_destroy+0x13f/0x1d0 init_cifs+0xa50/0xff0 [cifs] do_one_initcall+0xdc/0x550 do_init_module+0x22d/0x6b0 load_module+0x4e96/0x5ff0 init_module_from_file+0xcd/0x130 idempotent_init_module+0x330/0x620 __x64_sys_finit_module+0xb3/0x110 do_syscall_64+0xc1/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Obviously, 'cifs_io_request_pool' is not created by mempool_create(). So just use mempool_exit() to revert 'cifs_io_request_pool'. Fixes: `edea94a697` ("cifs: Add mempools for cifs_io_request and cifs_io_subrequest structs") Signed-off-by: Ye Bin <yebin10@huawei.com> Acked-by: David Howells <dhowells@redhat.com Signed-off-by: Steve French <stfrench@microsoft.com>	2024-10-23 07:42:44 -05:00
Henrique Carvalho	9a5dd61151	smb: client: Handle kstrdup failures for passwords In smb3_reconfigure(), after duplicating ctx->password and ctx->password2 with kstrdup(), we need to check for allocation failures. If ses->password allocation fails, return -ENOMEM. If ses->password2 allocation fails, free ses->password, set it to NULL, and return -ENOMEM. Fixes: `c1eb537bf4` ("cifs: allow changing password during remount") Reviewed-by: David Howells <dhowells@redhat.com Signed-off-by: Haoxiang Li <make24@iscas.ac.cn> Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-10-23 07:42:22 -05:00
Dave Kleikamp	67373ca840	jfs: Fix sanity check in dbMount MAXAG is a legitimate value for bmp->db_numag Fixes: `e63866a475` ("jfs: fix out-of-bounds in dbNextAG() and diAlloc()") Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>	2024-10-22 09:40:37 -05:00
Jens Axboe	fdad1a20cd	Merge branch 'for-6.13/block-atomic' into for-6.13/block Merge in block/fs prep patches for the atomic write support. * for-6.13/block-atomic: block: Add bdev atomic write limits helpers fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid() block/fs: Pass an iocb to generic_atomic_write_valid()	2024-10-22 08:21:51 -06:00
Yue Haibing	75f49c3dc7	btrfs: fix passing 0 to ERR_PTR in btrfs_search_dir_index_item() The ret may be zero in btrfs_search_dir_index_item() and should not passed to ERR_PTR(). Now btrfs_unlink_subvol() is the only caller to this, reconstructed it to check ERR_PTR(-ENOENT) while ret >= 0. This fixes smatch warnings: fs/btrfs/dir-item.c:353 btrfs_search_dir_index_item() warn: passing zero to 'ERR_PTR' Fixes: `9dcbe16fcc` ("btrfs: use btrfs_for_each_slot in btrfs_search_dir_index_item") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-22 16:10:55 +02:00
Qu Wenruo	3c36a72c1d	btrfs: reject ro->rw reconfiguration if there are hard ro requirements [BUG] Syzbot reports the following crash: BTRFS info (device loop0 state MCS): disabling free space tree BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1) BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2) Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:backup_super_roots fs/btrfs/disk-io.c:1691 [inline] RIP: 0010:write_all_supers+0x97a/0x40f0 fs/btrfs/disk-io.c:4041 Call Trace: <TASK> btrfs_commit_transaction+0x1eae/0x3740 fs/btrfs/transaction.c:2530 btrfs_delete_free_space_tree+0x383/0x730 fs/btrfs/free-space-tree.c:1312 btrfs_start_pre_rw_mount+0xf28/0x1300 fs/btrfs/disk-io.c:3012 btrfs_remount_rw fs/btrfs/super.c:1309 [inline] btrfs_reconfigure+0xae6/0x2d40 fs/btrfs/super.c:1534 btrfs_reconfigure_for_mount fs/btrfs/super.c:2020 [inline] btrfs_get_tree_subvol fs/btrfs/super.c:2079 [inline] btrfs_get_tree+0x918/0x1920 fs/btrfs/super.c:2115 vfs_get_tree+0x90/0x2b0 fs/super.c:1800 do_new_mount+0x2be/0xb40 fs/namespace.c:3472 do_mount fs/namespace.c:3812 [inline] __do_sys_mount fs/namespace.c:4020 [inline] __se_sys_mount+0x2d6/0x3c0 fs/namespace.c:3997 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f [CAUSE] To support mounting different subvolume with different RO/RW flags for the new mount APIs, btrfs introduced two workaround to support this feature: - Skip mount option/feature checks if we are mounting a different subvolume - Reconfigure the fs to RW if the initial mount is RO Combining these two, we can have the following sequence: - Mount the fs ro,rescue=all,clear_cache,space_cache=v1 rescue=all will mark the fs as hard read-only, so no v2 cache clearing will happen. - Mount a subvolume rw of the same fs. We go into btrfs_get_tree_subvol(), but fc_mount() returns EBUSY because our new fc is RW, different from the original fs. Now we enter btrfs_reconfigure_for_mount(), which switches the RO flag first so that we can grab the existing fs_info. Then we reconfigure the fs to RW. - During reconfiguration, option/features check is skipped This means we will restart the v2 cache clearing, and convert back to v1 cache. This will trigger fs writes, and since the original fs has "rescue=all" option, it skips the csum tree read. And eventually causing NULL pointer dereference in super block writeback. [FIX] For reconfiguration caused by different subvolume RO/RW flags, ensure we always run btrfs_check_options() to ensure we have proper hard RO requirements met. In fact the function btrfs_check_options() doesn't really do many complex checks, but hard RO requirement and some feature dependency checks, thus there is no special reason not to do the check for mount reconfiguration. Reported-by: syzbot+56360f93efa90ff15870@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/0000000000008c5d090621cb2770@google.com/ Fixes: `f044b31867` ("btrfs: handle the ro->rw transition for mounting different subvolumes") CC: stable@vger.kernel.org # 6.8+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-22 16:10:51 +02:00
Boris Burkov	7a2339058e	btrfs: fix read corruption due to race with extent map merging In debugging some corrupt squashfs files, we observed symptoms of corrupt page cache pages but correct on-disk contents. Further investigation revealed that the exact symptom was a correct page followed by an incorrect, duplicate, page. This got us thinking about extent maps. commit `ac05ca913e` ("Btrfs: fix race between using extent maps and merging them") enforces a reference count on the primary `em` extent_map being merged, as that one gets modified. However, since, commit `3d2ac99224` ("btrfs: introduce new members for extent_map") both 'em' and 'merge' get modified, which started modifying 'merge' and thus introduced the same race. We were able to reproduce this by looping the affected squashfs workload in parallel on a bunch of separate btrfs-es while also dropping caches. We are still working on a simple enough reproducer to make into an fstest. The simplest fix is to stop modifying 'merge', which is not essential, as it is dropped immediately after the merge. This behavior is simply a consequence of the order of the two extent maps being important in computing the new values. Modify merge_ondisk_extents to take prev and next by const* and also take a third merged parameter that it puts the results in. Note that this introduces the rather odd behavior of passing 'em' to merge_ondisk_extents as a const * and as a regular ptr. Fixes: `3d2ac99224` ("btrfs: introduce new members for extent_map") CC: stable@vger.kernel.org # 6.11+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-22 16:10:13 +02:00
Qu Wenruo	f10f59f91a	btrfs: fix the delalloc range locking if sector size < page size Inside lock_delalloc_folios(), there are several problems related to sector size < page size handling: - Set the writer locks without checking if the folio is still valid We call btrfs_folio_start_writer_lock() just like it's folio_lock(). But since the folio may not even be the folio of the current mapping, we can easily screw up the folio->private. - The range is not clamped inside the page This means we can over write other bitmaps if the start/len is not properly handled, and trigger the btrfs_subpage_assert(). - @processed_end is always rounded up to page end If the delalloc range is not page aligned, and we need to retry (returning -EAGAIN), then we will unlock to the page end. Thankfully this is not a huge problem, as now btrfs_folio_end_writer_lock() can handle range larger than the locked range, and only unlock what is already locked. Fix all these problems by: - Lock and check the folio first, then call btrfs_folio_set_writer_lock() So that if we got a folio not belonging to the inode, we won't touch folio->private. - Properly truncate the range inside the page - Update @processed_end to the locked range end Fixes: `1e1de38792` ("btrfs: make process_one_page() to handle subpage locking") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-22 16:09:44 +02:00
Qu Wenruo	5f9062a48d	btrfs: qgroup: set a more sane default value for subtree drop threshold Since commit `011b46c304` ("btrfs: skip subtree scan if it's too high to avoid low stall in btrfs_commit_transaction()"), btrfs qgroup can automatically skip large subtree scan at the cost of marking qgroup inconsistent. It's designed to address the final performance problem of snapshot drop with qgroup enabled, but to be safe the default value is BTRFS_MAX_LEVEL, requiring a user space daemon to set a different value to make it work. I'd say it's not a good idea to rely on user space tool to set this default value, especially when some operations (snapshot dropping) can be triggered immediately after mount, leaving a very small window to that that sysfs interface. So instead of disabling this new feature by default, enable it with a low threshold (3), so that large subvolume tree drop at mount time won't cause huge qgroup workload. CC: stable@vger.kernel.org # 6.1 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-22 16:09:11 +02:00
Filipe Manana	3510e684b8	btrfs: clear force-compress on remount when compress mount option is given After the migration to use fs context for processing mount options we had a slight change in the semantics for remounting a filesystem that was mounted with compress-force. Before we could clear compress-force by passing only "-o compress[=algo]" during a remount, but after that change that does not work anymore, force-compress is still present and one needs to pass "-o compress-force=no,compress[=algo]" to the mount command. Example, when running on a kernel 6.8+: $ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi $ mount \| grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/) $ mount -o remount,compress=zlib:5 /mnt/sdi $ mount \| grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/) On a 6.7 kernel (or older): $ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi $ mount \| grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/) $ mount -o remount,compress=zlib:5 /mnt/sdi $ mount \| grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/) So update btrfs_parse_param() to clear "compress-force" when "compress" is given, providing the same semantics as kernel 6.7 and older. Reported-by: Roman Mamedov <rm@romanrm.net> Link: https://lore.kernel.org/linux-btrfs/20241014182416.13d0f8b0@nvm/ CC: stable@vger.kernel.org # 6.8+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-22 16:07:53 +02:00
Christoph Hellwig	4a201dcfa1	xfs: update the pag for the last AG at recovery time Currently log recovery never updates the in-core perag values for the last allocation group when they were grown by growfs. This leads to btree record validation failures for the alloc, ialloc or finotbt trees if a transaction references this new space. Found by Brian's new growfs recovery stress test. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-22 13:37:19 +02:00
Christoph Hellwig	069cf5e32b	xfs: don't use __GFP_RETRY_MAYFAIL in xfs_initialize_perag __GFP_RETRY_MAYFAIL increases the likelyhood of allocations to fail, which isn't really helpful during log recovery. Remove the flag and stick to the default GFP_KERNEL policies. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-22 13:37:18 +02:00
Christoph Hellwig	b882b0f813	xfs: error out when a superblock buffer update reduces the agcount XFS currently does not support reducing the agcount, so error out if a logged sb buffer tries to shrink the agcount. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-22 13:37:18 +02:00
Christoph Hellwig	6a18765b54	xfs: update the file system geometry after recoverying superblock buffers Primary superblock buffers that change the file system geometry after a growfs operation can affect the operation of later CIL checkpoints that make use of the newly added space and allocation groups. Apply the changes to the in-memory structures as part of recovery pass 2, to ensure recovery works fine for such cases. In the future we should apply the logic to other updates such as features bits as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-22 13:37:18 +02:00
Christoph Hellwig	aa67ec6a25	xfs: merge the perag freeing helpers There is no good reason to have two different routines for freeing perag structures for the unmount and error cases. Add two arguments to specify the range of AGs to free to xfs_free_perag, and use that to replace xfs_free_unused_perag_range. The addition RCU grace period for the error case is harmless, and the extra check for the AG to actually exist is not required now that the callers pass the exact known allocated range. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-22 13:37:18 +02:00
Christoph Hellwig	82742f8c3f	xfs: pass the exact range to initialize to xfs_initialize_perag Currently only the new agcount is passed to xfs_initialize_perag, which requires lookups of existing AGs to skip them and complicates error handling. Also pass the previous agcount so that the range that xfs_initialize_perag operates on is exactly defined. That way the extra lookups can be avoided, and error handling can clean up the exact range from the old count to the last added perag structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-22 13:37:18 +02:00
Darrick J. Wong	af8512c527	xfs: don't fail repairs on metadata files with no attr fork Fix a minor bug where we fail repairs on metadata files that do not have attr forks because xrep_metadata_inode_subtype doesn't filter ENOENT. Cc: stable@vger.kernel.org # v6.8 Fixes: `5a8e07e799` ("xfs: repair the inode core and forks of a metadata inode") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-22 13:37:18 +02:00
Thorsten Blum	8c6e03ffed	acl: Annotate struct posix_acl with __counted_by() Add the __counted_by compiler attribute to the flexible array member a_entries to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE. Use struct_size() to calculate the number of bytes to allocate for new and cloned acls and remove the local size variables. Change the posix_acl_alloc() function parameter count from int to unsigned int to match posix_acl's a_count data type. Add identifier names to the function definition to silence two checkpatch warnings. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://lore.kernel.org/r/20241018121426.155247-2-thorsten.blum@linux.dev Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-22 11:16:59 +02:00
Xuewen Yan	900bbaae67	epoll: Add synchronous wakeup support for ep_poll_callback Now, the epoll only use wake_up() interface to wake up task. However, sometimes, there are epoll users which want to use the synchronous wakeup flag to hint the scheduler, such as Android binder driver. So add a wake_up_sync() define, and use the wake_up_sync() when the sync is true in ep_poll_callback(). Co-developed-by: Jing Xia <jing.xia@unisoc.com> Signed-off-by: Jing Xia <jing.xia@unisoc.com> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Link: https://lore.kernel.org/r/20240426080548.8203-1-xuewen.yan@unisoc.com Tested-by: Brian Geffon <bgeffon@google.com> Reviewed-by: Brian Geffon <bgeffon@google.com> Reported-by: Benoit Lize <lizeb@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-22 11:16:59 +02:00
Rik van Riel	0dfcb72d33	coredump: add cond_resched() to dump_user_range The loop between elf_core_dump() and dump_user_range() can run for so long that the system shows softlockup messages, with side effects like workqueues and RCU getting stuck on the core dumping CPU. Add a cond_resched() in dump_user_range() to avoid that softlockup. Signed-off-by: Rik van Riel <riel@surriel.com> Link: https://lore.kernel.org/r/20241010113651.50cb0366@imladris.surriel.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-22 11:16:58 +02:00
Andrew Kreimer	80d3ab2227	fs/inode: Fix a typo Fix a typo in comments: wether v-> whether. Signed-off-by: Andrew Kreimer <algonell@gmail.com> Link: https://lore.kernel.org/r/20241008121602.16778-1-algonell@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-22 11:16:58 +02:00
Christian Brauner	2714b0d1f3	fcntl: make F_DUPFD_QUERY associative Currently when passing a closed file descriptor to fcntl(fd, F_DUPFD_QUERY, fd_dup) the order matters: fd = open("/dev/null"); fd_dup = dup(fd); When we now close one of the file descriptors we get: (1) fcntl(fd, fd_dup) // -EBADF (2) fcntl(fd_dup, fd) // 0 aka not equal depending on which file descriptor is passed first. That's not a huge deal but it gives the api I slightly weird feel. Make it so that the order doesn't matter by requiring that both file descriptors are valid: (1') fcntl(fd, fd_dup) // -EBADF (2') fcntl(fd_dup, fd) // -EBADF Link: https://lore.kernel.org/r/20241008-duften-formel-251f967602d5@brauner Fixes: `c62b758bae` ("fcntl: add F_DUPFD_QUERY fcntl()") Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-By: Lennart Poettering <lennart@poettering.net> Cc: stable@vger.kernel.org Reported-by: Lennart Poettering <lennart@poettering.net> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-22 11:16:58 +02:00
Andreas Gruenbacher	c298638743	vfs: inode insertion kdoc corrections Some minor corrections to the inode_insert5 and iget5_locked kernel documentation. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Link: https://lore.kernel.org/r/20241004115151.44834-1-agruenba@redhat.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-22 11:16:57 +02:00
Uros Bizjak	0cb9c994e7	namespace: Use atomic64_inc_return() in alloc_mnt_ns() Use atomic64_inc_return(&ref) instead of atomic64_add_return(1, &ref) to use optimized implementation and ease register pressure around the primitive for targets that implement optimized variant. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20241007085303.48312-1-ubizjak@gmail.com Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-22 11:16:57 +02:00
Julia Lawall	1e756248be	fs: Reorganize kerneldoc parameter names Reorganize kerneldoc parameter names to match the parameter order in the function header. Problems identified using Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Link: https://lore.kernel.org/r/20240930112121.95324-9-Julia.Lawall@inria.fr Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-22 11:16:57 +02:00
Yafang Shao	e6957c99dc	vfs: Add a sysctl for automated deletion of dentry Commit `681ce86235` ("vfs: Delete the associated dentry when deleting a file") introduced an unconditional deletion of the associated dentry when a file is removed. However, this led to performance regressions in specific benchmarks, such as ilebench.sum_operations/s [0], prompting a revert in commit `4a4be1ad3a` ("Revert "vfs: Delete the associated dentry when deleting a file""). This patch seeks to reintroduce the concept conditionally, where the associated dentry is deleted only when the user explicitly opts for it during file removal. A new sysctl fs.automated_deletion_of_dentry is added for this purpose. Its default value is set to 0. There are practical use cases for this proactive dentry reclamation. Besides the Elasticsearch use case mentioned in commit `681ce86235`, additional examples have surfaced in our production environment. For instance, in video rendering services that continuously generate temporary files, upload them to persistent storage servers, and then delete them, a large number of negative dentries—serving no useful purpose—accumulate. Users in such cases would benefit from proactively reclaiming these negative dentries. Link: https://lore.kernel.org/linux-fsdevel/202405291318.4dfbb352-oliver.sang@intel.com [0] Link: https://lore.kernel.org/all/20240912-programm-umgibt-a1145fa73bb6@brauner/ Suggested-by: Christian Brauner <brauner@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20240929122831.92515-1-laoar.shao@gmail.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-22 11:16:57 +02:00
Christian Brauner	6474353a5e	epoll: annotate racy check Epoll relies on a racy fastpath check during __fput() in eventpoll_release() to avoid the hit of pointlessly acquiring a semaphore. Annotate that race by using WRITE_ONCE() and READ_ONCE(). Link: https://lore.kernel.org/r/66edfb3c.050a0220.3195df.001a.GAE@google.com Link: https://lore.kernel.org/r/20240925-fungieren-anbauen-79b334b00542@brauner Reviewed-by: Jan Kara <jack@suse.cz> Reported-by: syzbot+3b6b32dc50537a49bb4a@syzkaller.appspotmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-22 11:16:56 +02:00
Linus Torvalds	7166c32651	vfs-6.12-rc5.fixes -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZxY6XAAKCRCRxhvAZXjc opmUAQCu4KhzBBdZmFw3AfZFNJvYb1onT4FiU0pnyGgfvzEdEwD6AlnlgQ7DL3ZN WBqBzUl+DpGYJfzhkqoEGH89Fagx7QM= =mm68 -----END PGP SIGNATURE----- Merge tag 'vfs-6.12-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "afs: - Fix a lock recursion in afs_wake_up_async_call() on ->notify_lock netfs: - Drop the references to a folio immediately after the folio has been extracted to prevent races with future I/O collection - Fix a documenation build error - Downgrade the i_rwsem for buffered writes to fix a cifs reported performance regression when switching to netfslib vfs: - Explicitly return -E2BIG from openat2() if the specified size is unexpectedly large. This aligns openat2() with other extensible struct based system calls - When copying a mount namespace ensure that we only try to remove the new copy from the mount namespace rbtree if it has already been added to it nilfs: - Clear the buffer delay flag when clearing the buffer state clags when a buffer head is discarded to prevent a kernel OOPs ocfs2: - Fix an unitialized value warning in ocfs2_setattr() proc: - Fix a kernel doc warning" * tag 'vfs-6.12-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: proc: Fix W=1 build kernel-doc warning afs: Fix lock recursion fs: Fix uninitialized value issue in from_kuid and from_kgid fs: don't try and remove empty rbtree node netfs: Downgrade i_rwsem for a buffered write nilfs2: fix kernel bug due to missing clearing of buffer delay flag openat2: explicitly return -E2BIG for (usize > PAGE_SIZE) netfs: fix documentation build error netfs: In readahead, put the folio refs as soon extracted	2024-10-21 10:48:24 -07:00
Christoph Hellwig	6db388585e	iomap: turn iomap_want_unshare_iter into an inline function iomap_want_unshare_iter currently sits in fs/iomap/buffered-io.c, which depends on CONFIG_BLOCK. It is also in used in fs/dax.c whіch has no such dependency. Given that it is a trivial check turn it into an inline in include/linux/iomap.h to fix the DAX && !BLOCK build. Fixes: `6ef6a0e821` ("iomap: share iomap_unshare_iter predicate code with fsdax") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241015041350.118403-1-hch@lst.de Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-21 17:01:01 +02:00
Jan Kara	fb6f20ecb1	reiserfs: The last commit Deprecation period of reiserfs ends with the end of this year so it is time to remove it from the kernel. Acked-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz>	2024-10-21 16:29:38 +02:00
Yang Erkun	d5ff2fb2e7	nfsd: cancel nfsd_shrinker_work using sync mode in nfs4_state_shutdown_net In the normal case, when we excute `echo 0 > /proc/fs/nfsd/threads`, the function `nfs4_state_destroy_net` in `nfs4_state_shutdown_net` will release all resources related to the hashed `nfs4_client`. If the `nfsd_client_shrinker` is running concurrently, the `expire_client` function will first unhash this client and then destroy it. This can lead to the following warning. Additionally, numerous use-after-free errors may occur as well. nfsd_client_shrinker echo 0 > /proc/fs/nfsd/threads expire_client nfsd_shutdown_net unhash_client ... nfs4_state_shutdown_net /* won't wait shrinker exit / / cancel_work(&nn->nfsd_shrinker_work) * nfsd_file for this /* won't destroy unhashed client1 / client1 still alive nfs4_state_destroy_net / nfsd_file_cache_shutdown / trigger warning / kmem_cache_destroy(nfsd_file_slab) kmem_cache_destroy(nfsd_file_mark_slab) / release nfsd_file and mark */ __destroy_client ==================================================================== BUG nfsd_file (Not tainted): Objects remaining in nfsd_file on __kmem_cache_shutdown() -------------------------------------------------------------------- CPU: 4 UID: 0 PID: 764 Comm: sh Not tainted 6.12.0-rc3+ #1 dump_stack_lvl+0x53/0x70 slab_err+0xb0/0xf0 __kmem_cache_shutdown+0x15c/0x310 kmem_cache_destroy+0x66/0x160 nfsd_file_cache_shutdown+0xac/0x210 [nfsd] nfsd_destroy_serv+0x251/0x2a0 [nfsd] nfsd_svc+0x125/0x1e0 [nfsd] write_threads+0x16a/0x2a0 [nfsd] nfsctl_transaction_write+0x74/0xa0 [nfsd] vfs_write+0x1a5/0x6d0 ksys_write+0xc1/0x160 do_syscall_64+0x5f/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e ==================================================================== BUG nfsd_file_mark (Tainted: G B W ): Objects remaining nfsd_file_mark on __kmem_cache_shutdown() -------------------------------------------------------------------- dump_stack_lvl+0x53/0x70 slab_err+0xb0/0xf0 __kmem_cache_shutdown+0x15c/0x310 kmem_cache_destroy+0x66/0x160 nfsd_file_cache_shutdown+0xc8/0x210 [nfsd] nfsd_destroy_serv+0x251/0x2a0 [nfsd] nfsd_svc+0x125/0x1e0 [nfsd] write_threads+0x16a/0x2a0 [nfsd] nfsctl_transaction_write+0x74/0xa0 [nfsd] vfs_write+0x1a5/0x6d0 ksys_write+0xc1/0x160 do_syscall_64+0x5f/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e To resolve this issue, cancel `nfsd_shrinker_work` using synchronous mode in nfs4_state_shutdown_net. Fixes: `7c24fa2250` ("NFSD: replace delayed_work with work_struct for nfsd_client_shrinker") Signed-off-by: Yang Erkun <yangerkun@huaweicloud.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-10-21 10:27:36 -04:00
Gao Xiang	14c2d97265	erofs: use get_tree_bdev_flags() to avoid misleading messages Users can pass in an arbitrary source path for the proper type of a mount then without "Can't lookup blockdev" error message. Reported-by: Allison Karlitskaya <allison.karlitskaya@redhat.com> Closes: https://lore.kernel.org/r/CAOYeF9VQ8jKVmpy5Zy9DNhO6xmWSKMB-DO8yvBB0XvBE7=3Ugg@mail.gmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241009033151.2334888-2-hsiangkao@linux.alibaba.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-21 14:30:27 +02:00
Gao Xiang	4021e68513	fs/super.c: introduce get_tree_bdev_flags() As Allison reported [1], currently get_tree_bdev() will store "Can't lookup blockdev" error message. Although it makes sense for pure bdev-based fses, this message may mislead users who try to use EROFS file-backed mounts since get_tree_nodev() is used as a fallback then. Add get_tree_bdev_flags() to specify extensible flags [2] and GET_TREE_BDEV_QUIET_LOOKUP to silence "Can't lookup blockdev" message since it's misleading to EROFS file-backed mounts now. [1] https://lore.kernel.org/r/CAOYeF9VQ8jKVmpy5Zy9DNhO6xmWSKMB-DO8yvBB0XvBE7=3Ugg@mail.gmail.com [2] https://lore.kernel.org/r/ZwUkJEtwIpUA4qMz@infradead.org Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241009033151.2334888-1-hsiangkao@linux.alibaba.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-21 14:30:26 +02:00
Miklos Szeredi	184429a17f	Revert "fuse: move initialization of fuse_file to fuse_writepages() instead of in callback" This reverts commit `672c3b7457`. fuse_writepages() might be called with no dirty pages after all writable opens were closed. In this case __fuse_write_file_get() will return NULL which will trigger the WARNING. The exact conditions under which this is triggered is unclear and syzbot didn't find a reproducer yet. Reported-by: syzbot+217a976dc26ef2fa8711@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/CAJnrk1aQwfvb51wQ5rUSf9N8j1hArTFeSkHqC_3T-mU6_BCD=A@mail.gmail.com/ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2024-10-21 10:02:51 +02:00
Ingo Molnar	d1fb8a78b2	Linux 6.12-rc4 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmcVgfoeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGhCYH/0Sdfp3cIq3JWLRv HCkWhPkPbEvR5XQlYQsAvTPVrEc0ZG9PKlXCaYaa8Tvt8xQ7WT/VDTjKgaWEhr8s qa6bNTx1zggiNBTP/3jYsNliOyAYfw5qjxA7fpEmueAeuT5y1XKZFKPHEXE/1qbR 8zeISKTkE0qwUmLqCdXe2qBWFnCC5i+78RcI6IN7uErnuNWk7ssapldgU4DB+dEl DDRxi1FTvARGPQGl8T+jPkfJiugv87ksG7l4WsqcYgoW+045K76C7I6vQjkDOrsd wqtPIow/yPmGQbbdRhWLxNU+wDmselYQ6xp7aMxppNF45HoHtzNm+X+T2ZU3bPoP iT2Mkbg= =+GXK -----END PGP SIGNATURE----- Merge tag 'v6.12-rc4' into sched/core, to resolve conflict Overlapping fixes solving the same bug slightly differently: `7266f0a6d3` fs/bcachefs: Fix __wait_on_freeing_inode() definition of waitqueue entry `3b80552e70` bcachefs: __wait_for_freeing_inode: Switch to wait_bit_queue_entry Use the upstream version. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-10-21 08:14:15 +02:00
Kent Overstreet	a069f01479	bcachefs: Set bch_inode_unpacked.bi_snapshot in old inode path This fixes a fsck bug on a very old filesystem (pre mainline merge). Fixes: `72350ee0ea` ("bcachefs: Kill snapshot arg to fsck_write_inode()") Reported-by: Marcin Mirosław <marcin@mejor.pl> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-20 18:09:09 -04:00
Kent Overstreet	e04ee86089	bcachefs: Mark more errors as AUTOFIX Reported-by: Marcin Mirosław <marcin@mejor.pl> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-20 18:08:53 -04:00
Kent Overstreet	f0d3302073	bcachefs: Workaround for kvmalloc() not supporting > INT_MAX allocations kvmalloc() doesn't support allocations > INT_MAX, but vmalloc() does - the limit should be lifted, but we can work around this for now. A user with a 75 TB filesystem reported the following journal replay error: https://github.com/koverstreet/bcachefs/issues/769 In journal replay we have to sort and dedup all the keys from the journal, which means we need a large contiguous allocation. Given that the user has 128GB of ram, the 2GB limit on allocation size has become far too small. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-20 16:50:14 -04:00
Kent Overstreet	3956ff8bc2	bcachefs: Don't use wait_event_interruptible() in recovery Fix a bug where mount was failing with -ERESTARTSYS: https://github.com/koverstreet/bcachefs/issues/741 We only want the interruptible wait when called from fsync. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-20 16:50:14 -04:00
Kent Overstreet	eb5db64c45	bcachefs: Fix __bch2_fsck_err() warning We only warn about having a btree_trans that wasn't passed in if we'll be prompting. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-20 16:50:14 -04:00
Al Viro	e896474fe4	getname_maybe_null() - the third variant of pathname copy-in Semantics used by statx(2) (and later xattrat(2)): without AT_EMPTY_PATH it's standard getname() (i.e. ERR_PTR(-ENOENT) on empty string, ERR_PTR(-EFAULT) on NULL), with AT_EMPTY_PATH both empty string and NULL are accepted. Calling conventions: getname_maybe_null(user_pointer, flags) returns pointer to struct filename when non-empty string had been successfully read * ERR_PTR(...) on error * NULL if an empty string or NULL pointer had been given with AT_EMPTY_PATH in the flags argument. It tries to avoid allocation in the last case; it's not always able to do so, in which case the temporary struct filename instance is freed and NULL returned anyway. Fast path is inlined. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-19 20:33:34 -04:00
Al Viro	5b313bcb6e	teach filename_lookup() to treat NULL filename as "" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-19 20:32:39 -04:00
John Garry	c3be7ebbbc	fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid() Currently FMODE_CAN_ATOMIC_WRITE is set if the bdev can atomic write and the file is open for direct IO. This does not work if the file is not opened for direct IO, yet fcntl(O_DIRECT) is used on the fd later. Change to check for direct IO on a per-IO basis in generic_atomic_write_valid(). Since we want to report -EOPNOTSUPP for non-direct IO for an atomic write, change to return an error code. Relocate the block fops atomic write checks to the common write path, as to catch non-direct IO. Fixes: `c34fc6f26a` ("fs: Initial atomic write support") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241019125113.369994-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-19 16:48:22 -06:00
John Garry	9a8dbdadae	block/fs: Pass an iocb to generic_atomic_write_valid() Darrick and Hannes both thought it better that generic_atomic_write_valid() should be passed a struct iocb, and not just the member of that struct which is referenced; see [0] and [1]. I think that makes a more generic and clean API, so make that change. [0] https://lore.kernel.org/linux-block/680ce641-729b-4150-b875-531a98657682@suse.de/ [1] https://lore.kernel.org/linux-xfs/20240620212401.GA3058325@frogsfrogsfrogs/ Fixes: `c34fc6f26a` ("fs: Initial atomic write support") Suggested-by: Darrick J. Wong <djwong@kernel.org> Suggested-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241019125113.369994-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-19 16:48:21 -06:00
Linus Torvalds	9197b73fd7	Mashed-up update that I sat on too long: - fix for multiple slabs created with the same name - enable multipage folios - theorical fix to also look for opened fids by inode if none was found by dentry -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmcS81AACgkQq06b7GqY 5nACpBAAtXOGRjg+dushCwUVKBlnI3oTwE2G+ywnphNZg2A0emlMOxos7x1OTiM3 Fu0b10MCUWHIXo4jD6ALVPWITJTfjiXR8s90Q/ozypcIXXhkDDShhV31b2h6Iplr YyKyjEehDFRiS7rqWC2a9mce99sOpwdQRmnssnWbjYvpJ4imFbl+50Z1I5Nc/Omu j2y02eMuikiWF/shKj0Dx1mmpZ4InSv3kvlM+V2D2YdWKNonGZe/xFZhid95LXmr Upt55R8k9qR2pn4VU22eKP6c34DIZGDlrcQdPUCNP5QuaAdGZov3TjNQdjE1bJmF E2QdxvUNfvvHqlvaRrlWa27uMgXMcy7QV3LEKwmo3tmaYVw2PDMRbFXc9zQdxy91 zqXjjGasnwzE8ca36y79vZjFTHAyY5VK/3cHCL3ai+ysu4UL3k2QgmVegREG/xKk G8Nz4UO/R6s8Wc2VqxKJdZS5NMLlADS+Aes0PG+9AxQz7iR9Ktgwrw39KDxMi+Lm PeH3Gz2rP9+EPoa3usoBQtvvvmJKM/Wb9qdPW9vTtRbRJ7bVclJoizFoLMA/TiW1 Jru+HYGBO75s8RynwEDLMiJhkjZWHfVgDjPsY6YsGVH8W2gOcJ7egQ2J2EsuurN3 tzKz4uQilV+VeDuWs8pWKrX/c3Y3KpSYV+oayg7Je7LoTlQBmU8= =VG4t -----END PGP SIGNATURE----- Merge tag '9p-for-6.12-rc4' of https://github.com/martinetd/linux Pull 9p fixes from Dominique Martinet: "Mashed-up update that I sat on too long: - fix for multiple slabs created with the same name - enable multipage folios - theorical fix to also look for opened fids by inode if none was found by dentry" [ Enabling multi-page folios should have been done during the merge window, but it's a one-liner, and the actual meat of the enablement is in netfs and already in use for other filesystems... - Linus ] * tag '9p-for-6.12-rc4' of https://github.com/martinetd/linux: 9p: Avoid creating multiple slab caches with the same name 9p: Enable multipage folios 9p: v9fs_fid_find: also lookup by inode if not found dentry	2024-10-19 08:44:10 -07:00
Christian Brauner	08ef26ea9a	fs: add file_ref As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has O(N^2) behaviour under contention with N concurrent operations and it is in a hot path in __fget_files_rcu(). The rcuref infrastructures remedies this problem by using an unconditional increment relying on safe- and dead zones to make this work and requiring rcu protection for the data structure in question. This not just scales better it also introduces overflow protection. However, in contrast to generic rcuref, files require a memory barrier and thus cannot rely on _relaxed() atomic operations and also require to be built on atomic_long_t as having massive amounts of reference isn't unheard of even if it is just an attack. As suggested by Linus, add a file specific variant instead of making this a generic library. Files are SLAB_TYPESAFE_BY_RCU and thus don't have "regular" rcu protection. In short, freeing of files isn't delayed until a grace period has elapsed. Instead, they are freed immediately and thus can be reused (multiple times) within the same grace period. So when picking a file from the file descriptor table via its file descriptor number it is thus possible to see an elevated reference count on file->f_count even though the file has already been recycled possibly multiple times by another task. To guard against this the vfs will pick the file from the file descriptor table twice. Once before the refcount increment and once after to compare the pointers (grossly simplified). If they match then the file is still valid. If not the caller needs to fput() it. The unconditional increment makes the following race possible as illustrated by rcuref: > Deconstruction race > =================== > > The release operation must be protected by prohibiting a grace period in > order to prevent a possible use after free: > > T1 T2 > put() get() > // ref->refcnt = ONEREF > if (!atomic_add_negative(-1, &ref->refcnt)) > return false; <- Not taken > > // ref->refcnt == NOREF > --> preemption > // Elevates ref->refcnt to ONEREF > if (!atomic_add_negative(1, &ref->refcnt)) > return true; <- taken > > if (put(&p->ref)) { <-- Succeeds > remove_pointer(p); > kfree_rcu(p, rcu); > } > > RCU grace period ends, object is freed > > atomic_cmpxchg(&ref->refcnt, NOREF, DEAD); <- UAF > > [...] it prevents the grace period which keeps the object alive until > all put() operations complete. Having files by SLAB_TYPESAFE_BY_RCU shouldn't cause any problems for this deconstruction race. Afaict, the only interesting case would be someone freeing the file and someone immediately recycling it within the same grace period and reinitializing file->f_count to ONEREF while a concurrent fput() is doing atomic_cmpxchg(&ref->refcnt, NOREF, DEAD) as in the race above. But this is safe from SLAB_TYPESAFE_BY_RCU's perspective and it should be safe from rcuref's perspective. T1 T2 T3 fput() fget() // f_count->refcnt = ONEREF if (!atomic_add_negative(-1, &f_count->refcnt)) return false; <- Not taken // f_count->refcnt == NOREF --> preemption // Elevates f_count->refcnt to ONEREF if (!atomic_add_negative(1, &f_count->refcnt)) return true; <- taken if (put(&f_count)) { <-- Succeeds remove_pointer(p); / * Cache is SLAB_TYPESAFE_BY_RCU * so this is freed without a grace period. / kmem_cache_free(p); } kmem_cache_alloc() init_file() { // Sets f_count->refcnt to ONEREF rcuref_long_init(&f->f_count, 1); } Object has been reused within the same grace period via kmem_cache_alloc()'s SLAB_TYPESAFE_BY_RCU. / * With SLAB_TYPESAFE_BY_RCU this would be a safe UAF access and * it would work correctly because the atomic_cmpxchg() * will fail because the refcount has been reset to ONEREF by T3. */ atomic_cmpxchg(&ref->refcnt, NOREF, DEAD); <- UAF However, there are other cases to consider: (1) Benign race due to multiple atomic_long_read() CPU1 CPU2 file_ref_put() // last reference // => count goes negative/FILE_REF_NOREF atomic_long_add_negative_release(-1, &ref->refcnt) -> __file_ref_put() file_ref_get() // goes back from negative/FILE_REF_NOREF to 0 // and file_ref_get() succeeds atomic_long_add_negative(1, &ref->refcnt) // This is immediately followed by file_ref_put() // managing to set FILE_REF_DEAD file_ref_put() // __file_ref_put() continues and sees // cnt > FILE_REF_RELEASED // and splats with // "imbalanced put on file reference count" cnt = atomic_long_read(&ref->refcnt); The race however is benign and the problem is the atomic_long_read(). Instead of performing a separate read this uses atomic_long_dec_return() and pass the value to __file_ref_put(). Thanks to Linus for pointing out that braino. (2) SLAB_TYPESAFE_BY_RCU may cause recycled files to be marked dead When a file is recycled the following race exists: CPU1 CPU2 // @file is already dead and thus // cnt >= FILE_REF_RELEASED. file_ref_get(file) atomic_long_add_negative(1, &ref->refcnt) // We thus call into __file_ref_get() -> __file_ref_get() // which sees cnt >= FILE_REF_RELEASED cnt = atomic_long_read(&ref->refcnt); // In the meantime @file gets freed kmem_cache_free() // and is immediately recycled file = kmem_cache_zalloc() // and the reference count is reinitialized // and the file alive again in someone // else's file descriptor table file_ref_init(&ref->refcnt, 1); // the __file_ref_get() slowpath now continues // and as it saw earlier that cnt >= FILE_REF_RELEASED // it wants to ensure that we're staying in the middle // of the deadzone and unconditionally sets // FILE_REF_DEAD. // This marks @file dead for CPU2... atomic_long_set(&ref->refcnt, FILE_REF_DEAD); // Caller issues a close() system call to close @file close(fd) file = file_close_fd_locked() filp_flush() // The caller sees that cnt >= FILE_REF_RELEASED // and warns the first time... CHECK_DATA_CORRUPTION(file_count(file) == 0) // and then splats a second time because // __file_ref_put() sees cnt >= FILE_REF_RELEASED file_ref_put(&ref->refcnt); -> __file_ref_put() My initial inclination was to replace the unconditional atomic_long_set() with an atomic_long_try_cmpxchg() but Linus pointed out that: > I think we should just make file_ref_get() do a simple > > return !atomic_long_add_negative(1, &ref->refcnt)); > > and nothing else. Yes, multiple CPU's can race, and you can increment > more than once, but the gap - even on 32-bit - between DEAD and > becoming close to REF_RELEASED is so big that we simply don't care. > That's the point of having a gap. I've been testing this with will-it-scale using fstat() on a machine that Jens gave me access (thank you very much!): processor : 511 vendor_id : AuthenticAMD cpu family : 25 model : 160 model name : AMD EPYC 9754 128-Core Processor and I consistently get a 3-5% improvement on 256+ threads. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202410151043.5d224a27-oliver.sang@intel.com Closes: https://lore.kernel.org/all/202410151611.f4cd71f2-oliver.sang@intel.com Link: https://lore.kernel.org/r/20241007-brauner-file-rcuref-v2-2-387e24dc9163@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-19 14:16:45 +02:00
Matthew Wilcox (Oracle)	b57c010e70	ufs: Convert ufs_change_blocknr() to take a folio Now that ufs_new_fragments() has a folio, pass it to ufs_change_blocknr() as a folio instead of converting it from folio to page to folio. This removes the last use of struct page in UFS. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-18 17:35:31 -04:00
Matthew Wilcox (Oracle)	14bcb7bb68	ufs: Pass a folio to ufs_new_fragments() All callers now have a folio, pass it to ufs_new_fragments() instead of converting back to a page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-18 17:35:31 -04:00
Matthew Wilcox (Oracle)	24a87a0adb	ufs: Convert ufs_inode_getfrag() to take a folio Pass bh->b_folio instead of bh->b_page. They're in a union, so no code change expected. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-18 17:35:31 -04:00
Matthew Wilcox (Oracle)	b6250a013d	ufs: Convert ufs_extend_tail() to take a folio Pass bh->b_folio instead of bh->b_page. They're in a union, so no code change expected. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-18 17:35:31 -04:00
Matthew Wilcox (Oracle)	d9036c488c	ufs: Convert ufs_inode_getblock() to take a folio Pass bh->b_folio instead of bh->b_page. They're in a union, so no code change expected. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-18 17:35:31 -04:00
Al Viro	6b103cc0ba	ufs: take the handling of free block counters into a helper There are 3 places where those counters (many and varied...) are adjusted - when we are freeing fragments and get an entire block freed, when we are freeing blocks and (in opposite direction) when we are grabbing a block. The logics is identical (modulo the sign of adjustment) in all three; better take it into a helper - less duplication and less clutter in the callers that way. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-18 17:35:31 -04:00
Al Viro	64f30e80d6	clean ufs_trunc_direct() up a bit... For short files (== no indirect blocks needed) UFS allows the last block to be a partial one. That creates some complications for truncation down to "short file" lengths. ufs_trunc_direct() is called when we'd already made sure that new EOF is not in a hole; nothing needs to be done if we are extending the file and in case we are shrinking the file it needs to * shrink or free the old final block. * free all full direct blocks between the new and old EOF. * possibly shrink the new final block. The logics is needlessly complicated by trying to keep all cases handled by the same sequence of operations. if not shrinking nothing to do else if number of full blocks unchanged free the tail of possibly partial last block else free the tail of (currently full) new last block free all present (full) blocks in between free the (possibly partial) old last block is easier to follow than the result of trying to unify these cases. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-18 17:35:31 -04:00
Al Viro	db57044217	ufs: get rid of ubh_{ubhcpymem,memcpyubh}() used only in ufs_read_cylinder_structures()/ufs_put_super_internal() and there we can just as well avoid bothering with ufs_buffer_head and just deal with it fragment-by-fragment. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-18 17:35:31 -04:00
Al Viro	ae79ce9d06	ufs_inode_getfrag(): remove junk comment It used to be a stubbed out beginning of ufs2 support, which had been implemented differently quite a while ago. Remove the commented-out (pseudo-)code. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-18 17:35:31 -04:00
Al Viro	426f07ad3e	ufs_free_fragments(): fix the braino in sanity check The function expects that all fragments it's been asked to free will be within the same block. And it even has a sanity check verifying that - it takes the fragment number modulo the number of fragments per block, adds the count and checks if that's too high. Unfortunately, it misspells the upper limit - instead of ->s_fpb (fragments per block) it says ->s_fpg (fragments per cylinder group). So "too high" ends up being insanely lenient. Had been that way since 2.1.112, when UFS write support had been added. 27 years to spot a typo... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-18 17:35:31 -04:00
Al Viro	c5df105f7d	ufs_clusteracct(): switch to passing fragment number Currently all callers pass it a block number. All of them have it derived from a fragment number (both fragment and block numbers are within a cylinder group, and thus 32bit). Pass it the fragment number instead; none of the callers has other uses for the block number, so that ends up with cleaner code. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-18 17:35:31 -04:00
Al Viro	dce3e8d33a	ufs: untangle ubh_...block...(), part 3 Pass fragment number instead of a block one. It's available in all callers and it makes the logics inside those helpers much simpler. The bitmap they operate upon is with bit per fragment, block being an aligned group of 1, 2, 4 or 8 adjacent fragments. We still need a switch by the number of fragments in block (== number of bits to check/set/clear), but finding the byte we need to work with becomes uniform and that makes the things easier to follow. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-18 17:35:31 -04:00
Al Viro	8bec0618a4	ufs: untangle ubh_...block...(), part 2 pass cylinder group descriptor instead of its buffer head (ubh, always UCPI_UBH(ucpi)) and its ->c_freeoff. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-18 17:35:31 -04:00
Al Viro	65136e46a0	ufs: untangle ubh_...block...() macros, part 1 passing implicit argument to a macro by having it in a variable with special name is Not Nice(tm); just pass it explicitly. kill an unused macro, while we are at it... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-18 17:35:31 -04:00
Al Viro	0bfd3e1078	ufs: fix ufs_read_cylinder() failure handling 1) ufs_load_cylinder() should return NULL on ufs_read_cylinder() failures. ufs_error() is not enough. As it is, IO failure on attempt to read a part of cylinder group metadata is likely to end up with an oops. 2) we drop the wrong buffer heads when undoing sb_bread() on IO failure halfway through the read - we need to brelse() what we've got from sb_bread(), TYVM... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-18 17:35:31 -04:00
Al Viro	7f71d6e346	ufs: missing ->splice_write() normal ->write_iter()-based ->splice_write() works here just fine... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-18 17:35:31 -04:00
Al Viro	6a1c4c4688	ufs: fix handling of delete_entry and set_link failures similar to minixfs series - make ufs_set_link() report failures, lift folio_release_kmap() into the callers of ufs_set_link() and ufs_delete_entry(), make ufs_rename() handle failures in both. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-18 17:35:30 -04:00
Olga Kornievskaia	8dd91e8d31	nfsd: fix race between laundromat and free_stateid There is a race between laundromat handling of revoked delegations and a client sending free_stateid operation. Laundromat thread finds that delegation has expired and needs to be revoked so it marks the delegation stid revoked and it puts it on a reaper list but then it unlock the state lock and the actual delegation revocation happens without the lock. Once the stid is marked revoked a racing free_stateid processing thread does the following (1) it calls list_del_init() which removes it from the reaper list and (2) frees the delegation stid structure. The laundromat thread ends up not calling the revoke_delegation() function for this particular delegation but that means it will no release the lock lease that exists on the file. Now, a new open for this file comes in and ends up finding that lease list isn't empty and calls nfsd_breaker_owns_lease() which ends up trying to derefence a freed delegation stateid. Leading to the followint use-after-free KASAN warning: kernel: ================================================================== kernel: BUG: KASAN: slab-use-after-free in nfsd_breaker_owns_lease+0x140/0x160 [nfsd] kernel: Read of size 8 at addr ffff0000e73cd0c8 by task nfsd/6205 kernel: kernel: CPU: 2 UID: 0 PID: 6205 Comm: nfsd Kdump: loaded Not tainted 6.11.0-rc7+ #9 kernel: Hardware name: Apple Inc. Apple Virtualization Generic Platform, BIOS 2069.0.0.0.0 08/03/2024 kernel: Call trace: kernel: dump_backtrace+0x98/0x120 kernel: show_stack+0x1c/0x30 kernel: dump_stack_lvl+0x80/0xe8 kernel: print_address_description.constprop.0+0x84/0x390 kernel: print_report+0xa4/0x268 kernel: kasan_report+0xb4/0xf8 kernel: __asan_report_load8_noabort+0x1c/0x28 kernel: nfsd_breaker_owns_lease+0x140/0x160 [nfsd] kernel: nfsd_file_do_acquire+0xb3c/0x11d0 [nfsd] kernel: nfsd_file_acquire_opened+0x84/0x110 [nfsd] kernel: nfs4_get_vfs_file+0x634/0x958 [nfsd] kernel: nfsd4_process_open2+0xa40/0x1a40 [nfsd] kernel: nfsd4_open+0xa08/0xe80 [nfsd] kernel: nfsd4_proc_compound+0xb8c/0x2130 [nfsd] kernel: nfsd_dispatch+0x22c/0x718 [nfsd] kernel: svc_process_common+0x8e8/0x1960 [sunrpc] kernel: svc_process+0x3d4/0x7e0 [sunrpc] kernel: svc_handle_xprt+0x828/0xe10 [sunrpc] kernel: svc_recv+0x2cc/0x6a8 [sunrpc] kernel: nfsd+0x270/0x400 [nfsd] kernel: kthread+0x288/0x310 kernel: ret_from_fork+0x10/0x20 This patch proposes a fixed that's based on adding 2 new additional stid's sc_status values that help coordinate between the laundromat and other operations (nfsd4_free_stateid() and nfsd4_delegreturn()). First to make sure, that once the stid is marked revoked, it is not removed by the nfsd4_free_stateid(), the laundromat take a reference on the stateid. Then, coordinating whether the stid has been put on the cl_revoked list or we are processing FREE_STATEID and need to make sure to remove it from the list, each check that state and act accordingly. If laundromat has added to the cl_revoke list before the arrival of FREE_STATEID, then nfsd4_free_stateid() knows to remove it from the list. If nfsd4_free_stateid() finds that operations arrived before laundromat has placed it on cl_revoke list, it marks the state freed and then laundromat will no longer add it to the list. Also, for nfsd4_delegreturn() when looking for the specified stid, we need to access stid that are marked removed or freeable, it means the laundromat has started processing it but hasn't finished and this delegreturn needs to return nfserr_deleg_revoked and not nfserr_bad_stateid. The latter will not trigger a FREE_STATEID and the lack of it will leave this stid on the cl_revoked list indefinitely. Fixes: `2d4a532d38` ("nfsd: ensure that clp->cl_revoked list is protected by clp->cl_lock") CC: stable@vger.kernel.org Signed-off-by: Olga Kornievskaia <okorniev@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-10-18 16:40:37 -04:00
Linus Torvalds	b04ae0f451	two fixes for stable, and two small cleanup fixes -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmcSdmYACgkQiiy9cAdy T1EnnAwAoNbY+odLB9atHIuaBftpyINrhzRrzpwTfYNtPKUPGxxGk2fiP29YqMLb OF4jnC87E3P/xhydoZHXXe3kKBQFVMAkJZKHiZBvJd+brk/EadfQnNmIio1pwOGh zFNxSujFtsM/1HU/ZoI2kaHzrqj5KxWKWFytZ6umd8C3NyKK9Lo/lcqUBKv8MpJy XXkMBh+7HGKRfDQlU+n6NQ5+dqFL5xDjTXlm9dM8LXuInKy5oKTGnRhLA7OA8lt7 EenFo8joy0IpXUByHt+ksQ8P88NCnU2h9kGp1UrGrBPh90+MokRr9GAcH8twK8jt /bpL4yzAwuk1TAg+L9mSLT2OtWYsDpsQZmsBMbxBZGr2qmtjwgbxSgjf6DNiJZgn jz15nFsuEsU5AbX4EAE67fwRWAo9AmQFyOOcYgkiIWOFHaRU6D/2NzCxCDZ+mfpy Z5f7dF/sA158iY4wmB5BrQpFamxzpLADz6Qy4NA9hXjEKsbyFAuf22EjE64ruxZ4 8nMB3buh =peum -----END PGP SIGNATURE----- Merge tag 'v6.12-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull smb client fixes from Steve French: - Fix possible double free setting xattrs - Fix slab out of bounds with large ioctl payload - Remove three unused functions, and an unused variable that could be confusing * tag 'v6.12-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: Remove unused functions smb/client: Fix logically dead code smb: client: fix OOBs when building SMB2_IOCTL request smb: client: fix possible double free in smb2_set_ea()	2024-10-18 11:37:12 -07:00
Linus Torvalds	568570fdf2	XFS Bug fixes for 6.12-rc4 * Fix integer overflow in xrep_bmap * Fix stale dealloc punching for COW IO Signed-off-by: Carlos Maiolino <cem@kernel.org> -----BEGIN PGP SIGNATURE----- iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZw5LIwAKCRBGdaER5Qtf puRlAYDezbvs1dDSkKIGOt3inGdLptNAu4qniXBUkbYI9BzmtIVDueWP4Wo0dV3d gu3xrWQBfjFXdmEuBlwLuAFrp07AN18BVMj+DWCiEShsPHSoSPcF/IrDiz4BHvGv MKYq9CywFw== =Gj9b -----END PGP SIGNATURE----- Merge tag 'xfs-6.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs fixes from Carlos Maiolino: - Fix integer overflow in xrep_bmap - Fix stale dealloc punching for COW IO * tag 'xfs-6.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: punch delalloc extents from the COW fork for COW writes xfs: set IOMAP_F_SHARED for all COW fork allocations xfs: share more code in xfs_buffered_write_iomap_begin xfs: support the COW fork in xfs_bmap_punch_delalloc_range xfs: IOMAP_ZERO and IOMAP_UNSHARE already hold invalidate_lock xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof xfs: factor out a xfs_file_write_zero_eof helper iomap: move locking out of iomap_write_delalloc_release iomap: remove iomap_file_buffered_write_punch_delalloc iomap: factor out a iomap_last_written_block helper xfs: fix integer overflow in xrep_bmap	2024-10-18 11:28:39 -07:00
Thorsten Blum	197231da7f	proc: Fix W=1 build kernel-doc warning Building the kernel with W=1 generates the following warning: fs/proc/fd.c:81: warning: This comment starts with '/**', but isn't a kernel-doc comment. Use a normal comment for the helper function proc_fdinfo_permission(). Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://lore.kernel.org/r/20241018102705.92237-2-thorsten.blum@linux.dev Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-18 13:02:47 +02:00
Kent Overstreet	bc6d2d1041	bcachefs: fsck: Improve hash_check_key() hash_check_key() checks and repairs the hash table btrees: dirents and xattrs are open addressing hash tables. We recently had a corruption reported where the hash type on an inode somehow got flipped, which made the existing dirents invisible and allowed new ones to be created with the same name. Now, hash_check_key() can repair duplicates: it will delete one of them, if it has an xattr or dangling dirent, but if it has two valid dirents one of them gets renamed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	dc96656b20	bcachefs: bch2_hash_set_or_get_in_snapshot() Add a variant of bch2_hash_set_in_snapshot() that returns the existing key on -EEXIST. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	15a3836c8e	bcachefs: Repair mismatches in inode hash seed, type Different versions of the same inode (same inode number, different snapshot ID) must have the same hash seed and type - lookups require this, since they see keys from different snapshots simultaneously. To repair we only need to make the inodes consistent, hash_check_key() will do the rest. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	d8e879377f	bcachefs: Add hash seed, type to inode_to_text() This helped with discovering some filesystem corruption fsck has having trouble with: the str_hash type had gotten flipped on one snapshot's version of an inode. All versions of a given inode number have the same hash seed and hash type, since lookups will be done with a single hash/seed and type and see dirents/xattrs from multiple snapshots. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	78cf0ae636	bcachefs: INODE_STR_HASH() for bch_inode_unpacked Trivial cleanup - add a normal BITMASK() helper for bch_inode_unpacked. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	b96f8cd387	bcachefs: Run in-kernel offline fsck without ratelimit errors Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Hongbo Li	489ecc4cfd	bcachefs: skip mount option handle for empty string. The options parse in get_tree will split the options buffer, it will get the empty string for last one by strsep(). After commit ea0eeb89b1d5 ("bcachefs: reject unknown mount options") is merged, unknown mount options is not allowed (here is empty string), and this causes this errors. This can be reproduced just by the following steps: bcachefs format /dev/loop mount -t bcachefs -o metadata_target=loop1 /dev/loop1 /mnt/bcachefs/ Fixes: ea0eeb89b1d5 ("bcachefs: reject unknown mount options") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Hongbo Li	07cf8bac2d	bcachefs: fix incorrect show_options results When call show_options in bcachefs, the options buffer is appeneded to the seq variable. In fact, it requires an additional comma to be appended first. This will affect the remount process when reading existing mount options. Fixes: 9305cf91d05e ("bcachefs: bch2_opts_to_text()") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	97535cd84f	bcachefs: Fix data corruption on -ENOSPC in buffered write path Found by generic/299: When we have to truncate a write due to -ENOSPC, we may have to read in the folio we're writing to if we're now no longer doing a complete write to a !uptodate folio. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	335d318ef5	bcachefs: bch2_folio_reservation_get_partial() is now better behaved bch2_folio_reservation_get_partial(), on partial success, will now return a reservation that's aligned to the filesystem blocksize. This is a partial fix for fstests generic/299 - fio verify is badly behaved in the presence of short writes that aren't aligned to its blocksize. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	81e0b6c7c1	bcachefs: fix disk reservation accounting in bch2_folio_reservation_get() bch2_disk_reservation_put() zeroes out the reservation - oops. This fixes a disk reservation leak when getting a quota reservation returned an error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	4007bbb203	bcachefS: ec: fix data type on stripe deletion Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	a0d11feefb	bcachefs: Don't use commit_do() unnecessarily Using commit_do() to call alloc_sectors_start_trans() breaks when we're randomly injecting transaction restarts - the restart in the commit causes us to leak the lock that alloc_sectorS_start_trans() takes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	6bee2a04c5	bcachefs: handle restarts in bch2_bucket_io_time_reset() bch2_bucket_io_time_reset() doesn't need to succeed, which is why it didn't previously retry on transaction restart - but we're now treating these as errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	29fd10a36a	bcachefs: fix restart handling in __bch2_resume_logged_op_finsert() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	d8b5059774	bcachefs: fix restart handling in bch2_alloc_write_key() This is ugly: We may discover in alloc_write_key that the data type we calculated is wrong, because BCH_DATA_need_discard is checked/set elsewhere, and the disk accounting counters we calculated need to be updated. But bch2_alloc_key_to_dev_counters(..., BTREE_TRIGGER_gc) is not safe w.r.t. transaction restarts, so we need to propagate the fixup back to our gc state in case we take a transaction restart. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:47 -04:00
Kent Overstreet	7ee4be9c62	bcachefs: fix restart handling in bch2_do_invalidates_work() this one is fairly harmless since the invalidate worker will just run again later if it needs to, but still worth fixing Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:47 -04:00
Kent Overstreet	028f3c1d9b	bcachefs: fix missing restart handling in bch2_read_retry_nodecode() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:47 -04:00
Kent Overstreet	e1c4d2f082	bcachefs: fix restart handling in bch2_fiemap() We were leaking transaction restart errors to userspace. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:47 -04:00
Kent Overstreet	94bdeec8f5	bcachefs: fix bch2_hash_delete() error path we were exiting an iterator that hadn't been initialized Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:47 -04:00
Kent Overstreet	74ec2f3024	bcachefs: fix restart handling in bch2_rename2() This should be impossible to hit in practice; the first lookup within a transaction won't return a restart due to lock ordering, but we're adding fault injection for transaction restarts and shaking out bugs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:47 -04:00
Linus Torvalds	4d939780b7	28 hotfixes. 13 are cc:stable. 23 are MM. It is the usual shower of unrelated singletons - please see the individual changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZxGY5wAKCRDdBJ7gKXxA js6RAQC16zQ7WRV091i79cEi1C5648NbZjMCU626hZjuyfbzKgEA2v8PYtjj9w2e UGLxMY+PYZki2XNEh75Sikdkiyl9Vgg= =xcWT -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2024-10-17-16-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "28 hotfixes. 13 are cc:stable. 23 are MM. It is the usual shower of unrelated singletons - please see the individual changelogs for details" * tag 'mm-hotfixes-stable-2024-10-17-16-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (28 commits) maple_tree: add regression test for spanning store bug maple_tree: correct tree corruption on spanning store mm/mglru: only clear kswapd_failures if reclaimable mm/swapfile: skip HugeTLB pages for unuse_vma selftests: mm: fix the incorrect usage() info of khugepaged MAINTAINERS: add Jann as memory mapping/VMA reviewer mm: swap: prevent possible data-race in __try_to_reclaim_swap mm: khugepaged: fix the incorrect statistics when collapsing large file folios MAINTAINERS: kasan, kcov: add bugzilla links mm: don't install PMD mappings when THPs are disabled by the hw/process/vma mm: huge_memory: add vma_thp_disabled() and thp_disabled_by_hw() Docs/damon/maintainer-profile: update deprecated awslabs GitHub URLs Docs/damon/maintainer-profile: add missing '_' suffixes for external web links maple_tree: check for MA_STATE_BULK on setting wr_rebalance mm: khugepaged: fix the arguments order in khugepaged_collapse_file trace point mm/damon/tests/sysfs-kunit.h: fix memory leak in damon_sysfs_test_add_targets() mm: remove unused stub for can_swapin_thp() mailmap: add an entry for Andy Chiu MAINTAINERS: add memory mapping/VMA co-maintainers fs/proc: fix build with GCC 15 due to -Werror=unterminated-string-initialization ...	2024-10-17 16:33:06 -07:00
Mark Brown	4e6e8c2b75	binfmt_elf: Wire up AT_HWCAP3 at AT_HWCAP4 AT_HWCAP3 and AT_HWCAP4 were recently defined for use on PowerPC in commit `3281366a8e` ("uapi/auxvec: Define AT_HWCAP3 and AT_HWCAP4 aux vector, entries"). Since we want to start using AT_HWCAP3 on arm64 add support for exposing both these new hwcaps via binfmt_elf. Signed-off-by: Mark Brown <broonie@kernel.org> Acked-by: Kees Cook <kees@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lore.kernel.org/r/20241004-arm64-elf-hwcap3-v2-1-799d1daad8b0@kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2024-10-17 18:38:49 +01:00
Naohiro Aota	bf9821ba47	btrfs: zoned: fix zone unusable accounting for freed reserved extent When btrfs reserves an extent and does not use it (e.g, by an error), it calls btrfs_free_reserved_extent() to free the reserved extent. In the process, it calls btrfs_add_free_space() and then it accounts the region bytes as block_group->zone_unusable. However, it leaves the space_info->bytes_zone_unusable side not updated. As a result, ENOSPC can happen while a space_info reservation succeeded. The reservation is fine because the freed region is not added in space_info->bytes_zone_unusable, leaving that space as "free". OTOH, corresponding block group counts it as zone_unusable and its allocation pointer is not rewound, we cannot allocate an extent from that block group. That will also negate space_info's async/sync reclaim process, and cause an ENOSPC error from the extent allocation process. Fix that by returning the space to space_info->bytes_zone_unusable. Ideally, since a bio is not submitted for this reserved region, we should return the space to free space and rewind the allocation pointer. But, it needs rework on extent allocation handling, so let it work in this way for now. Fixes: `169e0da91a` ("btrfs: zoned: track unusable bytes for zones") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-17 16:16:46 +02:00
David Howells	610a79ffea	afs: Fix lock recursion afs_wake_up_async_call() can incur lock recursion. The problem is that it is called from AF_RXRPC whilst holding the ->notify_lock, but it tries to take a ref on the afs_call struct in order to pass it to a work queue - but if the afs_call is already queued, we then have an extraneous ref that must be put... calling afs_put_call() may call back down into AF_RXRPC through rxrpc_kernel_shutdown_call(), however, which might try taking the ->notify_lock again. This case isn't very common, however, so defer it to a workqueue. The oops looks something like: BUG: spinlock recursion on CPU#0, krxrpcio/7001/1646 lock: 0xffff888141399b30, .magic: dead4ead, .owner: krxrpcio/7001/1646, .owner_cpu: 0 CPU: 0 UID: 0 PID: 1646 Comm: krxrpcio/7001 Not tainted 6.12.0-rc2-build3+ #4351 Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 Call Trace: <TASK> dump_stack_lvl+0x47/0x70 do_raw_spin_lock+0x3c/0x90 rxrpc_kernel_shutdown_call+0x83/0xb0 afs_put_call+0xd7/0x180 rxrpc_notify_socket+0xa0/0x190 rxrpc_input_split_jumbo+0x198/0x1d0 rxrpc_input_data+0x14b/0x1e0 ? rxrpc_input_call_packet+0xc2/0x1f0 rxrpc_input_call_event+0xad/0x6b0 rxrpc_input_packet_on_conn+0x1e1/0x210 rxrpc_input_packet+0x3f2/0x4d0 rxrpc_io_thread+0x243/0x410 ? __pfx_rxrpc_io_thread+0x10/0x10 kthread+0xcf/0xe0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x24/0x40 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/1394602.1729162732@warthog.procyon.org.uk cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-17 15:33:46 +02:00
Alessandro Zanni	15f3434748	fs: Fix uninitialized value issue in from_kuid and from_kgid ocfs2_setattr() uses attr->ia_mode, attr->ia_uid and attr->ia_gid in a trace point even though ATTR_MODE, ATTR_UID and ATTR_GID aren't set. Initialize all fields of newattrs to avoid uninitialized variables, by checking if ATTR_MODE, ATTR_UID, ATTR_GID are initialized, otherwise 0. Reported-by: syzbot+6c55f725d1bdc8c52058@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=6c55f725d1bdc8c52058 Signed-off-by: Alessandro Zanni <alessandro.zanni87@gmail.com> Link: https://lore.kernel.org/r/20241017120553.55331-1-alessandro.zanni87@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-17 15:33:43 +02:00
Christian Brauner	229fd15908	fs: don't try and remove empty rbtree node When copying a namespace we won't have added the new copy into the namespace rbtree until after the copy succeeded. Calling free_mnt_ns() will try to remove the copy from the rbtree which is invalid. Simply free the namespace skeleton directly. Link: https://lore.kernel.org/r/20241016-adapter-seilwinde-83c508a7bde1@brauner Fixes: `1901c92497` ("fs: keep an index of current mount namespaces") Tested-by: Brad Spengler <spender@grsecurity.net> Cc: stable@vger.kernel.org # v6.11+ Reported-by: Brad Spengler <spender@grsecurity.net> Suggested-by: Brad Spengler <spender@grsecurity.net> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-17 15:33:43 +02:00
David Howells	d6a77668a7	netfs: Downgrade i_rwsem for a buffered write In the I/O locking code borrowed from NFS into netfslib, i_rwsem is held locked across a buffered write - but this causes a performance regression in cifs as it excludes buffered reads for the duration (cifs didn't use any locking for buffered reads). Mitigate this somewhat by downgrading the i_rwsem to a read lock across the buffered write. This at least allows parallel reads to occur whilst excluding other writes, DIO, truncate and setattr. Note that this shouldn't be a problem for a buffered write as a read through an mmap can circumvent i_rwsem anyway. Also note that we might want to make this change in NFS also. Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/1317958.1729096113@warthog.procyon.org.uk cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Trond Myklebust <trondmy@kernel.org> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-cifs@vger.kernel.org cc: linux-nfs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-17 15:33:42 +02:00
Brahmajit Das	5778ace04e	fs/proc: fix build with GCC 15 due to -Werror=unterminated-string-initialization show show_smap_vma_flags() has been a using misspelled initializer in mnemonics[] - it needed to initialize 2 element array of char and it used NUL-padded 2 character string literals (i.e. 3-element initializer). This has been spotted by gcc-15[]; prior to that gcc quietly dropped the 3rd eleemnt of initializers. To fix this we are increasing the size of mnemonics[] (from mnemonics[BITS_PER_LONG][2] to mnemonics[BITS_PER_LONG][3]) to accomodate the NUL-padded string literals. This also helps us in simplyfying the logic for printing of the flags as instead of printing each character from the mnemonics[], we can just print the mnemonics[] using seq_printf. []: fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization] 917 \| [0 ... (BITS_PER_LONG-1)] = "??", \| ^~~~ fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization] fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization] fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization] fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization] fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization] ... Stephen pointed out: : The C standard explicitly allows for a string initializer to be too long : due to the NUL byte at the end ... so this warning may be overzealous. but let's make the warning go away anwyay. Link: https://lkml.kernel.org/r/20241005063700.2241027-1-brahmajit.xyz@gmail.com Link: https://lkml.kernel.org/r/20241003093040.47c08382@canb.auug.org.au Signed-off-by: Brahmajit Das <brahmajit.xyz@gmail.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: David Hildenbrand <david@redhat.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-10-17 00:28:07 -07:00
OGAWA Hirofumi	963a7f4d3b	fat: fix uninitialized variable syszbot produced this with a corrupted fs image. In theory, however an IO error would trigger this also. This affects just an error report, so should not be a serious error. Link: https://lkml.kernel.org/r/87r08wjsnh.fsf@mail.parknet.co.jp Link: https://lkml.kernel.org/r/66ff2c95.050a0220.49194.03e9.GAE@google.com Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reported-by: syzbot+ef0d7bc412553291aa86@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-10-17 00:28:06 -07:00
Ryusuke Konishi	08cfa12adf	nilfs2: propagate directory read errors from nilfs_find_entry() Syzbot reported that a task hang occurs in vcs_open() during a fuzzing test for nilfs2. The root cause of this problem is that in nilfs_find_entry(), which searches for directory entries, ignores errors when loading a directory page/folio via nilfs_get_folio() fails. If the filesystem images is corrupted, and the i_size of the directory inode is large, and the directory page/folio is successfully read but fails the sanity check, for example when it is zero-filled, nilfs_check_folio() may continue to spit out error messages in bursts. Fix this issue by propagating the error to the callers when loading a page/folio fails in nilfs_find_entry(). The current interface of nilfs_find_entry() and its callers is outdated and cannot propagate error codes such as -EIO and -ENOMEM returned via nilfs_find_entry(), so fix it together. Link: https://lkml.kernel.org/r/20241004033640.6841-1-konishi.ryusuke@gmail.com Fixes: `2ba466d74e` ("nilfs2: directory entry operations") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: Lizhi Xu <lizhi.xu@windriver.com> Closes: https://lkml.kernel.org/r/20240927013806.3577931-1-lizhi.xu@windriver.com Reported-by: syzbot+8a192e8d090fa9a31135@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=8a192e8d090fa9a31135 Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-10-17 00:28:06 -07:00
Bart Van Assche	f4dd946c77	fs/procfs: Switch to irq_get_nr_irqs() Use the irq_get_nr_irqs() function instead of the global variable 'nr_irqs'. Prepare for changing 'nr_irqs' from an exported global variable into a variable with file scope. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241015190953.1266194-21-bvanassche@acm.org	2024-10-16 21:56:59 +02:00
Linus Torvalds	667b1d41b2	for-6.12-rc3-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmcPxtAACgkQxWXV+ddt WDu9lA//WfB88fwEKnqBYDRo6aiSMIAzLDuXkJ9i8d7rcjZO1OIZkEnMOsxhvTcZ KxgjNjkgzoTyUwoAUlG+ZpvMeSNMhBdr2NFkXmYzN9oanFE4zplpZiWx6tGSApRU 0ilngjXBsr8p03HmB88Yb05DVYQ2elMP6Jx3VETDBa0CNyp4//tGKzusNhZdA7KM XLZmkKRk3ZKabNo+p2J5t8UGJCl2L18U0o/EphfSkODKadUnsBbAPZUt2EGQCZwv uZhDFAUkgTFBkeRO7JwTfDrNi51M4zwmh+kEduzg4Ny4TdFb1UapU7K1N330WMru 4Qa953Met9I4NB/kKI+fZP1lN4NGuD2qEU6yoZVSy4UiqRp1gEg8kOUfVGFbNJa1 VFYcwdrBad0I4PjnQc5bpZVjzqJT5wWiZxjlWrB7VyIfdmnvQxe5h4DBwBhN5FJr +MEtuY2QNFygjDAZ5z0Ss8hegqI+FYi562Cjy9QRLhb3qGD8STF2BIChaILIn3oA UVJUlUP6CUmCu1RZRMFB4/WkeHO46FmZxJErGfFeXqJInThf0/rdSZOQgIP0JsUq N8FINEgXFAMCkK1PT7MNvAYkSP0tR7B0JjGKcSlGS3v3F0URCNGvHSiqbLedAtXT lc1MdXTZxub8h6xhIvgY1j7HRAFGrunn7LD6MIKRWX1SZPWwAGI= =DEUA -----END PGP SIGNATURE----- Merge tag 'for-6.12-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - regression fix: dirty extents tracked in xarray for qgroups must be adjusted for 32bit platforms - fix potentially freeing uninitialized name in fscrypt structure - fix warning about unneeded variable in a send callback * tag 'for-6.12-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix uninitialized pointer free on read_alloc_one_name() error btrfs: send: cleanup unneeded return variable in changed_verity() btrfs: fix uninitialized pointer free in add_inode_ref() btrfs: use sector numbers as keys for the dirty extents xarray	2024-10-16 09:30:20 -07:00
Linus Torvalds	9f635d44d7	two ksmbd server fixes -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmcPUXEACgkQiiy9cAdy T1Gm7QwAlPW5//Cb4B0gpjzRcUws51IZ4yFhp4IQWmsd0RqdjZ4TxSCOPF3u3HR3 0OPxyLdbUn6h5g0S2ayzqomHx2VBOQTjgyuMtaTWzokToMNu8kqvxK1MTslkBior 9YEHUz9+5f0OJ+JBGNUzjfy4Plygr5y09udaLfqIknuY8+SeuooxNNUNfkIvrP7C JsSAWJznN9VMpKJmszYc4ntyTiz1XVXyyjJmjhRQ27ah8LUghqZ0mamgigTS5UFa U7eYBDfs6+9i5Lvkd4bJPdGyov9g/EPViLURZMfNaz3+p0TfosN8s2UZuhHC+zuv BDQ+wHGRqzmteZspLanrGBt9y9svHXp1CD7MwqWeGR3GhKsfsxCMJpE931fBhsxM vlJdd/xCs128fv48AvNyHA9abN0U1FpskOJhOzjDgvhKqDoIQ4TCC7QFDEttsPRv ZiQmyOCPyZZY28EmfoltU4CFcMIwKQ81nPUSOJFgKmHBbSpc+Qtnv5QgRHZCzj7n StJfaIMv =WhJj -----END PGP SIGNATURE----- Merge tag 'v6.12-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd Pull smb server fixes from Steve French: - fix race between session setup and session logoff - add supplementary group support * tag 'v6.12-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd: ksmbd: add support for supplementary groups ksmbd: fix user-after-free from session log off	2024-10-16 09:15:43 -07:00
Amir Goldstein	522249f05c	fanotify: allow reporting errors on failure to open fd When working in "fd mode", fanotify_read() needs to open an fd from a dentry to report event->fd to userspace. Opening an fd from dentry can fail for several reasons. For example, when tasks are gone and we try to open their /proc files or we try to open a WRONLY file like in sysfs or when trying to open a file that was deleted on the remote network server. Add a new flag FAN_REPORT_FD_ERROR for fanotify_init(). For a group with FAN_REPORT_FD_ERROR, we will send the event with the error instead of the open fd, otherwise userspace may not get the error at all. For an overflow event, we report -EBADF to avoid confusing FAN_NOFD with -EPERM. Similarly for pidfd open errors we report either -ESRCH or the open error instead of FAN_NOPIDFD and FAN_EPIDFD. In any case, userspace will not know which file failed to open, so add a debug print for further investigation. Reported-by: Krishna Vivek Vitta <kvitta@microsoft.com> Link: https://lore.kernel.org/linux-fsdevel/SI2P153MB07182F3424619EDDD1F393EED46D2@SI2P153MB0718.APCP153.PROD.OUTLOOK.COM/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20241003142922.111539-1-amir73il@gmail.com	2024-10-16 17:43:05 +02:00
Yang Shi	25c17c4b55	hugetlb: arm64: add mte support Enable MTE support for hugetlb. The MTE page flags will be set on the folio only. When copying hugetlb folio (for example, CoW), the tags for all subpages will be copied when copying the first subpage. When freeing hugetlb folio, the MTE flags will be cleared. Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Link: https://lore.kernel.org/r/20241001225220.271178-1-yang@os.amperecomputing.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2024-10-16 14:50:47 +01:00
Ryusuke Konishi	6ed469df0b	nilfs2: fix kernel bug due to missing clearing of buffer delay flag Syzbot reported that after nilfs2 reads a corrupted file system image and degrades to read-only, the BUG_ON check for the buffer delay flag in submit_bh_wbc() may fail, causing a kernel bug. This is because the buffer delay flag is not cleared when clearing the buffer state flags to discard a page/folio or a buffer head. So, fix this. This became necessary when the use of nilfs2's own page clear routine was expanded. This state inconsistency does not occur if the buffer is written normally by log writing. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Link: https://lore.kernel.org/r/20241015213300.7114-1-konishi.ryusuke@gmail.com Fixes: `8c26c4e269` ("nilfs2: fix issue with flush kernel thread after remount in RO mode because of driver's internal error or metadata corruption") Reported-by: syzbot+985ada84bf055a575c07@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=985ada84bf055a575c07 Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-16 15:05:32 +02:00
Amir Goldstein	20121d3f58	fuse: update inode size after extending passthrough write yangyun reported that libfuse test test_copy_file_range() copies zero bytes from a newly written file when fuse passthrough is enabled. The reason is that extending passthrough write is not updating the fuse inode size and when vfs_copy_file_range() observes a zero size inode, it returns without calling the filesystem copy_file_range() method. Fix this by adjusting the fuse inode size after an extending passthrough write. This does not provide cache coherency of fuse inode attributes and backing inode attributes, but it should prevent situations where fuse inode size is too small, causing read/copy to be wrongly shortened. Reported-by: yangyun <yangyun50@huawei.com> Closes: https://github.com/libfuse/libfuse/issues/1048 Fixes: `57e1176e60` ("fuse: implement read/write passthrough") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2024-10-16 13:18:21 +02:00
Amir Goldstein	f03b296e8b	fs: pass offset and result to backing_file end_write() callback This is needed for extending fuse inode size after fuse passthrough write. Suggested-by: Miklos Szeredi <miklos@szeredi.hu> Link: https://lore.kernel.org/linux-fsdevel/CAJfpegs=cvZ_NYy6Q_D42XhYS=Sjj5poM1b5TzXzOVvX=R36aA@mail.gmail.com/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2024-10-16 13:17:45 +02:00
Dr. David Alan Gilbert	6aca91c416	cifs: Remove unused functions cifs_ses_find_chan() has been unused since commit `f486ef8e20` ("cifs: use the chans_need_reconnect bitmap for reconnect status") cifs_read_page_from_socket() has been unused since commit `d08089f649` ("cifs: Change the I/O paths to use an iterator rather than a page list") cifs_chan_in_reconnect() has been unused since commit `bc962159e8` ("cifs: avoid race conditions with parallel reconnects") Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-10-16 00:30:52 -05:00
Advait Dhamorikar	3dfea293f4	smb/client: Fix logically dead code The if condition in collect_sample: can never be satisfied because of a logical contradiction. The indicated dead code may have performed some action; that action will never occur. Fixes: `94ae8c3fee` ("smb: client: compress: LZ77 code improvements cleanup") Signed-off-by: Advait Dhamorikar <advaitdhamorikar@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-10-16 00:30:52 -05:00
Paulo Alcantara	1ab60323c5	smb: client: fix OOBs when building SMB2_IOCTL request When using encryption, either enforced by the server or when using 'seal' mount option, the client will squash all compound request buffers down for encryption into a single iov in smb2_set_next_command(). SMB2_ioctl_init() allocates a small buffer (448 bytes) to hold the SMB2_IOCTL request in the first iov, and if the user passes an input buffer that is greater than 328 bytes, smb2_set_next_command() will end up writing off the end of @rqst->iov[0].iov_base as shown below: mount.cifs //srv/share /mnt -o ...,seal ln -s $(perl -e "print('a')for 1..1024") /mnt/link BUG: KASAN: slab-out-of-bounds in smb2_set_next_command.cold+0x1d6/0x24c [cifs] Write of size 4116 at addr ffff8881148fcab8 by task ln/859 CPU: 1 UID: 0 PID: 859 Comm: ln Not tainted 6.12.0-rc3 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x5d/0x80 ? smb2_set_next_command.cold+0x1d6/0x24c [cifs] print_report+0x156/0x4d9 ? smb2_set_next_command.cold+0x1d6/0x24c [cifs] ? __virt_addr_valid+0x145/0x310 ? __phys_addr+0x46/0x90 ? smb2_set_next_command.cold+0x1d6/0x24c [cifs] kasan_report+0xda/0x110 ? smb2_set_next_command.cold+0x1d6/0x24c [cifs] kasan_check_range+0x10f/0x1f0 __asan_memcpy+0x3c/0x60 smb2_set_next_command.cold+0x1d6/0x24c [cifs] smb2_compound_op+0x238c/0x3840 [cifs] ? kasan_save_track+0x14/0x30 ? kasan_save_free_info+0x3b/0x70 ? vfs_symlink+0x1a1/0x2c0 ? do_symlinkat+0x108/0x1c0 ? __pfx_smb2_compound_op+0x10/0x10 [cifs] ? kmem_cache_free+0x118/0x3e0 ? cifs_get_writable_path+0xeb/0x1a0 [cifs] smb2_get_reparse_inode+0x423/0x540 [cifs] ? __pfx_smb2_get_reparse_inode+0x10/0x10 [cifs] ? rcu_is_watching+0x20/0x50 ? __kmalloc_noprof+0x37c/0x480 ? smb2_create_reparse_symlink+0x257/0x490 [cifs] ? smb2_create_reparse_symlink+0x38f/0x490 [cifs] smb2_create_reparse_symlink+0x38f/0x490 [cifs] ? __pfx_smb2_create_reparse_symlink+0x10/0x10 [cifs] ? find_held_lock+0x8a/0xa0 ? hlock_class+0x32/0xb0 ? __build_path_from_dentry_optional_prefix+0x19d/0x2e0 [cifs] cifs_symlink+0x24f/0x960 [cifs] ? __pfx_make_vfsuid+0x10/0x10 ? __pfx_cifs_symlink+0x10/0x10 [cifs] ? make_vfsgid+0x6b/0xc0 ? generic_permission+0x96/0x2d0 vfs_symlink+0x1a1/0x2c0 do_symlinkat+0x108/0x1c0 ? __pfx_do_symlinkat+0x10/0x10 ? strncpy_from_user+0xaa/0x160 __x64_sys_symlinkat+0xb9/0xf0 do_syscall_64+0xbb/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f08d75c13bb Reported-by: David Howells <dhowells@redhat.com> Fixes: `e77fe73c7e` ("cifs: we can not use small padding iovs together with encryption") Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-10-16 00:30:52 -05:00
Su Hui	19ebc1e6ca	smb: client: fix possible double free in smb2_set_ea() Clang static checker(scan-build) warning： fs/smb/client/smb2ops.c:1304:2: Attempt to free released memory. 1304 \| kfree(ea); \| ^~~~~~~~~ There is a double free in such case: 'ea is initialized to NULL' -> 'first successful memory allocation for ea' -> 'something failed, goto sea_exit' -> 'first memory release for ea' -> 'goto replay_again' -> 'second goto sea_exit before allocate memory for ea' -> 'second memory release for ea resulted in double free'. Re-initialie 'ea' to NULL near to the replay_again label, it can fix this double free problem. Fixes: `4f1fffa237` ("cifs: commands that are retried should have replay flag set") Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Su Hui <suhui@nfschina.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-10-16 00:25:54 -05:00
Linus Torvalds	bdc7276512	bcachefs fixes for 6.12-rc4 - New metadata version inode_has_child_snapshots This fixes bugs with handling of unlinked inodes + snapshots, in particular when an inode is reattached after taking a snapshot; deleted inodes now get correctly cleaned up across snapshots. - Disk accounting rewrite fixes - validation fixes for when a device has been removed - fix journal replay failing with "journal_reclaim_would_deadlock" - Some more small fixes for erasure coding + device removal - Assorted small syzbot fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmcNw4UACgkQE6szbY3K bnbSzBAAmSCCQCqRwnFSp4OdNSlBK9q1e5WsbKOqHgtoXZU/mOUBe/5bnPPqm6Mg GkTc7FqVOs/95/rEDKXw2LneFgxRrt8MriJCUdXZvV5fC2R4Kdl0TkwABtMtm2Ae wp37n6iQO81j4uZHfOj67RzC2NRo7dMdun5HnQPRBTKzyuDaZXqwjMmF2LmaeODh oiBFUvD5nFBo5XvXPABBin6xpdquHO+6ZWf6SFD4+iRe11NrJAOAIS/crJvxsFfr I/X152Z+gzKPE+NhANKMxlHyNnVGo7iHUqhUjVuI4SSaXb9Ap6k4sXgfoIzncR17 GA5qWtaNS1W72+awT3R2EaF9Tqi+Vng2RVfxxQ04giImnBq0eziOjlZ26enOE0LU 0ZZrBFzqpItqYbNnzPissHuKb1mAQGPWy6kxoGIrqDKbichA7lzyWDz2lgEE85Sx E1mvHwYbKhUuLC4c4460hueGVUgMWmjqM3E8oex+oNDpauPB+/bnYkcgZEG2RBla +ZlDL28fg4fxtqlUrOQeonQ1RecGNdRMJz7xiGnkYU9rQpUuv8QwFiBZGAbLP6zn 6fbFZGxS/pO95sY7GmAtKz7ZgKxJQCzII4s+Oht5AgOvoBlPjAiol1UbwYadYQxz HKF+WBaPC9z/L6JjP+gx+uUzTWRIfBmhHylhWbKr4vLGfx3Jc1g= =Rkq2 -----END PGP SIGNATURE----- Merge tag 'bcachefs-2024-10-14' of git://evilpiepirate.org/bcachefs Pull bcachefs fixes from Kent Overstreet: - New metadata version inode_has_child_snapshots This fixes bugs with handling of unlinked inodes + snapshots, in particular when an inode is reattached after taking a snapshot; deleted inodes now get correctly cleaned up across snapshots. - Disk accounting rewrite fixes - validation fixes for when a device has been removed - fix journal replay failing with "journal_reclaim_would_deadlock" - Some more small fixes for erasure coding + device removal - Assorted small syzbot fixes * tag 'bcachefs-2024-10-14' of git://evilpiepirate.org/bcachefs: (27 commits) bcachefs: Fix sysfs warning in fstests generic/730,731 bcachefs: Handle race between stripe reuse, invalidate_stripe_to_dev bcachefs: Fix kasan splat in new_stripe_alloc_buckets() bcachefs: Add missing validation for bch_stripe.csum_granularity_bits bcachefs: Fix missing bounds checks in bch2_alloc_read() bcachefs: fix uaf in bch2_dio_write_done() bcachefs: Improve check_snapshot_exists() bcachefs: Fix bkey_nocow_lock() bcachefs: Fix accounting replay flags bcachefs: Fix invalid shift in member_to_text() bcachefs: Fix bch2_have_enough_devs() for BCH_SB_MEMBER_INVALID bcachefs: __wait_for_freeing_inode: Switch to wait_bit_queue_entry bcachefs: Check if stuck in journal_res_get() closures: Add closure_wait_event_timeout() bcachefs: Fix state lock involved deadlock bcachefs: Fix NULL pointer dereference in bch2_opt_to_text bcachefs: Release transaction before wake up bcachefs: add check for btree id against max in try read node bcachefs: Disk accounting device validation fixes bcachefs: bch2_inode_or_descendents_is_open() ...	2024-10-15 11:06:45 -07:00
Bill O'Donnell	51ceeb1a81	efs: fix the efs new mount api implementation Commit `39a6c668e4` (efs: convert efs to use the new mount api) did not include anything from v2 and v3 that were also submitted. Fix this by bringing in those changes that were proposed in v2 and v3. Fixes: `39a6c668e4` efs: convert efs to use the new mount api. Signed-off-by: Bill O'Donnell <bodonnel@redhat.com> Link: https://lore.kernel.org/r/20241014190241.4093825-1-bodonnel@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-15 15:58:36 +02:00
Christoph Hellwig	f6f91d290c	xfs: punch delalloc extents from the COW fork for COW writes When ->iomap_end is called on a short write to the COW fork it needs to punch stale delalloc data from the COW fork and not the data fork. Ensure that IOMAP_F_NEW is set for new COW fork allocations in xfs_buffered_write_iomap_begin, and then use the IOMAP_F_SHARED flag in xfs_buffered_write_delalloc_punch to decide which fork to punch. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-15 11:37:42 +02:00
Christoph Hellwig	7d6fe5c586	xfs: set IOMAP_F_SHARED for all COW fork allocations Change to always set xfs_buffered_write_iomap_begin for COW fork allocations even if they don't overlap existing data fork extents, which will allow the iomap_end callback to detect if it has to punch stale delalloc blocks from the COW fork instead of the data fork. It also means we sample the sequence counter for both the data and the COW fork when writing to the COW fork, which ensures we properly revalidate when only COW fork changes happens. This is essentially a revert of commit `72a048c105` ("xfs: only set IOMAP_F_SHARED when providing a srcmap to a write"). This is fine because the problem that the commit fixed has now been dealt with in iomap by only looking at the actual srcmap and not the fallback to the write iomap. Note that the direct I/O path was never changed and has always set IOMAP_F_SHARED for all COW fork allocations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-15 11:37:42 +02:00
Christoph Hellwig	c29440ff66	xfs: share more code in xfs_buffered_write_iomap_begin Introduce a local iomap_flags variable so that the code allocating new delalloc blocks in the data fork can fall through to the found_imap label and reuse the code to unlock and fill the iomap. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-15 11:37:42 +02:00
Christoph Hellwig	8fe3b21efa	xfs: support the COW fork in xfs_bmap_punch_delalloc_range xfs_buffered_write_iomap_begin can also create delallocate reservations that need cleaning up, prepare for that by adding support for the COW fork in xfs_bmap_punch_delalloc_range. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-15 11:37:42 +02:00
Christoph Hellwig	abd7d651ad	xfs: IOMAP_ZERO and IOMAP_UNSHARE already hold invalidate_lock All XFS callers of iomap_zero_range and iomap_file_unshare already hold invalidate_lock, so we can't take it again in iomap_file_buffered_write_punch_delalloc. Use the passed in flags argument to detect if we're called from a zero or unshare operation and don't take the lock again in this case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-15 11:37:42 +02:00
Christoph Hellwig	acfbac7764	xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof xfs_file_write_zero_eof is the only caller of xfs_zero_range that does not take XFS_MMAPLOCK_EXCL (aka the invalidate lock). Currently that is actually the right thing, as an error in the iomap zeroing code will also take the invalidate_lock to clean up, but to fix that deadlock we need a consistent locking pattern first. The only extra thing that XFS_MMAPLOCK_EXCL will lock out are read pagefaults, which isn't really needed here, but also not actively harmful. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-15 11:37:42 +02:00
Christoph Hellwig	3c399374af	xfs: factor out a xfs_file_write_zero_eof helper Split a helper from xfs_file_write_checks that just deal with the post-EOF zeroing to keep the code readable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-15 11:37:42 +02:00
Christoph Hellwig	b784951662	iomap: move locking out of iomap_write_delalloc_release XFS (which currently is the only user of iomap_write_delalloc_release) already holds invalidate_lock for most zeroing operations. To be able to avoid a deadlock it needs to stop taking the lock, but doing so in iomap would leak XFS locking details into iomap. To avoid this require the caller to hold invalidate_lock when calling iomap_write_delalloc_release instead of taking it there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-15 11:37:42 +02:00
Christoph Hellwig	caf0ea451d	iomap: remove iomap_file_buffered_write_punch_delalloc Currently iomap_file_buffered_write_punch_delalloc can be called from XFS either with the invalidate lock held or not. To fix this while keeping the locking in the file system and not the iomap library code we'll need to life the locking up into the file system. To prepare for that, open code iomap_file_buffered_write_punch_delalloc in the only caller, and instead export iomap_write_delalloc_release. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-15 11:37:42 +02:00
Christoph Hellwig	c0adf8c3a9	iomap: factor out a iomap_last_written_block helper Split out a pice of logic from iomap_file_buffered_write_punch_delalloc that is useful for all iomap_end implementations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-15 11:37:41 +02:00
Linus Torvalds	eca631b8fe	f2fs fix for 6.12-rc4 This includes an urgent fix to resolve DIO read performance regression caused by `0cac51185e` ("f2fs: fix to avoid racing in between read and OPU dio write"). -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmcNSyoACgkQQBSofoJI UNJBKg/6AgfkxWeY4Vb5h43nnmwnv/eJZZRIhsFMSz+vWEI1wfKeCkUys0Q1GGZ3 RPZl+ZT4vJ2FDjLRwZFfCi31nT/qIAlOnm0GtlDczsccBh91RrzFFKDHcwvfOd0/ NLpdqsDt2tuurf6zjbhtx5Paoyr7KvMg4+sSfRqV+7nvXmLImMH7ahRGiB5Eh4HP gDNpQ7tk+D2+ZHBU40PUSYXooikFYznGuHk5JjpKnVCAsK8F0u9nA35ZeSlkkUCM 8MGS+zHEpEqD/wZWlrwUWhmXmHLuNUbJh6X3pPNYxe0s/+ymHo99zKRw5HHKKybK FFZYbSWXrTNS+2SS2NUhUxp3CpPV0N6IGM+i7UkYo4DMu/MG7skVrOjkLx7NQvSf 8/8B8g6LQt32JzlNvCrZcjStEgdxbXzFaJZH71952C7dp8mnhc4LkgkYkADnjsa3 /+L+nazVgX6YaXZ9Ny2TY3gMF/gyJHp7LzZGeOdeqKNWYTclnrkEzGJ3Eg9aE2vz yymcz2P7nWYFklIfnRUPAYnvUwleBysYkHsw5z3wrqX6TjW5IW8fcozN9dIIdOTC 2AVBGhi931xnJSj4AU9in+p+s9qfPP6bHN/C/5PuL04UQ7+y+pjUhmwJxUzugVoq T4xpp1BFj/3SQ8SUQX4nVTdHqR5uKfHvFKyAe6/0345w6bQCQtI= =7KMu -----END PGP SIGNATURE----- Merge tag 'f2fs-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs fix from Jaegeuk Kim: "An urgent fix to resolve DIO read performance regression caused by 'f2fs: fix to avoid racing in between read and OPU dio write'" * tag 'f2fs-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: f2fs: allow parallel DIO reads	2024-10-14 11:19:19 -07:00
Linus Torvalds	63fa605041	Changes since last update: - Make sure only regular inodes can be used for file-backed mounts; - Two minor codebase cleanups. -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmcNHqARHHhpYW5nQGtl cm5lbC5vcmcACgkQUXZn5Zlu5qo1XA/+MFbobJ4bWxJQKnouLlCiFQ5C1xEFbVn2 HasHLfrMIcdz/n/S3Ib4Ayi+9W0zM2Ekq9EG+fuOBqjZP17+EOj3e7OPtVVPNwx0 u2GbD9zNCliZg9PigCfPO+6oImt6l/Mytmx+7bELqbMywAy7JNCNesJuyycsTcja o1I3dNNUZdppilohXPIENTRLjBlOuGBaZdUXDih0LqB+Pb0jgXTP6JfD88h1MLFw xBbhqQ1A/GgyESfsMpZFn2xvFIocLBCIAdAehi9M1AiEwCTjGkTZ66WW3H6Es/Zp vcC9KjHJoGGCXxZf8mnoQHQo/WqQuNUPc2BVf9iExzCo0nwRArcTbAu5Bskqg0LF c+a7FrrxhODz8ioxOOiMUqG4b3/qGkzlk6w5a/t7IRrmFtmcXmPWZ14aI8qpy7o/ CW3iPUoF/zEsmmFvOgJtHwy3g+bC8KhDvz3fqFIDSSMjSKjqb4cPYSe/L5MyhwED wmLgp1uYjEyR0uuqqUp93FEYIbHuO5HpPRT5crLczRIoYn7bXRhjNWLCTmzlLqrj yDAQsrngK99BQ7g0FTQ/OV9si/HRRGsusZmCkeCb6KnRNIvml4X9/WXKc1ioOFk/ 3MSaxlQlTXzCCctjVCDNn9GfD/yR1cXu2sUpGSEnP1ssLG4ARyXGVfoeSw7gJ4xn C5lm9SOmkzU= =0gFG -----END PGP SIGNATURE----- Merge tag 'erofs-for-6.12-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: "The main one fixes a syzbot issue due to the invalid inode type out of file-backed mounts. The others are minor cleanups without actual logic changes. Summary: - Make sure only regular inodes can be used for file-backed mounts - Two minor codebase cleanups" * tag 'erofs-for-6.12-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: get rid of kaddr in `struct z_erofs_maprecorder` erofs: get rid of z_erofs_try_to_claim_pcluster() erofs: ensure regular inodes for file-backed mounts	2024-10-14 11:12:09 -07:00
Song Liu	1cda52f1b4	fsnotify, lsm: Decouple fsnotify from lsm Currently, fsnotify_open_perm() is called from security_file_open(). This is a a bit unexpected and creates otherwise unnecessary dependency of CONFIG_FANOTIFY_ACCESS_PERMISSIONS on CONFIG_SECURITY. Fix this by calling fsnotify_open_perm() directly. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20241013002248.3984442-1-song@kernel.org	2024-10-14 17:38:27 +02:00
Christian Brauner	58439f6c48	Merge patch series "ovl: file descriptors based layer setup" Christian Brauner <brauner@kernel.org> says: Currently overlayfs only allows specifying layers through path names. This is inconvenient for users such as systemd that want to assemble an overlayfs mount purely based on file descriptors. When porting overlayfs to the new mount api I already mentioned this. This enables user to specify both: fsconfig(fd_overlay, FSCONFIG_SET_FD, "upperdir+", NULL, fd_upper); fsconfig(fd_overlay, FSCONFIG_SET_FD, "workdir+", NULL, fd_work); fsconfig(fd_overlay, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower1); fsconfig(fd_overlay, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower2); in addition to: fsconfig(fd_overlay, FSCONFIG_SET_STRING, "upperdir+", "/upper", 0); fsconfig(fd_overlay, FSCONFIG_SET_STRING, "workdir+", "/work", 0); fsconfig(fd_overlay, FSCONFIG_SET_STRING, "lowerdir+", "/lower1", 0); fsconfig(fd_overlay, FSCONFIG_SET_STRING, "lowerdir+", "/lower2", 0); The selftest contain an example for this. * patches from https://lore.kernel.org/r/20241014-work-overlayfs-v3-0-32b3fed1286e@kernel.org: selftests: add overlayfs fd mounting selftests selftests: use shared header Documentation,ovl: document new file descriptor based layers ovl: specify layers via file descriptors fs: add helper to use mount option as path or fd Link: https://lore.kernel.org/r/20241014-work-overlayfs-v3-0-32b3fed1286e@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-14 16:31:22 +02:00
Christian Brauner	a08557d19e	ovl: specify layers via file descriptors Currently overlayfs only allows specifying layers through path names. This is inconvenient for users such as systemd that want to assemble an overlayfs mount purely based on file descriptors. This enables user to specify both: fsconfig(fd_overlay, FSCONFIG_SET_FD, "upperdir+", NULL, fd_upper); fsconfig(fd_overlay, FSCONFIG_SET_FD, "workdir+", NULL, fd_work); fsconfig(fd_overlay, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower1); fsconfig(fd_overlay, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower2); in addition to: fsconfig(fd_overlay, FSCONFIG_SET_STRING, "upperdir+", "/upper", 0); fsconfig(fd_overlay, FSCONFIG_SET_STRING, "workdir+", "/work", 0); fsconfig(fd_overlay, FSCONFIG_SET_STRING, "lowerdir+", "/lower1", 0); fsconfig(fd_overlay, FSCONFIG_SET_STRING, "lowerdir+", "/lower2", 0); Link: https://lore.kernel.org/r/20241014-work-overlayfs-v3-2-32b3fed1286e@kernel.org Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-14 16:31:16 +02:00
Christian Brauner	c2f8fde868	fs: add helper to use mount option as path or fd Allow filesystems to use a mount option either as a file or path. Link: https://lore.kernel.org/r/20241014-work-overlayfs-v3-1-32b3fed1286e@kernel.org Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-14 16:31:15 +02:00
Mathieu Desnoyers	7e019dcc47	sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads commit `223baf9d17` ("sched: Fix performance regression introduced by mm_cid") introduced a per-mm/cpu current concurrency id (mm_cid), which keeps a reference to the concurrency id allocated for each CPU. This reference expires shortly after a 100ms delay. These per-CPU references keep the per-mm-cid data cache-local in situations where threads are running at least once on each CPU within each 100ms window, thus keeping the per-cpu reference alive. However, intermittent workloads behaving in bursts spaced by more than 100ms on each CPU exhibit bad cache locality and degraded performance compared to purely per-cpu data indexing, because concurrency IDs are allocated over various CPUs and cores, therefore losing cache locality of the associated data. Introduce the following changes to improve per-mm-cid cache locality: - Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep track of which mm_cid value was last used, and use it as a hint to attempt re-allocating the same concurrency ID the next time this mm/cpu needs to allocate a concurrency ID, - Add a per-mm CPUs allowed mask, which keeps track of the union of CPUs allowed for all threads belonging to this mm. This cpumask is only set during the lifetime of the mm, never cleared, so it represents the union of all the CPUs allowed since the beginning of the mm lifetime (note that the mm_cpumask() is really arch-specific and tailored to the TLB flush needs, and is thus _not_ a viable approach for this), - Add a per-mm nr_cpus_allowed to keep track of the weight of the per-mm CPUs allowed mask (for fast access), - Add a per-mm max_nr_cid to keep track of the highest number of concurrency IDs allocated for the mm. This is used for expanding the concurrency ID allocation within the upper bound defined by: min(mm->nr_cpus_allowed, mm->mm_users) When the next unused CID value reaches this threshold, stop trying to expand the cid allocation and use the first available cid value instead. Spreading allocation to use all the cid values within the range [ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ] improves cache locality while preserving mm_cid compactness within the expected user limits, - In __mm_cid_try_get, only return cid values within the range [ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This prevents allocating cids above the number of allowed cpus in rare scenarios where cid allocation races with a concurrent remote-clear of the per-mm/cpu cid. This improvement is made possible by the addition of the per-mm CPUs allowed mask, - In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than t->nr_cpus_allowed. This criterion was really meant to compare the number of mm->mm_users to the number of CPUs allowed for the entire mm. Therefore, the prior comparison worked fine when all threads shared the same CPUs allowed mask, but not so much in scenarios where those threads have different masks (e.g. each thread pinned to a single CPU). This improvement is made possible by the addition of the per-mm CPUs allowed mask. * Benchmarks Each thread increments 16kB worth of 8-bit integers in bursts, with a configurable delay between each thread's execution. Each thread run one after the other (no threads run concurrently). The order of thread execution in the sequence is random. The thread execution sequence begins again after all threads have executed. The 16kB areas are allocated with rseq_mempool and indexed by either cpu_id, mm_cid (not cache-local), or cache-local mm_cid. Each thread is pinned to its own core. Testing configurations: 8-core/1-L3: Use 8 cores within a single L3 24-core/24-L3: Use 24 cores, 1 core per L3 192-core/24-L3: Use 192 cores (all cores in the system) 384-thread/24-L3: Use 384 HW threads (all HW threads in the system) Intermittent workload delays between threads: 200ms, 10ms. Hardware: CPU(s): 384 On-line CPU(s) list: 0-383 Vendor ID: AuthenticAMD Model name: AMD EPYC 9654 96-Core Processor Thread(s) per core: 2 Core(s) per socket: 96 Socket(s): 2 Caches (sum of all): L1d: 6 MiB (192 instances) L1i: 6 MiB (192 instances) L2: 192 MiB (192 instances) L3: 768 MiB (24 instances) Each result is an average of 5 test runs. The cache-local speedup is calculated as: (cache-local mm_cid) / (mm_cid). Intermittent workload delay: 200ms per-cpu mm_cid cache-local mm_cid cache-local speedup (ns) (ns) (ns) 8-core/1-L3 1374 19289 1336 14.4x 24-core/24-L3 2423 26721 1594 16.7x 192-core/24-L3 2291 15826 2153 7.3x 384-thread/24-L3 1874 13234 1907 6.9x Intermittent workload delay: 10ms per-cpu mm_cid cache-local mm_cid cache-local speedup (ns) (ns) (ns) 8-core/1-L3 662 756 686 1.1x 24-core/24-L3 1378 3648 1035 3.5x 192-core/24-L3 1439 10833 1482 7.3x 384-thread/24-L3 1503 10570 1556 6.8x [ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs" patch series with a simpler and more general approach. ] [ This patch applies on top of v6.12-rc1. ] Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/lkml/20240823185946.418340-1-mathieu.desnoyers@efficios.com/	2024-10-14 12:52:40 +02:00
Kent Overstreet	5e3b72324d	bcachefs: Fix sysfs warning in fstests generic/730,731 sysfs warns if we're removing a symlink from a directory that's no longer in sysfs; this is triggered by fstests generic/730, which simulates hot removal of a block device. This patch is however not a correct fix, since checking kobj->state_in_sysfs on a kobj owned by another subsystem is racy. A better fix would be to add the appropriate check to sysfs_remove_link() - and sysfs_create_link() as well. But kobject_add_internal()/kobject_del() do not as of today have locking that would support that. Note that the block/holder.c code appears to be subject to this race as well. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-14 05:43:01 -04:00
Kent Overstreet	cb6055e66f	bcachefs: Handle race between stripe reuse, invalidate_stripe_to_dev When creating a new stripe, we may reuse an existing stripe that has some empty and some nonempty blocks. Generally, the existing stripe won't change underneath us - except for block sector counts, which we copy to the new key in ec_stripe_key_update. But the device removal path can now invalidate stripe pointers to a device, and that can race with stripe reuse. Change ec_stripe_key_update() to check for and resolve this inconsistency. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-13 22:03:03 -04:00
Kent Overstreet	b1e562265e	bcachefs: Fix kasan splat in new_stripe_alloc_buckets() Update for BCH_SB_MEMBER_INVALID. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-13 22:03:01 -04:00
Kent Overstreet	9f25dbe0bf	bcachefs: Add missing validation for bch_stripe.csum_granularity_bits Reported-by: syzbot+f8c98a50c323635be65d@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-13 17:55:33 -04:00
Kent Overstreet	a319aeaebb	bcachefs: Fix missing bounds checks in bch2_alloc_read() We were checking that the alloc key was for a valid device, but not a valid bucket. This is the upgrade path from versions prior to bcachefs being mainlined. Reported-by: syzbot+a1b59c8e1a3f022fd301@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-13 17:55:33 -04:00
Kent Overstreet	573ddcdc56	bcachefs: fix uaf in bch2_dio_write_done() Reported-by: syzbot+19ad84d5133871207377@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-13 17:55:33 -04:00
Linus Torvalds	cfea70e835	two fixes for Windows symlink handling -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmcMBf4ACgkQiiy9cAdy T1Hf7Qv/f/TEXZisWIGshUpIerxOAWmN70bTw4sNID9ge8mVWwtVJBs57rlSjPTc 97Jj95urqnKEAGk/KC8qntp5QCMBQAeBFILigZph2c7vqEXPQy0dpbDUEUFuRN2G mq0wn7IcJZcPJmhZGx9JJeteHk/24drJRSM+jyklwI2Rmev6Y6dlsv4JyMuvP7iI YuCdbN7rYXsRBkpnK5AbiWCRdxwQMiMuGsppNQyBVSZKkt/g+8R16Z6WKxSbkaZf XajVsywhlP5Bg9HRAk/YTPK4enKVi8ISp9qfS9EuinwM/VFzEnXnYrec/fiD0Ukg rEemM7iF/YQdQq/2q8gm5KpoOjnLbaew+Zb+OoWyXMK7RJygD79+uMHn3v1cdi7B BWCgbQQ7KiRi6rOo0Xzz8Rmw3L4+DHjTvIbh46jz90qQyuumR2hUSa7cPl2ATO4l lxA50Q8xPE1i0Cfob1w/XHlrfmWMyovtHSKDvaeOMclp/VAHDfS6nB0x/ngyY8UH ii2czaDd =uI8y -----END PGP SIGNATURE----- Merge tag '6.12-rc2-cifs-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull smb client fixes from Steve French: "Two fixes for Windows symlink handling" * tag '6.12-rc2-cifs-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: Fix creating native symlinks pointing to current or parent directory cifs: Improve creating native symlinks pointing to directory	2024-10-13 10:52:39 -07:00
Kent Overstreet	c986dd7ecb	bcachefs: Improve check_snapshot_exists() Check if we have snapshot_trees or subvolumes that refer to the snapshot node being reconstructed, and use them. With this, the kill_btree_root test that blows away the snapshots btree now passes, and we're able to successfully reconstruct. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-12 05:02:48 -04:00
Kent Overstreet	9183c2b11e	bcachefs: Fix bkey_nocow_lock() This fixes an assertion pop in nocow_locking.c 00243 kernel BUG at fs/bcachefs/nocow_locking.c:41! 00243 Internal error: Oops - BUG: 00000000f2000800 [#1] SMP 00243 Modules linked in: 00243 Hardware name: linux,dummy-virt (DT) 00243 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00244 pc : bch2_bucket_nocow_unlock (/home/testdashboard/linux-7/fs/bcachefs/nocow_locking.c:41) 00244 lr : bkey_nocow_lock (/home/testdashboard/linux-7/fs/bcachefs/data_update.c:79) 00244 sp : ffffff80c82373b0 00244 x29: ffffff80c82373b0 x28: ffffff80e08958c0 x27: ffffff80e0880000 00244 x26: ffffff80c8237a98 x25: 00000000000000a0 x24: ffffff80c8237ab0 00244 x23: 00000000000000c0 x22: 0000000000000008 x21: 0000000000000000 00244 x20: ffffff80c8237a98 x19: 0000000000000018 x18: 0000000000000000 00244 x17: 0000000000000000 x16: 000000000000003f x15: 0000000000000000 00244 x14: 0000000000000008 x13: 0000000000000018 x12: 0000000000000000 00244 x11: 0000000000000000 x10: ffffff80e0880000 x9 : ffffffc0803ac1a4 00244 x8 : 0000000000000018 x7 : ffffff80c8237a88 x6 : ffffff80c8237ab0 00244 x5 : ffffff80e08988d0 x4 : 00000000ffffffff x3 : 0000000000000000 00244 x2 : 0000000000000004 x1 : 0003000000000d1e x0 : ffffff80e08988c0 00244 Call trace: 00244 bch2_bucket_nocow_unlock (/home/testdashboard/linux-7/fs/bcachefs/nocow_locking.c:41) 00245 bch2_data_update_init (/home/testdashboard/linux-7/fs/bcachefs/data_update.c:627 (discriminator 1)) 00245 promote_alloc.isra.0 (/home/testdashboard/linux-7/fs/bcachefs/io_read.c:242 /home/testdashboard/linux-7/fs/bcachefs/io_read.c:304) 00245 __bch2_read_extent (/home/testdashboard/linux-7/fs/bcachefs/io_read.c:949) 00246 __bch2_read (/home/testdashboard/linux-7/fs/bcachefs/io_read.c:1215) 00246 bch2_direct_IO_read (/home/testdashboard/linux-7/fs/bcachefs/fs-io-direct.c:132) 00246 bch2_read_iter (/home/testdashboard/linux-7/fs/bcachefs/fs-io-direct.c:201) 00247 aio_read.constprop.0 (/home/testdashboard/linux-7/fs/aio.c:1602) 00247 io_submit_one.constprop.0 (/home/testdashboard/linux-7/fs/aio.c:2003 /home/testdashboard/linux-7/fs/aio.c:2052) 00248 __arm64_sys_io_submit (/home/testdashboard/linux-7/fs/aio.c:2111 /home/testdashboard/linux-7/fs/aio.c:2081 /home/testdashboard/linux-7/fs/aio.c:2081) 00248 invoke_syscall.constprop.0 (/home/testdashboard/linux-7/arch/arm64/include/asm/syscall.h:61 /home/testdashboard/linux-7/arch/arm64/kernel/syscall.c:54) 00248 ========= FAILED TIMEOUT tiering_variable_buckets_replicas in 1200s Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-12 05:01:52 -04:00
Kent Overstreet	672f75238e	bcachefs: Fix accounting replay flags BCH_TRANS_COMMIT_journal_reclaim without BCH_WATERMARK_reclaim means "return an error if low on journal space" - but accounting replay must succeed. Fixes https://github.com/koverstreet/bcachefs/issues/656 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-12 03:02:16 -04:00
Kent Overstreet	c1bd21bb65	bcachefs: Fix invalid shift in member_to_text() Reported-by: syzbot+064ce437a1ad63d3f6ef@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-12 03:02:16 -04:00
Kent Overstreet	7d84d9f449	bcachefs: Fix bch2_have_enough_devs() for BCH_SB_MEMBER_INVALID This fixes a kasan splat in the ec device removal tests. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-11 22:20:51 -04:00
Linus Torvalds	6254d53727	NFS Client Bugfixes for Linux 6.12-rc Localio Bugfixes: * Remove duplicated include in localio.c * Fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put() * Fix Kconfig for NFS_COMMON_LOCALIO_SUPPORT * Fix nfsd_file tracepoints to handle NULL rqstp pointers Other Bugfixes: * Fix program selection loop in svc_process_common * Fix integer overflow in decode_rc_list() * Prevent NULL-pointer dereference in nfs42_complete_copies() * Fix CB_RECALL performance issues when using a large number of delegations -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmcJjjQACgkQ18tUv7Cl QOvgJw/6A33s+pjyBVLIKT6oMCPkUJeQ4Rhg9Je0Qw/ji0eFkT4Eyd65kRz3T9M/ qRrCfWaUd2dTYcbKQyhuGTlEfICZa9R4I0/Ztk9yvf9xcd1xFXKzTkFekGUVeHQA OcngDu9psFxhvyzKI8nAHs1ephX/T7TywvTKANMRbeRCYYvVkytAt9YeVMigYZa5 dnchoUdGUdL6B6RXCU/Qhf0A1uYyA4hkk/FTBCPgv+kYx5pnjFq0y/yIIHDzCR3I +yE1ss3EpVTQgt2Ca/cmDyYXsa7G8G51U7cS5AeIoXfsf1EGtTujowWcBY4oqFEC ixx58fQe48AqwsP5XDZn8gnsuYH9snnw5rIB0IVqq55/a+XLMupHayyf/iziMV3s JWgT4gKDyFca2pT+bJ8iWweU+ecRYxKGnh2NydyBiqowogsHZm4uKh0vELvqqkBd RIjCyIiQVhYBII2jqpjRnxrqhGUT5XO99NQdQIGV0bUjCEP4YAjY4ChfEVcWXhnB ppyBP+r8N5O77NcVqsVQS26U0/jb9K30LyYl9VT43ank3d+VVtHA5ZqnUflWtwuc 2XiGDvXW9mIvbVraWIZXUNVy39bzRclDf5bx4jeYLnKCMym81rkEIBOvBKQKZTrl v+1Nhaj+fSw+rFSUm0KPqms0UDiT0Ol7ltu84ifadYqubbSEbqU= =QBvR -----END PGP SIGNATURE----- Merge tag 'nfs-for-6.12-2' of git://git.linux-nfs.org/projects/anna/linux-nfs Pull NFS client fixes from Anna Schumaker: "Localio Bugfixes: - remove duplicated include in localio.c - fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put() - fix Kconfig for NFS_COMMON_LOCALIO_SUPPORT - fix nfsd_file tracepoints to handle NULL rqstp pointers Other Bugfixes: - fix program selection loop in svc_process_common - fix integer overflow in decode_rc_list() - prevent NULL-pointer dereference in nfs42_complete_copies() - fix CB_RECALL performance issues when using a large number of delegations" * tag 'nfs-for-6.12-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFS: remove revoked delegation from server's delegation list nfsd/localio: fix nfsd_file tracepoints to handle NULL rqstp nfs_common: fix Kconfig for NFS_COMMON_LOCALIO_SUPPORT nfs_common: fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put() NFSv4: Prevent NULL-pointer dereference in nfs42_complete_copies() SUNRPC: Fix integer overflow in decode_rc_list() sunrpc: fix prog selection loop in svc_process_common nfs: Remove duplicated include in localio.c	2024-10-11 15:37:15 -07:00
Roi Martin	2ab5e243c2	btrfs: fix uninitialized pointer free on read_alloc_one_name() error The function read_alloc_one_name() does not initialize the name field of the passed fscrypt_str struct if kmalloc fails to allocate the corresponding buffer. Thus, it is not guaranteed that fscrypt_str.name is initialized when freeing it. This is a follow-up to the linked patch that fixes the remaining instances of the bug introduced by commit `e43eec81c5` ("btrfs: use struct qstr instead of name and namelen pairs"). Link: https://lore.kernel.org/linux-btrfs/20241009080833.1355894-1-jroi.martin@gmail.com/ Fixes: `e43eec81c5` ("btrfs: use struct qstr instead of name and namelen pairs") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Roi Martin <jroi.martin@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-11 19:55:04 +02:00
Christian Heusel	a0af4936e4	btrfs: send: cleanup unneeded return variable in changed_verity() As all changed_* functions need to return something, just return 0 directly here, as the verity status is passed via the context. Reported by LKP: fs/btrfs/send.c:6877:5-8: Unneeded variable: "ret". Return "0" on line 6883 Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202410092305.WbyqspH8-lkp@intel.com/ Signed-off-by: Christian Heusel <christian@heusel.eu> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-11 19:54:58 +02:00
Roi Martin	66691c6e2f	btrfs: fix uninitialized pointer free in add_inode_ref() The add_inode_ref() function does not initialize the "name" struct when it is declared. If any of the following calls to "read_one_inode() returns NULL, dir = read_one_inode(root, parent_objectid); if (!dir) { ret = -ENOENT; goto out; } inode = read_one_inode(root, inode_objectid); if (!inode) { ret = -EIO; goto out; } then "name.name" would be freed on "out" before being initialized. out: ... kfree(name.name); This issue was reported by Coverity with CID 1526744. Fixes: `e43eec81c5` ("btrfs: use struct qstr instead of name and namelen pairs") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Roi Martin <jroi.martin@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-11 19:54:52 +02:00
Filipe Manana	97420be7bd	btrfs: use sector numbers as keys for the dirty extents xarray We are using the logical address ("bytenr") of an extent as the key for qgroup records in the dirty extents xarray. This is a problem because the xarrays use "unsigned long" for keys/indices, meaning that on a 32 bits platform any extent starting at or beyond 4G is truncated, which is a too low limitation as virtually everyone is using storage with more than 4G of space. This means a "bytenr" of 4G gets truncated to 0, and so does 8G and 16G for example, resulting in incorrect qgroup accounting. Fix this by using sector numbers as keys instead, that is, using keys that match the logical address right shifted by fs_info->sectorsize_bits, which is what we do for the fs_info->buffer_radix that tracks extent buffers (radix trees also use an "unsigned long" type for keys). This also makes the index space more dense which helps optimize the xarray (as mentioned at Documentation/core-api/xarray.rst). Fixes: `3cce39a8ca` ("btrfs: qgroup: use xarray to track dirty extents in transaction") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-11 18:33:35 +02:00
Namjae Jeon	a77e0e02af	ksmbd: add support for supplementary groups Even though system user has a supplementary group, It gets NT_STATUS_ACCESS_DENIED when attempting to create file or directory. This patch add KSMBD_EVENT_LOGIN_REQUEST_EXT/RESPONSE_EXT netlink events to get supplementary groups list. The new netlink event doesn't break backward compatibility when using old ksmbd-tools. Co-developed-by: Atte Heikkilä <atteh.mailbox@gmail.com> Signed-off-by: Atte Heikkilä <atteh.mailbox@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-10-11 11:02:14 -05:00
Jaegeuk Kim	332fade75d	f2fs: allow parallel DIO reads This fixes a regression which prevents parallel DIO reads. Fixes: `0cac51185e` ("f2fs: fix to avoid racing in between read and OPU dio write") Reviewed-by: Daeho Jeong <daehojeong@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-10-11 15:12:07 +00:00
Darrick J. Wong	0fb823f1cf	xfs: fix integer overflow in xrep_bmap The variable declaration in this function predates the merge of the nrext64 (aka 64-bit extent counters) feature, which means that the variable declaration type is insufficient to avoid an integer overflow. Fix that by redeclaring the variable to be xfs_extnum_t. Coverity-id: 1630958 Fixes: `8f71bede8e` ("xfs: repair inode fork block mapping data structures") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-11 12:32:48 +02:00
Gao Xiang	ae54567eaa	erofs: get rid of kaddr in `struct z_erofs_maprecorder` `kaddr` becomes useless after switching to metabuf. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241010235830.1535616-1-hsiangkao@linux.alibaba.com	2024-10-11 13:36:58 +08:00
Gao Xiang	2402082e53	erofs: get rid of z_erofs_try_to_claim_pcluster() Just fold it into the caller for simplicity. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241010090420.405871-1-hsiangkao@linux.alibaba.com	2024-10-11 13:36:58 +08:00
Gao Xiang	416a8b2c02	erofs: ensure regular inodes for file-backed mounts Only regular inodes are allowed for file-backed mounts, not directories (as seen in the original syzbot case) or special inodes. Also ensure that .read_folio() is implemented on the underlying fs for the primary device. Fixes: `fb17675026` ("erofs: add file-backed mount support") Reported-by: syzbot+001306cd9c92ce0df23f@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/00000000000011bdde0622498ee3@google.com Tested-by: syzbot+001306cd9c92ce0df23f@syzkaller.appspotmail.com Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240917130803.32418-1-hsiangkao@linux.alibaba.com	2024-10-11 13:36:41 +08:00
Linus Torvalds	eb952c47d1	for-6.12-rc2-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmcH4qAACgkQxWXV+ddt WDtDig//czntfO+iRvDERZWTIB6vdVExfLd3r3ZNYlO1pIvgCuvqx3iYva+0ZhGW 8A+gcRax7cz0jCaxDp/+5lIRfdNxZH6/LwjZsDgU8Ly7himeRmwhtn2fCgNeiH/K bUl92+ZMo2vwqTKXYa3xF1g3Hz6cRXVW7gJrMwNhb1hpPTGx+lgYJU02m/Io/vjK 1jcrZ84OEPIOY5uiAoDyO2hgsT/zVEeuuOiSTpKSzrghPbo0vmjLiYJ5T+CE5Uw3 u3w7/Fqnw49NwucqtncvyFFDXY9EWNuQhowi3hqJgOYTInqwwJigIpQV0hDDwYxb ohGUGjazGfAEf/cy1jZXMbwCVgg8/Nj9x0eDKKhfs19VYUbMkEYQ8LKRTUlCeBwS H/2AmqpqHEEO+tPY3P+w6MVwkNho8JNpWPdP5OzJs7XrD067IViOjD06HPM/k5ci TU3zp9NYvgHVtmfZK1Aqsg9OYVhI1klVXejmlAzOLxejRPWXK/1hBw3kXbC6I+k1 50l0Yh1dgEnclMI3yWsKoj8IYUAkh2eudt0pNsot4a5vICMY++NVS2eukdz5UcEz ix7hcpYcCcmzoOaelyEgmdAncWVGJT5w2Nzy85YaOp+Z1C65Ywb41utU+sSY+swB kZfwl9vrsfu754vX7UKBherCvvYo+Lnj3GeX8Oe+1LoT2BP0TPk= =lTqc -----END PGP SIGNATURE----- Merge tag 'for-6.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - update fstrim loop and add more cancellation points, fix reported delayed or blocked suspend if there's a huge chunk queued - fix error handling in recent qgroup xarray conversion - in zoned mode, fix warning printing device path without RCU protection - again fix invalid extent xarray state (`6252690f7e`), lost due to refactoring * tag 'for-6.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix clear_dirty and writeback ordering in submit_one_sector() btrfs: zoned: fix missing RCU locking in error message when loading zone info btrfs: fix missing error handling when adding delayed ref with qgroups enabled btrfs: add cancellation points to trim loops btrfs: split remaining space to discard in chunks	2024-10-10 10:02:59 -07:00
Linus Torvalds	5870963f6c	nfsd-6.12 fixes: - Fix NFSD bring-up / shutdown - Fix a UAF when releasing a stateid -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmcHx6IACgkQM2qzM29m f5ecExAAheSpSP9D+1n3hKeOlNfhY8FUzd41Arn4NYV+jIBJbtx94/FlSNOxA0mp Ovm1I8uAxy4TR8TLt7tsxfT7JDOStKwXFl3QlOUZT/+uyyJr7/q5R959R3oMiccR Rfrpj6j2yYWrI8qGDGHca4Vv2bDSxr4mzztWwDe0SHSsjwf4OAcv+XF5vcfZ/CJN Bxulb9WfNU8XvdFcRDHokMfk6jiY6/+FCTwX8ckvbVEG6gHT8+CRYSUJ05j0LJGo xKZV913NgzcuV7PH0vq6vExJE6+rEPt/ejDAT5FM5yeNe+WJ4RTDgsYyIr9iLbHF mWB9M4NnP+EZhejtOCbZ9RZjjKro09ilEPpqILuuGQPtcHSeWmhNbFz0kwLe+zYZ CdtjnPZhjB0ITWgZ1HCtoJ8k/ZcMa7iiM/kApMLGr9fVj8/BHHFzS95PK7K/Fqur FLdhvo6CzZCnRd16e2kqWsG7wO2lPWcz4NWTf9wxIG5GCunXoVCEnK1VfHvnldbH BIFXZ+ib5qnL2i3Qmz7bQxmfIp5ryZnNx1mF0OM8imR9K/rsnARd7JfQ99lpMy8D mD4coZVTMMk/Zg9zuH8k5GBzB2zXXqgngp4IJIxqrKR7/AsuSU3R7r+O9CWN91GQ GKpRtMn/rVUg81jxDr3qoKquyxONoyVrVXAKsj1PgUSQdjUJgqU= =Rud7 -----END PGP SIGNATURE----- Merge tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Fix NFSD bring-up / shutdown - Fix a UAF when releasing a stateid * tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: fix possible badness in FREE_STATEID nfsd: nfsd_destroy_serv() must call svc_destroy() even if nfsd_startup_net() failed NFSD: Mark filecache "down" if init fails	2024-10-10 09:52:49 -07:00
Linus Torvalds	825ec756af	Bug fixes for 6.12-rc3 * A few small typo fixes * fstests xfs/538 DEBUG-only fix * Performance fix on blockgc on COW'ed files, by skipping trims on cowblock inodes currently opened for write * Prevent cowblocks to be freed under dirty pagecache during unshare * Update MAINTAINERS file to quote the new maintainer Signed-off-by: Carlos Maiolino <cem@kernel.org> -----BEGIN PGP SIGNATURE----- iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZwY6mgAKCRBGdaER5Qtf poE8AYCZzMJr9wMrs2RsWRnaEhMRJNZIPQmSKXgHAK3mV5AbXtdHRc8yGVNHf+mW Nh0fwAkBf1Ix0VJWkXOSFHZI9O2lLRsCogbNjFhwYF0MHZch2/mq1Wa4Tj1SDlfg Ny2PJBNHyA== =OkRo -----END PGP SIGNATURE----- Merge tag 'xfs-6.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs fixes from Carlos Maiolino: - A few small typo fixes - fstests xfs/538 DEBUG-only fix - Performance fix on blockgc on COW'ed files, by skipping trims on cowblock inodes currently opened for write - Prevent cowblocks to be freed under dirty pagecache during unshare - Update MAINTAINERS file to quote the new maintainer * tag 'xfs-6.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix a typo xfs: don't free cowblocks from under dirty pagecache on unshare xfs: skip background cowblock trims on inodes open for write xfs: support lowmode allocations in xfs_bmap_exact_minlen_extent_alloc xfs: call xfs_bmap_exact_minlen_extent_alloc from xfs_bmap_btalloc xfs: don't ifdef around the exact minlen allocations xfs: fold xfs_bmap_alloc_userdata into xfs_bmapi_allocate xfs: distinguish extra split from real ENOSPC from xfs_attr_node_try_addname xfs: distinguish extra split from real ENOSPC from xfs_attr3_leaf_split xfs: return bool from xfs_attr3_leaf_add xfs: merge xfs_attr_leaf_try_add into xfs_attr_leaf_addname xfs: Use try_cmpxchg() in xlog_cil_insert_pcp_aggregate() xfs: scrub: convert comma to semicolon xfs: Remove empty declartion in header file MAINTAINERS: add Carlos Maiolino as XFS release manager	2024-10-10 09:45:45 -07:00
Aleksa Sarai	f92f0a1b05	openat2: explicitly return -E2BIG for (usize > PAGE_SIZE) While we do currently return -EFAULT in this case, it seems prudent to follow the behaviour of other syscalls like clone3. It seems quite unlikely that anyone depends on this error code being EFAULT, but we can always revert this if it turns out to be an issue. Cc: stable@vger.kernel.org # v5.6+ Fixes: `fddb5d430a` ("open: introduce openat2(2) syscall") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/r/20241010-extensible-structs-check_fields-v3-3-d2833dfe6edd@cyphar.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-10 12:09:03 +02:00
Christian Brauner	b40508ca5d	Merge patch series "timekeeping/fs: multigrain timestamp redux" Jeff Layton <jlayton@kernel.org> says: The VFS has always used coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide when to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. What we need is a way to only use fine-grained timestamps when they are being actively queried. Use the (unused) top bit in inode->i_ctime_nsec as a flag that indicates whether the current timestamps have been queried via stat() or the like. When it's set, we allow the kernel to use a fine-grained timestamp iff it's necessary to make the ctime show a different value. This solves the problem of being able to distinguish the timestamp between updates, but introduces a new problem: it's now possible for a file being changed to get a fine-grained timestamp. A file that is altered just a bit later can then get a coarse-grained one that appears older than the earlier fine-grained time. This violates timestamp ordering guarantees. To remedy this, keep a global monotonic atomic64_t value that acts as a timestamp floor. When we go to stamp a file, we first get the latter of the current floor value and the current coarse-grained time. If the inode ctime hasn't been queried then we just attempt to stamp it with that value. If it has been queried, then first see whether the current coarse time is later than the existing ctime. If it is, then we accept that value. If it isn't, then we get a fine-grained time and try to swap that into the global floor. Whether that succeeds or fails, we take the resulting floor time, convert it to realtime and try to swap that into the ctime. We take the result of the ctime swap whether it succeeds or fails, since either is just as valid. Filesystems can opt into this by setting the FS_MGTIME fstype flag. Others should be unaffected (other than being subject to the same floor value as multigrain filesystems). * patches from https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org: tmpfs: add support for multigrain timestamps btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps Documentation: add a new file documenting multigrain timestamps fs: add percpu counters for significant multigrain timestamp events fs: tracepoints around multigrain timestamp events fs: handle delegated timestamps in setattr_copy_mgtime fs: have setattr_copy handle multigrain timestamps appropriately fs: add infrastructure for multigrain timestamps Link: https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-10 10:20:57 +02:00
Jeff Layton	e2e801d6e6	btrfs: convert to multigrain timestamps Enable multigrain timestamps, which should ensure that there is an apparent change to the timestamp whenever it has been written after being actively observed via getattr. Beyond enabling the FS_MGTIME flag, this patch eliminates update_time_for_write, which goes to great pains to avoid in-memory stores. Just have it overwrite the timestamps unconditionally. Note that this also drops the IS_I_VERSION check and unconditionally bumps the change attribute, since SB_I_VERSION is always set on btrfs. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241002-mgtime-v10-11-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-10 10:20:53 +02:00
Jeff Layton	d0382c698f	ext4: switch to multigrain timestamps Enable multigrain timestamps, which should ensure that there is an apparent change to the timestamp whenever it has been written after being actively observed via getattr. For ext4, we only need to enable the FS_MGTIME flag. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241002-mgtime-v10-10-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-10 10:20:53 +02:00
Jeff Layton	1cf7e834a6	xfs: switch to multigrain timestamps Enable multigrain timestamps, which should ensure that there is an apparent change to the timestamp whenever it has been written after being actively observed via getattr. Also, anytime the mtime changes, the ctime must also change, and those are now the only two options for xfs_trans_ichgtime. Have that function unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is always set. Finally, stop setting STATX_CHANGE_COOKIE in getattr, since the ctime should give us better semantics now. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241002-mgtime-v10-9-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-10 10:20:52 +02:00
Jeff Layton	73a47cf40f	fs: add percpu counters for significant multigrain timestamp events New percpu counters for counting various stats around multigrain timestamp events, and a new debugfs file for displaying them when CONFIG_DEBUG_FS is enabled: - number of attempted ctime updates - number of successful i_ctime_nsec swaps - number of fine-grained timestamp fetches - number of floor value swap events Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241002-mgtime-v10-7-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-10 10:20:52 +02:00
Jeff Layton	c86e3c4718	fs: tracepoints around multigrain timestamp events Add some tracepoints around various multigrain timestamp events. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241002-mgtime-v10-6-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-10 10:20:52 +02:00
Jeff Layton	7f2c86cba3	fs: handle delegated timestamps in setattr_copy_mgtime An update to the inode ctime typically requires the latest clock value possible. The exception to this rule is when there is a nfsd write delegation and the server is proxying timestamps from the client. When nfsd gets a CB_GETATTR response, update the timestamp value in the inode to the values that the client is tracking. The client doesn't send a ctime value (since that's always determined by the exported filesystem), but it can send a mtime value. In the case where it does, update the ctime to a value commensurate with that instead of the current time. If ATTR_DELEG is set, then use ia_ctime value instead of setting the timestamp to the current time. With the addition of delegated timestamps, the server may receive a request to update only the atime, which doesn't involve a ctime update. Trust the ATTR_CTIME flag in the update and only update the ctime when it's set. Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241002-mgtime-v10-5-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-10 10:20:51 +02:00
Namjae Jeon	7aa8804c0b	ksmbd: fix user-after-free from session log off There is racy issue between smb2 session log off and smb2 session setup. It will cause user-after-free from session log off. This add session_lock when setting SMB2_SESSION_EXPIRED and referece count to session struct not to free session while it is being used. Cc: stable@vger.kernel.org # v5.15+ Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-25282 Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-10-09 21:23:17 -05:00
Linus Torvalds	d3d1556696	12 hotfixes, 5 of which are c:stable. All singletons, about half of which are MM. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZwcILgAKCRDdBJ7gKXxA jjnMAQDRl+UfscRUeMippi7wnL3ee6MKyhhZVOhoxP24uB7yBwD/Ulq4oE+mLHml YTlK/wj5qTZIsdxGaBzM1yifqp3L7gU= =lFmJ -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2024-10-09-15-46' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "12 hotfixes, 5 of which are c:stable. All singletons, about half of which are MM" * tag 'mm-hotfixes-stable-2024-10-09-15-46' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: zswap: delete comments for "value" member of 'struct zswap_entry'. CREDITS: sort alphabetically by name secretmem: disable memfd_secret() if arch cannot set direct map .mailmap: update Fangrui's email mm/huge_memory: check pmd_special() only after pmd_present() resource, kunit: fix user-after-free in resource_test_region_intersects() fs/proc/kcore.c: allow translation of physical memory addresses selftests/mm: fix incorrect buffer->mirror size in hmm2 double_map test device-dax: correct pgoff align in dax_set_mapping() kthread: unpark only parked kthread Revert "mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN" bcachefs: do not use PF_MEMALLOC_NORECLAIM	2024-10-09 16:01:40 -07:00
Kent Overstreet	3b80552e70	bcachefs: __wait_for_freeing_inode: Switch to wait_bit_queue_entry inode_bit_waitqueue() is changing - this update clears the way for sched changes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-09 16:58:18 -04:00
Kent Overstreet	a7e2dd58fb	bcachefs: Check if stuck in journal_res_get() Like how we already do when the allocator seems to be stuck, check if we're waiting too long for a journal reservation and print some debug info. This is specifically to track down https://github.com/koverstreet/bcachefs/issues/656 which is showing up in userspace where we don't have sysfs/debugfs to get the journal debug info. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-09 16:57:59 -04:00
Alan Huang	9205d24cf7	bcachefs: Fix state lock involved deadlock We increased write ref, if the fs went to RO, that would lead to a deadlock, it actually happens: 00171 ========= TEST generic/279 00171 00172 bcachefs (vdb): starting version 1.12: rebalance_work_acct_fix opts=nocow 00172 bcachefs (vdb): recovering from clean shutdown, journal seq 35 00172 bcachefs (vdb): accounting_read... done 00172 bcachefs (vdb): alloc_read... done 00172 bcachefs (vdb): stripes_read... done 00172 bcachefs (vdb): snapshots_read... done 00172 bcachefs (vdb): journal_replay... done 00172 bcachefs (vdb): resume_logged_ops... done 00172 bcachefs (vdb): going read-write 00172 bcachefs (vdb): done starting filesystem 00172 FSTYP -- bcachefs 00172 PLATFORM -- Linux/aarch64 farm3-kvm 6.11.0-rc1-ktest-g3e290a0b8e34 #7030 SMP Tue Oct 8 14:15:12 UTC 2024 00172 MKFS_OPTIONS -- --nocow /dev/vdc 00172 MOUNT_OPTIONS -- /dev/vdc /mnt/scratch 00172 00172 bcachefs (vdc): starting version 1.12: rebalance_work_acct_fix opts=nocow 00172 bcachefs (vdc): initializing new filesystem 00172 bcachefs (vdc): going read-write 00172 bcachefs (vdc): marking superblocks 00172 bcachefs (vdc): initializing freespace 00172 bcachefs (vdc): done initializing freespace 00172 bcachefs (vdc): reading snapshots table 00172 bcachefs (vdc): reading snapshots done 00172 bcachefs (vdc): done starting filesystem 00173 bcachefs (vdc): shutting down 00173 bcachefs (vdc): going read-only 00173 bcachefs (vdc): finished waiting for writes to stop 00173 bcachefs (vdc): flushing journal and stopping allocators, journal seq 4 00173 bcachefs (vdc): flushing journal and stopping allocators complete, journal seq 6 00173 bcachefs (vdc): shutdown complete, journal seq 7 00173 bcachefs (vdc): marking filesystem clean 00173 bcachefs (vdc): shutdown complete 00173 bcachefs (vdb): shutting down 00173 bcachefs (vdb): going read-only 00361 INFO: task umount:6180 blocked for more than 122 seconds. 00361 Not tainted 6.11.0-rc1-ktest-g3e290a0b8e34 #7030 00361 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 00361 task:umount state:D stack:0 pid:6180 tgid:6180 ppid:6176 flags:0x00000004 00361 Call trace: 00362 __switch_to (arch/arm64/kernel/process.c:556) 00362 __schedule (kernel/sched/core.c:5191 kernel/sched/core.c:6529) 00363 schedule (include/asm-generic/bitops/generic-non-atomic.h:128 include/linux/thread_info.h:192 include/linux/sched.h:2084 kernel/sched/core.c:6608 kernel/sched/core.c:6621) 00365 bch2_fs_read_only (fs/bcachefs/super.c:346 (discriminator 41)) 00367 __bch2_fs_stop (fs/bcachefs/super.c:620) 00368 bch2_put_super (fs/bcachefs/fs.c:1942) 00369 generic_shutdown_super (include/linux/list.h:373 (discriminator 2) fs/super.c:650 (discriminator 2)) 00371 bch2_kill_sb (fs/bcachefs/fs.c:2170) 00372 deactivate_locked_super (fs/super.c:434 fs/super.c:475) 00373 deactivate_super (fs/super.c:508) 00374 cleanup_mnt (fs/namespace.c:250 fs/namespace.c:1374) 00376 __cleanup_mnt (fs/namespace.c:1381) 00376 task_work_run (include/linux/sched.h:2024 kernel/task_work.c:224) 00377 do_notify_resume (include/linux/resume_user_mode.h:50 arch/arm64/kernel/entry-common.c:151) 00377 el0_svc (arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:171 arch/arm64/kernel/entry-common.c:178 arch/arm64/kernel/entry-common.c:713) 00377 el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:731) 00378 el0t_64_sync (arch/arm64/kernel/entry.S:598) 00378 INFO: task tee:6182 blocked for more than 122 seconds. 00378 Not tainted 6.11.0-rc1-ktest-g3e290a0b8e34 #7030 00378 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 00378 task:tee state:D stack:0 pid:6182 tgid:6182 ppid:533 flags:0x00000004 00378 Call trace: 00378 __switch_to (arch/arm64/kernel/process.c:556) 00378 __schedule (kernel/sched/core.c:5191 kernel/sched/core.c:6529) 00378 schedule (include/asm-generic/bitops/generic-non-atomic.h:128 include/linux/thread_info.h:192 include/linux/sched.h:2084 kernel/sched/core.c:6608 kernel/sched/core.c:6621) 00378 schedule_preempt_disabled (kernel/sched/core.c:6680) 00379 rwsem_down_read_slowpath (kernel/locking/rwsem.c:1073 (discriminator 1)) 00379 down_read (kernel/locking/rwsem.c:1529) 00381 bch2_gc_gens (fs/bcachefs/sb-members.h:77 fs/bcachefs/sb-members.h:88 fs/bcachefs/sb-members.h:128 fs/bcachefs/btree_gc.c:1240) 00383 bch2_fs_store_inner (fs/bcachefs/sysfs.c:473) 00385 bch2_fs_internal_store (fs/bcachefs/sysfs.c:417 fs/bcachefs/sysfs.c:580 fs/bcachefs/sysfs.c:576) 00386 sysfs_kf_write (fs/sysfs/file.c:137) 00387 kernfs_fop_write_iter (fs/kernfs/file.c:334) 00389 vfs_write (fs/read_write.c:497 fs/read_write.c:590) 00390 ksys_write (fs/read_write.c:643) 00391 __arm64_sys_write (fs/read_write.c:652) 00391 invoke_syscall.constprop.0 (arch/arm64/include/asm/syscall.h:61 arch/arm64/kernel/syscall.c:54) 00392 do_el0_svc (include/linux/thread_info.h:127 (discriminator 2) arch/arm64/kernel/syscall.c:140 (discriminator 2) arch/arm64/kernel/syscall.c:151 (discriminator 2)) 00392 el0_svc (arch/arm64/include/asm/irqflags.h:55 arch/arm64/include/asm/irqflags.h:76 arch/arm64/kernel/entry-common.c:165 arch/arm64/kernel/entry-common.c:178 arch/arm64/kernel/entry-common.c:713) 00392 el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:731) 00392 el0t_64_sync (arch/arm64/kernel/entry.S:598) Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-09 16:42:54 -04:00
Mohammed Anees	a30f32222d	bcachefs: Fix NULL pointer dereference in bch2_opt_to_text This patch adds a bounds check to the bch2_opt_to_text function to prevent NULL pointer dereferences when accessing the opt->choices array. This ensures that the index used is within valid bounds before dereferencing. The new version enhances the readability. Reported-and-tested-by: syzbot+37186860aa7812b331d5@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=37186860aa7812b331d5 Signed-off-by: Mohammed Anees <pvmohammedanees2003@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-09 16:42:53 -04:00
Alan Huang	a154154148	bcachefs: Release transaction before wake up We will get this if we wake up first: Kernel panic - not syncing: btree_node_write_done leaked btree_trans since there are still transactions waiting for cycle detectors after BTREE_NODE_write_in_flight is cleared. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-09 16:42:53 -04:00
Piotr Zalewski	0151d10a48	bcachefs: add check for btree id against max in try read node Add check for read node's btree_id against BTREE_ID_NR_MAX in try_read_btree_node to prevent triggering EBUG_ON condition in bch2_btree_id_root[1]. [1] https://syzkaller.appspot.com/bug?extid=cf7b2215b5d70600ec00 Reported-by: syzbot+cf7b2215b5d70600ec00@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=cf7b2215b5d70600ec00 Fixes: `4409b8081d` ("bcachefs: Repair pass for scanning for btree nodes") Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-09 16:42:53 -04:00
Kent Overstreet	19773ec997	bcachefs: Disk accounting device validation fixes - Fix failure to validate that accounting replicas entries point to valid devices: this wasn't a real bug since they'd be cleaned up by GC, but is still something we should know about - Fix failure to validate that dev_data_type entries point to valid devices: this does fix a real bug, since bch2_accounting_read() would then try to copy the counters to that device and pop an inconsistent error when the device didn't exist - Remove accounting entries that are zeroed or invalid: if we're not validating them we need to get rid of them: they might not exist in the superblock, so we need the to trigger the superblock mark path when they're readded. This fixes the replication.ktest rereplicate test, which was failing with "superblock not marked for replicas..." Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-09 16:42:53 -04:00
Kent Overstreet	9d86178782	bcachefs: bch2_inode_or_descendents_is_open() fsck can now correctly check if inodes in interior snapshot nodes are open/in use. - Tweak the vfs inode rhashtable so that the subvolume ID isn't hashed, meaning inums in different subvolumes will hash to the same slot. Note that this is a hack, and will cause problems if anyone ever has the same file in many different snapshots open all at the same time. - Then check if any of those subvolumes is a descendent of the snapshot ID being checked Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-09 16:42:53 -04:00
Kent Overstreet	84878e8245	bcachefs: Kill bch2_propagate_key_to_snapshot_leaves() Dead code now. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-09 16:42:53 -04:00
Kent Overstreet	9b23fdbd5d	bcachefs: bcachefs_metadata_version_inode_has_child_snapshots There's an inherent race in taking a snapshot while an unlinked file is open, and then reattaching it in the child snapshot. In the interior snapshot node the file will appear unlinked, as though it should be deleted - it's not referenced by anything in that snapshot - but we can't delete it, because the file data is referenced by the child snapshot. This was being handled incorrectly with propagate_key_to_snapshot_leaves() - but that doesn't resolve the fundamental inconsistency of "this file looks like it should be deleted according to normal rules, but - ". To fix this, we need to fix the rule for when an inode is deleted. The previous rule, ignoring snapshots (there was no well-defined rule for with snapshots) was: Unlinked, non open files are deleted, either at recovery time or during online fsck The new rule is: Unlinked, non open files, that do not exist in child snapshots, are deleted. To make this work transactionally, we add a new inode flag, BCH_INODE_has_child_snapshot; it overrides BCH_INODE_unlinked when considering whether to delete an inode, or put it on the deleted list. For transactional consistency, clearing it handled by the inode trigger: when deleting an inode we check if there are parent inodes which can now have the BCH_INODE_has_child_snapshot flag cleared. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-09 16:42:51 -04:00
Alexander Gordeev	3d5854d75e	fs/proc/kcore.c: allow translation of physical memory addresses When /proc/kcore is read an attempt to read the first two pages results in HW-specific page swap on s390 and another (so called prefix) pages are accessed instead. That leads to a wrong read. Allow architecture-specific translation of memory addresses using kc_xlate_dev_mem_ptr() and kc_unxlate_dev_mem_ptr() callbacks similarily to /dev/mem xlate_dev_mem_ptr() and unxlate_dev_mem_ptr() callbacks. That way an architecture can deal with specific physical memory ranges. Re-use the existing /dev/mem callback implementation on s390, which handles the described prefix pages swapping correctly. For other architectures the default callback is basically NOP. It is expected the condition (vaddr == __va(__pa(vaddr))) always holds true for KCORE_RAM memory type. Link: https://lkml.kernel.org/r/20240930122119.1651546-1-agordeev@linux.ibm.com Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Suggested-by: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-10-09 12:47:19 -07:00
Michal Hocko	9897713fe1	bcachefs: do not use PF_MEMALLOC_NORECLAIM Patch series "remove PF_MEMALLOC_NORECLAIM" v3. This patch (of 2): bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new inode to achieve GFP_NOWAIT semantic while holding locks. If this allocation fails it will drop locks and use GFP_NOFS allocation context. We would like to drop PF_MEMALLOC_NORECLAIM because it is really dangerous to use if the caller doesn't control the full call chain with this flag set. E.g. if any of the function down the chain needed GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and cause unexpected failure. While this is not the case in this particular case using the scoped gfp semantic is not really needed bacause we can easily pus the allocation context down the chain without too much clutter. [akpm@linux-foundation.org: fix kerneldoc warnings] Link: https://lkml.kernel.org/r/20240926172940.167084-1-mhocko@kernel.org Link: https://lkml.kernel.org/r/20240926172940.167084-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> # For vfs changes Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: James Morris <jmorris@namei.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Paul Moore <paul@paul-moore.com> Cc: Serge E. Hallyn <serge@hallyn.com> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-10-09 12:47:18 -07:00
Dai Ngo	7ef6010806	NFS: remove revoked delegation from server's delegation list After the delegation is returned to the NFS server remove it from the server's delegations list to reduce the time it takes to scan this list. Network trace captured while running the below script shows the time taken to service the CB_RECALL increases gradually due to the overhead of traversing the delegation list in nfs_delegation_find_inode_server. The NFS server in this test is a Solaris server which issues CB_RECALL when receiving the all-zero stateid in the SETATTR. mount=/mnt/data for i in $(seq 1 20) do echo $i mkdir $mount/testtarfile$i time tar -C $mount/testtarfile$i -xf 5000_files.tar done Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2024-10-09 15:39:22 -04:00
Linus Torvalds	ff9d4099e6	unicode updates * Patch to handle code-points with the Ignorable property as regular character instead of treating them as an empty string. (Me) Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQS3XO7QfvpFoONBhH1OwQgI3t8RJgUCZwbNtQAKCRBOwQgI3t8R JrlrAP4yCrZCp4YPlXO6oQGfS9RIeYpmcMzGmp1IAeqlzpB5qwD/YS53kiAzF4qV +eD2fl/O4qNhZcWqBZKSH4shZBbXJAg= =XCsY -----END PGP SIGNATURE----- Merge tag 'unicode-fixes-6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode Pull unicode fix from Gabriel Krisman Bertazi: - Handle code-points with the Ignorable property as regular character instead of treating them as an empty string (me) * tag 'unicode-fixes-6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode: unicode: Don't special case ignorable code points	2024-10-09 12:22:02 -07:00
Gabriel Krisman Bertazi	5c26d2f1d3	unicode: Don't special case ignorable code points We don't need to handle them separately. Instead, just let them decompose/casefold to themselves. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>	2024-10-09 13:34:01 -04:00
Al Viro	6a8126f077	expand_files(): simplify calling conventions All callers treat 0 and 1 returned by expand_files() in the same way now since the call in alloc_fd() had been made conditional. Just make it return 0 on success and be done with it... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-09 11:28:55 -04:00
Al Viro	b8ea429d72	make __set_open_fd() set cloexec state as well ->close_on_exec[] state is maintained only for opened descriptors; as the result, anything that marks a descriptor opened has to set its cloexec state explicitly. As the result, all calls of __set_open_fd() are followed by __set_close_on_exec(); might as well fold it into __set_open_fd() so that cloexec state is defined as soon as the descriptor is marked opened. [braino fix folded] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-09 11:28:06 -04:00
Naohiro Aota	e761be2a07	btrfs: fix clear_dirty and writeback ordering in submit_one_sector() This commit is a replay of commit `6252690f7e` ("btrfs: fix invalid mapping of extent xarray state"). We need to call btrfs_folio_clear_dirty() before btrfs_set_range_writeback(), so that xarray DIRTY tag is cleared. With a refactoring commit `8189197425` ("btrfs: refactor __extent_writepage_io() to do sector-by-sector submission"), it screwed up and the order is reversed and causing the same hang. Fix the ordering now in submit_one_sector(). Fixes: `8189197425` ("btrfs: refactor __extent_writepage_io() to do sector-by-sector submission") Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-09 13:23:51 +02:00
Filipe Manana	fe4cd7ed12	btrfs: zoned: fix missing RCU locking in error message when loading zone info At btrfs_load_zone_info() we have an error path that is dereferencing the name of a device which is a RCU string but we are not holding a RCU read lock, which is incorrect. Fix this by using btrfs_err_in_rcu() instead of btrfs_err(). The problem is there since commit `08e11a3db0` ("btrfs: zoned: load zone's allocation offset"), back then at btrfs_load_block_group_zone_info() but then later on that code was factored out into the helper btrfs_load_zone_info() by commit `09a46725cc` ("btrfs: zoned: factor out per-zone logic from btrfs_load_block_group_zone_info"). Fixes: `08e11a3db0` ("btrfs: zoned: load zone's allocation offset") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-10-09 13:23:51 +02:00
Andrew Kreimer	77bfe1b11e	xfs: fix a typo Fix a typo in comments. Signed-off-by: Andrew Kreimer <algonell@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-09 10:05:26 +02:00
Brian Foster	4390f019ad	xfs: don't free cowblocks from under dirty pagecache on unshare fallocate unshare mode explicitly breaks extent sharing. When a command completes, it checks the data fork for any remaining shared extents to determine whether the reflink inode flag and COW fork preallocation can be removed. This logic doesn't consider in-core pagecache and I/O state, however, which means we can unsafely remove COW fork blocks that are still needed under certain conditions. For example, consider the following command sequence: xfs_io -fc "pwrite 0 1k" -c "reflink <file> 0 256k 1k" \ -c "pwrite 0 32k" -c "funshare 0 1k" <file> This allocates a data block at offset 0, shares it, and then overwrites it with a larger buffered write. The overwrite triggers COW fork preallocation, 32 blocks by default, which maps the entire 32k write to delalloc in the COW fork. All but the shared block at offset 0 remains hole mapped in the data fork. The unshare command redirties and flushes the folio at offset 0, removing the only shared extent from the inode. Since the inode no longer maps shared extents, unshare purges the COW fork before the remaining 28k may have written back. This leaves dirty pagecache backed by holes, which writeback quietly skips, thus leaving clean, non-zeroed pagecache over holes in the file. To verify, fiemap shows holes in the first 32k of the file and reads return different data across a remount: $ xfs_io -c "fiemap -v" <file> <file>: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS ... 1: [8..511]: hole 504 ... $ xfs_io -c "pread -v 4k 8" <file> 00001000: cd cd cd cd cd cd cd cd ........ $ umount <mnt>; mount <dev> <mnt> $ xfs_io -c "pread -v 4k 8" <file> 00001000: 00 00 00 00 00 00 00 00 ........ To avoid this problem, make unshare follow the same rules used for background cowblock scanning and never purge the COW fork for inodes with dirty pagecache or in-flight I/O. Fixes: `46afb0628b` ("xfs: only flush the unshared range in xfs_reflink_unshare") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-09 10:05:10 +02:00

... 6 7 8 9 10 ...

95157 Commits