Bring in the changes to the file infrastructure for this cycle. Mostly
cleanups and some performance tweaks.
* file: remove __receive_fd()
* file: stop exposing receive_fd_user()
* fs: replace f_rcuhead with f_task_work
* file: remove pointless wrapper
* file: s/close_fd_get_file()/file_close_fd()/g
* Improve __fget_files_rcu() code generation (and thus __fget_light())
* file: massage cleanup of files that failed to open
Signed-off-by: Christian Brauner <brauner@kernel.org>
Do the replacement:
s/simply passs @nop_mnt_idmap/simply pass @nop_mnt_idmap/
in the fs/ tree.
Found by chance while working on support for idmapped mounts in fuse.
Cc: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: <linux-kernel@vger.kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Link: https://lore.kernel.org/r/20231215130927.136917-1-aleksandr.mikhalitsyn@canonical.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
A file that has never gotten FMODE_OPENED will never have RCU-accessed
references, its final fput() is equivalent to file_free() and if it
doesn't have FMODE_BACKING either, it can be done from any context and
won't need task_work treatment.
Now that we have SLAB_TYPESAFE_BY_RCU we can simplify this and have
other callers benefit. All of that can be achieved easier is to make
fput() recoginze that case and call file_free() directly.
No need to introduce a special primitive for that. It also allowed
things like failing dentry_open() could benefit from that as well.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[Christian Brauner <brauner@kernel.org>: massage commit message]
Link: https://lore.kernel.org/r/20231126020834.GC38156@ZenIV
Signed-off-by: Christian Brauner <brauner@kernel.org>
... and fix the directory locking documentation and proof of correctness.
Holding ->s_vfs_rename_mutex *almost* prevents ->d_parent changes; the
case where we really don't want it is splicing the root of disconnected
tree to somewhere.
In other words, ->s_vfs_rename_mutex is sufficient to stabilize "X is an
ancestor of Y" only if X and Y are already in the same tree. Otherwise
it can go from false to true, and one can construct a deadlock on that.
Make lock_two_directories() report an error in such case and update the
callers of lock_rename()/lock_rename_child() to handle such errors.
And yes, such conditions are not impossible to create ;-/
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We should never lock two subdirectories without having taken
->s_vfs_rename_mutex; inode pointer order or not, the "order" proposed
in 28eceeda13 "fs: Lock moved directories" is not transitive, with
the usual consequences.
The rationale for locking renamed subdirectory in all cases was
the possibility of race between rename modifying .. in a subdirectory to
reflect the new parent and another thread modifying the same subdirectory.
For a lot of filesystems that's not a problem, but for some it can lead
to trouble (e.g. the case when short directory contents is kept in the
inode, but creating a file in it might push it across the size limit
and copy its contents into separate data block(s)).
However, we need that only in case when the parent does change -
otherwise ->rename() doesn't need to do anything with .. entry in the
first place. Some instances are lazy and do a tautological update anyway,
but it's really not hard to avoid.
Amended locking rules for rename():
find the parent(s) of source and target
if source and target have the same parent
lock the common parent
else
lock ->s_vfs_rename_mutex
lock both parents, in ancestor-first order; if neither
is an ancestor of another, lock the parent of source
first.
find the source and target.
if source and target have the same parent
if operation is an overwriting rename of a subdirectory
lock the target subdirectory
else
if source is a subdirectory
lock the source
if target is a subdirectory
lock the target
lock non-directories involved, in inode pointer order if both
source and target are such.
That way we are guaranteed that parents are locked (for obvious reasons),
that any renamed non-directory is locked (nfsd relies upon that),
that any victim is locked (emptiness check needs that, among other things)
and subdirectory that changes parent is locked (needed to protect the update
of .. entries). We are also guaranteed that any operation locking more
than one directory either takes ->s_vfs_rename_mutex or locks a parent
followed by its child.
Cc: stable@vger.kernel.org
Fixes: 28eceeda13 "fs: Lock moved directories"
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTpoQAAKCRCRxhvAZXjc
ovFNAQDgIRjXfZ1Ku+USxsRRdqp8geJVaNc3PuMmYhOYhUenqgEAmC1m+p0y31dS
P6+HlL16Mqgu0tpLCcJK9BibpDZ0Ew4=
=7yD1
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Rename and export helpers that get write access to a mount. They
are used in overlayfs to get write access to the upper mount.
- Print the pretty name of the root device on boot failure. This
helps in scenarios where we would usually only print
"unknown-block(1,2)".
- Add an internal SB_I_NOUMASK flag. This is another part in the
endless POSIX ACL saga in a way.
When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip
the umask because if the relevant inode has POSIX ACLs set it might
take the umask from there. But if the inode doesn't have any POSIX
ACLs set then we apply the umask in the filesytem itself. So we end
up with:
(1) no SB_POSIXACL -> strip umask in vfs
(2) SB_POSIXACL -> strip umask in filesystem
The umask semantics associated with SB_POSIXACL allowed filesystems
that don't even support POSIX ACLs at all to raise SB_POSIXACL
purely to avoid umask stripping. That specifically means NFS v4 and
Overlayfs. NFS v4 does it because it delegates this to the server
and Overlayfs because it needs to delegate umask stripping to the
upper filesystem, i.e., the filesystem used as the writable layer.
This went so far that SB_POSIXACL is raised eve on kernels that
don't even have POSIX ACL support at all.
Stop this blatant abuse and add SB_I_NOUMASK which is an internal
superblock flag that filesystems can raise to opt out of umask
handling. That should really only be the two mentioned above. It's
not that we want any filesystems to do this. Ideally we have all
umask handling always in the vfs.
- Make overlayfs use SB_I_NOUMASK too.
- Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in
IS_POSIXACL() if the kernel doesn't have support for it. This is a
very old patch but it's only possible to do this now with the wider
cleanup that was done.
- Follow-up work on fake path handling from last cycle. Citing mostly
from Amir:
When overlayfs was first merged, overlayfs files of regular files
and directories, the ones that are installed in file table, had a
"fake" path, namely, f_path is the overlayfs path and f_inode is
the "real" inode on the underlying filesystem.
In v6.5, we took another small step by introducing of the
backing_file container and the file_real_path() helper. This change
allowed vfs and filesystem code to get the "real" path of an
overlayfs backing file. With this change, we were able to make
fsnotify work correctly and report events on the "real" filesystem
objects that were accessed via overlayfs.
This method works fine, but it still leaves the vfs vulnerable to
new code that is not aware of files with fake path. A recent
example is commit db1d1e8b98 ("IMA: use vfs_getattr_nosec to get
the i_version"). This commit uses direct referencing to f_path in
IMA code that otherwise uses file_inode() and file_dentry() to
reference the filesystem objects that it is measuring.
This contains work to switch things around: instead of having
filesystem code opt-in to get the "real" path, have generic code
opt-in for the "fake" path in the few places that it is needed.
Is it far more likely that new filesystems code that does not use
the file_dentry() and file_real_path() helpers will end up causing
crashes or averting LSM/audit rules if we keep the "fake" path
exposed by default.
This change already makes file_dentry() moot, but for now we did
not change this helper just added a WARN_ON() in ovl_d_real() to
catch if we have made any wrong assumptions.
After the dust settles on this change, we can make file_dentry() a
plain accessor and we can drop the inode argument to ->d_real().
- Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small
change but it really isn't and I would like to see everyone on
their tippie toes for any possible bugs from this work.
Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for
files since a very long time because of the nasty interactions
between the SCM_RIGHTS file descriptor garbage collection. So
extending it makes a lot of sense but it is a subtle change. There
are almost no places that fiddle with file rcu semantics directly
and the ones that did mess around with struct file internal under
rcu have been made to stop doing that because it really was always
dodgy.
I forgot to put in the link tag for this change and the discussion
in the commit so adding it into the merge message:
https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com
Cleanups:
- Various smaller pipe cleanups including the removal of a spin lock
that was only used to protect against writes without pipe_lock()
from O_NOTIFICATION_PIPE aka watch queues. As that was never
implemented remove the additional locking from pipe_write().
- Annotate struct watch_filter with the new __counted_by attribute.
- Clarify do_unlinkat() cleanup so that it doesn't look like an extra
iput() is done that would cause issues.
- Simplify file cleanup when the file has never been opened.
- Use module helper instead of open-coding it.
- Predict error unlikely for stale retry.
- Use WRITE_ONCE() for mount expiry field instead of just commenting
that one hopes the compiler doesn't get smart.
Fixes:
- Fix readahead on block devices.
- Fix writeback when layztime is enabled and inodes whose timestamp
is the only thing that changed reside on wb->b_dirty_time. This
caused excessively large zombie memory cgroup when lazytime was
enabled as such inodes weren't handled fast enough.
- Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()"
* tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits)
file, i915: fix file reference for mmap_singleton()
vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups
writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs
chardev: Simplify usage of try_module_get()
ovl: rely on SB_I_NOUMASK
fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n
fs: store real path instead of fake path in backing file f_path
fs: create helper file_user_path() for user displayed mapped file path
fs: get mnt_writers count for an open backing file's real path
vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked
vfs: predict the error in retry_estale as unlikely
backing file: free directly
vfs: fix readahead(2) on block devices
io_uring: use files_lookup_fd_locked()
file: convert to SLAB_TYPESAFE_BY_RCU
vfs: shave work on failed file open
fs: simplify misleading code to remove ambiguity regarding ihold()/iput()
watch_queue: Annotate struct watch_filter with __counted_by
fs/pipe: use spinlock in pipe_read() only if there is a watch_queue
fs/pipe: remove unnecessary spinlock from pipe_write()
...
The calling code actually handles -ECHILD, so this BUG_ON
can be converted to WARN_ON_ONCE.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Link: https://lore.kernel.org/r/20231023184718.11143-1-bschubert@ddn.com
Cc: Christian Brauner <brauner@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Dharmendra Singh <dsingh@ddn.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Failed opens (mostly ENOENT) legitimately happen a lot, for example here
are stats from stracing kernel build for few seconds (strace -fc make):
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ------------------
0.76 0.076233 5 15040 3688 openat
(this is tons of header files tried in different paths)
In the common case of there being nothing to close (only the file object
to free) there is a lot of overhead which can be avoided.
This is most notably delegation of freeing to task_work, which comes
with an enormous cost (see 021a160abf ("fs: use __fput_sync in
close(2)" for an example).
Benchmarked with will-it-scale with a custom testcase based on
tests/open1.c, stuffed into tests/openneg.c:
[snip]
while (1) {
int fd = open("/tmp/nonexistent", O_RDONLY);
assert(fd == -1);
(*iterations)++;
}
[/snip]
Sapphire Rapids, openneg_processes -t 1 (ops/s):
before: 1950013
after: 2914973 (+49%)
file refcount is checked as a safety belt against buggy consumers with
an atomic cmpxchg. Technically it is not necessary, but it happens to
not be measurable due to several other atomics which immediately follow.
Optmizing them away to make this atomic into a problem is left as an
exercise for the reader.
v2:
- unexport fput_badopen and move to fs/internal.h
- handle the refcount with cmpxchg, adjust commentary accordingly
- tweak the commit message
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Because 'inode' is being initialised before checking if 'dentry' is negative
it looks like an extra iput() on 'inode' may happen since the ihold() is
done only if the dentry is *not* negative. In reality this doesn't happen
because d_is_negative() is never true if ->d_inode is NULL. This patch only
makes the code easier to understand, as I was initially mislead by it.
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Link: https://lore.kernel.org/r/20230928152341.303-1-lhenriques@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
SB_POSIXACL must be set when a filesystem supports POSIX ACLs, but NFSv4
also sets this flag to prevent the VFS from applying the umask on
newly-created files. NFSv4 doesn't support POSIX ACLs however, which
causes confusion when other subsystems try to test for them.
Add a new SB_I_NOUMASK flag that allows filesystems to opt-in to umask
stripping without advertising support for POSIX ACLs. Set the new flag
on NFSv4 instead of SB_POSIXACL.
Also, move mode_strip_umask to namei.h and convert init_mknod and
init_mkdir to use it.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230911-acl-fix-v3-1-b25315333f6c@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
An io_uring openat operation can update an audit reference count
from multiple threads resulting in the call trace below.
A call to io_uring_submit() with a single openat op with a flag of
IOSQE_ASYNC results in the following reference count updates.
These first part of the system call performs two increments that do not race.
do_syscall_64()
__do_sys_io_uring_enter()
io_submit_sqes()
io_openat_prep()
__io_openat_prep()
getname()
getname_flags() /* update 1 (increment) */
__audit_getname() /* update 2 (increment) */
The openat op is queued to an io_uring worker thread which starts the
opportunity for a race. The system call exit performs one decrement.
do_syscall_64()
syscall_exit_to_user_mode()
syscall_exit_to_user_mode_prepare()
__audit_syscall_exit()
audit_reset_context()
putname() /* update 3 (decrement) */
The io_uring worker thread performs one increment and two decrements.
These updates can race with the system call decrement.
io_wqe_worker()
io_worker_handle_work()
io_wq_submit_work()
io_issue_sqe()
io_openat()
io_openat2()
do_filp_open()
path_openat()
__audit_inode() /* update 4 (increment) */
putname() /* update 5 (decrement) */
__audit_uring_exit()
audit_reset_context()
putname() /* update 6 (decrement) */
The fix is to change the refcnt member of struct audit_names
from int to atomic_t.
kernel BUG at fs/namei.c:262!
Call Trace:
...
? putname+0x68/0x70
audit_reset_context.part.0.constprop.0+0xe1/0x300
__audit_uring_exit+0xda/0x1c0
io_issue_sqe+0x1f3/0x450
? lock_timer_base+0x3b/0xd0
io_wq_submit_work+0x8d/0x2b0
? __try_to_del_timer_sync+0x67/0xa0
io_worker_handle_work+0x17c/0x2b0
io_wqe_worker+0x10a/0x350
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/MW2PR2101MB1033FFF044A258F84AEAA584F1C9A@MW2PR2101MB1033.namprd21.prod.outlook.com/
Fixes: 5bd2182d58 ("audit,io_uring,io-wq: add some basic audit support to io_uring")
Signed-off-by: Dan Clash <daclash@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231012215518.GA4048@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
These have a variety of causes and a corresponding variety of solutions.
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Message-Id: <20230818200824.2720007-1-willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The d_hash_and_lookup() function returns error pointers or NULL.
Most incorrect error checks were fixed, but the one in int path_pts()
was forgotten.
Fixes: eedf265aa0 ("devpts: Make each mount of devpts an independent filesystem.")
Signed-off-by: Wang Ming <machel@vivo.com>
Message-Id: <20230713120555.7025-1-machel@vivo.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The @source inode must be valid. It is even checked via IS_SWAPFILE()
above making it pretty clear. So no need to check it when we unlock.
What doesn't need to exist is the @target inode. The lock_two_inodes()
helper currently swaps the @inode1 and @inode2 arguments if @inode1 is
NULL to have consistent lock class usage. However, we know that at least
for vfs_rename() that @inode1 is @source and thus is never NULL as per
above. We also know that @source is a different inode than @target as
that is checked right at the beginning of vfs_rename(). So we know that
@source is valid and locked and that @target is locked. So drop the
check whether @source is non-NULL.
Fixes: 28eceeda13 ("fs: Lock moved directories")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202307030026.9sE2pk2x-lkp@intel.com
Message-Id: <20230703-vfs-rename-source-v1-1-37eebb29b65b@kernel.org>
[brauner: use commit message from patch I sent concurrently]
Signed-off-by: Christian Brauner <brauner@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZJU4WgAKCRCRxhvAZXjc
oofvAQDs9RJwQUyWHJmQA+tWz5cUE5DviVWCwwul5dQRRCqgaQEA2OIO0gPFaVoq
1OYOeLyUjl/cpS8e3u4uJtw34jttdQA=
=AwcR
-----END PGP SIGNATURE-----
Merge tag 'v6.5/vfs.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs file handling updates from Christian Brauner:
"This contains Amir's work to fix a long-standing problem where an
unprivileged overlayfs mount can be used to avoid fanotify permission
events that were requested for an inode or superblock on the
underlying filesystem.
Some background about files opened in overlayfs. If a file is opened
in overlayfs @file->f_path will refer to a "fake" path. What this
means is that while @file->f_inode will refer to inode of the
underlying layer, @file->f_path refers to an overlayfs
{dentry,vfsmount} pair. The reasons for doing this are out of scope
here but it is the reason why the vfs has been providing the
open_with_fake_path() helper for overlayfs for very long time now. So
nothing new here.
This is for sure not very elegant and everyone including the overlayfs
maintainers agree. Improving this significantly would involve more
fragile and potentially rather invasive changes.
In various codepaths access to the path of the underlying filesystem
is needed for such hybrid file. The best example is fsnotify where
this becomes security relevant. Passing the overlayfs
@file->f_path->dentry will cause fsnotify to skip generating fsnotify
events registered on the underlying inode or superblock.
To fix this we extend the vfs provided open_with_fake_path() concept
for overlayfs to create a backing file container that holds the real
path and to expose a helper that can be used by relevant callers to
get access to the path of the underlying filesystem through the new
file_real_path() helper. This pattern is similar to what we do in
d_real() and d_real_inode().
The first beneficiary is fsnotify and fixes the security sensitive
problem mentioned above.
There's a couple of nice cleanups included as well.
Over time, the old open_with_fake_path() helper added specifically for
overlayfs a long time ago started to get used in other places such as
cachefiles. Even though cachefiles have nothing to do with hybrid
files.
The only reason cachefiles used that concept was that files opened
with open_with_fake_path() aren't charged against the caller's open
file limit by raising FMODE_NOACCOUNT. It's just mere coincidence that
both overlayfs and cachefiles need to ensure to not overcharge the
caller for their internal open calls.
So this work disentangles FMODE_NOACCOUNT use cases and backing file
use-cases by adding the FMODE_BACKING flag which indicates that the
file can be used to retrieve the backing file of another filesystem.
(Fyi, Jens will be sending you a really nice cleanup from Christoph
that gets rid of 3 FMODE_* flags otherwise this would be the last
fmode_t bit we'd be using.)
So now overlayfs becomes the sole user of the renamed
open_with_fake_path() helper which is now named backing_file_open().
For internal kernel users such as cachefiles that are only interested
in FMODE_NOACCOUNT but not in FMODE_BACKING we add a new
kernel_file_open() helper which opens a file without being charged
against the caller's open file limit. All new helpers are properly
documented and clearly annotated to mention their special uses.
We also rename vfs_tmpfile_open() to kernel_tmpfile_open() to clearly
distinguish it from vfs_tmpfile() and align it the other kernel_*()
internal helpers"
* tag 'v6.5/vfs.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
ovl: enable fsnotify events on underlying real files
fs: use backing_file container for internal files with "fake" f_path
fs: move kmem_cache_zalloc() into alloc_empty_file*() helpers
fs: use a helper for opening kernel internal files
fs: rename {vfs,kernel}_tmpfile_open()
Overlayfs and cachefiles use vfs_open_tmpfile() to open a tmpfile
without accounting for nr_files.
Rename this helper to kernel_tmpfile_open() to better reflect this
helper is used for kernel internal users.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Message-Id: <20230615112229.2143178-2-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
When a directory is moved to a different directory, some filesystems
(udf, ext4, ocfs2, f2fs, and likely gfs2, reiserfs, and others) need to
update their pointer to the parent and this must not race with other
operations on the directory. Lock the directories when they are moved.
Although not all filesystems need this locking, we perform it in
vfs_rename() because getting the lock ordering right is really difficult
and we don't want to expose these locking details to filesystems.
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230601105830.13168-5-jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Currently the locking order of inode locks for directories that are not
in ancestor relationship is not defined because all operations that
needed to lock two directories like this were serialized by
sb->s_vfs_rename_mutex. However some filesystems need to lock two
subdirectories for RENAME_EXCHANGE operations and for this we need the
locking order established even for two tree-unrelated directories.
Provide a helper function lock_two_inodes() that establishes lock
ordering for any two inodes and use it in lock_two_directories().
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230601105830.13168-4-jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmRJ1QwACgkQiiy9cAdy
T1FoMwv/dQv3nlq7CFW/p82xRGQg1rqE/bLw8iAz2kl2QMNI32mnNZ5vsnAOx5CT
CXUsdfQTgVsdKQ0yAV0WIAopxOxf4ASmKn8Ej1wJBEREFuwqQUgFE16DALKDPhcW
77cuPqFEGD1CfbOvM+IjSN1OlDmsu2a3qSuD1OfTxC14s3+mPZitGpSm06PMsaT6
C17RG6dC+PJmzXOT79OGfPmP1e6103abhv7PUdy7tZdiHi3I2ESXvCUGAoeFzjG9
i4vxM+WdfStun0sXkZysS+iiyajfbFF8cOkaXp4QG+aplDD5Z2gjV6w+n3phegiX
N1QvTZ3C8dQEhvmy40DGyLgj2A3/dFqh5bEfA3Mtni2NqEHKG4ng2dQsmkMiNg0M
iuCHwLUz2kgvc+zl4+HJpKDVF0HG3RHAlzRq0oNEw+lQApzfg/Q+p5ZcZUBjcoSj
n4fxxNO4bggHmCBHZNDL7VNG81m0YEg8xnDpBPvwtLMO1yEj8RFwW+RTfIwqyOYE
VK1dcRc1
=DweL
-----END PGP SIGNATURE-----
Merge tag '6.4-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull ksmbd server updates from Steve French:
- SMB3.1.1 negotiate context fixes and cleanup
- new lock_rename_child VFS helper
- ksmbd fix to avoid unlink race and to use the new VFS helper to avoid
rename race
* tag '6.4-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix racy issue from using ->d_parent and ->d_name
ksmbd: remove unused compression negotiate ctx packing
ksmbd: avoid duplicate negotiate ctx offset increments
ksmbd: set NegotiateContextCount once instead of every inc
fs: introduce lock_rename_child() helper
ksmbd: remove internal.h include
Al pointed out that ksmbd has racy issue from using ->d_parent and ->d_name
in ksmbd_vfs_unlink and smb2_vfs_rename(). and use new lock_rename_child()
to lock stable parent while underlying rename racy.
Introduce vfs_path_parent_lookup helper to avoid out of share access and
export vfs functions like the following ones to use
vfs_path_parent_lookup().
- rename __lookup_hash() to lookup_one_qstr_excl().
- export lookup_one_qstr_excl().
- export getname_kernel() and putname().
vfs_path_parent_lookup() is used for parent lookup of destination file
using absolute pathname given from FILE_RENAME_INFORMATION request.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Pass the dentry of a source file and the dentry of a destination directory
to lock parent inodes for rename. As soon as this function returns,
->d_parent of the source file dentry is stable and inodes are properly
locked for calling vfs-rename. This helper is needed for ksmbd server.
rename request of SMB protocol has to rename an opened file, no matter
which directory it's in.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Update the description of vfs_tmpfile() to match the current parameters of
that function.
Fixes: 9751b33865 ("vfs: move open right after ->tmpfile()")
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Two significant security enhancements are part of this release:
* NFSD's RPC header encoding and decoding, including RPCSEC GSS
and gssproxy header parsing, has been overhauled to make it
more memory-safe.
* Support for Kerberos AES-SHA2-based encryption types has been
added for both the NFS client and server. This provides a clean
path for deprecating and removing insecure encryption types
based on DES and SHA-1. AES-SHA2 is also FIPS-140 compliant, so
that NFS with Kerberos may now be used on systems with fips
enabled.
In addition to these, NFSD is now able to handle crossing into an
auto-mounted mount point on an exported NFS mount. A number of
fixes have been made to NFSD's server-side copy implementation.
RPC metrics have been converted to per-CPU variables. This helps
reduce unnecessary cross-CPU and cross-node memory bus traffic,
and significantly reduces noise when KCSAN is enabled.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmPzgiYACgkQM2qzM29m
f5dB2A//eqjpj+FgAN+UjygrwMC4ahAsPX3Sc3FG8/lTAiao3NFVFY2gxAiCPyVE
CFk+tUyfL23oXvbyfIBe3LhxSBOf621xU6up2OzqAzJqh1Q9iUWB6as3I14to8ZU
sWpxXo5ofwk1hzkbrvOAVkyfY0emwsr00iBeWMawkpBe8FZEQA31OYj3/xHr6bBI
zEVlZPBZAZlp0DZ74tb+bBLs/EOnqKj+XLWcogCH13JB3sn2umF6cQNkYgsxvHGa
TNQi4LEdzWZGme242LfBRiGGwm1xuVIjlAhYV/R1wIjaknE3QBzqfXc6lJx74WII
HaqpRJGrKqdo7B+1gaXCl/AMS7YluED1CBrxuej0wBG7l2JEB7m2MFMQ4LTQjgsn
nrr3P70DgbB4LuPCPyUS7dtsMmUXabIqP7niiCR4T1toH6lBmHAgEi4cFmkzg7Cd
EoFzn888mtDpfx4fghcsRWS5oKXEzbPJfu5+IZOD63+UB+NGpi0Xo2s23sJPK8vz
kqK/X63JYOUxWUvK0zkj/c/wW1cLqIaBwnSKbShou5/BL+cZVI+uJYrnEesgpoB2
5fh/cZv3hdcoOPO7OfcjCLQYy4J6RCWajptnk/hcS3lMvBTBrnq697iAqCVURDKU
Xfmlf7XbBwje+sk4eHgqVGEqqVjrEmoqbmA2OS44WSS5LDvxXdI=
=ZG/7
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"Two significant security enhancements are part of this release:
- NFSD's RPC header encoding and decoding, including RPCSEC GSS and
gssproxy header parsing, has been overhauled to make it more
memory-safe.
- Support for Kerberos AES-SHA2-based encryption types has been added
for both the NFS client and server. This provides a clean path for
deprecating and removing insecure encryption types based on DES and
SHA-1. AES-SHA2 is also FIPS-140 compliant, so that NFS with
Kerberos may now be used on systems with fips enabled.
In addition to these, NFSD is now able to handle crossing into an
auto-mounted mount point on an exported NFS mount. A number of fixes
have been made to NFSD's server-side copy implementation.
RPC metrics have been converted to per-CPU variables. This helps
reduce unnecessary cross-CPU and cross-node memory bus traffic, and
significantly reduces noise when KCSAN is enabled"
* tag 'nfsd-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (121 commits)
NFSD: Clean up nfsd_symlink()
NFSD: copy the whole verifier in nfsd_copy_write_verifier
nfsd: don't fsync nfsd_files on last close
SUNRPC: Fix occasional warning when destroying gss_krb5_enctypes
nfsd: fix courtesy client with deny mode handling in nfs4_upgrade_open
NFSD: fix problems with cleanup on errors in nfsd4_copy
nfsd: fix race to check ls_layouts
nfsd: don't hand out delegation on setuid files being opened for write
SUNRPC: Remove ->xpo_secure_port()
SUNRPC: Clean up the svc_xprt_flags() macro
nfsd: remove fs/nfsd/fault_inject.c
NFSD: fix leaked reference count of nfsd4_ssc_umount_item
nfsd: clean up potential nfsd_file refcount leaks in COPY codepath
nfsd: zero out pointers after putting nfsd_files on COPY setup error
SUNRPC: Fix whitespace damage in svcauth_unix.c
nfsd: eliminate __nfs4_get_fd
nfsd: add some kerneldoc comments for stateid preprocessing functions
nfsd: eliminate find_deleg_file_locked
nfsd: don't take nfsd4_copy ref for OP_OFFLOAD_STATUS
SUNRPC: Add encryption self-tests
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY+5NlQAKCRCRxhvAZXjc
orOaAP9i2h3OJy95nO2Fpde0Bt2UT+oulKCCcGlvXJ8/+TQpyQD/ZQq47gFQ0EAz
Br5NxeyGeecAb0lHpFz+CpLGsxMrMwQ=
=+BG5
-----END PGP SIGNATURE-----
Merge tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull vfs idmapping updates from Christian Brauner:
- Last cycle we introduced the dedicated struct mnt_idmap type for
mount idmapping and the required infrastucture in 256c8aed2b ("fs:
introduce dedicated idmap type for mounts"). As promised in last
cycle's pull request message this converts everything to rely on
struct mnt_idmap.
Currently we still pass around the plain namespace that was attached
to a mount. This is in general pretty convenient but it makes it easy
to conflate namespaces that are relevant on the filesystem with
namespaces that are relevant on the mount level. Especially for
non-vfs developers without detailed knowledge in this area this was a
potential source for bugs.
This finishes the conversion. Instead of passing the plain namespace
around this updates all places that currently take a pointer to a
mnt_userns with a pointer to struct mnt_idmap.
Now that the conversion is done all helpers down to the really
low-level helpers only accept a struct mnt_idmap argument instead of
two namespace arguments.
Conflating mount and other idmappings will now cause the compiler to
complain loudly thus eliminating the possibility of any bugs. This
makes it impossible for filesystem developers to mix up mount and
filesystem idmappings as they are two distinct types and require
distinct helpers that cannot be used interchangeably.
Everything associated with struct mnt_idmap is moved into a single
separate file. With that change no code can poke around in struct
mnt_idmap. It can only be interacted with through dedicated helpers.
That means all filesystems are and all of the vfs is completely
oblivious to the actual implementation of idmappings.
We are now also able to extend struct mnt_idmap as we see fit. For
example, we can decouple it completely from namespaces for users that
don't require or don't want to use them at all. We can also extend
the concept of idmappings so we can cover filesystem specific
requirements.
In combination with the vfs{g,u}id_t work we finished in v6.2 this
makes this feature substantially more robust and thus difficult to
implement wrong by a given filesystem and also protects the vfs.
- Enable idmapped mounts for tmpfs and fulfill a longstanding request.
A long-standing request from users had been to make it possible to
create idmapped mounts for tmpfs. For example, to share the host's
tmpfs mount between multiple sandboxes. This is a prerequisite for
some advanced Kubernetes cases. Systemd also has a range of use-cases
to increase service isolation. And there are more users of this.
However, with all of the other work going on this was way down on the
priority list but luckily someone other than ourselves picked this
up.
As usual the patch is tiny as all the infrastructure work had been
done multiple kernel releases ago. In addition to all the tests that
we already have I requested that Rodrigo add a dedicated tmpfs
testsuite for idmapped mounts to xfstests. It is to be included into
xfstests during the v6.3 development cycle. This should add a slew of
additional tests.
* tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits)
shmem: support idmapped mounts for tmpfs
fs: move mnt_idmap
fs: port vfs{g,u}id helpers to mnt_idmap
fs: port fs{g,u}id helpers to mnt_idmap
fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
fs: port i_{g,u}id_{needs_}update() to mnt_idmap
quota: port to mnt_idmap
fs: port privilege checking helpers to mnt_idmap
fs: port inode_owner_or_capable() to mnt_idmap
fs: port inode_init_owner() to mnt_idmap
fs: port acl to mnt_idmap
fs: port xattr to mnt_idmap
fs: port ->permission() to pass mnt_idmap
fs: port ->fileattr_set() to pass mnt_idmap
fs: port ->set_acl() to pass mnt_idmap
fs: port ->get_acl() to pass mnt_idmap
fs: port ->tmpfile() to pass mnt_idmap
fs: port ->rename() to pass mnt_idmap
fs: port ->mknod() to pass mnt_idmap
fs: port ->mkdir() to pass mnt_idmap
...
This function is only used by NFSD to cross mount points.
If a mount point is of type auto mount, follow_down() will
not uncover it. Add LOOKUP_AUTOMOUNT to the lookup flags
to have ->d_automount() called when NFSD walks down the
mount tree.
Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Ian Kent <raven@themaw.net>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Convert to struct mnt_idmap.
Remove legacy file_mnt_user_ns() and mnt_user_ns().
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
The file locking definitions have lived in fs.h since the dawn of time,
but they are only used by a small subset of the source files that
include it.
Move the file locking definitions to a new header file, and add the
appropriate #include directives to the source files that need them. By
doing this we trim down fs.h a bit and limit the amount of rebuilding
that has to be done when we make changes to the file locking APIs.
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Steve French <stfrench@microsoft.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
-----BEGIN PGP SIGNATURE-----
iIYEABYIAC4WIQSVyBthFV4iTW/VU1/l49DojIL20gUCY5b27RAcbWljQGRpZ2lr
b2QubmV0AAoJEOXj0OiMgvbSg9YA/0K10H+VsGt1+qqR4+w9SM7SFzbgszrV3Yw9
rwiPgaPVAP9rxXPr2bD2hAk7/Lv9LeJ2kfM9RzMErP1A6UsC5YVbDA==
=mAG7
-----END PGP SIGNATURE-----
Merge tag 'landlock-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux
Pull landlock updates from Mickaël Salaün:
"This adds file truncation support to Landlock, contributed by Günther
Noack. As described by Günther [1], the goal of these patches is to
work towards a more complete coverage of file system operations that
are restrictable with Landlock.
The known set of currently unsupported file system operations in
Landlock is described at [2]. Out of the operations listed there,
truncate is the only one that modifies file contents, so these patches
should make it possible to prevent the direct modification of file
contents with Landlock.
The new LANDLOCK_ACCESS_FS_TRUNCATE access right covers both the
truncate(2) and ftruncate(2) families of syscalls, as well as open(2)
with the O_TRUNC flag. This includes usages of creat() in the case
where existing regular files are overwritten.
Additionally, this introduces a new Landlock security blob associated
with opened files, to track the available Landlock access rights at
the time of opening the file. This is in line with Unix's general
approach of checking the read and write permissions during open(), and
associating this previously checked authorization with the opened
file. An ongoing patch documents this use case [3].
In order to treat truncate(2) and ftruncate(2) calls differently in an
LSM hook, we split apart the existing security_path_truncate hook into
security_path_truncate (for truncation by path) and
security_file_truncate (for truncation of previously opened files)"
Link: https://lore.kernel.org/r/20221018182216.301684-1-gnoack3000@gmail.com [1]
Link: https://www.kernel.org/doc/html/v6.1/userspace-api/landlock.html#filesystem-flags [2]
Link: https://lore.kernel.org/r/20221209193813.972012-1-mic@digikod.net [3]
* tag 'landlock-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux:
samples/landlock: Document best-effort approach for LANDLOCK_ACCESS_FS_REFER
landlock: Document Landlock's file truncation support
samples/landlock: Extend sample tool to support LANDLOCK_ACCESS_FS_TRUNCATE
selftests/landlock: Test ftruncate on FDs created by memfd_create(2)
selftests/landlock: Test FD passing from restricted to unrestricted processes
selftests/landlock: Locally define __maybe_unused
selftests/landlock: Test open() and ftruncate() in multiple scenarios
selftests/landlock: Test file truncation support
landlock: Support file truncation
landlock: Document init_layer_masks() helper
landlock: Refactor check_access_path_dual() into is_access_to_paths_allowed()
security: Create file_truncate hook from path_truncate hook
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY5bspgAKCRCRxhvAZXjc
opEWAQDpF5rnZn1vv4/uOTij9ztcA4yLxu/Q19CdqBaoHlWZ9AD/d3eecee3bh5h
iPHtlUK5/VspfD9LPpdc5ZbPCdZ2pA4=
=t6NN
-----END PGP SIGNATURE-----
Merge tag 'fs.vfsuid.conversion.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull vfsuid updates from Christian Brauner:
"Last cycle we introduced the vfs{g,u}id_t types and associated helpers
to gain type safety when dealing with idmapped mounts. That initial
work already converted a lot of places over but there were still some
left,
This converts all remaining places that still make use of non-type
safe idmapping helpers to rely on the new type safe vfs{g,u}id based
helpers.
Afterwards it removes all the old non-type safe helpers"
* tag 'fs.vfsuid.conversion.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
fs: remove unused idmapping helpers
ovl: port to vfs{g,u}id_t and associated helpers
fuse: port to vfs{g,u}id_t and associated helpers
ima: use type safe idmapping helpers
apparmor: use type safe idmapping helpers
caps: use type safe idmapping helpers
fs: use type safe idmapping helpers
mnt_idmapping: add missing helpers
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY5bwTgAKCRCRxhvAZXjc
ovd2AQCK00NAtGjQCjQPQGyTa4GAPqvWgq1ef0lnhv+TL5US5gD9FncQ8UofeMXt
pBfjtAD6ettTPCTxUQfnTwWEU4rc7Qg=
=27Wm
-----END PGP SIGNATURE-----
Merge tag 'fs.acl.rework.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull VFS acl updates from Christian Brauner:
"This contains the work that builds a dedicated vfs posix acl api.
The origins of this work trace back to v5.19 but it took quite a while
to understand the various filesystem specific implementations in
sufficient detail and also come up with an acceptable solution.
As we discussed and seen multiple times the current state of how posix
acls are handled isn't nice and comes with a lot of problems: The
current way of handling posix acls via the generic xattr api is error
prone, hard to maintain, and type unsafe for the vfs until we call
into the filesystem's dedicated get and set inode operations.
It is already the case that posix acls are special-cased to death all
the way through the vfs. There are an uncounted number of hacks that
operate on the uapi posix acl struct instead of the dedicated vfs
struct posix_acl. And the vfs must be involved in order to interpret
and fixup posix acls before storing them to the backing store, caching
them, reporting them to userspace, or for permission checking.
Currently a range of hacks and duct tape exist to make this work. As
with most things this is really no ones fault it's just something that
happened over time. But the code is hard to understand and difficult
to maintain and one is constantly at risk of introducing bugs and
regressions when having to touch it.
Instead of continuing to hack posix acls through the xattr handlers
this series builds a dedicated posix acl api solely around the get and
set inode operations.
Going forward, the vfs_get_acl(), vfs_remove_acl(), and vfs_set_acl()
helpers must be used in order to interact with posix acls. They
operate directly on the vfs internal struct posix_acl instead of
abusing the uapi posix acl struct as we currently do. In the end this
removes all of the hackiness, makes the codepaths easier to maintain,
and gets us type safety.
This series passes the LTP and xfstests suites without any
regressions. For xfstests the following combinations were tested:
- xfs
- ext4
- btrfs
- overlayfs
- overlayfs on top of idmapped mounts
- orangefs
- (limited) cifs
There's more simplifications for posix acls that we can make in the
future if the basic api has made it.
A few implementation details:
- The series makes sure to retain exactly the same security and
integrity module permission checks. Especially for the integrity
modules this api is a win because right now they convert the uapi
posix acl struct passed to them via a void pointer into the vfs
struct posix_acl format to perform permission checking on the mode.
There's a new dedicated security hook for setting posix acls which
passes the vfs struct posix_acl not a void pointer. Basing checking
on the posix acl stored in the uapi format is really unreliable.
The vfs currently hacks around directly in the uapi struct storing
values that frankly the security and integrity modules can't
correctly interpret as evidenced by bugs we reported and fixed in
this area. It's not necessarily even their fault it's just that the
format we provide to them is sub optimal.
- Some filesystems like 9p and cifs need access to the dentry in
order to get and set posix acls which is why they either only
partially or not even at all implement get and set inode
operations. For example, cifs allows setxattr() and getxattr()
operations but doesn't allow permission checking based on posix
acls because it can't implement a get acl inode operation.
Thus, this patch series updates the set acl inode operation to take
a dentry instead of an inode argument. However, for the get acl
inode operation we can't do this as the old get acl method is
called in e.g., generic_permission() and inode_permission(). These
helpers in turn are called in various filesystem's permission inode
operation. So passing a dentry argument to the old get acl inode
operation would amount to passing a dentry to the permission inode
operation which we shouldn't and probably can't do.
So instead of extending the existing inode operation Christoph
suggested to add a new one. He also requested to ensure that the
get and set acl inode operation taking a dentry are consistently
named. So for this version the old get acl operation is renamed to
->get_inode_acl() and a new ->get_acl() inode operation taking a
dentry is added. With this we can give both 9p and cifs get and set
acl inode operations and in turn remove their complex custom posix
xattr handlers.
In the future I hope to get rid of the inode method duplication but
it isn't like we have never had this situation. Readdir is just one
example. And frankly, the overall gain in type safety and the more
pleasant api wise are simply too big of a benefit to not accept
this duplication for a while.
- We've done a full audit of every codepaths using variant of the
current generic xattr api to get and set posix acls and
surprisingly it isn't that many places. There's of course always a
chance that we might have missed some and if so I'm sure we'll find
them soon enough.
The crucial codepaths to be converted are obviously stacking
filesystems such as ecryptfs and overlayfs.
For a list of all callers currently using generic xattr api helpers
see [2] including comments whether they support posix acls or not.
- The old vfs generic posix acl infrastructure doesn't obey the
create and replace semantics promised on the setxattr(2) manpage.
This patch series doesn't address this. It really is something we
should revisit later though.
The patches are roughly organized as follows:
(1) Change existing set acl inode operation to take a dentry
argument (Intended to be a non-functional change)
(2) Rename existing get acl method (Intended to be a non-functional
change)
(3) Implement get and set acl inode operations for filesystems that
couldn't implement one before because of the missing dentry.
That's mostly 9p and cifs (Intended to be a non-functional
change)
(4) Build posix acl api, i.e., add vfs_get_acl(), vfs_remove_acl(),
and vfs_set_acl() including security and integrity hooks
(Intended to be a non-functional change)
(5) Implement get and set acl inode operations for stacking
filesystems (Intended to be a non-functional change)
(6) Switch posix acl handling in stacking filesystems to new posix
acl api now that all filesystems it can stack upon support it.
(7) Switch vfs to new posix acl api (semantical change)
(8) Remove all now unused helpers
(9) Additional regression fixes reported after we merged this into
linux-next
Thanks to Seth for a lot of good discussion around this and
encouragement and input from Christoph"
* tag 'fs.acl.rework.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (36 commits)
posix_acl: Fix the type of sentinel in get_acl
orangefs: fix mode handling
ovl: call posix_acl_release() after error checking
evm: remove dead code in evm_inode_set_acl()
cifs: check whether acl is valid early
acl: make vfs_posix_acl_to_xattr() static
acl: remove a slew of now unused helpers
9p: use stub posix acl handlers
cifs: use stub posix acl handlers
ovl: use stub posix acl handlers
ecryptfs: use stub posix acl handlers
evm: remove evm_xattr_acl_change()
xattr: use posix acl api
ovl: use posix acl api
ovl: implement set acl method
ovl: implement get acl method
ecryptfs: implement set acl method
ecryptfs: implement get acl method
ksmbd: use vfs_remove_acl()
acl: add vfs_remove_acl()
...
If O_EXCL is *not* specified, then linkat() can be
used to link the temporary file into the filesystem.
If O_EXCL is specified then linkat() should fail (-1).
After commit 863f144f12 ("vfs: open inside ->tmpfile()")
the O_EXCL flag is no longer honored by the vfs layer for
tmpfile, which means the file can be linked even if O_EXCL
flag is specified, which is a change in behaviour for
userspace!
The open flags was previously passed as a parameter, so it
was uneffected by the changes to file->f_flags caused by
finish_open(). This patch fixes the issue by storing
file->f_flags in a local variable so the O_EXCL test
logic is restored.
This regression was detected by Android CTS Bionic fcntl()
tests running on android-mainline [1].
[1] https://android.googlesource.com/platform/bionic/+/
refs/heads/master/tests/fcntl_test.cpp#352
Fixes: 863f144f12 ("vfs: open inside ->tmpfile()")
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Tested-by: Will McVicker <willmcvicker@google.com>
Signed-off-by: Peter Griffin <peter.griffin@linaro.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We already ported most parts and filesystems over for v6.0 to the new
vfs{g,u}id_t type and associated helpers for v6.0. Convert the remaining
places so we can remove all the old helpers.
This is a non-functional change.
Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
The current way of setting and getting posix acls through the generic
xattr interface is error prone and type unsafe. The vfs needs to
interpret and fixup posix acls before storing or reporting it to
userspace. Various hacks exist to make this work. The code is hard to
understand and difficult to maintain in it's current form. Instead of
making this work by hacking posix acls through xattr handlers we are
building a dedicated posix acl api around the get and set inode
operations. This removes a lot of hackiness and makes the codepaths
easier to maintain. A lot of background can be found in [1].
The current inode operation for getting posix acls takes an inode
argument but various filesystems (e.g., 9p, cifs, overlayfs) need access
to the dentry. In contrast to the ->set_acl() inode operation we cannot
simply extend ->get_acl() to take a dentry argument. The ->get_acl()
inode operation is called from:
acl_permission_check()
-> check_acl()
-> get_acl()
which is part of generic_permission() which in turn is part of
inode_permission(). Both generic_permission() and inode_permission() are
called in the ->permission() handler of various filesystems (e.g.,
overlayfs). So simply passing a dentry argument to ->get_acl() would
amount to also having to pass a dentry argument to ->permission(). We
should avoid this unnecessary change.
So instead of extending the existing inode operation rename it from
->get_acl() to ->get_inode_acl() and add a ->get_acl() method later that
passes a dentry argument and which filesystems that need access to the
dentry can implement instead of ->get_inode_acl(). Filesystems like cifs
which allow setting and getting posix acls but not using them for
permission checking during lookup can simply not implement
->get_inode_acl().
This is intended to be a non-functional change.
Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1]
Suggested-by/Inspired-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Like path_truncate, the file_truncate hook also restricts file
truncation, but is called in the cases where truncation is attempted
on an already-opened file.
This is required in a subsequent commit to handle ftruncate()
operations differently to truncate() operations.
Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Günther Noack <gnoack3000@gmail.com>
Link: https://lore.kernel.org/r/20221018182216.301684-2-gnoack3000@gmail.com
Signed-off-by: Mickaël Salaün <mic@digikod.net>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCY0DP2AAKCRBZ7Krx/gZQ
6/+qAQCEGQWpcC5MB17zylaX7gqzhgAsDrwtpevlno3aIv/1pQD/YWr/E8tf7WTW
ERXRXMRx1cAzBJhUhVgIY+3ANfU2Rg4=
=cko4
-----END PGP SIGNATURE-----
Merge tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs tmpfile updates from Al Viro:
"Miklos' ->tmpfile() signature change; pass an unopened struct file to
it, let it open the damn thing. Allows to add tmpfile support to FUSE"
* tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fuse: implement ->tmpfile()
vfs: open inside ->tmpfile()
vfs: move open right after ->tmpfile()
vfs: make vfs_tmpfile() static
ovl: use vfs_tmpfile_open() helper
cachefiles: use vfs_tmpfile_open() helper
cachefiles: only pass inode to *mark_inode_inuse() helpers
cachefiles: tmpfile error handling cleanup
hugetlbfs: cleanup mknod and tmpfile
vfs: add vfs_tmpfile_open() helper
linux-next for a couple of months without, to my knowledge, any negative
reports (or any positive ones, come to that).
- Also the Maple Tree from Liam R. Howlett. An overlapping range-based
tree for vmas. It it apparently slight more efficient in its own right,
but is mainly targeted at enabling work to reduce mmap_lock contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
(https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com).
This has yet to be addressed due to Liam's unfortunately timed
vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down to
the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support
file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA
joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf
bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU=
=xfWx
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any
negative reports (or any positive ones, come to that).
- Also the Maple Tree from Liam Howlett. An overlapping range-based
tree for vmas. It it apparently slightly more efficient in its own
right, but is mainly targeted at enabling work to reduce mmap_lock
contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
at [1]. This has yet to be addressed due to Liam's unfortunately
timed vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down
to the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
support file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging
activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]
* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
hugetlb: allocate vma lock for all sharable vmas
hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
hugetlb: fix vma lock handling during split vma and range unmapping
mglru: mm/vmscan.c: fix imprecise comments
mm/mglru: don't sync disk for each aging cycle
mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
mm: memcontrol: use do_memsw_account() in a few more places
mm: memcontrol: deprecate swapaccounting=0 mode
mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
mm/secretmem: remove reduntant return value
mm/hugetlb: add available_huge_pages() func
mm: remove unused inline functions from include/linux/mm_inline.h
selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
selftests/vm: add thp collapse shmem testing
selftests/vm: add thp collapse file and tmpfs testing
selftests/vm: modularize thp collapse memory operations
selftests/vm: dedup THP helpers
mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
mm/madvise: add file and shmem support to MADV_COLLAPSE
...
Functions implementing the a_ops->write_end() interface accept the `void
*fsdata` parameter that is supposed to be initialized by the corresponding
a_ops->write_begin() (which accepts `void **fsdata`).
However not all a_ops->write_begin() implementations initialize `fsdata`
unconditionally, so it may get passed uninitialized to a_ops->write_end(),
resulting in undefined behavior.
Fix this by initializing fsdata with NULL before the call to
write_begin(), rather than doing so in all possible a_ops implementations.
This patch covers only the following cases found by running x86 KMSAN
under syzkaller:
- generic_perform_write()
- cont_expand_zero() and generic_cont_expand_simple()
- page_symlink()
Other cases of passing uninitialized fsdata may persist in the codebase.
Link: https://lkml.kernel.org/r/20220915150417.722975-43-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This is in preparation for adding tmpfile support to fuse, which requires
that the tmpfile creation and opening are done as a single operation.
Replace the 'struct dentry *' argument of i_op->tmpfile with
'struct file *'.
Call finish_open_simple() as the last thing in ->tmpfile() instances (may
be omitted in the error case).
Change d_tmpfile() argument to 'struct file *' as well to make callers more
readable.
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Create a helper finish_open_simple() that opens the file with the original
dentry. Handle the error case here as well to simplify callers.
Call this helper right after ->tmpfile() is called.
Next patch will change the tmpfile API and move this call into tmpfile
instances.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
No callers outside of fs/namei.c anymore.
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This helper unifies tmpfile creation with opening.
Existing vfs_tmpfile() callers outside of fs/namei.c will be converted to
using this helper. There are two such callers: cachefile and overlayfs.
The cachefiles code currently uses the open_with_fake_path() helper to open
the tmpfile, presumably to disable accounting of the open file. Overlayfs
uses tmpfile for copy_up, which means these struct file instances will be
short lived, hence it doesn't really matter if they are accounted or not.
Disable accounting in this helper too, which should be okay for both
callers.
Add MAY_OPEN permission checking for consistency. Like for create(2)
read/write permissions are not checked.
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHQEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYvIYIgAKCRCRxhvAZXjc
omE8AQDAZG2YjNJfMnUUaaWYO3+zTaHlQp7OQkQTXIHfcfViXQD4vPt3Wxh3rrF+
J8fwNcWmXhSNei8HP6cA06QmSajnDQ==
=GF9/
-----END PGP SIGNATURE-----
Merge tag 'fs.setgid.v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull setgid updates from Christian Brauner:
"This contains the work to move setgid stripping out of individual
filesystems and into the VFS itself.
Creating files that have both the S_IXGRP and S_ISGID bit raised in
directories that themselves have the S_ISGID bit set requires
additional privileges to avoid security issues.
When a filesystem creates a new inode it needs to take care that the
caller is either in the group of the newly created inode or they have
CAP_FSETID in their current user namespace and are privileged over the
parent directory of the new inode. If any of these two conditions is
true then the S_ISGID bit can be raised for an S_IXGRP file and if not
it needs to be stripped.
However, there are several key issues with the current implementation:
- S_ISGID stripping logic is entangled with umask stripping.
For example, if the umask removes the S_IXGRP bit from the file
about to be created then the S_ISGID bit will be kept.
The inode_init_owner() helper is responsible for S_ISGID stripping
and is called before posix_acl_create(). So we can end up with two
different orderings:
1. FS without POSIX ACL support
First strip umask then strip S_ISGID in inode_init_owner().
In other words, if a filesystem doesn't support or enable POSIX
ACLs then umask stripping is done directly in the vfs before
calling into the filesystem:
2. FS with POSIX ACL support
First strip S_ISGID in inode_init_owner() then strip umask in
posix_acl_create().
In other words, if the filesystem does support POSIX ACLs then
unmask stripping may be done in the filesystem itself when
calling posix_acl_create().
Note that technically filesystems are free to impose their own
ordering between posix_acl_create() and inode_init_owner() meaning
that there's additional ordering issues that influence S_ISGID
inheritance.
(Note that the commit message of commit 1639a49ccd ("fs: move
S_ISGID stripping into the vfs_*() helpers") gets the ordering
between inode_init_owner() and posix_acl_create() the wrong way
around. I realized this too late.)
- Filesystems that don't rely on inode_init_owner() don't get S_ISGID
stripping logic.
While that may be intentional (e.g. network filesystems might just
defer setgid stripping to a server) it is often just a security
issue.
Note that mandating the use of inode_init_owner() was proposed as
an alternative solution but that wouldn't fix the ordering issues
and there are examples such as afs where the use of
inode_init_owner() isn't possible.
In any case, we should also try the cleaner and generalized
solution first before resorting to this approach.
- We still have S_ISGID inheritance bugs years after the initial
round of S_ISGID inheritance fixes:
e014f37db1 ("xfs: use setattr_copy to set vfs inode attributes")
01ea173e10 ("xfs: fix up non-directory creation in SGID directories")
fd84bfdddd ("ceph: fix up non-directory creation in SGID directories")
All of this led us to conclude that the current state is too messy.
While we won't be able to make it completely clean as
posix_acl_create() is still a filesystem specific call we can improve
the S_SIGD stripping situation quite a bit by hoisting it out of
inode_init_owner() and into the respective vfs creation operations.
The obvious advantage is that we don't need to rely on individual
filesystems getting S_ISGID stripping right and instead can
standardize the ordering between S_ISGID and umask stripping directly
in the VFS.
A few short implementation notes:
- The stripping logic needs to happen in vfs_*() helpers for the sake
of stacking filesystems such as overlayfs that rely on these
helpers taking care of S_ISGID stripping.
- Security hooks have never seen the mode as it is ultimately seen by
the filesystem because of the ordering issue we mentioned. Nothing
is changed for them. We simply continue to strip the umask before
passing the mode down to the security hooks.
- The following filesystems use inode_init_owner() and thus relied on
S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs,
hfsplus, hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs,
overlayfs, ramfs, reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs,
bpf, tmpfs.
We've audited all callchains as best as we could. More details can
be found in the commit message to 1639a49ccd ("fs: move S_ISGID
stripping into the vfs_*() helpers")"
* tag 'fs.setgid.v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
ceph: rely on vfs for setgid stripping
fs: move S_ISGID stripping into the vfs_*() helpers
fs: Add missing umask strip in vfs_tmpfile
fs: add mode_strip_sgid() helper
Move setgid handling out of individual filesystems and into the VFS
itself to stop the proliferation of setgid inheritance bugs.
Creating files that have both the S_IXGRP and S_ISGID bit raised in
directories that themselves have the S_ISGID bit set requires additional
privileges to avoid security issues.
When a filesystem creates a new inode it needs to take care that the
caller is either in the group of the newly created inode or they have
CAP_FSETID in their current user namespace and are privileged over the
parent directory of the new inode. If any of these two conditions is
true then the S_ISGID bit can be raised for an S_IXGRP file and if not
it needs to be stripped.
However, there are several key issues with the current implementation:
* S_ISGID stripping logic is entangled with umask stripping.
If a filesystem doesn't support or enable POSIX ACLs then umask
stripping is done directly in the vfs before calling into the
filesystem.
If the filesystem does support POSIX ACLs then unmask stripping may be
done in the filesystem itself when calling posix_acl_create().
Since umask stripping has an effect on S_ISGID inheritance, e.g., by
stripping the S_IXGRP bit from the file to be created and all relevant
filesystems have to call posix_acl_create() before inode_init_owner()
where we currently take care of S_ISGID handling S_ISGID handling is
order dependent. IOW, whether or not you get a setgid bit depends on
POSIX ACLs and umask and in what order they are called.
Note that technically filesystems are free to impose their own
ordering between posix_acl_create() and inode_init_owner() meaning
that there's additional ordering issues that influence S_SIGID
inheritance.
* Filesystems that don't rely on inode_init_owner() don't get S_ISGID
stripping logic.
While that may be intentional (e.g. network filesystems might just
defer setgid stripping to a server) it is often just a security issue.
This is not just ugly it's unsustainably messy especially since we do
still have bugs in this area years after the initial round of setgid
bugfixes.
So the current state is quite messy and while we won't be able to make
it completely clean as posix_acl_create() is still a filesystem specific
call we can improve the S_SIGD stripping situation quite a bit by
hoisting it out of inode_init_owner() and into the vfs creation
operations. This means we alleviate the burden for filesystems to handle
S_ISGID stripping correctly and can standardize the ordering between
S_ISGID and umask stripping in the vfs.
We add a new helper vfs_prepare_mode() so S_ISGID handling is now done
in the VFS before umask handling. This has S_ISGID handling is
unaffected unaffected by whether umask stripping is done by the VFS
itself (if no POSIX ACLs are supported or enabled) or in the filesystem
in posix_acl_create() (if POSIX ACLs are supported).
The vfs_prepare_mode() helper is called directly in vfs_*() helpers that
create new filesystem objects. We need to move them into there to make
sure that filesystems like overlayfs hat have callchains like:
sys_mknod()
-> do_mknodat(mode)
-> .mknod = ovl_mknod(mode)
-> ovl_create(mode)
-> vfs_mknod(mode)
get S_ISGID stripping done when calling into lower filesystems via
vfs_*() creation helpers. Moving vfs_prepare_mode() into e.g.
vfs_mknod() takes care of that. This is in any case semantically cleaner
because S_ISGID stripping is VFS security requirement.
Security hooks so far have seen the mode with the umask applied but
without S_ISGID handling done. The relevant hooks are called outside of
vfs_*() creation helpers so by calling vfs_prepare_mode() from vfs_*()
helpers the security hooks would now see the mode without umask
stripping applied. For now we fix this by passing the mode with umask
settings applied to not risk any regressions for LSM hooks. IOW, nothing
changes for LSM hooks. It is worth pointing out that security hooks
never saw the mode that is seen by the filesystem when actually creating
the file. They have always been completely misplaced for that to work.
The following filesystems use inode_init_owner() and thus relied on
S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs, hfsplus,
hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs, overlayfs, ramfs,
reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs, bpf, tmpfs.
All of the above filesystems end up calling inode_init_owner() when new
filesystem objects are created through the ->mkdir(), ->mknod(),
->create(), ->tmpfile(), ->rename() inode operations.
Since directories always inherit the S_ISGID bit with the exception of
xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
apply. The ->symlink() and ->link() inode operations trivially inherit
the mode from the target and the ->rename() inode operation inherits the
mode from the source inode. All other creation inode operations will get
S_ISGID handling via vfs_prepare_mode() when called from their relevant
vfs_*() helpers.
In addition to this there are filesystems which allow the creation of
filesystem objects through ioctl()s or - in the case of spufs -
circumventing the vfs in other ways. If filesystem objects are created
through ioctl()s the vfs doesn't know about it and can't apply regular
permission checking including S_ISGID logic. Therfore, a filesystem
relying on S_ISGID stripping in inode_init_owner() in their ioctl()
callpath will be affected by moving this logic into the vfs. We audited
those filesystems:
* btrfs allows the creation of filesystem objects through various
ioctls(). Snapshot creation literally takes a snapshot and so the mode
is fully preserved and S_ISGID stripping doesn't apply.
Creating a new subvolum relies on inode_init_owner() in
btrfs_new_subvol_inode() but only creates directories and doesn't
raise S_ISGID.
* ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
the actual extents ocfs2 uses a separate ioctl() that also creates the
target file.
Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
inode_init_owner() to strip the S_ISGID bit. This is the only place
where a filesystem needs to call mode_strip_sgid() directly but this
is self-inflicted pain.
* spufs doesn't go through the vfs at all and doesn't use ioctl()s
either. Instead it has a dedicated system call spufs_create() which
allows the creation of filesystem objects. But spufs only creates
directories and doesn't allo S_SIGID bits, i.e. it specifically only
allows 0777 bits.
* bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.
The patch will have an effect on ext2 when the EXT2_MOUNT_GRPID mount
option is used, on ext4 when the EXT4_MOUNT_GRPID mount option is used,
and on xfs when the XFS_FEAT_GRPID mount option is used. When any of
these filesystems are mounted with their respective GRPID option then
newly created files inherit the parent directories group
unconditionally. In these cases non of the filesystems call
inode_init_owner() and thus did never strip the S_ISGID bit for newly
created files. Moving this logic into the VFS means that they now get
the S_ISGID bit stripped. This is a user visible change. If this leads
to regressions we will either need to figure out a better way or we need
to revert. However, given the various setgid bugs that we found just in
the last two years this is a regression risk we should take.
Associated with this change is a new set of fstests to enforce the
semantics for all new filesystems.
Link: https://lore.kernel.org/ceph-devel/20220427092201.wvsdjbnc7b4dttaw@wittgenstein [1]
Link: e014f37db1 ("xfs: use setattr_copy to set vfs inode attributes") [2]
Link: 01ea173e10 ("xfs: fix up non-directory creation in SGID directories") [3]
Link: fd84bfdddd ("ceph: fix up non-directory creation in SGID directories") [4]
Link: https://lore.kernel.org/r/1657779088-2242-3-git-send-email-xuyang2018.jy@fujitsu.com
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
[<brauner@kernel.org>: rewrote commit message]
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
All creation paths except for O_TMPFILE handle umask in the vfs directly
if the filesystem doesn't support or enable POSIX ACLs. If the filesystem
does then umask handling is deferred until posix_acl_create().
Because, O_TMPFILE misses umask handling in the vfs it will not honor
umask settings. Fix this by adding the missing umask handling.
Link: https://lore.kernel.org/r/1657779088-2242-2-git-send-email-xuyang2018.jy@fujitsu.com
Fixes: 60545d0d46 ("[O_TMPFILE] it's still short a few helpers, but infrastructure should be OK now...")
Cc: <stable@vger.kernel.org> # 4.19+
Reported-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Note that validation of ->d_seq after ->d_inode fetch is gone, along
with fetching of ->d_inode itself.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
make handle_mounts() always fetch it. This is just the first step -
the callers of step_into() will stop trying to calculate the sucker,
etc.
The passed value should be equal to dentry->d_inode in all cases;
in RCU mode - fetched after we'd sampled ->d_seq. Might as well
fetch it here. We do need to validate ->d_seq, which duplicates
the check currently done in lookup_fast(); that duplication will
go away shortly.
After that change handle_mounts() always ignores the initial value of
*inode and always sets it on success.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
try_to_unlazy()/try_to_unlazy_next() drop LOOKUP_RCU in the
very beginning and do rcu_read_unlock() only at the very end.
However, nothing done in between even looks at the flag in
question; might as well clear it at the same time we unlock.
Note that try_to_unlazy_next() used to call legitimize_mnt(),
which might drop/regain rcu_read_lock() in some cases. This
is no longer true, so we really have rcu_read_lock() held
all along until the end.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The tricky case (__legitimize_mnt() failing after having grabbed
a reference) can be trivially dealt with by leaving nd->path.mnt
non-NULL, for terminate_walk() to drop it.
legitimize_mnt() becomes static after that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of returning NULL when we are in root, just make it return
the current position (and set *seqp and *inodep accordingly).
That collapses the calls of step_into() in handle_dots()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
read_seqcount_retry() et.al. are inlined and there's enough annotations
for compiler to figure out that those are unlikely to return non-zero.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Validate mount_lock seqcount as soon as we cross into mount in RCU
mode. Sure, ->mnt_root is pinned and will remain so until we
do rcu_read_unlock() anyway, and we will eventually fail to unlazy if
the mount_lock had been touched, but we might run into a hard error
(e.g. -ENOENT) before trying to unlazy. And it's possible to end
up with RCU pathwalk racing with rename() and umount() in a way
that would fail with -ENOENT while non-RCU pathwalk would've
succeeded with any timings.
Once upon a time we hadn't needed that, but analysis had been subtle,
brittle and went out of window as soon as RENAME_EXCHANGE had been
added.
It's narrow, hard to hit and won't get you anything other than
stray -ENOENT that could be arranged in much easier way with the
same priveleges, but it's a bug all the same.
Cc: stable@kernel.org
X-sky-is-falling: unlikely
Fixes: da1ce0670c "vfs: add cross-rename"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYpvrrgAKCRBZ7Krx/gZQ
6+eZAP9r0c8E1UnUxRI32kQrzndJO2z9mKq6rI9D7GgJe+MtogEAlfynzklqbHzo
fFvczeD0tzLu4aDExtGn6GNGf5gSpw4=
=1blQ
-----END PGP SIGNATURE-----
Merge tag 'pull-18-rc1-work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pathname updates from Al Viro:
"Several cleanups in fs/namei.c"
* tag 'pull-18-rc1-work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
namei: cleanup double word in comment
get rid of dead code in legitimize_root()
fs/namei.c:reserve_stack(): tidy up the call of try_to_unlazy()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYo+nGwAKCRDh3BK/laaZ
PBouAP0VBH/jygclzc42jlRkKjp+wJnF1FifpWOJEtTPiYqhtAD/UWjR/2Sy4TMT
fRsw9N9/FXxcXShjg3U42fpCNSVEqgM=
=oY4z
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
- Support idmapped layers in overlayfs (Christian Brauner)
- Add a fix to exportfs that is relevant to open_by_handle_at(2) as
well
- Introduce new lookup helpers that allow passing mnt_userns into
inode_permission()
* tag 'ovl-update-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: support idmapped layers
ovl: handle idmappings in ovl_xattr_{g,s}et()
ovl: handle idmappings in layer open helpers
ovl: handle idmappings in ovl_permission()
ovl: use ovl_copy_{real,upper}attr() wrappers
ovl: store lower path in ovl_inode
ovl: handle idmappings for layer lookup
ovl: handle idmappings for layer fileattrs
ovl: use ovl_path_getxattr() wrapper
ovl: use ovl_lookup_upper() wrapper
ovl: use ovl_do_notify_change() wrapper
ovl: pass layer mnt to ovl_open_realfile()
ovl: pass ofs to setattr operations
ovl: handle idmappings in creation operations
ovl: add ovl_upper_mnt_userns() wrapper
ovl: pass ofs to creation operations
ovl: use wrappers to all vfs_*xattr() calls
exportfs: support idmapped mounts
fs: add two trivial lookup helpers
subsystems. Most notably some maintenance work in ocfs2 and initramfs.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYo/6xQAKCRDdBJ7gKXxA
jkD9AQCPczLBbRWpe1edL+5VHvel9ePoHQmvbHQnufdTh9rB5QEAu0Uilxz4q9cx
xSZypNhj2n9f8FCYca/ZrZneBsTnAA8=
=AJEO
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2022-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc updates from Andrew Morton:
"The non-MM patch queue for this merge window.
Not a lot of material this cycle. Many singleton patches against
various subsystems. Most notably some maintenance work in ocfs2
and initramfs"
* tag 'mm-nonmm-stable-2022-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (65 commits)
kcov: update pos before writing pc in trace function
ocfs2: dlmfs: fix error handling of user_dlm_destroy_lock
ocfs2: dlmfs: don't clear USER_LOCK_ATTACHED when destroying lock
fs/ntfs: remove redundant variable idx
fat: remove time truncations in vfat_create/vfat_mkdir
fat: report creation time in statx
fat: ignore ctime updates, and keep ctime identical to mtime in memory
fat: split fat_truncate_time() into separate functions
MAINTAINERS: add Muchun as a memcg reviewer
proc/sysctl: make protected_* world readable
ia64: mca: drop redundant spinlock initialization
tty: fix deadlock caused by calling printk() under tty_port->lock
relay: remove redundant assignment to pointer buf
fs/ntfs3: validate BOOT sectors_per_clusters
lib/string_helpers: fix not adding strarray to device's resource list
kernel/crash_core.c: remove redundant check of ck_cmdline
ELF, uapi: fixup ELF_ST_TYPE definition
ipc/mqueue: use get_tree_nodev() in mqueue_get_tree()
ipc: update semtimedop() to use hrtimer
ipc/sem: remove redundant assignments
...
Combination of LOOKUP_IS_SCOPED and NULL nd->root.mnt is impossible
after successful path_init(). All places where ->root.mnt might
become NULL do that only if LOOKUP_IS_SCOPED is not there and
path_init() itself can return success without setting nd->root
only if ND_ROOT_PRESET had been set (in which case nd->root
had been set by caller and never changed) or if the name had
been a relative one *and* none of the bits in LOOKUP_IS_SCOPED
had been present.
Since all calls of legitimize_root() must be downstream of successful
path_init(), the check for !nd->root.mnt && (nd->flags & LOOKUP_IS_SCOPED)
is pure paranoia.
FWIW, it had been discussed (and agreed upon) with Aleksa back when
scoped lookups had been merged; looks like that had fallen through the
cracks back then.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
!foo() != 0 is a strange way to spell !foo(); fallout from
"fs: make unlazy_walk() error handling consistent"...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
protected_* files have 600 permissions which prevents non-superuser from
reading them.
Container like "AWS greengrass" refuse to launch unless
protected_hardlinks and protected_symlinks are set. When containers like
these run with "userns-remap" or "--user" mapping container's root to
non-superuser on host, they fail to run due to denied read access to these
files.
As these protections are hardly a secret, and do not possess any security
risk, making them world readable.
Though above greengrass usecase needs read access to only
protected_hardlinks and protected_symlinks files, setting all other
protected_* files to 644 to keep consistency.
Link: http://lkml.kernel.org/r/20200709235115.56954-1-jpitti@cisco.com
Fixes: 800179c9b8 ("fs: add link restrictions")
Signed-off-by: Julius Hemanth Pitti <jpitti@cisco.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
pagecache_write_begin() and pagecache_write_end() are now trivial
wrappers, so call the aops directly.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Stop using AOP_FLAG_NOFS in favour of the scoped memory API.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
There are no callers of __page_symlink() left, so we can remove that
entry point.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Similar to the addition of lookup_one() add a version of
lookup_one_unlocked() and lookup_one_positive_unlocked() that take
idmapped mounts into account. This is required to port overlay to
support idmapped base layers.
Cc: <linux-fsdevel@vger.kernel.org>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When asked to create a path ending '/', but which is not to be a
directory (LOOKUP_DIRECTORY not set), filename_create() will never try
to create the file. If it doesn't exist, -ENOENT is reported.
However, it still passes LOOKUP_CREATE|LOOKUP_EXCL to the filesystems
->lookup() function, even though there is no intent to create. This is
misleading and can cause incorrect behaviour.
If you try
ln -s foo /path/dir/
where 'dir' is a directory on an NFS filesystem which is not currently
known in the dcache, this will fail with ENOENT.
But as the name is not in the dcache, nfs_lookup gets called with
LOOKUP_CREATE|LOOKUP_EXCL and so it returns NULL without performing any
lookup, with the expectation that a subsequent call to create the target
will be made, and the lookup can be combined with the creation. In the
case with a trailing '/' and no LOOKUP_DIRECTORY, that call is never
made. Instead filename_create() sees that the dentry is not (yet)
positive and returns -ENOENT - even though the directory actually
exists.
So only set LOOKUP_CREATE|LOOKUP_EXCL if there really is an intent to
create, and use the absence of these flags to decide if -ENOENT should
be returned.
Note that filename_parentat() is only interested in LOOKUP_REVAL, so we
split that out and store it in 'reval_flag'. __lookup_hash() then gets
reval_flag combined with whatever create flags were determined to be
needed.
Reviewed-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmHz0QsACgkQnJ2qBz9k
QNkN+AgA6XqWHKYyElfgJFt1UqaoNMz/Faz9H/+PKiBNSTf6/+67D+V7DFz6jJrv
dDwHNzfDg9kR+pbAAPwhl2jfnQoUlsr019Hrqa5HpWlj5geVpbdunYUzS2WOkwqD
/m+brcLgPdKb2uIysj6wOh9B7wa8V9ksl3EjQvvwaHaU0p1YLUqidVXucYvs8DUo
bgXNaj9GmeysxnmU+aILotWuuXH2vOP4Q2Uk4qz3rN6xW9eEXtpQ4y7gWBp/GA8y
Ia8FtFdQdvlSDOJYMdPOTBu5RB7gY9ElrapvVaWNYtCWI/jRv666nZsLaERYNhLN
uUsG4MWjYbOqW5XqFDbSOwbDqvMh5Q==
=mEFA
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify fixes from Jan Kara:
"Fixes for userspace breakage caused by fsnotify changes ~3 years ago
and one fanotify cleanup"
* tag 'fsnotify_for_v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: fix fsnotify hooks in pseudo filesystems
fsnotify: invalidate dcache before IN_DELETE event
fanotify: remove variable set but not used
Apparently, there are some applications that use IN_DELETE event as an
invalidation mechanism and expect that if they try to open a file with
the name reported with the delete event, that it should not contain the
content of the deleted file.
Commit 49246466a9 ("fsnotify: move fsnotify_nameremove() hook out of
d_delete()") moved the fsnotify delete hook before d_delete() so fsnotify
will have access to a positive dentry.
This allowed a race where opening the deleted file via cached dentry
is now possible after receiving the IN_DELETE event.
To fix the regression, create a new hook fsnotify_delete() that takes
the unlinked inode as an argument and use a helper d_delete_notify() to
pin the inode, so we can pass it to fsnotify_delete() after d_delete().
Backporting hint: this regression is from v5.3. Although patch will
apply with only trivial conflicts to v5.4 and v5.10, it won't build,
because fsnotify_delete() implementation is different in each of those
versions (see fsnotify_link()).
A follow up patch will fix the fsnotify_unlink/rmdir() calls in pseudo
filesystem that do not need to call d_delete().
Link: https://lore.kernel.org/r/20220120215305.282577-1-amir73il@gmail.com
Reported-by: Ivan Delalande <colona@arista.com>
Link: https://lore.kernel.org/linux-fsdevel/YeNyzoDM5hP5LtGW@visor/
Fixes: 49246466a9 ("fsnotify: move fsnotify_nameremove() hook out of d_delete()")
Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
So move namei's own sysctl knobs to its own file.
Other than the move we also avoid initializing two static variables to 0
as this is not needed:
* sysctl_protected_symlinks
* sysctl_protected_hardlinks
Link: https://lkml.kernel.org/r/20211129205548.605569-8-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmF77NQTHGpsYXl0b25A
a2VybmVsLm9yZwAKCRAADmhBGVaCFfP9D/wN8rCAPA2J6SpBjdJXSpSQS4PoAOqC
Z002bOc7sq/zg2cWk+pX1aOR/+wUpk+PvaQdyvfO+o4TVCpsTOklRh/yfYuOkJdP
PoINUR7vb43/CGqd04YI45+pxOFMJk9JoLkNha0uIY4ukXdt9mA6u/+qBEDboyDQ
Jbn1JGitRc9WYaE7BV26ba0l+Deb5h2/4c1DiDlsgmLkDPhpowkOznovUCkBnH7H
bfwlssjZ5P0K5ttZDw6VlkC2xE+Yr56BsEco2bXO42LwUHOx6r6ZNp04rh9zh1Zp
hFPybgU+ur17EOOmBbCq8aHZqRRcjQQDH/rZ1heHSOfTrEWWth1xNjmeewSgRZHL
0oMi3yIJPwvuDBQPEQg0VD87k5Z8xbRPql6eM6GeGxDZvzXWqqYKW2OYXtNxG91m
bGvu2OOGkdF/4WGYBixdjUQb5KjcOqdIFkq3/oHfLQ+cS2uc6eOfnCdxa7cTnTdd
BcFDgZmWQDLFs6/DIbwUI0KWMAiLFMZnZACE937JvlE74EGiHu47JMzwcU15J6zO
VD0Oq0XsWQN+TgY2RnjuTFqm6DTpbrkgw88sNDr5g3jZbgJZiZ/r/3M55lcBVWvk
PFT4fjKhD1mT6/SscAAmOxUKUeDbN7EktiRsZmH9C2sUCERufDb/cmY/RYZ00C4t
01ovPjs7VukS/w==
=bcaq
-----END PGP SIGNATURE-----
Merge tag 'locks-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull file locking updates from Jeff Layton:
"Most of this is just follow-on cleanup work of documentation and
comments from the mandatory locking removal in v5.15.
The only real functional change is that LOCK_MAND flock() support is
also being removed, as it has basically been non-functional since the
v2.5 days"
* tag 'locks-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
fs: remove leftover comments from mandatory locking removal
locks: remove changelog comments
docs: fs: locks.rst: update comment about mandatory file locking
Documentation: remove reference to now removed mandatory-locking doc
locks: remove LOCK_MAND flock lock support
Mixing NULL and ERR_PTR() just in case is a Bad Idea(tm). For
struct filename the former is wrong - failures are reported
as ERR_PTR(...), not as NULL.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
filename_create() has two variants, one which drops the caller's
reference to filename (filename_create) and one which does
not (__filename_create). This can be confusing as it's unusual to drop a
caller's reference. Remove filename_create, rename __filename_create
to filename_create, and convert all callers.
Link: https://lore.kernel.org/linux-fsdevel/f6238254-35bd-7e97-5b27-21050c745874@oracle.com/
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
filename_lookup() has two variants, one which drops the caller's
reference to filename (filename_lookup), and one which does
not (__filename_lookup). This can be confusing as it's unusual to drop a
caller's reference. Remove filename_lookup, rename __filename_lookup
to filename_lookup, and convert all callers. The cost is a few slightly
longer functions, but the clarity is greater.
[AV: consuming a reference is not at all unusual, actually; look at
e.g. do_mkdirat(), for example. It's more that we want non-consuming
variant for close relative of that function...]
Link: https://lore.kernel.org/linux-fsdevel/YS+dstZ3xfcLxhoB@zeniv-ca.linux.org.uk/
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In 0ee50b4753 ("namei: change filename_parentat() calling
conventions"), filename_parentat() was made to always call putname() on
the filename before returning, and kern_path_locked() was migrated to
this calling convention. However, kern_path_locked() uses the "last"
parameter to lookup and potentially create a new dentry. The last
parameter contains the last component of the path and points within the
filename, which was recently freed at the end of filename_parentat().
Thus, when kern_path_locked() calls __lookup_hash(), it is using the
filename after it has already been freed.
In other words, these calling conventions had been wrong for the
only remaining caller of filename_parentat(). Everything else
is using __filename_parentat(), which does not drop the reference;
so should kern_path_locked().
Switch kern_path_locked() to use of __filename_parentat() and move
getting/dropping struct filename into wrapper. Remove filename_parentat(),
now that we have no remaining callers.
Fixes: 0ee50b4753 ("namei: change filename_parentat() calling conventions")
Link: https://lore.kernel.org/linux-fsdevel/YS9D4AlEsaCxLFV0@infradead.org/
Link: https://lore.kernel.org/linux-fsdevel/YS+csMTV2tTXKg3s@zeniv-ca.linux.org.uk/
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: syzbot+fb0d60a179096e8c2731@syzkaller.appspotmail.com
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Co-authored-by: Dmitry Kadashev <dkadashev@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Merge misc updates from Andrew Morton:
"173 patches.
Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
oom-kill, migration, ksm, percpu, vmstat, and madvise)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits)
mm/madvise: add MADV_WILLNEED to process_madvise()
mm/vmstat: remove unneeded return value
mm/vmstat: simplify the array size calculation
mm/vmstat: correct some wrong comments
mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
selftests: vm: add COW time test for KSM pages
selftests: vm: add KSM merging time test
mm: KSM: fix data type
selftests: vm: add KSM merging across nodes test
selftests: vm: add KSM zero page merging test
selftests: vm: add KSM unmerge test
selftests: vm: add KSM merge test
mm/migrate: correct kernel-doc notation
mm: wire up syscall process_mrelease
mm: introduce process_mrelease system call
memblock: make memblock_find_in_range method private
mm/mempolicy.c: use in_task() in mempolicy_slab_node()
mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
mm/mempolicy: advertise new MPOL_PREFERRED_MANY
mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
...
We had a recurring situation in which admin procedures setting up
swapfiles would race with test preparation clearing away swapfiles; and
just occasionally that got stuck on a swapfile "(deleted)" which could
never be swapped off. That is not supposed to be possible.
2.6.28 commit f9454548e1 ("don't unlink an active swapfile") admitted
that it was leaving a race window open: now close it.
may_delete() makes the IS_SWAPFILE check (amongst many others) before
inode_lock has been taken on target: now repeat just that simple check in
vfs_unlink() and vfs_rename(), after taking inode_lock.
Which goes most of the way to fixing the race, but swapon() must also
check after it acquires inode_lock, that the file just opened has not
already been unlinked.
Link: https://lkml.kernel.org/r/e17b91ad-a578-9a15-5e3-4989e0f999b5@google.com
Fixes: f9454548e1 ("don't unlink an active swapfile")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmEs2NIACgkQxWXV+ddt
WDsJMQ/+PJ/yXfI85mAeAzTJLWQ0zD6YO3iBhf3wOeyychWC4on435pj+zW8zR/U
/bix25ygoWF4MvGF6p0uyv4Z5mnvkZXE5lapUcJu6wXG7se1QRPH0broTh05IBXK
SnT93Eb9RexaiNFk7DVma9XkviqZ/ZISPtkJ9wYrfIba7j/U/wa+PtEFS7wk58hP
rFQXgV64xm/pcP28YYHfOkCjdyUMdJrnBUvfKOlX6d94lmYbP5lyiTL+XJEXExzN
wPakD0UsnXPr4TRvf+YRTPeFHPPUgyORII7otVUOKmGywWtcJrELX8rXFoW+6GwB
dzZIcSYXHUxU5UrtMbZgiztVBJ+bQY5juYMIrj13eYOMYkijxAqPP84iDO15+TSV
zNqyAVjUglHCGUGjhSpAxnAmtp+IJTZfVAWcvIKq3VqvJtb8tssQsk9bqFjH1xlH
qNJLE57CYe3tjw05K9y0keMh2iJWRWkXZYkgI/zjwo5nreemobpN+3fO4yneVLh7
ecdBmSl/JVSzAB1NamLOCZNGZLUqiiuTvZlJtI6ZsekrN1+4A6QzVcU/MGjSYL1v
C7W0hK0LF+e3xIBkxTKVq8noolsgbmlWacxJq8fZq9HwZy5IVJOVm9STDlCuLaIo
gPr0V0itkclcsMU0CHTyCjMsfuHYUwJZXwg93wKfJf5UCzS4OWU=
=ALO9
-----END PGP SIGNATURE-----
Merge tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"The highlights of this round are integrations with fs-verity and
idmapped mounts, the rest is usual mix of minor improvements, speedups
and cleanups.
There are some patches outside of btrfs, namely updating some VFS
interfaces, all straightforward and acked.
Features:
- fs-verity support, using standard ioctls, backward compatible with
read-only limitation on inodes with previously enabled fs-verity
- idmapped mount support
- make mount with rescue=ibadroots more tolerant to partially damaged
trees
- allow raid0 on a single device and raid10 on two devices,
degenerate cases but might be useful as an intermediate step during
conversion to other profiles
- zoned mode block group auto reclaim can be disabled via sysfs knob
Performance improvements:
- continue readahead of node siblings even if target node is in
memory, could speed up full send (on sample test +11%)
- batching of delayed items can speed up creating many files
- fsync/tree-log speedups
- avoid unnecessary work (gains +2% throughput, -2% run time on
sample load)
- reduced lock contention on renames (on dbench +4% throughput,
up to -30% latency)
Fixes:
- various zoned mode fixes
- preemptive flushing threshold tuning, avoid excessive work on
almost full filesystems
Core:
- continued subpage support, preparation for implementing remaining
features like compression and defragmentation; with some
limitations, write is now enabled on 64K page systems with 4K
sectors, still considered experimental
- no readahead on compressed reads
- inline extents disabled
- disabled raid56 profile conversion and mount
- improved flushing logic, fixing early ENOSPC on some workloads
- inode flags have been internally split to read-only and read-write
incompat bit parts, used by fs-verity
- new tree items for fs-verity
- descriptor item
- Merkle tree item
- inode operations extended to be namespace-aware
- cleanups and refactoring
Generic code changes:
- fs: new export filemap_fdatawrite_wbc
- fs: removed sync_inode
- block: bio_trim argument type fixups
- vfs: add namespace-aware lookup"
* tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits)
btrfs: reset replace target device to allocation state on close
btrfs: zoned: fix ordered extent boundary calculation
btrfs: do not do preemptive flushing if the majority is global rsv
btrfs: reduce the preemptive flushing threshold to 90%
btrfs: tree-log: check btrfs_lookup_data_extent return value
btrfs: avoid unnecessarily logging directories that had no changes
btrfs: allow idmapped mount
btrfs: handle ACLs on idmapped mounts
btrfs: allow idmapped INO_LOOKUP_USER ioctl
btrfs: allow idmapped SUBVOL_SETFLAGS ioctl
btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls
btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids
btrfs: allow idmapped SNAP_DESTROY ioctls
btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls
btrfs: check whether fsgid/fsuid are mapped during subvolume creation
btrfs: allow idmapped permission inode op
btrfs: allow idmapped setattr inode op
btrfs: allow idmapped tmpfile inode op
btrfs: allow idmapped symlink inode op
btrfs: allow idmapped mkdir inode op
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEs8fUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpio4D/9cGrHIbbZsuDIHzhaK2JIUrSG7G4GkcaG/
NAqbOp7KvF+1elMY08DWLT0nnFqHM7REHIS4Lv55KCNtktTFfdYmxso4lPrRu67o
MNbMJcEAglgIDw0xP4MfP/vZ0ftXJv8+OXSfL51pD4U40nWIZVpqn8WbWKRqjhGf
nQhiANbl2mO2Ec7I/UgAIqwczQnF5HveCkX5106dAppma8yEH+v2TkvZyZp/TCU3
h0ec26hLi+4QRBFm4O0yrVWj1gMS7yfHuEFSGw+jhp/WNTpH9A5pXFQjn7pIyJNi
uqrwM7knrod9ZH2pE1825w0TrbqkOdcZCo+/NvJHOAy03LUBJ/9qDc+JJUWsEmLZ
cpd8auaCfuAFx6ForHmKd+Pw1bANebWBMsClyQSh38+fsJ9myci3c3tkkzmO+dSW
G+rZZochiG4nFSl+CvlUoFfztuu8rdbOLKI/9usPMHNcDiY4yAAmz80B9uQdtQp7
tRLqegplsDODefLNvl0/Uj7WFJl6w5furchTXPmc+GSPFc+mpW08Olh7ScaCyD8c
a8YXaQi5hwuUR1N7uW65Df/HGMbIDvxOStcurIakP0mOSvRKrojZgQhbJ8zuCG4y
cRCwRUzvreNIoKK2ZxEvhLjhE5POaWgy6AtN/UI9k9BeVGQdboKVBGvub5Mv+ZKE
HpchbANk8Q==
=T7Zv
-----END PGP SIGNATURE-----
Merge tag 'for-5.15/io_uring-vfs-2021-08-30' of git://git.kernel.dk/linux-block
Pull io_uring mkdirat/symlinkat/linkat support from Jens Axboe:
"This adds io_uring support for mkdirat, symlinkat, and linkat"
* tag 'for-5.15/io_uring-vfs-2021-08-30' of git://git.kernel.dk/linux-block:
io_uring: add support for IORING_OP_LINKAT
io_uring: add support for IORING_OP_SYMLINKAT
io_uring: add support for IORING_OP_MKDIRAT
namei: update do_*() helpers to return ints
namei: make do_linkat() take struct filename
namei: add getname_uflags()
namei: make do_symlinkat() take struct filename
namei: make do_mknodat() take struct filename
namei: make do_mkdirat() take struct filename
namei: change filename_parentat() calling conventions
namei: ignore ERR/NULL names in putname()
IORING_OP_LINKAT behaves like linkat(2) and takes the same flags and
arguments.
In some internal places 'hardlink' is used instead of 'link' to avoid
confusion with the SQE links. Name 'link' conflicts with the existing
'link' member of io_kiocb.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-12-dkadashev@gmail.com
[axboe: add splice_fd_in check]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Update the following to return int rather than long, for uniformity with
the rest of the do_* helpers in namei.c:
* do_rmdir()
* do_unlinkat()
* do_mkdirat()
* do_mknodat()
* do_symlinkat()
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/20210514143202.dmzfcgz5hnauy7ze@wittgenstein/
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-9-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>