v6.6-vfs.tmpfs

-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTkgAKCRCRxhvAZXjc
 ouZsAPwNBHB2aPKtzWURuKx5RX02vXTzHX+A/LpuDz5WBFe8zQD+NlaBa4j0MBtS
 rVYM+CjOXnjnsLc8W0euMnfYNvViKgQ=
 =L2+2
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-vfs.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull libfs and tmpfs updates from Christian Brauner:
 "This cycle saw a lot of work for tmpfs that required changes to the
  vfs layer. Andrew, Hugh, and I decided to take tmpfs through vfs this
  cycle. Things will go back to mm next cycle.

  Features
  ========

   - By far the biggest work is the quota support for tmpfs. New tmpfs
     quota infrastructure is added to support it and a new QFMT_SHMEM
     uapi option is exposed.

     This offers user and group quotas to tmpfs (project quotas will be
     added later). Similar to other filesystems tmpfs quota are not
     supported within user namespaces yet.

   - Add support for user xattrs. While tmpfs already supports security
     xattrs (security.*) and POSIX ACLs for a long time it lacked
     support for user xattrs (user.*). With this pull request tmpfs will
     be able to support a limited number of user xattrs.

     This is accompanied by a fix (see below) to limit persistent simple
     xattr allocations.

   - Add support for stable directory offsets. Currently tmpfs relies on
     the libfs provided cursor-based mechanism for readdir. This causes
     issues when a tmpfs filesystem is exported via NFS.

     NFS clients do not open directories. Instead, each server-side
     readdir operation opens the directory, reads it, and then closes
     it. Since the cursor state for that directory is associated with
     the opened file it is discarded after each readdir operation. Such
     directory offsets are not just cached by NFS clients but also
     various userspace libraries based on these clients.

     As it stands there is no way to invalidate the caches when
     directory offsets have changed and the whole application depends on
     unchanging directory offsets.

     At LSFMM we discussed how to solve this problem and decided to
     support stable directory offsets. libfs now allows filesystems like
     tmpfs to use an xarrary to map a directory offset to a dentry. This
     mechanism is currently only used by tmpfs but can be supported by
     others as well.

  Fixes
  =====

   - Change persistent simple xattrs allocations in libfs from
     GFP_KERNEL to GPF_KERNEL_ACCOUNT so they're subject to memory
     cgroup limits. Since this is a change to libfs it affects both
     tmpfs and kernfs.

   - Correctly verify {g,u}id mount options.

     A new filesystem context is created via fsopen() which records the
     namespace that becomes the owning namespace of the superblock when
     fsconfig(FSCONFIG_CMD_CREATE) is called for filesystems that are
     mountable in namespaces. However, fsconfig() calls can occur in a
     namespace different from the namespace where fsopen() has been
     called.

     Currently, when fsconfig() is called to set {g,u}id mount options
     the requested {g,u}id is mapped into a k{g,u}id according to the
     namespace where fsconfig() was called from. The resulting k{g,u}id
     is not guaranteed to be resolvable in the namespace of the
     filesystem (the one that fsopen() was called in).

     This means it's possible for an unprivileged user to create files
     owned by any group in a tmpfs mount since it's possible to set the
     setid bits on the tmpfs directory.

     The contract for {g,u}id mount options and {g,u}id values in
     general set from userspace has always been that they are translated
     according to the caller's idmapping. In so far, tmpfs has been
     doing the correct thing. But since tmpfs is mountable in
     unprivileged contexts it is also necessary to verify that the
     resulting {k,g}uid is representable in the namespace of the
     superblock to avoid such bugs.

     The new mount api's cross-namespace delegation abilities are
     already widely used. Having talked to a bunch of userspace this is
     the most faithful solution with minimal regression risks"

* tag 'v6.6-vfs.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  tmpfs,xattr: GFP_KERNEL_ACCOUNT for simple xattrs
  mm: invalidation check mapping before folio_contains
  tmpfs: trivial support for direct IO
  tmpfs,xattr: enable limited user extended attributes
  tmpfs: track free_ispace instead of free_inodes
  xattr: simple_xattr_set() return old_xattr to be freed
  tmpfs: verify {g,u}id mount options correctly
  shmem: move spinlock into shmem_recalc_inode() to fix quota support
  libfs: Remove parent dentry locking in offset_iterate_dir()
  libfs: Add a lock class for the offset map's xa_lock
  shmem: stable directory offsets
  shmem: Refactor shmem_symlink()
  libfs: Add directory operations for stable offsets
  shmem: fix quota lock nesting in huge hole handling
  shmem: Add default quota limit mount options
  shmem: quota support
  shmem: prepare shmem quota infrastructure
  quota: Check presence of quota operation structures instead of ->quota_read and ->quota_write callbacks
  shmem: make shmem_get_inode() return ERR_PTR instead of NULL
  shmem: make shmem_inode_acct_block() return error
This commit is contained in:
Linus Torvalds 2023-08-28 09:55:25 -07:00
commit ecd7db2047
19 changed files with 1411 additions and 296 deletions

View File

@ -85,13 +85,14 @@ prototypes::
struct dentry *dentry, struct fileattr *fa);
int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int);
struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
locking rules:
all may block
============== =============================================
============== ==================================================
ops i_rwsem(inode)
============== =============================================
============== ==================================================
lookup: shared
create: exclusive
link: exclusive (both)
@ -115,7 +116,8 @@ atomic_open: shared (exclusive if O_CREAT is set in open flags)
tmpfile: no
fileattr_get: no or exclusive
fileattr_set: exclusive
============== =============================================
get_offset_ctx no
============== ==================================================
Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem

View File

@ -21,8 +21,8 @@ explained further below, some of which can be reconfigured dynamically on the
fly using a remount ('mount -o remount ...') of the filesystem. A tmpfs
filesystem can be resized but it cannot be resized to a size below its current
usage. tmpfs also supports POSIX ACLs, and extended attributes for the
trusted.* and security.* namespaces. ramfs does not use swap and you cannot
modify any parameter for a ramfs filesystem. The size limit of a ramfs
trusted.*, security.* and user.* namespaces. ramfs does not use swap and you
cannot modify any parameter for a ramfs filesystem. The size limit of a ramfs
filesystem is how much memory you have available, and so care must be taken if
used so to not run out of memory.
@ -97,6 +97,9 @@ mount with such options, since it allows any user with write access to
use up all the memory on the machine; but enhances the scalability of
that instance in a system with many CPUs making intensive use of it.
If nr_inodes is not 0, that limited space for inodes is also used up by
extended attributes: "df -i"'s IUsed and IUse% increase, IFree decreases.
tmpfs blocks may be swapped out, when there is a shortage of memory.
tmpfs has a mount option to disable its use of swap:
@ -123,6 +126,37 @@ sysfs file /sys/kernel/mm/transparent_hugepage/shmem_enabled: which can
be used to deny huge pages on all tmpfs mounts in an emergency, or to
force huge pages on all tmpfs mounts for testing.
tmpfs also supports quota with the following mount options
======================== =================================================
quota User and group quota accounting and enforcement
is enabled on the mount. Tmpfs is using hidden
system quota files that are initialized on mount.
usrquota User quota accounting and enforcement is enabled
on the mount.
grpquota Group quota accounting and enforcement is enabled
on the mount.
usrquota_block_hardlimit Set global user quota block hard limit.
usrquota_inode_hardlimit Set global user quota inode hard limit.
grpquota_block_hardlimit Set global group quota block hard limit.
grpquota_inode_hardlimit Set global group quota inode hard limit.
======================== =================================================
None of the quota related mount options can be set or changed on remount.
Quota limit parameters accept a suffix k, m or g for kilo, mega and giga
and can't be changed on remount. Default global quota limits are taking
effect for any and all user/group/project except root the first time the
quota entry for user/group/project id is being accessed - typically the
first time an inode with a particular id ownership is being created after
the mount. In other words, instead of the limits being initialized to zero,
they are initialized with the particular value provided with these mount
options. The limits can be changed for any user/group id at any time as they
normally can be.
Note that tmpfs quotas do not support user namespaces so no uid/gid
translation is done if quotas are enabled inside user namespaces.
tmpfs has a mount option to set the NUMA memory allocation policy for
all files in that instance (if CONFIG_NUMA is enabled) - which can be
adjusted on the fly via 'mount -o remount ...'

View File

@ -515,6 +515,7 @@ As of kernel 2.6.22, the following members are defined:
int (*fileattr_set)(struct mnt_idmap *idmap,
struct dentry *dentry, struct fileattr *fa);
int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
};
Again, all methods are called without any locks being held, unless
@ -675,7 +676,10 @@ otherwise noted.
called on ioctl(FS_IOC_SETFLAGS) and ioctl(FS_IOC_FSSETXATTR) to
change miscellaneous file flags and attributes. Callers hold
i_rwsem exclusive. If unset, then fall back to f_op->ioctl().
``get_offset_ctx``
called to get the offset context for a directory inode. A
filesystem must define this operation to use
simple_offset_dir_operations.
The Address Space Object
========================

View File

@ -205,8 +205,8 @@ config TMPFS_XATTR
Extended attributes are name:value pairs associated with inodes by
the kernel or by users (see the attr(5) manual page for details).
Currently this enables support for the trusted.* and
security.* namespaces.
This enables support for the trusted.*, security.* and user.*
namespaces.
You need this for POSIX ACL support on tmpfs.
@ -233,6 +233,18 @@ config TMPFS_INODE64
If unsure, say N.
config TMPFS_QUOTA
bool "Tmpfs quota support"
depends on TMPFS
select QUOTA
help
Quota support allows to set per user and group limits for tmpfs
usage. Say Y to enable quota support. Once enabled you can control
user and group quota enforcement with quota, usrquota and grpquota
mount options.
If unsure, say N.
config ARCH_SUPPORTS_HUGETLBFS
def_bool n

View File

@ -556,7 +556,7 @@ void kernfs_put(struct kernfs_node *kn)
kfree_const(kn->name);
if (kn->iattr) {
simple_xattrs_free(&kn->iattr->xattrs);
simple_xattrs_free(&kn->iattr->xattrs, NULL);
kmem_cache_free(kernfs_iattrs_cache, kn->iattr);
}
spin_lock(&kernfs_idr_lock);

View File

@ -305,11 +305,17 @@ int kernfs_xattr_get(struct kernfs_node *kn, const char *name,
int kernfs_xattr_set(struct kernfs_node *kn, const char *name,
const void *value, size_t size, int flags)
{
struct simple_xattr *old_xattr;
struct kernfs_iattrs *attrs = kernfs_iattrs(kn);
if (!attrs)
return -ENOMEM;
return simple_xattr_set(&attrs->xattrs, name, value, size, flags, NULL);
old_xattr = simple_xattr_set(&attrs->xattrs, name, value, size, flags);
if (IS_ERR(old_xattr))
return PTR_ERR(old_xattr);
simple_xattr_free(old_xattr);
return 0;
}
static int kernfs_vfs_xattr_get(const struct xattr_handler *handler,
@ -341,7 +347,7 @@ static int kernfs_vfs_user_xattr_add(struct kernfs_node *kn,
{
atomic_t *sz = &kn->iattr->user_xattr_size;
atomic_t *nr = &kn->iattr->nr_user_xattrs;
ssize_t removed_size;
struct simple_xattr *old_xattr;
int ret;
if (atomic_inc_return(nr) > KERNFS_MAX_USER_XATTRS) {
@ -354,13 +360,18 @@ static int kernfs_vfs_user_xattr_add(struct kernfs_node *kn,
goto dec_size_out;
}
ret = simple_xattr_set(xattrs, full_name, value, size, flags,
&removed_size);
if (!ret && removed_size >= 0)
size = removed_size;
else if (!ret)
old_xattr = simple_xattr_set(xattrs, full_name, value, size, flags);
if (!old_xattr)
return 0;
if (IS_ERR(old_xattr)) {
ret = PTR_ERR(old_xattr);
goto dec_size_out;
}
ret = 0;
size = old_xattr->size;
simple_xattr_free(old_xattr);
dec_size_out:
atomic_sub(size, sz);
dec_count_out:
@ -375,18 +386,19 @@ static int kernfs_vfs_user_xattr_rm(struct kernfs_node *kn,
{
atomic_t *sz = &kn->iattr->user_xattr_size;
atomic_t *nr = &kn->iattr->nr_user_xattrs;
ssize_t removed_size;
int ret;
struct simple_xattr *old_xattr;
ret = simple_xattr_set(xattrs, full_name, value, size, flags,
&removed_size);
old_xattr = simple_xattr_set(xattrs, full_name, value, size, flags);
if (!old_xattr)
return 0;
if (removed_size >= 0) {
atomic_sub(removed_size, sz);
atomic_dec(nr);
}
if (IS_ERR(old_xattr))
return PTR_ERR(old_xattr);
return ret;
atomic_sub(old_xattr->size, sz);
atomic_dec(nr);
simple_xattr_free(old_xattr);
return 0;
}
static int kernfs_vfs_user_xattr_set(const struct xattr_handler *handler,

View File

@ -239,6 +239,254 @@ const struct inode_operations simple_dir_inode_operations = {
};
EXPORT_SYMBOL(simple_dir_inode_operations);
static void offset_set(struct dentry *dentry, u32 offset)
{
dentry->d_fsdata = (void *)((uintptr_t)(offset));
}
static u32 dentry2offset(struct dentry *dentry)
{
return (u32)((uintptr_t)(dentry->d_fsdata));
}
static struct lock_class_key simple_offset_xa_lock;
/**
* simple_offset_init - initialize an offset_ctx
* @octx: directory offset map to be initialized
*
*/
void simple_offset_init(struct offset_ctx *octx)
{
xa_init_flags(&octx->xa, XA_FLAGS_ALLOC1);
lockdep_set_class(&octx->xa.xa_lock, &simple_offset_xa_lock);
/* 0 is '.', 1 is '..', so always start with offset 2 */
octx->next_offset = 2;
}
/**
* simple_offset_add - Add an entry to a directory's offset map
* @octx: directory offset ctx to be updated
* @dentry: new dentry being added
*
* Returns zero on success. @so_ctx and the dentry offset are updated.
* Otherwise, a negative errno value is returned.
*/
int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
{
static const struct xa_limit limit = XA_LIMIT(2, U32_MAX);
u32 offset;
int ret;
if (dentry2offset(dentry) != 0)
return -EBUSY;
ret = xa_alloc_cyclic(&octx->xa, &offset, dentry, limit,
&octx->next_offset, GFP_KERNEL);
if (ret < 0)
return ret;
offset_set(dentry, offset);
return 0;
}
/**
* simple_offset_remove - Remove an entry to a directory's offset map
* @octx: directory offset ctx to be updated
* @dentry: dentry being removed
*
*/
void simple_offset_remove(struct offset_ctx *octx, struct dentry *dentry)
{
u32 offset;
offset = dentry2offset(dentry);
if (offset == 0)
return;
xa_erase(&octx->xa, offset);
offset_set(dentry, 0);
}
/**
* simple_offset_rename_exchange - exchange rename with directory offsets
* @old_dir: parent of dentry being moved
* @old_dentry: dentry being moved
* @new_dir: destination parent
* @new_dentry: destination dentry
*
* Returns zero on success. Otherwise a negative errno is returned and the
* rename is rolled back.
*/
int simple_offset_rename_exchange(struct inode *old_dir,
struct dentry *old_dentry,
struct inode *new_dir,
struct dentry *new_dentry)
{
struct offset_ctx *old_ctx = old_dir->i_op->get_offset_ctx(old_dir);
struct offset_ctx *new_ctx = new_dir->i_op->get_offset_ctx(new_dir);
u32 old_index = dentry2offset(old_dentry);
u32 new_index = dentry2offset(new_dentry);
int ret;
simple_offset_remove(old_ctx, old_dentry);
simple_offset_remove(new_ctx, new_dentry);
ret = simple_offset_add(new_ctx, old_dentry);
if (ret)
goto out_restore;
ret = simple_offset_add(old_ctx, new_dentry);
if (ret) {
simple_offset_remove(new_ctx, old_dentry);
goto out_restore;
}
ret = simple_rename_exchange(old_dir, old_dentry, new_dir, new_dentry);
if (ret) {
simple_offset_remove(new_ctx, old_dentry);
simple_offset_remove(old_ctx, new_dentry);
goto out_restore;
}
return 0;
out_restore:
offset_set(old_dentry, old_index);
xa_store(&old_ctx->xa, old_index, old_dentry, GFP_KERNEL);
offset_set(new_dentry, new_index);
xa_store(&new_ctx->xa, new_index, new_dentry, GFP_KERNEL);
return ret;
}
/**
* simple_offset_destroy - Release offset map
* @octx: directory offset ctx that is about to be destroyed
*
* During fs teardown (eg. umount), a directory's offset map might still
* contain entries. xa_destroy() cleans out anything that remains.
*/
void simple_offset_destroy(struct offset_ctx *octx)
{
xa_destroy(&octx->xa);
}
/**
* offset_dir_llseek - Advance the read position of a directory descriptor
* @file: an open directory whose position is to be updated
* @offset: a byte offset
* @whence: enumerator describing the starting position for this update
*
* SEEK_END, SEEK_DATA, and SEEK_HOLE are not supported for directories.
*
* Returns the updated read position if successful; otherwise a
* negative errno is returned and the read position remains unchanged.
*/
static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
{
switch (whence) {
case SEEK_CUR:
offset += file->f_pos;
fallthrough;
case SEEK_SET:
if (offset >= 0)
break;
fallthrough;
default:
return -EINVAL;
}
return vfs_setpos(file, offset, U32_MAX);
}
static struct dentry *offset_find_next(struct xa_state *xas)
{
struct dentry *child, *found = NULL;
rcu_read_lock();
child = xas_next_entry(xas, U32_MAX);
if (!child)
goto out;
spin_lock(&child->d_lock);
if (simple_positive(child))
found = dget_dlock(child);
spin_unlock(&child->d_lock);
out:
rcu_read_unlock();
return found;
}
static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
{
u32 offset = dentry2offset(dentry);
struct inode *inode = d_inode(dentry);
return ctx->actor(ctx, dentry->d_name.name, dentry->d_name.len, offset,
inode->i_ino, fs_umode_to_dtype(inode->i_mode));
}
static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
{
struct offset_ctx *so_ctx = inode->i_op->get_offset_ctx(inode);
XA_STATE(xas, &so_ctx->xa, ctx->pos);
struct dentry *dentry;
while (true) {
dentry = offset_find_next(&xas);
if (!dentry)
break;
if (!offset_dir_emit(ctx, dentry)) {
dput(dentry);
break;
}
dput(dentry);
ctx->pos = xas.xa_index + 1;
}
}
/**
* offset_readdir - Emit entries starting at offset @ctx->pos
* @file: an open directory to iterate over
* @ctx: directory iteration context
*
* Caller must hold @file's i_rwsem to prevent insertion or removal of
* entries during this call.
*
* On entry, @ctx->pos contains an offset that represents the first entry
* to be read from the directory.
*
* The operation continues until there are no more entries to read, or
* until the ctx->actor indicates there is no more space in the caller's
* output buffer.
*
* On return, @ctx->pos contains an offset that will read the next entry
* in this directory when offset_readdir() is called again with @ctx.
*
* Return values:
* %0 - Complete
*/
static int offset_readdir(struct file *file, struct dir_context *ctx)
{
struct dentry *dir = file->f_path.dentry;
lockdep_assert_held(&d_inode(dir)->i_rwsem);
if (!dir_emit_dots(file, ctx))
return 0;
offset_iterate_dir(d_inode(dir), ctx);
return 0;
}
const struct file_operations simple_offset_dir_operations = {
.llseek = offset_dir_llseek,
.iterate_shared = offset_readdir,
.read = generic_read_dir,
.fsync = noop_fsync,
};
static struct dentry *find_next_child(struct dentry *parent, struct dentry *prev)
{
struct dentry *child = NULL;

View File

@ -2367,7 +2367,7 @@ int dquot_load_quota_sb(struct super_block *sb, int type, int format_id,
if (!fmt)
return -ESRCH;
if (!sb->s_op->quota_write || !sb->s_op->quota_read ||
if (!sb->dq_op || !sb->s_qcop ||
(type == PRJQUOTA && sb->dq_op->get_projid == NULL)) {
error = -EINVAL;
goto out_fmt;

View File

@ -1040,12 +1040,32 @@ const char *xattr_full_name(const struct xattr_handler *handler,
EXPORT_SYMBOL(xattr_full_name);
/**
* free_simple_xattr - free an xattr object
* simple_xattr_space - estimate the memory used by a simple xattr
* @name: the full name of the xattr
* @size: the size of its value
*
* This takes no account of how much larger the two slab objects actually are:
* that would depend on the slab implementation, when what is required is a
* deterministic number, which grows with name length and size and quantity.
*
* Return: The approximate number of bytes of memory used by such an xattr.
*/
size_t simple_xattr_space(const char *name, size_t size)
{
/*
* Use "40" instead of sizeof(struct simple_xattr), to return the
* same result on 32-bit and 64-bit, and even if simple_xattr grows.
*/
return 40 + size + strlen(name);
}
/**
* simple_xattr_free - free an xattr object
* @xattr: the xattr object
*
* Free the xattr object. Can handle @xattr being NULL.
*/
static inline void free_simple_xattr(struct simple_xattr *xattr)
void simple_xattr_free(struct simple_xattr *xattr)
{
if (xattr)
kfree(xattr->name);
@ -1073,7 +1093,7 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
if (len < sizeof(*new_xattr))
return NULL;
new_xattr = kvmalloc(len, GFP_KERNEL);
new_xattr = kvmalloc(len, GFP_KERNEL_ACCOUNT);
if (!new_xattr)
return NULL;
@ -1164,7 +1184,6 @@ int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
* @value: the value to store along the xattr
* @size: the size of @value
* @flags: the flags determining how to set the xattr
* @removed_size: the size of the removed xattr
*
* Set a new xattr object.
* If @value is passed a new xattr object will be allocated. If XATTR_REPLACE
@ -1181,29 +1200,27 @@ int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
* nothing if XATTR_CREATE is specified in @flags or @flags is zero. For
* XATTR_REPLACE we fail as mentioned above.
*
* Return: On success zero and on error a negative error code is returned.
* Return: On success, the removed or replaced xattr is returned, to be freed
* by the caller; or NULL if none. On failure a negative error code is returned.
*/
int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
const void *value, size_t size, int flags,
ssize_t *removed_size)
struct simple_xattr *simple_xattr_set(struct simple_xattrs *xattrs,
const char *name, const void *value,
size_t size, int flags)
{
struct simple_xattr *xattr = NULL, *new_xattr = NULL;
struct simple_xattr *old_xattr = NULL, *new_xattr = NULL;
struct rb_node *parent = NULL, **rbp;
int err = 0, ret;
if (removed_size)
*removed_size = -1;
/* value == NULL means remove */
if (value) {
new_xattr = simple_xattr_alloc(value, size);
if (!new_xattr)
return -ENOMEM;
return ERR_PTR(-ENOMEM);
new_xattr->name = kstrdup(name, GFP_KERNEL);
new_xattr->name = kstrdup(name, GFP_KERNEL_ACCOUNT);
if (!new_xattr->name) {
free_simple_xattr(new_xattr);
return -ENOMEM;
simple_xattr_free(new_xattr);
return ERR_PTR(-ENOMEM);
}
}
@ -1217,12 +1234,12 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
else if (ret > 0)
rbp = &(*rbp)->rb_right;
else
xattr = rb_entry(*rbp, struct simple_xattr, rb_node);
if (xattr)
old_xattr = rb_entry(*rbp, struct simple_xattr, rb_node);
if (old_xattr)
break;
}
if (xattr) {
if (old_xattr) {
/* Fail if XATTR_CREATE is requested and the xattr exists. */
if (flags & XATTR_CREATE) {
err = -EEXIST;
@ -1230,12 +1247,10 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
}
if (new_xattr)
rb_replace_node(&xattr->rb_node, &new_xattr->rb_node,
&xattrs->rb_root);
rb_replace_node(&old_xattr->rb_node,
&new_xattr->rb_node, &xattrs->rb_root);
else
rb_erase(&xattr->rb_node, &xattrs->rb_root);
if (!err && removed_size)
*removed_size = xattr->size;
rb_erase(&old_xattr->rb_node, &xattrs->rb_root);
} else {
/* Fail if XATTR_REPLACE is requested but no xattr is found. */
if (flags & XATTR_REPLACE) {
@ -1260,12 +1275,10 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
out_unlock:
write_unlock(&xattrs->lock);
if (err)
free_simple_xattr(new_xattr);
else
free_simple_xattr(xattr);
return err;
if (!err)
return old_xattr;
simple_xattr_free(new_xattr);
return ERR_PTR(err);
}
static bool xattr_is_trusted(const char *name)
@ -1370,14 +1383,17 @@ void simple_xattrs_init(struct simple_xattrs *xattrs)
/**
* simple_xattrs_free - free xattrs
* @xattrs: xattr header whose xattrs to destroy
* @freed_space: approximate number of bytes of memory freed from @xattrs
*
* Destroy all xattrs in @xattr. When this is called no one can hold a
* reference to any of the xattrs anymore.
*/
void simple_xattrs_free(struct simple_xattrs *xattrs)
void simple_xattrs_free(struct simple_xattrs *xattrs, size_t *freed_space)
{
struct rb_node *rbp;
if (freed_space)
*freed_space = 0;
rbp = rb_first(&xattrs->rb_root);
while (rbp) {
struct simple_xattr *xattr;
@ -1386,7 +1402,10 @@ void simple_xattrs_free(struct simple_xattrs *xattrs)
rbp_next = rb_next(rbp);
xattr = rb_entry(rbp, struct simple_xattr, rb_node);
rb_erase(&xattr->rb_node, &xattrs->rb_root);
free_simple_xattr(xattr);
if (freed_space)
*freed_space += simple_xattr_space(xattr->name,
xattr->size);
simple_xattr_free(xattr);
rbp = rbp_next;
}
}

View File

@ -1842,6 +1842,7 @@ struct dir_context {
struct iov_iter;
struct io_uring_cmd;
struct offset_ctx;
struct file_operations {
struct module *owner;
@ -1935,6 +1936,7 @@ struct inode_operations {
int (*fileattr_set)(struct mnt_idmap *idmap,
struct dentry *dentry, struct fileattr *fa);
int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
} ____cacheline_aligned;
static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
@ -3065,6 +3067,22 @@ extern ssize_t simple_read_from_buffer(void __user *to, size_t count,
extern ssize_t simple_write_to_buffer(void *to, size_t available, loff_t *ppos,
const void __user *from, size_t count);
struct offset_ctx {
struct xarray xa;
u32 next_offset;
};
void simple_offset_init(struct offset_ctx *octx);
int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry);
void simple_offset_remove(struct offset_ctx *octx, struct dentry *dentry);
int simple_offset_rename_exchange(struct inode *old_dir,
struct dentry *old_dentry,
struct inode *new_dir,
struct dentry *new_dentry);
void simple_offset_destroy(struct offset_ctx *octx);
extern const struct file_operations simple_offset_dir_operations;
extern int __generic_file_fsync(struct file *, loff_t, loff_t, int);
extern int generic_file_fsync(struct file *, loff_t, loff_t, int);

View File

@ -13,6 +13,10 @@
/* inode in-kernel data */
#ifdef CONFIG_TMPFS_QUOTA
#define SHMEM_MAXQUOTAS 2
#endif
struct shmem_inode_info {
spinlock_t lock;
unsigned int seals; /* shmem seals */
@ -27,6 +31,10 @@ struct shmem_inode_info {
atomic_t stop_eviction; /* hold when working on inode */
struct timespec64 i_crtime; /* file creation time */
unsigned int fsflags; /* flags for FS_IOC_[SG]ETFLAGS */
#ifdef CONFIG_TMPFS_QUOTA
struct dquot *i_dquot[MAXQUOTAS];
#endif
struct offset_ctx dir_offsets; /* stable entry offsets */
struct inode vfs_inode;
};
@ -35,11 +43,18 @@ struct shmem_inode_info {
(FS_IMMUTABLE_FL | FS_APPEND_FL | FS_NODUMP_FL | FS_NOATIME_FL)
#define SHMEM_FL_INHERITED (FS_NODUMP_FL | FS_NOATIME_FL)
struct shmem_quota_limits {
qsize_t usrquota_bhardlimit; /* Default user quota block hard limit */
qsize_t usrquota_ihardlimit; /* Default user quota inode hard limit */
qsize_t grpquota_bhardlimit; /* Default group quota block hard limit */
qsize_t grpquota_ihardlimit; /* Default group quota inode hard limit */
};
struct shmem_sb_info {
unsigned long max_blocks; /* How many blocks are allowed */
struct percpu_counter used_blocks; /* How many are allocated */
unsigned long max_inodes; /* How many inodes are allowed */
unsigned long free_inodes; /* How many are left for allocation */
unsigned long free_ispace; /* How much ispace left for allocation */
raw_spinlock_t stat_lock; /* Serialize shmem_sb_info changes */
umode_t mode; /* Mount mode for root directory */
unsigned char huge; /* Whether to try for hugepages */
@ -53,6 +68,7 @@ struct shmem_sb_info {
spinlock_t shrinklist_lock; /* Protects shrinklist */
struct list_head shrinklist; /* List of shinkable inodes */
unsigned long shrinklist_len; /* Length of shrinklist */
struct shmem_quota_limits qlimits; /* Default quota limits */
};
static inline struct shmem_inode_info *SHMEM_I(struct inode *inode)
@ -172,4 +188,17 @@ extern int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
#endif /* CONFIG_SHMEM */
#endif /* CONFIG_USERFAULTFD */
/*
* Used space is stored as unsigned 64-bit value in bytes but
* quota core supports only signed 64-bit values so use that
* as a limit
*/
#define SHMEM_QUOTA_MAX_SPC_LIMIT 0x7fffffffffffffffLL /* 2^63-1 */
#define SHMEM_QUOTA_MAX_INO_LIMIT 0x7fffffffffffffffLL
#ifdef CONFIG_TMPFS_QUOTA
extern const struct dquot_operations shmem_quota_operations;
extern struct quota_format_type shmem_quota_format;
#endif /* CONFIG_TMPFS_QUOTA */
#endif

View File

@ -114,13 +114,15 @@ struct simple_xattr {
};
void simple_xattrs_init(struct simple_xattrs *xattrs);
void simple_xattrs_free(struct simple_xattrs *xattrs);
void simple_xattrs_free(struct simple_xattrs *xattrs, size_t *freed_space);
size_t simple_xattr_space(const char *name, size_t size);
struct simple_xattr *simple_xattr_alloc(const void *value, size_t size);
void simple_xattr_free(struct simple_xattr *xattr);
int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
void *buffer, size_t size);
int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
const void *value, size_t size, int flags,
ssize_t *removed_size);
struct simple_xattr *simple_xattr_set(struct simple_xattrs *xattrs,
const char *name, const void *value,
size_t size, int flags);
ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
char *buffer, size_t size);
void simple_xattr_add(struct simple_xattrs *xattrs,

View File

@ -77,6 +77,7 @@
#define QFMT_VFS_V0 2
#define QFMT_OCFS2 3
#define QFMT_VFS_V1 4
#define QFMT_SHMEM 5
/* Size of block in which space limits are passed through the quota
* interface */

View File

@ -51,7 +51,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
readahead.o swap.o truncate.o vmscan.o shmem.o \
util.o mmzone.o vmstat.o backing-dev.o \
mm_init.o percpu.o slab_common.o \
compaction.o show_mem.o\
compaction.o show_mem.o shmem_quota.o\
interval_tree.o list_lru.o workingset.o \
debug.o gup.o mmap_lock.o $(mmu-y)

View File

@ -2520,7 +2520,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
struct address_space *swap_cache = NULL;
unsigned long offset = 0;
unsigned int nr = thp_nr_pages(head);
int i;
int i, nr_dropped = 0;
/* complete memcg works before add pages to LRU */
split_page_memcg(head, nr);
@ -2545,7 +2545,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
struct folio *tail = page_folio(head + i);
if (shmem_mapping(head->mapping))
shmem_uncharge(head->mapping->host, 1);
nr_dropped++;
else if (folio_test_clear_dirty(tail))
folio_account_cleaned(tail,
inode_to_wb(folio->mapping->host));
@ -2582,6 +2582,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
}
local_irq_enable();
if (nr_dropped)
shmem_uncharge(head->mapping->host, nr_dropped);
remap_page(folio, nr);
if (PageSwapCache(head)) {

View File

@ -1955,10 +1955,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
goto xa_locked;
}
}
if (!shmem_charge(mapping->host, 1)) {
result = SCAN_FAIL;
goto xa_locked;
}
nr_none++;
continue;
}
@ -2145,8 +2141,13 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
*/
try_to_unmap_flush();
if (result != SCAN_SUCCEED)
if (result == SCAN_SUCCEED && nr_none &&
!shmem_charge(mapping->host, nr_none))
result = SCAN_FAIL;
if (result != SCAN_SUCCEED) {
nr_none = 0;
goto rollback;
}
/*
* The old pages are locked, so they won't change anymore.
@ -2283,8 +2284,8 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
if (nr_none) {
xas_lock_irq(&xas);
mapping->nrpages -= nr_none;
shmem_uncharge(mapping->host, nr_none);
xas_unlock_irq(&xas);
shmem_uncharge(mapping->host, nr_none);
}
list_for_each_entry_safe(page, tmp, &pagelist, lru) {

File diff suppressed because it is too large Load Diff

350
mm/shmem_quota.c Normal file
View File

@ -0,0 +1,350 @@
// SPDX-License-Identifier: GPL-2.0-only
/*
* In memory quota format relies on quota infrastructure to store dquot
* information for us. While conventional quota formats for file systems
* with persistent storage can load quota information into dquot from the
* storage on-demand and hence quota dquot shrinker can free any dquot
* that is not currently being used, it must be avoided here. Otherwise we
* can lose valuable information, user provided limits, because there is
* no persistent storage to load the information from afterwards.
*
* One information that in-memory quota format needs to keep track of is
* a sorted list of ids for each quota type. This is done by utilizing
* an rb tree which root is stored in mem_dqinfo->dqi_priv for each quota
* type.
*
* This format can be used to support quota on file system without persistent
* storage such as tmpfs.
*
* Author: Lukas Czerner <lczerner@redhat.com>
* Carlos Maiolino <cmaiolino@redhat.com>
*
* Copyright (C) 2023 Red Hat, Inc.
*/
#include <linux/errno.h>
#include <linux/fs.h>
#include <linux/mount.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/rbtree.h>
#include <linux/shmem_fs.h>
#include <linux/quotaops.h>
#include <linux/quota.h>
#ifdef CONFIG_TMPFS_QUOTA
/*
* The following constants define the amount of time given a user
* before the soft limits are treated as hard limits (usually resulting
* in an allocation failure). The timer is started when the user crosses
* their soft limit, it is reset when they go below their soft limit.
*/
#define SHMEM_MAX_IQ_TIME 604800 /* (7*24*60*60) 1 week */
#define SHMEM_MAX_DQ_TIME 604800 /* (7*24*60*60) 1 week */
struct quota_id {
struct rb_node node;
qid_t id;
qsize_t bhardlimit;
qsize_t bsoftlimit;
qsize_t ihardlimit;
qsize_t isoftlimit;
};
static int shmem_check_quota_file(struct super_block *sb, int type)
{
/* There is no real quota file, nothing to do */
return 1;
}
/*
* There is no real quota file. Just allocate rb_root for quota ids and
* set limits
*/
static int shmem_read_file_info(struct super_block *sb, int type)
{
struct quota_info *dqopt = sb_dqopt(sb);
struct mem_dqinfo *info = &dqopt->info[type];
info->dqi_priv = kzalloc(sizeof(struct rb_root), GFP_NOFS);
if (!info->dqi_priv)
return -ENOMEM;
info->dqi_max_spc_limit = SHMEM_QUOTA_MAX_SPC_LIMIT;
info->dqi_max_ino_limit = SHMEM_QUOTA_MAX_INO_LIMIT;
info->dqi_bgrace = SHMEM_MAX_DQ_TIME;
info->dqi_igrace = SHMEM_MAX_IQ_TIME;
info->dqi_flags = 0;
return 0;
}
static int shmem_write_file_info(struct super_block *sb, int type)
{
/* There is no real quota file, nothing to do */
return 0;
}
/*
* Free all the quota_id entries in the rb tree and rb_root.
*/
static int shmem_free_file_info(struct super_block *sb, int type)
{
struct mem_dqinfo *info = &sb_dqopt(sb)->info[type];
struct rb_root *root = info->dqi_priv;
struct quota_id *entry;
struct rb_node *node;
info->dqi_priv = NULL;
node = rb_first(root);
while (node) {
entry = rb_entry(node, struct quota_id, node);
node = rb_next(&entry->node);
rb_erase(&entry->node, root);
kfree(entry);
}
kfree(root);
return 0;
}
static int shmem_get_next_id(struct super_block *sb, struct kqid *qid)
{
struct mem_dqinfo *info = sb_dqinfo(sb, qid->type);
struct rb_node *node = ((struct rb_root *)info->dqi_priv)->rb_node;
qid_t id = from_kqid(&init_user_ns, *qid);
struct quota_info *dqopt = sb_dqopt(sb);
struct quota_id *entry = NULL;
int ret = 0;
if (!sb_has_quota_active(sb, qid->type))
return -ESRCH;
down_read(&dqopt->dqio_sem);
while (node) {
entry = rb_entry(node, struct quota_id, node);
if (id < entry->id)
node = node->rb_left;
else if (id > entry->id)
node = node->rb_right;
else
goto got_next_id;
}
if (!entry) {
ret = -ENOENT;
goto out_unlock;
}
if (id > entry->id) {
node = rb_next(&entry->node);
if (!node) {
ret = -ENOENT;
goto out_unlock;
}
entry = rb_entry(node, struct quota_id, node);
}
got_next_id:
*qid = make_kqid(&init_user_ns, qid->type, entry->id);
out_unlock:
up_read(&dqopt->dqio_sem);
return ret;
}
/*
* Load dquot with limits from existing entry, or create the new entry if
* it does not exist.
*/
static int shmem_acquire_dquot(struct dquot *dquot)
{
struct mem_dqinfo *info = sb_dqinfo(dquot->dq_sb, dquot->dq_id.type);
struct rb_node **n = &((struct rb_root *)info->dqi_priv)->rb_node;
struct shmem_sb_info *sbinfo = dquot->dq_sb->s_fs_info;
struct rb_node *parent = NULL, *new_node = NULL;
struct quota_id *new_entry, *entry;
qid_t id = from_kqid(&init_user_ns, dquot->dq_id);
struct quota_info *dqopt = sb_dqopt(dquot->dq_sb);
int ret = 0;
mutex_lock(&dquot->dq_lock);
down_write(&dqopt->dqio_sem);
while (*n) {
parent = *n;
entry = rb_entry(parent, struct quota_id, node);
if (id < entry->id)
n = &(*n)->rb_left;
else if (id > entry->id)
n = &(*n)->rb_right;
else
goto found;
}
/* We don't have entry for this id yet, create it */
new_entry = kzalloc(sizeof(struct quota_id), GFP_NOFS);
if (!new_entry) {
ret = -ENOMEM;
goto out_unlock;
}
new_entry->id = id;
if (dquot->dq_id.type == USRQUOTA) {
new_entry->bhardlimit = sbinfo->qlimits.usrquota_bhardlimit;
new_entry->ihardlimit = sbinfo->qlimits.usrquota_ihardlimit;
} else if (dquot->dq_id.type == GRPQUOTA) {
new_entry->bhardlimit = sbinfo->qlimits.grpquota_bhardlimit;
new_entry->ihardlimit = sbinfo->qlimits.grpquota_ihardlimit;
}
new_node = &new_entry->node;
rb_link_node(new_node, parent, n);
rb_insert_color(new_node, (struct rb_root *)info->dqi_priv);
entry = new_entry;
found:
/* Load the stored limits from the tree */
spin_lock(&dquot->dq_dqb_lock);
dquot->dq_dqb.dqb_bhardlimit = entry->bhardlimit;
dquot->dq_dqb.dqb_bsoftlimit = entry->bsoftlimit;
dquot->dq_dqb.dqb_ihardlimit = entry->ihardlimit;
dquot->dq_dqb.dqb_isoftlimit = entry->isoftlimit;
if (!dquot->dq_dqb.dqb_bhardlimit &&
!dquot->dq_dqb.dqb_bsoftlimit &&
!dquot->dq_dqb.dqb_ihardlimit &&
!dquot->dq_dqb.dqb_isoftlimit)
set_bit(DQ_FAKE_B, &dquot->dq_flags);
spin_unlock(&dquot->dq_dqb_lock);
/* Make sure flags update is visible after dquot has been filled */
smp_mb__before_atomic();
set_bit(DQ_ACTIVE_B, &dquot->dq_flags);
out_unlock:
up_write(&dqopt->dqio_sem);
mutex_unlock(&dquot->dq_lock);
return ret;
}
static bool shmem_is_empty_dquot(struct dquot *dquot)
{
struct shmem_sb_info *sbinfo = dquot->dq_sb->s_fs_info;
qsize_t bhardlimit;
qsize_t ihardlimit;
if (dquot->dq_id.type == USRQUOTA) {
bhardlimit = sbinfo->qlimits.usrquota_bhardlimit;
ihardlimit = sbinfo->qlimits.usrquota_ihardlimit;
} else if (dquot->dq_id.type == GRPQUOTA) {
bhardlimit = sbinfo->qlimits.grpquota_bhardlimit;
ihardlimit = sbinfo->qlimits.grpquota_ihardlimit;
}
if (test_bit(DQ_FAKE_B, &dquot->dq_flags) ||
(dquot->dq_dqb.dqb_curspace == 0 &&
dquot->dq_dqb.dqb_curinodes == 0 &&
dquot->dq_dqb.dqb_bhardlimit == bhardlimit &&
dquot->dq_dqb.dqb_ihardlimit == ihardlimit))
return true;
return false;
}
/*
* Store limits from dquot in the tree unless it's fake. If it is fake
* remove the id from the tree since there is no useful information in
* there.
*/
static int shmem_release_dquot(struct dquot *dquot)
{
struct mem_dqinfo *info = sb_dqinfo(dquot->dq_sb, dquot->dq_id.type);
struct rb_node *node = ((struct rb_root *)info->dqi_priv)->rb_node;
qid_t id = from_kqid(&init_user_ns, dquot->dq_id);
struct quota_info *dqopt = sb_dqopt(dquot->dq_sb);
struct quota_id *entry = NULL;
mutex_lock(&dquot->dq_lock);
/* Check whether we are not racing with some other dqget() */
if (dquot_is_busy(dquot))
goto out_dqlock;
down_write(&dqopt->dqio_sem);
while (node) {
entry = rb_entry(node, struct quota_id, node);
if (id < entry->id)
node = node->rb_left;
else if (id > entry->id)
node = node->rb_right;
else
goto found;
}
/* We should always find the entry in the rb tree */
WARN_ONCE(1, "quota id %u from dquot %p, not in rb tree!\n", id, dquot);
up_write(&dqopt->dqio_sem);
mutex_unlock(&dquot->dq_lock);
return -ENOENT;
found:
if (shmem_is_empty_dquot(dquot)) {
/* Remove entry from the tree */
rb_erase(&entry->node, info->dqi_priv);
kfree(entry);
} else {
/* Store the limits in the tree */
spin_lock(&dquot->dq_dqb_lock);
entry->bhardlimit = dquot->dq_dqb.dqb_bhardlimit;
entry->bsoftlimit = dquot->dq_dqb.dqb_bsoftlimit;
entry->ihardlimit = dquot->dq_dqb.dqb_ihardlimit;
entry->isoftlimit = dquot->dq_dqb.dqb_isoftlimit;
spin_unlock(&dquot->dq_dqb_lock);
}
clear_bit(DQ_ACTIVE_B, &dquot->dq_flags);
up_write(&dqopt->dqio_sem);
out_dqlock:
mutex_unlock(&dquot->dq_lock);
return 0;
}
static int shmem_mark_dquot_dirty(struct dquot *dquot)
{
return 0;
}
static int shmem_dquot_write_info(struct super_block *sb, int type)
{
return 0;
}
static const struct quota_format_ops shmem_format_ops = {
.check_quota_file = shmem_check_quota_file,
.read_file_info = shmem_read_file_info,
.write_file_info = shmem_write_file_info,
.free_file_info = shmem_free_file_info,
};
struct quota_format_type shmem_quota_format = {
.qf_fmt_id = QFMT_SHMEM,
.qf_ops = &shmem_format_ops,
.qf_owner = THIS_MODULE
};
const struct dquot_operations shmem_quota_operations = {
.acquire_dquot = shmem_acquire_dquot,
.release_dquot = shmem_release_dquot,
.alloc_dquot = dquot_alloc,
.destroy_dquot = dquot_destroy,
.write_info = shmem_dquot_write_info,
.mark_dirty = shmem_mark_dquot_dirty,
.get_next_id = shmem_get_next_id,
};
#endif /* CONFIG_TMPFS_QUOTA */

View File

@ -657,11 +657,11 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
}
folio_lock(folio);
VM_BUG_ON_FOLIO(!folio_contains(folio, indices[i]), folio);
if (folio->mapping != mapping) {
if (unlikely(folio->mapping != mapping)) {
folio_unlock(folio);
continue;
}
VM_BUG_ON_FOLIO(!folio_contains(folio, indices[i]), folio);
folio_wait_writeback(folio);
if (folio_mapped(folio))