License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 15:07:57 +01:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
* linux/fs/read_write.c
|
|
|
|
*
|
|
|
|
* Copyright (C) 1991, 1992 Linus Torvalds
|
|
|
|
*/
|
|
|
|
|
2017-02-08 18:51:33 +01:00
|
|
|
#include <linux/slab.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/stat.h>
|
2017-02-08 18:51:33 +01:00
|
|
|
#include <linux/sched/xacct.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/fcntl.h>
|
|
|
|
#include <linux/file.h>
|
|
|
|
#include <linux/uio.h>
|
[PATCH] inotify
inotify is intended to correct the deficiencies of dnotify, particularly
its inability to scale and its terrible user interface:
* dnotify requires the opening of one fd per each directory
that you intend to watch. This quickly results in too many
open files and pins removable media, preventing unmount.
* dnotify is directory-based. You only learn about changes to
directories. Sure, a change to a file in a directory affects
the directory, but you are then forced to keep a cache of
stat structures.
* dnotify's interface to user-space is awful. Signals?
inotify provides a more usable, simple, powerful solution to file change
notification:
* inotify's interface is a system call that returns a fd, not SIGIO.
You get a single fd, which is select()-able.
* inotify has an event that says "the filesystem that the item
you were watching is on was unmounted."
* inotify can watch directories or files.
Inotify is currently used by Beagle (a desktop search infrastructure),
Gamin (a FAM replacement), and other projects.
See Documentation/filesystems/inotify.txt.
Signed-off-by: Robert Love <rml@novell.com>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-12 17:06:03 -04:00
|
|
|
#include <linux/fsnotify.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/security.h>
|
2011-11-16 23:57:37 -05:00
|
|
|
#include <linux/export.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/syscalls.h>
|
2006-01-04 16:20:40 -08:00
|
|
|
#include <linux/pagemap.h>
|
2007-06-04 09:59:47 +02:00
|
|
|
#include <linux/splice.h>
|
2013-02-24 10:52:26 -05:00
|
|
|
#include <linux/compat.h>
|
2015-11-10 16:53:30 -05:00
|
|
|
#include <linux/mount.h>
|
2016-02-16 22:20:59 +01:00
|
|
|
#include <linux/fs.h>
|
2013-03-20 13:19:30 -04:00
|
|
|
#include "internal.h"
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2016-12-24 11:46:01 -08:00
|
|
|
#include <linux/uaccess.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <asm/unistd.h>
|
|
|
|
|
2006-03-28 01:56:42 -08:00
|
|
|
const struct file_operations generic_ro_fops = {
|
2005-04-16 15:20:36 -07:00
|
|
|
.llseek = generic_file_llseek,
|
2014-04-02 14:33:16 -04:00
|
|
|
.read_iter = generic_file_read_iter,
|
2005-04-16 15:20:36 -07:00
|
|
|
.mmap = generic_file_readonly_mmap,
|
2023-05-22 14:50:15 +01:00
|
|
|
.splice_read = filemap_splice_read,
|
2005-04-16 15:20:36 -07:00
|
|
|
};
|
|
|
|
|
|
|
|
EXPORT_SYMBOL(generic_ro_fops);
|
|
|
|
|
2017-07-06 18:58:37 +02:00
|
|
|
static inline bool unsigned_offsets(struct file *file)
|
2010-10-01 14:20:22 -07:00
|
|
|
{
|
2024-08-09 12:38:56 +02:00
|
|
|
return file->f_op->fop_flags & FOP_UNSIGNED_OFFSET;
|
2010-10-01 14:20:22 -07:00
|
|
|
}
|
|
|
|
|
2013-06-25 12:02:13 +08:00
|
|
|
/**
|
2024-08-30 15:04:46 +02:00
|
|
|
* vfs_setpos_cookie - update the file offset for lseek and reset cookie
|
2013-06-25 12:02:13 +08:00
|
|
|
* @file: file structure in question
|
|
|
|
* @offset: file offset to seek to
|
|
|
|
* @maxsize: maximum file size
|
2024-08-30 15:04:46 +02:00
|
|
|
* @cookie: cookie to reset
|
2013-06-25 12:02:13 +08:00
|
|
|
*
|
2024-08-30 15:04:46 +02:00
|
|
|
* Update the file offset to the value specified by @offset if the given
|
|
|
|
* offset is valid and it is not equal to the current file offset and
|
|
|
|
* reset the specified cookie to indicate that a seek happened.
|
2013-06-25 12:02:13 +08:00
|
|
|
*
|
|
|
|
* Return the specified offset on success and -EINVAL on invalid offset.
|
|
|
|
*/
|
2024-08-30 15:04:46 +02:00
|
|
|
static loff_t vfs_setpos_cookie(struct file *file, loff_t offset,
|
|
|
|
loff_t maxsize, u64 *cookie)
|
2011-09-15 16:06:48 -07:00
|
|
|
{
|
|
|
|
if (offset < 0 && !unsigned_offsets(file))
|
|
|
|
return -EINVAL;
|
|
|
|
if (offset > maxsize)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (offset != file->f_pos) {
|
|
|
|
file->f_pos = offset;
|
2024-08-30 15:05:01 +02:00
|
|
|
if (cookie)
|
|
|
|
*cookie = 0;
|
2011-09-15 16:06:48 -07:00
|
|
|
}
|
|
|
|
return offset;
|
|
|
|
}
|
|
|
|
|
2008-08-11 15:37:17 +02:00
|
|
|
/**
|
2024-08-30 15:04:46 +02:00
|
|
|
* vfs_setpos - update the file offset for lseek
|
|
|
|
* @file: file structure in question
|
2008-08-11 15:37:17 +02:00
|
|
|
* @offset: file offset to seek to
|
2024-08-30 15:04:46 +02:00
|
|
|
* @maxsize: maximum file size
|
2008-08-11 15:37:17 +02:00
|
|
|
*
|
2024-08-30 15:04:46 +02:00
|
|
|
* This is a low-level filesystem helper for updating the file offset to
|
|
|
|
* the value specified by @offset if the given offset is valid and it is
|
|
|
|
* not equal to the current file offset.
|
2011-09-15 16:06:48 -07:00
|
|
|
*
|
2024-08-30 15:04:46 +02:00
|
|
|
* Return the specified offset on success and -EINVAL on invalid offset.
|
2008-08-11 15:37:17 +02:00
|
|
|
*/
|
2024-08-30 15:04:46 +02:00
|
|
|
loff_t vfs_setpos(struct file *file, loff_t offset, loff_t maxsize)
|
|
|
|
{
|
2024-08-30 15:05:01 +02:00
|
|
|
return vfs_setpos_cookie(file, offset, maxsize, NULL);
|
2024-08-30 15:04:46 +02:00
|
|
|
}
|
2013-06-25 12:02:13 +08:00
|
|
|
EXPORT_SYMBOL(vfs_setpos);
|
2011-09-15 16:06:48 -07:00
|
|
|
|
2024-08-30 15:04:47 +02:00
|
|
|
/**
|
|
|
|
* must_set_pos - check whether f_pos has to be updated
|
|
|
|
* @file: file to seek on
|
|
|
|
* @offset: offset to use
|
|
|
|
* @whence: type of seek operation
|
|
|
|
* @eof: end of file
|
|
|
|
*
|
|
|
|
* Check whether f_pos needs to be updated and update @offset according
|
|
|
|
* to @whence.
|
|
|
|
*
|
|
|
|
* Return: 0 if f_pos doesn't need to be updated, 1 if f_pos has to be
|
|
|
|
* updated, and negative error code on failure.
|
|
|
|
*/
|
2024-08-30 15:04:48 +02:00
|
|
|
static int must_set_pos(struct file *file, loff_t *offset, int whence, loff_t eof)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2012-12-17 15:59:39 -08:00
|
|
|
switch (whence) {
|
2008-08-11 15:37:17 +02:00
|
|
|
case SEEK_END:
|
2024-08-30 15:04:47 +02:00
|
|
|
*offset += eof;
|
2008-08-11 15:37:17 +02:00
|
|
|
break;
|
|
|
|
case SEEK_CUR:
|
2008-11-10 17:08:08 -08:00
|
|
|
/*
|
|
|
|
* Here we special-case the lseek(fd, 0, SEEK_CUR)
|
|
|
|
* position-querying operation. Avoid rewriting the "same"
|
|
|
|
* f_pos value back to the file because a concurrent read(),
|
|
|
|
* write() or lseek() might have altered it
|
|
|
|
*/
|
2024-08-30 15:04:47 +02:00
|
|
|
if (*offset == 0) {
|
|
|
|
*offset = file->f_pos;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
break;
|
2011-07-18 13:21:35 -04:00
|
|
|
case SEEK_DATA:
|
|
|
|
/*
|
|
|
|
* In the generic case the entire file is data, so as long as
|
|
|
|
* offset isn't at the end of the file then the offset is data.
|
|
|
|
*/
|
2024-08-30 15:04:47 +02:00
|
|
|
if ((unsigned long long)*offset >= eof)
|
2011-07-18 13:21:35 -04:00
|
|
|
return -ENXIO;
|
|
|
|
break;
|
|
|
|
case SEEK_HOLE:
|
|
|
|
/*
|
|
|
|
* There is a virtual hole at the end of the file, so as long as
|
|
|
|
* offset isn't i_size or larger, return i_size.
|
|
|
|
*/
|
2024-08-30 15:04:47 +02:00
|
|
|
if ((unsigned long long)*offset >= eof)
|
2011-07-18 13:21:35 -04:00
|
|
|
return -ENXIO;
|
2024-08-30 15:04:47 +02:00
|
|
|
*offset = eof;
|
2011-07-18 13:21:35 -04:00
|
|
|
break;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2008-08-11 15:37:17 +02:00
|
|
|
|
2024-08-30 15:04:47 +02:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2008-08-11 15:37:17 +02:00
|
|
|
/**
|
2011-09-15 16:06:50 -07:00
|
|
|
* generic_file_llseek_size - generic llseek implementation for regular files
|
2008-08-11 15:37:17 +02:00
|
|
|
* @file: file structure to seek on
|
|
|
|
* @offset: file offset to seek to
|
2012-12-17 15:59:39 -08:00
|
|
|
* @whence: type of seek
|
2023-08-11 09:43:59 +08:00
|
|
|
* @maxsize: max size of this file in file system
|
vfs: allow custom EOF in generic_file_llseek code
For ext3/4 htree directories, using the vfs llseek function with
SEEK_END goes to i_size like for any other file, but in reality
we want the maximum possible hash value. Recent changes
in ext4 have cut & pasted generic_file_llseek() back into fs/ext4/dir.c,
but replicating this core code seems like a bad idea, especially
since the copy has already diverged from the vfs.
This patch updates generic_file_llseek_size to accept
both a custom maximum offset, and a custom EOF position. With this
in place, ext4_dir_llseek can pass in the appropriate maximum hash
position for both maxsize and eof, and get what it wants.
As far as I know, this does not fix any bugs - nfs in the kernel
doesn't use SEEK_END, and I don't know of any user who does. But
some ext4 folks seem keen on doing the right thing here, and I can't
really argue.
(Patch also fixes up some comments slightly)
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-04-30 13:11:29 -05:00
|
|
|
* @eof: offset used for SEEK_END position
|
2008-08-11 15:37:17 +02:00
|
|
|
*
|
2011-09-15 16:06:50 -07:00
|
|
|
* This is a variant of generic_file_llseek that allows passing in a custom
|
vfs: allow custom EOF in generic_file_llseek code
For ext3/4 htree directories, using the vfs llseek function with
SEEK_END goes to i_size like for any other file, but in reality
we want the maximum possible hash value. Recent changes
in ext4 have cut & pasted generic_file_llseek() back into fs/ext4/dir.c,
but replicating this core code seems like a bad idea, especially
since the copy has already diverged from the vfs.
This patch updates generic_file_llseek_size to accept
both a custom maximum offset, and a custom EOF position. With this
in place, ext4_dir_llseek can pass in the appropriate maximum hash
position for both maxsize and eof, and get what it wants.
As far as I know, this does not fix any bugs - nfs in the kernel
doesn't use SEEK_END, and I don't know of any user who does. But
some ext4 folks seem keen on doing the right thing here, and I can't
really argue.
(Patch also fixes up some comments slightly)
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-04-30 13:11:29 -05:00
|
|
|
* maximum file size and a custom EOF position, for e.g. hashed directories
|
2011-09-15 16:06:48 -07:00
|
|
|
*
|
|
|
|
* Synchronization:
|
2011-09-15 16:06:50 -07:00
|
|
|
* SEEK_SET and SEEK_END are unsynchronized (but atomic on 64bit platforms)
|
2011-09-15 16:06:48 -07:00
|
|
|
* SEEK_CUR is synchronized against other SEEK_CURs, but not read/writes.
|
|
|
|
* read/writes behave like SEEK_SET against seeks.
|
2008-08-11 15:37:17 +02:00
|
|
|
*/
|
2008-06-27 11:05:24 +02:00
|
|
|
loff_t
|
2012-12-17 15:59:39 -08:00
|
|
|
generic_file_llseek_size(struct file *file, loff_t offset, int whence,
|
vfs: allow custom EOF in generic_file_llseek code
For ext3/4 htree directories, using the vfs llseek function with
SEEK_END goes to i_size like for any other file, but in reality
we want the maximum possible hash value. Recent changes
in ext4 have cut & pasted generic_file_llseek() back into fs/ext4/dir.c,
but replicating this core code seems like a bad idea, especially
since the copy has already diverged from the vfs.
This patch updates generic_file_llseek_size to accept
both a custom maximum offset, and a custom EOF position. With this
in place, ext4_dir_llseek can pass in the appropriate maximum hash
position for both maxsize and eof, and get what it wants.
As far as I know, this does not fix any bugs - nfs in the kernel
doesn't use SEEK_END, and I don't know of any user who does. But
some ext4 folks seem keen on doing the right thing here, and I can't
really argue.
(Patch also fixes up some comments slightly)
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-04-30 13:11:29 -05:00
|
|
|
loff_t maxsize, loff_t eof)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2024-08-30 15:04:48 +02:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = must_set_pos(file, &offset, whence, eof);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
if (ret == 0)
|
2011-09-15 16:06:48 -07:00
|
|
|
return offset;
|
2024-08-30 15:04:48 +02:00
|
|
|
|
|
|
|
if (whence == SEEK_CUR) {
|
2011-07-18 13:21:35 -04:00
|
|
|
/*
|
2024-08-30 15:04:48 +02:00
|
|
|
* f_lock protects against read/modify/write race with
|
|
|
|
* other SEEK_CURs. Note that parallel writes and reads
|
|
|
|
* behave like SEEK_SET.
|
2011-07-18 13:21:35 -04:00
|
|
|
*/
|
2024-08-30 15:04:48 +02:00
|
|
|
guard(spinlock)(&file->f_lock);
|
|
|
|
return vfs_setpos(file, file->f_pos + offset, maxsize);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2008-08-11 15:37:17 +02:00
|
|
|
|
2013-06-25 12:02:13 +08:00
|
|
|
return vfs_setpos(file, offset, maxsize);
|
2011-09-15 16:06:50 -07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(generic_file_llseek_size);
|
|
|
|
|
2024-08-30 15:04:49 +02:00
|
|
|
/**
|
|
|
|
* generic_llseek_cookie - versioned llseek implementation
|
|
|
|
* @file: file structure to seek on
|
|
|
|
* @offset: file offset to seek to
|
|
|
|
* @whence: type of seek
|
|
|
|
* @cookie: cookie to update
|
|
|
|
*
|
|
|
|
* See generic_file_llseek for a general description and locking assumptions.
|
|
|
|
*
|
|
|
|
* In contrast to generic_file_llseek, this function also resets a
|
|
|
|
* specified cookie to indicate a seek took place.
|
|
|
|
*/
|
|
|
|
loff_t generic_llseek_cookie(struct file *file, loff_t offset, int whence,
|
|
|
|
u64 *cookie)
|
|
|
|
{
|
|
|
|
struct inode *inode = file->f_mapping->host;
|
|
|
|
loff_t maxsize = inode->i_sb->s_maxbytes;
|
|
|
|
loff_t eof = i_size_read(inode);
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (WARN_ON_ONCE(!cookie))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Require that this is only used for directories that guarantee
|
|
|
|
* synchronization between readdir and seek so that an update to
|
|
|
|
* @cookie is correctly synchronized with concurrent readdir.
|
|
|
|
*/
|
|
|
|
if (WARN_ON_ONCE(!(file->f_mode & FMODE_ATOMIC_POS)))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
ret = must_set_pos(file, &offset, whence, eof);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
if (ret == 0)
|
|
|
|
return offset;
|
|
|
|
|
|
|
|
/* No need to hold f_lock because we know that f_pos_lock is held. */
|
|
|
|
if (whence == SEEK_CUR)
|
|
|
|
return vfs_setpos_cookie(file, file->f_pos + offset, maxsize, cookie);
|
|
|
|
|
|
|
|
return vfs_setpos_cookie(file, offset, maxsize, cookie);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(generic_llseek_cookie);
|
|
|
|
|
2011-09-15 16:06:50 -07:00
|
|
|
/**
|
|
|
|
* generic_file_llseek - generic llseek implementation for regular files
|
|
|
|
* @file: file structure to seek on
|
|
|
|
* @offset: file offset to seek to
|
2012-12-17 15:59:39 -08:00
|
|
|
* @whence: type of seek
|
2011-09-15 16:06:50 -07:00
|
|
|
*
|
|
|
|
* This is a generic implemenation of ->llseek useable for all normal local
|
|
|
|
* filesystems. It just updates the file offset to the value specified by
|
2013-04-29 15:06:07 -07:00
|
|
|
* @offset and @whence.
|
2011-09-15 16:06:50 -07:00
|
|
|
*/
|
2012-12-17 15:59:39 -08:00
|
|
|
loff_t generic_file_llseek(struct file *file, loff_t offset, int whence)
|
2011-09-15 16:06:50 -07:00
|
|
|
{
|
|
|
|
struct inode *inode = file->f_mapping->host;
|
|
|
|
|
2012-12-17 15:59:39 -08:00
|
|
|
return generic_file_llseek_size(file, offset, whence,
|
vfs: allow custom EOF in generic_file_llseek code
For ext3/4 htree directories, using the vfs llseek function with
SEEK_END goes to i_size like for any other file, but in reality
we want the maximum possible hash value. Recent changes
in ext4 have cut & pasted generic_file_llseek() back into fs/ext4/dir.c,
but replicating this core code seems like a bad idea, especially
since the copy has already diverged from the vfs.
This patch updates generic_file_llseek_size to accept
both a custom maximum offset, and a custom EOF position. With this
in place, ext4_dir_llseek can pass in the appropriate maximum hash
position for both maxsize and eof, and get what it wants.
As far as I know, this does not fix any bugs - nfs in the kernel
doesn't use SEEK_END, and I don't know of any user who does. But
some ext4 folks seem keen on doing the right thing here, and I can't
really argue.
(Patch also fixes up some comments slightly)
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-04-30 13:11:29 -05:00
|
|
|
inode->i_sb->s_maxbytes,
|
|
|
|
i_size_read(inode));
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2008-06-27 11:05:24 +02:00
|
|
|
EXPORT_SYMBOL(generic_file_llseek);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2013-06-16 20:27:42 +04:00
|
|
|
/**
|
|
|
|
* fixed_size_llseek - llseek implementation for fixed-sized devices
|
|
|
|
* @file: file structure to seek on
|
|
|
|
* @offset: file offset to seek to
|
|
|
|
* @whence: type of seek
|
|
|
|
* @size: size of the file
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
loff_t fixed_size_llseek(struct file *file, loff_t offset, int whence, loff_t size)
|
|
|
|
{
|
|
|
|
switch (whence) {
|
|
|
|
case SEEK_SET: case SEEK_CUR: case SEEK_END:
|
|
|
|
return generic_file_llseek_size(file, offset, whence,
|
|
|
|
size, size);
|
|
|
|
default:
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(fixed_size_llseek);
|
|
|
|
|
2015-12-05 22:04:48 -05:00
|
|
|
/**
|
|
|
|
* no_seek_end_llseek - llseek implementation for fixed-sized devices
|
|
|
|
* @file: file structure to seek on
|
|
|
|
* @offset: file offset to seek to
|
|
|
|
* @whence: type of seek
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
loff_t no_seek_end_llseek(struct file *file, loff_t offset, int whence)
|
|
|
|
{
|
|
|
|
switch (whence) {
|
|
|
|
case SEEK_SET: case SEEK_CUR:
|
|
|
|
return generic_file_llseek_size(file, offset, whence,
|
2016-02-16 22:20:59 +01:00
|
|
|
OFFSET_MAX, 0);
|
2015-12-05 22:04:48 -05:00
|
|
|
default:
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(no_seek_end_llseek);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* no_seek_end_llseek_size - llseek implementation for fixed-sized devices
|
|
|
|
* @file: file structure to seek on
|
|
|
|
* @offset: file offset to seek to
|
|
|
|
* @whence: type of seek
|
|
|
|
* @size: maximal offset allowed
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
loff_t no_seek_end_llseek_size(struct file *file, loff_t offset, int whence, loff_t size)
|
|
|
|
{
|
|
|
|
switch (whence) {
|
|
|
|
case SEEK_SET: case SEEK_CUR:
|
|
|
|
return generic_file_llseek_size(file, offset, whence,
|
|
|
|
size, 0);
|
|
|
|
default:
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(no_seek_end_llseek_size);
|
|
|
|
|
2010-05-26 14:44:48 -07:00
|
|
|
/**
|
|
|
|
* noop_llseek - No Operation Performed llseek implementation
|
|
|
|
* @file: file structure to seek on
|
|
|
|
* @offset: file offset to seek to
|
2012-12-17 15:59:39 -08:00
|
|
|
* @whence: type of seek
|
2010-05-26 14:44:48 -07:00
|
|
|
*
|
|
|
|
* This is an implementation of ->llseek useable for the rare special case when
|
|
|
|
* userspace expects the seek to succeed but the (device) file is actually not
|
|
|
|
* able to perform the seek. In this case you use noop_llseek() instead of
|
|
|
|
* falling back to the default implementation of ->llseek.
|
|
|
|
*/
|
2012-12-17 15:59:39 -08:00
|
|
|
loff_t noop_llseek(struct file *file, loff_t offset, int whence)
|
2010-05-26 14:44:48 -07:00
|
|
|
{
|
|
|
|
return file->f_pos;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(noop_llseek);
|
|
|
|
|
2012-12-17 15:59:39 -08:00
|
|
|
loff_t default_llseek(struct file *file, loff_t offset, int whence)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2013-01-23 17:07:38 -05:00
|
|
|
struct inode *inode = file_inode(file);
|
2008-04-22 15:09:22 +02:00
|
|
|
loff_t retval;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2016-01-22 15:40:57 -05:00
|
|
|
inode_lock(inode);
|
2012-12-17 15:59:39 -08:00
|
|
|
switch (whence) {
|
2007-05-08 00:24:13 -07:00
|
|
|
case SEEK_END:
|
2011-07-18 13:21:35 -04:00
|
|
|
offset += i_size_read(inode);
|
2005-04-16 15:20:36 -07:00
|
|
|
break;
|
2007-05-08 00:24:13 -07:00
|
|
|
case SEEK_CUR:
|
2008-11-10 17:08:08 -08:00
|
|
|
if (offset == 0) {
|
|
|
|
retval = file->f_pos;
|
|
|
|
goto out;
|
|
|
|
}
|
2005-04-16 15:20:36 -07:00
|
|
|
offset += file->f_pos;
|
2011-07-18 13:21:35 -04:00
|
|
|
break;
|
|
|
|
case SEEK_DATA:
|
|
|
|
/*
|
|
|
|
* In the generic case the entire file is data, so as
|
|
|
|
* long as offset isn't at the end of the file then the
|
|
|
|
* offset is data.
|
|
|
|
*/
|
2011-07-26 17:25:20 +03:00
|
|
|
if (offset >= inode->i_size) {
|
|
|
|
retval = -ENXIO;
|
|
|
|
goto out;
|
|
|
|
}
|
2011-07-18 13:21:35 -04:00
|
|
|
break;
|
|
|
|
case SEEK_HOLE:
|
|
|
|
/*
|
|
|
|
* There is a virtual hole at the end of the file, so
|
|
|
|
* as long as offset isn't i_size or larger, return
|
|
|
|
* i_size.
|
|
|
|
*/
|
2011-07-26 17:25:20 +03:00
|
|
|
if (offset >= inode->i_size) {
|
|
|
|
retval = -ENXIO;
|
|
|
|
goto out;
|
|
|
|
}
|
2011-07-18 13:21:35 -04:00
|
|
|
offset = inode->i_size;
|
|
|
|
break;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
retval = -EINVAL;
|
2010-12-17 07:44:05 -05:00
|
|
|
if (offset >= 0 || unsigned_offsets(file)) {
|
2024-08-30 15:05:01 +02:00
|
|
|
if (offset != file->f_pos)
|
2005-04-16 15:20:36 -07:00
|
|
|
file->f_pos = offset;
|
|
|
|
retval = offset;
|
|
|
|
}
|
2008-11-10 17:08:08 -08:00
|
|
|
out:
|
2016-01-22 15:40:57 -05:00
|
|
|
inode_unlock(inode);
|
2005-04-16 15:20:36 -07:00
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(default_llseek);
|
|
|
|
|
2012-12-17 15:59:39 -08:00
|
|
|
loff_t vfs_llseek(struct file *file, loff_t offset, int whence)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2022-06-29 15:06:59 +02:00
|
|
|
if (!(file->f_mode & FMODE_LSEEK))
|
|
|
|
return -ESPIPE;
|
|
|
|
return file->f_op->llseek(file, offset, whence);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(vfs_llseek);
|
|
|
|
|
2020-06-06 14:49:58 +02:00
|
|
|
static off_t ksys_lseek(unsigned int fd, off_t offset, unsigned int whence)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
off_t retval;
|
2024-05-31 22:10:12 -04:00
|
|
|
CLASS(fd_pos, f)(fd);
|
|
|
|
if (fd_empty(f))
|
2012-08-28 12:52:22 -04:00
|
|
|
return -EBADF;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
retval = -EINVAL;
|
2012-12-17 15:59:39 -08:00
|
|
|
if (whence <= SEEK_MAX) {
|
2024-05-31 14:12:01 -04:00
|
|
|
loff_t res = vfs_llseek(fd_file(f), offset, whence);
|
2005-04-16 15:20:36 -07:00
|
|
|
retval = res;
|
|
|
|
if (res != (loff_t)retval)
|
|
|
|
retval = -EOVERFLOW; /* LFS: should only happen on 32 bit platforms */
|
|
|
|
}
|
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
2018-03-13 21:51:17 +01:00
|
|
|
SYSCALL_DEFINE3(lseek, unsigned int, fd, off_t, offset, unsigned int, whence)
|
|
|
|
{
|
|
|
|
return ksys_lseek(fd, offset, whence);
|
|
|
|
}
|
|
|
|
|
2013-02-24 10:52:26 -05:00
|
|
|
#ifdef CONFIG_COMPAT
|
|
|
|
COMPAT_SYSCALL_DEFINE3(lseek, unsigned int, fd, compat_off_t, offset, unsigned int, whence)
|
|
|
|
{
|
2018-03-13 21:51:17 +01:00
|
|
|
return ksys_lseek(fd, offset, whence);
|
2013-02-24 10:52:26 -05:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2020-03-20 11:20:12 +01:00
|
|
|
#if !defined(CONFIG_64BIT) || defined(CONFIG_COMPAT) || \
|
|
|
|
defined(__ARCH_WANT_SYS_LLSEEK)
|
2009-01-14 14:14:21 +01:00
|
|
|
SYSCALL_DEFINE5(llseek, unsigned int, fd, unsigned long, offset_high,
|
|
|
|
unsigned long, offset_low, loff_t __user *, result,
|
2012-12-17 15:59:39 -08:00
|
|
|
unsigned int, whence)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
int retval;
|
2024-05-31 22:10:12 -04:00
|
|
|
CLASS(fd_pos, f)(fd);
|
2005-04-16 15:20:36 -07:00
|
|
|
loff_t offset;
|
|
|
|
|
2024-05-31 22:10:12 -04:00
|
|
|
if (fd_empty(f))
|
2012-08-28 12:52:22 -04:00
|
|
|
return -EBADF;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2012-12-17 15:59:39 -08:00
|
|
|
if (whence > SEEK_MAX)
|
2024-05-31 22:10:12 -04:00
|
|
|
return -EINVAL;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2024-05-31 14:12:01 -04:00
|
|
|
offset = vfs_llseek(fd_file(f), ((loff_t) offset_high << 32) | offset_low,
|
2012-12-17 15:59:39 -08:00
|
|
|
whence);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
retval = (int)offset;
|
|
|
|
if (offset >= 0) {
|
|
|
|
retval = -EFAULT;
|
|
|
|
if (!copy_to_user(result, &offset, sizeof(offset)))
|
|
|
|
retval = 0;
|
|
|
|
}
|
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2013-06-19 15:26:04 +04:00
|
|
|
int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t count)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2023-12-12 11:44:40 +02:00
|
|
|
int mask = read_write == READ ? MAY_READ : MAY_WRITE;
|
|
|
|
int ret;
|
|
|
|
|
2006-01-04 16:20:40 -08:00
|
|
|
if (unlikely((ssize_t) count < 0))
|
2021-08-24 13:12:59 +02:00
|
|
|
return -EINVAL;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2019-04-12 12:31:57 +03:00
|
|
|
if (ppos) {
|
|
|
|
loff_t pos = *ppos;
|
|
|
|
|
|
|
|
if (unlikely(pos < 0)) {
|
|
|
|
if (!unsigned_offsets(file))
|
2021-08-24 13:12:59 +02:00
|
|
|
return -EINVAL;
|
2019-04-12 12:31:57 +03:00
|
|
|
if (count >= -pos) /* both values are in 0..LLONG_MAX */
|
|
|
|
return -EOVERFLOW;
|
|
|
|
} else if (unlikely((loff_t) (pos + count) < 0)) {
|
|
|
|
if (!unsigned_offsets(file))
|
2021-08-24 13:12:59 +02:00
|
|
|
return -EINVAL;
|
2019-04-12 12:31:57 +03:00
|
|
|
}
|
2006-01-04 16:20:40 -08:00
|
|
|
}
|
2019-04-12 12:31:57 +03:00
|
|
|
|
2023-12-12 11:44:40 +02:00
|
|
|
ret = security_file_permission(file, mask);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
return fsnotify_file_area_perm(file, mask, ppos, count);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2019-09-04 12:13:25 -07:00
|
|
|
EXPORT_SYMBOL(rw_verify_area);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2015-04-03 15:41:18 -04:00
|
|
|
static ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
|
2014-02-11 18:37:41 -05:00
|
|
|
{
|
|
|
|
struct kiocb kiocb;
|
|
|
|
struct iov_iter iter;
|
|
|
|
ssize_t ret;
|
|
|
|
|
|
|
|
init_sync_kiocb(&kiocb, filp);
|
2019-04-12 12:31:57 +03:00
|
|
|
kiocb.ki_pos = (ppos ? *ppos : 0);
|
2022-09-15 20:25:47 -04:00
|
|
|
iov_iter_ubuf(&iter, ITER_DEST, buf, len);
|
2014-02-11 18:37:41 -05:00
|
|
|
|
2023-08-28 17:13:18 +02:00
|
|
|
ret = filp->f_op->read_iter(&kiocb, &iter);
|
2015-02-11 19:59:44 +01:00
|
|
|
BUG_ON(ret == -EIOCBQUEUED);
|
2019-04-12 12:31:57 +03:00
|
|
|
if (ppos)
|
|
|
|
*ppos = kiocb.ki_pos;
|
2014-02-11 18:37:41 -05:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-09-03 16:22:33 +02:00
|
|
|
static int warn_unsupported(struct file *file, const char *op)
|
|
|
|
{
|
|
|
|
pr_warn_ratelimited(
|
|
|
|
"kernel %s not supported for file %pD4 (pid: %d comm: %.20s)\n",
|
|
|
|
op, file, current->pid, current->comm);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2020-05-08 08:54:16 +02:00
|
|
|
ssize_t __kernel_read(struct file *file, void *buf, size_t count, loff_t *pos)
|
|
|
|
{
|
2020-09-03 16:22:33 +02:00
|
|
|
struct kvec iov = {
|
|
|
|
.iov_base = buf,
|
|
|
|
.iov_len = min_t(size_t, count, MAX_RW_COUNT),
|
|
|
|
};
|
|
|
|
struct kiocb kiocb;
|
|
|
|
struct iov_iter iter;
|
2020-05-08 08:54:16 +02:00
|
|
|
ssize_t ret;
|
|
|
|
|
|
|
|
if (WARN_ON_ONCE(!(file->f_mode & FMODE_READ)))
|
|
|
|
return -EINVAL;
|
|
|
|
if (!(file->f_mode & FMODE_CAN_READ))
|
|
|
|
return -EINVAL;
|
2020-09-03 16:22:33 +02:00
|
|
|
/*
|
|
|
|
* Also fail if ->read_iter and ->read are both wired up as that
|
|
|
|
* implies very convoluted semantics.
|
|
|
|
*/
|
|
|
|
if (unlikely(!file->f_op->read_iter || file->f_op->read))
|
|
|
|
return warn_unsupported(file, "read");
|
2020-05-08 08:54:16 +02:00
|
|
|
|
2020-09-03 16:22:33 +02:00
|
|
|
init_sync_kiocb(&kiocb, file);
|
2020-10-03 03:55:23 +01:00
|
|
|
kiocb.ki_pos = pos ? *pos : 0;
|
2022-09-15 20:25:47 -04:00
|
|
|
iov_iter_kvec(&iter, ITER_DEST, &iov, 1, iov.iov_len);
|
2020-09-03 16:22:33 +02:00
|
|
|
ret = file->f_op->read_iter(&kiocb, &iter);
|
2020-05-08 08:54:16 +02:00
|
|
|
if (ret > 0) {
|
2020-10-03 03:55:23 +01:00
|
|
|
if (pos)
|
|
|
|
*pos = kiocb.ki_pos;
|
2020-05-08 08:54:16 +02:00
|
|
|
fsnotify_access(file);
|
|
|
|
add_rchar(current, ret);
|
|
|
|
}
|
|
|
|
inc_syscr(current);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2017-09-01 17:39:13 +02:00
|
|
|
ssize_t kernel_read(struct file *file, void *buf, size_t count, loff_t *pos)
|
2017-09-01 17:39:12 +02:00
|
|
|
{
|
2020-05-08 09:00:28 +02:00
|
|
|
ssize_t ret;
|
2017-09-01 17:39:12 +02:00
|
|
|
|
2020-05-08 09:00:28 +02:00
|
|
|
ret = rw_verify_area(READ, file, pos, count);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
return __kernel_read(file, buf, count, pos);
|
2017-09-01 17:39:12 +02:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kernel_read);
|
2014-11-05 17:01:17 +02:00
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
|
|
|
|
{
|
|
|
|
ssize_t ret;
|
|
|
|
|
|
|
|
if (!(file->f_mode & FMODE_READ))
|
|
|
|
return -EBADF;
|
2014-02-11 17:49:24 -05:00
|
|
|
if (!(file->f_mode & FMODE_CAN_READ))
|
2005-04-16 15:20:36 -07:00
|
|
|
return -EINVAL;
|
Remove 'type' argument from access_ok() function
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.
It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.
A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.
This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
There were a couple of notable cases:
- csky still had the old "verify_area()" name as an alias.
- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)
- microblaze used the type argument for a debug printout
but other than those oddities this should be a total no-op patch.
I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-03 18:57:57 -08:00
|
|
|
if (unlikely(!access_ok(buf, count)))
|
2005-04-16 15:20:36 -07:00
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
ret = rw_verify_area(READ, file, pos, count);
|
2020-05-08 11:17:46 +02:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
if (count > MAX_RW_COUNT)
|
|
|
|
count = MAX_RW_COUNT;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2020-05-08 11:17:46 +02:00
|
|
|
if (file->f_op->read)
|
|
|
|
ret = file->f_op->read(file, buf, count, pos);
|
|
|
|
else if (file->f_op->read_iter)
|
|
|
|
ret = new_sync_read(file, buf, count, pos);
|
|
|
|
else
|
|
|
|
ret = -EINVAL;
|
|
|
|
if (ret > 0) {
|
|
|
|
fsnotify_access(file);
|
|
|
|
add_rchar(current, ret);
|
|
|
|
}
|
|
|
|
inc_syscr(current);
|
2005-04-16 15:20:36 -07:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2015-04-03 15:41:18 -04:00
|
|
|
static ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos)
|
2014-02-11 18:37:41 -05:00
|
|
|
{
|
|
|
|
struct kiocb kiocb;
|
|
|
|
struct iov_iter iter;
|
|
|
|
ssize_t ret;
|
|
|
|
|
|
|
|
init_sync_kiocb(&kiocb, filp);
|
2019-04-12 12:31:57 +03:00
|
|
|
kiocb.ki_pos = (ppos ? *ppos : 0);
|
2022-09-15 20:25:47 -04:00
|
|
|
iov_iter_ubuf(&iter, ITER_SOURCE, (void __user *)buf, len);
|
2014-02-11 18:37:41 -05:00
|
|
|
|
2023-08-28 17:13:18 +02:00
|
|
|
ret = filp->f_op->write_iter(&kiocb, &iter);
|
2015-02-11 19:59:44 +01:00
|
|
|
BUG_ON(ret == -EIOCBQUEUED);
|
2019-04-12 12:31:57 +03:00
|
|
|
if (ret > 0 && ppos)
|
2015-04-06 20:50:38 -04:00
|
|
|
*ppos = kiocb.ki_pos;
|
2014-02-11 18:37:41 -05:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-05-07 19:33:03 +02:00
|
|
|
/* caller is responsible for file_start_write/file_end_write */
|
2022-09-26 11:59:14 -04:00
|
|
|
ssize_t __kernel_write_iter(struct file *file, struct iov_iter *from, loff_t *pos)
|
2013-03-20 13:19:30 -04:00
|
|
|
{
|
2020-09-03 16:22:33 +02:00
|
|
|
struct kiocb kiocb;
|
2013-03-20 13:19:30 -04:00
|
|
|
ssize_t ret;
|
|
|
|
|
2020-05-08 08:55:03 +02:00
|
|
|
if (WARN_ON_ONCE(!(file->f_mode & FMODE_WRITE)))
|
|
|
|
return -EBADF;
|
2014-02-11 17:49:24 -05:00
|
|
|
if (!(file->f_mode & FMODE_CAN_WRITE))
|
2013-03-27 15:20:30 +00:00
|
|
|
return -EINVAL;
|
2020-09-03 16:22:33 +02:00
|
|
|
/*
|
|
|
|
* Also fail if ->write_iter and ->write are both wired up as that
|
|
|
|
* implies very convoluted semantics.
|
|
|
|
*/
|
|
|
|
if (unlikely(!file->f_op->write_iter || file->f_op->write))
|
|
|
|
return warn_unsupported(file, "write");
|
2013-03-27 15:20:30 +00:00
|
|
|
|
2020-09-03 16:22:33 +02:00
|
|
|
init_sync_kiocb(&kiocb, file);
|
2020-10-03 03:55:22 +01:00
|
|
|
kiocb.ki_pos = pos ? *pos : 0;
|
2022-09-26 11:59:14 -04:00
|
|
|
ret = file->f_op->write_iter(&kiocb, from);
|
2013-03-20 13:19:30 -04:00
|
|
|
if (ret > 0) {
|
2020-10-03 03:55:22 +01:00
|
|
|
if (pos)
|
|
|
|
*pos = kiocb.ki_pos;
|
2013-03-20 13:19:30 -04:00
|
|
|
fsnotify_modify(file);
|
|
|
|
add_wchar(current, ret);
|
|
|
|
}
|
|
|
|
inc_syscw(current);
|
|
|
|
return ret;
|
|
|
|
}
|
2022-09-26 11:59:14 -04:00
|
|
|
|
|
|
|
/* caller is responsible for file_start_write/file_end_write */
|
|
|
|
ssize_t __kernel_write(struct file *file, const void *buf, size_t count, loff_t *pos)
|
|
|
|
{
|
|
|
|
struct kvec iov = {
|
|
|
|
.iov_base = (void *)buf,
|
|
|
|
.iov_len = min_t(size_t, count, MAX_RW_COUNT),
|
|
|
|
};
|
|
|
|
struct iov_iter iter;
|
2022-09-15 20:25:47 -04:00
|
|
|
iov_iter_kvec(&iter, ITER_SOURCE, &iov, 1, iov.iov_len);
|
2022-09-26 11:59:14 -04:00
|
|
|
return __kernel_write_iter(file, &iter, pos);
|
|
|
|
}
|
2020-09-29 17:18:34 -07:00
|
|
|
/*
|
|
|
|
* This "EXPORT_SYMBOL_GPL()" is more of a "EXPORT_SYMBOL_DONTUSE()",
|
|
|
|
* but autofs is one of the few internal kernel users that actually
|
|
|
|
* wants this _and_ can be built as a module. So we need to export
|
|
|
|
* this symbol for autofs, even though it really isn't appropriate
|
|
|
|
* for any other kernel modules.
|
|
|
|
*/
|
|
|
|
EXPORT_SYMBOL_GPL(__kernel_write);
|
2014-08-19 11:48:09 -04:00
|
|
|
|
2017-09-01 17:39:14 +02:00
|
|
|
ssize_t kernel_write(struct file *file, const void *buf, size_t count,
|
|
|
|
loff_t *pos)
|
2017-09-01 17:39:11 +02:00
|
|
|
{
|
2020-05-07 19:33:03 +02:00
|
|
|
ssize_t ret;
|
2017-09-01 17:39:11 +02:00
|
|
|
|
2020-05-07 19:33:03 +02:00
|
|
|
ret = rw_verify_area(WRITE, file, pos, count);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2017-09-01 17:39:11 +02:00
|
|
|
|
2020-05-07 19:33:03 +02:00
|
|
|
file_start_write(file);
|
|
|
|
ret = __kernel_write(file, buf, count, pos);
|
|
|
|
file_end_write(file);
|
|
|
|
return ret;
|
2017-09-01 17:39:11 +02:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kernel_write);
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
|
|
|
|
{
|
|
|
|
ssize_t ret;
|
|
|
|
|
|
|
|
if (!(file->f_mode & FMODE_WRITE))
|
|
|
|
return -EBADF;
|
2014-02-11 17:49:24 -05:00
|
|
|
if (!(file->f_mode & FMODE_CAN_WRITE))
|
2005-04-16 15:20:36 -07:00
|
|
|
return -EINVAL;
|
Remove 'type' argument from access_ok() function
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.
It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.
A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.
This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
There were a couple of notable cases:
- csky still had the old "verify_area()" name as an alias.
- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)
- microblaze used the type argument for a debug printout
but other than those oddities this should be a total no-op patch.
I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-03 18:57:57 -08:00
|
|
|
if (unlikely(!access_ok(buf, count)))
|
2005-04-16 15:20:36 -07:00
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
ret = rw_verify_area(WRITE, file, pos, count);
|
2020-05-13 08:51:46 +02:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
if (count > MAX_RW_COUNT)
|
|
|
|
count = MAX_RW_COUNT;
|
|
|
|
file_start_write(file);
|
|
|
|
if (file->f_op->write)
|
|
|
|
ret = file->f_op->write(file, buf, count, pos);
|
|
|
|
else if (file->f_op->write_iter)
|
|
|
|
ret = new_sync_write(file, buf, count, pos);
|
|
|
|
else
|
|
|
|
ret = -EINVAL;
|
|
|
|
if (ret > 0) {
|
|
|
|
fsnotify_modify(file);
|
|
|
|
add_wchar(current, ret);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2020-05-13 08:51:46 +02:00
|
|
|
inc_syscw(current);
|
|
|
|
file_end_write(file);
|
2005-04-16 15:20:36 -07:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2019-04-12 12:31:57 +03:00
|
|
|
/* file_ppos returns &file->f_pos or NULL if file is stream */
|
|
|
|
static inline loff_t *file_ppos(struct file *file)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2019-04-12 12:31:57 +03:00
|
|
|
return file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2018-03-13 21:56:26 +01:00
|
|
|
ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2024-05-31 22:10:12 -04:00
|
|
|
CLASS(fd_pos, f)(fd);
|
2005-04-16 15:20:36 -07:00
|
|
|
ssize_t ret = -EBADF;
|
|
|
|
|
2024-05-31 22:10:12 -04:00
|
|
|
if (!fd_empty(f)) {
|
2024-05-31 14:12:01 -04:00
|
|
|
loff_t pos, *ppos = file_ppos(fd_file(f));
|
2019-04-12 12:31:57 +03:00
|
|
|
if (ppos) {
|
|
|
|
pos = *ppos;
|
|
|
|
ppos = &pos;
|
|
|
|
}
|
2024-05-31 14:12:01 -04:00
|
|
|
ret = vfs_read(fd_file(f), buf, count, ppos);
|
2019-04-12 12:31:57 +03:00
|
|
|
if (ret >= 0 && ppos)
|
2024-05-31 14:12:01 -04:00
|
|
|
fd_file(f)->f_pos = pos;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-03-13 21:56:26 +01:00
|
|
|
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
|
|
|
|
{
|
|
|
|
return ksys_read(fd, buf, count);
|
|
|
|
}
|
|
|
|
|
2018-03-11 11:34:41 +01:00
|
|
|
ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2024-05-31 22:10:12 -04:00
|
|
|
CLASS(fd_pos, f)(fd);
|
2005-04-16 15:20:36 -07:00
|
|
|
ssize_t ret = -EBADF;
|
|
|
|
|
2024-05-31 22:10:12 -04:00
|
|
|
if (!fd_empty(f)) {
|
2024-05-31 14:12:01 -04:00
|
|
|
loff_t pos, *ppos = file_ppos(fd_file(f));
|
2019-04-12 12:31:57 +03:00
|
|
|
if (ppos) {
|
|
|
|
pos = *ppos;
|
|
|
|
ppos = &pos;
|
|
|
|
}
|
2024-05-31 14:12:01 -04:00
|
|
|
ret = vfs_write(fd_file(f), buf, count, ppos);
|
2019-04-12 12:31:57 +03:00
|
|
|
if (ret >= 0 && ppos)
|
2024-05-31 14:12:01 -04:00
|
|
|
fd_file(f)->f_pos = pos;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-03-11 11:34:41 +01:00
|
|
|
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
|
|
|
|
size_t, count)
|
|
|
|
{
|
|
|
|
return ksys_write(fd, buf, count);
|
|
|
|
}
|
|
|
|
|
2018-03-19 17:38:31 +01:00
|
|
|
ssize_t ksys_pread64(unsigned int fd, char __user *buf, size_t count,
|
|
|
|
loff_t pos)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
if (pos < 0)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2024-07-19 21:19:02 -04:00
|
|
|
CLASS(fd, f)(fd);
|
|
|
|
if (fd_empty(f))
|
|
|
|
return -EBADF;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2024-07-19 21:19:02 -04:00
|
|
|
if (fd_file(f)->f_mode & FMODE_PREAD)
|
|
|
|
return vfs_read(fd_file(f), buf, count, &pos);
|
|
|
|
|
|
|
|
return -ESPIPE;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2018-03-19 17:38:31 +01:00
|
|
|
SYSCALL_DEFINE4(pread64, unsigned int, fd, char __user *, buf,
|
|
|
|
size_t, count, loff_t, pos)
|
|
|
|
{
|
|
|
|
return ksys_pread64(fd, buf, count, pos);
|
|
|
|
}
|
|
|
|
|
2022-04-05 15:13:05 +08:00
|
|
|
#if defined(CONFIG_COMPAT) && defined(__ARCH_WANT_COMPAT_PREAD64)
|
|
|
|
COMPAT_SYSCALL_DEFINE5(pread64, unsigned int, fd, char __user *, buf,
|
|
|
|
size_t, count, compat_arg_u64_dual(pos))
|
|
|
|
{
|
|
|
|
return ksys_pread64(fd, buf, count, compat_arg_u64_glue(pos));
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2018-03-19 17:38:31 +01:00
|
|
|
ssize_t ksys_pwrite64(unsigned int fd, const char __user *buf,
|
|
|
|
size_t count, loff_t pos)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
if (pos < 0)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2024-07-19 21:19:02 -04:00
|
|
|
CLASS(fd, f)(fd);
|
|
|
|
if (fd_empty(f))
|
|
|
|
return -EBADF;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2024-07-19 21:19:02 -04:00
|
|
|
if (fd_file(f)->f_mode & FMODE_PWRITE)
|
|
|
|
return vfs_write(fd_file(f), buf, count, &pos);
|
|
|
|
|
|
|
|
return -ESPIPE;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2018-03-19 17:38:31 +01:00
|
|
|
SYSCALL_DEFINE4(pwrite64, unsigned int, fd, const char __user *, buf,
|
|
|
|
size_t, count, loff_t, pos)
|
|
|
|
{
|
|
|
|
return ksys_pwrite64(fd, buf, count, pos);
|
|
|
|
}
|
|
|
|
|
2022-04-05 15:13:05 +08:00
|
|
|
#if defined(CONFIG_COMPAT) && defined(__ARCH_WANT_COMPAT_PWRITE64)
|
|
|
|
COMPAT_SYSCALL_DEFINE5(pwrite64, unsigned int, fd, const char __user *, buf,
|
|
|
|
size_t, count, compat_arg_u64_dual(pos))
|
|
|
|
{
|
|
|
|
return ksys_pwrite64(fd, buf, count, compat_arg_u64_glue(pos));
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2015-03-20 20:10:21 -04:00
|
|
|
static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
|
2017-07-06 18:58:37 +02:00
|
|
|
loff_t *ppos, int type, rwf_t flags)
|
2014-02-11 18:37:41 -05:00
|
|
|
{
|
|
|
|
struct kiocb kiocb;
|
|
|
|
ssize_t ret;
|
|
|
|
|
|
|
|
init_sync_kiocb(&kiocb, filp);
|
fs: Initial atomic write support
An atomic write is a write issued with torn-write protection, meaning
that for a power failure or any other hardware failure, all or none of the
data from the write will be stored, but never a mix of old and new data.
Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the
write is to be issued with torn-write prevention, according to special
alignment and length rules.
For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for
iocb->ki_flags field to indicate the same.
A call to statx will give the relevant atomic write info for a file:
- atomic_write_unit_min
- atomic_write_unit_max
- atomic_write_segments_max
Both min and max values must be a power-of-2.
Applications can avail of atomic write feature by ensuring that the total
length of a write is a power-of-2 in size and also sized between
atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications
must ensure that the write is at a naturally-aligned offset in the file
wrt the total write length. The value in atomic_write_segments_max
indicates the upper limit for IOV_ITER iovcnt.
Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the
flag set will have RWF_ATOMIC rejected and not just ignored.
Add a type argument to kiocb_set_rw_flags() to allows reads which have
RWF_ATOMIC set to be rejected.
Helper function generic_atomic_write_valid() can be used by FSes to verify
compliant writes. There we check for iov_iter type is for ubuf, which
implies iovcnt==1 for pwritev2(), which is an initial restriction for
atomic_write_segments_max. Initially the only user will be bdev file
operations write handler. We will rely on the block BIO submission path to
ensure write sizes are compliant for the bdev, so we don't need to check
atomic writes sizes yet.
Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
jpg: merge into single patch and much rewrite
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-4-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 12:53:52 +00:00
|
|
|
ret = kiocb_set_rw_flags(&kiocb, flags, type);
|
2017-06-20 07:05:40 -05:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
2019-04-12 12:31:57 +03:00
|
|
|
kiocb.ki_pos = (ppos ? *ppos : 0);
|
2014-02-11 18:37:41 -05:00
|
|
|
|
2017-02-20 16:51:23 +01:00
|
|
|
if (type == READ)
|
2023-08-28 17:13:18 +02:00
|
|
|
ret = filp->f_op->read_iter(&kiocb, iter);
|
2017-02-20 16:51:23 +01:00
|
|
|
else
|
2023-08-28 17:13:18 +02:00
|
|
|
ret = filp->f_op->write_iter(&kiocb, iter);
|
2015-02-11 19:59:44 +01:00
|
|
|
BUG_ON(ret == -EIOCBQUEUED);
|
2019-04-12 12:31:57 +03:00
|
|
|
if (ppos)
|
|
|
|
*ppos = kiocb.ki_pos;
|
2014-02-11 18:37:41 -05:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2006-09-30 23:28:47 -07:00
|
|
|
/* Do it by hand, with file-ops */
|
2015-03-20 20:10:21 -04:00
|
|
|
static ssize_t do_loop_readv_writev(struct file *filp, struct iov_iter *iter,
|
2017-07-06 18:58:37 +02:00
|
|
|
loff_t *ppos, int type, rwf_t flags)
|
2006-09-30 23:28:47 -07:00
|
|
|
{
|
|
|
|
ssize_t ret = 0;
|
|
|
|
|
2016-03-03 16:04:01 +01:00
|
|
|
if (flags & ~RWF_HIPRI)
|
2016-03-03 16:03:58 +01:00
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
2015-03-20 20:10:21 -04:00
|
|
|
while (iov_iter_count(iter)) {
|
2006-09-30 23:28:47 -07:00
|
|
|
ssize_t nr;
|
|
|
|
|
2017-02-20 16:51:23 +01:00
|
|
|
if (type == READ) {
|
2023-03-29 09:16:45 -06:00
|
|
|
nr = filp->f_op->read(filp, iter_iov_addr(iter),
|
|
|
|
iter_iov_len(iter), ppos);
|
2017-02-20 16:51:23 +01:00
|
|
|
} else {
|
2023-03-29 09:16:45 -06:00
|
|
|
nr = filp->f_op->write(filp, iter_iov_addr(iter),
|
|
|
|
iter_iov_len(iter), ppos);
|
2017-02-20 16:51:23 +01:00
|
|
|
}
|
2006-09-30 23:28:47 -07:00
|
|
|
|
|
|
|
if (nr < 0) {
|
|
|
|
if (!ret)
|
|
|
|
ret = nr;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
ret += nr;
|
2023-03-29 09:16:45 -06:00
|
|
|
if (nr != iter_iov_len(iter))
|
2006-09-30 23:28:47 -07:00
|
|
|
break;
|
2015-03-20 20:10:21 -04:00
|
|
|
iov_iter_advance(iter, nr);
|
2006-09-30 23:28:47 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2023-07-16 14:47:14 +03:00
|
|
|
ssize_t vfs_iocb_iter_read(struct file *file, struct kiocb *iocb,
|
|
|
|
struct iov_iter *iter)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
size_t tot_len;
|
2017-02-20 16:51:23 +01:00
|
|
|
ssize_t ret = 0;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2023-07-16 14:47:14 +03:00
|
|
|
if (!file->f_op->read_iter)
|
|
|
|
return -EINVAL;
|
2017-05-27 11:16:49 +03:00
|
|
|
if (!(file->f_mode & FMODE_READ))
|
|
|
|
return -EBADF;
|
|
|
|
if (!(file->f_mode & FMODE_CAN_READ))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2017-02-20 16:51:23 +01:00
|
|
|
tot_len = iov_iter_count(iter);
|
2015-03-21 19:40:11 -04:00
|
|
|
if (!tot_len)
|
|
|
|
goto out;
|
2023-07-16 14:47:14 +03:00
|
|
|
ret = rw_verify_area(READ, file, &iocb->ki_pos, tot_len);
|
2006-01-04 16:20:40 -08:00
|
|
|
if (ret < 0)
|
2017-05-27 11:16:48 +03:00
|
|
|
return ret;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2023-08-28 17:13:18 +02:00
|
|
|
ret = file->f_op->read_iter(iocb, iter);
|
2005-04-16 15:20:36 -07:00
|
|
|
out:
|
2017-05-27 11:16:48 +03:00
|
|
|
if (ret >= 0)
|
|
|
|
fsnotify_access(file);
|
2005-04-16 15:20:36 -07:00
|
|
|
return ret;
|
|
|
|
}
|
2023-07-16 14:47:14 +03:00
|
|
|
EXPORT_SYMBOL(vfs_iocb_iter_read);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2023-07-16 14:47:14 +03:00
|
|
|
ssize_t vfs_iter_read(struct file *file, struct iov_iter *iter, loff_t *ppos,
|
|
|
|
rwf_t flags)
|
2019-11-20 17:45:25 +08:00
|
|
|
{
|
|
|
|
size_t tot_len;
|
|
|
|
ssize_t ret = 0;
|
|
|
|
|
|
|
|
if (!file->f_op->read_iter)
|
|
|
|
return -EINVAL;
|
|
|
|
if (!(file->f_mode & FMODE_READ))
|
|
|
|
return -EBADF;
|
|
|
|
if (!(file->f_mode & FMODE_CAN_READ))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
tot_len = iov_iter_count(iter);
|
|
|
|
if (!tot_len)
|
|
|
|
goto out;
|
2023-07-16 14:47:14 +03:00
|
|
|
ret = rw_verify_area(READ, file, ppos, tot_len);
|
2019-11-20 17:45:25 +08:00
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
2023-07-16 14:47:14 +03:00
|
|
|
ret = do_iter_readv_writev(file, iter, ppos, READ, flags);
|
2019-11-20 17:45:25 +08:00
|
|
|
out:
|
|
|
|
if (ret >= 0)
|
|
|
|
fsnotify_access(file);
|
|
|
|
return ret;
|
|
|
|
}
|
2017-05-27 11:16:51 +03:00
|
|
|
EXPORT_SYMBOL(vfs_iter_read);
|
2017-02-20 16:51:23 +01:00
|
|
|
|
2023-11-22 14:27:12 +02:00
|
|
|
/*
|
|
|
|
* Caller is responsible for calling kiocb_end_write() on completion
|
|
|
|
* if async iocb was queued.
|
|
|
|
*/
|
2019-11-20 17:45:25 +08:00
|
|
|
ssize_t vfs_iocb_iter_write(struct file *file, struct kiocb *iocb,
|
|
|
|
struct iov_iter *iter)
|
|
|
|
{
|
|
|
|
size_t tot_len;
|
|
|
|
ssize_t ret = 0;
|
|
|
|
|
|
|
|
if (!file->f_op->write_iter)
|
|
|
|
return -EINVAL;
|
|
|
|
if (!(file->f_mode & FMODE_WRITE))
|
|
|
|
return -EBADF;
|
|
|
|
if (!(file->f_mode & FMODE_CAN_WRITE))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
tot_len = iov_iter_count(iter);
|
|
|
|
if (!tot_len)
|
|
|
|
return 0;
|
|
|
|
ret = rw_verify_area(WRITE, file, &iocb->ki_pos, tot_len);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
2023-11-22 14:27:12 +02:00
|
|
|
kiocb_start_write(iocb);
|
2023-08-28 17:13:18 +02:00
|
|
|
ret = file->f_op->write_iter(iocb, iter);
|
2023-11-22 14:27:12 +02:00
|
|
|
if (ret != -EIOCBQUEUED)
|
|
|
|
kiocb_end_write(iocb);
|
2019-11-20 17:45:25 +08:00
|
|
|
if (ret > 0)
|
|
|
|
fsnotify_modify(file);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(vfs_iocb_iter_write);
|
|
|
|
|
2017-05-27 11:16:52 +03:00
|
|
|
ssize_t vfs_iter_write(struct file *file, struct iov_iter *iter, loff_t *ppos,
|
2023-11-22 14:27:09 +02:00
|
|
|
rwf_t flags)
|
2017-05-27 11:16:52 +03:00
|
|
|
{
|
2023-07-16 14:47:14 +03:00
|
|
|
size_t tot_len;
|
|
|
|
ssize_t ret;
|
2023-11-22 14:27:09 +02:00
|
|
|
|
2023-07-16 14:47:14 +03:00
|
|
|
if (!(file->f_mode & FMODE_WRITE))
|
|
|
|
return -EBADF;
|
|
|
|
if (!(file->f_mode & FMODE_CAN_WRITE))
|
|
|
|
return -EINVAL;
|
2017-05-27 11:16:52 +03:00
|
|
|
if (!file->f_op->write_iter)
|
|
|
|
return -EINVAL;
|
2023-11-22 14:27:09 +02:00
|
|
|
|
2023-07-16 14:47:14 +03:00
|
|
|
tot_len = iov_iter_count(iter);
|
|
|
|
if (!tot_len)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
ret = rw_verify_area(WRITE, file, ppos, tot_len);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
2023-11-22 14:27:09 +02:00
|
|
|
file_start_write(file);
|
2023-07-16 14:47:14 +03:00
|
|
|
ret = do_iter_readv_writev(file, iter, ppos, WRITE, flags);
|
|
|
|
if (ret > 0)
|
|
|
|
fsnotify_modify(file);
|
2023-11-22 14:27:09 +02:00
|
|
|
file_end_write(file);
|
|
|
|
|
|
|
|
return ret;
|
2017-05-27 11:16:52 +03:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(vfs_iter_write);
|
|
|
|
|
2020-09-03 16:22:34 +02:00
|
|
|
static ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
|
2023-07-16 14:47:14 +03:00
|
|
|
unsigned long vlen, loff_t *pos, rwf_t flags)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2017-02-20 16:51:23 +01:00
|
|
|
struct iovec iovstack[UIO_FASTIOV];
|
|
|
|
struct iovec *iov = iovstack;
|
|
|
|
struct iov_iter iter;
|
2023-07-16 14:47:14 +03:00
|
|
|
size_t tot_len;
|
|
|
|
ssize_t ret = 0;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2023-07-16 14:47:14 +03:00
|
|
|
if (!(file->f_mode & FMODE_READ))
|
|
|
|
return -EBADF;
|
|
|
|
if (!(file->f_mode & FMODE_CAN_READ))
|
|
|
|
return -EINVAL;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2023-07-16 14:47:14 +03:00
|
|
|
ret = import_iovec(ITER_DEST, vec, vlen, ARRAY_SIZE(iovstack), &iov,
|
|
|
|
&iter);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
tot_len = iov_iter_count(&iter);
|
|
|
|
if (!tot_len)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = rw_verify_area(READ, file, pos, tot_len);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (file->f_op->read_iter)
|
|
|
|
ret = do_iter_readv_writev(file, &iter, pos, READ, flags);
|
|
|
|
else
|
|
|
|
ret = do_loop_readv_writev(file, &iter, pos, READ, flags);
|
|
|
|
out:
|
|
|
|
if (ret >= 0)
|
|
|
|
fsnotify_access(file);
|
|
|
|
kfree(iov);
|
2017-05-27 11:16:46 +03:00
|
|
|
return ret;
|
|
|
|
}
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2017-09-01 17:39:25 +02:00
|
|
|
static ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
|
2023-07-16 14:47:14 +03:00
|
|
|
unsigned long vlen, loff_t *pos, rwf_t flags)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2017-05-27 11:16:46 +03:00
|
|
|
struct iovec iovstack[UIO_FASTIOV];
|
|
|
|
struct iovec *iov = iovstack;
|
|
|
|
struct iov_iter iter;
|
2023-07-16 14:47:14 +03:00
|
|
|
size_t tot_len;
|
|
|
|
ssize_t ret = 0;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2023-07-16 14:47:14 +03:00
|
|
|
if (!(file->f_mode & FMODE_WRITE))
|
|
|
|
return -EBADF;
|
|
|
|
if (!(file->f_mode & FMODE_CAN_WRITE))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
ret = import_iovec(ITER_SOURCE, vec, vlen, ARRAY_SIZE(iovstack), &iov,
|
|
|
|
&iter);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
tot_len = iov_iter_count(&iter);
|
|
|
|
if (!tot_len)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = rw_verify_area(WRITE, file, pos, tot_len);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
file_start_write(file);
|
|
|
|
if (file->f_op->write_iter)
|
|
|
|
ret = do_iter_readv_writev(file, &iter, pos, WRITE, flags);
|
|
|
|
else
|
|
|
|
ret = do_loop_readv_writev(file, &iter, pos, WRITE, flags);
|
|
|
|
if (ret > 0)
|
|
|
|
fsnotify_modify(file);
|
|
|
|
file_end_write(file);
|
|
|
|
out:
|
|
|
|
kfree(iov);
|
2017-05-27 11:16:46 +03:00
|
|
|
return ret;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2016-03-03 16:03:59 +01:00
|
|
|
static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
|
2017-07-06 18:58:37 +02:00
|
|
|
unsigned long vlen, rwf_t flags)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2024-05-31 22:10:12 -04:00
|
|
|
CLASS(fd_pos, f)(fd);
|
2005-04-16 15:20:36 -07:00
|
|
|
ssize_t ret = -EBADF;
|
|
|
|
|
2024-05-31 22:10:12 -04:00
|
|
|
if (!fd_empty(f)) {
|
2024-05-31 14:12:01 -04:00
|
|
|
loff_t pos, *ppos = file_ppos(fd_file(f));
|
2019-04-12 12:31:57 +03:00
|
|
|
if (ppos) {
|
|
|
|
pos = *ppos;
|
|
|
|
ppos = &pos;
|
|
|
|
}
|
2024-05-31 14:12:01 -04:00
|
|
|
ret = vfs_readv(fd_file(f), vec, vlen, ppos, flags);
|
2019-04-12 12:31:57 +03:00
|
|
|
if (ret >= 0 && ppos)
|
2024-05-31 14:12:01 -04:00
|
|
|
fd_file(f)->f_pos = pos;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
if (ret > 0)
|
[PATCH] ifdef ->rchar, ->wchar, ->syscr, ->syscw from task_struct
They are fat: 4x8 bytes in task_struct.
They are uncoditionally updated in every fork, read, write and sendfile.
They are used only if you have some "extended acct fields feature".
And please, please, please, read(2) knows about bytes, not characters,
why it is called "rchar"?
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-10 01:46:45 -08:00
|
|
|
add_rchar(current, ret);
|
|
|
|
inc_syscr(current);
|
2005-04-16 15:20:36 -07:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-03-03 16:03:59 +01:00
|
|
|
static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
|
2017-07-06 18:58:37 +02:00
|
|
|
unsigned long vlen, rwf_t flags)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2024-05-31 22:10:12 -04:00
|
|
|
CLASS(fd_pos, f)(fd);
|
2005-04-16 15:20:36 -07:00
|
|
|
ssize_t ret = -EBADF;
|
|
|
|
|
2024-05-31 22:10:12 -04:00
|
|
|
if (!fd_empty(f)) {
|
2024-05-31 14:12:01 -04:00
|
|
|
loff_t pos, *ppos = file_ppos(fd_file(f));
|
2019-04-12 12:31:57 +03:00
|
|
|
if (ppos) {
|
|
|
|
pos = *ppos;
|
|
|
|
ppos = &pos;
|
|
|
|
}
|
2024-05-31 14:12:01 -04:00
|
|
|
ret = vfs_writev(fd_file(f), vec, vlen, ppos, flags);
|
2019-04-12 12:31:57 +03:00
|
|
|
if (ret >= 0 && ppos)
|
2024-05-31 14:12:01 -04:00
|
|
|
fd_file(f)->f_pos = pos;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
if (ret > 0)
|
[PATCH] ifdef ->rchar, ->wchar, ->syscr, ->syscw from task_struct
They are fat: 4x8 bytes in task_struct.
They are uncoditionally updated in every fork, read, write and sendfile.
They are used only if you have some "extended acct fields feature".
And please, please, please, read(2) knows about bytes, not characters,
why it is called "rchar"?
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-10 01:46:45 -08:00
|
|
|
add_wchar(current, ret);
|
|
|
|
inc_syscw(current);
|
2005-04-16 15:20:36 -07:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
Make non-compat preadv/pwritev use native register size
Instead of always splitting the file offset into 32-bit 'high' and 'low'
parts, just split them into the largest natural word-size - which in C
terms is 'unsigned long'.
This allows 64-bit architectures to avoid the unnecessary 32-bit
shifting and masking for native format (while the compat interfaces will
obviously always have to do it).
This also changes the order of 'high' and 'low' to be "low first". Why?
Because when we have it like this, the 64-bit system calls now don't use
the "pos_high" argument at all, and it makes more sense for the native
system call to simply match the user-mode prototype.
This results in a much more natural calling convention, and allows the
compiler to generate much more straightforward code. On x86-64, we now
generate
testq %rcx, %rcx # pos_l
js .L122 #,
movq %rcx, -48(%rbp) # pos_l, pos
from the C source
loff_t pos = pos_from_hilo(pos_h, pos_l);
...
if (pos < 0)
return -EINVAL;
and the 'pos_h' register isn't even touched. It used to generate code
like
mov %r8d, %r8d # pos_low, pos_low
salq $32, %rcx #, tmp71
movq %r8, %rax # pos_low, pos.386
orq %rcx, %rax # tmp71, pos.386
js .L122 #,
movq %rax, -48(%rbp) # pos.386, pos
which isn't _that_ horrible, but it does show how the natural word size
is just a more sensible interface (same arguments will hold in the user
level glibc wrapper function, of course, so the kernel side is just half
of the equation!)
Note: in all cases the user code wrapper can again be the same. You can
just do
#define HALF_BITS (sizeof(unsigned long)*4)
__syscall(PWRITEV, fd, iov, count, offset, (offset >> HALF_BITS) >> HALF_BITS);
or something like that. That way the user mode wrapper will also be
nicely passing in a zero (it won't actually have to do the shifts, the
compiler will understand what is going on) for the last argument.
And that is a good idea, even if nobody will necessarily ever care: if
we ever do move to a 128-bit lloff_t, this particular system call might
be left alone. Of course, that will be the least of our worries if we
really ever need to care, so this may not be worth really caring about.
[ Fixed for lost 'loff_t' cast noticed by Andrew Morton ]
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ralf Baechle <ralf@linux-mips.org>>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-03 08:03:22 -07:00
|
|
|
static inline loff_t pos_from_hilo(unsigned long high, unsigned long low)
|
|
|
|
{
|
|
|
|
#define HALF_LONG_BITS (BITS_PER_LONG / 2)
|
|
|
|
return (((loff_t)high << HALF_LONG_BITS) << HALF_LONG_BITS) | low;
|
|
|
|
}
|
|
|
|
|
2016-03-03 16:03:59 +01:00
|
|
|
static ssize_t do_preadv(unsigned long fd, const struct iovec __user *vec,
|
2017-07-06 18:58:37 +02:00
|
|
|
unsigned long vlen, loff_t pos, rwf_t flags)
|
2009-04-02 16:59:23 -07:00
|
|
|
{
|
|
|
|
ssize_t ret = -EBADF;
|
|
|
|
|
|
|
|
if (pos < 0)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2024-07-22 21:51:33 -04:00
|
|
|
CLASS(fd, f)(fd);
|
|
|
|
if (!fd_empty(f)) {
|
2009-04-02 16:59:23 -07:00
|
|
|
ret = -ESPIPE;
|
2024-05-31 14:12:01 -04:00
|
|
|
if (fd_file(f)->f_mode & FMODE_PREAD)
|
|
|
|
ret = vfs_readv(fd_file(f), vec, vlen, &pos, flags);
|
2009-04-02 16:59:23 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
if (ret > 0)
|
|
|
|
add_rchar(current, ret);
|
|
|
|
inc_syscr(current);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-03-03 16:03:59 +01:00
|
|
|
static ssize_t do_pwritev(unsigned long fd, const struct iovec __user *vec,
|
2017-07-06 18:58:37 +02:00
|
|
|
unsigned long vlen, loff_t pos, rwf_t flags)
|
2009-04-02 16:59:23 -07:00
|
|
|
{
|
|
|
|
ssize_t ret = -EBADF;
|
|
|
|
|
|
|
|
if (pos < 0)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2024-07-22 21:51:33 -04:00
|
|
|
CLASS(fd, f)(fd);
|
|
|
|
if (!fd_empty(f)) {
|
2009-04-02 16:59:23 -07:00
|
|
|
ret = -ESPIPE;
|
2024-05-31 14:12:01 -04:00
|
|
|
if (fd_file(f)->f_mode & FMODE_PWRITE)
|
|
|
|
ret = vfs_writev(fd_file(f), vec, vlen, &pos, flags);
|
2009-04-02 16:59:23 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
if (ret > 0)
|
|
|
|
add_wchar(current, ret);
|
|
|
|
inc_syscw(current);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-03-03 16:03:59 +01:00
|
|
|
SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
|
|
|
|
unsigned long, vlen)
|
|
|
|
{
|
|
|
|
return do_readv(fd, vec, vlen, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
|
|
|
|
unsigned long, vlen)
|
|
|
|
{
|
|
|
|
return do_writev(fd, vec, vlen, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
|
|
|
|
unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
|
|
|
|
{
|
|
|
|
loff_t pos = pos_from_hilo(pos_h, pos_l);
|
|
|
|
|
|
|
|
return do_preadv(fd, vec, vlen, pos, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
|
|
|
|
unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
|
2017-07-06 18:58:37 +02:00
|
|
|
rwf_t, flags)
|
2016-03-03 16:03:59 +01:00
|
|
|
{
|
|
|
|
loff_t pos = pos_from_hilo(pos_h, pos_l);
|
|
|
|
|
|
|
|
if (pos == -1)
|
|
|
|
return do_readv(fd, vec, vlen, flags);
|
|
|
|
|
|
|
|
return do_preadv(fd, vec, vlen, pos, flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
|
|
|
|
unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
|
|
|
|
{
|
|
|
|
loff_t pos = pos_from_hilo(pos_h, pos_l);
|
|
|
|
|
|
|
|
return do_pwritev(fd, vec, vlen, pos, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
|
|
|
|
unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
|
2017-07-06 18:58:37 +02:00
|
|
|
rwf_t, flags)
|
2016-03-03 16:03:59 +01:00
|
|
|
{
|
|
|
|
loff_t pos = pos_from_hilo(pos_h, pos_l);
|
|
|
|
|
|
|
|
if (pos == -1)
|
|
|
|
return do_writev(fd, vec, vlen, flags);
|
|
|
|
|
|
|
|
return do_pwritev(fd, vec, vlen, pos, flags);
|
|
|
|
}
|
|
|
|
|
2020-09-25 06:51:42 +02:00
|
|
|
/*
|
|
|
|
* Various compat syscalls. Note that they all pretend to take a native
|
|
|
|
* iovec - import_iovec will properly treat those as compat_iovecs based on
|
|
|
|
* in_compat_syscall().
|
|
|
|
*/
|
2013-03-20 10:42:10 -04:00
|
|
|
#ifdef CONFIG_COMPAT
|
2014-03-05 10:43:51 +01:00
|
|
|
#ifdef __ARCH_WANT_COMPAT_SYS_PREADV64
|
|
|
|
COMPAT_SYSCALL_DEFINE4(preadv64, unsigned long, fd,
|
2020-09-25 06:51:42 +02:00
|
|
|
const struct iovec __user *, vec,
|
2014-03-05 10:43:51 +01:00
|
|
|
unsigned long, vlen, loff_t, pos)
|
|
|
|
{
|
2020-09-25 06:51:42 +02:00
|
|
|
return do_preadv(fd, vec, vlen, pos, 0);
|
2014-03-05 10:43:51 +01:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2014-01-29 14:05:44 -08:00
|
|
|
COMPAT_SYSCALL_DEFINE5(preadv, compat_ulong_t, fd,
|
2020-09-25 06:51:42 +02:00
|
|
|
const struct iovec __user *, vec,
|
2014-01-29 14:05:44 -08:00
|
|
|
compat_ulong_t, vlen, u32, pos_low, u32, pos_high)
|
2013-03-20 10:42:10 -04:00
|
|
|
{
|
|
|
|
loff_t pos = ((loff_t)pos_high << 32) | pos_low;
|
2014-03-05 10:43:51 +01:00
|
|
|
|
2020-09-25 06:51:42 +02:00
|
|
|
return do_preadv(fd, vec, vlen, pos, 0);
|
2016-03-03 16:03:59 +01:00
|
|
|
}
|
|
|
|
|
2016-07-14 12:31:53 -07:00
|
|
|
#ifdef __ARCH_WANT_COMPAT_SYS_PREADV64V2
|
|
|
|
COMPAT_SYSCALL_DEFINE5(preadv64v2, unsigned long, fd,
|
2020-09-25 06:51:42 +02:00
|
|
|
const struct iovec __user *, vec,
|
2017-07-06 18:58:37 +02:00
|
|
|
unsigned long, vlen, loff_t, pos, rwf_t, flags)
|
2016-07-14 12:31:53 -07:00
|
|
|
{
|
2018-12-06 20:05:34 +01:00
|
|
|
if (pos == -1)
|
2020-09-25 06:51:42 +02:00
|
|
|
return do_readv(fd, vec, vlen, flags);
|
|
|
|
return do_preadv(fd, vec, vlen, pos, flags);
|
2016-07-14 12:31:53 -07:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2016-03-03 16:03:59 +01:00
|
|
|
COMPAT_SYSCALL_DEFINE6(preadv2, compat_ulong_t, fd,
|
2020-09-25 06:51:42 +02:00
|
|
|
const struct iovec __user *, vec,
|
2016-03-03 16:03:59 +01:00
|
|
|
compat_ulong_t, vlen, u32, pos_low, u32, pos_high,
|
2017-07-06 18:58:37 +02:00
|
|
|
rwf_t, flags)
|
2016-03-03 16:03:59 +01:00
|
|
|
{
|
|
|
|
loff_t pos = ((loff_t)pos_high << 32) | pos_low;
|
|
|
|
|
|
|
|
if (pos == -1)
|
2020-09-25 06:51:42 +02:00
|
|
|
return do_readv(fd, vec, vlen, flags);
|
|
|
|
return do_preadv(fd, vec, vlen, pos, flags);
|
2013-03-20 10:42:10 -04:00
|
|
|
}
|
|
|
|
|
2014-03-05 10:43:51 +01:00
|
|
|
#ifdef __ARCH_WANT_COMPAT_SYS_PWRITEV64
|
|
|
|
COMPAT_SYSCALL_DEFINE4(pwritev64, unsigned long, fd,
|
2020-09-25 06:51:42 +02:00
|
|
|
const struct iovec __user *, vec,
|
2014-03-05 10:43:51 +01:00
|
|
|
unsigned long, vlen, loff_t, pos)
|
|
|
|
{
|
2020-09-25 06:51:42 +02:00
|
|
|
return do_pwritev(fd, vec, vlen, pos, 0);
|
2014-03-05 10:43:51 +01:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2014-01-29 14:05:44 -08:00
|
|
|
COMPAT_SYSCALL_DEFINE5(pwritev, compat_ulong_t, fd,
|
2020-09-25 06:51:42 +02:00
|
|
|
const struct iovec __user *,vec,
|
2014-01-29 14:05:44 -08:00
|
|
|
compat_ulong_t, vlen, u32, pos_low, u32, pos_high)
|
2013-03-20 10:42:10 -04:00
|
|
|
{
|
|
|
|
loff_t pos = ((loff_t)pos_high << 32) | pos_low;
|
2014-03-05 10:43:51 +01:00
|
|
|
|
2020-09-25 06:51:42 +02:00
|
|
|
return do_pwritev(fd, vec, vlen, pos, 0);
|
2013-03-20 10:42:10 -04:00
|
|
|
}
|
2016-03-03 16:03:59 +01:00
|
|
|
|
2016-07-14 12:31:53 -07:00
|
|
|
#ifdef __ARCH_WANT_COMPAT_SYS_PWRITEV64V2
|
|
|
|
COMPAT_SYSCALL_DEFINE5(pwritev64v2, unsigned long, fd,
|
2020-09-25 06:51:42 +02:00
|
|
|
const struct iovec __user *, vec,
|
2017-07-06 18:58:37 +02:00
|
|
|
unsigned long, vlen, loff_t, pos, rwf_t, flags)
|
2016-07-14 12:31:53 -07:00
|
|
|
{
|
2018-12-06 20:05:34 +01:00
|
|
|
if (pos == -1)
|
2020-09-25 06:51:42 +02:00
|
|
|
return do_writev(fd, vec, vlen, flags);
|
|
|
|
return do_pwritev(fd, vec, vlen, pos, flags);
|
2016-07-14 12:31:53 -07:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2016-03-03 16:03:59 +01:00
|
|
|
COMPAT_SYSCALL_DEFINE6(pwritev2, compat_ulong_t, fd,
|
2020-09-25 06:51:42 +02:00
|
|
|
const struct iovec __user *,vec,
|
2017-07-06 18:58:37 +02:00
|
|
|
compat_ulong_t, vlen, u32, pos_low, u32, pos_high, rwf_t, flags)
|
2016-03-03 16:03:59 +01:00
|
|
|
{
|
|
|
|
loff_t pos = ((loff_t)pos_high << 32) | pos_low;
|
|
|
|
|
|
|
|
if (pos == -1)
|
2020-09-25 06:51:42 +02:00
|
|
|
return do_writev(fd, vec, vlen, flags);
|
|
|
|
return do_pwritev(fd, vec, vlen, pos, flags);
|
2013-03-20 10:42:10 -04:00
|
|
|
}
|
2020-09-25 06:51:42 +02:00
|
|
|
#endif /* CONFIG_COMPAT */
|
2013-03-20 10:42:10 -04:00
|
|
|
|
2013-02-24 02:17:03 -05:00
|
|
|
static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
|
2023-12-12 11:44:36 +02:00
|
|
|
size_t count, loff_t max)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2012-08-28 12:52:22 -04:00
|
|
|
struct inode *in_inode, *out_inode;
|
2021-01-25 22:24:28 -05:00
|
|
|
struct pipe_inode_info *opipe;
|
2005-04-16 15:20:36 -07:00
|
|
|
loff_t pos;
|
2013-06-20 18:58:36 +04:00
|
|
|
loff_t out_pos;
|
2005-04-16 15:20:36 -07:00
|
|
|
ssize_t retval;
|
2012-08-28 12:52:22 -04:00
|
|
|
int fl;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Get input file, and verify that it is ok..
|
|
|
|
*/
|
2024-07-19 21:19:02 -04:00
|
|
|
CLASS(fd, in)(in_fd);
|
|
|
|
if (fd_empty(in))
|
|
|
|
return -EBADF;
|
2024-05-31 14:12:01 -04:00
|
|
|
if (!(fd_file(in)->f_mode & FMODE_READ))
|
2024-07-19 21:19:02 -04:00
|
|
|
return -EBADF;
|
2013-06-20 18:58:36 +04:00
|
|
|
if (!ppos) {
|
2024-05-31 14:12:01 -04:00
|
|
|
pos = fd_file(in)->f_pos;
|
2013-06-20 18:58:36 +04:00
|
|
|
} else {
|
|
|
|
pos = *ppos;
|
2024-05-31 14:12:01 -04:00
|
|
|
if (!(fd_file(in)->f_mode & FMODE_PREAD))
|
2024-07-19 21:19:02 -04:00
|
|
|
return -ESPIPE;
|
2013-06-20 18:58:36 +04:00
|
|
|
}
|
2024-05-31 14:12:01 -04:00
|
|
|
retval = rw_verify_area(READ, fd_file(in), &pos, count);
|
2006-01-04 16:20:40 -08:00
|
|
|
if (retval < 0)
|
2024-07-19 21:19:02 -04:00
|
|
|
return retval;
|
2016-03-31 21:48:20 -04:00
|
|
|
if (count > MAX_RW_COUNT)
|
|
|
|
count = MAX_RW_COUNT;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Get output file, and verify that it is ok..
|
|
|
|
*/
|
2024-07-19 21:19:02 -04:00
|
|
|
CLASS(fd, out)(out_fd);
|
|
|
|
if (fd_empty(out))
|
|
|
|
return -EBADF;
|
2024-05-31 14:12:01 -04:00
|
|
|
if (!(fd_file(out)->f_mode & FMODE_WRITE))
|
2024-07-19 21:19:02 -04:00
|
|
|
return -EBADF;
|
2024-05-31 14:12:01 -04:00
|
|
|
in_inode = file_inode(fd_file(in));
|
|
|
|
out_inode = file_inode(fd_file(out));
|
|
|
|
out_pos = fd_file(out)->f_pos;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
if (!max)
|
|
|
|
max = min(in_inode->i_sb->s_maxbytes, out_inode->i_sb->s_maxbytes);
|
|
|
|
|
|
|
|
if (unlikely(pos + count > max)) {
|
|
|
|
if (pos >= max)
|
2024-07-19 21:19:02 -04:00
|
|
|
return -EOVERFLOW;
|
2005-04-16 15:20:36 -07:00
|
|
|
count = max - pos;
|
|
|
|
}
|
|
|
|
|
2007-06-11 12:18:52 +02:00
|
|
|
fl = 0;
|
2007-06-01 14:52:37 +02:00
|
|
|
#if 0
|
2007-06-11 12:18:52 +02:00
|
|
|
/*
|
|
|
|
* We need to debate whether we can enable this or not. The
|
|
|
|
* man page documents EAGAIN return for the output at least,
|
|
|
|
* and the application is arguably buggy if it doesn't expect
|
|
|
|
* EAGAIN on a non-blocking file descriptor.
|
|
|
|
*/
|
2024-05-31 14:12:01 -04:00
|
|
|
if (fd_file(in)->f_flags & O_NONBLOCK)
|
2007-06-11 12:18:52 +02:00
|
|
|
fl = SPLICE_F_NONBLOCK;
|
2007-06-01 14:52:37 +02:00
|
|
|
#endif
|
2024-05-31 14:12:01 -04:00
|
|
|
opipe = get_pipe_info(fd_file(out), true);
|
2021-01-25 22:24:28 -05:00
|
|
|
if (!opipe) {
|
2024-05-31 14:12:01 -04:00
|
|
|
retval = rw_verify_area(WRITE, fd_file(out), &out_pos, count);
|
2021-01-25 22:24:28 -05:00
|
|
|
if (retval < 0)
|
2024-07-19 21:19:02 -04:00
|
|
|
return retval;
|
2024-05-31 14:12:01 -04:00
|
|
|
retval = do_splice_direct(fd_file(in), &pos, fd_file(out), &out_pos,
|
2021-01-25 22:24:28 -05:00
|
|
|
count, fl);
|
|
|
|
} else {
|
2024-05-31 14:12:01 -04:00
|
|
|
if (fd_file(out)->f_flags & O_NONBLOCK)
|
fs: sendfile handles O_NONBLOCK of out_fd
sendfile has to return EAGAIN if out_fd is nonblocking and the write into
it would block.
Here is a small reproducer for the problem:
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/sendfile.h>
#define FILE_SIZE (1UL << 30)
int main(int argc, char **argv) {
int p[2], fd;
if (pipe2(p, O_NONBLOCK))
return 1;
fd = open(argv[1], O_RDWR | O_TMPFILE, 0666);
if (fd < 0)
return 1;
ftruncate(fd, FILE_SIZE);
if (sendfile(p[1], fd, 0, FILE_SIZE) == -1) {
fprintf(stderr, "FAIL\n");
}
if (sendfile(p[1], fd, 0, FILE_SIZE) != -1 || errno != EAGAIN) {
fprintf(stderr, "FAIL\n");
}
return 0;
}
It worked before b964bf53e540, it is stuck after b964bf53e540, and it
works again with this fix.
This regression occurred because do_splice_direct() calls pipe_write
that handles O_NONBLOCK. Here is a trace log from the reproducer:
1) | __x64_sys_sendfile64() {
1) | do_sendfile() {
1) | __fdget()
1) | rw_verify_area()
1) | __fdget()
1) | rw_verify_area()
1) | do_splice_direct() {
1) | rw_verify_area()
1) | splice_direct_to_actor() {
1) | do_splice_to() {
1) | rw_verify_area()
1) | generic_file_splice_read()
1) + 74.153 us | }
1) | direct_splice_actor() {
1) | iter_file_splice_write() {
1) | __kmalloc()
1) 0.148 us | pipe_lock();
1) 0.153 us | splice_from_pipe_next.part.0();
1) 0.162 us | page_cache_pipe_buf_confirm();
... 16 times
1) 0.159 us | page_cache_pipe_buf_confirm();
1) | vfs_iter_write() {
1) | do_iter_write() {
1) | rw_verify_area()
1) | do_iter_readv_writev() {
1) | pipe_write() {
1) | mutex_lock()
1) 0.153 us | mutex_unlock();
1) 1.368 us | }
1) 1.686 us | }
1) 5.798 us | }
1) 6.084 us | }
1) 0.174 us | kfree();
1) 0.152 us | pipe_unlock();
1) + 14.461 us | }
1) + 14.783 us | }
1) 0.164 us | page_cache_pipe_buf_release();
... 16 times
1) 0.161 us | page_cache_pipe_buf_release();
1) | touch_atime()
1) + 95.854 us | }
1) + 99.784 us | }
1) ! 107.393 us | }
1) ! 107.699 us | }
Link: https://lkml.kernel.org/r/20220415005015.525191-1-avagin@gmail.com
Fixes: b964bf53e540 ("teach sendfile(2) to handle send-to-pipe directly")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-16 21:37:10 -07:00
|
|
|
fl |= SPLICE_F_NONBLOCK;
|
|
|
|
|
2024-05-31 14:12:01 -04:00
|
|
|
retval = splice_file_to_pipe(fd_file(in), opipe, &pos, count, fl);
|
2021-01-25 22:24:28 -05:00
|
|
|
}
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
if (retval > 0) {
|
[PATCH] ifdef ->rchar, ->wchar, ->syscr, ->syscw from task_struct
They are fat: 4x8 bytes in task_struct.
They are uncoditionally updated in every fork, read, write and sendfile.
They are used only if you have some "extended acct fields feature".
And please, please, please, read(2) knows about bytes, not characters,
why it is called "rchar"?
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-10 01:46:45 -08:00
|
|
|
add_rchar(current, retval);
|
|
|
|
add_wchar(current, retval);
|
2024-05-31 14:12:01 -04:00
|
|
|
fsnotify_access(fd_file(in));
|
|
|
|
fsnotify_modify(fd_file(out));
|
|
|
|
fd_file(out)->f_pos = out_pos;
|
2013-06-20 18:58:36 +04:00
|
|
|
if (ppos)
|
|
|
|
*ppos = pos;
|
|
|
|
else
|
2024-05-31 14:12:01 -04:00
|
|
|
fd_file(in)->f_pos = pos;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
[PATCH] ifdef ->rchar, ->wchar, ->syscr, ->syscw from task_struct
They are fat: 4x8 bytes in task_struct.
They are uncoditionally updated in every fork, read, write and sendfile.
They are used only if you have some "extended acct fields feature".
And please, please, please, read(2) knows about bytes, not characters,
why it is called "rchar"?
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-10 01:46:45 -08:00
|
|
|
inc_syscr(current);
|
|
|
|
inc_syscw(current);
|
2013-06-20 18:58:36 +04:00
|
|
|
if (pos > max)
|
2005-04-16 15:20:36 -07:00
|
|
|
retval = -EOVERFLOW;
|
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
2009-01-14 14:14:18 +01:00
|
|
|
SYSCALL_DEFINE4(sendfile, int, out_fd, int, in_fd, off_t __user *, offset, size_t, count)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
loff_t pos;
|
|
|
|
off_t off;
|
|
|
|
ssize_t ret;
|
|
|
|
|
|
|
|
if (offset) {
|
|
|
|
if (unlikely(get_user(off, offset)))
|
|
|
|
return -EFAULT;
|
|
|
|
pos = off;
|
|
|
|
ret = do_sendfile(out_fd, in_fd, &pos, count, MAX_NON_LFS);
|
|
|
|
if (unlikely(put_user(pos, offset)))
|
|
|
|
return -EFAULT;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
return do_sendfile(out_fd, in_fd, NULL, count, 0);
|
|
|
|
}
|
|
|
|
|
2009-01-14 14:14:18 +01:00
|
|
|
SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd, loff_t __user *, offset, size_t, count)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
loff_t pos;
|
|
|
|
ssize_t ret;
|
|
|
|
|
|
|
|
if (offset) {
|
|
|
|
if (unlikely(copy_from_user(&pos, offset, sizeof(loff_t))))
|
|
|
|
return -EFAULT;
|
|
|
|
ret = do_sendfile(out_fd, in_fd, &pos, count, 0);
|
|
|
|
if (unlikely(put_user(pos, offset)))
|
|
|
|
return -EFAULT;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
return do_sendfile(out_fd, in_fd, NULL, count, 0);
|
|
|
|
}
|
2013-02-24 02:17:03 -05:00
|
|
|
|
|
|
|
#ifdef CONFIG_COMPAT
|
|
|
|
COMPAT_SYSCALL_DEFINE4(sendfile, int, out_fd, int, in_fd,
|
|
|
|
compat_off_t __user *, offset, compat_size_t, count)
|
|
|
|
{
|
|
|
|
loff_t pos;
|
|
|
|
off_t off;
|
|
|
|
ssize_t ret;
|
|
|
|
|
|
|
|
if (offset) {
|
|
|
|
if (unlikely(get_user(off, offset)))
|
|
|
|
return -EFAULT;
|
|
|
|
pos = off;
|
|
|
|
ret = do_sendfile(out_fd, in_fd, &pos, count, MAX_NON_LFS);
|
|
|
|
if (unlikely(put_user(pos, offset)))
|
|
|
|
return -EFAULT;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
return do_sendfile(out_fd, in_fd, NULL, count, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
|
|
|
|
compat_loff_t __user *, offset, compat_size_t, count)
|
|
|
|
{
|
|
|
|
loff_t pos;
|
|
|
|
ssize_t ret;
|
|
|
|
|
|
|
|
if (offset) {
|
|
|
|
if (unlikely(copy_from_user(&pos, offset, sizeof(loff_t))))
|
|
|
|
return -EFAULT;
|
|
|
|
ret = do_sendfile(out_fd, in_fd, &pos, count, 0);
|
|
|
|
if (unlikely(put_user(pos, offset)))
|
|
|
|
return -EFAULT;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
return do_sendfile(out_fd, in_fd, NULL, count, 0);
|
|
|
|
}
|
|
|
|
#endif
|
2015-11-10 16:53:30 -05:00
|
|
|
|
2020-10-15 09:21:17 -07:00
|
|
|
/*
|
|
|
|
* Performs necessary checks before doing a file copy
|
|
|
|
*
|
|
|
|
* Can adjust amount of bytes to copy via @req_count argument.
|
|
|
|
* Returns appropriate error code that caller should return or
|
|
|
|
* zero in case the copy should be allowed.
|
|
|
|
*/
|
|
|
|
static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
|
|
|
|
struct file *file_out, loff_t pos_out,
|
|
|
|
size_t *req_count, unsigned int flags)
|
|
|
|
{
|
|
|
|
struct inode *inode_in = file_inode(file_in);
|
|
|
|
struct inode *inode_out = file_inode(file_out);
|
|
|
|
uint64_t count = *req_count;
|
|
|
|
loff_t size_in;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = generic_file_rw_checks(file_in, file_out);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2022-06-30 22:58:49 +03:00
|
|
|
/*
|
|
|
|
* We allow some filesystems to handle cross sb copy, but passing
|
|
|
|
* a file of the wrong filesystem type to filesystem driver can result
|
|
|
|
* in an attempt to dereference the wrong type of ->private_data, so
|
|
|
|
* avoid doing that until we really have a good reason.
|
|
|
|
*
|
|
|
|
* nfs and cifs define several different file_system_type structures
|
|
|
|
* and several different sets of file_operations, but they all end up
|
|
|
|
* using the same ->copy_file_range() function pointer.
|
|
|
|
*/
|
2022-11-17 22:52:49 +02:00
|
|
|
if (flags & COPY_FILE_SPLICE) {
|
|
|
|
/* cross sb splice is allowed */
|
|
|
|
} else if (file_out->f_op->copy_file_range) {
|
2022-06-30 22:58:49 +03:00
|
|
|
if (file_in->f_op->copy_file_range !=
|
|
|
|
file_out->f_op->copy_file_range)
|
|
|
|
return -EXDEV;
|
|
|
|
} else if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) {
|
|
|
|
return -EXDEV;
|
|
|
|
}
|
|
|
|
|
2020-10-15 09:21:17 -07:00
|
|
|
/* Don't touch certain kinds of inodes */
|
|
|
|
if (IS_IMMUTABLE(inode_out))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
|
|
|
|
return -ETXTBSY;
|
|
|
|
|
|
|
|
/* Ensure offsets don't wrap. */
|
|
|
|
if (pos_in + count < pos_in || pos_out + count < pos_out)
|
|
|
|
return -EOVERFLOW;
|
|
|
|
|
|
|
|
/* Shorten the copy to EOF */
|
|
|
|
size_in = i_size_read(inode_in);
|
|
|
|
if (pos_in >= size_in)
|
|
|
|
count = 0;
|
|
|
|
else
|
|
|
|
count = min(count, size_in - (uint64_t)pos_in);
|
|
|
|
|
|
|
|
ret = generic_write_check_limits(file_out, pos_out, &count);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
/* Don't allow overlapped copying within the same file. */
|
|
|
|
if (inode_in == inode_out &&
|
|
|
|
pos_out + count > pos_in &&
|
|
|
|
pos_out < pos_in + count)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
*req_count = count;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-11-10 16:53:30 -05:00
|
|
|
/*
|
|
|
|
* copy_file_range() differs from regular file read and write in that it
|
|
|
|
* specifically allows return partial success. When it does so is up to
|
|
|
|
* the copy_file_range method.
|
|
|
|
*/
|
|
|
|
ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
|
|
|
|
struct file *file_out, loff_t pos_out,
|
|
|
|
size_t len, unsigned int flags)
|
|
|
|
{
|
|
|
|
ssize_t ret;
|
2022-11-17 22:52:49 +02:00
|
|
|
bool splice = flags & COPY_FILE_SPLICE;
|
2023-11-30 16:16:24 +02:00
|
|
|
bool samesb = file_inode(file_in)->i_sb == file_inode(file_out)->i_sb;
|
2015-11-10 16:53:30 -05:00
|
|
|
|
2022-11-17 22:52:49 +02:00
|
|
|
if (flags & ~COPY_FILE_SPLICE)
|
2015-11-10 16:53:30 -05:00
|
|
|
return -EINVAL;
|
|
|
|
|
2019-06-05 08:04:49 -07:00
|
|
|
ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
|
|
|
|
flags);
|
2019-06-05 08:04:48 -07:00
|
|
|
if (unlikely(ret))
|
|
|
|
return ret;
|
2017-01-31 10:34:56 +02:00
|
|
|
|
2015-11-10 16:53:30 -05:00
|
|
|
ret = rw_verify_area(READ, file_in, &pos_in, len);
|
2016-03-31 21:48:20 -04:00
|
|
|
if (unlikely(ret))
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
ret = rw_verify_area(WRITE, file_out, &pos_out, len);
|
|
|
|
if (unlikely(ret))
|
2015-11-10 16:53:30 -05:00
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (len == 0)
|
|
|
|
return 0;
|
|
|
|
|
2017-01-31 10:34:57 +02:00
|
|
|
file_start_write(file_out);
|
2015-11-10 16:53:30 -05:00
|
|
|
|
2016-12-09 16:17:19 -08:00
|
|
|
/*
|
2022-06-30 22:58:49 +03:00
|
|
|
* Cloning is supported by more file systems, so we implement copy on
|
|
|
|
* same sb using clone, but for filesystems where both clone and copy
|
|
|
|
* are supported (e.g. nfs,cifs), we only call the copy method.
|
2016-12-09 16:17:19 -08:00
|
|
|
*/
|
2022-11-17 22:52:49 +02:00
|
|
|
if (!splice && file_out->f_op->copy_file_range) {
|
2022-06-30 22:58:49 +03:00
|
|
|
ret = file_out->f_op->copy_file_range(file_in, pos_in,
|
|
|
|
file_out, pos_out,
|
|
|
|
len, flags);
|
2023-11-30 16:16:24 +02:00
|
|
|
} else if (!splice && file_in->f_op->remap_file_range && samesb) {
|
2022-06-30 22:58:49 +03:00
|
|
|
ret = file_in->f_op->remap_file_range(file_in, pos_in,
|
2018-10-30 10:41:49 +11:00
|
|
|
file_out, pos_out,
|
2018-10-30 10:42:10 +11:00
|
|
|
min_t(loff_t, MAX_RW_COUNT, len),
|
|
|
|
REMAP_FILE_CAN_SHORTEN);
|
2023-11-30 16:16:24 +02:00
|
|
|
/* fallback to splice */
|
|
|
|
if (ret <= 0)
|
|
|
|
splice = true;
|
|
|
|
} else if (samesb) {
|
|
|
|
/* Fallback to splice for same sb copy for backward compat */
|
|
|
|
splice = true;
|
2016-12-09 16:17:19 -08:00
|
|
|
}
|
|
|
|
|
2023-11-30 16:16:24 +02:00
|
|
|
file_end_write(file_out);
|
|
|
|
|
|
|
|
if (!splice)
|
|
|
|
goto done;
|
|
|
|
|
2022-06-30 22:58:49 +03:00
|
|
|
/*
|
|
|
|
* We can get here for same sb copy of filesystems that do not implement
|
|
|
|
* ->copy_file_range() in case filesystem does not support clone or in
|
|
|
|
* case filesystem supports clone but rejected the clone request (e.g.
|
|
|
|
* because it was not block aligned).
|
|
|
|
*
|
|
|
|
* In both cases, fall back to kernel copy so we are able to maintain a
|
|
|
|
* consistent story about which filesystems support copy_file_range()
|
|
|
|
* and which filesystems do not, that will allow userspace tools to
|
|
|
|
* make consistent desicions w.r.t using copy_file_range().
|
2022-11-17 22:52:49 +02:00
|
|
|
*
|
2023-11-30 16:16:24 +02:00
|
|
|
* We also get here if caller (e.g. nfsd) requested COPY_FILE_SPLICE
|
|
|
|
* for server-side-copy between any two sb.
|
|
|
|
*
|
|
|
|
* In any case, we call do_splice_direct() and not splice_file_range(),
|
|
|
|
* without file_start_write() held, to avoid possible deadlocks related
|
|
|
|
* to splicing from input file, while file_start_write() is held on
|
|
|
|
* the output file on a different sb.
|
2022-06-30 22:58:49 +03:00
|
|
|
*/
|
2023-11-30 16:16:24 +02:00
|
|
|
ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out,
|
|
|
|
min_t(size_t, len, MAX_RW_COUNT), 0);
|
2016-12-09 16:17:19 -08:00
|
|
|
done:
|
2015-11-10 16:53:30 -05:00
|
|
|
if (ret > 0) {
|
|
|
|
fsnotify_access(file_in);
|
|
|
|
add_rchar(current, ret);
|
|
|
|
fsnotify_modify(file_out);
|
|
|
|
add_wchar(current, ret);
|
|
|
|
}
|
2016-12-09 16:17:19 -08:00
|
|
|
|
2015-11-10 16:53:30 -05:00
|
|
|
inc_syscr(current);
|
|
|
|
inc_syscw(current);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(vfs_copy_file_range);
|
|
|
|
|
|
|
|
SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
|
|
|
|
int, fd_out, loff_t __user *, off_out,
|
|
|
|
size_t, len, unsigned int, flags)
|
|
|
|
{
|
|
|
|
loff_t pos_in;
|
|
|
|
loff_t pos_out;
|
|
|
|
ssize_t ret = -EBADF;
|
|
|
|
|
2024-07-19 20:17:58 -04:00
|
|
|
CLASS(fd, f_in)(fd_in);
|
|
|
|
if (fd_empty(f_in))
|
|
|
|
return -EBADF;
|
2015-11-10 16:53:30 -05:00
|
|
|
|
2024-07-19 20:17:58 -04:00
|
|
|
CLASS(fd, f_out)(fd_out);
|
|
|
|
if (fd_empty(f_out))
|
|
|
|
return -EBADF;
|
2015-11-10 16:53:30 -05:00
|
|
|
|
|
|
|
if (off_in) {
|
|
|
|
if (copy_from_user(&pos_in, off_in, sizeof(loff_t)))
|
2024-07-19 20:17:58 -04:00
|
|
|
return -EFAULT;
|
2015-11-10 16:53:30 -05:00
|
|
|
} else {
|
2024-05-31 14:12:01 -04:00
|
|
|
pos_in = fd_file(f_in)->f_pos;
|
2015-11-10 16:53:30 -05:00
|
|
|
}
|
|
|
|
|
|
|
|
if (off_out) {
|
|
|
|
if (copy_from_user(&pos_out, off_out, sizeof(loff_t)))
|
2024-07-19 20:17:58 -04:00
|
|
|
return -EFAULT;
|
2015-11-10 16:53:30 -05:00
|
|
|
} else {
|
2024-05-31 14:12:01 -04:00
|
|
|
pos_out = fd_file(f_out)->f_pos;
|
2015-11-10 16:53:30 -05:00
|
|
|
}
|
|
|
|
|
2022-11-17 22:52:49 +02:00
|
|
|
if (flags != 0)
|
2024-07-19 20:17:58 -04:00
|
|
|
return -EINVAL;
|
2022-11-17 22:52:49 +02:00
|
|
|
|
2024-05-31 14:12:01 -04:00
|
|
|
ret = vfs_copy_file_range(fd_file(f_in), pos_in, fd_file(f_out), pos_out, len,
|
2015-11-10 16:53:30 -05:00
|
|
|
flags);
|
|
|
|
if (ret > 0) {
|
|
|
|
pos_in += ret;
|
|
|
|
pos_out += ret;
|
|
|
|
|
|
|
|
if (off_in) {
|
|
|
|
if (copy_to_user(off_in, &pos_in, sizeof(loff_t)))
|
|
|
|
ret = -EFAULT;
|
|
|
|
} else {
|
2024-05-31 14:12:01 -04:00
|
|
|
fd_file(f_in)->f_pos = pos_in;
|
2015-11-10 16:53:30 -05:00
|
|
|
}
|
|
|
|
|
|
|
|
if (off_out) {
|
|
|
|
if (copy_to_user(off_out, &pos_out, sizeof(loff_t)))
|
|
|
|
ret = -EFAULT;
|
|
|
|
} else {
|
2024-05-31 14:12:01 -04:00
|
|
|
fd_file(f_out)->f_pos = pos_out;
|
2015-11-10 16:53:30 -05:00
|
|
|
}
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
2015-12-03 12:59:50 +01:00
|
|
|
|
2019-08-11 15:52:25 -07:00
|
|
|
/*
|
2020-10-15 09:21:17 -07:00
|
|
|
* Don't operate on ranges the page cache doesn't support, and don't exceed the
|
|
|
|
* LFS limits. If pos is under the limit it becomes a short access. If it
|
|
|
|
* exceeds the limit we return -EFBIG.
|
2019-08-11 15:52:25 -07:00
|
|
|
*/
|
2020-10-15 09:21:17 -07:00
|
|
|
int generic_write_check_limits(struct file *file, loff_t pos, loff_t *count)
|
2019-08-11 15:52:25 -07:00
|
|
|
{
|
2020-10-15 09:21:17 -07:00
|
|
|
struct inode *inode = file->f_mapping->host;
|
|
|
|
loff_t max_size = inode->i_sb->s_maxbytes;
|
|
|
|
loff_t limit = rlimit(RLIMIT_FSIZE);
|
2019-08-11 15:52:25 -07:00
|
|
|
|
2020-10-15 09:21:17 -07:00
|
|
|
if (limit != RLIM_INFINITY) {
|
|
|
|
if (pos >= limit) {
|
|
|
|
send_sig(SIGXFSZ, current, 0);
|
|
|
|
return -EFBIG;
|
2019-08-11 15:52:25 -07:00
|
|
|
}
|
2020-10-15 09:21:17 -07:00
|
|
|
*count = min(*count, limit - pos);
|
|
|
|
}
|
2019-08-11 15:52:25 -07:00
|
|
|
|
2020-10-15 09:21:17 -07:00
|
|
|
if (!(file->f_flags & O_LARGEFILE))
|
|
|
|
max_size = MAX_NON_LFS;
|
2018-10-30 10:42:17 +11:00
|
|
|
|
2020-10-15 09:21:17 -07:00
|
|
|
if (unlikely(pos >= max_size))
|
|
|
|
return -EFBIG;
|
2018-10-30 10:42:17 +11:00
|
|
|
|
2020-10-15 09:21:17 -07:00
|
|
|
*count = min(*count, max_size - pos);
|
2018-10-30 10:42:17 +11:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
2024-04-15 14:54:13 -07:00
|
|
|
EXPORT_SYMBOL_GPL(generic_write_check_limits);
|
2015-12-03 12:59:50 +01:00
|
|
|
|
2021-08-12 15:34:57 -07:00
|
|
|
/* Like generic_write_checks(), but takes size of write instead of iter. */
|
|
|
|
int generic_write_checks_count(struct kiocb *iocb, loff_t *count)
|
2016-12-09 16:18:30 -08:00
|
|
|
{
|
2020-10-15 09:21:17 -07:00
|
|
|
struct file *file = iocb->ki_filp;
|
|
|
|
struct inode *inode = file->f_mapping->host;
|
2016-12-09 16:18:30 -08:00
|
|
|
|
2020-10-15 09:21:17 -07:00
|
|
|
if (IS_SWAPFILE(inode))
|
2016-12-09 16:18:30 -08:00
|
|
|
return -ETXTBSY;
|
|
|
|
|
2021-08-12 15:34:57 -07:00
|
|
|
if (!*count)
|
2020-10-15 09:21:17 -07:00
|
|
|
return 0;
|
2018-09-10 16:21:17 -07:00
|
|
|
|
2020-10-15 09:21:17 -07:00
|
|
|
if (iocb->ki_flags & IOCB_APPEND)
|
|
|
|
iocb->ki_pos = i_size_read(inode);
|
2018-07-06 23:57:03 +02:00
|
|
|
|
2022-06-23 10:51:50 -07:00
|
|
|
if ((iocb->ki_flags & IOCB_NOWAIT) &&
|
|
|
|
!((iocb->ki_flags & IOCB_DIRECT) ||
|
2024-03-28 13:27:24 +01:00
|
|
|
(file->f_op->fop_flags & FOP_BUFFER_WASYNC)))
|
2020-10-15 09:21:17 -07:00
|
|
|
return -EINVAL;
|
2018-07-06 23:57:03 +02:00
|
|
|
|
2021-08-12 15:34:57 -07:00
|
|
|
return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(generic_write_checks_count);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Performs necessary checks before doing a write
|
|
|
|
*
|
|
|
|
* Can adjust writing position or amount of bytes to write.
|
|
|
|
* Returns appropriate error code that caller should return or
|
|
|
|
* zero in case that write should be allowed.
|
|
|
|
*/
|
|
|
|
ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
|
|
|
|
{
|
|
|
|
loff_t count = iov_iter_count(from);
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = generic_write_checks_count(iocb, &count);
|
2018-07-06 23:57:03 +02:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2020-10-15 09:21:17 -07:00
|
|
|
iov_iter_truncate(from, count);
|
|
|
|
return iov_iter_count(from);
|
2018-07-06 23:57:03 +02:00
|
|
|
}
|
2020-10-15 09:21:17 -07:00
|
|
|
EXPORT_SYMBOL(generic_write_checks);
|
2018-07-06 23:57:03 +02:00
|
|
|
|
2020-10-15 09:21:17 -07:00
|
|
|
/*
|
|
|
|
* Performs common checks before doing a file copy/clone
|
|
|
|
* from @file_in to @file_out.
|
|
|
|
*/
|
|
|
|
int generic_file_rw_checks(struct file *file_in, struct file *file_out)
|
2015-12-19 00:55:59 -08:00
|
|
|
{
|
2020-10-15 09:21:17 -07:00
|
|
|
struct inode *inode_in = file_inode(file_in);
|
|
|
|
struct inode *inode_out = file_inode(file_out);
|
2015-12-19 00:55:59 -08:00
|
|
|
|
2020-10-15 09:21:17 -07:00
|
|
|
/* Don't copy dirs, pipes, sockets... */
|
|
|
|
if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
|
2018-11-19 13:31:12 -08:00
|
|
|
return -EISDIR;
|
2020-10-15 09:21:17 -07:00
|
|
|
if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
|
2016-12-19 15:13:26 -08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2020-10-15 09:21:17 -07:00
|
|
|
if (!(file_in->f_mode & FMODE_READ) ||
|
|
|
|
!(file_out->f_mode & FMODE_WRITE) ||
|
|
|
|
(file_out->f_flags & O_APPEND))
|
|
|
|
return -EBADF;
|
2018-07-06 23:57:03 +02:00
|
|
|
|
2020-10-15 09:21:17 -07:00
|
|
|
return 0;
|
2015-12-19 00:55:59 -08:00
|
|
|
}
|
fs: Initial atomic write support
An atomic write is a write issued with torn-write protection, meaning
that for a power failure or any other hardware failure, all or none of the
data from the write will be stored, but never a mix of old and new data.
Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the
write is to be issued with torn-write prevention, according to special
alignment and length rules.
For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for
iocb->ki_flags field to indicate the same.
A call to statx will give the relevant atomic write info for a file:
- atomic_write_unit_min
- atomic_write_unit_max
- atomic_write_segments_max
Both min and max values must be a power-of-2.
Applications can avail of atomic write feature by ensuring that the total
length of a write is a power-of-2 in size and also sized between
atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications
must ensure that the write is at a naturally-aligned offset in the file
wrt the total write length. The value in atomic_write_segments_max
indicates the upper limit for IOV_ITER iovcnt.
Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the
flag set will have RWF_ATOMIC rejected and not just ignored.
Add a type argument to kiocb_set_rw_flags() to allows reads which have
RWF_ATOMIC set to be rejected.
Helper function generic_atomic_write_valid() can be used by FSes to verify
compliant writes. There we check for iov_iter type is for ubuf, which
implies iovcnt==1 for pwritev2(), which is an initial restriction for
atomic_write_segments_max. Initially the only user will be bdev file
operations write handler. We will rely on the block BIO submission path to
ensure write sizes are compliant for the bdev, so we don't need to check
atomic writes sizes yet.
Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
jpg: merge into single patch and much rewrite
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-4-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 12:53:52 +00:00
|
|
|
|
2024-10-19 12:51:07 +00:00
|
|
|
int generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter)
|
fs: Initial atomic write support
An atomic write is a write issued with torn-write protection, meaning
that for a power failure or any other hardware failure, all or none of the
data from the write will be stored, but never a mix of old and new data.
Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the
write is to be issued with torn-write prevention, according to special
alignment and length rules.
For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for
iocb->ki_flags field to indicate the same.
A call to statx will give the relevant atomic write info for a file:
- atomic_write_unit_min
- atomic_write_unit_max
- atomic_write_segments_max
Both min and max values must be a power-of-2.
Applications can avail of atomic write feature by ensuring that the total
length of a write is a power-of-2 in size and also sized between
atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications
must ensure that the write is at a naturally-aligned offset in the file
wrt the total write length. The value in atomic_write_segments_max
indicates the upper limit for IOV_ITER iovcnt.
Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the
flag set will have RWF_ATOMIC rejected and not just ignored.
Add a type argument to kiocb_set_rw_flags() to allows reads which have
RWF_ATOMIC set to be rejected.
Helper function generic_atomic_write_valid() can be used by FSes to verify
compliant writes. There we check for iov_iter type is for ubuf, which
implies iovcnt==1 for pwritev2(), which is an initial restriction for
atomic_write_segments_max. Initially the only user will be bdev file
operations write handler. We will rely on the block BIO submission path to
ensure write sizes are compliant for the bdev, so we don't need to check
atomic writes sizes yet.
Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
jpg: merge into single patch and much rewrite
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-4-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 12:53:52 +00:00
|
|
|
{
|
|
|
|
size_t len = iov_iter_count(iter);
|
|
|
|
|
|
|
|
if (!iter_is_ubuf(iter))
|
2024-10-19 12:51:07 +00:00
|
|
|
return -EINVAL;
|
fs: Initial atomic write support
An atomic write is a write issued with torn-write protection, meaning
that for a power failure or any other hardware failure, all or none of the
data from the write will be stored, but never a mix of old and new data.
Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the
write is to be issued with torn-write prevention, according to special
alignment and length rules.
For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for
iocb->ki_flags field to indicate the same.
A call to statx will give the relevant atomic write info for a file:
- atomic_write_unit_min
- atomic_write_unit_max
- atomic_write_segments_max
Both min and max values must be a power-of-2.
Applications can avail of atomic write feature by ensuring that the total
length of a write is a power-of-2 in size and also sized between
atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications
must ensure that the write is at a naturally-aligned offset in the file
wrt the total write length. The value in atomic_write_segments_max
indicates the upper limit for IOV_ITER iovcnt.
Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the
flag set will have RWF_ATOMIC rejected and not just ignored.
Add a type argument to kiocb_set_rw_flags() to allows reads which have
RWF_ATOMIC set to be rejected.
Helper function generic_atomic_write_valid() can be used by FSes to verify
compliant writes. There we check for iov_iter type is for ubuf, which
implies iovcnt==1 for pwritev2(), which is an initial restriction for
atomic_write_segments_max. Initially the only user will be bdev file
operations write handler. We will rely on the block BIO submission path to
ensure write sizes are compliant for the bdev, so we don't need to check
atomic writes sizes yet.
Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
jpg: merge into single patch and much rewrite
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-4-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 12:53:52 +00:00
|
|
|
|
|
|
|
if (!is_power_of_2(len))
|
2024-10-19 12:51:07 +00:00
|
|
|
return -EINVAL;
|
fs: Initial atomic write support
An atomic write is a write issued with torn-write protection, meaning
that for a power failure or any other hardware failure, all or none of the
data from the write will be stored, but never a mix of old and new data.
Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the
write is to be issued with torn-write prevention, according to special
alignment and length rules.
For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for
iocb->ki_flags field to indicate the same.
A call to statx will give the relevant atomic write info for a file:
- atomic_write_unit_min
- atomic_write_unit_max
- atomic_write_segments_max
Both min and max values must be a power-of-2.
Applications can avail of atomic write feature by ensuring that the total
length of a write is a power-of-2 in size and also sized between
atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications
must ensure that the write is at a naturally-aligned offset in the file
wrt the total write length. The value in atomic_write_segments_max
indicates the upper limit for IOV_ITER iovcnt.
Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the
flag set will have RWF_ATOMIC rejected and not just ignored.
Add a type argument to kiocb_set_rw_flags() to allows reads which have
RWF_ATOMIC set to be rejected.
Helper function generic_atomic_write_valid() can be used by FSes to verify
compliant writes. There we check for iov_iter type is for ubuf, which
implies iovcnt==1 for pwritev2(), which is an initial restriction for
atomic_write_segments_max. Initially the only user will be bdev file
operations write handler. We will rely on the block BIO submission path to
ensure write sizes are compliant for the bdev, so we don't need to check
atomic writes sizes yet.
Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
jpg: merge into single patch and much rewrite
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-4-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 12:53:52 +00:00
|
|
|
|
2024-10-19 12:51:06 +00:00
|
|
|
if (!IS_ALIGNED(iocb->ki_pos, len))
|
2024-10-19 12:51:07 +00:00
|
|
|
return -EINVAL;
|
fs: Initial atomic write support
An atomic write is a write issued with torn-write protection, meaning
that for a power failure or any other hardware failure, all or none of the
data from the write will be stored, but never a mix of old and new data.
Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the
write is to be issued with torn-write prevention, according to special
alignment and length rules.
For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for
iocb->ki_flags field to indicate the same.
A call to statx will give the relevant atomic write info for a file:
- atomic_write_unit_min
- atomic_write_unit_max
- atomic_write_segments_max
Both min and max values must be a power-of-2.
Applications can avail of atomic write feature by ensuring that the total
length of a write is a power-of-2 in size and also sized between
atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications
must ensure that the write is at a naturally-aligned offset in the file
wrt the total write length. The value in atomic_write_segments_max
indicates the upper limit for IOV_ITER iovcnt.
Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the
flag set will have RWF_ATOMIC rejected and not just ignored.
Add a type argument to kiocb_set_rw_flags() to allows reads which have
RWF_ATOMIC set to be rejected.
Helper function generic_atomic_write_valid() can be used by FSes to verify
compliant writes. There we check for iov_iter type is for ubuf, which
implies iovcnt==1 for pwritev2(), which is an initial restriction for
atomic_write_segments_max. Initially the only user will be bdev file
operations write handler. We will rely on the block BIO submission path to
ensure write sizes are compliant for the bdev, so we don't need to check
atomic writes sizes yet.
Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
jpg: merge into single patch and much rewrite
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-4-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 12:53:52 +00:00
|
|
|
|
2024-10-19 12:51:07 +00:00
|
|
|
if (!(iocb->ki_flags & IOCB_DIRECT))
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
return 0;
|
fs: Initial atomic write support
An atomic write is a write issued with torn-write protection, meaning
that for a power failure or any other hardware failure, all or none of the
data from the write will be stored, but never a mix of old and new data.
Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the
write is to be issued with torn-write prevention, according to special
alignment and length rules.
For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for
iocb->ki_flags field to indicate the same.
A call to statx will give the relevant atomic write info for a file:
- atomic_write_unit_min
- atomic_write_unit_max
- atomic_write_segments_max
Both min and max values must be a power-of-2.
Applications can avail of atomic write feature by ensuring that the total
length of a write is a power-of-2 in size and also sized between
atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications
must ensure that the write is at a naturally-aligned offset in the file
wrt the total write length. The value in atomic_write_segments_max
indicates the upper limit for IOV_ITER iovcnt.
Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the
flag set will have RWF_ATOMIC rejected and not just ignored.
Add a type argument to kiocb_set_rw_flags() to allows reads which have
RWF_ATOMIC set to be rejected.
Helper function generic_atomic_write_valid() can be used by FSes to verify
compliant writes. There we check for iov_iter type is for ubuf, which
implies iovcnt==1 for pwritev2(), which is an initial restriction for
atomic_write_segments_max. Initially the only user will be bdev file
operations write handler. We will rely on the block BIO submission path to
ensure write sizes are compliant for the bdev, so we don't need to check
atomic writes sizes yet.
Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
jpg: merge into single patch and much rewrite
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-4-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 12:53:52 +00:00
|
|
|
}
|
2024-11-04 16:14:02 -08:00
|
|
|
EXPORT_SYMBOL_GPL(generic_atomic_write_valid);
|