License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 15:07:57 +01:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2006-10-11 01:20:50 -07:00
|
|
|
/*
|
2006-10-11 01:20:53 -07:00
|
|
|
* linux/fs/ext4/file.c
|
2006-10-11 01:20:50 -07:00
|
|
|
*
|
|
|
|
* Copyright (C) 1992, 1993, 1994, 1995
|
|
|
|
* Remy Card (card@masi.ibp.fr)
|
|
|
|
* Laboratoire MASI - Institut Blaise Pascal
|
|
|
|
* Universite Pierre et Marie Curie (Paris VI)
|
|
|
|
*
|
|
|
|
* from
|
|
|
|
*
|
|
|
|
* linux/fs/minix/file.c
|
|
|
|
*
|
|
|
|
* Copyright (C) 1991, 1992 Linus Torvalds
|
|
|
|
*
|
2006-10-11 01:20:53 -07:00
|
|
|
* ext4 fs regular file handling primitives
|
2006-10-11 01:20:50 -07:00
|
|
|
*
|
|
|
|
* 64-bit file support on 64-bit platforms by Jakub Jelinek
|
|
|
|
* (jj@sunsite.ms.mff.cuni.cz)
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/time.h>
|
|
|
|
#include <linux/fs.h>
|
2017-10-01 17:58:54 -04:00
|
|
|
#include <linux/iomap.h>
|
2009-06-13 10:09:48 -04:00
|
|
|
#include <linux/mount.h>
|
|
|
|
#include <linux/path.h>
|
2015-09-08 14:58:40 -07:00
|
|
|
#include <linux/dax.h>
|
2010-03-03 09:05:07 -05:00
|
|
|
#include <linux/quotaops.h>
|
2012-11-08 21:57:40 -05:00
|
|
|
#include <linux/pagevec.h>
|
2015-02-22 08:58:50 -08:00
|
|
|
#include <linux/uio.h>
|
2017-11-01 16:36:45 +01:00
|
|
|
#include <linux/mman.h>
|
2019-11-05 23:02:39 +11:00
|
|
|
#include <linux/backing-dev.h>
|
2008-04-29 18:13:32 -04:00
|
|
|
#include "ext4.h"
|
|
|
|
#include "ext4_jbd2.h"
|
2006-10-11 01:20:50 -07:00
|
|
|
#include "xattr.h"
|
|
|
|
#include "acl.h"
|
2019-11-05 23:01:51 +11:00
|
|
|
#include "truncate.h"
|
2006-10-11 01:20:50 -07:00
|
|
|
|
2019-11-05 23:01:37 +11:00
|
|
|
static bool ext4_dio_supported(struct inode *inode)
|
|
|
|
{
|
|
|
|
if (IS_ENABLED(CONFIG_FS_ENCRYPTION) && IS_ENCRYPTED(inode))
|
|
|
|
return false;
|
|
|
|
if (fsverity_active(inode))
|
|
|
|
return false;
|
|
|
|
if (ext4_should_journal_data(inode))
|
|
|
|
return false;
|
|
|
|
if (ext4_has_inline_data(inode))
|
|
|
|
return false;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
|
|
|
|
{
|
|
|
|
ssize_t ret;
|
|
|
|
struct inode *inode = file_inode(iocb->ki_filp);
|
|
|
|
|
|
|
|
if (iocb->ki_flags & IOCB_NOWAIT) {
|
|
|
|
if (!inode_trylock_shared(inode))
|
|
|
|
return -EAGAIN;
|
|
|
|
} else {
|
|
|
|
inode_lock_shared(inode);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!ext4_dio_supported(inode)) {
|
|
|
|
inode_unlock_shared(inode);
|
|
|
|
/*
|
|
|
|
* Fallback to buffered I/O if the operation being performed on
|
|
|
|
* the inode is not supported by direct I/O. The IOCB_DIRECT
|
|
|
|
* flag needs to be cleared here in order to ensure that the
|
|
|
|
* direct I/O path within generic_file_read_iter() is not
|
|
|
|
* taken.
|
|
|
|
*/
|
|
|
|
iocb->ki_flags &= ~IOCB_DIRECT;
|
|
|
|
return generic_file_read_iter(iocb, to);
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL,
|
|
|
|
is_sync_kiocb(iocb));
|
|
|
|
inode_unlock_shared(inode);
|
|
|
|
|
|
|
|
file_accessed(iocb->ki_filp);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-11-20 17:36:06 -05:00
|
|
|
#ifdef CONFIG_FS_DAX
|
|
|
|
static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
|
|
|
|
{
|
|
|
|
struct inode *inode = file_inode(iocb->ki_filp);
|
|
|
|
ssize_t ret;
|
|
|
|
|
2017-06-20 07:05:47 -05:00
|
|
|
if (!inode_trylock_shared(inode)) {
|
|
|
|
if (iocb->ki_flags & IOCB_NOWAIT)
|
|
|
|
return -EAGAIN;
|
|
|
|
inode_lock_shared(inode);
|
|
|
|
}
|
2016-11-20 17:36:06 -05:00
|
|
|
/*
|
|
|
|
* Recheck under inode lock - at this point we are sure it cannot
|
|
|
|
* change anymore
|
|
|
|
*/
|
|
|
|
if (!IS_DAX(inode)) {
|
|
|
|
inode_unlock_shared(inode);
|
|
|
|
/* Fallback to buffered IO in case we cannot support DAX */
|
|
|
|
return generic_file_read_iter(iocb, to);
|
|
|
|
}
|
|
|
|
ret = dax_iomap_rw(iocb, to, &ext4_iomap_ops);
|
|
|
|
inode_unlock_shared(inode);
|
|
|
|
|
|
|
|
file_accessed(iocb->ki_filp);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
|
|
|
|
{
|
2019-11-05 23:01:37 +11:00
|
|
|
struct inode *inode = file_inode(iocb->ki_filp);
|
|
|
|
|
|
|
|
if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
|
2017-02-05 01:28:48 -05:00
|
|
|
return -EIO;
|
|
|
|
|
2016-11-20 17:36:06 -05:00
|
|
|
if (!iov_iter_count(to))
|
|
|
|
return 0; /* skip atime */
|
|
|
|
|
|
|
|
#ifdef CONFIG_FS_DAX
|
2019-11-05 23:01:37 +11:00
|
|
|
if (IS_DAX(inode))
|
2016-11-20 17:36:06 -05:00
|
|
|
return ext4_dax_read_iter(iocb, to);
|
|
|
|
#endif
|
2019-11-05 23:01:37 +11:00
|
|
|
if (iocb->ki_flags & IOCB_DIRECT)
|
|
|
|
return ext4_dio_read_iter(iocb, to);
|
|
|
|
|
2016-11-20 17:36:06 -05:00
|
|
|
return generic_file_read_iter(iocb, to);
|
|
|
|
}
|
|
|
|
|
2006-10-11 01:20:50 -07:00
|
|
|
/*
|
|
|
|
* Called when an inode is released. Note that this is different
|
2006-10-11 01:20:53 -07:00
|
|
|
* from ext4_file_open: open gets called at every open, but release
|
2006-10-11 01:20:50 -07:00
|
|
|
* gets called only when /all/ the files are closed.
|
|
|
|
*/
|
2008-09-08 22:25:24 -04:00
|
|
|
static int ext4_release_file(struct inode *inode, struct file *filp)
|
2006-10-11 01:20:50 -07:00
|
|
|
{
|
2010-01-24 14:34:07 -05:00
|
|
|
if (ext4_test_inode_state(inode, EXT4_STATE_DA_ALLOC_CLOSE)) {
|
2009-02-24 08:21:14 -05:00
|
|
|
ext4_alloc_da_blocks(inode);
|
2010-01-24 14:34:07 -05:00
|
|
|
ext4_clear_inode_state(inode, EXT4_STATE_DA_ALLOC_CLOSE);
|
2009-02-24 08:21:14 -05:00
|
|
|
}
|
2006-10-11 01:20:50 -07:00
|
|
|
/* if we are the last writer on the inode, drop the block reservation */
|
|
|
|
if ((filp->f_mode & FMODE_WRITE) &&
|
2009-03-27 22:36:43 -04:00
|
|
|
(atomic_read(&inode->i_writecount) == 1) &&
|
|
|
|
!EXT4_I(inode)->i_reserved_data_blocks)
|
2006-10-11 01:20:50 -07:00
|
|
|
{
|
2008-01-28 23:58:26 -05:00
|
|
|
down_write(&EXT4_I(inode)->i_data_sem);
|
2008-10-10 09:40:52 -04:00
|
|
|
ext4_discard_preallocations(inode);
|
2008-01-28 23:58:26 -05:00
|
|
|
up_write(&EXT4_I(inode)->i_data_sem);
|
2006-10-11 01:20:50 -07:00
|
|
|
}
|
|
|
|
if (is_dx(inode) && filp->private_data)
|
2006-10-11 01:20:53 -07:00
|
|
|
ext4_htree_free_dir_info(filp->private_data);
|
2006-10-11 01:20:50 -07:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 08:17:34 -05:00
|
|
|
/*
|
|
|
|
* This tests whether the IO in question is block-aligned or not.
|
|
|
|
* Ext4 utilizes unwritten extents when hole-filling during direct IO, and they
|
|
|
|
* are converted to written only after the IO is complete. Until they are
|
|
|
|
* mapped, these blocks appear as holes, so dio_zero_block() will assume that
|
|
|
|
* it needs to zero out portions of the start and/or end block. If 2 AIO
|
|
|
|
* threads are at work on the same unwritten block, they must be synchronized
|
|
|
|
* or one thread will zero the other's data, causing corruption.
|
|
|
|
*/
|
|
|
|
static int
|
2014-04-17 16:09:22 -04:00
|
|
|
ext4_unaligned_aio(struct inode *inode, struct iov_iter *from, loff_t pos)
|
ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 08:17:34 -05:00
|
|
|
{
|
|
|
|
struct super_block *sb = inode->i_sb;
|
|
|
|
int blockmask = sb->s_blocksize - 1;
|
|
|
|
|
ext4: fix data corruption caused by unaligned direct AIO
Ext4 needs to serialize unaligned direct AIO because the zeroing of
partial blocks of two competing unaligned AIOs can result in data
corruption.
However it decides not to serialize if the potentially unaligned aio is
past i_size with the rationale that no pending writes are possible past
i_size. Unfortunately if the i_size is not block aligned and the second
unaligned write lands past i_size, but still into the same block, it has
the potential of corrupting the previous unaligned write to the same
block.
This is (very simplified) reproducer from Frank
// 41472 = (10 * 4096) + 512
// 37376 = 41472 - 4096
ftruncate(fd, 41472);
io_prep_pwrite(iocbs[0], fd, buf[0], 4096, 37376);
io_prep_pwrite(iocbs[1], fd, buf[1], 4096, 41472);
io_submit(io_ctx, 1, &iocbs[1]);
io_submit(io_ctx, 1, &iocbs[2]);
io_getevents(io_ctx, 2, 2, events, NULL);
Without this patch the 512B range from 40960 up to the start of the
second unaligned write (41472) is going to be zeroed overwriting the data
written by the first write. This is a data corruption.
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
00009200 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
*
0000a000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0000a200 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31
With this patch the data corruption is avoided because we will recognize
the unaligned_aio and wait for the unwritten extent conversion.
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
00009200 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
*
0000a200 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31
*
0000b200
Reported-by: Frank Sorenson <fsorenso@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fixes: e9e3bcecf44c ("ext4: serialize unaligned asynchronous DIO")
Cc: stable@vger.kernel.org
2019-03-14 23:20:25 -04:00
|
|
|
if (pos >= ALIGN(i_size_read(inode), sb->s_blocksize))
|
ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 08:17:34 -05:00
|
|
|
return 0;
|
|
|
|
|
2014-04-17 16:09:22 -04:00
|
|
|
if ((pos | iov_iter_alignment(from)) & blockmask)
|
ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 08:17:34 -05:00
|
|
|
return 1;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-11-20 17:29:51 -05:00
|
|
|
/* Is IO overwriting allocated and initialized blocks? */
|
|
|
|
static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len)
|
|
|
|
{
|
|
|
|
struct ext4_map_blocks map;
|
|
|
|
unsigned int blkbits = inode->i_blkbits;
|
|
|
|
int err, blklen;
|
|
|
|
|
|
|
|
if (pos + len > i_size_read(inode))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
map.m_lblk = pos >> blkbits;
|
|
|
|
map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
|
|
|
|
blklen = map.m_len;
|
|
|
|
|
|
|
|
err = ext4_map_blocks(NULL, inode, &map, 0);
|
|
|
|
/*
|
|
|
|
* 'err==len' means that all of the blocks have been preallocated,
|
|
|
|
* regardless of whether they have been initialized or not. To exclude
|
|
|
|
* unwritten extents, we need to check m_flags.
|
|
|
|
*/
|
|
|
|
return err == blklen && (map.m_flags & EXT4_MAP_MAPPED);
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
|
|
|
|
{
|
|
|
|
struct inode *inode = file_inode(iocb->ki_filp);
|
|
|
|
ssize_t ret;
|
|
|
|
|
2019-11-05 23:02:39 +11:00
|
|
|
if (unlikely(IS_IMMUTABLE(inode)))
|
|
|
|
return -EPERM;
|
|
|
|
|
2016-11-20 17:29:51 -05:00
|
|
|
ret = generic_write_checks(iocb, from);
|
|
|
|
if (ret <= 0)
|
|
|
|
return ret;
|
2019-06-09 22:04:33 -04:00
|
|
|
|
2016-11-20 17:29:51 -05:00
|
|
|
/*
|
|
|
|
* If we have encountered a bitmap-format file, the size limit
|
|
|
|
* is smaller than s_maxbytes, which is for extent-mapped files.
|
|
|
|
*/
|
|
|
|
if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
|
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
|
|
|
|
|
|
|
|
if (iocb->ki_pos >= sbi->s_bitmap_maxbytes)
|
|
|
|
return -EFBIG;
|
|
|
|
iov_iter_truncate(from, sbi->s_bitmap_maxbytes - iocb->ki_pos);
|
|
|
|
}
|
2019-11-05 23:02:39 +11:00
|
|
|
|
|
|
|
ret = file_modified(iocb->ki_filp);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2016-11-20 17:29:51 -05:00
|
|
|
return iov_iter_count(from);
|
|
|
|
}
|
|
|
|
|
2019-11-05 23:02:39 +11:00
|
|
|
static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
|
|
|
|
struct iov_iter *from)
|
|
|
|
{
|
|
|
|
ssize_t ret;
|
|
|
|
struct inode *inode = file_inode(iocb->ki_filp);
|
|
|
|
|
|
|
|
if (iocb->ki_flags & IOCB_NOWAIT)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
inode_lock(inode);
|
|
|
|
ret = ext4_write_checks(iocb, from);
|
|
|
|
if (ret <= 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
current->backing_dev_info = inode_to_bdi(inode);
|
|
|
|
ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
|
|
|
|
current->backing_dev_info = NULL;
|
|
|
|
|
|
|
|
out:
|
|
|
|
inode_unlock(inode);
|
|
|
|
if (likely(ret > 0)) {
|
|
|
|
iocb->ki_pos += ret;
|
|
|
|
ret = generic_write_sync(iocb, ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2019-11-05 23:01:51 +11:00
|
|
|
static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
|
|
|
|
ssize_t written, size_t count)
|
|
|
|
{
|
|
|
|
handle_t *handle;
|
|
|
|
bool truncate = false;
|
|
|
|
u8 blkbits = inode->i_blkbits;
|
|
|
|
ext4_lblk_t written_blk, end_blk;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Note that EXT4_I(inode)->i_disksize can get extended up to
|
|
|
|
* inode->i_size while the I/O was running due to writeback of delalloc
|
|
|
|
* blocks. But, the code in ext4_iomap_alloc() is careful to use
|
|
|
|
* zeroed/unwritten extents if this is possible; thus we won't leave
|
|
|
|
* uninitialized blocks in a file even if we didn't succeed in writing
|
|
|
|
* as much as we intended.
|
|
|
|
*/
|
|
|
|
WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
|
|
|
|
if (offset + count <= EXT4_I(inode)->i_disksize) {
|
|
|
|
/*
|
|
|
|
* We need to ensure that the inode is removed from the orphan
|
|
|
|
* list if it has been added prematurely, due to writeback of
|
|
|
|
* delalloc blocks.
|
|
|
|
*/
|
|
|
|
if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
|
|
|
|
handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
|
|
|
|
|
|
|
|
if (IS_ERR(handle)) {
|
|
|
|
ext4_orphan_del(NULL, inode);
|
|
|
|
return PTR_ERR(handle);
|
|
|
|
}
|
|
|
|
|
|
|
|
ext4_orphan_del(handle, inode);
|
|
|
|
ext4_journal_stop(handle);
|
|
|
|
}
|
|
|
|
|
|
|
|
return written;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (written < 0)
|
|
|
|
goto truncate;
|
|
|
|
|
|
|
|
handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
|
|
|
|
if (IS_ERR(handle)) {
|
|
|
|
written = PTR_ERR(handle);
|
|
|
|
goto truncate;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ext4_update_inode_size(inode, offset + written))
|
|
|
|
ext4_mark_inode_dirty(handle, inode);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We may need to truncate allocated but not written blocks beyond EOF.
|
|
|
|
*/
|
|
|
|
written_blk = ALIGN(offset + written, 1 << blkbits);
|
|
|
|
end_blk = ALIGN(offset + count, 1 << blkbits);
|
|
|
|
if (written_blk < end_blk && ext4_can_truncate(inode))
|
|
|
|
truncate = true;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove the inode from the orphan list if it has been extended and
|
|
|
|
* everything went OK.
|
|
|
|
*/
|
|
|
|
if (!truncate && inode->i_nlink)
|
|
|
|
ext4_orphan_del(handle, inode);
|
|
|
|
ext4_journal_stop(handle);
|
|
|
|
|
|
|
|
if (truncate) {
|
|
|
|
truncate:
|
|
|
|
ext4_truncate_failed_write(inode);
|
|
|
|
/*
|
|
|
|
* If the truncate operation failed early, then the inode may
|
|
|
|
* still be on the orphan list. In that case, we need to try
|
|
|
|
* remove the inode from the in-memory linked list.
|
|
|
|
*/
|
|
|
|
if (inode->i_nlink)
|
|
|
|
ext4_orphan_del(NULL, inode);
|
|
|
|
}
|
|
|
|
|
|
|
|
return written;
|
|
|
|
}
|
|
|
|
|
2019-11-05 23:02:39 +11:00
|
|
|
static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
|
|
|
|
int error, unsigned int flags)
|
|
|
|
{
|
|
|
|
loff_t offset = iocb->ki_pos;
|
|
|
|
struct inode *inode = file_inode(iocb->ki_filp);
|
|
|
|
|
|
|
|
if (error)
|
|
|
|
return error;
|
|
|
|
|
|
|
|
if (size && flags & IOMAP_DIO_UNWRITTEN)
|
|
|
|
return ext4_convert_unwritten_extents(NULL, inode,
|
|
|
|
offset, size);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct iomap_dio_ops ext4_dio_write_ops = {
|
|
|
|
.end_io = ext4_dio_write_end_io,
|
|
|
|
};
|
|
|
|
|
|
|
|
static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
|
|
|
|
{
|
|
|
|
ssize_t ret;
|
|
|
|
size_t count;
|
|
|
|
loff_t offset;
|
|
|
|
handle_t *handle;
|
|
|
|
struct inode *inode = file_inode(iocb->ki_filp);
|
|
|
|
bool extend = false, overwrite = false, unaligned_aio = false;
|
|
|
|
|
|
|
|
if (iocb->ki_flags & IOCB_NOWAIT) {
|
|
|
|
if (!inode_trylock(inode))
|
|
|
|
return -EAGAIN;
|
|
|
|
} else {
|
|
|
|
inode_lock(inode);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!ext4_dio_supported(inode)) {
|
|
|
|
inode_unlock(inode);
|
|
|
|
/*
|
|
|
|
* Fallback to buffered I/O if the inode does not support
|
|
|
|
* direct I/O.
|
|
|
|
*/
|
|
|
|
return ext4_buffered_write_iter(iocb, from);
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = ext4_write_checks(iocb, from);
|
|
|
|
if (ret <= 0) {
|
|
|
|
inode_unlock(inode);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Unaligned asynchronous direct I/O must be serialized among each
|
|
|
|
* other as the zeroing of partial blocks of two competing unaligned
|
|
|
|
* asynchronous direct I/O writes can result in data corruption.
|
|
|
|
*/
|
|
|
|
offset = iocb->ki_pos;
|
|
|
|
count = iov_iter_count(from);
|
|
|
|
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
|
|
|
|
!is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
|
|
|
|
unaligned_aio = true;
|
|
|
|
inode_dio_wait(inode);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Determine whether the I/O will overwrite allocated and initialized
|
|
|
|
* blocks. If so, check to see whether it is possible to take the
|
|
|
|
* dioread_nolock path.
|
|
|
|
*/
|
|
|
|
if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
|
|
|
|
ext4_should_dioread_nolock(inode)) {
|
|
|
|
overwrite = true;
|
|
|
|
downgrade_write(&inode->i_rwsem);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (offset + count > EXT4_I(inode)->i_disksize) {
|
|
|
|
handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
|
|
|
|
if (IS_ERR(handle)) {
|
|
|
|
ret = PTR_ERR(handle);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = ext4_orphan_add(handle, inode);
|
|
|
|
if (ret) {
|
|
|
|
ext4_journal_stop(handle);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
extend = true;
|
|
|
|
ext4_journal_stop(handle);
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, &ext4_dio_write_ops,
|
|
|
|
is_sync_kiocb(iocb) || unaligned_aio || extend);
|
|
|
|
|
|
|
|
if (extend)
|
|
|
|
ret = ext4_handle_inode_extension(inode, offset, ret, count);
|
|
|
|
|
|
|
|
out:
|
|
|
|
if (overwrite)
|
|
|
|
inode_unlock_shared(inode);
|
|
|
|
else
|
|
|
|
inode_unlock(inode);
|
|
|
|
|
|
|
|
if (ret >= 0 && iov_iter_count(from)) {
|
|
|
|
ssize_t err;
|
|
|
|
loff_t endbyte;
|
|
|
|
|
|
|
|
offset = iocb->ki_pos;
|
|
|
|
err = ext4_buffered_write_iter(iocb, from);
|
|
|
|
if (err < 0)
|
|
|
|
return err;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We need to ensure that the pages within the page cache for
|
|
|
|
* the range covered by this I/O are written to disk and
|
|
|
|
* invalidated. This is in attempt to preserve the expected
|
|
|
|
* direct I/O semantics in the case we fallback to buffered I/O
|
|
|
|
* to complete off the I/O request.
|
|
|
|
*/
|
|
|
|
ret += err;
|
|
|
|
endbyte = offset + err - 1;
|
|
|
|
err = filemap_write_and_wait_range(iocb->ki_filp->f_mapping,
|
|
|
|
offset, endbyte);
|
|
|
|
if (!err)
|
|
|
|
invalidate_mapping_pages(iocb->ki_filp->f_mapping,
|
|
|
|
offset >> PAGE_SHIFT,
|
|
|
|
endbyte >> PAGE_SHIFT);
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-11-20 18:09:11 -05:00
|
|
|
#ifdef CONFIG_FS_DAX
|
|
|
|
static ssize_t
|
|
|
|
ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
|
|
|
|
{
|
|
|
|
ssize_t ret;
|
2019-11-05 23:01:51 +11:00
|
|
|
size_t count;
|
|
|
|
loff_t offset;
|
2019-11-05 23:02:08 +11:00
|
|
|
handle_t *handle;
|
|
|
|
bool extend = false;
|
2019-11-05 23:01:51 +11:00
|
|
|
struct inode *inode = file_inode(iocb->ki_filp);
|
2016-11-20 18:09:11 -05:00
|
|
|
|
2017-06-20 07:05:47 -05:00
|
|
|
if (!inode_trylock(inode)) {
|
|
|
|
if (iocb->ki_flags & IOCB_NOWAIT)
|
|
|
|
return -EAGAIN;
|
|
|
|
inode_lock(inode);
|
|
|
|
}
|
2019-11-05 23:02:39 +11:00
|
|
|
|
2016-11-20 18:09:11 -05:00
|
|
|
ret = ext4_write_checks(iocb, from);
|
|
|
|
if (ret <= 0)
|
|
|
|
goto out;
|
|
|
|
|
2019-11-05 23:01:51 +11:00
|
|
|
offset = iocb->ki_pos;
|
|
|
|
count = iov_iter_count(from);
|
2019-11-05 23:02:08 +11:00
|
|
|
|
|
|
|
if (offset + count > EXT4_I(inode)->i_disksize) {
|
|
|
|
handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
|
|
|
|
if (IS_ERR(handle)) {
|
|
|
|
ret = PTR_ERR(handle);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = ext4_orphan_add(handle, inode);
|
|
|
|
if (ret) {
|
|
|
|
ext4_journal_stop(handle);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
extend = true;
|
|
|
|
ext4_journal_stop(handle);
|
|
|
|
}
|
|
|
|
|
2016-11-20 18:09:11 -05:00
|
|
|
ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
|
2019-11-05 23:02:08 +11:00
|
|
|
|
|
|
|
if (extend)
|
|
|
|
ret = ext4_handle_inode_extension(inode, offset, ret, count);
|
2016-11-20 18:09:11 -05:00
|
|
|
out:
|
2017-02-08 14:39:27 -05:00
|
|
|
inode_unlock(inode);
|
2016-11-20 18:09:11 -05:00
|
|
|
if (ret > 0)
|
|
|
|
ret = generic_write_sync(iocb, ret);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2006-10-11 01:20:50 -07:00
|
|
|
static ssize_t
|
2014-04-17 16:09:22 -04:00
|
|
|
ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
|
2006-10-11 01:20:50 -07:00
|
|
|
{
|
2014-04-21 14:26:57 -04:00
|
|
|
struct inode *inode = file_inode(iocb->ki_filp);
|
2014-04-21 14:26:28 -04:00
|
|
|
|
2017-02-05 01:28:48 -05:00
|
|
|
if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
|
|
|
|
return -EIO;
|
|
|
|
|
2016-11-20 18:09:11 -05:00
|
|
|
#ifdef CONFIG_FS_DAX
|
|
|
|
if (IS_DAX(inode))
|
|
|
|
return ext4_dax_write_iter(iocb, from);
|
|
|
|
#endif
|
2019-11-05 23:02:39 +11:00
|
|
|
if (iocb->ki_flags & IOCB_DIRECT)
|
|
|
|
return ext4_dio_write_iter(iocb, from);
|
2016-11-20 18:09:11 -05:00
|
|
|
|
2019-11-05 23:02:39 +11:00
|
|
|
return ext4_buffered_write_iter(iocb, from);
|
2006-10-11 01:20:50 -07:00
|
|
|
}
|
|
|
|
|
2015-02-16 15:59:38 -08:00
|
|
|
#ifdef CONFIG_FS_DAX
|
2018-05-13 16:01:49 -04:00
|
|
|
static vm_fault_t ext4_dax_huge_fault(struct vm_fault *vmf,
|
2017-02-24 14:57:08 -08:00
|
|
|
enum page_entry_size pe_size)
|
2015-02-16 15:59:38 -08:00
|
|
|
{
|
2018-05-13 16:01:49 -04:00
|
|
|
int error = 0;
|
|
|
|
vm_fault_t result;
|
2018-01-07 16:41:01 -05:00
|
|
|
int retries = 0;
|
2017-05-12 15:46:54 -07:00
|
|
|
handle_t *handle = NULL;
|
2017-02-24 14:56:41 -08:00
|
|
|
struct inode *inode = file_inode(vmf->vma->vm_file);
|
2015-12-07 14:28:03 -05:00
|
|
|
struct super_block *sb = inode->i_sb;
|
2017-08-24 15:26:01 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We have to distinguish real writes from writes which will result in a
|
|
|
|
* COW page; COW writes should *not* poke the journal (the file will not
|
|
|
|
* be changed). Doing so would cause unintended failures when mounted
|
|
|
|
* read-only.
|
|
|
|
*
|
|
|
|
* We check for VM_SHARED rather than vmf->cow_page since the latter is
|
|
|
|
* unset for pe_size != PE_SIZE_PTE (i.e. only in do_cow_fault); for
|
|
|
|
* other sizes, dax_iomap_fault will handle splitting / fallback so that
|
|
|
|
* we eventually come back with a COW page.
|
|
|
|
*/
|
|
|
|
bool write = (vmf->flags & FAULT_FLAG_WRITE) &&
|
|
|
|
(vmf->vma->vm_flags & VM_SHARED);
|
2017-11-01 16:36:45 +01:00
|
|
|
pfn_t pfn;
|
2015-09-08 14:59:22 -07:00
|
|
|
|
|
|
|
if (write) {
|
|
|
|
sb_start_pagefault(sb);
|
2017-02-24 14:56:41 -08:00
|
|
|
file_update_time(vmf->vma->vm_file);
|
2017-05-12 15:46:54 -07:00
|
|
|
down_read(&EXT4_I(inode)->i_mmap_sem);
|
2018-01-07 16:41:01 -05:00
|
|
|
retry:
|
2017-05-12 15:46:54 -07:00
|
|
|
handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
|
|
|
|
EXT4_DATA_TRANS_BLOCKS(sb));
|
2017-11-01 16:36:44 +01:00
|
|
|
if (IS_ERR(handle)) {
|
|
|
|
up_read(&EXT4_I(inode)->i_mmap_sem);
|
|
|
|
sb_end_pagefault(sb);
|
|
|
|
return VM_FAULT_SIGBUS;
|
|
|
|
}
|
2017-05-12 15:46:54 -07:00
|
|
|
} else {
|
|
|
|
down_read(&EXT4_I(inode)->i_mmap_sem);
|
2016-10-21 11:33:49 +02:00
|
|
|
}
|
2018-01-07 16:41:01 -05:00
|
|
|
result = dax_iomap_fault(vmf, pe_size, &pfn, &error, &ext4_iomap_ops);
|
2017-05-12 15:46:54 -07:00
|
|
|
if (write) {
|
2017-11-01 16:36:44 +01:00
|
|
|
ext4_journal_stop(handle);
|
2018-01-07 16:41:01 -05:00
|
|
|
|
|
|
|
if ((result & VM_FAULT_ERROR) && error == -ENOSPC &&
|
|
|
|
ext4_should_retry_alloc(sb, &retries))
|
|
|
|
goto retry;
|
2017-11-01 16:36:45 +01:00
|
|
|
/* Handling synchronous page fault? */
|
|
|
|
if (result & VM_FAULT_NEEDDSYNC)
|
|
|
|
result = dax_finish_sync_fault(vmf, pe_size, pfn);
|
2017-05-12 15:46:54 -07:00
|
|
|
up_read(&EXT4_I(inode)->i_mmap_sem);
|
2015-09-08 14:59:22 -07:00
|
|
|
sb_end_pagefault(sb);
|
2017-05-12 15:46:54 -07:00
|
|
|
} else {
|
|
|
|
up_read(&EXT4_I(inode)->i_mmap_sem);
|
|
|
|
}
|
2015-09-08 14:59:22 -07:00
|
|
|
|
|
|
|
return result;
|
2015-02-16 15:59:38 -08:00
|
|
|
}
|
|
|
|
|
2018-05-13 16:01:49 -04:00
|
|
|
static vm_fault_t ext4_dax_fault(struct vm_fault *vmf)
|
2017-02-24 14:57:08 -08:00
|
|
|
{
|
|
|
|
return ext4_dax_huge_fault(vmf, PE_SIZE_PTE);
|
|
|
|
}
|
|
|
|
|
2015-02-16 15:59:38 -08:00
|
|
|
static const struct vm_operations_struct ext4_dax_vm_ops = {
|
|
|
|
.fault = ext4_dax_fault,
|
2017-02-24 14:57:08 -08:00
|
|
|
.huge_fault = ext4_dax_huge_fault,
|
2016-02-27 14:01:13 -05:00
|
|
|
.page_mkwrite = ext4_dax_fault,
|
dax: use common 4k zero page for dax mmap reads
When servicing mmap() reads from file holes the current DAX code
allocates a page cache page of all zeroes and places the struct page
pointer in the mapping->page_tree radix tree.
This has three major drawbacks:
1) It consumes memory unnecessarily. For every 4k page that is read via
a DAX mmap() over a hole, we allocate a new page cache page. This
means that if you read 1GiB worth of pages, you end up using 1GiB of
zeroed memory. This is easily visible by looking at the overall
memory consumption of the system or by looking at /proc/[pid]/smaps:
7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12 /root/dax/data
Size: 1048576 kB
Rss: 1048576 kB
Pss: 1048576 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 1048576 kB
Private_Dirty: 0 kB
Referenced: 1048576 kB
Anonymous: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
2) It is slower than using a common zero page because each page fault
has more work to do. Instead of just inserting a common zero page we
have to allocate a page cache page, zero it, and then insert it. Here
are the average latencies of dax_load_hole() as measured by ftrace on
a random test box:
Old method, using zeroed page cache pages: 3.4 us
New method, using the common 4k zero page: 0.8 us
This was the average latency over 1 GiB of sequential reads done by
this simple fio script:
[global]
size=1G
filename=/root/dax/data
fallocate=none
[io]
rw=read
ioengine=mmap
3) The fact that we had to check for both DAX exceptional entries and
for page cache pages in the radix tree made the DAX code more
complex.
Solve these issues by following the lead of the DAX PMD code and using a
common 4k zero page instead. As with the PMD code we will now insert a
DAX exceptional entry into the radix tree instead of a struct page
pointer which allows us to remove all the special casing in the DAX
code.
Note that we do still pretty aggressively check for regular pages in the
DAX radix tree, especially where we take action based on the bits set in
the page. If we ever find a regular page in our radix tree now that
most likely means that someone besides DAX is inserting pages (which has
happened lots of times in the past), and we want to find that out early
and fail loudly.
This solution also removes the extra memory consumption. Here is that
same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
code:
7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12 /root/dax/data
Size: 1048576 kB
Rss: 0 kB
Pss: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 0 kB
Anonymous: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
Overall system memory consumption is similarly improved.
Another major change is that we remove dax_pfn_mkwrite() from our fault
flow, and instead rely on the page fault itself to make the PTE dirty
and writeable. The following description from the patch adding the
vm_insert_mixed_mkwrite() call explains this a little more:
"To be able to use the common 4k zero page in DAX we need to have our
PTE fault path look more like our PMD fault path where a PTE entry
can be marked as dirty and writeable as it is first inserted rather
than waiting for a follow-up dax_pfn_mkwrite() =>
finish_mkwrite_fault() call.
Right now we can rely on having a dax_pfn_mkwrite() call because we
can distinguish between these two cases in do_wp_page():
case 1: 4k zero page => writable DAX storage
case 2: read-only DAX storage => writeable DAX storage
This distinction is made by via vm_normal_page(). vm_normal_page()
returns false for the common 4k zero page, though, just as it does
for DAX ptes. Instead of special casing the DAX + 4k zero page case
we will simplify our DAX PTE page fault sequence so that it matches
our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
We will instead use dax_iomap_fault() to handle write-protection
faults.
This means that insert_pfn() needs to follow the lead of
insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
'mkwrite' is set insert_pfn() will do the work that was previously
done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"
Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 16:18:43 -07:00
|
|
|
.pfn_mkwrite = ext4_dax_fault,
|
2015-02-16 15:59:38 -08:00
|
|
|
};
|
|
|
|
#else
|
|
|
|
#define ext4_dax_vm_ops ext4_file_vm_ops
|
|
|
|
#endif
|
|
|
|
|
2009-09-27 22:29:37 +04:00
|
|
|
static const struct vm_operations_struct ext4_file_vm_ops = {
|
2015-12-07 14:28:03 -05:00
|
|
|
.fault = ext4_filemap_fault,
|
2014-04-07 15:37:19 -07:00
|
|
|
.map_pages = filemap_map_pages,
|
2008-07-11 19:27:31 -04:00
|
|
|
.page_mkwrite = ext4_page_mkwrite,
|
|
|
|
};
|
|
|
|
|
|
|
|
static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
|
|
|
|
{
|
2015-04-12 00:56:10 -04:00
|
|
|
struct inode *inode = file->f_mapping->host;
|
2019-07-05 19:33:27 +05:30
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
|
|
|
|
struct dax_device *dax_dev = sbi->s_daxdev;
|
2015-04-12 00:56:10 -04:00
|
|
|
|
2019-07-05 19:33:27 +05:30
|
|
|
if (unlikely(ext4_forced_shutdown(sbi)))
|
2017-02-05 01:28:48 -05:00
|
|
|
return -EIO;
|
|
|
|
|
2017-11-01 16:36:45 +01:00
|
|
|
/*
|
2019-07-05 19:33:27 +05:30
|
|
|
* We don't support synchronous mappings for non-DAX files and
|
|
|
|
* for DAX files if underneath dax_device is not synchronous.
|
2017-11-01 16:36:45 +01:00
|
|
|
*/
|
2019-07-05 19:33:27 +05:30
|
|
|
if (!daxdev_mapping_supported(vma, dax_dev))
|
2017-11-01 16:36:45 +01:00
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
2008-07-11 19:27:31 -04:00
|
|
|
file_accessed(file);
|
2015-02-16 15:59:38 -08:00
|
|
|
if (IS_DAX(file_inode(file))) {
|
|
|
|
vma->vm_ops = &ext4_dax_vm_ops;
|
2018-08-17 15:43:40 -07:00
|
|
|
vma->vm_flags |= VM_HUGEPAGE;
|
2015-02-16 15:59:38 -08:00
|
|
|
} else {
|
|
|
|
vma->vm_ops = &ext4_file_vm_ops;
|
|
|
|
}
|
2008-07-11 19:27:31 -04:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-05-13 22:44:23 -04:00
|
|
|
static int ext4_sample_last_mounted(struct super_block *sb,
|
|
|
|
struct vfsmount *mnt)
|
2009-06-13 10:09:48 -04:00
|
|
|
{
|
2018-05-13 22:44:23 -04:00
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
2009-06-13 10:09:48 -04:00
|
|
|
struct path path;
|
|
|
|
char buf[64], *cp;
|
2018-05-13 22:44:23 -04:00
|
|
|
handle_t *handle;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
if (likely(sbi->s_mount_flags & EXT4_MF_MNTDIR_SAMPLED))
|
|
|
|
return 0;
|
|
|
|
|
2018-05-13 22:54:44 -04:00
|
|
|
if (sb_rdonly(sb) || !sb_start_intwrite_trylock(sb))
|
2018-05-13 22:44:23 -04:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
sbi->s_mount_flags |= EXT4_MF_MNTDIR_SAMPLED;
|
|
|
|
/*
|
|
|
|
* Sample where the filesystem has been mounted and
|
|
|
|
* store it in the superblock for sysadmin convenience
|
|
|
|
* when trying to sort through large numbers of block
|
|
|
|
* devices or filesystem images.
|
|
|
|
*/
|
|
|
|
memset(buf, 0, sizeof(buf));
|
|
|
|
path.mnt = mnt;
|
|
|
|
path.dentry = mnt->mnt_root;
|
|
|
|
cp = d_path(&path, buf, sizeof(buf));
|
2018-05-13 22:54:44 -04:00
|
|
|
err = 0;
|
2018-05-13 22:44:23 -04:00
|
|
|
if (IS_ERR(cp))
|
2018-05-13 22:54:44 -04:00
|
|
|
goto out;
|
2018-05-13 22:44:23 -04:00
|
|
|
|
|
|
|
handle = ext4_journal_start_sb(sb, EXT4_HT_MISC, 1);
|
2018-05-13 22:54:44 -04:00
|
|
|
err = PTR_ERR(handle);
|
2018-05-13 22:44:23 -04:00
|
|
|
if (IS_ERR(handle))
|
2018-05-13 22:54:44 -04:00
|
|
|
goto out;
|
2018-05-13 22:44:23 -04:00
|
|
|
BUFFER_TRACE(sbi->s_sbh, "get_write_access");
|
|
|
|
err = ext4_journal_get_write_access(handle, sbi->s_sbh);
|
|
|
|
if (err)
|
2018-05-13 22:54:44 -04:00
|
|
|
goto out_journal;
|
2018-05-13 22:44:23 -04:00
|
|
|
strlcpy(sbi->s_es->s_last_mounted, cp,
|
|
|
|
sizeof(sbi->s_es->s_last_mounted));
|
|
|
|
ext4_handle_dirty_super(handle, sb);
|
2018-05-13 22:54:44 -04:00
|
|
|
out_journal:
|
2018-05-13 22:44:23 -04:00
|
|
|
ext4_journal_stop(handle);
|
2018-05-13 22:54:44 -04:00
|
|
|
out:
|
|
|
|
sb_end_intwrite(sb);
|
2018-05-13 22:44:23 -04:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int ext4_file_open(struct inode * inode, struct file * filp)
|
|
|
|
{
|
2015-04-12 00:56:10 -04:00
|
|
|
int ret;
|
2009-06-13 10:09:48 -04:00
|
|
|
|
2017-02-05 01:28:48 -05:00
|
|
|
if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
|
|
|
|
return -EIO;
|
|
|
|
|
2018-05-13 22:44:23 -04:00
|
|
|
ret = ext4_sample_last_mounted(inode->i_sb, filp->f_path.mnt);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2016-03-26 16:14:41 -04:00
|
|
|
|
2017-10-18 20:21:57 -04:00
|
|
|
ret = fscrypt_file_open(inode, filp);
|
|
|
|
if (ret)
|
2019-07-22 09:26:24 -07:00
|
|
|
return ret;
|
|
|
|
|
|
|
|
ret = fsverity_file_open(inode, filp);
|
|
|
|
if (ret)
|
2017-10-18 20:21:57 -04:00
|
|
|
return ret;
|
|
|
|
|
2011-01-10 12:29:43 -05:00
|
|
|
/*
|
|
|
|
* Set up the jbd2_inode if we are opening the inode for
|
|
|
|
* writing and the journal is present
|
|
|
|
*/
|
2013-08-16 21:19:41 -04:00
|
|
|
if (filp->f_mode & FMODE_WRITE) {
|
2015-04-12 00:56:10 -04:00
|
|
|
ret = ext4_inode_attach_jinode(inode);
|
2013-08-16 21:19:41 -04:00
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
2011-01-10 12:29:43 -05:00
|
|
|
}
|
2017-06-20 07:05:47 -05:00
|
|
|
|
2017-08-29 16:13:20 +02:00
|
|
|
filp->f_mode |= FMODE_NOWAIT;
|
2015-05-31 13:35:39 -04:00
|
|
|
return dquot_file_open(inode, filp);
|
2009-06-13 10:09:48 -04:00
|
|
|
}
|
|
|
|
|
2010-10-27 21:30:06 -04:00
|
|
|
/*
|
2012-04-30 13:14:03 -05:00
|
|
|
* ext4_llseek() handles both block-mapped and extent-mapped maxbytes values
|
|
|
|
* by calling generic_file_llseek_size() with the appropriate maxbytes
|
|
|
|
* value for each.
|
2010-10-27 21:30:06 -04:00
|
|
|
*/
|
2012-12-17 15:59:39 -08:00
|
|
|
loff_t ext4_llseek(struct file *file, loff_t offset, int whence)
|
2010-10-27 21:30:06 -04:00
|
|
|
{
|
|
|
|
struct inode *inode = file->f_mapping->host;
|
|
|
|
loff_t maxbytes;
|
|
|
|
|
|
|
|
if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
|
|
|
|
maxbytes = EXT4_SB(inode->i_sb)->s_bitmap_maxbytes;
|
|
|
|
else
|
|
|
|
maxbytes = inode->i_sb->s_maxbytes;
|
|
|
|
|
2012-12-17 15:59:39 -08:00
|
|
|
switch (whence) {
|
2017-10-01 17:58:54 -04:00
|
|
|
default:
|
2012-12-17 15:59:39 -08:00
|
|
|
return generic_file_llseek_size(file, offset, whence,
|
2012-11-08 21:57:40 -05:00
|
|
|
maxbytes, i_size_read(inode));
|
|
|
|
case SEEK_HOLE:
|
2017-10-01 17:58:54 -04:00
|
|
|
inode_lock_shared(inode);
|
2019-11-05 23:03:31 +11:00
|
|
|
offset = iomap_seek_hole(inode, offset,
|
|
|
|
&ext4_iomap_report_ops);
|
2017-10-01 17:58:54 -04:00
|
|
|
inode_unlock_shared(inode);
|
|
|
|
break;
|
|
|
|
case SEEK_DATA:
|
|
|
|
inode_lock_shared(inode);
|
2019-11-05 23:03:31 +11:00
|
|
|
offset = iomap_seek_data(inode, offset,
|
|
|
|
&ext4_iomap_report_ops);
|
2017-10-01 17:58:54 -04:00
|
|
|
inode_unlock_shared(inode);
|
|
|
|
break;
|
2012-11-08 21:57:40 -05:00
|
|
|
}
|
|
|
|
|
2017-10-01 17:58:54 -04:00
|
|
|
if (offset < 0)
|
|
|
|
return offset;
|
|
|
|
return vfs_setpos(file, offset, maxbytes);
|
2010-10-27 21:30:06 -04:00
|
|
|
}
|
|
|
|
|
2006-10-11 01:20:53 -07:00
|
|
|
const struct file_operations ext4_file_operations = {
|
2010-10-27 21:30:06 -04:00
|
|
|
.llseek = ext4_llseek,
|
2016-11-20 17:36:06 -05:00
|
|
|
.read_iter = ext4_file_read_iter,
|
2014-04-17 16:09:22 -04:00
|
|
|
.write_iter = ext4_file_write_iter,
|
2008-04-29 22:03:54 -04:00
|
|
|
.unlocked_ioctl = ext4_ioctl,
|
2006-10-11 01:20:50 -07:00
|
|
|
#ifdef CONFIG_COMPAT
|
2006-10-11 01:20:53 -07:00
|
|
|
.compat_ioctl = ext4_compat_ioctl,
|
2006-10-11 01:20:50 -07:00
|
|
|
#endif
|
2008-07-11 19:27:31 -04:00
|
|
|
.mmap = ext4_file_mmap,
|
2017-11-01 16:36:45 +01:00
|
|
|
.mmap_supported_flags = MAP_SYNC,
|
2009-06-13 10:09:48 -04:00
|
|
|
.open = ext4_file_open,
|
2006-10-11 01:20:53 -07:00
|
|
|
.release = ext4_release_file,
|
|
|
|
.fsync = ext4_sync_file,
|
2016-10-07 16:59:59 -07:00
|
|
|
.get_unmapped_area = thp_get_unmapped_area,
|
2006-10-11 01:20:50 -07:00
|
|
|
.splice_read = generic_file_splice_read,
|
2014-04-05 04:27:08 -04:00
|
|
|
.splice_write = iter_file_splice_write,
|
2011-01-14 13:07:43 +01:00
|
|
|
.fallocate = ext4_fallocate,
|
2006-10-11 01:20:50 -07:00
|
|
|
};
|
|
|
|
|
2007-02-12 00:55:38 -08:00
|
|
|
const struct inode_operations ext4_file_inode_operations = {
|
2006-10-11 01:20:53 -07:00
|
|
|
.setattr = ext4_setattr,
|
2017-03-31 18:31:56 +01:00
|
|
|
.getattr = ext4_file_getattr,
|
2006-10-11 01:20:53 -07:00
|
|
|
.listxattr = ext4_listxattr,
|
2011-07-23 17:37:31 +02:00
|
|
|
.get_acl = ext4_get_acl,
|
2013-12-20 05:16:44 -08:00
|
|
|
.set_acl = ext4_set_acl,
|
2008-10-07 00:46:36 -04:00
|
|
|
.fiemap = ext4_fiemap,
|
2006-10-11 01:20:50 -07:00
|
|
|
};
|
|
|
|
|