3010 Commits

Author SHA1 Message Date
Jan Kara
16c5468859 ext4: Allow parallel DIO reads
We can easily support parallel direct IO reads. We only have to make
sure we cannot expose uninitialized data by reading allocated block to
which data was not written yet, or which was already truncated. That is
easily achieved by holding inode_lock in shared mode - that excludes all
writes, truncates, hole punches. We also have to guard against page
writeback allocating blocks for delay-allocated pages - that race is
handled by the fact that we writeback all the pages in the affected
range and the lock protects us from new pages being created there.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 01:03:17 -04:00
Miklos Szeredi
2773bf00ae fs: rename "rename2" i_op to "rename"
Generated patch:

sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2`
sed -i "s/\brename2\b/rename/g" `git grep -wl rename2`

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-27 11:03:58 +02:00
Ross Zwisler
cca32b7eeb ext4: allow DAX writeback for hole punch
Currently when doing a DAX hole punch with ext4 we fail to do a writeback.
This is because the logic around filemap_write_and_wait_range() in
ext4_punch_hole() only looks for dirty page cache pages in the radix tree,
not for dirty DAX exceptional entries.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-22 11:49:38 -04:00
Jan Kara
31051c85b5 fs: Give dentry to inode_change_ok() instead of inode
inode_change_ok() will be resposible for clearing capabilities and IMA
extended attributes and as such will need dentry. Give it as an argument
to inode_change_ok() instead of an inode. Also rename inode_change_ok()
to setattr_prepare() to better relect that it does also some
modifications in addition to checks.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-09-22 10:56:19 +02:00
Jan Kara
073931017b posix_acl: Clear SGID bit when setting file permissions
When file permissions are modified via chmod(2) and the user is not in
the owning group or capable of CAP_FSETID, the setgid bit is cleared in
inode_change_ok().  Setting a POSIX ACL via setxattr(2) sets the file
permissions as well as the new ACL, but doesn't clear the setgid bit in
a similar way; this allows to bypass the check in chmod(2).  Fix that.

References: CVE-2016-7097
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2016-09-22 10:55:32 +02:00
Eric Biggers
ef1eb3aa50 fscrypto: make filename crypto functions return 0 on success
Several filename crypto functions: fname_decrypt(),
fscrypt_fname_disk_to_usr(), and fscrypt_fname_usr_to_disk(), returned
the output length on success or -errno on failure.  However, the output
length was redundant with the value written to 'oname->len'.  It is also
potentially error-prone to make callers have to check for '< 0' instead
of '!= 0'.

Therefore, make these functions return 0 instead of a length, and make
the callers who cared about the return value being a length use
'oname->len' instead.  For consistency also make other callers check for
a nonzero result rather than a negative result.

This change also fixes the inconsistency of fname_encrypt() actually
already returning 0 on success, not a length like the other filename
crypto functions and as documented in its function comment.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-15 17:25:55 -04:00
Eric Biggers
dcce7a46c6 ext4: fix memory leak when symlink decryption fails
This bug was introduced in v4.8-rc1.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-09-15 13:13:13 -04:00
Fabian Frederick
be32197cd6 ext4: remove unused definition for MAX_32_NUM
MAX_32_NUM isn't used in ext4

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15 11:58:47 -04:00
Fabian Frederick
518eaa6387 ext4: create EXT4_MAX_BLOCKS() macro
Create a macro to calculate length + offset -> maximum blocks
This adds more readability.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15 11:55:01 -04:00
Fabian Frederick
c3fe493ccd ext4: remove unneeded test in ext4_alloc_file_blocks()
ext4_alloc_file_blocks() is called from ext4_zero_range() and
ext4_fallocate() both already testing EXT4_INODE_EXTENTS
We can call ext_depth(inode) unconditionnally.

[ Added BUG_ON check to make sure ext4_alloc_file_blocks() won't get
  called for a indirect-mapped inode in the future.  -- tytso ]

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15 11:52:07 -04:00
Fabian Frederick
edf15aa180 ext4: fix memory leak in ext4_insert_range()
Running xfstests generic/013 with kmemleak gives the following:

unreferenced object 0xffff8801d3d27de0 (size 96):
  comm "fsstress", pid 4941, jiffies 4294860168 (age 53.485s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff818eaaf3>] kmemleak_alloc+0x23/0x40
    [<ffffffff81179805>] __kmalloc+0xf5/0x1d0
    [<ffffffff8122ef5c>] ext4_find_extent+0x1ec/0x2f0
    [<ffffffff8123530c>] ext4_insert_range+0x34c/0x4a0
    [<ffffffff81235942>] ext4_fallocate+0x4e2/0x8b0
    [<ffffffff81181334>] vfs_fallocate+0x134/0x210
    [<ffffffff8118203f>] SyS_fallocate+0x3f/0x60
    [<ffffffff818efa9b>] entry_SYSCALL_64_fastpath+0x13/0x8f
    [<ffffffffffffffff>] 0xffffffffffffffff

Problem seems mitigated by dropping refs and freeing path
when there's no path[depth].p_ext

Cc: stable@vger.kernel.org
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15 11:39:52 -04:00
wangguang
4e800c0359 ext4: bugfix for mmaped pages in mpage_release_unused_pages()
Pages clear buffers after ext4 delayed block allocation failed,
However, it does not clean its pte_dirty flag.
if the pages unmap ,in cording to the pte_dirty ,
unmap_page_range may try to call __set_page_dirty,

which may lead to the bugon at 
mpage_prepare_extent_to_map:head = page_buffers(page);.

This patch just call clear_page_dirty_for_io to clean pte_dirty 
at mpage_release_unused_pages for pages mmaped. 

Steps to reproduce the bug:

(1) mmap a file in ext4
	addr = (char *)mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED,
	       	            fd, 0);
	memset(addr, 'i', 4096);

(2) return EIO at 

	ext4_writepages->mpage_map_and_submit_extent->mpage_map_one_extent 

which causes this log message to be print:

                ext4_msg(sb, KERN_CRIT,
                        "Delayed block allocation failed for "
                        "inode %lu at logical offset %llu with"
                        " max blocks %u with error %d",
                        inode->i_ino,
                        (unsigned long long)map->m_lblk,
                        (unsigned)map->m_len, -err);

(3)Unmap the addr cause warning at

	__set_page_dirty:WARN_ON_ONCE(warn && !PageUptodate(page));

(4) wait for a minute,then bugon happen.

Cc: stable@vger.kernel.org
Signed-off-by: wangguang <wangguang03@zte.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15 11:32:46 -04:00
Eric Biggers
ba63f23d69 fscrypto: require write access to mount to set encryption policy
Since setting an encryption policy requires writing metadata to the
filesystem, it should be guarded by mnt_want_write/mnt_drop_write.
Otherwise, a user could cause a write to a frozen or readonly
filesystem.  This was handled correctly by f2fs but not by ext4.  Make
fscrypt_process_policy() handle it rather than relying on the filesystem
to get it right.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Cc: stable@vger.kernel.org # 4.1+; check fs/{ext4,f2fs}
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-10 01:18:57 -04:00
Dmitry Monakhov
e22834f024 ext4: improve ext4lazyinit scalability
ext4lazyinit is a global thread. This thread performs itable
initalization under li_list_mtx mutex.

It basically does the following:
ext4_lazyinit_thread
  ->mutex_lock(&eli->li_list_mtx);
  ->ext4_run_li_request(elr)
    ->ext4_init_inode_table-> Do a lot of IO if the list is large

And when new mount/umount arrive they have to block on ->li_list_mtx
because  lazy_thread holds it during full walk procedure.
ext4_fill_super
 ->ext4_register_li_request
   ->mutex_lock(&ext4_li_info->li_list_mtx);
   ->list_add(&elr->lr_request, &ext4_li_info >li_request_list);
In my case mount takes 40minutes on server with 36 * 4Tb HDD.
Common user may face this in case of very slow dev ( /dev/mmcblkXXX)
Even more. If one of filesystems was frozen lazyinit_thread will simply
block on sb_start_write() so other mount/umount will be stuck forever.

This patch changes logic like follows:
- grab ->s_umount read sem before processing new li_request.
  After that it is safe to drop li_list_mtx because all callers of
  li_remove_request are holding ->s_umount for write.
- li_thread skips frozen SB's

Locking order:
Mh KOrder is asserted by umount path like follows: s_umount ->li_list_mtx so
the only way to to grab ->s_mount inside li_thread is via down_read_trylock

xfstests:ext4/023
#PSBM-49658

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-05 23:38:36 -04:00
Jan Kara
6ae4c5a698 ext4: cleanup ext4_sync_parent()
A condition !hlist_empty(&inode->i_dentry) is always true for open file.
Just remove it. Also ext4_sync_parent() could use some explanation why
races with rmdir() are not an issue - add a comment explaining that.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-05 23:21:43 -04:00
Kaho Ng
0b7b77791c ext4: remove old feature helpers
Use the ext4_{has,set,clear}_feature_* helpers to replace the old
feature helpers.

Signed-off-by: Kaho Ng <ngkaho1234@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-09-05 23:11:58 -04:00
Jan Kara
49da939272 ext4: enable quota enforcement based on mount options
When quota information is stored in quota files, we enable only quota
accounting on mount and enforcement is enabled only in response to
Q_QUOTAON quotactl. To make ext4 behavior consistent with XFS, we add a
possibility to enable quota enforcement on mount by specifying
corresponding quota mount option (usrquota, grpquota, prjquota).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-05 23:08:16 -04:00
Daeho Jeong
93e3b4e663 ext4: reinforce check of i_dtime when clearing high fields of uid and gid
Now, ext4_do_update_inode() clears high 16-bit fields of uid/gid
of deleted and evicted inode to fix up interoperability with old
kernels. However, it checks only i_dtime of an inode to determine
whether the inode was deleted and evicted, and this is very risky,
because i_dtime can be used for the pointer maintaining orphan inode
list, too. We need to further check whether the i_dtime is being
used for the orphan inode list even if the i_dtime is not NULL.

We found that high 16-bit fields of uid/gid of inode are unintentionally
and permanently cleared when the inode truncation is just triggered,
but not finished, and the inode metadata, whose high uid/gid bits are
cleared, is written on disk, and the sudden power-off follows that
in order.

Cc: stable@vger.kernel.org
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Hobin Woo <hobin.woo@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-05 22:56:10 -04:00
Eric Whitney
14fbd4aa61 ext4: enforce online defrag restriction for encrypted files
Online defragging of encrypted files is not currently implemented.
However, the move extent ioctl can still return successfully when
called.  For example, this occurs when xfstest ext4/020 is run on an
encrypted file system, resulting in a corrupted test file and a
corresponding test failure.

Until the proper functionality is implemented, fail the move extent
ioctl if either the original or donor file is encrypted.

Cc: stable@vger.kernel.org
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29 15:45:11 -04:00
Jan Kara
dfa2064b22 ext4: factor out loop for freeing inode xattr space
Move loop to make enough space in the inode from
ext4_expand_extra_isize_ea() into a separate function to make that
function smaller and better readable and also to avoid delaration of
variables inside a loop block.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29 15:44:11 -04:00
Jan Kara
6e0cd088c0 ext4: remove (almost) unused variables from ext4_expand_extra_isize_ea()
'start' variable is completely unused in ext4_expand_extra_isize_ea().
Variable 'first' is used only once in one place. So just remove them.
Variables 'entry' and 'last' are only really used later in the function
inside a loop. Move their declarations there.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29 15:43:11 -04:00
Jan Kara
3f2571c1f9 ext4: factor out xattr moving
Factor out function for moving xattrs from inode into external xattr
block from ext4_expand_extra_isize_ea(). That function is already quite
long and factoring out this rather standalone functionality helps
readability.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29 15:42:11 -04:00
Jan Kara
9440571388 ext4: replace bogus assertion in ext4_xattr_shift_entries()
We were checking whether computed offsets do not exceed end of block in
ext4_xattr_shift_entries(). However this does not make sense since we
always only decrease offsets. So replace that assertion with a check
whether we really decrease xattrs value offsets.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29 15:41:11 -04:00
Jan Kara
1cba423707 ext4: remove checks for e_value_block
Currently we don't support xattrs with e_value_block set. We don't allow
them to pass initial xattr check so there's no point for checking for
this later. Since these tests were untested, bugs were creeping in and
not all places which should have checked were checking e_value_block
anyway.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29 15:40:11 -04:00
Jan Kara
2de58f1102 ext4: Check that external xattr value block is zero
Currently we don't support xattrs with values stored out of line. Check
for that in ext4_xattr_check_names() to make sure we never work with
such xattrs since not all the code counts with that resulting is possible
weird corruption issues.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29 15:39:11 -04:00
Jan Kara
e3014d14a8 ext4: fixup free space calculations when expanding inodes
Conditions checking whether there is enough free space in an xattr block
and when xattr is large enough to make enough space in the inode forgot
to account for the fact that inode need not be completely filled up with
xattrs. Thus we could move unnecessarily many xattrs out of inode or
even falsely claim there is not enough space to expand the inode. We
also forgot to update the amount of free space in xattr block when moving
more xattrs and thus could decide to move too big xattr resulting in
unexpected failure.

Fix these problems by properly updating free space in the inode and
xattr block as we move xattrs. To simplify the math, avoid shifting
xattrs after removing each one xattr and instead just shift xattrs only
once there is enough free space in the inode.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29 15:38:11 -04:00
Linus Torvalds
b8927721ae Fix bugs that could cause kernel deadlocks or file system corruption
while moving xattrs to expand the extended inode.  Also add some
 sanity checks to the block group descriptors to make sure we don't end
 up overwriting the superblock.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJXw7i2AAoJEPL5WVaVDYGj96gH/A8rNgx7BoqPx3kanVEamblT
 tM0X9JcEGmKHN4enRts2b78EWbR0/U0SOP92+fg9SSq2MDJ0/kdaKLWmbUwx8jUi
 B7HMEqCprlCdigK7wwt3xF+6edyZRhtzlWy3bhxJ40f0KT5CuriSQbxogr931uKl
 hUKW2h5JtUqHtINzTt4oWjVm8xwrScxuYHYAcpw0G42ZzfO6xQOzQdowcx4m3cE9
 PrtTbU5MwW8/wgsdLiClScQq30MK/GCbHh5heyRt1BcNo9+MDsZDOgdavh9StfnW
 Bl1N6zwRtRBJNcpKWfTfwU4NTIvStCTyA8BJgKgE95YIHDsstJVl4MO7ot25qbM=
 =pXe+
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix bugs that could cause kernel deadlocks or file system corruption
  while moving xattrs to expand the extended inode.

  Also add some sanity checks to the block group descriptors to make
  sure we don't end up overwriting the superblock"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: avoid deadlock when expanding inode size
  ext4: properly align shifted xattrs when expanding inodes
  ext4: fix xattr shifting when expanding inodes part 2
  ext4: fix xattr shifting when expanding inodes
  ext4: validate that metadata blocks do not overlap superblock
  ext4: reserve xattr index for the Hurd
2016-08-29 12:37:11 -07:00
Jan Kara
2e81a4eeed ext4: avoid deadlock when expanding inode size
When we need to move xattrs into external xattr block, we call
ext4_xattr_block_set() from ext4_expand_extra_isize_ea(). That may end
up calling ext4_mark_inode_dirty() again which will recurse back into
the inode expansion code leading to deadlocks.

Protect from recursion using EXT4_STATE_NO_EXPAND inode flag and move
its management into ext4_expand_extra_isize_ea() since its manipulation
is safe there (due to xattr_sem) from possible races with
ext4_xattr_set_handle() which plays with it as well.

CC: stable@vger.kernel.org   # 4.4.x
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-11 12:38:55 -04:00
Jan Kara
443a8c41cd ext4: properly align shifted xattrs when expanding inodes
We did not count with the padding of xattr value when computing desired
shift of xattrs in the inode when expanding i_extra_isize. As a result
we could create unaligned start of inline xattrs. Account for alignment
properly.

CC: stable@vger.kernel.org  # 4.4.x-
Signed-off-by: Jan Kara <jack@suse.cz>
2016-08-11 12:00:01 -04:00
Jan Kara
418c12d08d ext4: fix xattr shifting when expanding inodes part 2
When multiple xattrs need to be moved out of inode, we did not properly
recompute total size of xattr headers in the inode and the new header
position. Thus when moving the second and further xattr we asked
ext4_xattr_shift_entries() to move too much and from the wrong place,
resulting in possible xattr value corruption or general memory
corruption.

CC: stable@vger.kernel.org  # 4.4.x
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-11 11:58:32 -04:00
Jan Kara
d0141191a2 ext4: fix xattr shifting when expanding inodes
The code in ext4_expand_extra_isize_ea() treated new_extra_isize
argument sometimes as the desired target i_extra_isize and sometimes as
the amount by which we need to grow current i_extra_isize. These happen
to coincide when i_extra_isize is 0 which used to be the common case and
so nobody noticed this until recently when we added i_projid to the
inode and so i_extra_isize now needs to grow from 28 to 32 bytes.

The result of these bugs was that we sometimes unnecessarily decided to
move xattrs out of inode even if there was enough space and we often
ended up corrupting in-inode xattrs because arguments to
ext4_xattr_shift_entries() were just wrong. This could demonstrate
itself as BUG_ON in ext4_xattr_shift_entries() triggering.

Fix the problem by introducing new isize_diff variable and use it where
appropriate.

CC: stable@vger.kernel.org   # 4.4.x
Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-11 11:50:30 -04:00
Theodore Ts'o
829fa70ddd ext4: validate that metadata blocks do not overlap superblock
A number of fuzzing failures seem to be caused by allocation bitmaps
or other metadata blocks being pointed at the superblock.

This can cause kernel BUG or WARNings once the superblock is
overwritten, so validate the group descriptor blocks to make sure this
doesn't happen.

Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-01 00:51:02 -04:00
Theodore Ts'o
3980bd3b40 ext4: reserve xattr index for the Hurd
The Hurd is using inode fields which restricts it from using more
advanced ext4 file system features, due to design choices made over a
decade ago.  By giving the Hurd an extended attribute index field we
allow it to move the translator and author fields out of the core
inode fields, and hopefully we can get rid of ugly hacks such as
EXT4_OS_HURD and EXT4_MOUNT2_HURD_COMPAT somday.

For more information please see:
      https://summerofcode.withgoogle.com/projects/#5869799859027968

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-07-31 23:38:36 -04:00
Linus Torvalds
6784725ab0 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "Assorted cleanups and fixes.

  Probably the most interesting part long-term is ->d_init() - that will
  have a bunch of followups in (at least) ceph and lustre, but we'll
  need to sort the barrier-related rules before it can get used for
  really non-trivial stuff.

  Another fun thing is the merge of ->d_iput() callers (dentry_iput()
  and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all
  except the one in __d_lookup_lru())"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
  fs/dcache.c: avoid soft-lockup in dput()
  vfs: new d_init method
  vfs: Update lookup_dcache() comment
  bdev: get rid of ->bd_inodes
  Remove last traces of ->sync_page
  new helper: d_same_name()
  dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends()
  vfs: clean up documentation
  vfs: document ->d_real()
  vfs: merge .d_select_inode() into .d_real()
  unify dentry_iput() and dentry_unlink_inode()
  binfmt_misc: ->s_root is not going anywhere
  drop redundant ->owner initializations
  ufs: get rid of redundant checks
  orangefs: constify inode_operations
  missed comment updates from ->direct_IO() prototype change
  file_inode(f)->i_mapping is f->f_mapping
  trim fsnotify hooks a bit
  9p: new helper - v9fs_parent_fid()
  debugfs: ->d_parent is never NULL or negative
  ...
2016-07-28 12:59:05 -07:00
Linus Torvalds
0e06f5c0de Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few misc bits

 - ocfs2

 - most(?) of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits)
  thp: fix comments of __pmd_trans_huge_lock()
  cgroup: remove unnecessary 0 check from css_from_id()
  cgroup: fix idr leak for the first cgroup root
  mm: memcontrol: fix documentation for compound parameter
  mm: memcontrol: remove BUG_ON in uncharge_list
  mm: fix build warnings in <linux/compaction.h>
  mm, thp: convert from optimistic swapin collapsing to conservative
  mm, thp: fix comment inconsistency for swapin readahead functions
  thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
  shmem: split huge pages beyond i_size under memory pressure
  thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
  khugepaged: add support of collapse for tmpfs/shmem pages
  shmem: make shmem_inode_info::lock irq-safe
  khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
  thp: extract khugepaged from mm/huge_memory.c
  shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
  shmem: add huge pages support
  shmem: get_unmapped_area align huge page
  shmem: prepare huge= mount option and sysfs knob
  mm, rmap: account shmem thp pages
  ...
2016-07-26 19:55:54 -07:00
Linus Torvalds
396d10993f The major change this cycle is deleting ext4's copy of the file system
encryption code and switching things over to using the copies in
 fs/crypto.  I've updated the MAINTAINERS file to add an entry for
 fs/crypto listing Jaeguk Kim and myself as the maintainers.
 
 There are also a number of bug fixes, most notably for some problems
 found by American Fuzzy Lop (AFL) courtesy of Vegard Nossum.  Also
 fixed is a writeback deadlock detected by generic/130, and some
 potential races in the metadata checksum code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJXlbP9AAoJEPL5WVaVDYGjGxgIAJ9YIqme//yix63oHYLhDNea
 lY/TLqZrb9/TdDRvGyZa3jYaKaIejL53eEQS9nhEB/JI0sEiDpHmOrDOxdj8Hlsw
 fm7nJyh1u4vFKPyklCbIvLAje1vl8X/6OvqQiwh45gIxbbsFftaBWtccW+UtEkIP
 Fx65Vk7RehJ/sNrM0cRrwB79YAmDS8P6BPyzdMRk+vO/uFqyq7Auc+pkd+bTlw/m
 TDAEIunlk0Ovjx75ru1zaemL1JJx5ffehrJmGCcSUPHVbMObOEKIrlV50gAAKVhO
 qbZAri3mhDvyspSLuS/73L9skeCiWFLhvojCBGu4t2aa3JJolmItO7IpKi4HdRU=
 =bxGK
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "The major change this cycle is deleting ext4's copy of the file system
  encryption code and switching things over to using the copies in
  fs/crypto.  I've updated the MAINTAINERS file to add an entry for
  fs/crypto listing Jaeguk Kim and myself as the maintainers.

  There are also a number of bug fixes, most notably for some problems
  found by American Fuzzy Lop (AFL) courtesy of Vegard Nossum.  Also
  fixed is a writeback deadlock detected by generic/130, and some
  potential races in the metadata checksum code"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits)
  ext4: verify extent header depth
  ext4: short-cut orphan cleanup on error
  ext4: fix reference counting bug on block allocation error
  MAINTAINRES: fs-crypto maintainers update
  ext4 crypto: migrate into vfs's crypto engine
  ext2: fix filesystem deadlock while reading corrupted xattr block
  ext4: fix project quota accounting without quota limits enabled
  ext4: validate s_reserved_gdt_blocks on mount
  ext4: remove unused page_idx
  ext4: don't call ext4_should_journal_data() on the journal inode
  ext4: Fix WARN_ON_ONCE in ext4_commit_super()
  ext4: fix deadlock during page writeback
  ext4: correct error value of function verifying dx checksum
  ext4: avoid modifying checksum fields directly during checksum verification
  ext4: check for extents that wrap around
  jbd2: make journal y2038 safe
  jbd2: track more dependencies on transaction commit
  jbd2: move lockdep tracking to journal_s
  jbd2: move lockdep instrumentation for jbd2 handles
  ext4: respect the nobarrier mount option in nojournal mode
  ...
2016-07-26 18:35:55 -07:00
Michal Hocko
8a5c743e30 mm, memcg: use consistent gfp flags during readahead
Vladimir has noticed that we might declare memcg oom even during
readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
restriction) while __do_page_cache_readahead uses
page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
OOMs.  This gfp mask discrepancy is really unfortunate and easily
fixable.  Drop page_cache_alloc_readahead() which only has one user and
outsource the gfp_mask logic into readahead_gfp_mask and propagate this
mask from __do_page_cache_readahead down to read_pages.

This alone would have only very limited impact as most filesystems are
implementing ->readpages and the common implementation mpage_readpages
does GFP_KERNEL (with mapping_gfp restriction) again.  We can tell it to
use readahead_gfp_mask instead as this function is called only during
readahead as well.  The same applies to read_cache_pages.

ext4 has its own ext4_mpage_readpages but the path which has pages !=
NULL can use the same gfp mask.  Btrfs, cifs, f2fs and orangefs are
doing a very similar pattern to mpage_readpages so the same can be
applied to them as well.

[akpm@linux-foundation.org: coding-style fixes]
[mhocko@suse.com: restrict gfp mask in mpage_alloc]
  Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Chris Mason <clm@fb.com>
Cc: Steve French <sfrench@samba.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Changman Lee <cm224.lee@samsung.com>
Cc: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Ross Zwisler
6b524995a7 dax: remote unused fault wrappers
Remove the unused wrappers dax_fault() and dax_pmd_fault().  After this
removal, rename __dax_fault() and __dax_pmd_fault() to dax_fault() and
dax_pmd_fault() respectively, and update all callers.

The dax_fault() and dax_pmd_fault() wrappers were initially intended to
capture some filesystem independent functionality around page faults
(calling sb_start_pagefault() & sb_end_pagefault(), updating file mtime
and ctime).

However, the following commits:

   5726b27b09cc ("ext2: Add locking for DAX faults")
   ea3d7209ca01 ("ext4: fix races between page faults and hole punching")

added locking to the ext2 and ext4 filesystems after these common
operations but before __dax_fault() and __dax_pmd_fault() were called.
This means that these wrappers are no longer used, and are unlikely to
be used in the future.

XFS has had locking analogous to what was recently added to ext2 and
ext4 since DAX support was initially introduced by:

   6b698edeeef0 ("xfs: add DAX file operations support")

Link: http://lkml.kernel.org/r/20160714214049.20075-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Vegard Nossum
7bc9491645 ext4: verify extent header depth
Although the extent tree depth of 5 should enough be for the worst
case of 2*32 extents of length 1, the extent tree code does not
currently to merge nodes which are less than half-full with a sibling
node, or to shrink the tree depth if possible.  So it's possible, at
least in theory, for the tree depth to be greater than 5.  However,
even in the worst case, a tree depth of 32 is highly unlikely, and if
the file system is maliciously corrupted, an insanely large eh_depth
can cause memory allocation failures that will trigger kernel warnings
(here, eh_depth = 65280):

    JBD2: ext4.exe wants too many credits credits:195849 rsv_credits:0 max:256
    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 50 at fs/jbd2/transaction.c:293 start_this_handle+0x569/0x580
    CPU: 0 PID: 50 Comm: ext4.exe Not tainted 4.7.0-rc5+ #508
    Stack:
     604a8947 625badd8 0002fd09 00000000
     60078643 00000000 62623910 601bf9bc
     62623970 6002fc84 626239b0 900000125
    Call Trace:
     [<6001c2dc>] show_stack+0xdc/0x1a0
     [<601bf9bc>] dump_stack+0x2a/0x2e
     [<6002fc84>] __warn+0x114/0x140
     [<6002fdff>] warn_slowpath_null+0x1f/0x30
     [<60165829>] start_this_handle+0x569/0x580
     [<60165d4e>] jbd2__journal_start+0x11e/0x220
     [<60146690>] __ext4_journal_start_sb+0x60/0xa0
     [<60120a81>] ext4_truncate+0x131/0x3a0
     [<60123677>] ext4_setattr+0x757/0x840
     [<600d5d0f>] notify_change+0x16f/0x2a0
     [<600b2b16>] do_truncate+0x76/0xc0
     [<600c3e56>] path_openat+0x806/0x1300
     [<600c55c9>] do_filp_open+0x89/0xf0
     [<600b4074>] do_sys_open+0x134/0x1e0
     [<600b4140>] SyS_open+0x20/0x30
     [<6001ea68>] handle_syscall+0x88/0x90
     [<600295fd>] userspace+0x3fd/0x500
     [<6001ac55>] fork_handler+0x85/0x90

    ---[ end trace 08b0b88b6387a244 ]---

[ Commit message modified and the extent tree depath check changed
from 5 to 32 -- tytso ]

Cc: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-07-15 00:22:07 -04:00
Vegard Nossum
c65d5c6c81 ext4: short-cut orphan cleanup on error
If we encounter a filesystem error during orphan cleanup, we should stop.
Otherwise, we may end up in an infinite loop where the same inode is
processed again and again.

    EXT4-fs (loop0): warning: checktime reached, running e2fsck is recommended
    EXT4-fs error (device loop0): ext4_mb_generate_buddy:758: group 2, block bitmap and bg descriptor inconsistent: 6117 vs 0 free clusters
    Aborting journal on device loop0-8.
    EXT4-fs (loop0): Remounting filesystem read-only
    EXT4-fs error (device loop0) in ext4_free_blocks:4895: Journal has aborted
    EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
    EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
    EXT4-fs error (device loop0) in ext4_ext_remove_space:3068: IO failure
    EXT4-fs error (device loop0) in ext4_ext_truncate:4667: Journal has aborted
    EXT4-fs error (device loop0) in ext4_orphan_del:2927: Journal has aborted
    EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
    EXT4-fs (loop0): Inode 16 (00000000618192a0): orphan list check failed!
    [...]
    EXT4-fs (loop0): Inode 16 (0000000061819748): orphan list check failed!
    [...]
    EXT4-fs (loop0): Inode 16 (0000000061819bf0): orphan list check failed!
    [...]

See-also: c9eb13a9105 ("ext4: fix hang when processing corrupted orphaned inode list")
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-07-14 23:21:35 -04:00
Vegard Nossum
554a5ccc4e ext4: fix reference counting bug on block allocation error
If we hit this error when mounted with errors=continue or
errors=remount-ro:

    EXT4-fs error (device loop0): ext4_mb_mark_diskspace_used:2940: comm ext4.exe: Allocating blocks 5090-6081 which overlap fs metadata

then ext4_mb_new_blocks() will call ext4_mb_release_context() and try to
continue. However, ext4_mb_release_context() is the wrong thing to call
here since we are still actually using the allocation context.

Instead, just error out. We could retry the allocation, but there is a
possibility of getting stuck in an infinite loop instead, so this seems
safer.

[ Fixed up so we don't return EAGAIN to userspace. --tytso ]

Fixes: 8556e8f3b6 ("ext4: Don't allow new groups to be added during block allocation")
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
2016-07-14 23:02:47 -04:00
Jaegeuk Kim
a7550b30ab ext4 crypto: migrate into vfs's crypto engine
This patch removes the most parts of internal crypto codes.
And then, it modifies and adds some ext4-specific crypt codes to use the generic
facility.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-07-10 14:01:03 -04:00
Wang Shilong
079788d01e ext4: fix project quota accounting without quota limits enabled
We should always transfer quota accounting, regardless of whether
quota limits are enabled.

Steps to reproduce:
  # mkfs.ext4 /dev/sda4 -O quota,project
  # mount /dev/sda4 /mnt/test
  # cp /bin/bash /mnt/test
  # chattr -p 123 /mnt/test/bash
  # quota -v -P 123

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-07-05 21:33:52 -04:00
Theodore Ts'o
5b9554dc5b ext4: validate s_reserved_gdt_blocks on mount
If s_reserved_gdt_blocks is extremely large, it's possible for
ext4_init_block_bitmap(), which is called when ext4 sets up an
uninitialized block bitmap, to corrupt random kernel memory.  Add the
same checks which e2fsck has --- it must never be larger than
blocksize / sizeof(__u32) --- and then add a backup check in
ext4_init_block_bitmap() in case the superblock gets modified after
the file system is mounted.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-07-05 20:01:52 -04:00
yalin wang
de9e9181bc ext4: remove unused page_idx
Signed-off-by: yalin wang <yalin.wang2010@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.com>
2016-07-05 16:32:32 -04:00
Vegard Nossum
6a7fd522a7 ext4: don't call ext4_should_journal_data() on the journal inode
If ext4_fill_super() fails early, it's possible for ext4_evict_inode()
to call ext4_should_journal_data() before superblock options and flags
are fully set up.  In that case, the iput() on the journal inode can
end up causing a BUG().

Work around this problem by reordering the tests so we only call
ext4_should_journal_data() after we know it's not the journal inode.

Fixes: 2d859db3e4 ("ext4: fix data corruption in inodes with journalled data")
Fixes: 2b405bfa84 ("ext4: fix data=journal fast mount/umount hang")
Cc: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-07-04 11:03:00 -04:00
Pranay Kr. Srivastava
4743f83990 ext4: Fix WARN_ON_ONCE in ext4_commit_super()
If there are racing calls to ext4_commit_super() it's possible for
another writeback of the superblock to result in the buffer being
marked with an error after we check if the buffer is marked as having
a write error and the buffer up-to-date flag is set again.  If that
happens mark_buffer_dirty() can end up throwing a WARN_ON_ONCE.

Fix this by moving this check to write before we call
write_buffer_dirty(), and keeping the buffer locked during this whole
sequence.

Signed-off-by: Pranay Kr. Srivastava <pranjas@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-07-04 10:24:52 -04:00
Jan Kara
646caa9c8e ext4: fix deadlock during page writeback
Commit 06bd3c36a733 (ext4: fix data exposure after a crash) uncovered a
deadlock in ext4_writepages() which was previously much harder to hit.
After this commit xfstest generic/130 reproduces the deadlock on small
filesystems.

The problem happens when ext4_do_update_inode() sets LARGE_FILE feature
and marks current inode handle as synchronous. That subsequently results
in ext4_journal_stop() called from ext4_writepages() to block waiting for
transaction commit while still holding page locks, reference to io_end,
and some prepared bio in mpd structure each of which can possibly block
transaction commit from completing and thus results in deadlock.

Fix the problem by releasing page locks, io_end reference, and
submitting prepared bio before calling ext4_journal_stop().

[ Changed to defer the call to ext4_journal_stop() only if the handle
  is synchronous.  --tytso ]

Reported-and-tested-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2016-07-04 10:14:01 -04:00
Daeho Jeong
fa96454069 ext4: correct error value of function verifying dx checksum
ext4_dx_csum_verify() returns the success return value in two checksum
verification failure cases. We need to set the return values to zero
as failure like ext4_dirent_csum_verify() returning zero when failing
to find a checksum dirent at the tail.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-07-03 21:11:08 -04:00
Daeho Jeong
b47820edd1 ext4: avoid modifying checksum fields directly during checksum verification
We temporally change checksum fields in buffers of some types of
metadata into '0' for verifying the checksum values. By doing this
without locking the buffer, some metadata's checksums, which are
being committed or written back to the storage, could be damaged.
In our test, several metadata blocks were found with damaged metadata
checksum value during recovery process. When we only verify the
checksum value, we have to avoid modifying checksum fields directly.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-07-03 17:51:39 -04:00