43135 Commits

Author SHA1 Message Date
Linus Torvalds
d5ffdf8b4a xfs: Update 2 for 4.5-rc1
This update contains:
 
 o promotion of XFS_IOC_FS[GS]ETXATTR ioctl to the vfs level so that
   it can be shared with other filesystems. The ext4 project quota
   functionality is the first target for this. The commits in this
   series have not been updated with review or final SOB tags because
   the branch they were originally published in was needed by ext4.
   Those tags are:
 
   Reviewed-by: Theodore Ts'o <tytso@mit.edu>
   Signed-off-by: Dave Chinner <david@fromrobit.com>
 
 o Revert a change that is causing suspend failures.
 o Fix a use-after-free that can occur on log mount failures. Been
   around forever, but now exposed by other changes to log recovery
   made in the first 4.5 merge.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJWoV0hAAoJEK3oKUf0dfodSCQP/RXlZp6TQhv2DQ2MW4AeZRzs
 kzp3zvWUN1udB0fgAARMUDbHHeqEp5gUB6Fj8GOjgh69VGac1pjR2GOvEA9UbnhL
 uLQwaRggVB/BJV6+hDUw283kENXE1H8JcDiEIFratwdiZ6KrhniMptzrbUnG22LO
 cBLzHOCFI0x4ib2fdTvrVV8bNaAaLYViYUxuVwzblzhoODN4Nmv5HZ5BlMHDFJsd
 E47Yw/0tdYFVRDuujN22ylYsKsySXBxPaWyUvDDlW/ryeKSfwn3V8Y7BSDZU4vUZ
 CFstsqlzEySGrNNCfor5bFn9EO3i882M+DU60UhZAKRgvAzANAsxjJ97B8Of5KA+
 /0OQarl0ZNJ93g6mZJ2bhuVpRCIGWJ3rBl9+GK8JdtsjF0mPOvrusKTQKoz1frK7
 B8h52P+jxfqrrqeqpNigMWfDKYkXCfUUMAJm57+QILAoTNRupAzgFyXZnSgAermE
 jaDfvnkaSZxfaLtTOlkkpGukhbFubhAWTk3TksVxICPXztZelQLmmbqjZnTYFCT/
 dKieKbwop58DBTycFuzCrWiSjXjodAq/+IfpAQcvJ5xZPLtgfjHxQaHD6zsOVKzQ
 lWosgYOnIaN/PYPOpAzo0sRDf80d5KFjwcdSjrWZVZ5lGfAsx8iYErh3v0Xv3rkE
 YuKQw2AjVVtD64SfHvIn
 =wEy8
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-4.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull more xfs updates from Dave Chinner:
 "This is the second update for XFS that I mentioned in the original
  pull request last week.

  It contains a revert for a suspend regression in 4.4 and a fix for a
  long standing log recovery issue that has been further exposed by all
  the log recovery changes made in the original 4.5 merge.

  There is one more thing in this pull request - one that I forgot to
  merge into the origin.  That is, pulling the XFS_IOC_FS[GS]ETXATTR
  ioctl up to the VFS level so that other filesystems can also use it
  for modifying project quota IDs

  Summary:

   - promotion of XFS_IOC_FS[GS]ETXATTR ioctl to the vfs level so that
     it can be shared with other filesystems.  The ext4 project quota
     functionality is the first target for this.  The commits in this
     series have not been updated with review or final SOB tags because
     the branch they were originally published in was needed by ext4.
     Those tags are:

        Reviewed-by: Theodore Ts'o <tytso@mit.edu>
        Signed-off-by: Dave Chinner <david@fromrobit.com>

   - Revert a change that is causing suspend failures.

   - Fix a use-after-free that can occur on log mount failures.  Been
     around forever, but now exposed by other changes to log recovery
     made in the first 4.5 merge"

* tag 'xfs-for-linus-4.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: log mount failures don't wait for buffers to be released
  Revert "xfs: clear PF_NOFREEZE for xfsaild kthread"
  xfs: introduce per-inode DAX enablement
  xfs: use FS_XFLAG definitions directly
  fs: XFS_IOC_FS[SG]SETXATTR to FS_IOC_FS[SG]ETXATTR promotion
2016-01-22 10:54:13 -08:00
Linus Torvalds
eadee0ce6f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "Embarrassing braino fix + pipe page accounting + fixing an eyesore in
  find_filesystem() (checking that s1 is equal to prefix of s2 of given
  length can be done in many ways, but "compare strlen(s1) with length
  and then do strncmp()" is not a good one...)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  [regression] fix braino in fs/dlm/user.c
  pipe: limit the per-user amount of pages allocated in pipes
  find_filesystem(): simplify comparison
2016-01-22 10:24:03 -08:00
Tariq Saeed
b1b1e15ef6 ocfs2: NFS hangs in __ocfs2_cluster_lock due to race with ocfs2_unblock_lock
NFS on a 2 node ocfs2 cluster each node exporting dir.  The lock causing
the hang is the global bit map inode lock.  Node 1 is master, has the
lock granted in PR mode; Node 2 is in the converting list (PR -> EX).
There are no holders of the lock on the master node so it should
downconvert to NL and grant EX to node 2 but that does not happen.
BLOCKED + QUEUED in lock res are set and it is on osb blocked list.
Threads are waiting in __ocfs2_cluster_lock on BLOCKED.  One thread
wants EX, rest want PR.  So it is as though the downconvert thread needs
to be kicked to complete the conv.

The hang is caused by an EX req coming into __ocfs2_cluster_lock on the
heels of a PR req after it sets BUSY (drops l_lock, releasing EX
thread), forcing the incoming EX to wait on BUSY without doing anything.
PR has called ocfs2_dlm_lock, which sets the node 1 lock from NL -> PR,
queues ast.

At this time, upconvert (PR ->EX) arrives from node 2, finds conflict
with node 1 lock in PR, so the lock res is put on dlm thread's dirty
listt.

After ret from ocf2_dlm_lock, PR thread now waits behind EX on BUSY till
awoken by ast.

Now it is dlm_thread that serially runs dlm_shuffle_lists, ast, bast, in
that order.  dlm_shuffle_lists ques a bast on behalf of node 2 (which
will be run by dlm_thread right after the ast).  ast does its part, sets
UPCONVERT_FINISHING, clears BUSY and wakes its waiters.  Next,
dlm_thread runs bast.  It sets BLOCKED and kicks dc thread.  dc thread
runs ocfs2_unblock_lock, but since UPCONVERT_FINISHING set, skips doing
anything and reques.

Inside of __ocfs2_cluster_lock, since EX has been waiting on BUSY ahead
of PR, it wakes up first, finds BLOCKED set and skips doing anything but
clearing UPCONVERT_FINISHING (which was actually "meant" for the PR
thread), and this time waits on BLOCKED.  Next, the PR thread comes out
of wait but since UPCONVERT_FINISHING is not set, it skips updating the
l_ro_holders and goes straight to wait on BLOCKED.  So there, we have a
hang! Threads in __ocfs2_cluster_lock wait on BLOCKED, lock res in osb
blocked list.  Only when dc thread is awoken, it will run
ocfs2_unblock_lock and things will unhang.

One way to fix this is to wake the dc thread on the flag after clearing
UPCONVERT_FINISHING

Orabug: 20933419
Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Eric Ren <zren@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21 17:20:51 -08:00
Sudip Mukherjee
ff7d080e52 reiserfs: fix dereference of ERR_PTR
reiserfs_iget() returns either NULL or error code in ERR_PTR.  And we
were only checking for NULL, so in case of some other error we will try
to dereference the ERR_PTR(-errno) thinking it to be a valid pointer.

Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21 17:20:51 -08:00
Kirill A. Shutemov
b6ec57f4b9 thp: change pmd_trans_huge_lock() interface to return ptl
After THP refcounting rework we have only two possible return values
from pmd_trans_huge_lock(): success and failure.  Return-by-pointer for
ptl doesn't make much sense in this case.

Let's convert pmd_trans_huge_lock() to return ptl on success and NULL on
failure.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21 17:20:51 -08:00
Al Viro
117aa41e80 [regression] fix braino in fs/dlm/user.c
it's "bugger off if we got ERR_PTR", not the other way round...

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-21 17:45:15 -05:00
Linus Torvalds
eae21770b4 Merge branch 'akpm' (patches from Andrew)
Merge third patch-bomb from Andrew Morton:
 "I'm pretty much done for -rc1 now:

   - the rest of MM, basically

   - lib/ updates

   - checkpatch, epoll, hfs, fatfs, ptrace, coredump, exit

   - cpu_mask simplifications

   - kexec, rapidio, MAINTAINERS etc, etc.

   - more dma-mapping cleanups/simplifications from hch"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (109 commits)
  MAINTAINERS: add/fix git URLs for various subsystems
  mm: memcontrol: add "sock" to cgroup2 memory.stat
  mm: memcontrol: basic memory statistics in cgroup2 memory controller
  mm: memcontrol: do not uncharge old page in page cache replacement
  Documentation: cgroup: add memory.swap.{current,max} description
  mm: free swap cache aggressively if memcg swap is full
  mm: vmscan: do not scan anon pages if memcg swap limit is hit
  swap.h: move memcg related stuff to the end of the file
  mm: memcontrol: replace mem_cgroup_lruvec_online with mem_cgroup_online
  mm: vmscan: pass memcg to get_scan_count()
  mm: memcontrol: charge swap to cgroup2
  mm: memcontrol: clean up alloc, online, offline, free functions
  mm: memcontrol: flatten struct cg_proto
  mm: memcontrol: rein in the CONFIG space madness
  net: drop tcp_memcontrol.c
  mm: memcontrol: introduce CONFIG_MEMCG_LEGACY_KMEM
  mm: memcontrol: allow to disable kmem accounting for cgroup2
  mm: memcontrol: account "kmem" consumers in cgroup2 memory controller
  mm: memcontrol: move kmem accounting code to CONFIG_MEMCG
  mm: memcontrol: separate kmem code from legacy tcp accounting code
  ...
2016-01-21 12:32:08 -08:00
Linus Torvalds
e9f57ebcba Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
 "This contains several bug fixes and a new mount option
  'default_permissions' that allows read-only exported NFS
  filesystems to be used as lower layer"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: check dentry positiveness in ovl_cleanup_whiteouts()
  ovl: setattr: check permissions before copy-up
  ovl: root: copy attr
  ovl: move super block magic number to magic.h
  ovl: use a minimal buffer in ovl_copy_xattr
  ovl: allow zero size xattr
  ovl: default permissions
2016-01-21 12:20:46 -08:00
Linus Torvalds
5c89e9ea7e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
 "This adds SEEK_HOLE and SEEK_DATA support in lseek"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: add support for SEEK_HOLE and SEEK_DATA in lseek
2016-01-21 12:14:24 -08:00
Linus Torvalds
71e4634e00 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending
Pull SCSI target updates from Nicholas Bellinger:
 "The highlights this round include:

   - Introduce configfs support for unlocked configfs_depend_item()
     (krzysztof + andrezej)
   - Conversion of usb-gadget target driver to new function registration
     interface (andrzej + sebastian)
   - Enable qla2xxx FC target mode support for Extended Logins (himansu +
     giridhar)
   - Enable qla2xxx FC target mode support for Exchange Offload (himansu +
     giridhar)
   - Add qla2xxx FC target mode irq affinity notification + selective
     command queuing.  (quinn + himanshu)
   - Fix iscsi-target deadlock in se_node_acl configfs deletion (sagi +
     nab)
   - Convert se_node_acl configfs deletion + se_node_acl->queue_depth to
     proper se_session->sess_kref + target_get_session() usage.  (hch +
     sagi + nab)
   - Fix long-standing race between se_node_acl->acl_kref get and
     get_initiator_node_acl() lookup.  (hch + nab)
   - Fix target/user block-size handling, and make sure netlink reaches
     all network namespaces (sheng + andy)

  Note there is an outstanding bug-fix series for remote I_T nexus port
  TMR LUN_RESET has been posted and still being tested, and will likely
  become post -rc1 material at this point"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (56 commits)
  scsi: qla2xxxx: avoid type mismatch in comparison
  target/user: Make sure netlink would reach all network namespaces
  target: Obtain se_node_acl->acl_kref during get_initiator_node_acl
  target: Convert ACL change queue_depth se_session reference usage
  iscsi-target: Fix potential dead-lock during node acl delete
  ib_srpt: Convert acl lookup to modern get_initiator_node_acl usage
  tcm_fc: Convert acl lookup to modern get_initiator_node_acl usage
  tcm_fc: Wait for command completion before freeing a session
  target: Fix a memory leak in target_dev_lba_map_store()
  target: Support aborting tasks with a 64-bit tag
  usb/gadget: Remove set-but-not-used variables
  target: Remove an unused variable
  target: Fix indentation in target_core_configfs.c
  target/user: Allow user to set block size before enabling device
  iser-target: Fix non negative ERR_PTR isert_device_get usage
  target/fcoe: Add tag support to tcm_fc
  qla2xxx: Check for online flag instead of active reset when transmitting responses
  qla2xxx: Set all queues to 4k
  qla2xxx: Disable ZIO at start time.
  qla2xxx: Move atioq to a different lock to reduce lock contention
  ...
2016-01-20 17:20:53 -08:00
Andrew Morton
90d6cd51af fs/adfs/adfs.h: tidy up comments
Lots of needless 80-col overflows.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Andrew Morton
e458bcd16f fs/overlayfs/super.c needs pagemap.h
i386 allmodconfig:

  In file included from fs/overlayfs/super.c:10:0:
  fs/overlayfs/super.c: In function 'ovl_fill_super':
  include/linux/fs.h:898:36: error: 'PAGE_CACHE_SIZE' undeclared (first use in this function)
   #define MAX_LFS_FILESIZE (((loff_t)PAGE_CACHE_SIZE << (BITS_PER_LONG-1))-1)
                                      ^
  fs/overlayfs/super.c:939:19: note: in expansion of macro 'MAX_LFS_FILESIZE'
    sb->s_maxbytes = MAX_LFS_FILESIZE;
                     ^
  include/linux/fs.h:898:36: note: each undeclared identifier is reported only once for each function it appears in
   #define MAX_LFS_FILESIZE (((loff_t)PAGE_CACHE_SIZE << (BITS_PER_LONG-1))-1)
                                      ^
  fs/overlayfs/super.c:939:19: note: in expansion of macro 'MAX_LFS_FILESIZE'
    sb->s_maxbytes = MAX_LFS_FILESIZE;
                     ^

Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Mateusz Guzik
a3b609ef9f proc read mm's {arg,env}_{start,end} with mmap semaphore taken.
Only functions doing more than one read are modified.  Consumeres
happened to deal with possibly changing data, but it does not seem like
a good thing to rely on.

Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jarod Wilson <jarod@redhat.com>
Cc: Jan Stancek <jstancek@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anshuman Khandual <anshuman.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Jann Horn
ac94b6e3ba fs/coredump: prevent "" / "." / ".." core path components
Let %h and %e print empty values as "!", "." as "!" and
".." as "!.".

This prevents hostnames and comm values that are empty or consist of one
or two dots from changing the directory level at which the corefile will
be stored.

Consider the case where someone decides to sort coredumps by hostname
with a core pattern like "/cores/%h/core.%e.%p.%t" or so.  In this
case, hostnames "" and "." would cause the coredump to land directly in
/cores, which is not what the intent behind the core pattern is, and
".." would cause the coredump to land in /.

Yeah, there probably aren't many people who do that, but I still don't
want this edgecase to be kind of broken.

It seems very unlikely that this caused security issues anywhere, so I'm
not requesting a stable backport.

[akpm@linux-foundation.org: tweak code comment]
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Jann Horn
caaee6234d ptrace: use fsuid, fsgid, effective creds for fs access checks
By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.

To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.

The problem was that when a privileged task had temporarily dropped its
privileges, e.g.  by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.

While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.

In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:

 /proc/$pid/stat - uses the check for determining whether pointers
     should be visible, useful for bypassing ASLR
 /proc/$pid/maps - also useful for bypassing ASLR
 /proc/$pid/cwd - useful for gaining access to restricted
     directories that contain files with lax permissions, e.g. in
     this scenario:
     lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
     drwx------ root root /root
     drwxr-xr-x root root /root/foobar
     -rw-r--r-- root root /root/foobar/secret

Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).

[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Julia Lawall
8992de4cec fat: constify fatent_operations structures
The fatent_operations structures are never modified, so declare them as
const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Namjae Jeon
16fab20150 fat: permit to return phy block number by fibmap in fallocated region
Make the fibmap call return the proper physical block number for any
offset request in the fallocated range.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Namjae Jeon
7e0f236b5b fat: skip cluster allocation on fallocated region
Skip new cluster allocation after checking i_blocks limit in _fat_get_block,
because the blocks are already allocated in fallocated region.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Namjae Jeon
b13bb33eac fat: add fat_fallocate operation
Implement preallocation via the fallocate syscall on VFAT partitions.
This patch is based on an earlier patch of the same name which had some
issues detailed below and did not get accepted.  Refer
https://lkml.org/lkml/2007/12/22/130.

a) The preallocated space was not persistent when the
   FALLOC_FL_KEEP_SIZE flag was set.  It will deallocate cluster at evict
   time.

b) There was no need to zero out the clusters when the flag was set
   Instead of doing an expanding truncate, just allocate clusters and add
   them to the fat chain.  This reduces preallocation time.

Compatibility with windows:

There are no issues when FALLOC_FL_KEEP_SIZE is not set because it just
does an expanding truncate.  Thus reading from the preallocated area on
windows returns null until data is written to it.

When a file with preallocated area using the FALLOC_FL_KEEP_SIZE was
written to on windows, the windows driver freed-up the preallocated
clusters and allocated new clusters for the new data.  The freed up
clusters gets reflected in the free space available for the partition
which can be seen from the Volume properties.

The windows chkdsk tool also does not report any errors on a disk
containing files with preallocated space.

And there is also no issue using linux fat fsck.  because discard
preallocated clusters at repair time.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
OGAWA Hirofumi
a3082d526f fat: add simple validation for directory inode
This detects simple corruption cases of directory, and tries to avoid
further damage to user data.

And performance impact of this validation should be very low, or not
measurable.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Tested-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Jan Kara
a513d86983 fat: allow time_offset to be up to 24 hours
Currently we limit values of time_offset mount option to be between -12
and 12 hours.  However e.g.  zone GMT+12 can have a DST correction on top
which makes the total time difference 13 hours.  Update the checks in
mount option parsing to allow offset of upto 24 hours to allow for unusual
cases.

Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Volker Kuhlmann <list0570@paradise.net.nz>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Geliang Tang
2c35dea279 fs/hfs/catalog.c: use list_for_each_entry in hfs_cat_delete
Use list_for_each_entry() instead of list_for_each() to simplify the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Jason Baron
df0108c5da epoll: add EPOLLEXCLUSIVE flag
Currently, epoll file descriptors or epfds (the fd returned from
epoll_create[1]()) that are added to a shared wakeup source are always
added in a non-exclusive manner.  This means that when we have multiple
epfds attached to a shared fd source they are all woken up.  This creates
thundering herd type behavior.

Introduce a new 'EPOLLEXCLUSIVE' flag that can be passed as part of the
'event' argument during an epoll_ctl() EPOLL_CTL_ADD operation.  This new
flag allows for exclusive wakeups when there are multiple epfds attached
to a shared fd event source.

The implementation walks the list of exclusive waiters, and queues an
event to each epfd, until it finds the first waiter that has threads
blocked on it via epoll_wait().  The idea is to search for threads which
are idle and ready to process the wakeup events.  Thus, we queue an event
to at least 1 epfd, but may still potentially queue an event to all epfds
that are attached to the shared fd source.

Performance testing was done by Madars Vitolins using a modified version
of Enduro/X.  The use of the 'EPOLLEXCLUSIVE' flag reduce the length of
this particular workload from 860s down to 24s.

Sample epoll_clt text:

EPOLLEXCLUSIVE

  Sets an exclusive wakeup mode for the epfd file descriptor that is
  being attached to the target file descriptor, fd.  Thus, when an event
  occurs and multiple epfd file descriptors are attached to the same
  target file using EPOLLEXCLUSIVE, one or more epfds will receive an
  event with epoll_wait(2).  The default in this scenario (when
  EPOLLEXCLUSIVE is not set) is for all epfds to receive an event.
  EPOLLEXCLUSIVE may only be specified with the op EPOLL_CTL_ADD.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Tested-by: Madars Vitolins <m@silodev.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Eric Wong <normalperson@yhbt.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Kirill A. Shutemov
f4be6153cc fs/proc/task_mmu.c: add workaround for old compilers
For THP=n, HPAGE_PMD_NR in smaps_account() expands to BUILD_BUG().
That's fine since this codepath is eliminated by modern compilers.

But older compilers have not that efficient dead code elimination.  It
causes problem at least with gcc 4.1.2 on m68k:

   fs/built-in.o: In function `smaps_account':
   task_mmu.c:(.text+0x4f8fa): undefined reference to `__compiletime_assert_471'

Let's replace HPAGE_PMD_NR with 1 << compound_order(page).

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Linus Torvalds
d9569f003c Merge branch 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull kbuild updates from Michal Marek:
 - Make <modname>-m in makefiles work like <modname>-y and fix the
   fallout
 - Minor genksyms fix
 - Fix race with make -j install modules_install
 - Move -Wsign-compare from make W=1 to W=2
 - Other minor fixes

* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
  kbuild: Demote 'sign-compare' warning to W=2
  Makefile: revert "Makefile: Document ability to make file.lst and file.S" partially
  kbuild: Do not run modules_install and install in paralel
  genksyms: Handle string literals with spaces in reference files
  fixdep: constify strrcmp arguments
  ath10k: Fix build with CONFIG_THERMAL=m
  Revert "drm: Hack around CONFIG_AGP=m build failures"
  kbuild: Allow to specify composite modules with modname-m
  staging/ad7606: Actually build the interface modules
2016-01-20 09:45:43 -08:00
Willy Tarreau
759c01142a pipe: limit the per-user amount of pages allocated in pipes
On no-so-small systems, it is possible for a single process to cause an
OOM condition by filling large pipes with data that are never read. A
typical process filling 4000 pipes with 1 MB of data will use 4 GB of
memory. On small systems it may be tricky to set the pipe max size to
prevent this from happening.

This patch makes it possible to enforce a per-user soft limit above
which new pipes will be limited to a single page, effectively limiting
them to 4 kB each, as well as a hard limit above which no new pipes may
be created for this user. This has the effect of protecting the system
against memory abuse without hurting other users, and still allowing
pipes to work correctly though with less data at once.

The limit are controlled by two new sysctls : pipe-user-pages-soft, and
pipe-user-pages-hard. Both may be disabled by setting them to zero. The
default soft limit allows the default number of FDs per process (1024)
to create pipes of the default size (64kB), thus reaching a limit of 64MB
before starting to create only smaller pipes. With 256 processes limited
to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
1084 MB of memory allocated for a user. The hard limit is disabled by
default to avoid breaking existing applications that make intensive use
of pipes (eg: for splicing).

Reported-by: socketpair@gmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-19 19:25:21 -05:00
Linus Torvalds
7c24d9f3b2 Merge branch 'for-4.5/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
 "We don't have a lot of core changes this time around, it's mostly in
  drivers, which will come in a subsequent pull.

  The cores changes include:

   - blk-mq
        - Prep patch from Christoph, changing blk_mq_alloc_request() to
          take flags instead of just using gfp_t for sleep/nosleep.
        - Doc patch from me, clarifying the difference between legacy
          and blk-mq for timer usage.
        - Fixes from Raghavendra for memory-less numa nodes, and a reuse
          of CPU masks.

   - Cleanup from Geliang Tang, using offset_in_page() instead of open
     coding it.

   - From Ilya, rename request_queue slab to it reflects what it holds,
     and a fix for proper use of bdgrab/put.

   - A real fix for the split across stripe boundaries from Keith.  We
     yanked a broken version of this from 4.4-rc final, this one works.

   - From Mike Krinkin, emit a trace message when we split.

   - From Wei Tang, two small cleanups, not explicitly clearing memory
     that is already cleared"

* 'for-4.5/core' of git://git.kernel.dk/linux-block:
  block: use bd{grab,put}() instead of open-coding
  block: split bios to max possible length
  block: add call to split trace point
  blk-mq: Avoid memoryless numa node encoded in hctx numa_node
  blk-mq: Reuse hardware context cpumask for tags
  blk-mq: add a flags parameter to blk_mq_alloc_request
  Revert "blk-flush: Queue through IO scheduler when flush not required"
  block: clarify blk_add_timer() use case for blk-mq
  bio: use offset_in_page macro
  block: do not initialise statics to 0 or NULL
  block: do not initialise globals to 0 or NULL
  block: rename request_queue slab cache
2016-01-19 15:03:34 -08:00
Al Viro
558041d8d2 find_filesystem(): simplify comparison
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-19 12:02:23 -05:00
Dave Chinner
ee3804d9f9 Merge branch 'xfs-misc-fixes-for-4.5-3' into for-next 2016-01-19 08:28:36 +11:00
Dave Chinner
85bec5460a xfs: log mount failures don't wait for buffers to be released
Recently I've been seeing xfs/051 fail on 1k block size filesystems.
Trying to trace the events during the test lead to the problem going
away, indicating that it was a race condition that lead to this
ASSERT failure:

XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 156
.....
[<ffffffff814e1257>] xfs_free_perag+0x87/0xb0
[<ffffffff814e21b9>] xfs_mountfs+0x4d9/0x900
[<ffffffff814e5dff>] xfs_fs_fill_super+0x3bf/0x4d0
[<ffffffff811d8800>] mount_bdev+0x180/0x1b0
[<ffffffff814e3ff5>] xfs_fs_mount+0x15/0x20
[<ffffffff811d90a8>] mount_fs+0x38/0x170
[<ffffffff811f4347>] vfs_kern_mount+0x67/0x120
[<ffffffff811f7018>] do_mount+0x218/0xd60
[<ffffffff811f7e5b>] SyS_mount+0x8b/0xd0

When I finally caught it with tracing enabled, I saw that AG 2 had
an elevated reference count and a buffer was responsible for it. I
tracked down the specific buffer, and found that it was missing the
final reference count release that would put it back on the LRU and
hence be found by xfs_wait_buftarg() calls in the log mount failure
handling.

The last four traces for the buffer before the assert were (trimmed
for relevance)

kworker/0:1-5259   xfs_buf_iodone:        hold 2  lock 0 flags ASYNC
kworker/0:1-5259   xfs_buf_ioerror:       hold 2  lock 0 error -5
mount-7163	   xfs_buf_lock_done:     hold 2  lock 0 flags ASYNC
mount-7163	   xfs_buf_unlock:        hold 2  lock 1 flags ASYNC

This is an async write that is completing, so there's nobody waiting
for it directly.  Hence we call xfs_buf_relse() once all the
processing is complete. That does:

static inline void xfs_buf_relse(xfs_buf_t *bp)
{
	xfs_buf_unlock(bp);
	xfs_buf_rele(bp);
}

Now, it's clear that mount is waiting on the buffer lock, and that
it has been released by xfs_buf_relse() and gained by mount. This is
expected, because at this point the mount process is in
xfs_buf_delwri_submit() waiting for all the IO it submitted to
complete.

The mount process, however, is waiting on the lock for the buffer
because it is in xfs_buf_delwri_submit(). This waits for IO
completion, but it doesn't wait for the buffer reference owned by
the IO to go away. The mount process collects all the completions,
fails the log recovery, and the higher level code then calls
xfs_wait_buftarg() to free all the remaining buffers in the
filesystem.

The issue is that on unlocking the buffer, the scheduler has decided
that the mount process has higher priority than the the kworker
thread that is running the IO completion, and so immediately
switched contexts to the mount process from the semaphore unlock
code, hence preventing the kworker thread from finishing the IO
completion and releasing the IO reference to the buffer.

Hence by the time that xfs_wait_buftarg() is run, the buffer still
has an active reference and so isn't on the LRU list that the
function walks to free the remaining buffers. Hence we miss that
buffer and continue onwards to tear down the mount structures,
at which time we get find a stray reference count on the perag
structure. On a non-debug kernel, this will be ignored and the
structure torn down and freed. Hence when the kworker thread is then
rescheduled and the buffer released and freed, it will access a
freed perag structure.

The problem here is that when the log mount fails, we still need to
quiesce the log to ensure that the IO workqueues have returned to
idle before we run xfs_wait_buftarg(). By synchronising the
workqueues, we ensure that all IO completions are fully processed,
not just to the point where buffers have been unlocked. This ensures
we don't end up in the situation above.

cc: <stable@vger.kernel.org> # 3.18
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-19 08:28:10 +11:00
Dave Chinner
3e85286e75 Revert "xfs: clear PF_NOFREEZE for xfsaild kthread"
This reverts commit 24ba16bb3d499c49974669cd8429c3e4138ab102 as it
prevents machines from suspending. This regression occurs when the
xfsaild is idle on entry to suspend, and so there s no activity to
wake it from it's idle sleep and hence see that it is supposed to
freeze. Hence the freezer times out waiting for it and suspend is
cancelled.

There is no obvious fix for this short of freezing the filesystem
properly, so revert this change for now.

cc: <stable@vger.kernel.org> # 4.4
Signed-off-by: Dave Chinner <david@fromorbit.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-19 08:21:46 +11:00
Dave Chinner
4c931f770d Merge branch 'xfs-setxattr-promotion' into for-next 2016-01-19 08:16:08 +11:00
Linus Torvalds
c1a198d923 Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This has our usual assortment of fixes and cleanups, but the biggest
  change included is Omar Sandoval's free space tree.  It's not the
  default yet, mounting -o space_cache=v2 enables it and sets a readonly
  compat bit.  The tree can actually be deleted and regenerated if there
  are any problems, but it has held up really well in testing so far.

  For very large filesystems (30T+) our existing free space caching code
  can end up taking a huge amount of time during commits.  The new tree
  based code is faster and less work overall to update as the commit
  progresses.

  Omar worked on this during the summer and we'll hammer on it in
  production here at FB over the next few months"

* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (73 commits)
  Btrfs: fix fitrim discarding device area reserved for boot loader's use
  Btrfs: Check metadata redundancy on balance
  btrfs: statfs: report zero available if metadata are exhausted
  btrfs: preallocate path for snapshot creation at ioctl time
  btrfs: allocate root item at snapshot ioctl time
  btrfs: do an allocation earlier during snapshot creation
  btrfs: use smaller type for btrfs_path locks
  btrfs: use smaller type for btrfs_path lowest_level
  btrfs: use smaller type for btrfs_path reada
  btrfs: cleanup, use enum values for btrfs_path reada
  btrfs: constify static arrays
  btrfs: constify remaining structs with function pointers
  btrfs tests: replace whole ops structure for free space tests
  btrfs: use list_for_each_entry* in backref.c
  btrfs: use list_for_each_entry_safe in free-space-cache.c
  btrfs: use list_for_each_entry* in check-integrity.c
  Btrfs: use linux/sizes.h to represent constants
  btrfs: cleanup, remove stray return statements
  btrfs: zero out delayed node upon allocation
  btrfs: pass proper enum type to start_transaction()
  ...
2016-01-18 12:44:40 -08:00
Linus Torvalds
5807fcaa9b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:

 - EVM gains support for loading an x509 cert from the kernel
   (EVM_LOAD_X509), into the EVM trusted kernel keyring.

 - Smack implements 'file receive' process-based permission checking for
   sockets, rather than just depending on inode checks.

 - Misc enhancments for TPM & TPM2.

 - Cleanups and bugfixes for SELinux, Keys, and IMA.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (41 commits)
  selinux: Inode label revalidation performance fix
  KEYS: refcount bug fix
  ima: ima_write_policy() limit locking
  IMA: policy can be updated zero times
  selinux: rate-limit netlink message warnings in selinux_nlmsg_perm()
  selinux: export validatetrans decisions
  gfs2: Invalid security labels of inodes when they go invalid
  selinux: Revalidate invalid inode security labels
  security: Add hook to invalidate inode security labels
  selinux: Add accessor functions for inode->i_security
  security: Make inode argument of inode_getsecid non-const
  security: Make inode argument of inode_getsecurity non-const
  selinux: Remove unused variable in selinux_inode_init_security
  keys, trusted: seal with a TPM2 authorization policy
  keys, trusted: select hash algorithm for TPM2 chips
  keys, trusted: fix: *do not* allow duplicate key options
  tpm_ibmvtpm: properly handle interrupted packet receptions
  tpm_tis: Tighten IRQ auto-probing
  tpm_tis: Refactor the interrupt setup
  tpm_tis: Get rid of the duplicate IRQ probing code
  ...
2016-01-17 19:13:15 -08:00
Yaowei Bai
4c416f42ee fs/stat.c: drop the last new_valid_dev check
New_valid_dev() always returns true, so that's unnecessary to perform
new_valid_dev() checks in some filesystems.  Most checks of
new_valid_dev() have been removed so let's drop this last one and then
we can remove new_valid_dev() from the source code.

No functional change.

Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 11:17:23 -08:00
Mike Kravetz
4aae8d1c05 mm/hugetlbfs: unmap pages if page fault raced with hole punch
Page faults can race with fallocate hole punch.  If a page fault happens
between the unmap and remove operations, the page is not removed and
remains within the hole.  This is not the desired behavior.  The race is
difficult to detect in user level code as even in the non-race case, a
page within the hole could be faulted back in before fallocate returns.
If userfaultfd is expanded to support hugetlbfs in the future, this race
will be easier to observe.

If this race is detected and a page is mapped, the remove operation
(remove_inode_hugepages) will unmap the page before removing.  The unmap
within remove_inode_hugepages occurs with the hugetlb_fault_mutex held
so that no other faults will be processed until the page is removed.

The (unmodified) routine hugetlb_vmdelete_list was moved ahead of
remove_inode_hugepages to satisfy the new reference.

[akpm@linux-foundation.org: move hugetlb_vmdelete_list()]
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Mike Kravetz
9aacdd354d fs/hugetlbfs/inode.c: fix bugs in hugetlb_vmtruncate_list()
Hillf Danton noticed bugs in the hugetlb_vmtruncate_list routine.  The
argument end is of type pgoff_t.  It was being converted to a vaddr
offset and passed to unmap_hugepage_range.  However, end was also being
used as an argument to the vma_interval_tree_foreach controlling loop.
In addition, the conversion of end to vaddr offset was incorrect.

hugetlb_vmtruncate_list is called as part of a file truncate or
fallocate hole punch operation.

When truncating a hugetlbfs file, this bug could prevent some pages from
being unmapped.  This is possible if there are multiple vmas mapping the
file, and there is a sufficiently sized hole between the mappings.  The
size of the hole between two vmas (A,B) must be such that the starting
virtual address of B is greater than (ending virtual address of A <<
PAGE_SHIFT).  In this case, the pages in B would not be unmapped.  If
pages are not properly unmapped during truncate, the following BUG is
hit:

	kernel BUG at fs/hugetlbfs/inode.c:428!

In the fallocate hole punch case, this bug could prevent pages from
being unmapped as in the truncate case.  However, for hole punch the
result is that unmapped pages will not be removed during the operation.
For hole punch, it is also possible that more pages than desired will be
unmapped.  This unnecessary unmapping will cause page faults to
reestablish the mappings on subsequent page access.

Fixes: 1bfad99ab (" hugetlbfs: hugetlb_vmtruncate_list() needs to take a range")Reported-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org>	[4.3]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov
88f306b68c mm: fix locking order in mm_take_all_locks()
Dmitry Vyukov has reported[1] possible deadlock (triggered by his
syzkaller fuzzer):

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&hugetlbfs_i_mmap_rwsem_key);
                               lock(&mapping->i_mmap_rwsem);
                               lock(&hugetlbfs_i_mmap_rwsem_key);
  lock(&mapping->i_mmap_rwsem);

Both traces points to mm_take_all_locks() as a source of the problem.
It doesn't take care about ordering or hugetlbfs_i_mmap_rwsem_key (aka
mapping->i_mmap_rwsem for hugetlb mapping) vs.  i_mmap_rwsem.

huge_pmd_share() does memory allocation under hugetlbfs_i_mmap_rwsem_key
and allocator can take i_mmap_rwsem if it hit reclaim.  So we need to
take i_mmap_rwsem from all hugetlb VMAs before taking i_mmap_rwsem from
rest of VMAs.

The patch also documents locking order for hugetlbfs_i_mmap_rwsem_key.

[1] http://lkml.kernel.org/r/CACT4Y+Zu95tBs-0EvdiAKzUOsb4tczRRfCRTpLr4bg_OP9HuVg@mail.gmail.com

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Tejun Heo
654a0dd095 cgroup, memcg, writeback: drop spurious rcu locking around mem_cgroup_css_from_page()
In earlier versions, mem_cgroup_css_from_page() could return non-root
css on a legacy hierarchy which can go away and required rcu locking;
however, the eventual version simply returns the root cgroup if memcg is
on a legacy hierarchy and thus doesn't need rcu locking around or in it.
Remove spurious rcu lockings.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams
c046c321cb dax: re-enable dax pmd mappings
Now that the get_user_pages() path knows how to handle dax-pmd mappings,
remove the protections that disabled dax-pmd support.

Tests available from github.com/pmem/ndctl:

    make TESTS="lib/test-dax.sh lib/test-mmap.sh" check

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams
cbb38e41a9 dax: provide diagnostics for pmd mapping failures
There is a wide gamut of conditions that can trigger the dax pmd path to
fallback to pte mappings.  Ideally we'd have a syscall interface to
determine mapping characteristics after the fact.  In the meantime
provide debug messages.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams
f25748e3c3 mm, dax: convert vmf_insert_pfn_pmd() to pfn_t
Similar to the conversion of vm_insert_mixed() use pfn_t in the
vmf_insert_pfn_pmd() to tag the resulting pte with _PAGE_DEVICE when the
pfn is backed by a devm_memremap_pages() mapping.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams
01c8f1c44b mm, dax, gpu: convert vm_insert_mixed to pfn_t
Convert the raw unsigned long 'pfn' argument to pfn_t for the purpose of
evaluating the PFN_MAP and PFN_DEV flags.  When both are set it triggers
_PAGE_DEVMAP to be set in the resulting pte.

There are no functional changes to the gpu drivers as a result of this
conversion.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: David Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams
34c0fd540e mm, dax, pmem: introduce pfn_t
For the purpose of communicating the optional presence of a 'struct
page' for the pfn returned from ->direct_access(), introduce a type that
encapsulates a page-frame-number plus flags.  These flags contain the
historical "page_link" encoding for a scatterlist entry, but can also
denote "device memory".  Where "device memory" is a set of pfns that are
not part of the kernel's linear mapping by default, but are accessed via
the same memory controller as ram.

The motivation for this new type is large capacity persistent memory
that needs struct page entries in the 'memmap' to support 3rd party DMA
(i.e.  O_DIRECT I/O with a persistent memory source/target).  However,
we also need it in support of maintaining a list of mapped inodes which
need to be unmapped at driver teardown or freeze_bdev() time.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave@sr71.net>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Toshi Kani
59bf4fb9d3 dax: Split pmd map when fallback on COW
An infinite loop of PMD faults was observed when attempted to mlock() a
private read-only PMD mmap'd range of a DAX file.

__dax_pmd_fault() simply returns with VM_FAULT_FALLBACK when falling
back to PTE on COW.  However, __handle_mm_fault() returns without
falling back to handle_pte_fault() because a PMD map is present in this
case.

Change __dax_pmd_fault() to split the PMD map, if present, before
returning with VM_FAULT_FALLBACK.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams
b2e0d1625e dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
The DAX implementation needs to protect new calls to ->direct_access()
and usage of its return value against the driver for the underlying
block device being disabled.  Use blk_queue_enter()/blk_queue_exit() to
hold off blk_cleanup_queue() from proceeding, or otherwise fail new
mapping requests if the request_queue is being torn down.

This also introduces blk_dax_ctl to simplify the interface from fs/dax.c
through dax_map_atomic() to bdev_direct_access().

[willy@linux.intel.com: fix read() of a hole]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams
fe683adabf dax: guarantee page aligned results from bdev_direct_access()
If a ->direct_access() implementation ever returns a map count less than
PAGE_SIZE, catch the error in bdev_direct_access().  This simplifies
error checking in upper layers.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams
0e749e5424 dax: increase granularity of dax_clear_blocks() operations
dax_clear_blocks is currently performing a cond_resched() after every
PAGE_SIZE memset.  We need not check so frequently, for example md-raid
only calls cond_resched() at stripe granularity.  Also, in preparation
for introducing a dax_map_atomic() operation that temporarily pins a dax
mapping move the call to cond_resched() to the outer loop.

The worst case latency between calls to cond_resched() after this change
is 500us the average latency is 133us.  This is up from a 10us max and
4us average.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams
52db400fcd pmem, dax: clean up clear_pmem()
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace).  This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o.  It allows userspace to coordinate
DMA/RDMA from/to persistent memory.

The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver.  The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.

The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag.  Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.

Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array.  Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory.  The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.

This patch (of 25):

Both __dax_pmd_fault, and clear_pmem() were taking special steps to
clear memory a page at a time to take advantage of non-temporal
clear_page() implementations.  However, x86_64 does not use non-temporal
instructions for clear_page(), and arch_clear_pmem() was always
incurring the cost of __arch_wb_cache_pmem().

Clean up the assumption that doing clear_pmem() a page at a time is more
performant.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov
e1534ae950 mm: differentiate page_mapped() from page_mapcount() for compound pages
Let's define page_mapped() to be true for compound pages if any
sub-pages of the compound page is mapped (with PMD or PTE).

On other hand page_mapcount() return mapcount for this particular small
page.

This will make cases like page_get_anon_vma() behave correctly once we
allow huge pages to be mapped with PTE.

Most users outside core-mm should use page_mapcount() instead of
page_mapped().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00