Currently if some fatal errors occur, like all IO get -EIO, resources
would be cleaned up when
a) transaction is being committed or
b) BTRFS_FS_STATE_ERROR is set
However, in some rare cases, resources may be left alone after transaction
gets aborted and umount may run into some ASSERT(), e.g.
ASSERT(list_empty(&block_group->dirty_list));
For case a), in btrfs_commit_transaciton(), there're several places at the
beginning where we just call btrfs_end_transaction() without cleaning up
resources. For case b), it is possible that the trans handle doesn't have
any dirty stuff, then only trans hanlde is marked as aborted while
BTRFS_FS_STATE_ERROR is not set, so resources remain in memory.
This makes btrfs also check BTRFS_FS_STATE_TRANS_ABORTED to make sure that
all resources won't stay in memory after umount.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With mount option "xino=on", mounter declares that there are enough
free high bits in underlying fs to hold the layer fsid.
If overlayfs does encounter underlying inodes using the high xino
bits reserved for layer fsid, a warning will be emitted and the original
inode number will be used.
The mount option name "xino" goes after a similar meaning mount option
of aufs, but in overlayfs case, the mapping is stateless.
An example for a use case of "xino=on" is when upper/lower is on an xfs
filesystem. xfs uses 64bit inode numbers, but it currently never uses the
upper 8bit for inode numbers exposed via stat(2) and that is not likely to
change in the future without user opting-in for a new xfs feature. The
actual number of unused upper bit is much larger and determined by the xfs
filesystem geometry (64 - agno_log - agblklog - inopblog). That means
that for all practical purpose, there are enough unused bits in xfs
inode numbers for more than OVL_MAX_STACK unique fsid's.
Another use case of "xino=on" is when upper/lower is on tmpfs. tmpfs inode
numbers are allocated sequentially since boot, so they will practially
never use the high inode number bits.
For compatibility with applications that expect 32bit inodes, the feature
can be disabled with "xino=off". The option "xino=auto" automatically
detects underlying filesystem that use 32bit inodes and enables the
feature. The Kconfig option OVERLAY_FS_XINO_AUTO and module parameter of
the same name, determine if the default mode for overlayfs mount is
"xino=auto" or "xino=off".
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When overlay layers are not all on the same fs, but all inode numbers
of underlying fs do not use the high 'xino' bits, overlay st_ino values
are constant and persistent.
In that case, relax non-samefs constraint for consistent d_ino and always
iterate non-merge dir using ovl_fill_real() actor so we can remap lower
inode numbers to unique lower fs range.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When overlay layers are not all on the same fs, but all inode numbers
of underlying fs do not use the high 'xino' bits, overlay st_ino values
are constant and persistent.
In that case, set i_ino value to the same value as st_ino for nfsd
readdirplus validator.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
On 64bit systems, when overlay layers are not all on the same fs, but
all inode numbers of underlying fs are not using the high bits, use the
high bits to partition the overlay st_ino address space. The high bits
hold the fsid (upper fsid is 0). This way overlay inode numbers are unique
and all inodes use overlay st_dev. Inode numbers are also persistent
for a given layer configuration.
Currently, our only indication for available high ino bits is from a
filesystem that supports file handles and uses the default encode_fh()
operation, which encodes a 32bit inode number.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Instead of allocating an anonymous bdev per lower layer, allocate
one anonymous bdev per every unique lower fs that is different than
upper fs.
Every unique lower fs is assigned an fsid > 0 and the number of
unique lower fs are stored in ofs->numlowerfs.
The assigned fsid is stored in the lower layer struct and will be
used also for inode number multiplexing.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
A helper for ovl_getattr() to map the values of st_dev and st_ino
according to constant st_ino rules.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Certain properties in ovl_lookup_data should be set only for the last
element of the path. IOW, if we are calling ovl_lookup_single() for an
absolute redirect, then d->is_dir and d->opaque do not make much sense
for intermediate path elements. Instead set them only if dentry being
lookup is last path element.
As of now we do not seem to be making use of d->opaque if it is set for
a path/dentry in lower. But just define the semantics so that future code
can make use of this assumption.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If we are looking in last layer, then there should not be any need to
process redirect. redirect information is used only for lookup in next
lower layer and there is no more lower layer to look into. So no need
to process redirects.
IOW, ignore redirects on lowest layer.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When decoding a lower file handle, we need to check if lower file was
copied up and indexed and if it has a whiteout index, we need to check
if this is an unlinked but open non-dir before returning -ESTALE.
To find out if this is an unlinked but open non-dir we need to lookup
an overlay inode in inode cache by lower inode and that requires decoding
the lower file handle before looking in inode cache.
Before this change, if the lower inode turned out to be a directory, we
may have paid an expensive cost to reconnect that lower directory for
nothing.
After this change, we start by decoding a disconnected lower dentry and
using the lower inode for looking up an overlay inode in inode cache.
If we find overlay inode and dentry in cache, we avoid the index lookup
overhead. If we don't find an overlay inode and dentry in cache, then we
only need to decode a connected lower dentry in case the lower dentry is
a non-indexed directory.
The xfstests group overlay/exportfs tests decoding overlayfs file
handles after drop_caches with different states of the file at encode
and decode time. Overall the tests in the group call ovl_lower_fh_to_d()
89 times to decode a lower file handle.
Before this change, the tests called ovl_get_index_fh() 75 times and
reconnect_one() 61 times.
After this change, the tests call ovl_get_index_fh() 70 times and
reconnect_one() 59 times. The 2 cases where reconnect_one() was avoided
are cases where a non-upper directory file handle was encoded, then the
directory removed and then file handle was decoded.
To demonstrate the affect on decoding file handles with hot inode/dentry
cache, the drop_caches call in the tests was disabled. Without
drop_caches, there are no reconnect_one() calls at all before or after
the change. Before the change, there are 75 calls to ovl_get_index_fh(),
exactly as the case with drop_caches. After the change, there are only
10 calls to ovl_get_index_fh().
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
On lookup of non directory, we try to decode the origin file handle
stored in upper inode. The origin file handle is supposed to be decoded
to a disconnected non-dir dentry, which is fine, because we only need
the lower inode of a copy up origin.
However, if the origin file handle somehow turns out to be a directory
we pay the expensive cost of reconnecting the directory dentry, only to
get a mismatch file type and drop the dentry.
Optimize this case by explicitly opting out of reconnecting the dentry.
Opting-out of reconnect is done by passing a NULL acceptable callback
to exportfs_decode_fh().
While the case described above is a strange corner case that does not
really need to be optimized, the API added for this optimization will
be used by a following patch to optimize a more common case of decoding
an overlayfs file handle.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Rename ovl_encode_fh() to ovl_encode_real_fh() to differentiate from the
exportfs function ovl_encode_inode_fh() and change the latter to
ovl_encode_fh() to match the exportfs method name.
Rename ovl_decode_fh() to ovl_decode_real_fh() for consistency.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
For broken hardlinks, we do not return lower st_ino, so we should
also not return lower pseudo st_dev.
Fixes: a0c5ad307ac0 ("ovl: relax same fs constraint for constant st_ino")
Cc: <stable@vger.kernel.org> #v4.15
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
As of now if we encounter an opaque dir while looking for a dentry, we set
d->last=true. This means that there is no need to look further in any of
the lower layers. This works fine as long as there are no redirets or
relative redircts. But what if there is an absolute redirect on the
children dentry of opaque directory. We still need to continue to look into
next lower layer. This patch fixes it.
Here is an example to demonstrate the issue. Say you have following setup.
upper: /redirect (redirect=/a/b/c)
lower1: /a/[b]/c ([b] is opaque) (c has absolute redirect=/a/b/d/)
lower0: /a/b/d/foo
Now "redirect" dir should merge with lower1:/a/b/c/ and lower0:/a/b/d.
Note, despite the fact lower1:/a/[b] is opaque, we need to continue to look
into lower0 because children c has an absolute redirect.
Following is a reproducer.
Watch me make foo disappear:
$ mkdir lower middle upper work work2 merged
$ mkdir lower/origin
$ touch lower/origin/foo
$ mount -t overlay none merged/ \
-olowerdir=lower,upperdir=middle,workdir=work2
$ mkdir merged/pure
$ mv merged/origin merged/pure/redirect
$ umount merged
$ mount -t overlay none merged/ \
-olowerdir=middle:lower,upperdir=upper,workdir=work
$ mv merged/pure/redirect merged/redirect
Now you see foo inside a twice redirected merged dir:
$ ls merged/redirect
foo
$ umount merged
$ mount -t overlay none merged/ \
-olowerdir=middle:lower,upperdir=upper,workdir=work
After mount cycle you don't see foo inside the same dir:
$ ls merged/redirect
During middle layer lookup, the opaqueness of middle/pure is left in
the lookup state and then middle/pure/redirect is wrongly treated as
opaque.
Fixes: 02b69b284cd7 ("ovl: lookup redirects")
Cc: <stable@vger.kernel.org> #v4.10
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
d->last signifies that this is the last layer we are looking into and there
is no more. And that means this allows for some optimzation opportunities
during lookup. For example, in ovl_lookup_single() we don't have to check
for opaque xattr of a directory is this is the last layer we are looking
into (d->last = true).
But knowing for sure whether we are looking into last layer can be very
tricky. If redirects are not enabled, then we can look at poe->numlower and
figure out if the lookup we are about to is last layer or not. But if
redircts are enabled then it is possible poe->numlower suggests that we are
looking in last layer, but there is an absolute redirect present in found
element and that redirects us to a layer in root and that means lookup will
continue in lower layers further.
For example, consider following.
/upperdir/pure (opaque=y)
/upperdir/pure/foo (opaque=y,redirect=/bar)
/lowerdir/bar
In this case pure is "pure upper". When we look for "foo", that time
poe->numlower=0. But that alone does not mean that we will not search for a
merge candidate in /lowerdir. Absolute redirect changes that.
IOW, d->last should not be set just based on poe->numlower if redirects are
enabled. That can lead to setting d->last while it should not have and that
means we will not check for opaque xattr while we should have.
So do this.
- If redirects are not enabled, then continue to rely on poe->numlower
information to determine if it is last layer or not.
- If redirects are enabled, then set d->last = true only if this is the
last layer in root ovl_entry (roe).
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 02b69b284cd7 ("ovl: lookup redirects")
Cc: <stable@vger.kernel.org> #v4.10
Eddie Horng reported that readdir of an overlayfs directory that
was exported via NFSv3 returns entries with d_type set to DT_UNKNOWN.
The reason is that while preparing the response for readdirplus, nfsd
checks inside encode_entryplus_baggage() that a child dentry's inode
number matches the value of d_ino returns by overlayfs readdir iterator.
Because the overlayfs inodes use arbitrary inode numbers that are not
correlated with the values of st_ino/d_ino, NFSv3 falls back to not
encoding d_type. Although this is an allowed behavior, we can fix it for
the case of all overlayfs layers on the same underlying filesystem.
When NFS export is enabled and d_ino is consistent with st_ino
(samefs), set the same value also to i_ino in ovl_fill_inode() for all
overlayfs inodes, nfsd readdirplus sanity checks will pass.
ovl_fill_inode() may be called from ovl_new_inode(), before real inode
was created with ino arg 0. In that case, i_ino will be updated to real
upper inode i_ino on ovl_inode_init() or ovl_inode_update().
Reported-by: Eddie Horng <eddiehorng.tw@gmail.com>
Tested-by: Eddie Horng <eddiehorng.tw@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Fixes: 8383f1748829 ("ovl: wire up NFS export operations")
Cc: <stable@vger.kernel.org> #v4.16
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* Since cifs_vfs_error was just using pr_debug_ratelimited like the rest
of cifs_dbg, move it there too
* Add a ONCE type flag to call the pr_xxx_once() debug function instead
of the ratelimited ones.
To convert existing printk_once() calls to this we can run:
perl -i -pE \
's/printk_once\s*\(([^" \n]+)(.*)/cifs_dbg(VFS|ONCE,$2/g' \
fs/cifs/*.c
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
On 32-bit (e.g. with m68k-linux-gnu-gcc-4.1):
fs/cifs/inode.c: In function ‘simple_hashstr’:
fs/cifs/inode.c:713: warning: integer constant is too large for ‘long’ type
Fixes: 7ea884c77e5c97f1 ("smb3: Fix root directory when server returns inode number of zero")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Adding an extra debug message to show if a tree connect failure during
reconnect (and made it a log once so it doesn't spam the logs).
Saw a case recently where tree connect repeatedly returned
access denied on reconnect and it wasn't as easy to spot as it
should have been.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
tcon->ses is being dereferenced before it is null checked, hence
there is a potential null pointer dereference.
Fix this by moving the pointer dereference after tcon->ses has
been properly null checked.
Addresses-Coverity-ID: 1467426 ("Dereference before null check")
Fixes: 93012bf98416 ("cifs: add server->vals->header_preamble_size")
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Merge more updates from Andrew Morton:
- almost all of the rest of MM
- kasan updates
- lots of procfs work
- misc things
- lib/ updates
- checkpatch
- rapidio
- ipc/shm updates
- the start of willy's XArray conversion
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (140 commits)
page cache: use xa_lock
xarray: add the xa_lock to the radix_tree_root
fscache: use appropriate radix tree accessors
export __set_page_dirty
unicore32: turn flush_dcache_mmap_lock into a no-op
arm64: turn flush_dcache_mmap_lock into a no-op
mac80211_hwsim: use DEFINE_IDA
radix tree: use GFP_ZONEMASK bits of gfp_t for flags
linux/const.h: refactor _BITUL and _BITULL a bit
linux/const.h: move UL() macro to include/linux/const.h
linux/const.h: prefix include guard of uapi/linux/const.h with _UAPI
xen, mm: allow deferred page initialization for xen pv domains
elf: enforce MAP_FIXED on overlaying elf segments
fs, elf: drop MAP_FIXED usage from elf_map
mm: introduce MAP_FIXED_NOREPLACE
MAINTAINERS: update bouncing aacraid@adaptec.com addresses
fs/dcache.c: add cond_resched() in shrink_dentry_list()
include/linux/kfifo.h: fix comment
ipc/shm.c: shm_split(): remove unneeded test for NULL shm_file_data.vm_ops
kernel/sysctl.c: add kdoc comments to do_proc_do{u}intvec_minmax_conv_param
...
Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.
[willy@infradead.org: fix nds32, fs/dax.c]
Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This results in no change in structure size on 64-bit machines as it
fits in the padding between the gfp_t and the void *. 32-bit machines
will grow the structure from 8 to 12 bytes. Almost all radix trees are
protected with (at least) a spinlock, so as they are converted from
radix trees to xarrays, the data structures will shrink again.
Initialising the spinlock requires a name for the benefit of lockdep, so
RADIX_TREE_INIT() now needs to know the name of the radix tree it's
initialising, and so do IDR_INIT() and IDA_INIT().
Also add the xa_lock() and xa_unlock() family of wrappers to make it
easier to use the lock. If we could rely on -fplan9-extensions in the
compiler, we could avoid all of this syntactic sugar, but that wasn't
added until gcc 4.6.
Link: http://lkml.kernel.org/r/20180313132639.17387-8-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't open-code accesses to data structure internals.
Link: http://lkml.kernel.org/r/20180313132639.17387-7-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
XFS currently contains a copy-and-paste of __set_page_dirty(). Export
it from buffer.c instead.
Link: http://lkml.kernel.org/r/20180313132639.17387-6-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Anshuman has reported that with "fs, elf: drop MAP_FIXED usage from
elf_map" applied, some ELF binaries in his environment fail to start
with
[ 23.423642] 9148 (sed): Uhuuh, elf segment at 0000000010030000 requested but the memory is mapped already
[ 23.423706] requested [10030000, 10040000] mapped [10030000, 10040000] 100073 anon
The reason is that the above binary has overlapping elf segments:
LOAD 0x0000000000000000 0x0000000010000000 0x0000000010000000
0x0000000000013a8c 0x0000000000013a8c R E 10000
LOAD 0x000000000001fd40 0x000000001002fd40 0x000000001002fd40
0x00000000000002c0 0x00000000000005e8 RW 10000
LOAD 0x0000000000020328 0x0000000010030328 0x0000000010030328
0x0000000000000384 0x00000000000094a0 RW 10000
That binary has two RW LOAD segments, the first crosses a page border
into the second
0x1002fd40 (LOAD2-vaddr) + 0x5e8 (LOAD2-memlen) == 0x10030328 (LOAD3-vaddr)
Handle this situation by enforcing MAP_FIXED when we establish a
temporary brk VMA to handle overlapping segments. All other mappings
will still use MAP_FIXED_NOREPLACE.
Link: http://lkml.kernel.org/r/20180213100440.GM3443@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Andrei Vagin <avagin@openvz.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both load_elf_interp and load_elf_binary rely on elf_map to map segments
on a controlled address and they use MAP_FIXED to enforce that. This is
however dangerous thing prone to silent data corruption which can be
even exploitable.
Let's take CVE-2017-1000253 as an example. At the time (before commit
eab09532d400: "binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
the stack top on 32b (legacy) memory layout (only 1GB away). Therefore
we could end up mapping over the existing stack with some luck.
The issue has been fixed since then (a87938b2e246: "fs/binfmt_elf.c: fix
bug in loading of PIE binaries"), ELF_ET_DYN_BASE moved moved much
further from the stack (eab09532d400 and later by c715b72c1ba4: "mm:
revert x86_64 and arm64 ELF_ET_DYN_BASE base changes") and excessive
stack consumption early during execve fully stopped by da029c11e6b1
("exec: Limit arg stack to at most 75% of _STK_LIM"). So we should be
safe and any attack should be impractical. On the other hand this is
just too subtle assumption so it can break quite easily and hard to
spot.
I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
fundamentally dangerous. Moreover it shouldn't be even needed. We are
at the early process stage and so there shouldn't be unrelated mappings
(except for stack and loader) existing so mmap for a given address should
succeed even without MAP_FIXED. Something is terribly wrong if this is
not the case and we should rather fail than silently corrupt the
underlying mapping.
Address this issue by changing MAP_FIXED to the newly added
MAP_FIXED_NOREPLACE. This will mean that mmap will fail if there is an
existing mapping clashing with the requested one without clobbering it.
[mhocko@suse.com: fix build]
[akpm@linux-foundation.org: coding-style fixes]
[avagin@openvz.org: don't use the same value for MAP_FIXED_NOREPLACE and MAP_SYNC]
Link: http://lkml.kernel.org/r/20171218184916.24445-1-avagin@openvz.org
Link: http://lkml.kernel.org/r/20171213092550.2774-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As previously reported (https://patchwork.kernel.org/patch/8642031/)
it's possible to call shrink_dentry_list with a large number of dentries
(> 10000). This, in turn, could trigger the softlockup detector and
possibly trigger a panic. In addition to the unmount path being
vulnerable to this scenario, at SuSE we've observed similar situation
happening during process exit on processes that touch a lot of dentries.
Here is an excerpt from a crash dump. The number after the colon are
the number of dentries on the list passed to shrink_dentry_list:
PID 99760: 10722
PID 107530: 215
PID 108809: 24134
PID 108877: 21331
PID 141708: 16487
So we want to kill between 15k-25k dentries without yielding.
And one possible call stack looks like:
4 [ffff8839ece41db0] _raw_spin_lock at ffffffff8152a5f8
5 [ffff8839ece41db0] evict at ffffffff811c3026
6 [ffff8839ece41dd0] __dentry_kill at ffffffff811bf258
7 [ffff8839ece41df0] shrink_dentry_list at ffffffff811bf593
8 [ffff8839ece41e18] shrink_dcache_parent at ffffffff811bf830
9 [ffff8839ece41e50] proc_flush_task at ffffffff8120dd61
10 [ffff8839ece41ec0] release_task at ffffffff81059ebd
11 [ffff8839ece41f08] do_exit at ffffffff8105b8ce
12 [ffff8839ece41f78] sys_exit at ffffffff8105bd53
13 [ffff8839ece41f80] system_call_fastpath at ffffffff81532909
While some of the callers of shrink_dentry_list do use cond_resched,
this is not sufficient to prevent softlockups. So just move
cond_resched into shrink_dentry_list from its callers.
David said: I've found hundreds of occurrences of warnings that we emit
when need_resched stays set for a prolonged period of time with the
stack trace that is included in the change log.
Link: http://lkml.kernel.org/r/1521718946-31521-1-git-send-email-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "ipc: Clamp *mni to the real IPCMNI limit", v3.
The sysctl parameters msgmni, shmmni and semmni have an inherent limit
of IPC_MNI (32k). However, users may not be aware of that because they
can write a value much higher than that without getting any error or
notification. Reading the parameters back will show the newly written
values which are not real.
Enforcing the limit by failing sysctl parameter write, however, can
break existing user applications. To address this delemma, a new flags
field is introduced into the ctl_table. The value CTL_FLAGS_CLAMP_RANGE
can be added to any ctl_table entries to enable a looser range clamping
without returning any error. For example,
.flags = CTL_FLAGS_CLAMP_RANGE,
This flags value are now used for the range checking of shmmni, msgmni
and semmni without breaking existing applications. If any out of range
value is written to those sysctl parameters, the following warning will
be printed instead.
Kernel parameter "shmmni" was set out of range [0, 32768], clamped to 32768.
Reading the values back will show 32768 instead of some fake values.
This patch (of 6):
Fix a typo.
Link: http://lkml.kernel.org/r/1519926220-7453-2-git-send-email-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the stack rlimit is used in multiple places during exec and it can
be changed via other threads (via setrlimit()) or processes (via
prlimit()), the assumption that the value doesn't change cannot be made.
This leads to races with mm layout selection and argument size
calculations. This changes the exec path to use the rlimit stored in
bprm instead of in current. Before starting the thread, the bprm stack
rlimit is stored back to current.
Link: http://lkml.kernel.org/r/1518638796-20819-4-git-send-email-keescook@chromium.org
Fixes: 64701dee4178e ("exec: Use sane stack rlimit under secureexec")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Reported-by: Andy Lutomirski <luto@kernel.org>
Reported-by: Brad Spengler <spender@grsecurity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Greg KH <greg@kroah.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide a final callback into fs/exec.c before start_thread() takes
over, to handle any last-minute changes, like the coming restoration of
the stack limit.
Link: http://lkml.kernel.org/r/1518638796-20819-3-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Greg KH <greg@kroah.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "exec: Pin stack limit during exec".
Attempts to solve problems with the stack limit changing during exec
continue to be frustrated[1][2]. In addition to the specific issues
around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
other places during exec where the stack limit is used and is assumed to
be unchanging. Given the many places it gets used and the fact that it
can be manipulated/raced via setrlimit() and prlimit(), I think the only
way to handle this is to move away from the "current" view of the stack
limit and instead attach it to the bprm, and plumb this down into the
functions that need to know the stack limits. This series implements
the approach.
[1] 04e35f4495dd ("exec: avoid RLIMIT_STACK races with prlimit()")
[2] 779f4e1c6c7c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
[3] to security@kernel.org, "Subject: existing rlimit races?"
This patch (of 3):
Since it is possible that the stack rlimit can change externally during
exec (either via another thread calling setrlimit() or another process
calling prlimit()), provide a way to pass the rlimit down into the
per-architecture mm layout functions so that the rlimit can stay in the
bprm structure instead of sitting in the signal structure until exec is
finalized.
Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Greg KH <greg@kroah.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
Cc: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All it takes to open a file and read 1 byte from it.
seq_file will be allocated along with any private allocations, and more
importantly seq file buffer which is 1 page by default.
Link: http://lkml.kernel.org/r/20180310085252.GB17121@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
One use of the reiserfs_warning() macro in journal_init_dev() is missing
a parameter, causing the following warning:
REISERFS warning (device loop0): journal_init_dev: Cannot open '%s': %i journal_init_dev:
This also causes a WARN_ONCE() warning in the vsprintf code, and then a
panic if panic_on_warn is set.
Please remove unsupported %/ in format string
WARNING: CPU: 1 PID: 4480 at lib/vsprintf.c:2138 format_decode+0x77f/0x830 lib/vsprintf.c:2138
Kernel panic - not syncing: panic_on_warn set ...
Just add another string argument to the macro invocation.
Addresses https://syzkaller.appspot.com/bug?id=0627d4551fdc39bf1ef5d82cd9eef587047f7718
Link: http://lkml.kernel.org/r/d678ebe1-6f54-8090-df4c-b9affad62293@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: <syzbot+6bd77b88c1977c03f584@syzkaller.appspotmail.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Jeff Mahoney <jeffm@suse.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jan Kara <jack@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This playing with signals to allow only fatal signals appears to predate
the introduction of wait_event_killable(), and I'm fairly sure that
wait_event_killable is what was meant to happen here.
[avagin@openvz.org: use wake_up() instead of wake_up_interruptible]
Link: http://lkml.kernel.org/r/20180331022839.21277-1-avagin@openvz.org
Link: http://lkml.kernel.org/r/20180319191609.23880-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Ian Kent <raven@themaw.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In a typical for /proc "open+read+close" usecase, dentry is looked up
successfully on open only to be killed in dput() on close. In fact
dentries which aren't /proc/*/... and /proc/sys/* were almost NEVER
CACHED. Simple printk in proc_lookup_de() shows that.
Now that ->delete hook intelligently picks which dentries should live in
dcache and which should not, rbtree caching is not necessary as dcache
does it job, at last!
As a side effect, struct proc_dir_entry shrinks by one pointer which can
go into inline name.
Link: http://lkml.kernel.org/r/20180314231032.GA15854@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Various subsystems can create files and directories in /proc with names
directly controlled by userspace.
Which means "/", "." and ".." are no-no.
"/" split is already taken care of, do the other 2 prohibited names.
Link: http://lkml.kernel.org/r/20180310001223.GB12443@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm_struct is not needed while printing as all the data was already
extracted.
Link: http://lkml.kernel.org/r/20180309223120.GC3843@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use seq_puts() and skip format string processing.
Link: http://lkml.kernel.org/r/20180309222948.GB3843@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As soon as register_filesystem() exits, filesystem can be mounted. It
is better to present fully operational /proc.
Of course it doesn't matter because /proc is not modular but do it
anyway.
Drop error check, it should be handled by panicking.
Link: http://lkml.kernel.org/r/20180309222709.GA3843@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I totally forgot that _parse_integer() accepts arbitrary amount of
leading zeroes leading to the following lookups:
OK
# readlink /proc/1/map_files/56427ecba000-56427eddc000
/lib/systemd/systemd
bogus
# readlink /proc/1/map_files/00000000000056427ecba000-56427eddc000
/lib/systemd/systemd
# readlink /proc/1/map_files/56427ecba000-00000000000056427eddc000
/lib/systemd/systemd
Link: http://lkml.kernel.org/r/20180303215130.GA23480@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"struct proc_dir_entry" is variable sized because of 0-length trailing
array for name, however, because of SLAB padding allocations it is
possible to make "struct proc_dir_entry" fixed sized and allocate same
amount of memory.
It buys fine-grained debugging with poisoning and usercopy protection
which is not possible with kmalloc-* caches.
Currently, on 32-bit 91+ byte allocations go into kmalloc-128 and on
64-bit 147+ byte allocations go to kmalloc-192 anyway.
Additional memory is allocated only for 38/46+ byte long names which are
rare or may not even exist in the wild.
Link: http://lkml.kernel.org/r/20180223205504.GA17139@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc_sys_link_fill_cache() does not need to check whether we're called
for a link - it's already done by scan().
Link: http://lkml.kernel.org/r/20180228013506.4915-2-danilokrummrich@dk-develop.de
Signed-off-by: Danilo Krummrich <danilokrummrich@dk-develop.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc_sys_link_fill_cache() does not take currently unregistering sysctl
tables into account, which might result into a page fault in
sysctl_follow_link() - add a check to fix it.
This bug has been present since v3.4.
Link: http://lkml.kernel.org/r/20180228013506.4915-1-danilokrummrich@dk-develop.de
Fixes: 0e47c99d7fe25 ("sysctl: Replace root_list with links between sysctl_table_sets")
Signed-off-by: Danilo Krummrich <danilokrummrich@dk-develop.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>