Certain properties in ovl_lookup_data should be set only for the last
element of the path. IOW, if we are calling ovl_lookup_single() for an
absolute redirect, then d->is_dir and d->opaque do not make much sense
for intermediate path elements. Instead set them only if dentry being
lookup is last path element.
As of now we do not seem to be making use of d->opaque if it is set for
a path/dentry in lower. But just define the semantics so that future code
can make use of this assumption.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If we are looking in last layer, then there should not be any need to
process redirect. redirect information is used only for lookup in next
lower layer and there is no more lower layer to look into. So no need
to process redirects.
IOW, ignore redirects on lowest layer.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When decoding a lower file handle, we need to check if lower file was
copied up and indexed and if it has a whiteout index, we need to check
if this is an unlinked but open non-dir before returning -ESTALE.
To find out if this is an unlinked but open non-dir we need to lookup
an overlay inode in inode cache by lower inode and that requires decoding
the lower file handle before looking in inode cache.
Before this change, if the lower inode turned out to be a directory, we
may have paid an expensive cost to reconnect that lower directory for
nothing.
After this change, we start by decoding a disconnected lower dentry and
using the lower inode for looking up an overlay inode in inode cache.
If we find overlay inode and dentry in cache, we avoid the index lookup
overhead. If we don't find an overlay inode and dentry in cache, then we
only need to decode a connected lower dentry in case the lower dentry is
a non-indexed directory.
The xfstests group overlay/exportfs tests decoding overlayfs file
handles after drop_caches with different states of the file at encode
and decode time. Overall the tests in the group call ovl_lower_fh_to_d()
89 times to decode a lower file handle.
Before this change, the tests called ovl_get_index_fh() 75 times and
reconnect_one() 61 times.
After this change, the tests call ovl_get_index_fh() 70 times and
reconnect_one() 59 times. The 2 cases where reconnect_one() was avoided
are cases where a non-upper directory file handle was encoded, then the
directory removed and then file handle was decoded.
To demonstrate the affect on decoding file handles with hot inode/dentry
cache, the drop_caches call in the tests was disabled. Without
drop_caches, there are no reconnect_one() calls at all before or after
the change. Before the change, there are 75 calls to ovl_get_index_fh(),
exactly as the case with drop_caches. After the change, there are only
10 calls to ovl_get_index_fh().
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
On lookup of non directory, we try to decode the origin file handle
stored in upper inode. The origin file handle is supposed to be decoded
to a disconnected non-dir dentry, which is fine, because we only need
the lower inode of a copy up origin.
However, if the origin file handle somehow turns out to be a directory
we pay the expensive cost of reconnecting the directory dentry, only to
get a mismatch file type and drop the dentry.
Optimize this case by explicitly opting out of reconnecting the dentry.
Opting-out of reconnect is done by passing a NULL acceptable callback
to exportfs_decode_fh().
While the case described above is a strange corner case that does not
really need to be optimized, the API added for this optimization will
be used by a following patch to optimize a more common case of decoding
an overlayfs file handle.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Rename ovl_encode_fh() to ovl_encode_real_fh() to differentiate from the
exportfs function ovl_encode_inode_fh() and change the latter to
ovl_encode_fh() to match the exportfs method name.
Rename ovl_decode_fh() to ovl_decode_real_fh() for consistency.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
For broken hardlinks, we do not return lower st_ino, so we should
also not return lower pseudo st_dev.
Fixes: a0c5ad307ac0 ("ovl: relax same fs constraint for constant st_ino")
Cc: <stable@vger.kernel.org> #v4.15
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
As of now if we encounter an opaque dir while looking for a dentry, we set
d->last=true. This means that there is no need to look further in any of
the lower layers. This works fine as long as there are no redirets or
relative redircts. But what if there is an absolute redirect on the
children dentry of opaque directory. We still need to continue to look into
next lower layer. This patch fixes it.
Here is an example to demonstrate the issue. Say you have following setup.
upper: /redirect (redirect=/a/b/c)
lower1: /a/[b]/c ([b] is opaque) (c has absolute redirect=/a/b/d/)
lower0: /a/b/d/foo
Now "redirect" dir should merge with lower1:/a/b/c/ and lower0:/a/b/d.
Note, despite the fact lower1:/a/[b] is opaque, we need to continue to look
into lower0 because children c has an absolute redirect.
Following is a reproducer.
Watch me make foo disappear:
$ mkdir lower middle upper work work2 merged
$ mkdir lower/origin
$ touch lower/origin/foo
$ mount -t overlay none merged/ \
-olowerdir=lower,upperdir=middle,workdir=work2
$ mkdir merged/pure
$ mv merged/origin merged/pure/redirect
$ umount merged
$ mount -t overlay none merged/ \
-olowerdir=middle:lower,upperdir=upper,workdir=work
$ mv merged/pure/redirect merged/redirect
Now you see foo inside a twice redirected merged dir:
$ ls merged/redirect
foo
$ umount merged
$ mount -t overlay none merged/ \
-olowerdir=middle:lower,upperdir=upper,workdir=work
After mount cycle you don't see foo inside the same dir:
$ ls merged/redirect
During middle layer lookup, the opaqueness of middle/pure is left in
the lookup state and then middle/pure/redirect is wrongly treated as
opaque.
Fixes: 02b69b284cd7 ("ovl: lookup redirects")
Cc: <stable@vger.kernel.org> #v4.10
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
d->last signifies that this is the last layer we are looking into and there
is no more. And that means this allows for some optimzation opportunities
during lookup. For example, in ovl_lookup_single() we don't have to check
for opaque xattr of a directory is this is the last layer we are looking
into (d->last = true).
But knowing for sure whether we are looking into last layer can be very
tricky. If redirects are not enabled, then we can look at poe->numlower and
figure out if the lookup we are about to is last layer or not. But if
redircts are enabled then it is possible poe->numlower suggests that we are
looking in last layer, but there is an absolute redirect present in found
element and that redirects us to a layer in root and that means lookup will
continue in lower layers further.
For example, consider following.
/upperdir/pure (opaque=y)
/upperdir/pure/foo (opaque=y,redirect=/bar)
/lowerdir/bar
In this case pure is "pure upper". When we look for "foo", that time
poe->numlower=0. But that alone does not mean that we will not search for a
merge candidate in /lowerdir. Absolute redirect changes that.
IOW, d->last should not be set just based on poe->numlower if redirects are
enabled. That can lead to setting d->last while it should not have and that
means we will not check for opaque xattr while we should have.
So do this.
- If redirects are not enabled, then continue to rely on poe->numlower
information to determine if it is last layer or not.
- If redirects are enabled, then set d->last = true only if this is the
last layer in root ovl_entry (roe).
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 02b69b284cd7 ("ovl: lookup redirects")
Cc: <stable@vger.kernel.org> #v4.10
Eddie Horng reported that readdir of an overlayfs directory that
was exported via NFSv3 returns entries with d_type set to DT_UNKNOWN.
The reason is that while preparing the response for readdirplus, nfsd
checks inside encode_entryplus_baggage() that a child dentry's inode
number matches the value of d_ino returns by overlayfs readdir iterator.
Because the overlayfs inodes use arbitrary inode numbers that are not
correlated with the values of st_ino/d_ino, NFSv3 falls back to not
encoding d_type. Although this is an allowed behavior, we can fix it for
the case of all overlayfs layers on the same underlying filesystem.
When NFS export is enabled and d_ino is consistent with st_ino
(samefs), set the same value also to i_ino in ovl_fill_inode() for all
overlayfs inodes, nfsd readdirplus sanity checks will pass.
ovl_fill_inode() may be called from ovl_new_inode(), before real inode
was created with ino arg 0. In that case, i_ino will be updated to real
upper inode i_ino on ovl_inode_init() or ovl_inode_update().
Reported-by: Eddie Horng <eddiehorng.tw@gmail.com>
Tested-by: Eddie Horng <eddiehorng.tw@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Fixes: 8383f1748829 ("ovl: wire up NFS export operations")
Cc: <stable@vger.kernel.org> #v4.16
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* Since cifs_vfs_error was just using pr_debug_ratelimited like the rest
of cifs_dbg, move it there too
* Add a ONCE type flag to call the pr_xxx_once() debug function instead
of the ratelimited ones.
To convert existing printk_once() calls to this we can run:
perl -i -pE \
's/printk_once\s*\(([^" \n]+)(.*)/cifs_dbg(VFS|ONCE,$2/g' \
fs/cifs/*.c
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
On 32-bit (e.g. with m68k-linux-gnu-gcc-4.1):
fs/cifs/inode.c: In function ‘simple_hashstr’:
fs/cifs/inode.c:713: warning: integer constant is too large for ‘long’ type
Fixes: 7ea884c77e5c97f1 ("smb3: Fix root directory when server returns inode number of zero")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Adding an extra debug message to show if a tree connect failure during
reconnect (and made it a log once so it doesn't spam the logs).
Saw a case recently where tree connect repeatedly returned
access denied on reconnect and it wasn't as easy to spot as it
should have been.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
tcon->ses is being dereferenced before it is null checked, hence
there is a potential null pointer dereference.
Fix this by moving the pointer dereference after tcon->ses has
been properly null checked.
Addresses-Coverity-ID: 1467426 ("Dereference before null check")
Fixes: 93012bf98416 ("cifs: add server->vals->header_preamble_size")
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Merge more updates from Andrew Morton:
- almost all of the rest of MM
- kasan updates
- lots of procfs work
- misc things
- lib/ updates
- checkpatch
- rapidio
- ipc/shm updates
- the start of willy's XArray conversion
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (140 commits)
page cache: use xa_lock
xarray: add the xa_lock to the radix_tree_root
fscache: use appropriate radix tree accessors
export __set_page_dirty
unicore32: turn flush_dcache_mmap_lock into a no-op
arm64: turn flush_dcache_mmap_lock into a no-op
mac80211_hwsim: use DEFINE_IDA
radix tree: use GFP_ZONEMASK bits of gfp_t for flags
linux/const.h: refactor _BITUL and _BITULL a bit
linux/const.h: move UL() macro to include/linux/const.h
linux/const.h: prefix include guard of uapi/linux/const.h with _UAPI
xen, mm: allow deferred page initialization for xen pv domains
elf: enforce MAP_FIXED on overlaying elf segments
fs, elf: drop MAP_FIXED usage from elf_map
mm: introduce MAP_FIXED_NOREPLACE
MAINTAINERS: update bouncing aacraid@adaptec.com addresses
fs/dcache.c: add cond_resched() in shrink_dentry_list()
include/linux/kfifo.h: fix comment
ipc/shm.c: shm_split(): remove unneeded test for NULL shm_file_data.vm_ops
kernel/sysctl.c: add kdoc comments to do_proc_do{u}intvec_minmax_conv_param
...
Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.
[willy@infradead.org: fix nds32, fs/dax.c]
Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This results in no change in structure size on 64-bit machines as it
fits in the padding between the gfp_t and the void *. 32-bit machines
will grow the structure from 8 to 12 bytes. Almost all radix trees are
protected with (at least) a spinlock, so as they are converted from
radix trees to xarrays, the data structures will shrink again.
Initialising the spinlock requires a name for the benefit of lockdep, so
RADIX_TREE_INIT() now needs to know the name of the radix tree it's
initialising, and so do IDR_INIT() and IDA_INIT().
Also add the xa_lock() and xa_unlock() family of wrappers to make it
easier to use the lock. If we could rely on -fplan9-extensions in the
compiler, we could avoid all of this syntactic sugar, but that wasn't
added until gcc 4.6.
Link: http://lkml.kernel.org/r/20180313132639.17387-8-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't open-code accesses to data structure internals.
Link: http://lkml.kernel.org/r/20180313132639.17387-7-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
XFS currently contains a copy-and-paste of __set_page_dirty(). Export
it from buffer.c instead.
Link: http://lkml.kernel.org/r/20180313132639.17387-6-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Anshuman has reported that with "fs, elf: drop MAP_FIXED usage from
elf_map" applied, some ELF binaries in his environment fail to start
with
[ 23.423642] 9148 (sed): Uhuuh, elf segment at 0000000010030000 requested but the memory is mapped already
[ 23.423706] requested [10030000, 10040000] mapped [10030000, 10040000] 100073 anon
The reason is that the above binary has overlapping elf segments:
LOAD 0x0000000000000000 0x0000000010000000 0x0000000010000000
0x0000000000013a8c 0x0000000000013a8c R E 10000
LOAD 0x000000000001fd40 0x000000001002fd40 0x000000001002fd40
0x00000000000002c0 0x00000000000005e8 RW 10000
LOAD 0x0000000000020328 0x0000000010030328 0x0000000010030328
0x0000000000000384 0x00000000000094a0 RW 10000
That binary has two RW LOAD segments, the first crosses a page border
into the second
0x1002fd40 (LOAD2-vaddr) + 0x5e8 (LOAD2-memlen) == 0x10030328 (LOAD3-vaddr)
Handle this situation by enforcing MAP_FIXED when we establish a
temporary brk VMA to handle overlapping segments. All other mappings
will still use MAP_FIXED_NOREPLACE.
Link: http://lkml.kernel.org/r/20180213100440.GM3443@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Andrei Vagin <avagin@openvz.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both load_elf_interp and load_elf_binary rely on elf_map to map segments
on a controlled address and they use MAP_FIXED to enforce that. This is
however dangerous thing prone to silent data corruption which can be
even exploitable.
Let's take CVE-2017-1000253 as an example. At the time (before commit
eab09532d400: "binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
the stack top on 32b (legacy) memory layout (only 1GB away). Therefore
we could end up mapping over the existing stack with some luck.
The issue has been fixed since then (a87938b2e246: "fs/binfmt_elf.c: fix
bug in loading of PIE binaries"), ELF_ET_DYN_BASE moved moved much
further from the stack (eab09532d400 and later by c715b72c1ba4: "mm:
revert x86_64 and arm64 ELF_ET_DYN_BASE base changes") and excessive
stack consumption early during execve fully stopped by da029c11e6b1
("exec: Limit arg stack to at most 75% of _STK_LIM"). So we should be
safe and any attack should be impractical. On the other hand this is
just too subtle assumption so it can break quite easily and hard to
spot.
I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
fundamentally dangerous. Moreover it shouldn't be even needed. We are
at the early process stage and so there shouldn't be unrelated mappings
(except for stack and loader) existing so mmap for a given address should
succeed even without MAP_FIXED. Something is terribly wrong if this is
not the case and we should rather fail than silently corrupt the
underlying mapping.
Address this issue by changing MAP_FIXED to the newly added
MAP_FIXED_NOREPLACE. This will mean that mmap will fail if there is an
existing mapping clashing with the requested one without clobbering it.
[mhocko@suse.com: fix build]
[akpm@linux-foundation.org: coding-style fixes]
[avagin@openvz.org: don't use the same value for MAP_FIXED_NOREPLACE and MAP_SYNC]
Link: http://lkml.kernel.org/r/20171218184916.24445-1-avagin@openvz.org
Link: http://lkml.kernel.org/r/20171213092550.2774-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As previously reported (https://patchwork.kernel.org/patch/8642031/)
it's possible to call shrink_dentry_list with a large number of dentries
(> 10000). This, in turn, could trigger the softlockup detector and
possibly trigger a panic. In addition to the unmount path being
vulnerable to this scenario, at SuSE we've observed similar situation
happening during process exit on processes that touch a lot of dentries.
Here is an excerpt from a crash dump. The number after the colon are
the number of dentries on the list passed to shrink_dentry_list:
PID 99760: 10722
PID 107530: 215
PID 108809: 24134
PID 108877: 21331
PID 141708: 16487
So we want to kill between 15k-25k dentries without yielding.
And one possible call stack looks like:
4 [ffff8839ece41db0] _raw_spin_lock at ffffffff8152a5f8
5 [ffff8839ece41db0] evict at ffffffff811c3026
6 [ffff8839ece41dd0] __dentry_kill at ffffffff811bf258
7 [ffff8839ece41df0] shrink_dentry_list at ffffffff811bf593
8 [ffff8839ece41e18] shrink_dcache_parent at ffffffff811bf830
9 [ffff8839ece41e50] proc_flush_task at ffffffff8120dd61
10 [ffff8839ece41ec0] release_task at ffffffff81059ebd
11 [ffff8839ece41f08] do_exit at ffffffff8105b8ce
12 [ffff8839ece41f78] sys_exit at ffffffff8105bd53
13 [ffff8839ece41f80] system_call_fastpath at ffffffff81532909
While some of the callers of shrink_dentry_list do use cond_resched,
this is not sufficient to prevent softlockups. So just move
cond_resched into shrink_dentry_list from its callers.
David said: I've found hundreds of occurrences of warnings that we emit
when need_resched stays set for a prolonged period of time with the
stack trace that is included in the change log.
Link: http://lkml.kernel.org/r/1521718946-31521-1-git-send-email-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "ipc: Clamp *mni to the real IPCMNI limit", v3.
The sysctl parameters msgmni, shmmni and semmni have an inherent limit
of IPC_MNI (32k). However, users may not be aware of that because they
can write a value much higher than that without getting any error or
notification. Reading the parameters back will show the newly written
values which are not real.
Enforcing the limit by failing sysctl parameter write, however, can
break existing user applications. To address this delemma, a new flags
field is introduced into the ctl_table. The value CTL_FLAGS_CLAMP_RANGE
can be added to any ctl_table entries to enable a looser range clamping
without returning any error. For example,
.flags = CTL_FLAGS_CLAMP_RANGE,
This flags value are now used for the range checking of shmmni, msgmni
and semmni without breaking existing applications. If any out of range
value is written to those sysctl parameters, the following warning will
be printed instead.
Kernel parameter "shmmni" was set out of range [0, 32768], clamped to 32768.
Reading the values back will show 32768 instead of some fake values.
This patch (of 6):
Fix a typo.
Link: http://lkml.kernel.org/r/1519926220-7453-2-git-send-email-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the stack rlimit is used in multiple places during exec and it can
be changed via other threads (via setrlimit()) or processes (via
prlimit()), the assumption that the value doesn't change cannot be made.
This leads to races with mm layout selection and argument size
calculations. This changes the exec path to use the rlimit stored in
bprm instead of in current. Before starting the thread, the bprm stack
rlimit is stored back to current.
Link: http://lkml.kernel.org/r/1518638796-20819-4-git-send-email-keescook@chromium.org
Fixes: 64701dee4178e ("exec: Use sane stack rlimit under secureexec")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Reported-by: Andy Lutomirski <luto@kernel.org>
Reported-by: Brad Spengler <spender@grsecurity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Greg KH <greg@kroah.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide a final callback into fs/exec.c before start_thread() takes
over, to handle any last-minute changes, like the coming restoration of
the stack limit.
Link: http://lkml.kernel.org/r/1518638796-20819-3-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Greg KH <greg@kroah.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "exec: Pin stack limit during exec".
Attempts to solve problems with the stack limit changing during exec
continue to be frustrated[1][2]. In addition to the specific issues
around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
other places during exec where the stack limit is used and is assumed to
be unchanging. Given the many places it gets used and the fact that it
can be manipulated/raced via setrlimit() and prlimit(), I think the only
way to handle this is to move away from the "current" view of the stack
limit and instead attach it to the bprm, and plumb this down into the
functions that need to know the stack limits. This series implements
the approach.
[1] 04e35f4495dd ("exec: avoid RLIMIT_STACK races with prlimit()")
[2] 779f4e1c6c7c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
[3] to security@kernel.org, "Subject: existing rlimit races?"
This patch (of 3):
Since it is possible that the stack rlimit can change externally during
exec (either via another thread calling setrlimit() or another process
calling prlimit()), provide a way to pass the rlimit down into the
per-architecture mm layout functions so that the rlimit can stay in the
bprm structure instead of sitting in the signal structure until exec is
finalized.
Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Greg KH <greg@kroah.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
Cc: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All it takes to open a file and read 1 byte from it.
seq_file will be allocated along with any private allocations, and more
importantly seq file buffer which is 1 page by default.
Link: http://lkml.kernel.org/r/20180310085252.GB17121@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
One use of the reiserfs_warning() macro in journal_init_dev() is missing
a parameter, causing the following warning:
REISERFS warning (device loop0): journal_init_dev: Cannot open '%s': %i journal_init_dev:
This also causes a WARN_ONCE() warning in the vsprintf code, and then a
panic if panic_on_warn is set.
Please remove unsupported %/ in format string
WARNING: CPU: 1 PID: 4480 at lib/vsprintf.c:2138 format_decode+0x77f/0x830 lib/vsprintf.c:2138
Kernel panic - not syncing: panic_on_warn set ...
Just add another string argument to the macro invocation.
Addresses https://syzkaller.appspot.com/bug?id=0627d4551fdc39bf1ef5d82cd9eef587047f7718
Link: http://lkml.kernel.org/r/d678ebe1-6f54-8090-df4c-b9affad62293@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: <syzbot+6bd77b88c1977c03f584@syzkaller.appspotmail.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Jeff Mahoney <jeffm@suse.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jan Kara <jack@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This playing with signals to allow only fatal signals appears to predate
the introduction of wait_event_killable(), and I'm fairly sure that
wait_event_killable is what was meant to happen here.
[avagin@openvz.org: use wake_up() instead of wake_up_interruptible]
Link: http://lkml.kernel.org/r/20180331022839.21277-1-avagin@openvz.org
Link: http://lkml.kernel.org/r/20180319191609.23880-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Ian Kent <raven@themaw.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In a typical for /proc "open+read+close" usecase, dentry is looked up
successfully on open only to be killed in dput() on close. In fact
dentries which aren't /proc/*/... and /proc/sys/* were almost NEVER
CACHED. Simple printk in proc_lookup_de() shows that.
Now that ->delete hook intelligently picks which dentries should live in
dcache and which should not, rbtree caching is not necessary as dcache
does it job, at last!
As a side effect, struct proc_dir_entry shrinks by one pointer which can
go into inline name.
Link: http://lkml.kernel.org/r/20180314231032.GA15854@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Various subsystems can create files and directories in /proc with names
directly controlled by userspace.
Which means "/", "." and ".." are no-no.
"/" split is already taken care of, do the other 2 prohibited names.
Link: http://lkml.kernel.org/r/20180310001223.GB12443@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm_struct is not needed while printing as all the data was already
extracted.
Link: http://lkml.kernel.org/r/20180309223120.GC3843@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use seq_puts() and skip format string processing.
Link: http://lkml.kernel.org/r/20180309222948.GB3843@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As soon as register_filesystem() exits, filesystem can be mounted. It
is better to present fully operational /proc.
Of course it doesn't matter because /proc is not modular but do it
anyway.
Drop error check, it should be handled by panicking.
Link: http://lkml.kernel.org/r/20180309222709.GA3843@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I totally forgot that _parse_integer() accepts arbitrary amount of
leading zeroes leading to the following lookups:
OK
# readlink /proc/1/map_files/56427ecba000-56427eddc000
/lib/systemd/systemd
bogus
# readlink /proc/1/map_files/00000000000056427ecba000-56427eddc000
/lib/systemd/systemd
# readlink /proc/1/map_files/56427ecba000-00000000000056427eddc000
/lib/systemd/systemd
Link: http://lkml.kernel.org/r/20180303215130.GA23480@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"struct proc_dir_entry" is variable sized because of 0-length trailing
array for name, however, because of SLAB padding allocations it is
possible to make "struct proc_dir_entry" fixed sized and allocate same
amount of memory.
It buys fine-grained debugging with poisoning and usercopy protection
which is not possible with kmalloc-* caches.
Currently, on 32-bit 91+ byte allocations go into kmalloc-128 and on
64-bit 147+ byte allocations go to kmalloc-192 anyway.
Additional memory is allocated only for 38/46+ byte long names which are
rare or may not even exist in the wild.
Link: http://lkml.kernel.org/r/20180223205504.GA17139@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc_sys_link_fill_cache() does not need to check whether we're called
for a link - it's already done by scan().
Link: http://lkml.kernel.org/r/20180228013506.4915-2-danilokrummrich@dk-develop.de
Signed-off-by: Danilo Krummrich <danilokrummrich@dk-develop.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc_sys_link_fill_cache() does not take currently unregistering sysctl
tables into account, which might result into a page fault in
sysctl_follow_link() - add a check to fix it.
This bug has been present since v3.4.
Link: http://lkml.kernel.org/r/20180228013506.4915-1-danilokrummrich@dk-develop.de
Fixes: 0e47c99d7fe25 ("sysctl: Replace root_list with links between sysctl_table_sets")
Signed-off-by: Danilo Krummrich <danilokrummrich@dk-develop.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_wchan() accesses stack page before permissions are checked, let's
not play this game.
Link: http://lkml.kernel.org/r/20180217071923.GA16074@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
seq_printf() works slower than seq_puts, seq_puts, etc.
== test_proc.c
int main(int argc, char **argv)
{
int n, i, fd;
char buf[16384];
n = atoi(argv[1]);
for (i = 0; i < n; i++) {
fd = open(argv[2], O_RDONLY);
if (fd < 0)
return 1;
if (read(fd, buf, sizeof(buf)) <= 0)
return 1;
close(fd);
}
return 0;
}
==
$ time ./test_proc 1000000 /proc/1/status
== Before path ==
real 0m5.171s
user 0m0.328s
sys 0m4.783s
== After patch ==
real 0m4.761s
user 0m0.334s
sys 0m4.366s
Link: http://lkml.kernel.org/r/20180212074931.7227-4-avagin@openvz.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A delimiter is a string which is printed before a number. A
syngle-symbol delimiters can be printed by set_putc() and this works
faster than printing by set_puts().
== test_proc.c
int main(int argc, char **argv)
{
int n, i, fd;
char buf[16384];
n = atoi(argv[1]);
for (i = 0; i < n; i++) {
fd = open(argv[2], O_RDONLY);
if (fd < 0)
return 1;
if (read(fd, buf, sizeof(buf)) <= 0)
return 1;
close(fd);
}
return 0;
}
==
$ time ./test_proc 1000000 /proc/1/stat
== Before patch ==
real 0m3.820s
user 0m0.337s
sys 0m3.394s
== After patch ==
real 0m3.110s
user 0m0.324s
sys 0m2.700s
Link: http://lkml.kernel.org/r/20180212074931.7227-3-avagin@openvz.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
seq_put_decimal_ull_w(m, str, val, width) prints a decimal number with a
specified minimal field width.
It is equivalent of seq_printf(m, "%s%*d", str, width, val), but it
works much faster.
== test_smaps.py
num = 0
with open("/proc/1/smaps") as f:
for x in xrange(10000):
data = f.read()
f.seek(0, 0)
==
== Before patch ==
$ time python test_smaps.py
real 0m4.593s
user 0m0.398s
sys 0m4.158s
== After patch ==
$ time python test_smaps.py
real 0m3.828s
user 0m0.413s
sys 0m3.408s
$ perf -g record python test_smaps.py
== Before patch ==
- 79.01% 3.36% python [kernel.kallsyms] [k] show_smap.isra.33
- 75.65% show_smap.isra.33
+ 48.85% seq_printf
+ 15.75% __walk_page_range
+ 9.70% show_map_vma.isra.23
0.61% seq_puts
== After patch ==
- 75.51% 4.62% python [kernel.kallsyms] [k] show_smap.isra.33
- 70.88% show_smap.isra.33
+ 24.82% seq_put_decimal_ull_w
+ 19.78% __walk_page_range
+ 12.74% seq_printf
+ 11.08% show_map_vma.isra.23
+ 1.68% seq_puts
[akpm@linux-foundation.org: fix drivers/of/unittest.c build]
Link: http://lkml.kernel.org/r/20180212074931.7227-1-avagin@openvz.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The allocation is persistent in fact as any fool can open a file in
/proc and sit on it.
Link: http://lkml.kernel.org/r/20180214082409.GC17157@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"struct pde_opener" is fixed size and we can have more granular approach
to debugging.
For those who don't know, per cache SLUB poisoning and red zoning don't
work if there is at least one object allocated which is hopeless in case
of kmalloc-64 but not in case of standalone cache. Although systemd
opens 2 files from the get go, so it is hopeless after all.
Link: http://lkml.kernel.org/r/20180214082306.GB17157@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>