The application currently has no way of knowing if a given opcode is
supported or not without having to try and issue one and see if we get
-EINVAL or not. And even this approach is fraught with peril, as maybe
we're getting -EINVAL due to some fields being missing, or maybe it's
just not that easy to issue that particular command without doing some
other leg work in terms of setup first.
This adds IORING_REGISTER_PROBE, which fills in a structure with info
on what it supported or not. This will work even with sparse opcode
fields, which may happen in the future or even today if someone
backports specific features to older kernels.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For some test apps at least, user_data is just zeroes. So it's not a
good way to tell what the command actually is. Add the opcode to the
issue trace point.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add support for the new openat2(2) system call. It's trivial to do, as
we can have openat(2) just be wrapped around it.
Suggested-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If an application is using eventfd notifications with poll to know when
new SQEs can be issued, it's expecting the following read/writes to
complete inline. And with that, it knows that there are events available,
and don't want spurious wakeups on the eventfd for those requests.
This adds IORING_REGISTER_EVENTFD_ASYNC, which works just like
IORING_REGISTER_EVENTFD, except it only triggers notifications for events
that happen from async completions (IRQ, or io-wq worker completions).
Any completions inline from the submission itself will not trigger
notifications.
Suggested-by: Mark Papadakis <markuspapadakis@icloud.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some applications like to start small in terms of ring size, and then
ramp up as needed. This is a bit tricky to do currently, since we don't
advertise the max ring size.
This adds IORING_SETUP_CLAMP. If set, and the values for SQ or CQ ring
size exceed what we support, then clamp them at the max values instead
of returning -EINVAL. Since we return the chosen ring sizes after setup,
no further changes are needed on the application side. io_uring already
changes the ring sizes if the application doesn't ask for power-of-two
sizes, for example.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add percpu_ref_tryget_many(), which works the same way as
percpu_ref_tryget(), but grabs specified number of refs.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds support for doing madvise(2) through io_uring. We assume that
any operation can block, and hence punt everything async. This could be
improved, but hard to make bullet proof. The async punt ensures it's
safe.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is in preparation for enabling this functionality through io_uring.
Add a helper that is just exporting what sys_madvise() does, and have the
system call use it.
No functional changes in this patch.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds support for doing fadvise through io_uring. We assume that
WILLNEED doesn't block, but that DONTNEED may block.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This behaves like preadv2/pwritev2 with offset == -1, it'll use (and
update) the current file position. This obviously comes with the caveat
that if the application has multiple read/writes in flight, then the
end result will not be as expected. This is similar to threads sharing
a file descriptor and doing IO using the current file position.
Since this feature isn't easily detectable by doing a read or write,
add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to
detect presence of this feature.
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For uses cases that don't already naturally have an iovec, it's easier
(or more convenient) to just use a buffer address + length. This is
particular true if the use case is from languages that want to create
a memory safe abstraction on top of io_uring, and where introducing
the need for the iovec may impose an ownership issue. For those cases,
they currently need an indirection buffer, which means allocating data
just for this purpose.
Add basic read/write that don't require the iovec.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring defaults to always doing inline submissions, if at all
possible. But for larger copies, even if the data is fully cached, that
can take a long time. Add an IOSQE_ASYNC flag that the application can
set on the SQE - if set, it'll ensure that we always go async for those
kinds of requests. Use the io-wq IO_WQ_WORK_CONCURRENT flag to ensure we
get the concurrency we desire for this case.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently fully quiesce the ring before an unregister or update of
the fixed fileset. This is very expensive, and we can be a bit smarter
about this.
Add a percpu refcount for the file tables as a whole. Grab a percpu ref
when we use a registered file, and put it on completion. This is cheap
to do. Upon removal of a file from a set, switch the ref count to atomic
mode. When we hit zero ref on the completion side, then we know we can
drop the previously registered files. When the old files have been
dropped, switch the ref back to percpu mode for normal operation.
Since there's a period between doing the update and the kernel being
done with it, add a IORING_OP_FILES_UPDATE opcode that can perform the
same action. The application knows the update has completed when it gets
the CQE for it. Between doing the update and receiving this completion,
the application must continue to use the unregistered fd if submitting
IO on this particular file.
This takes the runtime of test/file-register from liburing from 14s to
about 0.7s.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This works just like close(2), unsurprisingly. We remove the file
descriptor and post the completion inline, then offload the actual
(potential) last file put to async context.
Mark the async part of this work as uncancellable, as we really must
guarantee that the latter part of the close is run.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This works just like openat(2), except it can be performed async. For
the normal case of a non-blocking path lookup this will complete
inline. If we have to do IO to perform the open, it'll be done from
async context.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
fds field of struct io_uring_files_update is problematic with regards
to compat user space, as pointer size is different in 32-bit, 32-on-64-bit,
and 64-bit user space. In order to avoid custom handling of compat in
the syscall implementation, make fds __u64 and use u64_to_user_ptr in
order to retrieve it. Also, align the field naturally and check that
no garbage is passed there.
Fixes: c3a31e6056 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE")
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull networking fixes from David Miller:
1) Fix non-blocking connect() in x25, from Martin Schiller.
2) Fix spurious decryption errors in kTLS, from Jakub Kicinski.
3) Netfilter use-after-free in mtype_destroy(), from Cong Wang.
4) Limit size of TSO packets properly in lan78xx driver, from Eric
Dumazet.
5) r8152 probe needs an endpoint sanity check, from Johan Hovold.
6) Prevent looping in tcp_bpf_unhash() during sockmap/tls free, from
John Fastabend.
7) hns3 needs short frames padded on transmit, from Yunsheng Lin.
8) Fix netfilter ICMP header corruption, from Eyal Birger.
9) Fix soft lockup when low on memory in hns3, from Yonglong Liu.
10) Fix NTUPLE firmware command failures in bnxt_en, from Michael Chan.
11) Fix memory leak in act_ctinfo, from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits)
cxgb4: reject overlapped queues in TC-MQPRIO offload
cxgb4: fix Tx multi channel port rate limit
net: sched: act_ctinfo: fix memory leak
bnxt_en: Do not treat DSN (Digital Serial Number) read failure as fatal.
bnxt_en: Fix ipv6 RFS filter matching logic.
bnxt_en: Fix NTUPLE firmware command failures.
net: systemport: Fixed queue mapping in internal ring map
net: dsa: bcm_sf2: Configure IMP port for 2Gb/sec
net: dsa: sja1105: Don't error out on disabled ports with no phy-mode
net: phy: dp83867: Set FORCE_LINK_GOOD to default after reset
net: hns: fix soft lockup when there is not enough memory
net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()
net/sched: act_ife: initalize ife->metalist earlier
netfilter: nat: fix ICMP header corruption on ICMP errors
net: wan: lapbether.c: Use built-in RCU list checking
netfilter: nf_tables: fix flowtable list del corruption
netfilter: nf_tables: fix memory leak in nf_tables_parse_netdev_hooks()
netfilter: nf_tables: remove WARN and add NLA_STRING upper limits
netfilter: nft_tunnel: ERSPAN_VERSION must not be null
netfilter: nft_tunnel: fix null-attribute check
...
core mst:
- serialize down messages and clear timeslots are on unplug
amdgpu:
- Update golden settings for renoir
- eDP fix
i915:
- uAPI fix: Remove dash and colon from PMU names to comply with tools/perf
- Fix for include file that was indirectly included
- Two fixes to make sure VMA are marked active for error capture
virtio:
- maintain obj reservation lock when submitting cmds
rockchip:
- increase link rate var size to accommodate rates
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJeI3ZMAAoJEAx081l5xIa+iuUQAIIMTL86+QPr3CSu8fU7albJ
MQWOzycWn0xP0KNYIvKz2X6dePWAEDi2kMM+6M/htzi32XmyG0oiAat2wAGkK3ir
AtfOnbKdD6T0i5wwkAn8SvNYHOTHdg7HMKTzPw2rxlTlZ8IEdVxo1CZqHZMFg0PV
UiK2VWVBPmBwdRgLu5K8XJy5LKbpx9BMWW5KdS96L2lKNVV3qeH3Q7NvAh1PtJp4
W+83OsYxYNwS0//4v0i0MjCGVXq5qVw7tzplAPW2OpUCMZ4TtVUnW2/RcrqvU9/b
v18M7thf2uccMXs7E1gyHgvrZp2lw6KsyJEpPnIp4k+2OXM0BVe9l4Pl0uUQhrDH
EUvCkpmP1/Zu44CJhtmVAZlVyy3ShNY5NeF6OsCZDSL7z60YMZHFVKF1IgrBkOJF
mXXq4eb0rMh3At6ABNliRo/bzEYfGgVy1xxNzOoTKCp6pNK/OGy4xQ/DEAkZbBBx
ul6E+NjaHhul1daWqpuv37oxsjE0RI7asFBomC5HTNWVZHY8MN8+9p/TTZQJ8n0l
yxMlPy2OOWVfwclwRSLant3Ny1hXNysMpWXMtrEtcB3Q5GI7H4ZCNh9b+VUPLjXT
JPZU2ZSIjrYNBeKYjVGV75HKjyMWJv1aB3kiXq7lX1yFq7wQLd5nE5mjTSD9wr1I
ZnhFSbHLcR7VMWUldPPi
=Vwa3
-----END PGP SIGNATURE-----
Merge tag 'drm-fixes-2020-01-19' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Back from LCA2020, fixes wasn't too busy last week, seems to have
quieten down appropriately, some amdgpu, i915, then a core mst fix and
one fix for virtio-gpu and one for rockchip:
core mst:
- serialize down messages and clear timeslots are on unplug
amdgpu:
- Update golden settings for renoir
- eDP fix
i915:
- uAPI fix: Remove dash and colon from PMU names to comply with
tools/perf
- Fix for include file that was indirectly included
- Two fixes to make sure VMA are marked active for error capture
virtio:
- maintain obj reservation lock when submitting cmds
rockchip:
- increase link rate var size to accommodate rates"
* tag 'drm-fixes-2020-01-19' of git://anongit.freedesktop.org/drm/drm:
drm/amd/display: Reorder detect_edp_sink_caps before link settings read.
drm/amdgpu: update goldensetting for renoir
drm/dp_mst: Have DP_Tx send one msg at a time
drm/dp_mst: clear time slots for ports invalid
drm/i915/pmu: Do not use colons or dashes in PMU names
drm/rockchip: fix integer type used for storing dp data rate
drm/i915/gt: Mark ring->vma as active while pinned
drm/i915/gt: Mark context->state vma as active while pinned
drm/i915/gt: Skip trying to unbind in restore_ggtt_mappings
drm/i915: Add missing include file <linux/math64.h>
drm/virtio: add missing virtio_gpu_array_lock_resv call
Pull rseq fixes from Ingo Molnar:
"Two rseq bugfixes:
- CLONE_VM !CLONE_THREAD didn't work properly, the kernel would end
up corrupting the TLS of the parent. Technically a change in the
ABI but the previous behavior couldn't resonably have been relied
on by applications so this looks like a valid exception to the ABI
rule.
- Make the RSEQ_FLAG_UNREGISTER ABI behavior consistent with the
handling of other flags. This is not thought to impact any
applications either"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rseq: Unregister rseq for clone CLONE_VM
rseq: Reject unknown flags on rseq unregister
/* Background. */
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].
This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).
Userspace also has a hard time figuring out whether a particular flag is
supported on a particular kernel. While it is now possible with
contemporary kernels (thanks to [3]), older kernels will expose unknown
flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
openat(2) time matches modern syscall designs and is far more
fool-proof.
In addition, the newly-added path resolution restriction LOOKUP flags
(which we would like to expose to user-space) don't feel related to the
pre-existing O_* flag set -- they affect all components of path lookup.
We'd therefore like to add a new flag argument.
Adding a new syscall allows us to finally fix the flag-ignoring problem,
and we can make it extensible enough so that we will hopefully never
need an openat3(2).
/* Syscall Prototype. */
/*
* open_how is an extensible structure (similar in interface to
* clone3(2) or sched_setattr(2)). The size parameter must be set to
* sizeof(struct open_how), to allow for future extensions. All future
* extensions will be appended to open_how, with their zero value
* acting as a no-op default.
*/
struct open_how { /* ... */ };
int openat2(int dfd, const char *pathname,
struct open_how *how, size_t size);
/* Description. */
The initial version of 'struct open_how' contains the following fields:
flags
Used to specify openat(2)-style flags. However, any unknown flag
bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
will result in -EINVAL. In addition, this field is 64-bits wide to
allow for more O_ flags than currently permitted with openat(2).
mode
The file mode for O_CREAT or O_TMPFILE.
Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
resolve
Restrict path resolution (in contrast to O_* flags they affect all
path components). The current set of flags are as follows (at the
moment, all of the RESOLVE_ flags are implemented as just passing
the corresponding LOOKUP_ flag).
RESOLVE_NO_XDEV => LOOKUP_NO_XDEV
RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS
RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
RESOLVE_BENEATH => LOOKUP_BENEATH
RESOLVE_IN_ROOT => LOOKUP_IN_ROOT
open_how does not contain an embedded size field, because it is of
little benefit (userspace can figure out the kernel open_how size at
runtime fairly easily without it). It also only contains u64s (even
though ->mode arguably should be a u16) to avoid having padding fields
which are never used in the future.
Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
is no longer permitted for openat(2). As far as I can tell, this has
always been a bug and appears to not be used by userspace (and I've not
seen any problems on my machines by disallowing it). If it turns out
this breaks something, we can special-case it and only permit it for
openat(2) but not openat2(2).
After input from Florian Weimer, the new open_how and flag definitions
are inside a separate header from uapi/linux/fcntl.h, to avoid problems
that glibc has with importing that header.
/* Testing. */
In a follow-up patch there are over 200 selftests which ensure that this
syscall has the correct semantics and will correctly handle several
attack scenarios.
In addition, I've written a userspace library[4] which provides
convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
syscalls). During the development of this patch, I've run numerous
verification tests using libpathrs (showing that the API is reasonably
usable by userspace).
/* Future Work. */
Additional RESOLVE_ flags have been suggested during the review period.
These can be easily implemented separately (such as blocking auto-mount
during resolution).
Furthermore, there are some other proposed changes to the openat(2)
interface (the most obvious example is magic-link hardening[5]) which
would be a good opportunity to add a way for userspace to restrict how
O_PATH file descriptors can be re-opened.
Another possible avenue of future work would be some kind of
CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
which openat2(2) flags and fields are supported by the current kernel
(to avoid userspace having to go through several guesses to figure it
out).
[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
[3]: commit 629e014bb8 ("fs: completely ignore unknown open flags")
[4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
[5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
[6]: https://youtu.be/ggD-eb3yPVs
Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
rockchip: increase link rate var size to accommodate rates (Tobias)
mst: serialize down messages and clear timeslots are on unplug (Wayne)
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Tobias Schramm <t.schramm@manjaro.org>
Cc: Wayne Lin <Wayne.Lin@amd.com>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEHF6rntfJ3enn8gh8cywAJXLcr3kFAl4gjvAACgkQcywAJXLc
r3knlgf8C0szUNBmHkmwuAivHLJz6XHKCtGSfPrZ3M67KDwU563BRXlzHEA3/q7p
L3+RUW9CrohyMWA51fbmNJOR54mpy0zAu2ejLFll0Ntcw5CBbiYQ2ToBMmhW56h9
lM2EDkD0stx/JYmumpU6Vc8o0CG3asiJsU7IfQZ4n2kXqboyHho+E74qri52aPtD
ZZfXLmI9Yju8j0x0AjkZWxC6gfNWXH/HEnsC9X86JF+QXcnw5cVeHvi43doPjvs/
sZQCeaRAhseJXpAPaiBhVemF8+jkwX2CZ0sggxZZP9BB26PEwonqWHj9ixQlYLyD
bVC4Wson59D/0W1QK4R/AADpIO80IQ==
=iVHE
-----END PGP SIGNATURE-----
Merge tag 'drm-misc-fixes-2020-01-16' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
virtio: maintain obj reservation lock when submitting cmds (Gerd)
rockchip: increase link rate var size to accommodate rates (Tobias)
mst: serialize down messages and clear timeslots are on unplug (Wayne)
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Tobias Schramm <t.schramm@manjaro.org>
Cc: Wayne Lin <Wayne.Lin@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Sean Paul <sean@poorly.run>
Link: https://patchwork.freedesktop.org/patch/msgid/20200116162856.GA11524@art_vandelay
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4hOL0QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsJ9D/9hGO5/8LqHYipVQtBeVwaf33l3tF9eYgJ7
1Jdh7rQuopCiiLAO4DCboiMz4f9vSvLGgPcNNfPMbD6Bx7C0axXEaO6gzy19U7vk
ZTPPwBBzrCgiProE6Gb1QS/iNXLldJswoS1AlKCqwNQOkqzQ5BZ4QDiMpPmJ/MKj
ea8aK1UrHRz7eXuf1xSVOMkc9krGEWz581jvAZgoc8+Q64nGvWLIF4s+GK+kpY+C
Tsmda/pxCuyhG/RossCfX06j6UsiYbyGiXgrXszjt5QuvzGxtmPf9jZKVsed4K0m
9SOENctY5fjEVViwYfSKxikB5bQ98OalGo/Ad+FdetKeIEIg0uWXboOOYBgSgUMF
AYuxE91NjrvIbL2+Jt9NNFuIGUgXdTM/JXN5D8u5mb64psX/3QdZsBwEbivvmpGA
6nkdgX/x8Y9t95BDs4PQ/CgWneTxQXTnDUTdlbo6Av8WQXroEeXWHzrjUE/O0Po1
dm4n493Arlsn2+LRgG3hgiPbifQvJcYxBJwOe7uXMUuaWO2iOaFEsFOP5Dneit8R
vDL7cwkv0/xzzrUPF61RZjfQDBrQG8nr5PTuHYupkloVA2eP2hRNnOuQIgWcX6vB
9+db9VN/Wdf6MV0fqKjCZf9CVYUkC5SzF3YKQaEy8SA/1xKzrLdBEtQjIOu4Vh9Q
NiZBp47Xgw==
=f1JW
-----END PGP SIGNATURE-----
Merge tag 'block-5.5-2020-01-16' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Three fixes that should go into this release:
- The 32-bit segment size fix that I mentioned last week (Ming)
- Use uint for the block size (Mikulas)
- A null_blk zone write handling fix (Damien)"
* tag 'block-5.5-2020-01-16' of git://git.kernel.dk/linux-block:
block: fix an integer overflow in logical block size
null_blk: Fix zone write handling
block: fix get_max_segment_size() overflow on 32bit arch
I've been sitting on these longer than I meant, so the patch count is
a bit higher than ideal for this part of the release. There's also some
reverts of double-applied patches that brings the diffstat up a bit.
With that said, the biggest changes are:
- Revert of duplicate i2c device addition on two Aspeed (BMC) Devicetrees.
- Move of two device nodes that got applied to the wrong part of the
tree on ASpeed G6.
- Regulator fix for Beaglebone X15 (adding 12/5V supplies)
- Use interrupts for keys on Amlogic SM1 to avoid missed polls
In addition to that, there is a collection of smaller DT fixes:
- Power supply assignment fixes for i.MX6
- Fix of interrupt line for magnetometer on i.MX8 Librem5 devkit
- Build fixlets (selects) for davinci/omap2+
- More interrupt number fixes for Stratix10, Amlogic SM1, etc.
- ... and more similar fixes across different platforms
And some non-DT stuff:
- optee fix to register multiple shared pages properly
- Clock calculation fixes for MMP3
- Clock fixes for OMAP as well
-----BEGIN PGP SIGNATURE-----
iQJDBAABCAAtFiEElf+HevZ4QCAJmMQ+jBrnPN6EHHcFAl4hIooPHG9sb2ZAbGl4
b20ubmV0AAoJEIwa5zzehBx3vckP/jT/yrodXuK3OLtBnDQI4Em5b14uJQxEAsh+
fTaz1H3n82PaWJVaEXpRTYMa4WZnmMPazoAoDhuqWnz/VbzfXmufFIIXsQ0rJqbf
Ht1LWvx7hd5q49aq2x1o9Nuo5OKMbW8igQqsx7PqjSOQRaAZTkxZhOI1C9pKnnnD
oJU8nw19N8yCQILxXMmpBX2vczWyJ3tgH6v8rhB89riBXouqwcKbTRyI0ciFdO91
mPlfF9qwqZ99bb+7WqalrtOr+/0VgvhB3oCNzoWYPptipiaLGdH4ZXVEhyCUDmrY
WN1kZsBtK+jtDLcMdRqg+EmbijxcxA0DSLDCow1QwuMPNHxVN5du1JN7b4uTvCPX
sHbrDO/YdiSWx20VZID/x/sWqcQyBrDqZkA3NWhoClm75JGQUHP16pZUURCN/awy
IGApkQ5164Ac+2DFHgh3S7qKXWk7O+hY6iksyRPPZkj31d4mCimdVaHDV/c3aeI/
EnUI6nj6H3ghYTX2gl3yhT8d4yCM+2uSawdIFWGNvB85vs1koAUEuczc6Me8JdZV
4HWexVs8W0Jo1w3Ndq3Hxw0RTKccC34x1f4dnzSSSEF7t4GMveTdecd/D77aiT2x
eVNox3PIAfjR96et2vQ1C+hVRyEqn/hDapvR5OI/78F2ampee8m8tWQDYIlH/RbZ
pdBTN5CS
=MMJu
-----END PGP SIGNATURE-----
Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Olof Johansson:
"I've been sitting on these longer than I meant, so the patch count is
a bit higher than ideal for this part of the release. There's also
some reverts of double-applied patches that brings the diffstat up a
bit.
With that said, the biggest changes are:
- Revert of duplicate i2c device addition on two Aspeed (BMC)
Devicetrees.
- Move of two device nodes that got applied to the wrong part of the
tree on ASpeed G6.
- Regulator fix for Beaglebone X15 (adding 12/5V supplies)
- Use interrupts for keys on Amlogic SM1 to avoid missed polls
In addition to that, there is a collection of smaller DT fixes:
- Power supply assignment fixes for i.MX6
- Fix of interrupt line for magnetometer on i.MX8 Librem5 devkit
- Build fixlets (selects) for davinci/omap2+
- More interrupt number fixes for Stratix10, Amlogic SM1, etc.
- ... and more similar fixes across different platforms
And some non-DT stuff:
- optee fix to register multiple shared pages properly
- Clock calculation fixes for MMP3
- Clock fixes for OMAP as well"
* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (42 commits)
MAINTAINERS: Add myself as the co-maintainer for Actions Semi platforms
ARM: dts: imx7: Fix Toradex Colibri iMX7S 256MB NAND flash support
ARM: dts: imx6sll-evk: Remove incorrect power supply assignment
ARM: dts: imx6sl-evk: Remove incorrect power supply assignment
ARM: dts: imx6sx-sdb: Remove incorrect power supply assignment
ARM: dts: imx6qdl-sabresd: Remove incorrect power supply assignment
ARM: dts: imx6q-icore-mipi: Use 1.5 version of i.Core MX6DL
ARM: omap2plus: select RESET_CONTROLLER
ARM: davinci: select CONFIG_RESET_CONTROLLER
ARM: dts: aspeed: rainier: Fix fan fault and presence
ARM: dts: aspeed: rainier: Remove duplicate i2c busses
ARM: dts: aspeed: tacoma: Remove duplicate flash nodes
ARM: dts: aspeed: tacoma: Remove duplicate i2c busses
ARM: dts: aspeed: tacoma: Fix fsi master node
ARM: dts: aspeed-g6: Fix FSI master location
ARM: dts: mmp3: Fix the TWSI ranges
clk: mmp2: Fix the order of timer mux parents
ARM: mmp: do not divide the clock rate
arm64: dts: rockchip: Fix IR on Beelink A1
optee: Fix multi page dynamic shm pool alloc
...
Daniel Borkmann says:
====================
pull-request: bpf 2020-01-15
The following pull-request contains BPF updates for your *net* tree.
We've added 12 non-merge commits during the last 9 day(s) which contain
a total of 13 files changed, 95 insertions(+), 43 deletions(-).
The main changes are:
1) Fix refcount leak for TCP time wait and request sockets for socket lookup
related BPF helpers, from Lorenz Bauer.
2) Fix wrong verification of ARSH instruction under ALU32, from Daniel Borkmann.
3) Batch of several sockmap and related TLS fixes found while operating
more complex BPF programs with Cilium and OpenSSL, from John Fastabend.
4) Fix sockmap to read psock's ingress_msg queue before regular sk_receive_queue()
to avoid purging data upon teardown, from Lingpeng Chen.
5) Fix printing incorrect pointer in bpftool's btf_dump_ptr() in order to properly
dump a BPF map's value with BTF, from Martin KaFai Lau.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Logical block size has type unsigned short. That means that it can be at
most 32768. However, there are architectures that can run with 64k pages
(for example arm64) and on these architectures, it may be possible to
create block devices with 64k block size.
For exmaple (run this on an architecture with 64k pages):
Mount will fail with this error because it tries to read the superblock using 2-sector
access:
device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536
EXT4-fs (dm-0): unable to read superblock
This patch changes the logical block size from unsigned short to unsigned
int to avoid the overflow.
Cc: stable@vger.kernel.org
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When sockmap sock with TLS enabled is removed we cleanup bpf/psock state
and call tcp_update_ulp() to push updates to TLS ULP on top. However, we
don't push the write_space callback up and instead simply overwrite the
op with the psock stored previous op. This may or may not be correct so
to ensure we don't overwrite the TLS write space hook pass this field to
the ULP and have it fixup the ctx.
This completes a previous fix that pushed the ops through to the ULP
but at the time missed doing this for write_space, presumably because
write_space TLS hook was added around the same time.
Fixes: 95fa145479 ("bpf: sockmap/tls, close can race with map free")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-4-john.fastabend@gmail.com
When a sockmap is free'd and a socket in the map is enabled with tls
we tear down the bpf context on the socket, the psock struct and state,
and then call tcp_update_ulp(). The tcp_update_ulp() call is to inform
the tls stack it needs to update its saved sock ops so that when the tls
socket is later destroyed it doesn't try to call the now destroyed psock
hooks.
This is about keeping stacked ULPs in good shape so they always have
the right set of stacked ops.
However, recently unhash() hook was removed from TLS side. But, the
sockmap/bpf side is not doing any extra work to update the unhash op
when is torn down instead expecting TLS side to manage it. So both
TLS and sockmap believe the other side is managing the op and instead
no one updates the hook so it continues to point at tcp_bpf_unhash().
When unhash hook is called we call tcp_bpf_unhash() which detects the
psock has already been destroyed and calls sk->sk_prot_unhash() which
calls tcp_bpf_unhash() yet again and so on looping and hanging the core.
To fix have sockmap tear down logic fixup the stale pointer.
Fixes: 5d92e631b8 ("net/tls: partially revert fix transition through disconnect with close")
Reported-by: syzbot+83979935eb6304f8cd46@syzkaller.appspotmail.com
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-2-john.fastabend@gmail.com
[Why]
Noticed this while testing MST with the 4 ports MST hub from
StarTech.com. Sometimes can't light up monitors normally and get the
error message as 'sideband msg build failed'.
Look into aux transactions, found out that source sometimes will send
out another down request before receiving the down reply of the
previous down request. On the other hand, in drm_dp_get_one_sb_msg(),
current code doesn't handle the interleaved replies case. Hence, source
can't build up message completely and can't light up monitors.
[How]
For good compatibility, enforce source to send out one down request at a
time. Add a flag, is_waiting_for_dwn_reply, to determine if the source
can send out a down request immediately or not.
- Check the flag before calling process_single_down_tx_qlock to send out
a msg
- Set the flag when successfully send out a down request
- Clear the flag when successfully build up a down reply
- Clear the flag when find erros during sending out a down request
- Clear the flag when find errors during building up a down reply
- Clear the flag when timeout occurs during waiting for a down reply
- Use drm_dp_mst_kick_tx() to try to send another down request in queue
at the end of drm_dp_mst_wait_tx_reply() (attempt to send out messages
in queue when errors occur)
Cc: Lyude Paul <lyude@redhat.com>
Signed-off-by: Wayne Lin <Wayne.Lin@amd.com>
Signed-off-by: Lyude Paul <lyude@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200113093649.11755-1-Wayne.Lin@amd.com
Anatoly has been fuzzing with kBdysch harness and reported a hang in one
of the outcomes:
0: R1=ctx(id=0,off=0,imm=0) R10=fp0
0: (85) call bpf_get_socket_cookie#46
1: R0_w=invP(id=0) R10=fp0
1: (57) r0 &= 808464432
2: R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
2: (14) w0 -= 810299440
3: R0_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
3: (c4) w0 s>>= 1
4: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
4: (76) if w0 s>= 0x30303030 goto pc+216
221: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
221: (95) exit
processed 6 insns (limit 1000000) [...]
Taking a closer look, the program was xlated as follows:
# ./bpftool p d x i 12
0: (85) call bpf_get_socket_cookie#7800896
1: (bf) r6 = r0
2: (57) r6 &= 808464432
3: (14) w6 -= 810299440
4: (c4) w6 s>>= 1
5: (76) if w6 s>= 0x30303030 goto pc+216
6: (05) goto pc-1
7: (05) goto pc-1
8: (05) goto pc-1
[...]
220: (05) goto pc-1
221: (05) goto pc-1
222: (95) exit
Meaning, the visible effect is very similar to f54c7898ed ("bpf: Fix
precision tracking for unbounded scalars"), that is, the fall-through
branch in the instruction 5 is considered to be never taken given the
conclusion from the min/max bounds tracking in w6, and therefore the
dead-code sanitation rewrites it as goto pc-1. However, real-life input
disagrees with verification analysis since a soft-lockup was observed.
The bug sits in the analysis of the ARSH. The definition is that we shift
the target register value right by K bits through shifting in copies of
its sign bit. In adjust_scalar_min_max_vals(), we do first coerce the
register into 32 bit mode, same happens after simulating the operation.
However, for the case of simulating the actual ARSH, we don't take the
mode into account and act as if it's always 64 bit, but location of sign
bit is different:
dst_reg->smin_value >>= umin_val;
dst_reg->smax_value >>= umin_val;
dst_reg->var_off = tnum_arshift(dst_reg->var_off, umin_val);
Consider an unknown R0 where bpf_get_socket_cookie() (or others) would
for example return 0xffff. With the above ARSH simulation, we'd see the
following results:
[...]
1: R1=ctx(id=0,off=0,imm=0) R2_w=invP65535 R10=fp0
1: (85) call bpf_get_socket_cookie#46
2: R0_w=invP(id=0) R10=fp0
2: (57) r0 &= 808464432
-> R0_runtime = 0x3030
3: R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
3: (14) w0 -= 810299440
-> R0_runtime = 0xcfb40000
4: R0_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
(0xffffffff)
4: (c4) w0 s>>= 1
-> R0_runtime = 0xe7da0000
5: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
(0x67c00000) (0x7ffbfff8)
[...]
In insn 3, we have a runtime value of 0xcfb40000, which is '1100 1111 1011
0100 0000 0000 0000 0000', the result after the shift has 0xe7da0000 that
is '1110 0111 1101 1010 0000 0000 0000 0000', where the sign bit is correctly
retained in 32 bit mode. In insn4, the umax was 0xffffffff, and changed into
0x7ffbfff8 after the shift, that is, '0111 1111 1111 1011 1111 1111 1111 1000'
and means here that the simulation didn't retain the sign bit. With above
logic, the updates happen on the 64 bit min/max bounds and given we coerced
the register, the sign bits of the bounds are cleared as well, meaning, we
need to force the simulation into s32 space for 32 bit alu mode.
Verification after the fix below. We're first analyzing the fall-through branch
on 32 bit signed >= test eventually leading to rejection of the program in this
specific case:
0: R1=ctx(id=0,off=0,imm=0) R10=fp0
0: (b7) r2 = 808464432
1: R1=ctx(id=0,off=0,imm=0) R2_w=invP808464432 R10=fp0
1: (85) call bpf_get_socket_cookie#46
2: R0_w=invP(id=0) R10=fp0
2: (bf) r6 = r0
3: R0_w=invP(id=0) R6_w=invP(id=0) R10=fp0
3: (57) r6 &= 808464432
4: R0_w=invP(id=0) R6_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
4: (14) w6 -= 810299440
5: R0_w=invP(id=0) R6_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
5: (c4) w6 s>>= 1
6: R0_w=invP(id=0) R6_w=invP(id=0,umin_value=3888119808,umax_value=4294705144,var_off=(0xe7c00000; 0x183bfff8)) R10=fp0
(0x67c00000) (0xfffbfff8)
6: (76) if w6 s>= 0x30303030 goto pc+216
7: R0_w=invP(id=0) R6_w=invP(id=0,umin_value=3888119808,umax_value=4294705144,var_off=(0xe7c00000; 0x183bfff8)) R10=fp0
7: (30) r0 = *(u8 *)skb[808464432]
BPF_LD_[ABS|IND] uses reserved fields
processed 8 insns (limit 1000000) [...]
Fixes: 9cbe1f5a32 ("bpf/verifier: improve register value range tracking with ARSH")
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200115204733.16648-1-daniel@iogearbox.net
Pull vfs fixes from Al Viro:
"Fixes for mountpoint_last() bugs (by converting to use of
lookup_last()) and an autofs regression fix from this cycle (caused by
follow_managed() breakage introduced in barrier fixes series)"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix autofs regression caused by follow_managed() changes
reimplement path_mountpoint() with less magic
* -O3 enablement fallout, thanks to Arnd who ran this
* fixes for a few leaks, thanks to Felix
* channel 12 regulatory fix for custom regdomains
* check for a crash reported by syzbot
(NULL function is called on drivers that don't have it)
* fix TKIP replay protection after setup with some APs
(from Jouni)
* restrict obtaining some mesh data to avoid WARN_ONs
* fix deadlocks with auto-disconnect (socket owner)
* fix radar detection events with multiple devices
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl4e4voACgkQB8qZga/f
l8RVFg/+Ngjzje8xU6Ymh7VSwvBBz6CmSVwH/nRSsIAszDrQLoDCo7+eWeUHnAKG
2ViFQ4Lf5CP2J9kSgi7pbD0yGQRDGxH1CVSxMB64uZZRErlOCZYLRTIeBn7HGb8w
JhqAVXBjskzP9iB5jUZIkOgJ6L2ESM2KYKQVGtgn8p9fFdHKitec8UEKiPU3tcX1
zx94EbKldynuvDNb7VkTm4MdoY8LK5KI8ovVdVKwJ05qCtLvc6xvEBYhm33Fl6Pe
66cWklCrU093jQXUDTk0yBkwlcXGxJc143y1nXuurI8ZLH6q2wdDUAMHUEKdL6mz
ktHxABK4JA4Z/ylphR2Tp1TNh8xnz/ZZ5IdAe4P9K0/PerrjH8Yk7zf/48lrHTSv
LKb86Oaoq4MY4k5UNyjDUdgPB/q8CgxN2xDQaeQX69u740wrlxYtrwgHmqHrKI69
ppCzL7a3wmJzAz/97TMFhSseCH1WEQfWY+tlzktrvRofSgOGsIUtcPP10NmvGZHj
zdF79E9TiUPX4KVWssRb9A2MjBbvKA31hCQbH/OVKSWt/fItoBUbMQKtPxfnHcam
/W/gCWrOpw5Qqhc/BdFbYYsgY9J06KzifsV5e2Z6he+6Ex+rtdZNoLXc6/4QH3+9
HzsVYBP+ZUaqrMJi64KDN59C3ni4jDVs9zfYnVkC9LhzHASI5EE=
=i20C
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-net-2020-01-15' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
A few fixes:
* -O3 enablement fallout, thanks to Arnd who ran this
* fixes for a few leaks, thanks to Felix
* channel 12 regulatory fix for custom regdomains
* check for a crash reported by syzbot
(NULL function is called on drivers that don't have it)
* fix TKIP replay protection after setup with some APs
(from Jouni)
* restrict obtaining some mesh data to avoid WARN_ONs
* fix deadlocks with auto-disconnect (socket owner)
* fix radar detection events with multiple devices
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In case a radar event of CAC_FINISHED or RADAR_DETECTED
happens during another phy is during CAC we might need
to cancel that CAC.
If we got a radar in a channel that another phy is now
doing CAC on then the CAC should be canceled there.
If, for example, 2 phys doing CAC on the same channels,
or on comptable channels, once on of them will finish his
CAC the other might need to cancel his CAC, since it is no
longer relevant.
To fix that the commit adds an callback and implement it in
mac80211 to end CAC.
This commit also adds a call to said callback if after a radar
event we see the CAC is no longer relevant
Signed-off-by: Orr Mazor <Orr.Mazor@tandemg.com>
Reviewed-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Link: https://lore.kernel.org/r/20191222145449.15792-1-Orr.Mazor@tandemg.com
[slightly reformat/reword commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
... and get rid of a bunch of bugs in it. Background:
the reason for path_mountpoint() is that umount() really doesn't
want attempts to revalidate the root of what it's trying to umount.
The thing we want to avoid actually happen from complete_walk();
solution was to do something parallel to normal path_lookupat()
and it both went overboard and got the boilerplate subtly
(and not so subtly) wrong.
A better solution is to do pretty much what the normal path_lookupat()
does, but instead of complete_walk() do unlazy_walk(). All it takes
to avoid that ->d_weak_revalidate() call... mountpoint_last() goes
away, along with everything it got wrong, and so does the magic around
LOOKUP_NO_REVAL.
Another source of bugs is that when we traverse mounts at the final
location (and we need to do that - umount . expects to get whatever's
overmounting ., if any, out of the lookup) we really ought to take
care of ->d_manage() - as it is, manual umount of autofs automount
in progress can lead to unpleasant surprises for the daemon. Easily
solved by using handle_lookup_down() instead of follow_mount().
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Here are two bugfixes from Mike Rapoport, both fixing
compile-time errors for the nds32 architecture that
were recently introduced.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJeHehEAAoJEGCrR//JCVIncBIQAKmOG0xrS7YnUiyi0MqddSaj
wszJg5Y1edyMhrrYXMkMz3G0AJ3esLAXpwxk/d3thPjKwgY5TaWGNgcpYKiNG0Mm
ehSaeJHdXlXiAZIfjxM2POuJZ1na98CQJwBN+8jpWyXmlvzTAm3YJZyE65gZBAmP
VYZtaLCD1lAs10Pp8101r7Z0gQ+Fhv06Unf144eJ0ZF33sG5/xxJW57LCv/Yfy00
XeiSbN7xJ/2gPuz8JcsaJUF8VwJzf1UhNganFUUSoaQ9n03saytbuS2UhhWDoc5z
S9felivdKqo6l8ASc21ymW8WeSMwCO+mAOPuWg3T1K51Q/hasZ/CmGPBceHi0Pf9
zXtAigBc/GGANhu9w7kKl1IdKe5nTugk7EL+wm/4HAR8Z8JmdZT5SkE54SjB3maw
E1tvIMF8TS/KjXHJCLhyksB/kmV3+TKqwQl0y+ZI4hmdVECE301PaTO3xNhb4oS4
dOnEMPsIPQutqmKOyoN+S6T0IYaBL0XJ+5UXeFVRC2LGUWfGx096jScMYqCwadDy
MA49oK8xtKhXxdPa2E7yEzQIKx91EHCpavIy5Pg2WF1nvejGfH/amFgpvYaaJ1N1
uKYgavDJb6TeIylYfnaVawkB/WwM5uwWBANByPen86V9XqGhED3x10ceyvwNT6WF
jD27rkrHRT9VN/x38VOb
=ZMNN
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground
Pull asm-generic fixes from Arnd Bergmann:
"Here are two bugfixes from Mike Rapoport, both fixing compile-time
errors for the nds32 architecture that were recently introduced"
* tag 'asm-generic-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
nds32: fix build failure caused by page table folding updates
asm-generic/nds32: don't redefine cacheflush primitives
Merge misc fixes from David Howells.
Two afs fixes and a key refcounting fix.
* dhowells:
afs: Fix afs_lookup() to not clobber the version on a new dentry
afs: Fix use-after-loss-of-ref
keys: Fix request_key() cache
afs_lookup() has a tracepoint to indicate the outcome of
d_splice_alias(), passing it the inode to retrieve the fid from.
However, the function gave up its ref on that inode when it called
d_splice_alias(), which may have failed and dropped the inode.
Fix this by caching the fid.
Fixes: 80548b0399 ("afs: Add more tracepoints")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 99cb0dbd47 ("mm,thp: add read-only THP support for (non-shmem)
FS") introduced a new khugepaged scan result: SCAN_PAGE_HAS_PRIVATE, but
the corresponding description for trace events were not added.
Link: http://lkml.kernel.org/r/1574793844-2914-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: 99cb0dbd47 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 96a2b03f28 ("mm, debug_pagelloc: use static keys to enable
debugging") has introduced a static key to reduce overhead when
debug_pagealloc is compiled in but not enabled. It relied on the
assumption that jump_label_init() is called before parse_early_param()
as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
it is safe to enable the static key.
However, it turns out multiple architectures call parse_early_param()
earlier from their setup_arch(). x86 also calls jump_label_init() even
earlier, so no issue was found while testing the commit, but same is not
true for e.g. ppc64 and s390 where the kernel would not boot with
debug_pagealloc=on as found by our QA.
To fix this without tricky changes to init code of multiple
architectures, this patch partially reverts the static key conversion
from 96a2b03f28. Init-time and non-fastpath calls (such as in arch
code) of debug_pagealloc_enabled() will again test a simple bool
variable. Fastpath mm code is converted to a new
debug_pagealloc_enabled_static() variant that relies on the static key,
which is enabled in a well-defined point in mm_init() where it's
guaranteed that jump_label_init() has been called, regardless of
architecture.
[sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
Fixes: 96a2b03f28 ("mm, debug_pagelloc: use static keys to enable debugging")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently slab percpu vmstats are flushed twice: during the memcg
offlining and just before freeing the memcg structure. Each time percpu
counters are summed, added to the atomic counterparts and propagated up
by the cgroup tree.
The second flushing is required due to how recursive vmstats are
implemented: counters are batched in percpu variables on a local level,
and once a percpu value is crossing some predefined threshold, it spills
over to atomic values on the local and each ascendant levels. It means
that without flushing some numbers cached in percpu variables will be
dropped on floor each time a cgroup is destroyed. And with uptime the
error on upper levels might become noticeable.
The first flushing aims to make counters on ancestor levels more
precise. Dying cgroups may resume in the dying state for a long time.
After kmem_cache reparenting which is performed during the offlining
slab counters of the dying cgroup don't have any chances to be updated,
because any slab operations will be performed on the parent level. It
means that the inaccuracy caused by percpu batching will not decrease up
to the final destruction of the cgroup. By the original idea flushing
slab counters during the offlining should minimize the visible
inaccuracy of slab counters on the parent level.
The problem is that percpu counters are not zeroed after the first
flushing. So every cached percpu value is summed twice. It creates a
small error (up to 32 pages per cpu, but usually less) which accumulates
on parent cgroup level. After creating and destroying of thousands of
child cgroups, slab counter on parent level can be way off the real
value.
For now, let's just stop flushing slab counters on memcg offlining. It
can't be done correctly without scheduling a work on each cpu: reading
and zeroing it during css offlining can race with an asynchronous
update, which doesn't expect values to be changed underneath.
With this change, slab counters on parent level will become eventually
consistent. Once all dying children are gone, values are correct. And
if not, the error is capped by 32 * NR_CPUS pages per dying cgroup.
It's not perfect, as slab are reparented, so any updates after the
reparenting will happen on the parent level. It means that if a slab
page was allocated, a counter on child level was bumped, then the page
was reparented and freed, the annihilation of positive and negative
counter values will not happen until the child cgroup is released. It
makes slab counters different from others, and it might want us to
implement flushing in a correct form again. But it's also a question of
performance: scheduling a work on each cpu isn't free, and it's an open
question if the benefit of having more accurate counters is worth it.
We might also consider flushing all counters on offlining, not only slab
counters.
So let's fix the main problem now: make the slab counters eventually
consistent, so at least the error won't grow with uptime (or more
precisely the number of created and destroyed cgroups). And think about
the accuracy of counters separately.
Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com
Fixes: bee07b33db ("mm: memcontrol: flush percpu slab vmstats on kmem offlining")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Two fixes for RISC-V:
- Clear FP registers during boot when FP support is present, rather than
when they aren't present
- Move the header files associated with the SiFive L2 cache controller
to drivers/soc (where the code was recently moved)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEElRDoIDdEz9/svf2Kx4+xDQu9KksFAl4bY9MACgkQx4+xDQu9
Kkt8WQ//eaVeeVVBkNB4Wnq+zpdrj3Jhlab8woLrxP9q1S7z/DR098K565AxZ3wE
QVZN4ydK3PrgijIKXQusIj+/y27BFelDafBsNpyaph+SwHdqfPF7PIGdtE6RluCw
sw0Nhj1JGXme3vC7HTMceQM8iljxBOlG7KuaUHTFWSFe+im49VeulM3jCzdr/xWB
MoTMb5u3RL+N2Lv4bO6/PLWFBfzrcjD/z1pYXJ/PBHV559PQOeHkiHgFRy7TSn4w
nkZpof/QbFrAz4lCYGmI2d0C6dAet/e0b2thD+J77cYECSo8xc6OPvJXWNCdv6hY
I0FK+3RHZwAgeh/fdPQtkW6E+FDDi5SKOklmFTqbTMV+Rw1CJBTExAdI01fH2owG
sxXmD4NbDYVdWMBuuaR7kImGxQ5XDrcAFzUFDj/VAr6lYE7fklCcQclEiTe+9Tbt
TU8yI+ZjvXxvoZUk7TIxU2V/bSAM7jOuX5NGMpTrfsa+zpPjjnKsfMjpn/ddVW7q
VtNkiKGDMsXQOHdsAzpA7nTcUqWOv+o76r7Q0ZDO7IhqR9s9embpwTCZ4UIhfLMr
5S9pTex7iQCejMxWMcWc57fWHkqF3sHUqVu5rgSiZBWA1nNuC2PUqzwhq5ulrLjb
RDRS9i/XNPqhvjGdzJ+yFaXkalEse1v0EIWeQ+aXJifUV0zfxCc=
=qvPe
-----END PGP SIGNATURE-----
Merge tag 'riscv/for-v5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Paul Walmsley:
"Two fixes for RISC-V:
- Clear FP registers during boot when FP support is present, rather
than when they aren't present
- Move the header files associated with the SiFive L2 cache
controller to drivers/soc (where the code was recently moved)"
* tag 'riscv/for-v5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Fixup obvious bug for fp-regs reset
riscv: move sifive_l2_cache.h to include/soc
The commit 9209fb5189 ("riscv: move sifive_l2_cache.c to drivers/soc")
moves the sifive L2 cache driver to driver/soc. It did not move the
header file along with the driver. Therefore this patch moves the header
file to driver/soc
Signed-off-by: Yash Shah <yash.shah@sifive.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
[paul.walmsley@sifive.com: updated to fix the include guard]
Fixes: 9209fb5189 ("riscv: move sifive_l2_cache.c to drivers/soc")
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
The function to obtain a unique snapshot id was mistakenly typo'd as
devlink_region_shapshot_id_get. Fix this typo by renaming the function
and all of its users.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4YvdoQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpoOuD/0eZkvtim/ZyEzj081PZUkslWwjNvEZV9+o
iYWMp0PBDyYgR79ca86EVTevcMiVxPnKpQl5DT+p9L1JzZ7dFc8U7fTpygjwbzx0
FdHlPQt+oN4TsGl3hpTGGnw2ArbCHnqqj31ahgo87zo0a01xv33C3QeGFyXEIYoR
F0QQ5E7EAyT2umKKflX9PWnrbOQZ91p2P3m+AE0TtOXMUgTi2KJKHUFu+G5OOwZB
dM41GvyZDY9WA7bUlFeOp0mRZsiGkfEsI59VP6AR8ZkxwsOeHLrVB5iBEGPiTDL2
dUwLwbGrLYFtwLEh4yd0aKt9++H2RZjJwi4ssyaDkkWCMQHECQXwd34DBmrV/qia
hgh/4DV0E1X3MZFYOk44zp8kwjgpmU9MCH3dFU0bWnzm9WrvtS9uBDjgEDkn6zty
xONSQeyHWVFQFwIjG260YEbuTplOTFP5rNWEf2CHWMHuk9kp8kfATWt9wlazYhtz
OUELfWmkrGk8nqMN4Ee+ty582I8gxk48IGwiJHOYh1gMHHgFnJbgr97Pe2NCLLee
9elkJnUQSdXuF314uznrAf7XLiEC0hfHGnPCTD8pAMf2DYOGKNeivh2wtdhd98cu
AvHew9qnI2C/oahY+DaE4wunP4VlNZ/ZeNAl0h8KG7uNBCA4uFtyBMNimnKVNblw
7KDsQ3vDIg==
=1bsG
-----END PGP SIGNATURE-----
Merge tag 'block-5.5-2020-01-10' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A few fixes that should go into this round.
This pull request contains two NVMe fixes via Keith, removal of a dead
function, and a fix for the bio op for read truncates (Ming)"
* tag 'block-5.5-2020-01-10' of git://git.kernel.dk/linux-block:
nvmet: fix per feat data len for get_feature
nvme: Translate more status codes to blk_status_t
fs: move guard_bio_eod() after bio_set_op_attrs
block: remove unused mp_bvec_last_segment
* sm_ftl: Fix NULL pointer warning.
Raw NAND:
* Cadence: fix compile testing.
* STM32: Avoid locking.
Onenand:
* Fix several sparse/build warnings.
SPI-NOR:
* Add a flag to fix interaction with Micron parts.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEE9HuaYnbmDhq/XIDIJWrqGEe9VoQFAl4YipIACgkQJWrqGEe9
VoTr/Af9GDYEaD5ZnYeOwTSs43Cd62K7wzPar9tE20xu5VVmPXmyIgSmxyoPXpjh
O89xxahrG4sD0vokSWstjgZVTzBEu2DHkeOsjD6j7buXlv5LN8o4dAqw7k6+Hle5
T5qHZogcO2HK+4OijIP6xJ6hQMGz4YxZvhw34zEqdPYivxTK8X3EDEuQDns9bMUr
nPOjCYhOoR//iIRUA+l78VEnA2unnGGhaQhBxGm43xwqYLrOMsmz859pjbt1DGob
B3w2MlJ33ADBhA3/7PswAb8Otz6yPT8eq/8a8Pl+t9SZIwgZGBnjv/kzkBaZQ0r7
UCe2aVD97l5OkIBoIfPA+thbavrHwQ==
=TapA
-----END PGP SIGNATURE-----
Merge tag 'mtd/fixes-for-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
Pull MTD fixes from Miquel Raynal:
"MTD:
- sm_ftl: Fix NULL pointer warning.
Raw NAND:
- Cadence: fix compile testing.
- STM32: Avoid locking.
Onenand:
- Fix several sparse/build warnings.
SPI-NOR:
- Add a flag to fix interaction with Micron parts"
* tag 'mtd/fixes-for-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
mtd: spi-nor: Fix the writing of the Status Register on micron flashes
mtd: sm_ftl: fix NULL pointer warning
mtd: onenand: omap2: Pass correct flags for prep_dma_memcpy
mtd: onenand: samsung: Fix iomem access with regular memcpy
mtd: onenand: omap2: Fix errors in style
mtd: cadence: Fix cast to pointer from integer of different size warning
mtd: rawnand: stm32_fmc2: avoid to lock the CPU bus