38783 Commits

Author SHA1 Message Date
Masami Hiramatsu
ab51e15d53 fprobe: Introduce FPROBE_FL_KPROBE_SHARED flag for fprobe
Introduce FPROBE_FL_KPROBE_SHARED flag for sharing fprobe callback with
kprobes safely from the viewpoint of recursion.

Since the recursion safety of the fprobe (and ftrace) is a bit different
from the kprobes, this may cause an issue if user wants to run the same
code from the fprobe and the kprobes.

The kprobes has per-cpu 'current_kprobe' variable which protects the
kprobe handler from recursion in any case. On the other hand, the fprobe
uses only ftrace_test_recursion_trylock(), which will allow interrupt
context calls another (or same) fprobe during the fprobe user handler is
running.

This is not a matter in cases if the common callback shared among the
kprobes and the fprobe has its own recursion detection, or it can handle
the recursion in the different contexts (normal/interrupt/NMI.)
But if it relies on the 'current_kprobe' recursion lock, it has to check
kprobe_running() and use kprobe_busy_*() APIs.

Fprobe has FPROBE_FL_KPROBE_SHARED flag to do this. If your common callback
code will be shared with kprobes, please set FPROBE_FL_KPROBE_SHARED
*before* registering the fprobe, like;

 fprobe.flags = FPROBE_FL_KPROBE_SHARED;

 register_fprobe(&fprobe, "func*", NULL);

This will protect your common callback from the nested call.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/164735293127.1084943.15687374237275817599.stgit@devnote2
2022-03-17 20:17:02 -07:00
Masami Hiramatsu
5b0ab78998 fprobe: Add exit_handler support
Add exit_handler to fprobe. fprobe + rethook allows us to hook the kernel
function return. The rethook will be enabled only if the
fprobe::exit_handler is set.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/164735290790.1084943.10601965782208052202.stgit@devnote2
2022-03-17 20:16:52 -07:00
Masami Hiramatsu
54ecbe6f1e rethook: Add a generic return hook
Add a return hook framework which hooks the function return. Most of the
logic came from the kretprobe, but this is independent from kretprobe.

Note that this is expected to be used with other function entry hooking
feature, like ftrace, fprobe, adn kprobes. Eventually this will replace
the kretprobe (e.g. kprobe + rethook = kretprobe), but at this moment,
this is just an additional hook.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/164735285066.1084943.9259661137330166643.stgit@devnote2
2022-03-17 20:16:29 -07:00
Masami Hiramatsu
cad9931f64 fprobe: Add ftrace based probe APIs
The fprobe is a wrapper API for ftrace function tracer.
Unlike kprobes, this probes only supports the function entry, but this
can probe multiple functions by one fprobe. The usage is similar, user
will set their callback to fprobe::entry_handler and call
register_fprobe*() with probed functions.
There are 3 registration interfaces,

 - register_fprobe() takes filtering patterns of the functin names.
 - register_fprobe_ips() takes an array of ftrace-location addresses.
 - register_fprobe_syms() takes an array of function names.

The registered fprobes can be unregistered with unregister_fprobe().
e.g.

struct fprobe fp = { .entry_handler = user_handler };
const char *targets[] = { "func1", "func2", "func3"};
...

ret = register_fprobe_syms(&fp, targets, ARRAY_SIZE(targets));

...

unregister_fprobe(&fp);

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/164735283857.1084943.1154436951479395551.stgit@devnote2
2022-03-17 20:16:15 -07:00
Jiri Olsa
4f554e9556 ftrace: Add ftrace_set_filter_ips function
Adding ftrace_set_filter_ips function to be able to set filter on
multiple ip addresses at once.

With the kprobe multi attach interface we have cases where we need to
initialize ftrace_ops object with thousands of functions, so having
single function diving into ftrace_hash_move_and_update_ops with
ftrace_lock is faster.

The functions ips are passed as unsigned long array with count.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/164735282673.1084943.18310504594134769804.stgit@devnote2
2022-03-17 20:15:17 -07:00
Jakub Kicinski
e243f39685 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-17 13:56:58 -07:00
Chengming Zhou
2957308343 livepatch: Don't block removal of patches that are safe to unload
module_put() is not called for a patch with "forced" flag. It should
block the removal of the livepatch module when the code might still
be in use after forced transition.

klp_force_transition() currently sets "forced" flag for all patches on
the list.

In fact, any patch can be safely unloaded when it passed through
the consistency model in KLP_UNPATCHED transition.

In other words, the "forced" flag must be set only for livepatches
that are being removed. In particular, set the "forced" flag:

  + only for klp_transition_patch when the transition to KLP_UNPATCHED
    state was forced.

  + all replaced patches when the transition to KLP_PATCHED state was
    forced and the patch was replacing the existing patches.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
[mbenes@suse.cz: wording improvements]
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220312152220.88127-1-zhouchengming@bytedance.com
2022-03-17 19:02:30 +01:00
Qian Cai
8208257d2d configs/debug: restore DEBUG_INFO=y for overriding
Previously, I failed to realize that Kees' patch [1] has not been merged
into the mainline yet, and dropped DEBUG_INFO=y too eagerly from the
mainline.  As the results, "make debug.config" won't be able to flip
DEBUG_INFO=n from the existing .config.  This should close the gaps of a
few weeks before Kees' patch is there, and work regardless of their
merging status anyway.

Link: https://lore.kernel.org/all/20220125075126.891825-1-keescook@chromium.org/ [1]
Link: https://lkml.kernel.org/r/20220308153524.8618-1-quic_qiancai@quicinc.com
Signed-off-by: Qian Cai <quic_qiancai@quicinc.com>
Reported-by: Daniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-17 11:02:13 -07:00
Hou Tao
d2a3b7c5be bpf: Fix net.core.bpf_jit_harden race
It is the bpf_jit_harden counterpart to commit 60b58afc96c9 ("bpf: fix
net.core.bpf_jit_enable race"). bpf_jit_harden will be tested twice
for each subprog if there are subprogs in bpf program and constant
blinding may increase the length of program, so when running
"./test_progs -t subprogs" and toggling bpf_jit_harden between 0 and 2,
jit_subprogs may fail because constant blinding increases the length
of subprog instructions during extra passs.

So cache the value of bpf_jit_blinding_enabled() during program
allocation, and use the cached value during constant blinding, subprog
JITing and args tracking of tail call.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220309123321.2400262-4-houtao1@huawei.com
2022-03-16 15:13:36 -07:00
Beau Belgrave
089331d473 user_events: Add trace event call as root for low permission cases
Tracefs by default is locked down heavily. System operators can open up
some files, such as user_events to a broader set of users. These users
do not have access within tracefs beyond just the user_event files. Due
to this restriction the trace_add_event_call/remove calls will silently
fail since the caller does not have permissions to create directories.

To fix this trace_add_event_call/remove calls will be issued with
override creds of the global root UID. Creds are reverted immediately
afterward.

Link: https://lkml.kernel.org/r/20220308222807.2040-1-beaub@linux.microsoft.com

Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-15 14:33:20 -04:00
Ingo Molnar
8c490b42fe Merge branch 'x86/pasid' into x86/core, to resolve conflicts
Conflicts:
	tools/objtool/arch/x86/decode.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-03-15 12:50:59 +01:00
Ingo Molnar
a7b2553b5e sched/headers: Only include <linux/entry-common.h> when CONFIG_GENERIC_ENTRY=y
This header is not (yet) standalone.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-03-15 10:33:53 +01:00
Peter Zijlstra
105cd68596 x86: Mark __invalid_creds() __noreturn
vmlinux.o: warning: objtool: ksys_unshare()+0x36c: unreachable instruction

0000 0000000000067040 <ksys_unshare>:
...
0364    673a4:	4c 89 ef             	mov    %r13,%rdi
0367    673a7:	e8 00 00 00 00       	call   673ac <ksys_unshare+0x36c>	673a8: R_X86_64_PLT32	__invalid_creds-0x4
036c    673ac:	e9 28 ff ff ff       	jmp    672d9 <ksys_unshare+0x299>
0371    673b1:	41 bc f4 ff ff ff    	mov    $0xfffffff4,%r12d
0377    673b7:	e9 80 fd ff ff       	jmp    6713c <ksys_unshare+0xfc>

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/Yi9gOW9f1GGwwUD6@hirez.programming.kicks-ass.net
2022-03-15 10:32:44 +01:00
Peter Zijlstra
eae654f1c2 exit: Mark do_group_exit() __noreturn
vmlinux.o: warning: objtool: get_signal()+0x108: unreachable instruction

0000 000000000007f930 <get_signal>:
...
0103    7fa33:  e8 00 00 00 00          call   7fa38 <get_signal+0x108> 7fa34: R_X86_64_PLT32   do_group_exit-0x4
0108    7fa38:  41 8b 45 74             mov    0x74(%r13),%eax

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20220308154319.351270711@infradead.org
2022-03-15 10:32:43 +01:00
Peter Zijlstra
cc66bb9145 x86/ibt,kprobes: Cure sym+0 equals fentry woes
In order to allow kprobes to skip the ENDBR instructions at sym+0 for
X86_KERNEL_IBT builds, change _kprobe_addr() to take an architecture
callback to inspect the function at hand and modify the offset if
needed.

This streamlines the existing interface to cover more cases and
require less hooks. Once PowerPC gets fully converted there will only
be the one arch hook.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20220308154318.405947704@infradead.org
2022-03-15 10:32:38 +01:00
Peter Zijlstra
d15cb3dab1 x86/livepatch: Validate __fentry__ location
Currently livepatch assumes __fentry__ lives at func+0, which is most
likely untrue with IBT on. Instead make it use ftrace_location() by
default which both validates and finds the actual ip if there is any
in the same symbol.

Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20220308154318.285971256@infradead.org
2022-03-15 10:32:37 +01:00
Peter Zijlstra
aebfd12521 x86/ibt,ftrace: Search for __fentry__ location
Currently a lot of ftrace code assumes __fentry__ is at sym+0. However
with Intel IBT enabled the first instruction of a function will most
likely be ENDBR.

Change ftrace_location() to not only return the __fentry__ location
when called for the __fentry__ location, but also when called for the
sym+0 location.

Then audit/update all callsites of this function to consistently use
these new semantics.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20220308154318.227581603@infradead.org
2022-03-15 10:32:37 +01:00
Ingo Molnar
ccdbf33c23 Linux 5.17-rc8
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmIuUskeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGCFkH/2n3mpGXuITp0ZXE
 TNrpbdZOof5SgLw+w7THswXuo6m5yRGNKQs9fvIvDD8Vf7/OdQQfPOmF1cIE5+nk
 wcz6aHKbdrok8Jql2qjJqWXZ5xbGj6qywg3zZrwOUsCKFP5p+AjBJcmZOsvQHjSp
 ASODy1moOlK+nO52TrMaJw74a8xQPmQiNa+T2P+FedEYjlcRH/c7hLJ7GEnL6+cC
 /R4bATZq3tiInbTBlkC0hR0iVNgRXwXNyv9PEXrYYYHnekh8G1mgSNf06iejLcsG
 aAYsW9NyPxu8zPhhHNx79K9o8BMtxGD4YQpsfdfIEnf9Q3euqAKe2evRWqHHlDms
 RuSCtsc=
 =M9Nc
 -----END PGP SIGNATURE-----

Merge tag 'v5.17-rc8' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-03-15 10:28:12 +01:00
Ingo Molnar
ccacfe56d7 Merge branch 'sched/fast-headers' into sched/core
Merge the scheduler build speedup of the fast-headers tree.

Cumulative scheduler (kernel/sched/) build time speedup on a
Linux distribution's config, which enables all scheduler features,
compared to the vanilla kernel:

      _____________________________________________________________________________
     |
     |  Vanilla kernel (v5.13-rc7):
     |_____________________________________________________________________________
     |
     |  Performance counter stats for 'make -j96 kernel/sched/' (3 runs):
     |
     |   126,975,564,374      instructions              #    1.45  insn per cycle           ( +-  0.00% )
     |    87,637,847,671      cycles                    #    3.959 GHz                      ( +-  0.30% )
     |         22,136.96 msec cpu-clock                 #    7.499 CPUs utilized            ( +-  0.29% )
     |
     |            2.9520 +- 0.0169 seconds time elapsed  ( +-  0.57% )
     |_____________________________________________________________________________
     |
     |  Patched kernel:
     |_____________________________________________________________________________
     |
     | Performance counter stats for 'make -j96 kernel/sched/' (3 runs):
     |
     |    50,420,496,914      instructions              #    1.47  insn per cycle           ( +-  0.00% )
     |    34,234,322,038      cycles                    #    3.946 GHz                      ( +-  0.31% )
     |          8,675.81 msec cpu-clock                 #    3.053 CPUs utilized            ( +-  0.45% )
     |
     |            2.8420 +- 0.0181 seconds time elapsed  ( +-  0.64% )
     |_____________________________________________________________________________

    Summary:

      - CPU time used to build the scheduler dropped by -60.9%, a reduction
        from 22.1 clock-seconds to 8.7 clock-seconds.

      - Wall-clock time to build the scheduler dropped by -3.9%, a reduction
        from 2.95 seconds to 2.84 seconds.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-03-15 09:05:05 +01:00
Ingo Molnar
411472ae5b Linux 5.17-rc8
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmIuUskeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGCFkH/2n3mpGXuITp0ZXE
 TNrpbdZOof5SgLw+w7THswXuo6m5yRGNKQs9fvIvDD8Vf7/OdQQfPOmF1cIE5+nk
 wcz6aHKbdrok8Jql2qjJqWXZ5xbGj6qywg3zZrwOUsCKFP5p+AjBJcmZOsvQHjSp
 ASODy1moOlK+nO52TrMaJw74a8xQPmQiNa+T2P+FedEYjlcRH/c7hLJ7GEnL6+cC
 /R4bATZq3tiInbTBlkC0hR0iVNgRXwXNyv9PEXrYYYHnekh8G1mgSNf06iejLcsG
 aAYsW9NyPxu8zPhhHNx79K9o8BMtxGD4YQpsfdfIEnf9Q3euqAKe2evRWqHHlDms
 RuSCtsc=
 =M9Nc
 -----END PGP SIGNATURE-----

Merge tag 'v5.17-rc8' into irq/core, to fix conflicts

Conflicts:
	drivers/pinctrl/pinctrl-starfive.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-03-14 18:53:00 +01:00
Thomas Gleixner
f0fae8a0ed irqchip updates for 5.18
- Add support for the STM32MP13 variant
 
 - Move parent device away from struct irq_chip
 
 - Remove all instances of non-const strings assigned to
   struct irq_chip::name, enabling a nice cleanup for VIC and GIC)
 
 - Simplify the Qualcomm PDC driver
 
 - A bunch of SiFive PLIC cleanups
 
 - Add support for a new variant of the Meson GPIO block
 
 - Add support for the irqchip side of the Apple M1 PMU
 
 - Add support for the Apple M1 Pro/Max AICv2 irqchip
 
 - Add support for the Qualcomm MPM wakeup gadget
 
 - Move the Xilinx driver over to the generic irqdomain handling
 
 - Tiny speedup for IPIs on GICv3 systems
 
 - The usual odd cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmItxJMPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpD7ccQAIkkNoC6yQ+9lhbdRrlo6KUtUT2apDheIF+5
 Yfo7dTeKMUb4NpQs+b4v01A0B3KSLPuwTulWfGXhsLRXVcfEEnkBCQzy/IQnkYTQ
 DDvxENRz40SS0WJF1G74a7KsqHt+epyHZkB6KJQV4BYrZKxt2h0tWNSiNf1IDN/e
 9mZq2kLgEk0kfRCR9u6NYGMugbrgbdtiLgwBARKdRtAAkjBlGEtC2slp0a3WTsyg
 QfnMWMOK22wa34eZzFG8VrJMVwGyeqMP/ZW30EoClBzPyLUM5aZWRr+LSvLYQC4n
 ho6ua1+a2726TBT6vtWNi0KDNcXwhL6JheO4m2bCoWPvu4YengfKQ5QllAFvSR3W
 e4oT/xwkBcf+n5ehXEfxqTRRxG398oWYI60kX586dIcr9qN9WBsw1S5aPkDeZ+nT
 6THbQ5uZrIqkeWOoJmvg+iwKkE/NQY/xUENW0zeG2f4/YLIGeKK7e1/XCl1jqzlk
 vIvf/bYr64TgOvvHhIeh1G5iXQnk1TWoCzW0DQ8BIXhjlbVRG39QuvwjXKok4AhK
 QgKMi6N1ge4nKO1gcYbR174gDz+MylZP41ddDACVXT/5hzsfyxLF36ixdyMLKwtr
 Lybb4PGB5Pf0Zgxu6cVWeVsEZEwtlMCmIi1XUW4YRv2saypTPD5V78Ug6jbyPMXE
 G7J5dxwS
 =cf1B
 -----END PGP SIGNATURE-----

Merge tag 'irqchip-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core

Pull irqchip updates from Marc Zyngier:

  - Add support for the STM32MP13 variant

  - Move parent device away from struct irq_chip

  - Remove all instances of non-const strings assigned to
    struct irq_chip::name, enabling a nice cleanup for VIC and GIC)

  - Simplify the Qualcomm PDC driver

  - A bunch of SiFive PLIC cleanups

  - Add support for a new variant of the Meson GPIO block

  - Add support for the irqchip side of the Apple M1 PMU

  - Add support for the Apple M1 Pro/Max AICv2 irqchip

  - Add support for the Qualcomm MPM wakeup gadget

  - Move the Xilinx driver over to the generic irqdomain handling

  - Tiny speedup for IPIs on GICv3 systems

  - The usual odd cleanups

Link: https://lore.kernel.org/all/20220313105142.704579-1-maz@kernel.org
2022-03-14 10:23:22 +01:00
Tom Rix
f9da322e86 cgroup: cleanup comments
for spdx, add a space before //

replacements
judgement to judgment
transofrmed to transformed
partitition to partition
histrical to historical
migratecd to migrated

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-03-13 19:19:27 -10:00
Steven Rostedt (Google)
bc47ee4844 tracing/user_events: Use alloc_pages instead of kzalloc() for register pages
kzalloc virtual addresses do not work with SetPageReserved, use the actual
page virtual addresses instead via alloc_pages.

The issue is reported when booting with user_events and
DEBUG_VM_PGFLAGS=y.

Also make the number of events based on the ORDER.

Link: https://lore.kernel.org/all/CADYN=9+xY5Vku3Ws5E9S60SM5dCFfeGeRBkmDFbcxX0ZMoFing@mail.gmail.com/
Link: https://lore.kernel.org/all/20220311223028.1865-1-beaub@linux.microsoft.com/

Cc: Beau Belgrave <beaub@linux.microsoft.com>
Reported-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-11 18:59:57 -05:00
Linus Torvalds
93ce93587d Merge branch 'davidh' (fixes from David Howells)
Merge misc fixes from David Howells:
 "A set of patches for watch_queue filter issues noted by Jann. I've
  added in a cleanup patch from Christophe Jaillet to convert to using
  formal bitmap specifiers for the note allocation bitmap.

  Also two filesystem fixes (afs and cachefiles)"

* emailed patches from David Howells <dhowells@redhat.com>:
  cachefiles: Fix volume coherency attribute
  afs: Fix potential thrashing in afs writeback
  watch_queue: Make comment about setting ->defunct more accurate
  watch_queue: Fix lack of barrier/sync/lock between post and read
  watch_queue: Free the alloc bitmap when the watch_queue is torn down
  watch_queue: Fix the alloc bitmap size to reflect notes allocated
  watch_queue: Use the bitmap API when applicable
  watch_queue: Fix to always request a pow-of-2 pipe ring size
  watch_queue: Fix to release page in ->release()
  watch_queue, pipe: Free watchqueue state after clearing pipe ring
  watch_queue: Fix filter limit check
2022-03-11 10:28:32 -08:00
David Howells
4edc076041 watch_queue: Make comment about setting ->defunct more accurate
watch_queue_clear() has a comment stating that setting ->defunct to true
preventing new additions as well as preventing notifications.  Whilst
the latter is true, the first bit is superfluous since at the time this
function is called, the pipe cannot be accessed to add new event
sources.

Remove the "new additions" bit from the comment.

Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11 10:17:13 -08:00
David Howells
2ed147f015 watch_queue: Fix lack of barrier/sync/lock between post and read
There's nothing to synchronise post_one_notification() versus
pipe_read().  Whilst posting is done under pipe->rd_wait.lock, the
reader only takes pipe->mutex which cannot bar notification posting as
that may need to be made from contexts that cannot sleep.

Fix this by setting pipe->head with a barrier in post_one_notification()
and reading pipe->head with a barrier in pipe_read().

If that's not sufficient, the rd_wait.lock will need to be taken,
possibly in a ->confirm() op so that it only applies to notifications.
The lock would, however, have to be dropped before copy_page_to_iter()
is invoked.

Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11 10:17:13 -08:00
David Howells
7ea1a0124b watch_queue: Free the alloc bitmap when the watch_queue is torn down
Free the watch_queue note allocation bitmap when the watch_queue is
destroyed.

Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11 10:17:13 -08:00
David Howells
3b4c037192 watch_queue: Fix the alloc bitmap size to reflect notes allocated
Currently, watch_queue_set_size() sets the number of notes available in
wqueue->nr_notes according to the number of notes allocated, but sets
the size of the bitmap to the unrounded number of notes originally asked
for.

Fix this by setting the bitmap size to the number of notes we're
actually going to make available (ie. the number allocated).

Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11 10:17:12 -08:00
Christophe JAILLET
a66bd7575b watch_queue: Use the bitmap API when applicable
Use bitmap_alloc() to simplify code, improve the semantic and reduce
some open-coded arithmetic in allocator arguments.

Also change a memset(0xff) into an equivalent bitmap_fill() to keep
consistency.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11 10:17:12 -08:00
David Howells
96a4d8912b watch_queue: Fix to always request a pow-of-2 pipe ring size
The pipe ring size must always be a power of 2 as the head and tail
pointers are masked off by AND'ing with the size of the ring - 1.
watch_queue_set_size(), however, lets you specify any number of notes
between 1 and 511.  This number is passed through to pipe_resize_ring()
without checking/forcing its alignment.

Fix this by rounding the number of slots required up to the nearest
power of two.  The request is meant to guarantee that at least that many
notifications can be generated before the queue is full, so rounding
down isn't an option, but, alternatively, it may be better to give an
error if we aren't allowed to allocate that much ring space.

Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11 10:17:12 -08:00
David Howells
c1853fbadc watch_queue: Fix to release page in ->release()
When a pipe ring descriptor points to a notification message, the
refcount on the backing page is incremented by the generic get function,
but the release function, which marks the bitmap, doesn't drop the page
ref.

Fix this by calling generic_pipe_buf_release() at the end of
watch_queue_pipe_buf_release().

Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11 10:17:12 -08:00
David Howells
c993ee0f9f watch_queue: Fix filter limit check
In watch_queue_set_filter(), there are a couple of places where we check
that the filter type value does not exceed what the type_filter bitmap
can hold.  One place calculates the number of bits by:

   if (tf[i].type >= sizeof(wfilter->type_filter) * 8)

which is fine, but the second does:

   if (tf[i].type >= sizeof(wfilter->type_filter) * BITS_PER_LONG)

which is not.  This can lead to a couple of out-of-bounds writes due to
a too-large type:

 (1) __set_bit() on wfilter->type_filter
 (2) Writing more elements in wfilter->filters[] than we allocated.

Fix this by just using the proper WATCH_TYPE__NR instead, which is the
number of types we actually know about.

The bug may cause an oops looking something like:

  BUG: KASAN: slab-out-of-bounds in watch_queue_set_filter+0x659/0x740
  Write of size 4 at addr ffff88800d2c66bc by task watch_queue_oob/611
  ...
  Call Trace:
   <TASK>
   dump_stack_lvl+0x45/0x59
   print_address_description.constprop.0+0x1f/0x150
   ...
   kasan_report.cold+0x7f/0x11b
   ...
   watch_queue_set_filter+0x659/0x740
   ...
   __x64_sys_ioctl+0x127/0x190
   do_syscall_64+0x43/0x90
   entry_SYSCALL_64_after_hwframe+0x44/0xae

  Allocated by task 611:
   kasan_save_stack+0x1e/0x40
   __kasan_kmalloc+0x81/0xa0
   watch_queue_set_filter+0x23a/0x740
   __x64_sys_ioctl+0x127/0x190
   do_syscall_64+0x43/0x90
   entry_SYSCALL_64_after_hwframe+0x44/0xae

  The buggy address belongs to the object at ffff88800d2c66a0
   which belongs to the cache kmalloc-32 of size 32
  The buggy address is located 28 bytes inside of
   32-byte region [ffff88800d2c66a0, ffff88800d2c66c0)

Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11 10:17:12 -08:00
Steven Rostedt (Google)
380af29b8d tracing: Add snapshot at end of kernel boot up
Add ftrace_boot_snapshot kernel parameter that will take a snapshot at the
end of boot up just before switching over to user space (it happens during
the kernel freeing of init memory).

This is useful when there's interesting data that can be collected from
kernel start up, but gets overridden by user space start up code. With
this option, the ring buffer content from the boot up traces gets saved in
the snapshot at the end of boot up. This trace can be read from:

 /sys/kernel/tracing/snapshot

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-11 11:49:24 -05:00
Steven Rostedt (Google)
b3bc8547d3 tracing: Have TRACE_DEFINE_ENUM affect trace event types as well
The macro TRACE_DEFINE_ENUM is used to convert enums in the kernel to
their actual value when they are exported to user space via the trace
event format file.

Currently only the enums in the "print fmt" (TP_printk in the TRACE_EVENT
macro) have the enums converted. But the enums can be used to denote array
size:

        field:unsigned int fc_ineligible_rc[EXT4_FC_REASON_MAX]; offset:12;      size:36;        signed:0;

The EXT4_FC_REASON_MAX has no meaning to userspace but it needs to know
that information to know how to parse the array.

Have the array indexes also be parsed as well.

Link: https://lore.kernel.org/all/cover.1646922487.git.riteshh@linux.ibm.com/

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-11 11:49:24 -05:00
Tom Zanussi
9f438d4d7f tracing: Fix strncpy warning in trace_events_synth.c
0-day reported the strncpy error below:

../kernel/trace/trace_events_synth.c: In function 'last_cmd_set':
../kernel/trace/trace_events_synth.c:65:9: warning: 'strncpy' specified bound depends on the length o\
f the source argument [-Wstringop-truncation]
   65 |         strncpy(last_cmd, str, strlen(str) + 1);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../kernel/trace/trace_events_synth.c:65:32: note: length computed here
   65 |         strncpy(last_cmd, str, strlen(str) + 1);
      |                                ^~~~~~~~~~~

There's no reason to use strncpy here, in fact there's no reason to do
anything but a simple kstrdup() (note we don't even need to check for
failure since last_cmod is expected to be either the last cmd string
or NULL, and the containing function is a void return).

Link: https://lkml.kernel.org/r/77deca8cbfd226981b3f1eab203967381e9b5bd9.camel@kernel.org

Fixes: 27c888da9867 ("tracing: Remove size restriction on synthetic event cmd error logging")

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-11 11:49:24 -05:00
Beau Belgrave
7e348b325b user_events: Prevent dyn_event delete racing with ioctl add/delete
Find user_events always while under the event_mutex and before leaving
the lock, add a ref count to the user_event. This ensures that all paths
under the event_mutex that check the ref counts will be synchronized.

The ioctl add/delete paths are protected by the reg_mutex. However,
dyn_event is only protected by the event_mutex. The dyn_event delete
path cannot acquire reg_mutex, since that could cause a deadlock between
the ioctl delete case acquiring event_mutex after acquiring the reg_mutex.

Link: https://lkml.kernel.org/r/20220310001141.1660-1-beaub@linux.microsoft.com

Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-11 11:49:24 -05:00
Roberto Sassu
df6b3039fa bpf-lsm: Make bpf_lsm_kernel_read_file() as sleepable
Make bpf_lsm_kernel_read_file() as sleepable, so that bpf_ima_inode_hash()
or bpf_ima_file_hash() can be called inside the implementation of this
hook.

Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220302111404.193900-8-roberto.sassu@huawei.com
2022-03-10 18:57:55 -08:00
Roberto Sassu
174b16946e bpf-lsm: Introduce new helper bpf_ima_file_hash()
ima_file_hash() has been modified to calculate the measurement of a file on
demand, if it has not been already performed by IMA or the measurement is
not fresh. For compatibility reasons, ima_inode_hash() remains unchanged.

Keep the same approach in eBPF and introduce the new helper
bpf_ima_file_hash() to take advantage of the modified behavior of
ima_file_hash().

Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220302111404.193900-4-roberto.sassu@huawei.com
2022-03-10 18:57:54 -08:00
Linus Torvalds
dda64ead7e Minor tracing fixes:
- Fix unregistering the same event twice a the user could disable
    the event osnoise will disable on unregistering.
 
  - Inform RCU of a quiescent state in the osnoise testing thread.
 
  - Fix some kerneldoc comments.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYioe0BQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qohMAQCcnN4Iy9LcKX02A5HcuUxIdH3ktOqt
 lbc24+nb6AO+/wD+Ilnlx/Ir09HgaL7NreSAc0tvLMwRgaM+grLA82IRTwc=
 =quYh
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Minor tracing fixes:

   - Fix unregistering the same event twice. A user could disable the
     same event that osnoise will disable on unregistering.

   - Inform RCU of a quiescent state in the osnoise testing thread.

   - Fix some kerneldoc comments"

* tag 'trace-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Fix some W=1 warnings in kernel doc comments
  tracing/osnoise: Force quiescent states while tracing
  tracing/osnoise: Do not unregister events twice
2022-03-10 17:23:08 -08:00
Jakub Kicinski
1e8a3f0d2a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
net/dsa/dsa2.c
  commit afb3cc1a397d ("net: dsa: unlock the rtnl_mutex when dsa_master_setup() fails")
  commit e83d56537859 ("net: dsa: replay master state events in dsa_tree_{setup,teardown}_master")
https://lore.kernel.org/all/20220307101436.7ae87da0@canb.auug.org.au/

drivers/net/ethernet/intel/ice/ice.h
  commit 97b0129146b1 ("ice: Fix error with handling of bonding MTU")
  commit 43113ff73453 ("ice: add TTY for GNSS module for E810T device")
https://lore.kernel.org/all/20220310112843.3233bcf1@canb.auug.org.au/

drivers/staging/gdm724x/gdm_lte.c
  commit fc7f750dc9d1 ("staging: gdm724x: fix use after free in gdm_lte_rx()")
  commit 4bcc4249b4cf ("staging: Use netif_rx().")
https://lore.kernel.org/all/20220308111043.1018a59d@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-10 17:16:56 -08:00
Eric W. Biederman
355f841a3f tracehook: Remove tracehook.h
Now that all of the definitions have moved out of tracehook.h into
ptrace.h, sched/signal.h, resume_user_mode.h there is nothing left in
tracehook.h so remove it.

Update the few files that were depending upon tracehook.h to bring in
definitions to use the headers they need directly.

Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-13-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-10 16:51:51 -06:00
Eric W. Biederman
03248addad resume_user_mode: Move to resume_user_mode.h
Move set_notify_resume and tracehook_notify_resume into resume_user_mode.h.
While doing that rename tracehook_notify_resume to resume_user_mode_work.

Update all of the places that included tracehook.h for these functions to
include resume_user_mode.h instead.

Update all of the callers of tracehook_notify_resume to call
resume_user_mode_work.

Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-12-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-10 16:51:50 -06:00
Eric W. Biederman
7c5d8fa6fb task_work: Decouple TIF_NOTIFY_SIGNAL and task_work
There are a small handful of reasons besides pending signals that the
kernel might want to break out of interruptible sleeps.  The flag
TIF_NOTIFY_SIGNAL and the helpers that set and clear TIF_NOTIFY_SIGNAL
provide that the infrastructure for breaking out of interruptible
sleeps and entering the return to user space slow path for those
cases.

Expand tracehook_notify_signal inline in it's callers and remove it,
which makes clear that TIF_NOTIFY_SIGNAL and task_work are separate
concepts.

Update the comment on set_notify_signal to more accurately describe
it's purpose.

Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-9-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-10 16:51:50 -06:00
Eric W. Biederman
8ba62d3794 task_work: Call tracehook_notify_signal from get_signal on all architectures
Always handle TIF_NOTIFY_SIGNAL in get_signal.  With commit 35d0b389f3b2
("task_work: unconditionally run task_work from get_signal()") always
calling task_work_run all of the work of tracehook_notify_signal is
already happening except clearing TIF_NOTIFY_SIGNAL.

Factor clear_notify_signal out of tracehook_notify_signal and use it in
get_signal so that get_signal only needs one call of task_work_run.

To keep the semantics in sync update xfer_to_guest_mode_work (which
does not call get_signal) to call tracehook_notify_signal if either
_TIF_SIGPENDING or _TIF_NOTIFY_SIGNAL.

Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-8-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-10 16:51:36 -06:00
Steven Rostedt (Google)
8bcd06632b tracing: Allow custom events to be added to the tracefs directory
Allow custom events to be added to the events directory in the tracefs
file system. For example, a module could be installed that attaches to an
event and wants to be enabled and disabled via the tracefs file system. It
would use trace_add_event_call() to add the event to the tracefs
directory, and trace_remove_event_call() to remove it.

Make both those functions EXPORT_SYMBOL_GPL().

Link: https://lkml.kernel.org/r/20220303220625.186988045@goodmis.org

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-10 17:41:18 -05:00
Steven Rostedt (Google)
5677a3d713 tracing: Fix last_cmd_set() string management in histogram code
Using strnlen(dest, str, n) is confusing, as the size of dest must be
strlen(dest) + n + 1. Even more confusing, using sizeof(string constant)
gives you strlen(string constant) + 1 and not just strlen().

These two together made using strncat() with a constant string a bit off
in the calculations as we have:

	len = sizeof(HIST_PREFIX) + strlen(str) + 1;
	kfree(last_cmd);
	last_cmd = kzalloc(len, GFP_KERNEL);
	strcpy(last_cmd, HIST_PREFIX);
	len -= sizeof(HIST_PREFIX) + 1;
	strncat(last_cmd, str, len);

The above works if we s/sizeof/strlen/ with HIST_PREFIX (which is defined
as "hist:", but because sizeof(HIST_PREFIX) is equal to
strlen(HIST_PREFIX) + 1, we can drop the +1 in the code. But at least
comment that we are doing so.

Link: https://lore.kernel.org/all/202203082112.Iu7tvFl4-lkp@intel.com/

Fixes: 9f8e5aee93ed2 ("tracing: Fix allocation of last_cmd in last_cmd_set()")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-10 17:41:18 -05:00
Beau Belgrave
173c2049d1 user_events: Fix potential uninitialized pointer while parsing field
Ensure name is initialized by default to NULL to prevent possible edge
cases that could lead to it being left uninitialized. Add an explicit
check for NULL name to ensure edge boundaries.

Link: https://lore.kernel.org/bpf/20220224105334.GA2248@kili/
Link: https://lore.kernel.org/linux-trace-devel/20220224181637.2129-1-beaub@linux.microsoft.com

Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-10 17:41:18 -05:00
Yuntao Wang
1b773d0003 bpf: Use offsetofend() to simplify macro definition
Use offsetofend() instead of offsetof() + sizeof() to simplify
MIN_BPF_LINEINFO_SIZE macro definition.

Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/bpf/20220310161518.534544-1-ytcoode@gmail.com
2022-03-10 23:04:47 +01:00
Eric W. Biederman
7f62d40d9c task_work: Introduce task_work_pending
Wrap the test of task->task_works in a helper function to make
it clear what is being tested.

All of the other readers of task->task_work use READ_ONCE and this is
even necessary on current as other processes can update
task->task_work.  So for consistency I have added READ_ONCE into
task_work_pending.

Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-7-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-10 13:39:04 -06:00
Eric W. Biederman
8ca07e17c9 task_work: Remove unnecessary include from posix_timers.h
Break a header file circular dependency by removing the unnecessary
include of task_work.h from posix_timers.h.

sched.h -> posix-timers.h
posix-timers.h -> task_work.h
task_work.h -> sched.h

Add missing includes of task_work.h to:
arch/x86/mm/tlb.c
kernel/time/posix-cpu-timers.c

Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-6-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-10 13:38:01 -06:00