46607 Commits

Author SHA1 Message Date
Steven Rostedt (Google)
7132d8572b Merge ftrace/for-next 2025-01-13 09:47:27 -05:00
Masami Hiramatsu (Google)
66fc6f521a tracing/hist: Support POLLPRI event for poll on histogram
Since POLLIN will not be flushed until the hist file is read, the user
needs to repeatedly read() and poll() on the hist file for monitoring the
event continuously. But the read() is somewhat redundant when the user is
only monitoring for event updates.

Add POLLPRI poll event on the hist file so the event returns when a
histogram is updated after open(), poll() or read(). Thus it is possible
to wait for the next event without having to issue a read().

Cc: Shuah Khan <shuah@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/173527248770.464571.2536902137325258133.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-07 11:46:32 -05:00
Masami Hiramatsu (Google)
1bd13edbbe tracing/hist: Add poll(POLLIN) support on hist file
Add poll syscall support on the `hist` file. The Waiter will be waken
up when the histogram is updated with POLLIN.

Currently, there is no way to wait for a specific event in userspace.
So user needs to peek the `trace` periodicaly, or wait on `trace_pipe`.
But it is not a good idea to peek at the `trace` for an event that
randomly happens. And `trace_pipe` is not coming back until a page is
filled with events.

This allows a user to wait for a specific event on the `hist` file. User
can set a histogram trigger on the event which they want to monitor
and poll() on its `hist` file. Since this poll() returns POLLIN, the next
poll() will return soon unless a read() happens on that hist file.

NOTE: To read the hist file again, you must set the file offset to 0,
but just for monitoring the event, you may not need to read the
histogram.

Cc: Shuah Khan <shuah@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/173527247756.464571.14236296701625509931.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-07 11:44:49 -05:00
Steven Rostedt
22bec11a56 tracing: Fix using ret variable in tracing_set_tracer()
When the function tracing_set_tracer() switched over to using the guard()
infrastructure, it did not need to save the 'ret' variable and would just
return the value when an error arised, instead of setting ret and jumping
to an out label.

When CONFIG_TRACER_SNAPSHOT is enabled, it had code that expected the
"ret" variable to be initialized to zero and had set 'ret' while holding
an arch_spin_lock() (not used by guard), and then upon releasing the lock
it would check 'ret' and exit if set. But because ret was only set when an
error occurred while holding the locks, 'ret' would be used uninitialized
if there was no error. The code in the CONFIG_TRACER_SNAPSHOT block should
be self contain. Make sure 'ret' is also set when no error occurred.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250106111143.2f90ff65@gandalf.local.home
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202412271654.nJVBuwmF-lkp@intel.com/
Fixes: d33b10c0c73ad ("tracing: Switch trace.c code over to use guard()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-01-07 11:39:46 -05:00
Steven Rostedt
9e49ca756d tracing/string: Create and use __free(argv_free) in trace_dynevent.c
The function dyn_event_release() uses argv_split() which must be freed via
argv_free(). It contains several error paths that do a goto out to call
argv_free() for cleanup. This makes the code complex and error prone.

Create a new __free() directive __free(argv_free) that will call
argv_free() for data allocated with argv_split(), and use it in the
dyn_event_release() function.

Cc: Kees Cook <kees@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: linux-hardening@vger.kernel.org
Link: https://lore.kernel.org/20241220103313.4a74ec8e@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26 10:38:37 -05:00
Steven Rostedt
08b7673171 tracing: Switch trace_stat.c code over to use guard()
There are a couple functions in trace_stat.c that have "goto out" or
equivalent on error in order to release locks that were taken. This can be
error prone or just simply make the code more complex.

Switch every location that ends with unlocking a mutex on error over to
using the guard(mutex)() infrastructure to let the compiler worry about
releasing locks. This makes the code easier to read and understand.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/20241219201346.870318466@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26 10:38:37 -05:00
Steven Rostedt
6c05353e4f tracing: Switch trace_stack.c code over to use guard()
The function stack_trace_sysctl() uses a goto on the error path to jump to
the mutex_unlock() code. Replace the logic to use guard() and let the
compiler worry about it.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/20241225222931.684913592@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26 10:38:37 -05:00
Steven Rostedt
930d2b32c0 tracing: Switch trace_osnoise.c code over to use guard() and __free()
The osnoise_hotplug_workfn() grabs two mutexes and cpu_read_lock(). It has
various gotos to handle unlocking them. Switch them over to guard() and
let the compiler worry about it.

The osnoise_cpus_read() has a temporary mask_str allocated and there's
some gotos to make sure it gets freed on error paths. Switch that over to
__free() to let the compiler worry about it.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/20241225222931.517329690@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26 10:38:37 -05:00
Steven Rostedt
a2e27e1bb1 tracing: Switch trace_events_synth.c code over to use guard()
There are a couple functions in trace_events_synth.c that have "goto out"
or equivalent on error in order to release locks that were taken. This can
be error prone or just simply make the code more complex.

Switch every location that ends with unlocking a mutex on error over to
using the guard(mutex)() infrastructure to let the compiler worry about
releasing locks. This makes the code easier to read and understand.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/20241219201346.371082515@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26 10:38:37 -05:00
Steven Rostedt
076796f74e tracing: Switch trace_events_filter.c code over to use guard()
There are a couple functions in trace_events_filter.c that have "goto out"
or equivalent on error in order to release locks that were taken. This can
be error prone or just simply make the code more complex.

Switch every location that ends with unlocking a mutex on error over to
using the guard(mutex)() infrastructure to let the compiler worry about
releasing locks. This makes the code easier to read and understand.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/20241219201346.200737679@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26 10:38:37 -05:00
Steven Rostedt
63c7264168 tracing: Switch trace_events_trigger.c code over to use guard()
There are a few functions in trace_events_trigger.c that have "goto out" or
equivalent on error in order to release locks that were taken. This can be
error prone or just simply make the code more complex.

Switch every location that ends with unlocking a mutex on error over to
using the guard(mutex)() infrastructure to let the compiler worry about
releasing locks. This makes the code easier to read and understand.

Also use __free() for free a temporary buffer in event_trigger_regex_write().

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/20241220110621.639d3bc8@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26 10:38:37 -05:00
Steven Rostedt
2b36a97aee tracing: Switch trace_events_hist.c code over to use guard()
There are a couple functions in trace_events_hist.c that have "goto out" or
equivalent on error in order to release locks that were taken. This can be
error prone or just simply make the code more complex.

Switch every location that ends with unlocking a mutex on error over to
using the guard(mutex)() infrastructure to let the compiler worry about
releasing locks. This makes the code easier to read and understand.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/20241219201345.694601480@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26 10:38:37 -05:00
Steven Rostedt
59980d9b0b tracing: Switch trace_events.c code over to use guard()
There are several functions in trace_events.c that have "goto out;" or
equivalent on error in order to release locks that were taken. This can be
error prone or just simply make the code more complex.

Switch every location that ends with unlocking a mutex on error over to
using the guard(mutex)() infrastructure to let the compiler worry about
releasing locks. This makes the code easier to read and understand.

Some locations did some simple arithmetic after releasing the lock. As
this causes no real overhead for holding a mutex while processing the file
position (*ppos += cnt;) let the lock be held over this logic too.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/20241219201345.522546095@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26 10:38:36 -05:00
Steven Rostedt
4b8d63e5b6 tracing: Simplify event_enable_func() goto_reg logic
Currently there's an "out_reg:" label that gets jumped to if there's no
parameters to process. Instead, make it a proper "if (param) { }" block as
there's not much to do for the parameter processing, and remove the
"out_reg:" label.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/20241219201345.354746196@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26 10:38:36 -05:00
Steven Rostedt
c949dfb974 tracing: Simplify event_enable_func() goto out_free logic
The event_enable_func() function allocates the data descriptor early in
the function just to assign its data->count value via:

  kstrtoul(number, 0, &data->count);

This makes the code more complex as there are several error paths before
the data descriptor is actually used. This means there needs to be a
goto out_free; to clean it up.

Use a local variable "count" to do the update and move the data allocation
just before it is used. This removes the "out_free" label as the data can
be freed on the failure path of where it is used.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/20241219201345.190820140@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26 10:38:36 -05:00
Steven Rostedt
cad1d5bd2c tracing: Have event_enable_write() just return error on error
The event_enable_write() function is inconsistent in how it returns
errors. Sometimes it updates the ppos parameter and sometimes it doesn't.
Simplify the code to just return an error or the count if there isn't an
error.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/20241219201345.025284170@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26 10:38:36 -05:00
Steven Rostedt
d1e27ee9c6 tracing: Return -EINVAL if a boot tracer tries to enable the mmiotracer at boot
The mmiotracer is not set to be enabled at boot up from the kernel command
line. If the boot command line tries to enable that tracer, it will fail
to be enabled. The return code is currently zero when that happens so the
caller just thinks it was enabled. Return -EINVAL in this case.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/20241219201344.854254394@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26 10:38:36 -05:00
Steven Rostedt
d33b10c0c7 tracing: Switch trace.c code over to use guard()
There are several functions in trace.c that have "goto out;" or
equivalent on error in order to release locks or free values that were
allocated. This can be error prone or just simply make the code more
complex.

Switch every location that ends with unlocking a mutex or freeing on error
over to using the guard(mutex)() and __free() infrastructure to let the
compiler worry about releasing locks. This makes the code easier to read
and understand.

There's one place that should probably return an error but instead return
0. This does not change the return as the only changes are to do the
conversion without changing the logic. Fixing that location will have to
come later.

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/20241224221413.7b8c68c3@batman.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26 10:38:17 -05:00
Masami Hiramatsu (Google)
d576aec24d fgraph: Get ftrace recursion lock in function_graph_enter
Get the ftrace recursion lock in the generic function_graph_enter()
instead of each architecture code.
This changes all function_graph tracer callbacks running in
non-preemptive state. On x86 and powerpc, this is by default, but
on the other architecutres, this will be new.

Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: bpf <bpf@vger.kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alan Maguire <alan.maguire@oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/173379653720.973433.18438622234884980494.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-23 21:02:48 -05:00
Steven Rostedt
1d95fd9d6b ftrace: Switch ftrace.c code over to use guard()
There are a few functions in ftrace.c that have "goto out" or equivalent
on error in order to release locks that were taken. This can be error
prone or just simply make the code more complex.

Switch every location that ends with unlocking a mutex on error over to
using the guard(mutex)() infrastructure to let the compiler worry about
releasing locks. This makes the code easier to read and understand.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20241223184941.718001540@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-23 21:01:48 -05:00
Steven Rostedt
77e53cb2fc ftrace: Remove unneeded goto jumps
There are some goto jumps to exit a program to just return a value. The
code after the label doesn't free anything nor does it do any unlocks. It
simply returns the variable that was set before the jump.

Remove these unneeded goto jumps.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20241223184941.544855549@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-23 21:01:48 -05:00
Steven Rostedt
ac8c3b02fc ftrace: Do not disable interrupts in profiler
The function profiler disables interrupts before processing. This was
there since the profiler was introduced back in 2009 when there were
recursion issues to deal with. The function tracer is much more robust
today and has its own internal recursion protection. There's no reason to
disable interrupts in the function profiler.

Instead, just disable preemption and use the guard() infrastructure while
at it.

Before this change:

~# echo 1 > /sys/kernel/tracing/function_profile_enabled
~# perf stat -r 10 ./hackbench 10
Time: 3.099
Time: 2.556
Time: 2.500
Time: 2.705
Time: 2.985
Time: 2.959
Time: 2.859
Time: 2.621
Time: 2.742
Time: 2.631

 Performance counter stats for '/work/c/hackbench 10' (10 runs):

         23,156.77 msec task-clock                       #    6.951 CPUs utilized               ( +-  2.36% )
            18,306      context-switches                 #  790.525 /sec                        ( +-  5.95% )
               495      cpu-migrations                   #   21.376 /sec                        ( +-  8.61% )
            11,522      page-faults                      #  497.565 /sec                        ( +-  1.80% )
    47,967,124,606      cycles                           #    2.071 GHz                         ( +-  0.41% )
    80,009,078,371      instructions                     #    1.67  insn per cycle              ( +-  0.34% )
    16,389,249,798      branches                         #  707.752 M/sec                       ( +-  0.36% )
       139,943,109      branch-misses                    #    0.85% of all branches             ( +-  0.61% )

             3.332 +- 0.101 seconds time elapsed  ( +-  3.04% )

After this change:

~# echo 1 > /sys/kernel/tracing/function_profile_enabled
~# perf stat -r 10 ./hackbench 10
Time: 1.869
Time: 1.428
Time: 1.575
Time: 1.569
Time: 1.685
Time: 1.511
Time: 1.611
Time: 1.672
Time: 1.724
Time: 1.715

 Performance counter stats for '/work/c/hackbench 10' (10 runs):

         13,578.21 msec task-clock                       #    6.931 CPUs utilized               ( +-  2.23% )
            12,736      context-switches                 #  937.973 /sec                        ( +-  3.86% )
               341      cpu-migrations                   #   25.114 /sec                        ( +-  5.27% )
            11,378      page-faults                      #  837.960 /sec                        ( +-  1.74% )
    27,638,039,036      cycles                           #    2.035 GHz                         ( +-  0.27% )
    45,107,762,498      instructions                     #    1.63  insn per cycle              ( +-  0.23% )
     8,623,868,018      branches                         #  635.125 M/sec                       ( +-  0.27% )
       125,738,443      branch-misses                    #    1.46% of all branches             ( +-  0.32% )

            1.9590 +- 0.0484 seconds time elapsed  ( +-  2.47% )

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20241223184941.373853944@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-23 20:44:29 -05:00
Steven Rostedt
7d137e604a fgraph: Remove unnecessary disabling of interrupts and recursion
The function graph tracer disables interrupts as well as prevents
recursion via NMIs when recording the graph tracer code. There's no reason
to do this today. That disabling goes back to 2008 when the function graph
tracer was first introduced and recursion protection wasn't part of the
code.

Today, there's no reason to disable interrupts or prevent the code from
recursing as the infrastructure can easily handle it.

Before this change:

~# echo function_graph > /sys/kernel/tracing/current_tracer
~# perf stat -r 10 ./hackbench 10
Time: 4.240
Time: 4.236
Time: 4.106
Time: 4.014
Time: 4.314
Time: 3.830
Time: 4.063
Time: 4.323
Time: 3.763
Time: 3.727

 Performance counter stats for '/work/c/hackbench 10' (10 runs):

         33,937.20 msec task-clock                       #    7.008 CPUs utilized               ( +-  1.85% )
            18,220      context-switches                 #  536.874 /sec                        ( +-  6.41% )
               624      cpu-migrations                   #   18.387 /sec                        ( +-  9.07% )
            11,319      page-faults                      #  333.528 /sec                        ( +-  1.97% )
    76,657,643,617      cycles                           #    2.259 GHz                         ( +-  0.40% )
   141,403,302,768      instructions                     #    1.84  insn per cycle              ( +-  0.37% )
    25,518,463,888      branches                         #  751.932 M/sec                       ( +-  0.35% )
       156,151,050      branch-misses                    #    0.61% of all branches             ( +-  0.63% )

            4.8423 +- 0.0892 seconds time elapsed  ( +-  1.84% )

After this change:

~# echo function_graph > /sys/kernel/tracing/current_tracer
~# perf stat -r 10 ./hackbench 10
Time: 3.340
Time: 3.192
Time: 3.129
Time: 2.579
Time: 2.589
Time: 2.798
Time: 2.791
Time: 2.955
Time: 3.044
Time: 3.065

 Performance counter stats for './hackbench 10' (10 runs):

         24,416.30 msec task-clock                       #    6.996 CPUs utilized               ( +-  2.74% )
            16,764      context-switches                 #  686.590 /sec                        ( +-  5.85% )
               469      cpu-migrations                   #   19.208 /sec                        ( +-  6.14% )
            11,519      page-faults                      #  471.775 /sec                        ( +-  1.92% )
    53,895,628,450      cycles                           #    2.207 GHz                         ( +-  0.52% )
   105,552,664,638      instructions                     #    1.96  insn per cycle              ( +-  0.47% )
    17,808,672,667      branches                         #  729.376 M/sec                       ( +-  0.48% )
       133,075,435      branch-misses                    #    0.75% of all branches             ( +-  0.59% )

             3.490 +- 0.112 seconds time elapsed  ( +-  3.22% )

Also removed unneeded "unlikely()" around the retaddr code.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20241223184941.204074053@goodmis.org
Fixes: 9cd2992f2d6c8 ("fgraph: Have set_graph_notrace only affect function_graph tracer") # Performance only
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-23 20:43:55 -05:00
Linus Torvalds
4aa748dd1a 25 hotfixes. 16 are cc:stable. 19 are MM and 6 are non-MM.
The usual bunch of singletons and doubletons - please see the relevant
 changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ2cghQAKCRDdBJ7gKXxA
 jgrsAQCvlSmHYYLXBE1A6cram4qWgEP/2vD94d6sVv9UipO/FAEA8y1K7dbT2AGX
 A5ESuRndu5Iy76mb6Tiarqa/yt56QgU=
 =ZYVx
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-12-21-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "25 hotfixes.  16 are cc:stable.  19 are MM and 6 are non-MM.

  The usual bunch of singletons and doubletons - please see the relevant
  changelogs for details"

* tag 'mm-hotfixes-stable-2024-12-21-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (25 commits)
  mm: huge_memory: handle strsep not finding delimiter
  alloc_tag: fix set_codetag_empty() when !CONFIG_MEM_ALLOC_PROFILING_DEBUG
  alloc_tag: fix module allocation tags populated area calculation
  mm/codetag: clear tags before swap
  mm/vmstat: fix a W=1 clang compiler warning
  mm: convert partially_mapped set/clear operations to be atomic
  nilfs2: fix buffer head leaks in calls to truncate_inode_pages()
  vmalloc: fix accounting with i915
  mm/page_alloc: don't call pfn_to_page() on possibly non-existent PFN in split_large_buddy()
  fork: avoid inappropriate uprobe access to invalid mm
  nilfs2: prevent use of deleted inode
  zram: fix uninitialized ZRAM not releasing backing device
  zram: refuse to use zero sized block device as backing device
  mm: use clear_user_(high)page() for arch with special user folio handling
  mm: introduce cpu_icache_is_aliasing() across all architectures
  mm: add RCU annotation to pte_offset_map(_lock)
  mm: correctly reference merged VMA
  mm: use aligned address in copy_user_gigantic_page()
  mm: use aligned address in clear_gigantic_page()
  mm: shmem: fix ShmemHugePages at swapout
  ...
2024-12-21 15:31:56 -08:00
Linus Torvalds
9c707ba99f BPF fixes:
- Fix inlining of bpf_get_smp_processor_id helper for !CONFIG_SMP
   systems (Andrea Righi)
 
 - Fix BPF USDT selftests helper code to use asm constraint "m"
   for LoongArch (Tiezhu Yang)
 
 - Fix BPF selftest compilation error in get_uprobe_offset when
   PROCMAP_QUERY is not defined (Jerome Marchand)
 
 - Fix BPF bpf_skb_change_tail helper when used in context of
   BPF sockmap to handle negative skb header offsets (Cong Wang)
 
 - Several fixes to BPF sockmap code, among others, in the area
   of socket buffer accounting (Levi Zim, Zijian Zhang, Cong Wang)
 
 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
 -----BEGIN PGP SIGNATURE-----
 
 iIsEABYKADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZ2YJABUcZGFuaWVsQGlv
 Z2VhcmJveC5uZXQACgkQ2yufC7HISINDEgD+N4uVg+rp8Z8pg9jcai4WUERmRG20
 NcQTfBXczLHkwIcBALvn7NVvbTAINJzBTnukbjX3XbWFz2cJ/xHxDYXycP4I
 =SwXG
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull BPF fixes from Daniel Borkmann:

 - Fix inlining of bpf_get_smp_processor_id helper for !CONFIG_SMP
   systems (Andrea Righi)

 - Fix BPF USDT selftests helper code to use asm constraint "m" for
   LoongArch (Tiezhu Yang)

 - Fix BPF selftest compilation error in get_uprobe_offset when
   PROCMAP_QUERY is not defined (Jerome Marchand)

 - Fix BPF bpf_skb_change_tail helper when used in context of BPF
   sockmap to handle negative skb header offsets (Cong Wang)

 - Several fixes to BPF sockmap code, among others, in the area of
   socket buffer accounting (Levi Zim, Zijian Zhang, Cong Wang)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Test bpf_skb_change_tail() in TC ingress
  selftests/bpf: Introduce socket_helpers.h for TC tests
  selftests/bpf: Add a BPF selftest for bpf_skb_change_tail()
  bpf: Check negative offsets in __bpf_skb_min_len()
  tcp_bpf: Fix copied value in tcp_bpf_sendmsg
  skmsg: Return copied bytes in sk_msg_memcopy_from_iter
  tcp_bpf: Add sk_rmem_alloc related logic for tcp_bpf ingress redirection
  tcp_bpf: Charge receive socket buffer in bpf_tcp_ingress()
  selftests/bpf: Fix compilation error in get_uprobe_offset()
  selftests/bpf: Use asm constraint "m" for LoongArch
  bpf: Fix bpf_get_smp_processor_id() on !CONFIG_SMP
2024-12-21 11:07:19 -08:00
Linus Torvalds
5b83bcdea5 ring-buffer fixes for v6.13:
- Fix possible overflow of mmapped ring buffer with bad offset
 
   If the mmap() to the ring buffer passes in a start address that
   is passed the end of the mmapped file, it is not caught and
   a slab-out-of-bounds is triggered.
 
   Add a check to make sure the start address is within the bounds
 
 - Do not use TP_printk() to boot mapped ring buffers
 
   As a boot mapped ring buffer's data may have pointers that map to
   the previous boot's memory map, it is unsafe to allow the TP_printk()
   to be used to read the boot mapped buffer's events. If a TP_printk()
   points to a static string from within the kernel it will not match
   the current kernel mapping if KASLR is active, and it can fault.
 
   Have it simply print out the raw fields.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ2QuXRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qncvAQDf2s2WWsy4pYp2mpRtBXvAPf6tpBdi
 J9eceJQbwJVJHAEApQjEFfbUxLh2WgPU1Cn++PwDA+NLiru70+S0vtDLWwE=
 =OI+v
 -----END PGP SIGNATURE-----

Merge tag 'trace-ringbuffer-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ring-buffer fixes from Steven Rostedt:

 - Fix possible overflow of mmapped ring buffer with bad offset

   If the mmap() to the ring buffer passes in a start address that is
   passed the end of the mmapped file, it is not caught and a
   slab-out-of-bounds is triggered.

   Add a check to make sure the start address is within the bounds

 - Do not use TP_printk() to boot mapped ring buffers

   As a boot mapped ring buffer's data may have pointers that map to the
   previous boot's memory map, it is unsafe to allow the TP_printk() to
   be used to read the boot mapped buffer's events. If a TP_printk()
   points to a static string from within the kernel it will not match
   the current kernel mapping if KASLR is active, and it can fault.

   Have it simply print out the raw fields.

* tag 'trace-ringbuffer-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  trace/ring-buffer: Do not use TP_printk() formatting for boot mapped buffers
  ring-buffer: Fix overflow in __rb_map_vma
2024-12-20 10:13:26 -08:00
Lorenzo Stoakes
8ac662f5da fork: avoid inappropriate uprobe access to invalid mm
If dup_mmap() encounters an issue, currently uprobe is able to access the
relevant mm via the reverse mapping (in build_map_info()), and if we are
very unlucky with a race window, observe invalid XA_ZERO_ENTRY state which
we establish as part of the fork error path.

This occurs because uprobe_write_opcode() invokes anon_vma_prepare() which
in turn invokes find_mergeable_anon_vma() that uses a VMA iterator,
invoking vma_iter_load() which uses the advanced maple tree API and thus
is able to observe XA_ZERO_ENTRY entries added to dup_mmap() in commit
d24062914837 ("fork: use __mt_dup() to duplicate maple tree in
dup_mmap()").

This change was made on the assumption that only process tear-down code
would actually observe (and make use of) these values.  However this very
unlikely but still possible edge case with uprobes exists and
unfortunately does make these observable.

The uprobe operation prevents races against the dup_mmap() operation via
the dup_mmap_sem semaphore, which is acquired via uprobe_start_dup_mmap()
and dropped via uprobe_end_dup_mmap(), and held across
register_for_each_vma() prior to invoking build_map_info() which does the
reverse mapping lookup.

Currently these are acquired and dropped within dup_mmap(), which exposes
the race window prior to error handling in the invoking dup_mm() which
tears down the mm.

We can avoid all this by just moving the invocation of
uprobe_start_dup_mmap() and uprobe_end_dup_mmap() up a level to dup_mm()
and only release this lock once the dup_mmap() operation succeeds or clean
up is done.

This means that the uprobe code can never observe an incompletely
constructed mm and resolves the issue in this case.

Link: https://lkml.kernel.org/r/20241210172412.52995-1-lorenzo.stoakes@oracle.com
Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: syzbot+2d788f4f7cb660dac4b7@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:04:44 -08:00
Steven Rostedt
8cd63406d0 trace/ring-buffer: Do not use TP_printk() formatting for boot mapped buffers
The TP_printk() of a TRACE_EVENT() is a generic printf format that any
developer can create for their event. It may include pointers to strings
and such. A boot mapped buffer may contain data from a previous kernel
where the strings addresses are different.

One solution is to copy the event content and update the pointers by the
recorded delta, but a simpler solution (for now) is to just use the
print_fields() function to print these events. The print_fields() function
just iterates the fields and prints them according to what type they are,
and ignores the TP_printk() format from the event itself.

To understand the difference, when printing via TP_printk() the output
looks like this:

  4582.696626: kmem_cache_alloc: call_site=getname_flags+0x47/0x1f0 ptr=00000000e70e10e0 bytes_req=4096 bytes_alloc=4096 gfp_flags=GFP_KERNEL node=-1 accounted=false
  4582.696629: kmem_cache_alloc: call_site=alloc_empty_file+0x6b/0x110 ptr=0000000095808002 bytes_req=360 bytes_alloc=384 gfp_flags=GFP_KERNEL node=-1 accounted=false
  4582.696630: kmem_cache_alloc: call_site=security_file_alloc+0x24/0x100 ptr=00000000576339c3 bytes_req=16 bytes_alloc=16 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false
  4582.696653: kmem_cache_free: call_site=do_sys_openat2+0xa7/0xd0 ptr=00000000e70e10e0 name=names_cache

But when printing via print_fields() (echo 1 > /sys/kernel/tracing/options/fields)
the same event output looks like this:

  4582.696626: kmem_cache_alloc: call_site=0xffffffff92d10d97 (-1831793257) ptr=0xffff9e0e8571e000 (-107689771147264) bytes_req=0x1000 (4096) bytes_alloc=0x1000 (4096) gfp_flags=0xcc0 (3264) node=0xffffffff (-1) accounted=(0)
  4582.696629: kmem_cache_alloc: call_site=0xffffffff92d0250b (-1831852789) ptr=0xffff9e0e8577f800 (-107689770747904) bytes_req=0x168 (360) bytes_alloc=0x180 (384) gfp_flags=0xcc0 (3264) node=0xffffffff (-1) accounted=(0)
  4582.696630: kmem_cache_alloc: call_site=0xffffffff92efca74 (-1829778828) ptr=0xffff9e0e8d35d3b0 (-107689640864848) bytes_req=0x10 (16) bytes_alloc=0x10 (16) gfp_flags=0xdc0 (3520) node=0xffffffff (-1) accounted=(0)
  4582.696653: kmem_cache_free: call_site=0xffffffff92cfbea7 (-1831879001) ptr=0xffff9e0e8571e000 (-107689771147264) name=names_cache

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20241218141507.28389a1d@gandalf.local.home
Fixes: 07714b4bb3f98 ("tracing: Handle old buffer mappings for event strings and functions")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-18 14:20:38 -05:00
Edward Adam Davis
c58a812c8e ring-buffer: Fix overflow in __rb_map_vma
An overflow occurred when performing the following calculation:

   nr_pages = ((nr_subbufs + 1) << subbuf_order) - pgoff;

Add a check before the calculation to avoid this problem.

syzbot reported this as a slab-out-of-bounds in __rb_map_vma:

BUG: KASAN: slab-out-of-bounds in __rb_map_vma+0x9ab/0xae0 kernel/trace/ring_buffer.c:7058
Read of size 8 at addr ffff8880767dd2b8 by task syz-executor187/5836

CPU: 0 UID: 0 PID: 5836 Comm: syz-executor187 Not tainted 6.13.0-rc2-syzkaller-00159-gf932fb9b4074 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/25/2024
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:378 [inline]
 print_report+0xc3/0x620 mm/kasan/report.c:489
 kasan_report+0xd9/0x110 mm/kasan/report.c:602
 __rb_map_vma+0x9ab/0xae0 kernel/trace/ring_buffer.c:7058
 ring_buffer_map+0x56e/0x9b0 kernel/trace/ring_buffer.c:7138
 tracing_buffers_mmap+0xa6/0x120 kernel/trace/trace.c:8482
 call_mmap include/linux/fs.h:2183 [inline]
 mmap_file mm/internal.h:124 [inline]
 __mmap_new_file_vma mm/vma.c:2291 [inline]
 __mmap_new_vma mm/vma.c:2355 [inline]
 __mmap_region+0x1786/0x2670 mm/vma.c:2456
 mmap_region+0x127/0x320 mm/mmap.c:1348
 do_mmap+0xc00/0xfc0 mm/mmap.c:496
 vm_mmap_pgoff+0x1ba/0x360 mm/util.c:580
 ksys_mmap_pgoff+0x32c/0x5c0 mm/mmap.c:542
 __do_sys_mmap arch/x86/kernel/sys_x86_64.c:89 [inline]
 __se_sys_mmap arch/x86/kernel/sys_x86_64.c:82 [inline]
 __x64_sys_mmap+0x125/0x190 arch/x86/kernel/sys_x86_64.c:82
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

The reproducer for this bug is:

------------------------8<-------------------------
 #include <fcntl.h>
 #include <stdlib.h>
 #include <unistd.h>
 #include <asm/types.h>
 #include <sys/mman.h>

 int main(int argc, char **argv)
 {
	int page_size = getpagesize();
	int fd;
	void *meta;

	system("echo 1 > /sys/kernel/tracing/buffer_size_kb");
	fd = open("/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw", O_RDONLY);

	meta = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, page_size * 5);
 }
------------------------>8-------------------------

Cc: stable@vger.kernel.org
Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions")
Link: https://lore.kernel.org/tencent_06924B6674ED771167C23CC336C097223609@qq.com
Reported-by: syzbot+345e4443a21200874b18@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=345e4443a21200874b18
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-18 14:15:10 -05:00
Linus Torvalds
c061cf420d tracing fixes for v6.13:
- Replace trace_check_vprintf() with test_event_printk() and ignore_event()
 
   The function test_event_printk() checks on boot up if the trace event
   printf() formats dereference any pointers, and if they do, it then looks
   at the arguments to make sure that the pointers they dereference will
   exist in the event on the ring buffer. If they do not, it issues a
   WARN_ON() as it is a likely bug.
 
   But this isn't the case for the strings that can be dereferenced with
   "%s", as some trace events (notably RCU and some IPI events) save
   a pointer to a static string in the ring buffer. As the string it
   points to lives as long as the kernel is running, it is not a bug
   to reference it, as it is guaranteed to be there when the event is read.
   But it is also possible (and a common bug) to point to some allocated
   string that could be freed before the trace event is read and the
   dereference is to bad memory. This case requires a run time check.
 
   The previous way to handle this was with trace_check_vprintf() that would
   process the printf format piece by piece and send what it didn't care
   about to vsnprintf() to handle arguments that were not strings. This
   kept it from having to reimplement vsnprintf(). But it relied on va_list
   implementation and for architectures that copied the va_list and did
   not pass it by reference, it wasn't even possible to do this check and
   it would be skipped. As 64bit x86 passed va_list by reference, most
   events were tested and this kept out bugs where strings would have been
   dereferenced after being freed.
 
   Instead of relying on the implementation of va_list, extend the boot up
   test_event_printk() function to validate all the "%s" strings that
   can be validated at boot, and for the few events that point to strings
   outside the ring buffer, flag both the event and the field that is
   dereferenced as "needs_test". Then before the event is printed, a call
   to ignore_event() is made, and if the event has the flag set, it iterates
   all its fields and for every field that is to be tested, it will read
   the pointer directly from the event in the ring buffer and make sure
   that it is valid. If the pointer is not valid, it will print a WARN_ON(),
   print out to the trace that the event has unsafe memory and ignore
   the print format.
 
   With this new update, the trace_check_vprintf() can be safely removed
   and now all events can be verified regardless of architecture.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ2IqiRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qlfgAP9hJFl6zhA5GGRo905G9JWFHkbNNjgp
 WfQ0oMU2Eo1q+AEAmb5d3wWfWJAa+AxiiDNeZ28En/+ZbmjhSe6fPpR4egU=
 =LRKi
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Replace trace_check_vprintf() with test_event_printk() and
  ignore_event()

  The function test_event_printk() checks on boot up if the trace event
  printf() formats dereference any pointers, and if they do, it then
  looks at the arguments to make sure that the pointers they dereference
  will exist in the event on the ring buffer. If they do not, it issues
  a WARN_ON() as it is a likely bug.

  But this isn't the case for the strings that can be dereferenced with
  "%s", as some trace events (notably RCU and some IPI events) save a
  pointer to a static string in the ring buffer. As the string it points
  to lives as long as the kernel is running, it is not a bug to
  reference it, as it is guaranteed to be there when the event is read.
  But it is also possible (and a common bug) to point to some allocated
  string that could be freed before the trace event is read and the
  dereference is to bad memory. This case requires a run time check.

  The previous way to handle this was with trace_check_vprintf() that
  would process the printf format piece by piece and send what it didn't
  care about to vsnprintf() to handle arguments that were not strings.
  This kept it from having to reimplement vsnprintf(). But it relied on
  va_list implementation and for architectures that copied the va_list
  and did not pass it by reference, it wasn't even possible to do this
  check and it would be skipped. As 64bit x86 passed va_list by
  reference, most events were tested and this kept out bugs where
  strings would have been dereferenced after being freed.

  Instead of relying on the implementation of va_list, extend the boot
  up test_event_printk() function to validate all the "%s" strings that
  can be validated at boot, and for the few events that point to strings
  outside the ring buffer, flag both the event and the field that is
  dereferenced as "needs_test". Then before the event is printed, a call
  to ignore_event() is made, and if the event has the flag set, it
  iterates all its fields and for every field that is to be tested, it
  will read the pointer directly from the event in the ring buffer and
  make sure that it is valid. If the pointer is not valid, it will print
  a WARN_ON(), print out to the trace that the event has unsafe memory
  and ignore the print format.

  With this new update, the trace_check_vprintf() can be safely removed
  and now all events can be verified regardless of architecture"

* tag 'trace-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Check "%s" dereference via the field and not the TP_printk format
  tracing: Add "%s" check in test_event_printk()
  tracing: Add missing helper functions in event pointer dereference check
  tracing: Fix test_event_printk() to process entire print argument
2024-12-18 10:03:33 -08:00
Andrea Righi
23579010cf bpf: Fix bpf_get_smp_processor_id() on !CONFIG_SMP
On x86-64 calling bpf_get_smp_processor_id() in a kernel with CONFIG_SMP
disabled can trigger the following bug, as pcpu_hot is unavailable:

 [    8.471774] BUG: unable to handle page fault for address: 00000000936a290c
 [    8.471849] #PF: supervisor read access in kernel mode
 [    8.471881] #PF: error_code(0x0000) - not-present page

Fix by inlining a return 0 in the !CONFIG_SMP case.

Fixes: 1ae6921009e5 ("bpf: inline bpf_get_smp_processor_id() helper")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241217195813.622568-1-arighi@nvidia.com
2024-12-17 16:09:24 -08:00
Linus Torvalds
5529876063 Ftrace fixes for 6.13:
- Always try to initialize the idle functions when graph tracer starts
 
   A bug was found that when a CPU is offline when graph tracing starts
   and then comes online, that CPU is not traced. The fix to that was
   to move the initialization of the idle shadow stack over to the
   hot plug online logic, which also handle onlined CPUs. The issue was
   that it removed the initialization of the shadow stack when graph tracing
   starts, but the callbacks to the hot plug logic do nothing if graph
   tracing isn't currently running. Although that fix fixed the onlining
   of a CPU during tracing, it broke the CPUs that were already online.
 
 - Have microblaze not try to get the "true parent" in function tracing
 
   If function tracing and graph tracing are both enabled at the same time
   the parent of the functions traced by the function tracer may sometimes
   be the graph tracing trampoline. The graph tracing hijacks the return
   pointer of the function to trace it, but that can interfere with the
   function tracing parent output. This was fixed by using the
   ftrace_graph_ret_addr() function passing in the kernel stack pointer
   using the ftrace_regs_get_stack_pointer() function. But Al Viro reported
   that Microblaze does not implement the kernel_stack_pointer(regs)
   helper function that ftrace_regs_get_stack_pointer() uses and fails
   to compile when function graph tracing is enabled.
 
   It was first thought that this was a microblaze issue, but the real
   cause is that this only works when an architecture implements
   HAVE_DYNAMIC_FTRACE_WITH_ARGS, as a requirement for that config
   is to have ftrace always pass a valid ftrace_regs to the callbacks.
   That also means that the architecture supports ftrace_regs_get_stack_pointer()
   Microblaze does not set HAVE_DYNAMIC_FTRACE_WITH_ARGS nor does it
   implement ftrace_regs_get_stack_pointer() which caused it to fail to
   build. Only implement the "true parent" logic if an architecture has
   that config set.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ2GoLxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qrooAQCY2e6mwLFIb3HttmC5KikrEE48YLOj
 QEz3UGb2zrxVTQD/ebYtXTiZSU/oS+CHdDsXhKSq7jKdLlRWjqUTx81PJQs=
 =mvcR
 -----END PGP SIGNATURE-----

Merge tag 'ftrace-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ftrace fixes from Steven Rostedt:

 - Always try to initialize the idle functions when graph tracer starts

   A bug was found that when a CPU is offline when graph tracing starts
   and then comes online, that CPU is not traced. The fix to that was to
   move the initialization of the idle shadow stack over to the hot plug
   online logic, which also handle onlined CPUs. The issue was that it
   removed the initialization of the shadow stack when graph tracing
   starts, but the callbacks to the hot plug logic do nothing if graph
   tracing isn't currently running. Although that fix fixed the onlining
   of a CPU during tracing, it broke the CPUs that were already online.

 - Have microblaze not try to get the "true parent" in function tracing

   If function tracing and graph tracing are both enabled at the same
   time the parent of the functions traced by the function tracer may
   sometimes be the graph tracing trampoline. The graph tracing hijacks
   the return pointer of the function to trace it, but that can
   interfere with the function tracing parent output.

   This was fixed by using the ftrace_graph_ret_addr() function passing
   in the kernel stack pointer using the ftrace_regs_get_stack_pointer()
   function. But Al Viro reported that Microblaze does not implement the
   kernel_stack_pointer(regs) helper function that
   ftrace_regs_get_stack_pointer() uses and fails to compile when
   function graph tracing is enabled.

   It was first thought that this was a microblaze issue, but the real
   cause is that this only works when an architecture implements
   HAVE_DYNAMIC_FTRACE_WITH_ARGS, as a requirement for that config is to
   have ftrace always pass a valid ftrace_regs to the callbacks. That
   also means that the architecture supports
   ftrace_regs_get_stack_pointer()

   Microblaze does not set HAVE_DYNAMIC_FTRACE_WITH_ARGS nor does it
   implement ftrace_regs_get_stack_pointer() which caused it to fail to
   build. Only implement the "true parent" logic if an architecture has
   that config set"

* tag 'ftrace-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ftrace: Do not find "true_parent" if HAVE_DYNAMIC_FTRACE_WITH_ARGS is not set
  fgraph: Still initialize idle shadow stacks when starting
2024-12-17 09:14:31 -08:00
Steven Rostedt
afd2627f72 tracing: Check "%s" dereference via the field and not the TP_printk format
The TP_printk() portion of a trace event is executed at the time a event
is read from the trace. This can happen seconds, minutes, hours, days,
months, years possibly later since the event was recorded. If the print
format contains a dereference to a string via "%s", and that string was
allocated, there's a chance that string could be freed before it is read
by the trace file.

To protect against such bugs, there are two functions that verify the
event. The first one is test_event_printk(), which is called when the
event is created. It reads the TP_printk() format as well as its arguments
to make sure nothing may be dereferencing a pointer that was not copied
into the ring buffer along with the event. If it is, it will trigger a
WARN_ON().

For strings that use "%s", it is not so easy. The string may not reside in
the ring buffer but may still be valid. Strings that are static and part
of the kernel proper which will not be freed for the life of the running
system, are safe to dereference. But to know if it is a pointer to a
static string or to something on the heap can not be determined until the
event is triggered.

This brings us to the second function that tests for the bad dereferencing
of strings, trace_check_vprintf(). It would walk through the printf format
looking for "%s", and when it finds it, it would validate that the pointer
is safe to read. If not, it would produces a WARN_ON() as well and write
into the ring buffer "[UNSAFE-MEMORY]".

The problem with this is how it used va_list to have vsnprintf() handle
all the cases that it didn't need to check. Instead of re-implementing
vsnprintf(), it would make a copy of the format up to the %s part, and
call vsnprintf() with the current va_list ap variable, where the ap would
then be ready to point at the string in question.

For architectures that passed va_list by reference this was possible. For
architectures that passed it by copy it was not. A test_can_verify()
function was used to differentiate between the two, and if it wasn't
possible, it would disable it.

Even for architectures where this was feasible, it was a stretch to rely
on such a method that is undocumented, and could cause issues later on
with new optimizations of the compiler.

Instead, the first function test_event_printk() was updated to look at
"%s" as well. If the "%s" argument is a pointer outside the event in the
ring buffer, it would find the field type of the event that is the problem
and mark the structure with a new flag called "needs_test". The event
itself will be marked by TRACE_EVENT_FL_TEST_STR to let it be known that
this event has a field that needs to be verified before the event can be
printed using the printf format.

When the event fields are created from the field type structure, the
fields would copy the field type's "needs_test" value.

Finally, before being printed, a new function ignore_event() is called
which will check if the event has the TEST_STR flag set (if not, it
returns false). If the flag is set, it then iterates through the events
fields looking for the ones that have the "needs_test" flag set.

Then it uses the offset field from the field structure to find the pointer
in the ring buffer event. It runs the tests to make sure that pointer is
safe to print and if not, it triggers the WARN_ON() and also adds to the
trace output that the event in question has an unsafe memory access.

The ignore_event() makes the trace_check_vprintf() obsolete so it is
removed.

Link: https://lore.kernel.org/all/CAHk-=wh3uOnqnZPpR0PeLZZtyWbZLboZ7cHLCKRWsocvs9Y7hQ@mail.gmail.com/

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20241217024720.848621576@goodmis.org
Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-17 11:40:11 -05:00
Steven Rostedt
65a25d9f7a tracing: Add "%s" check in test_event_printk()
The test_event_printk() code makes sure that when a trace event is
registered, any dereferenced pointers in from the event's TP_printk() are
pointing to content in the ring buffer. But currently it does not handle
"%s", as there's cases where the string pointer saved in the ring buffer
points to a static string in the kernel that will never be freed. As that
is a valid case, the pointer needs to be checked at runtime.

Currently the runtime check is done via trace_check_vprintf(), but to not
have to replicate everything in vsnprintf() it does some logic with the
va_list that may not be reliable across architectures. In order to get rid
of that logic, more work in the test_event_printk() needs to be done. Some
of the strings can be validated at this time when it is obvious the string
is valid because the string will be saved in the ring buffer content.

Do all the validation of strings in the ring buffer at boot in
test_event_printk(), and make sure that the field of the strings that
point into the kernel are accessible. This will allow adding checks at
runtime that will validate the fields themselves and not rely on paring
the TP_printk() format at runtime.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20241217024720.685917008@goodmis.org
Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-17 11:40:11 -05:00
Steven Rostedt
917110481f tracing: Add missing helper functions in event pointer dereference check
The process_pointer() helper function looks to see if various trace event
macros are used. These macros are for storing data in the event. This
makes it safe to dereference as the dereference will then point into the
event on the ring buffer where the content of the data stays with the
event itself.

A few helper functions were missing. Those were:

  __get_rel_dynamic_array()
  __get_dynamic_array_len()
  __get_rel_dynamic_array_len()
  __get_rel_sockaddr()

Also add a helper function find_print_string() to not need to use a middle
man variable to test if the string exists.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20241217024720.521836792@goodmis.org
Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-17 11:40:11 -05:00
Steven Rostedt
a6629626c5 tracing: Fix test_event_printk() to process entire print argument
The test_event_printk() analyzes print formats of trace events looking for
cases where it may dereference a pointer that is not in the ring buffer
which can possibly be a bug when the trace event is read from the ring
buffer and the content of that pointer no longer exists.

The function needs to accurately go from one print format argument to the
next. It handles quotes and parenthesis that may be included in an
argument. When it finds the start of the next argument, it uses a simple
"c = strstr(fmt + i, ',')" to find the end of that argument!

In order to include "%s" dereferencing, it needs to process the entire
content of the print format argument and not just the content of the first
',' it finds. As there may be content like:

 ({ const char *saved_ptr = trace_seq_buffer_ptr(p); static const char
   *access_str[] = { "---", "--x", "w--", "w-x", "-u-", "-ux", "wu-", "wux"
   }; union kvm_mmu_page_role role; role.word = REC->role;
   trace_seq_printf(p, "sp gen %u gfn %llx l%u %u-byte q%u%s %s%s" " %snxe
   %sad root %u %s%c", REC->mmu_valid_gen, REC->gfn, role.level,
   role.has_4_byte_gpte ? 4 : 8, role.quadrant, role.direct ? " direct" : "",
   access_str[role.access], role.invalid ? " invalid" : "", role.efer_nx ? ""
   : "!", role.ad_disabled ? "!" : "", REC->root_count, REC->unsync ?
   "unsync" : "sync", 0); saved_ptr; })

Which is an example of a full argument of an existing event. As the code
already handles finding the next print format argument, process the
argument at the end of it and not the start of it. This way it has both
the start of the argument as well as the end of it.

Add a helper function "process_pointer()" that will do the processing during
the loop as well as at the end. It also makes the code cleaner and easier
to read.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20241217024720.362271189@goodmis.org
Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-17 11:40:11 -05:00
Linus Torvalds
59dbb9d81a XSA-465 and XSA-466 security patches for v6.13
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCZ2EoeQAKCRCAXGG7T9hj
 vv0FAQDvP7/oSa3bx1rNrlBbmaTOCqAFX9HJRcb39OUsYyzqgQEAt7jGG6uau+xO
 VRAE1u/s+9PA0VGQK8/+HEm0kGYA7wA=
 =CiGc
 -----END PGP SIGNATURE-----

Merge tag 'xsa465+xsa466-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:
 "Fix xen netfront crash (XSA-465) and avoid using the hypercall page
  that doesn't do speculation mitigations (XSA-466)"

* tag 'xsa465+xsa466-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  x86/xen: remove hypercall page
  x86/xen: use new hypercall functions instead of hypercall page
  x86/xen: add central hypercall functions
  x86/xen: don't do PV iret hypercall through hypercall page
  x86/static-call: provide a way to do very early static-call updates
  objtool/x86: allow syscall instruction
  x86: make get_cpu_vendor() accessible from Xen code
  xen/netfront: fix crash when removing device
2024-12-17 08:29:58 -08:00
Steven Rostedt
166438a432 ftrace: Do not find "true_parent" if HAVE_DYNAMIC_FTRACE_WITH_ARGS is not set
When function tracing and function graph tracing are both enabled (in
different instances) the "parent" of some of the function tracing events
is "return_to_handler" which is the trampoline used by function graph
tracing. To fix this, ftrace_get_true_parent_ip() was introduced that
returns the "true" parent ip instead of the trampoline.

To do this, the ftrace_regs_get_stack_pointer() is used, which uses
kernel_stack_pointer(). The problem is that microblaze does not implement
kerenl_stack_pointer() so when function graph tracing is enabled, the
build fails. But microblaze also does not enabled HAVE_DYNAMIC_FTRACE_WITH_ARGS.
That option has to be enabled by the architecture to reliably get the
values from the fregs parameter passed in. When that config is not set,
the architecture can also pass in NULL, which is not tested for in that
function and could cause the kernel to crash.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Jeff Xie <jeff.xie@linux.dev>
Link: https://lore.kernel.org/20241216164633.6df18e87@gandalf.local.home
Fixes: 60b1f578b578 ("ftrace: Get the true parent ip for function tracer")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-16 17:22:26 -05:00
Steven Rostedt
cc252bb592 fgraph: Still initialize idle shadow stacks when starting
A bug was discovered where the idle shadow stacks were not initialized
for offline CPUs when starting function graph tracer, and when they came
online they were not traced due to the missing shadow stack. To fix
this, the idle task shadow stack initialization was moved to using the
CPU hotplug callbacks. But it removed the initialization when the
function graph was enabled. The problem here is that the hotplug
callbacks are called when the CPUs come online, but the idle shadow
stack initialization only happens if function graph is currently
active. This caused the online CPUs to not get their shadow stack
initialized.

The idle shadow stack initialization still needs to be done when the
function graph is registered, as they will not be allocated if function
graph is not registered.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241211135335.094ba282@batman.local.home
Fixes: 2c02f7375e65 ("fgraph: Use CPU hotplug mechanism to initialize idle shadow stacks")
Reported-by: Linus Walleij <linus.walleij@linaro.org>
Tested-by: Linus Walleij <linus.walleij@linaro.org>
Closes: https://lore.kernel.org/all/CACRpkdaTBrHwRbbrphVy-=SeDz6MSsXhTKypOtLrTQ+DgGAOcQ@mail.gmail.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-16 16:03:33 -05:00
Linus Torvalds
acd855a949 - Prevent incorrect dequeueing of the deadline dlserver helper task and fix
its time accounting
 
 - Properly track the CFS runqueue runnable stats
 
 - Check the total number of all queued tasks in a sched fair's runqueue
   hierarchy before deciding to stop the tick
 
 - Fix the scheduling of the task that got woken last (NEXT_BUDDY) by
   preventing those from being delayed
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmdexEsACgkQEsHwGGHe
 VUpFqA//SIIbNJEIQEwGkFrYpGwVpSISm94L4ENsrkWbJWQlALwQEBJF9Me/DOZH
 vHaX3o+cMxt26W7o0NKyPcvYtulnOr33HZA/uxK35MDaUinSA3Spt3jXHfR3n0mL
 ljNQQraWHGaJh7dzKMZoxP6DR78/Z0yotXjt33xeBFMSJuzGsklrbIiSJ6c4m/3u
 Y1lrQT8LncsxJMYIPAKtBAc9hvJfGFV6IOTaTfxP0oTuDo/2qTNVHm7to40wk3NW
 kb0lf2kjVtE6mwMfEm49rtjE3h0VnPJKGKoEkLi9IQoPbQq9Uf4i9VSmRe3zqPAz
 yBxV8BAu2koscMZzqw1CTnd9c/V+/A9qOOHfDo72I5MriJ1qVWCEsqB1y3u2yT6n
 XjwFDbPiVKI8H9YlsZpWERocCRypshevPNlYOF93PlK+YTXoMWaXMQhec5NDzLLw
 Se1K2sCi3U8BMdln0dH6nhk0unzNKQ8UKzrMFncSjnpWhpJ69uxyUZ/jL//6bvfi
 Z+7G4U54mUhGyOAaUSGH/20TnZRWJ7NJC542omFgg9v0VLxx+wnZyX4zJIV0jvRr
 6voYmYDCO8zn/hO67VBJuei97ayIzxDNP1tVl15LzcvRcIGWNUPOwp5jijv8vDJG
 lJhQrMF6w4fgPItC20FvptlDvpP9cItSzyyOeg074HjDS53QN2Y=
 =jOb3
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v6.13_rc3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

 - Prevent incorrect dequeueing of the deadline dlserver helper task and
   fix its time accounting

 - Properly track the CFS runqueue runnable stats

 - Check the total number of all queued tasks in a sched fair's runqueue
   hierarchy before deciding to stop the tick

 - Fix the scheduling of the task that got woken last (NEXT_BUDDY) by
   preventing those from being delayed

* tag 'sched_urgent_for_v6.13_rc3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/dlserver: Fix dlserver time accounting
  sched/dlserver: Fix dlserver double enqueue
  sched/eevdf: More PELT vs DELAYED_DEQUEUE
  sched/fair: Fix sched_can_stop_tick() for fair tasks
  sched/fair: Fix NEXT_BUDDY
2024-12-15 09:38:03 -08:00
Linus Torvalds
35f301dd45 BPF fixes:
- Fix a bug in the BPF verifier to track changes to packet data
   property for global functions (Eduard Zingerman)
 
 - Fix a theoretical BPF prog_array use-after-free in RCU handling
   of __uprobe_perf_func (Jann Horn)
 
 - Fix BPF tracing to have an explicit list of tracepoints and
   their arguments which need to be annotated as PTR_MAYBE_NULL
   (Kumar Kartikeya Dwivedi)
 
 - Fix a logic bug in the bpf_remove_insns code where a potential
   error would have been wrongly propagated (Anton Protopopov)
 
 - Avoid deadlock scenarios caused by nested kprobe and fentry
   BPF programs (Priya Bala Govindasamy)
 
 - Fix a bug in BPF verifier which was missing a size check for
   BTF-based context access (Kumar Kartikeya Dwivedi)
 
 - Fix a crash found by syzbot through an invalid BPF prog_array
   access in perf_event_detach_bpf_prog (Jiri Olsa)
 
 - Fix several BPF sockmap bugs including a race causing a
   refcount imbalance upon element replace (Michal Luczaj)
 
 - Fix a use-after-free from mismatching BPF program/attachment
   RCU flavors (Jann Horn)
 
 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
 -----BEGIN PGP SIGNATURE-----
 
 iIsEABYKADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZ13rdhUcZGFuaWVsQGlv
 Z2VhcmJveC5uZXQACgkQ2yufC7HISINfqAD7B2vX6EgTFrgy7QDepQnZsmu2qjdW
 fFUzPatFXXp2S3MA/16vOEoHJ4rRhBkcUK/vw3gyY5j5bYZNUTTaam5l4BcM
 =gkfb
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Daniel Borkmann:

 - Fix a bug in the BPF verifier to track changes to packet data
   property for global functions (Eduard Zingerman)

 - Fix a theoretical BPF prog_array use-after-free in RCU handling of
   __uprobe_perf_func (Jann Horn)

 - Fix BPF tracing to have an explicit list of tracepoints and their
   arguments which need to be annotated as PTR_MAYBE_NULL (Kumar
   Kartikeya Dwivedi)

 - Fix a logic bug in the bpf_remove_insns code where a potential error
   would have been wrongly propagated (Anton Protopopov)

 - Avoid deadlock scenarios caused by nested kprobe and fentry BPF
   programs (Priya Bala Govindasamy)

 - Fix a bug in BPF verifier which was missing a size check for
   BTF-based context access (Kumar Kartikeya Dwivedi)

 - Fix a crash found by syzbot through an invalid BPF prog_array access
   in perf_event_detach_bpf_prog (Jiri Olsa)

 - Fix several BPF sockmap bugs including a race causing a refcount
   imbalance upon element replace (Michal Luczaj)

 - Fix a use-after-free from mismatching BPF program/attachment RCU
   flavors (Jann Horn)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (23 commits)
  bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs
  selftests/bpf: Add tests for raw_tp NULL args
  bpf: Augment raw_tp arguments with PTR_MAYBE_NULL
  bpf: Revert "bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"
  selftests/bpf: Add test for narrow ctx load for pointer args
  bpf: Check size for BTF-based ctx access of pointer members
  selftests/bpf: extend changes_pkt_data with cases w/o subprograms
  bpf: fix null dereference when computing changes_pkt_data of prog w/o subprogs
  bpf: Fix theoretical prog_array UAF in __uprobe_perf_func()
  bpf: fix potential error return
  selftests/bpf: validate that tail call invalidates packet pointers
  bpf: consider that tail calls invalidate packet pointers
  selftests/bpf: freplace tests for tracking of changes_packet_data
  bpf: check changes_pkt_data property for extension programs
  selftests/bpf: test for changing packet data from global functions
  bpf: track changes_pkt_data property for global functions
  bpf: refactor bpf_helper_changes_pkt_data to use helper number
  bpf: add find_containing_subprog() utility function
  bpf,perf: Fix invalid prog_array access in perf_event_detach_bpf_prog
  bpf: Fix UAF via mismatching bpf_prog/attachment RCU flavors
  ...
2024-12-14 12:58:14 -08:00
Priya Bala Govindasamy
c83508da56 bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs
BPF program types like kprobe and fentry can cause deadlocks in certain
situations. If a function takes a lock and one of these bpf programs is
hooked to some point in the function's critical section, and if the
bpf program tries to call the same function and take the same lock it will
lead to deadlock. These situations have been reported in the following
bug reports.

In percpu_freelist -
Link: https://lore.kernel.org/bpf/CAADnVQLAHwsa+2C6j9+UC6ScrDaN9Fjqv1WjB1pP9AzJLhKuLQ@mail.gmail.com/T/
Link: https://lore.kernel.org/bpf/CAPPBnEYm+9zduStsZaDnq93q1jPLqO-PiKX9jy0MuL8LCXmCrQ@mail.gmail.com/T/
In bpf_lru_list -
Link: https://lore.kernel.org/bpf/CAPPBnEajj+DMfiR_WRWU5=6A7KKULdB5Rob_NJopFLWF+i9gCA@mail.gmail.com/T/
Link: https://lore.kernel.org/bpf/CAPPBnEZQDVN6VqnQXvVqGoB+ukOtHGZ9b9U0OLJJYvRoSsMY_g@mail.gmail.com/T/
Link: https://lore.kernel.org/bpf/CAPPBnEaCB1rFAYU7Wf8UxqcqOWKmRPU1Nuzk3_oLk6qXR7LBOA@mail.gmail.com/T/

Similar bugs have been reported by syzbot.
In queue_stack_maps -
Link: https://lore.kernel.org/lkml/0000000000004c3fc90615f37756@google.com/
Link: https://lore.kernel.org/all/20240418230932.2689-1-hdanton@sina.com/T/
In lpm_trie -
Link: https://lore.kernel.org/linux-kernel/00000000000035168a061a47fa38@google.com/T/
In ringbuf -
Link: https://lore.kernel.org/bpf/20240313121345.2292-1-hdanton@sina.com/T/

Prevent kprobe and fentry bpf programs from attaching to these critical
sections by removing CC_FLAGS_FTRACE for percpu_freelist.o,
bpf_lru_list.o, queue_stack_maps.o, lpm_trie.o, ringbuf.o files.

The bugs reported by syzbot are due to tracepoint bpf programs being
called in the critical sections. This patch does not aim to fix deadlocks
caused by tracepoint programs. However, it does prevent deadlocks from
occurring in similar situations due to kprobe and fentry programs.

Signed-off-by: Priya Bala Govindasamy <pgovind2@uci.edu>
Link: https://lore.kernel.org/r/CAPPBnEZpjGnsuA26Mf9kYibSaGLm=oF6=12L21X1GEQdqjLnzQ@mail.gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-14 09:49:27 -08:00
Kumar Kartikeya Dwivedi
838a10bd2e bpf: Augment raw_tp arguments with PTR_MAYBE_NULL
Arguments to a raw tracepoint are tagged as trusted, which carries the
semantics that the pointer will be non-NULL.  However, in certain cases,
a raw tracepoint argument may end up being NULL. More context about this
issue is available in [0].

Thus, there is a discrepancy between the reality, that raw_tp arguments can
actually be NULL, and the verifier's knowledge, that they are never NULL,
causing explicit NULL check branch to be dead code eliminated.

A previous attempt [1], i.e. the second fixed commit, was made to
simulate symbolic execution as if in most accesses, the argument is a
non-NULL raw_tp, except for conditional jumps.  This tried to suppress
branch prediction while preserving compatibility, but surfaced issues
with production programs that were difficult to solve without increasing
verifier complexity. A more complete discussion of issues and fixes is
available at [2].

Fix this by maintaining an explicit list of tracepoints where the
arguments are known to be NULL, and mark the positional arguments as
PTR_MAYBE_NULL. Additionally, capture the tracepoints where arguments
are known to be ERR_PTR, and mark these arguments as scalar values to
prevent potential dereference.

Each hex digit is used to encode NULL-ness (0x1) or ERR_PTR-ness (0x2),
shifted by the zero-indexed argument number x 4. This can be represented
as follows:
1st arg: 0x1
2nd arg: 0x10
3rd arg: 0x100
... and so on (likewise for ERR_PTR case).

In the future, an automated pass will be used to produce such a list, or
insert __nullable annotations automatically for tracepoints. Each
compilation unit will be analyzed and results will be collated to find
whether a tracepoint pointer is definitely not null, maybe null, or an
unknown state where verifier conservatively marks it PTR_MAYBE_NULL.
A proof of concept of this tool from Eduard is available at [3].

Note that in case we don't find a specification in the raw_tp_null_args
array and the tracepoint belongs to a kernel module, we will
conservatively mark the arguments as PTR_MAYBE_NULL. This is because
unlike for in-tree modules, out-of-tree module tracepoints may pass NULL
freely to the tracepoint. We don't protect against such tracepoints
passing ERR_PTR (which is uncommon anyway), lest we mark all such
arguments as SCALAR_VALUE.

While we are it, let's adjust the test raw_tp_null to not perform
dereference of the skb->mark, as that won't be allowed anymore, and make
it more robust by using inline assembly to test the dead code
elimination behavior, which should still stay the same.

  [0]: https://lore.kernel.org/bpf/ZrCZS6nisraEqehw@jlelli-thinkpadt14gen4.remote.csb
  [1]: https://lore.kernel.org/all/20241104171959.2938862-1-memxor@gmail.com
  [2]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com
  [3]: https://github.com/eddyz87/llvm-project/tree/nullness-for-tracepoint-params

Reported-by: Juri Lelli <juri.lelli@redhat.com> # original bug
Reported-by: Manu Bretelle <chantra@meta.com> # bugs in masking fix
Fixes: 3f00c5239344 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs")
Fixes: cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL")
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Co-developed-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241213221929.3495062-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-13 16:24:53 -08:00
Kumar Kartikeya Dwivedi
c00d738e16 bpf: Revert "bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"
This patch reverts commit
cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"). The
patch was well-intended and meant to be as a stop-gap fixing branch
prediction when the pointer may actually be NULL at runtime. Eventually,
it was supposed to be replaced by an automated script or compiler pass
detecting possibly NULL arguments and marking them accordingly.

However, it caused two main issues observed for production programs and
failed to preserve backwards compatibility. First, programs relied on
the verifier not exploring == NULL branch when pointer is not NULL, thus
they started failing with a 'dereference of scalar' error.  Next,
allowing raw_tp arguments to be modified surfaced the warning in the
verifier that warns against reg->off when PTR_MAYBE_NULL is set.

More information, context, and discusson on both problems is available
in [0]. Overall, this approach had several shortcomings, and the fixes
would further complicate the verifier's logic, and the entire masking
scheme would have to be removed eventually anyway.

Hence, revert the patch in preparation of a better fix avoiding these
issues to replace this commit.

  [0]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com

Reported-by: Manu Bretelle <chantra@meta.com>
Fixes: cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241213221929.3495062-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-13 16:24:53 -08:00
Vineeth Pillai (Google)
c7f7e9c731 sched/dlserver: Fix dlserver time accounting
dlserver time is accounted when:
 - dlserver is active and the dlserver proxies the cfs task.
 - dlserver is active but deferred and cfs task runs after being picked
   through the normal fair class pick.

dl_server_update is called in two places to make sure that both the
above times are accounted for. But it doesn't check if dlserver is
active or not. Now that we have this dl_server_active flag, we can
consolidate dl_server_update into one place and all we need to check is
whether dlserver is active or not. When dlserver is active there is only
two possible conditions:
 - dlserver is deferred.
 - cfs task is running on behalf of dlserver.

Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server")
Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # ROCK 5B
Link: https://lore.kernel.org/r/20241213032244.877029-2-vineeth@bitbyteword.org
2024-12-13 12:57:35 +01:00
Vineeth Pillai (Google)
b53127db1d sched/dlserver: Fix dlserver double enqueue
dlserver can get dequeued during a dlserver pick_task due to the delayed
deueue feature and this can lead to issues with dlserver logic as it
still thinks that dlserver is on the runqueue. The dlserver throttling
and replenish logic gets confused and can lead to double enqueue of
dlserver.

Double enqueue of dlserver could happend due to couple of reasons:

Case 1
------

Delayed dequeue feature[1] can cause dlserver being stopped during a
pick initiated by dlserver:
  __pick_next_task
   pick_task_dl -> server_pick_task
    pick_task_fair
     pick_next_entity (if (sched_delayed))
      dequeue_entities
       dl_server_stop

server_pick_task goes ahead with update_curr_dl_se without knowing that
dlserver is dequeued and this confuses the logic and may lead to
unintended enqueue while the server is stopped.

Case 2
------
A race condition between a task dequeue on one cpu and same task's enqueue
on this cpu by a remote cpu while the lock is released causing dlserver
double enqueue.

One cpu would be in the schedule() and releasing RQ-lock:

current->state = TASK_INTERRUPTIBLE();
        schedule();
          deactivate_task()
            dl_stop_server();
          pick_next_task()
            pick_next_task_fair()
              sched_balance_newidle()
                rq_unlock(this_rq)

at which point another CPU can take our RQ-lock and do:

        try_to_wake_up()
          ttwu_queue()
            rq_lock()
            ...
            activate_task()
              dl_server_start() --> first enqueue
            wakeup_preempt() := check_preempt_wakeup_fair()
              update_curr()
                update_curr_task()
                  if (current->dl_server)
                    dl_server_update()
                      enqueue_dl_entity() --> second enqueue

This bug was not apparent as the enqueue in dl_server_start doesn't
usually happen because of the defer logic. But as a side effect of the
first case(dequeue during dlserver pick), dl_throttled and dl_yield will
be set and this causes the time accounting of dlserver to messup and
then leading to a enqueue in dl_server_start.

Have an explicit flag representing the status of dlserver to avoid the
confusion. This is set in dl_server_start and reset in dlserver_stop.

Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # ROCK 5B
Link: https://lkml.kernel.org/r/20241213032244.877029-1-vineeth@bitbyteword.org
2024-12-13 12:57:34 +01:00
Juergen Gross
0ef8047b73 x86/static-call: provide a way to do very early static-call updates
Add static_call_update_early() for updating static-call targets in
very early boot.

This will be needed for support of Xen guest type specific hypercall
functions.

This is part of XSA-466 / CVE-2024-53241.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
2024-12-13 09:28:32 +01:00
Kumar Kartikeya Dwivedi
659b9ba7cb bpf: Check size for BTF-based ctx access of pointer members
Robert Morris reported the following program type which passes the
verifier in [0]:

SEC("struct_ops/bpf_cubic_init")
void BPF_PROG(bpf_cubic_init, struct sock *sk)
{
	asm volatile("r2 = *(u16*)(r1 + 0)");     // verifier should demand u64
	asm volatile("*(u32 *)(r2 +1504) = 0");   // 1280 in some configs
}

The second line may or may not work, but the first instruction shouldn't
pass, as it's a narrow load into the context structure of the struct ops
callback. The code falls back to btf_ctx_access to ensure correctness
and obtaining the types of pointers. Ensure that the size of the access
is correctly checked to be 8 bytes, otherwise the verifier thinks the
narrow load obtained a trusted BTF pointer and will permit loads/stores
as it sees fit.

Perform the check on size after we've verified that the load is for a
pointer field, as for scalar values narrow loads are fine. Access to
structs passed as arguments to a BPF program are also treated as
scalars, therefore no adjustment is needed in their case.

Existing verifier selftests are broken by this change, but because they
were incorrect. Verifier tests for d_path were performing narrow load
into context to obtain path pointer, had this program actually run it
would cause a crash. The same holds for verifier_btf_ctx_access tests.

  [0]: https://lore.kernel.org/bpf/51338.1732985814@localhost

Fixes: 9e15db66136a ("bpf: Implement accurate raw_tp context access via BTF")
Reported-by: Robert Morris <rtm@mit.edu>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241212092050.3204165-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-12 11:40:18 -08:00
Eduard Zingerman
ac6542ad92 bpf: fix null dereference when computing changes_pkt_data of prog w/o subprogs
bpf_prog_aux->func field might be NULL if program does not have
subprograms except for main sub-program. The fixed commit does
bpf_prog_aux->func access unconditionally, which might lead to null
pointer dereference.

The bug could be triggered by replacing the following BPF program:

    SEC("tc")
    int main_changes(struct __sk_buff *sk)
    {
        bpf_skb_pull_data(sk, 0);
        return 0;
    }

With the following BPF program:

    SEC("freplace")
    long changes_pkt_data(struct __sk_buff *sk)
    {
        return bpf_skb_pull_data(sk, 0);
    }

bpf_prog_aux instance itself represents the main sub-program,
use this property to fix the bug.

Fixes: 81f6d0530ba0 ("bpf: check changes_pkt_data property for extension programs")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202412111822.qGw6tOyB-lkp@intel.com/
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241212070711.427443-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-12 11:37:19 -08:00
Linus Torvalds
1594c49394 Probes fixes for v6.13-rc1:
- eprobes: Fix to release eprobe when failed to add dyn_event.
   This unregisters event call and release eprobe when it fails to add
   a dynamic event. Found in cleaning up.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmdYT3sbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b5X8IALRigb6oDLzrq8yavSPy
 xn1QlnRtRFdLz+PQ3kFCzU3TOT9oxdFhBkYAXS32vDItPqzM7Upj0oZceqhmd5kz
 aXSdkL+PFmbHuLzyPuBksyX4gKga06rQBHJ2SIPxnRPZcXBBRStqyWRDpNjwIxrW
 K8p6k0Agrtd4tL7QtBdukda0uJqKSjN3gOzRAu40KMBjBJZ3kMTsoc+GWGIoIMHb
 PIDaXTZT0DlZ9ZxiEA/gPcjMugNjDVhkbq2ChPU+asvlRs0YUANT4CF0HcntJvDO
 W0xIWivfYIKWFLdAn6fhXicPkqU9DQ7FjppyRKC6y4bwuCYJlSeLsPmSWNI2IEBX
 bFA=
 =LLWX
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull eprobes fix from Masami Hiramatsu:

 - release eprobe when failing to add dyn_event.

   This unregisters event call and release eprobe when it fails to add a
   dynamic event. Found in cleaning up.

* tag 'probes-fixes-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/eprobe: Fix to release eprobe when failed to add dyn_event
2024-12-10 18:15:25 -08:00