mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
synced 2025-01-15 02:05:33 +00:00
46607 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Steven Rostedt (Google)
|
7132d8572b | Merge ftrace/for-next | ||
Masami Hiramatsu (Google)
|
66fc6f521a |
tracing/hist: Support POLLPRI event for poll on histogram
Since POLLIN will not be flushed until the hist file is read, the user needs to repeatedly read() and poll() on the hist file for monitoring the event continuously. But the read() is somewhat redundant when the user is only monitoring for event updates. Add POLLPRI poll event on the hist file so the event returns when a histogram is updated after open(), poll() or read(). Thus it is possible to wait for the next event without having to issue a read(). Cc: Shuah Khan <shuah@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/173527248770.464571.2536902137325258133.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Masami Hiramatsu (Google)
|
1bd13edbbe |
tracing/hist: Add poll(POLLIN) support on hist file
Add poll syscall support on the `hist` file. The Waiter will be waken up when the histogram is updated with POLLIN. Currently, there is no way to wait for a specific event in userspace. So user needs to peek the `trace` periodicaly, or wait on `trace_pipe`. But it is not a good idea to peek at the `trace` for an event that randomly happens. And `trace_pipe` is not coming back until a page is filled with events. This allows a user to wait for a specific event on the `hist` file. User can set a histogram trigger on the event which they want to monitor and poll() on its `hist` file. Since this poll() returns POLLIN, the next poll() will return soon unless a read() happens on that hist file. NOTE: To read the hist file again, you must set the file offset to 0, but just for monitoring the event, you may not need to read the histogram. Cc: Shuah Khan <shuah@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/173527247756.464571.14236296701625509931.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
22bec11a56 |
tracing: Fix using ret variable in tracing_set_tracer()
When the function tracing_set_tracer() switched over to using the guard() infrastructure, it did not need to save the 'ret' variable and would just return the value when an error arised, instead of setting ret and jumping to an out label. When CONFIG_TRACER_SNAPSHOT is enabled, it had code that expected the "ret" variable to be initialized to zero and had set 'ret' while holding an arch_spin_lock() (not used by guard), and then upon releasing the lock it would check 'ret' and exit if set. But because ret was only set when an error occurred while holding the locks, 'ret' would be used uninitialized if there was no error. The code in the CONFIG_TRACER_SNAPSHOT block should be self contain. Make sure 'ret' is also set when no error occurred. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250106111143.2f90ff65@gandalf.local.home Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202412271654.nJVBuwmF-lkp@intel.com/ Fixes: d33b10c0c73ad ("tracing: Switch trace.c code over to use guard()") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> |
||
Steven Rostedt
|
9e49ca756d |
tracing/string: Create and use __free(argv_free) in trace_dynevent.c
The function dyn_event_release() uses argv_split() which must be freed via argv_free(). It contains several error paths that do a goto out to call argv_free() for cleanup. This makes the code complex and error prone. Create a new __free() directive __free(argv_free) that will call argv_free() for data allocated with argv_split(), and use it in the dyn_event_release() function. Cc: Kees Cook <kees@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Shevchenko <andy@kernel.org> Cc: linux-hardening@vger.kernel.org Link: https://lore.kernel.org/20241220103313.4a74ec8e@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
08b7673171 |
tracing: Switch trace_stat.c code over to use guard()
There are a couple functions in trace_stat.c that have "goto out" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201346.870318466@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
6c05353e4f |
tracing: Switch trace_stack.c code over to use guard()
The function stack_trace_sysctl() uses a goto on the error path to jump to the mutex_unlock() code. Replace the logic to use guard() and let the compiler worry about it. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241225222931.684913592@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
930d2b32c0 |
tracing: Switch trace_osnoise.c code over to use guard() and __free()
The osnoise_hotplug_workfn() grabs two mutexes and cpu_read_lock(). It has various gotos to handle unlocking them. Switch them over to guard() and let the compiler worry about it. The osnoise_cpus_read() has a temporary mask_str allocated and there's some gotos to make sure it gets freed on error paths. Switch that over to __free() to let the compiler worry about it. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241225222931.517329690@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
a2e27e1bb1 |
tracing: Switch trace_events_synth.c code over to use guard()
There are a couple functions in trace_events_synth.c that have "goto out" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201346.371082515@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
076796f74e |
tracing: Switch trace_events_filter.c code over to use guard()
There are a couple functions in trace_events_filter.c that have "goto out" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201346.200737679@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
63c7264168 |
tracing: Switch trace_events_trigger.c code over to use guard()
There are a few functions in trace_events_trigger.c that have "goto out" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Also use __free() for free a temporary buffer in event_trigger_regex_write(). Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241220110621.639d3bc8@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
2b36a97aee |
tracing: Switch trace_events_hist.c code over to use guard()
There are a couple functions in trace_events_hist.c that have "goto out" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201345.694601480@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
59980d9b0b |
tracing: Switch trace_events.c code over to use guard()
There are several functions in trace_events.c that have "goto out;" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Some locations did some simple arithmetic after releasing the lock. As this causes no real overhead for holding a mutex while processing the file position (*ppos += cnt;) let the lock be held over this logic too. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201345.522546095@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
4b8d63e5b6 |
tracing: Simplify event_enable_func() goto_reg logic
Currently there's an "out_reg:" label that gets jumped to if there's no parameters to process. Instead, make it a proper "if (param) { }" block as there's not much to do for the parameter processing, and remove the "out_reg:" label. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201345.354746196@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
c949dfb974 |
tracing: Simplify event_enable_func() goto out_free logic
The event_enable_func() function allocates the data descriptor early in the function just to assign its data->count value via: kstrtoul(number, 0, &data->count); This makes the code more complex as there are several error paths before the data descriptor is actually used. This means there needs to be a goto out_free; to clean it up. Use a local variable "count" to do the update and move the data allocation just before it is used. This removes the "out_free" label as the data can be freed on the failure path of where it is used. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201345.190820140@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
cad1d5bd2c |
tracing: Have event_enable_write() just return error on error
The event_enable_write() function is inconsistent in how it returns errors. Sometimes it updates the ppos parameter and sometimes it doesn't. Simplify the code to just return an error or the count if there isn't an error. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201345.025284170@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
d1e27ee9c6 |
tracing: Return -EINVAL if a boot tracer tries to enable the mmiotracer at boot
The mmiotracer is not set to be enabled at boot up from the kernel command line. If the boot command line tries to enable that tracer, it will fail to be enabled. The return code is currently zero when that happens so the caller just thinks it was enabled. Return -EINVAL in this case. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201344.854254394@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
d33b10c0c7 |
tracing: Switch trace.c code over to use guard()
There are several functions in trace.c that have "goto out;" or equivalent on error in order to release locks or free values that were allocated. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex or freeing on error over to using the guard(mutex)() and __free() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. There's one place that should probably return an error but instead return 0. This does not change the return as the only changes are to do the conversion without changing the logic. Fixing that location will have to come later. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/20241224221413.7b8c68c3@batman.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Masami Hiramatsu (Google)
|
d576aec24d |
fgraph: Get ftrace recursion lock in function_graph_enter
Get the ftrace recursion lock in the generic function_graph_enter() instead of each architecture code. This changes all function_graph tracer callbacks running in non-preemptive state. On x86 and powerpc, this is by default, but on the other architecutres, this will be new. Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Naveen N Rao <naveen@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/173379653720.973433.18438622234884980494.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
1d95fd9d6b |
ftrace: Switch ftrace.c code over to use guard()
There are a few functions in ftrace.c that have "goto out" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20241223184941.718001540@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
77e53cb2fc |
ftrace: Remove unneeded goto jumps
There are some goto jumps to exit a program to just return a value. The code after the label doesn't free anything nor does it do any unlocks. It simply returns the variable that was set before the jump. Remove these unneeded goto jumps. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20241223184941.544855549@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
ac8c3b02fc |
ftrace: Do not disable interrupts in profiler
The function profiler disables interrupts before processing. This was there since the profiler was introduced back in 2009 when there were recursion issues to deal with. The function tracer is much more robust today and has its own internal recursion protection. There's no reason to disable interrupts in the function profiler. Instead, just disable preemption and use the guard() infrastructure while at it. Before this change: ~# echo 1 > /sys/kernel/tracing/function_profile_enabled ~# perf stat -r 10 ./hackbench 10 Time: 3.099 Time: 2.556 Time: 2.500 Time: 2.705 Time: 2.985 Time: 2.959 Time: 2.859 Time: 2.621 Time: 2.742 Time: 2.631 Performance counter stats for '/work/c/hackbench 10' (10 runs): 23,156.77 msec task-clock # 6.951 CPUs utilized ( +- 2.36% ) 18,306 context-switches # 790.525 /sec ( +- 5.95% ) 495 cpu-migrations # 21.376 /sec ( +- 8.61% ) 11,522 page-faults # 497.565 /sec ( +- 1.80% ) 47,967,124,606 cycles # 2.071 GHz ( +- 0.41% ) 80,009,078,371 instructions # 1.67 insn per cycle ( +- 0.34% ) 16,389,249,798 branches # 707.752 M/sec ( +- 0.36% ) 139,943,109 branch-misses # 0.85% of all branches ( +- 0.61% ) 3.332 +- 0.101 seconds time elapsed ( +- 3.04% ) After this change: ~# echo 1 > /sys/kernel/tracing/function_profile_enabled ~# perf stat -r 10 ./hackbench 10 Time: 1.869 Time: 1.428 Time: 1.575 Time: 1.569 Time: 1.685 Time: 1.511 Time: 1.611 Time: 1.672 Time: 1.724 Time: 1.715 Performance counter stats for '/work/c/hackbench 10' (10 runs): 13,578.21 msec task-clock # 6.931 CPUs utilized ( +- 2.23% ) 12,736 context-switches # 937.973 /sec ( +- 3.86% ) 341 cpu-migrations # 25.114 /sec ( +- 5.27% ) 11,378 page-faults # 837.960 /sec ( +- 1.74% ) 27,638,039,036 cycles # 2.035 GHz ( +- 0.27% ) 45,107,762,498 instructions # 1.63 insn per cycle ( +- 0.23% ) 8,623,868,018 branches # 635.125 M/sec ( +- 0.27% ) 125,738,443 branch-misses # 1.46% of all branches ( +- 0.32% ) 1.9590 +- 0.0484 seconds time elapsed ( +- 2.47% ) Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20241223184941.373853944@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
7d137e604a |
fgraph: Remove unnecessary disabling of interrupts and recursion
The function graph tracer disables interrupts as well as prevents recursion via NMIs when recording the graph tracer code. There's no reason to do this today. That disabling goes back to 2008 when the function graph tracer was first introduced and recursion protection wasn't part of the code. Today, there's no reason to disable interrupts or prevent the code from recursing as the infrastructure can easily handle it. Before this change: ~# echo function_graph > /sys/kernel/tracing/current_tracer ~# perf stat -r 10 ./hackbench 10 Time: 4.240 Time: 4.236 Time: 4.106 Time: 4.014 Time: 4.314 Time: 3.830 Time: 4.063 Time: 4.323 Time: 3.763 Time: 3.727 Performance counter stats for '/work/c/hackbench 10' (10 runs): 33,937.20 msec task-clock # 7.008 CPUs utilized ( +- 1.85% ) 18,220 context-switches # 536.874 /sec ( +- 6.41% ) 624 cpu-migrations # 18.387 /sec ( +- 9.07% ) 11,319 page-faults # 333.528 /sec ( +- 1.97% ) 76,657,643,617 cycles # 2.259 GHz ( +- 0.40% ) 141,403,302,768 instructions # 1.84 insn per cycle ( +- 0.37% ) 25,518,463,888 branches # 751.932 M/sec ( +- 0.35% ) 156,151,050 branch-misses # 0.61% of all branches ( +- 0.63% ) 4.8423 +- 0.0892 seconds time elapsed ( +- 1.84% ) After this change: ~# echo function_graph > /sys/kernel/tracing/current_tracer ~# perf stat -r 10 ./hackbench 10 Time: 3.340 Time: 3.192 Time: 3.129 Time: 2.579 Time: 2.589 Time: 2.798 Time: 2.791 Time: 2.955 Time: 3.044 Time: 3.065 Performance counter stats for './hackbench 10' (10 runs): 24,416.30 msec task-clock # 6.996 CPUs utilized ( +- 2.74% ) 16,764 context-switches # 686.590 /sec ( +- 5.85% ) 469 cpu-migrations # 19.208 /sec ( +- 6.14% ) 11,519 page-faults # 471.775 /sec ( +- 1.92% ) 53,895,628,450 cycles # 2.207 GHz ( +- 0.52% ) 105,552,664,638 instructions # 1.96 insn per cycle ( +- 0.47% ) 17,808,672,667 branches # 729.376 M/sec ( +- 0.48% ) 133,075,435 branch-misses # 0.75% of all branches ( +- 0.59% ) 3.490 +- 0.112 seconds time elapsed ( +- 3.22% ) Also removed unneeded "unlikely()" around the retaddr code. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20241223184941.204074053@goodmis.org Fixes: 9cd2992f2d6c8 ("fgraph: Have set_graph_notrace only affect function_graph tracer") # Performance only Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Linus Torvalds
|
4aa748dd1a |
25 hotfixes. 16 are cc:stable. 19 are MM and 6 are non-MM.
The usual bunch of singletons and doubletons - please see the relevant changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ2cghQAKCRDdBJ7gKXxA jgrsAQCvlSmHYYLXBE1A6cram4qWgEP/2vD94d6sVv9UipO/FAEA8y1K7dbT2AGX A5ESuRndu5Iy76mb6Tiarqa/yt56QgU= =ZYVx -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2024-12-21-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "25 hotfixes. 16 are cc:stable. 19 are MM and 6 are non-MM. The usual bunch of singletons and doubletons - please see the relevant changelogs for details" * tag 'mm-hotfixes-stable-2024-12-21-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (25 commits) mm: huge_memory: handle strsep not finding delimiter alloc_tag: fix set_codetag_empty() when !CONFIG_MEM_ALLOC_PROFILING_DEBUG alloc_tag: fix module allocation tags populated area calculation mm/codetag: clear tags before swap mm/vmstat: fix a W=1 clang compiler warning mm: convert partially_mapped set/clear operations to be atomic nilfs2: fix buffer head leaks in calls to truncate_inode_pages() vmalloc: fix accounting with i915 mm/page_alloc: don't call pfn_to_page() on possibly non-existent PFN in split_large_buddy() fork: avoid inappropriate uprobe access to invalid mm nilfs2: prevent use of deleted inode zram: fix uninitialized ZRAM not releasing backing device zram: refuse to use zero sized block device as backing device mm: use clear_user_(high)page() for arch with special user folio handling mm: introduce cpu_icache_is_aliasing() across all architectures mm: add RCU annotation to pte_offset_map(_lock) mm: correctly reference merged VMA mm: use aligned address in copy_user_gigantic_page() mm: use aligned address in clear_gigantic_page() mm: shmem: fix ShmemHugePages at swapout ... |
||
Linus Torvalds
|
9c707ba99f |
BPF fixes:
- Fix inlining of bpf_get_smp_processor_id helper for !CONFIG_SMP systems (Andrea Righi) - Fix BPF USDT selftests helper code to use asm constraint "m" for LoongArch (Tiezhu Yang) - Fix BPF selftest compilation error in get_uprobe_offset when PROCMAP_QUERY is not defined (Jerome Marchand) - Fix BPF bpf_skb_change_tail helper when used in context of BPF sockmap to handle negative skb header offsets (Cong Wang) - Several fixes to BPF sockmap code, among others, in the area of socket buffer accounting (Levi Zim, Zijian Zhang, Cong Wang) Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> -----BEGIN PGP SIGNATURE----- iIsEABYKADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZ2YJABUcZGFuaWVsQGlv Z2VhcmJveC5uZXQACgkQ2yufC7HISINDEgD+N4uVg+rp8Z8pg9jcai4WUERmRG20 NcQTfBXczLHkwIcBALvn7NVvbTAINJzBTnukbjX3XbWFz2cJ/xHxDYXycP4I =SwXG -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull BPF fixes from Daniel Borkmann: - Fix inlining of bpf_get_smp_processor_id helper for !CONFIG_SMP systems (Andrea Righi) - Fix BPF USDT selftests helper code to use asm constraint "m" for LoongArch (Tiezhu Yang) - Fix BPF selftest compilation error in get_uprobe_offset when PROCMAP_QUERY is not defined (Jerome Marchand) - Fix BPF bpf_skb_change_tail helper when used in context of BPF sockmap to handle negative skb header offsets (Cong Wang) - Several fixes to BPF sockmap code, among others, in the area of socket buffer accounting (Levi Zim, Zijian Zhang, Cong Wang) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Test bpf_skb_change_tail() in TC ingress selftests/bpf: Introduce socket_helpers.h for TC tests selftests/bpf: Add a BPF selftest for bpf_skb_change_tail() bpf: Check negative offsets in __bpf_skb_min_len() tcp_bpf: Fix copied value in tcp_bpf_sendmsg skmsg: Return copied bytes in sk_msg_memcopy_from_iter tcp_bpf: Add sk_rmem_alloc related logic for tcp_bpf ingress redirection tcp_bpf: Charge receive socket buffer in bpf_tcp_ingress() selftests/bpf: Fix compilation error in get_uprobe_offset() selftests/bpf: Use asm constraint "m" for LoongArch bpf: Fix bpf_get_smp_processor_id() on !CONFIG_SMP |
||
Linus Torvalds
|
5b83bcdea5 |
ring-buffer fixes for v6.13:
- Fix possible overflow of mmapped ring buffer with bad offset If the mmap() to the ring buffer passes in a start address that is passed the end of the mmapped file, it is not caught and a slab-out-of-bounds is triggered. Add a check to make sure the start address is within the bounds - Do not use TP_printk() to boot mapped ring buffers As a boot mapped ring buffer's data may have pointers that map to the previous boot's memory map, it is unsafe to allow the TP_printk() to be used to read the boot mapped buffer's events. If a TP_printk() points to a static string from within the kernel it will not match the current kernel mapping if KASLR is active, and it can fault. Have it simply print out the raw fields. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ2QuXRQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qncvAQDf2s2WWsy4pYp2mpRtBXvAPf6tpBdi J9eceJQbwJVJHAEApQjEFfbUxLh2WgPU1Cn++PwDA+NLiru70+S0vtDLWwE= =OI+v -----END PGP SIGNATURE----- Merge tag 'trace-ringbuffer-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ring-buffer fixes from Steven Rostedt: - Fix possible overflow of mmapped ring buffer with bad offset If the mmap() to the ring buffer passes in a start address that is passed the end of the mmapped file, it is not caught and a slab-out-of-bounds is triggered. Add a check to make sure the start address is within the bounds - Do not use TP_printk() to boot mapped ring buffers As a boot mapped ring buffer's data may have pointers that map to the previous boot's memory map, it is unsafe to allow the TP_printk() to be used to read the boot mapped buffer's events. If a TP_printk() points to a static string from within the kernel it will not match the current kernel mapping if KASLR is active, and it can fault. Have it simply print out the raw fields. * tag 'trace-ringbuffer-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: trace/ring-buffer: Do not use TP_printk() formatting for boot mapped buffers ring-buffer: Fix overflow in __rb_map_vma |
||
Lorenzo Stoakes
|
8ac662f5da |
fork: avoid inappropriate uprobe access to invalid mm
If dup_mmap() encounters an issue, currently uprobe is able to access the relevant mm via the reverse mapping (in build_map_info()), and if we are very unlucky with a race window, observe invalid XA_ZERO_ENTRY state which we establish as part of the fork error path. This occurs because uprobe_write_opcode() invokes anon_vma_prepare() which in turn invokes find_mergeable_anon_vma() that uses a VMA iterator, invoking vma_iter_load() which uses the advanced maple tree API and thus is able to observe XA_ZERO_ENTRY entries added to dup_mmap() in commit d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()"). This change was made on the assumption that only process tear-down code would actually observe (and make use of) these values. However this very unlikely but still possible edge case with uprobes exists and unfortunately does make these observable. The uprobe operation prevents races against the dup_mmap() operation via the dup_mmap_sem semaphore, which is acquired via uprobe_start_dup_mmap() and dropped via uprobe_end_dup_mmap(), and held across register_for_each_vma() prior to invoking build_map_info() which does the reverse mapping lookup. Currently these are acquired and dropped within dup_mmap(), which exposes the race window prior to error handling in the invoking dup_mm() which tears down the mm. We can avoid all this by just moving the invocation of uprobe_start_dup_mmap() and uprobe_end_dup_mmap() up a level to dup_mm() and only release this lock once the dup_mmap() operation succeeds or clean up is done. This means that the uprobe code can never observe an incompletely constructed mm and resolves the issue in this case. Link: https://lkml.kernel.org/r/20241210172412.52995-1-lorenzo.stoakes@oracle.com Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: syzbot+2d788f4f7cb660dac4b7@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/ Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Steven Rostedt
|
8cd63406d0 |
trace/ring-buffer: Do not use TP_printk() formatting for boot mapped buffers
The TP_printk() of a TRACE_EVENT() is a generic printf format that any developer can create for their event. It may include pointers to strings and such. A boot mapped buffer may contain data from a previous kernel where the strings addresses are different. One solution is to copy the event content and update the pointers by the recorded delta, but a simpler solution (for now) is to just use the print_fields() function to print these events. The print_fields() function just iterates the fields and prints them according to what type they are, and ignores the TP_printk() format from the event itself. To understand the difference, when printing via TP_printk() the output looks like this: 4582.696626: kmem_cache_alloc: call_site=getname_flags+0x47/0x1f0 ptr=00000000e70e10e0 bytes_req=4096 bytes_alloc=4096 gfp_flags=GFP_KERNEL node=-1 accounted=false 4582.696629: kmem_cache_alloc: call_site=alloc_empty_file+0x6b/0x110 ptr=0000000095808002 bytes_req=360 bytes_alloc=384 gfp_flags=GFP_KERNEL node=-1 accounted=false 4582.696630: kmem_cache_alloc: call_site=security_file_alloc+0x24/0x100 ptr=00000000576339c3 bytes_req=16 bytes_alloc=16 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false 4582.696653: kmem_cache_free: call_site=do_sys_openat2+0xa7/0xd0 ptr=00000000e70e10e0 name=names_cache But when printing via print_fields() (echo 1 > /sys/kernel/tracing/options/fields) the same event output looks like this: 4582.696626: kmem_cache_alloc: call_site=0xffffffff92d10d97 (-1831793257) ptr=0xffff9e0e8571e000 (-107689771147264) bytes_req=0x1000 (4096) bytes_alloc=0x1000 (4096) gfp_flags=0xcc0 (3264) node=0xffffffff (-1) accounted=(0) 4582.696629: kmem_cache_alloc: call_site=0xffffffff92d0250b (-1831852789) ptr=0xffff9e0e8577f800 (-107689770747904) bytes_req=0x168 (360) bytes_alloc=0x180 (384) gfp_flags=0xcc0 (3264) node=0xffffffff (-1) accounted=(0) 4582.696630: kmem_cache_alloc: call_site=0xffffffff92efca74 (-1829778828) ptr=0xffff9e0e8d35d3b0 (-107689640864848) bytes_req=0x10 (16) bytes_alloc=0x10 (16) gfp_flags=0xdc0 (3520) node=0xffffffff (-1) accounted=(0) 4582.696653: kmem_cache_free: call_site=0xffffffff92cfbea7 (-1831879001) ptr=0xffff9e0e8571e000 (-107689771147264) name=names_cache Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20241218141507.28389a1d@gandalf.local.home Fixes: 07714b4bb3f98 ("tracing: Handle old buffer mappings for event strings and functions") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Edward Adam Davis
|
c58a812c8e |
ring-buffer: Fix overflow in __rb_map_vma
An overflow occurred when performing the following calculation: nr_pages = ((nr_subbufs + 1) << subbuf_order) - pgoff; Add a check before the calculation to avoid this problem. syzbot reported this as a slab-out-of-bounds in __rb_map_vma: BUG: KASAN: slab-out-of-bounds in __rb_map_vma+0x9ab/0xae0 kernel/trace/ring_buffer.c:7058 Read of size 8 at addr ffff8880767dd2b8 by task syz-executor187/5836 CPU: 0 UID: 0 PID: 5836 Comm: syz-executor187 Not tainted 6.13.0-rc2-syzkaller-00159-gf932fb9b4074 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/25/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xc3/0x620 mm/kasan/report.c:489 kasan_report+0xd9/0x110 mm/kasan/report.c:602 __rb_map_vma+0x9ab/0xae0 kernel/trace/ring_buffer.c:7058 ring_buffer_map+0x56e/0x9b0 kernel/trace/ring_buffer.c:7138 tracing_buffers_mmap+0xa6/0x120 kernel/trace/trace.c:8482 call_mmap include/linux/fs.h:2183 [inline] mmap_file mm/internal.h:124 [inline] __mmap_new_file_vma mm/vma.c:2291 [inline] __mmap_new_vma mm/vma.c:2355 [inline] __mmap_region+0x1786/0x2670 mm/vma.c:2456 mmap_region+0x127/0x320 mm/mmap.c:1348 do_mmap+0xc00/0xfc0 mm/mmap.c:496 vm_mmap_pgoff+0x1ba/0x360 mm/util.c:580 ksys_mmap_pgoff+0x32c/0x5c0 mm/mmap.c:542 __do_sys_mmap arch/x86/kernel/sys_x86_64.c:89 [inline] __se_sys_mmap arch/x86/kernel/sys_x86_64.c:82 [inline] __x64_sys_mmap+0x125/0x190 arch/x86/kernel/sys_x86_64.c:82 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f The reproducer for this bug is: ------------------------8<------------------------- #include <fcntl.h> #include <stdlib.h> #include <unistd.h> #include <asm/types.h> #include <sys/mman.h> int main(int argc, char **argv) { int page_size = getpagesize(); int fd; void *meta; system("echo 1 > /sys/kernel/tracing/buffer_size_kb"); fd = open("/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw", O_RDONLY); meta = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, page_size * 5); } ------------------------>8------------------------- Cc: stable@vger.kernel.org Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions") Link: https://lore.kernel.org/tencent_06924B6674ED771167C23CC336C097223609@qq.com Reported-by: syzbot+345e4443a21200874b18@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=345e4443a21200874b18 Signed-off-by: Edward Adam Davis <eadavis@qq.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Linus Torvalds
|
c061cf420d |
tracing fixes for v6.13:
- Replace trace_check_vprintf() with test_event_printk() and ignore_event() The function test_event_printk() checks on boot up if the trace event printf() formats dereference any pointers, and if they do, it then looks at the arguments to make sure that the pointers they dereference will exist in the event on the ring buffer. If they do not, it issues a WARN_ON() as it is a likely bug. But this isn't the case for the strings that can be dereferenced with "%s", as some trace events (notably RCU and some IPI events) save a pointer to a static string in the ring buffer. As the string it points to lives as long as the kernel is running, it is not a bug to reference it, as it is guaranteed to be there when the event is read. But it is also possible (and a common bug) to point to some allocated string that could be freed before the trace event is read and the dereference is to bad memory. This case requires a run time check. The previous way to handle this was with trace_check_vprintf() that would process the printf format piece by piece and send what it didn't care about to vsnprintf() to handle arguments that were not strings. This kept it from having to reimplement vsnprintf(). But it relied on va_list implementation and for architectures that copied the va_list and did not pass it by reference, it wasn't even possible to do this check and it would be skipped. As 64bit x86 passed va_list by reference, most events were tested and this kept out bugs where strings would have been dereferenced after being freed. Instead of relying on the implementation of va_list, extend the boot up test_event_printk() function to validate all the "%s" strings that can be validated at boot, and for the few events that point to strings outside the ring buffer, flag both the event and the field that is dereferenced as "needs_test". Then before the event is printed, a call to ignore_event() is made, and if the event has the flag set, it iterates all its fields and for every field that is to be tested, it will read the pointer directly from the event in the ring buffer and make sure that it is valid. If the pointer is not valid, it will print a WARN_ON(), print out to the trace that the event has unsafe memory and ignore the print format. With this new update, the trace_check_vprintf() can be safely removed and now all events can be verified regardless of architecture. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ2IqiRQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qlfgAP9hJFl6zhA5GGRo905G9JWFHkbNNjgp WfQ0oMU2Eo1q+AEAmb5d3wWfWJAa+AxiiDNeZ28En/+ZbmjhSe6fPpR4egU= =LRKi -----END PGP SIGNATURE----- Merge tag 'trace-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: "Replace trace_check_vprintf() with test_event_printk() and ignore_event() The function test_event_printk() checks on boot up if the trace event printf() formats dereference any pointers, and if they do, it then looks at the arguments to make sure that the pointers they dereference will exist in the event on the ring buffer. If they do not, it issues a WARN_ON() as it is a likely bug. But this isn't the case for the strings that can be dereferenced with "%s", as some trace events (notably RCU and some IPI events) save a pointer to a static string in the ring buffer. As the string it points to lives as long as the kernel is running, it is not a bug to reference it, as it is guaranteed to be there when the event is read. But it is also possible (and a common bug) to point to some allocated string that could be freed before the trace event is read and the dereference is to bad memory. This case requires a run time check. The previous way to handle this was with trace_check_vprintf() that would process the printf format piece by piece and send what it didn't care about to vsnprintf() to handle arguments that were not strings. This kept it from having to reimplement vsnprintf(). But it relied on va_list implementation and for architectures that copied the va_list and did not pass it by reference, it wasn't even possible to do this check and it would be skipped. As 64bit x86 passed va_list by reference, most events were tested and this kept out bugs where strings would have been dereferenced after being freed. Instead of relying on the implementation of va_list, extend the boot up test_event_printk() function to validate all the "%s" strings that can be validated at boot, and for the few events that point to strings outside the ring buffer, flag both the event and the field that is dereferenced as "needs_test". Then before the event is printed, a call to ignore_event() is made, and if the event has the flag set, it iterates all its fields and for every field that is to be tested, it will read the pointer directly from the event in the ring buffer and make sure that it is valid. If the pointer is not valid, it will print a WARN_ON(), print out to the trace that the event has unsafe memory and ignore the print format. With this new update, the trace_check_vprintf() can be safely removed and now all events can be verified regardless of architecture" * tag 'trace-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Check "%s" dereference via the field and not the TP_printk format tracing: Add "%s" check in test_event_printk() tracing: Add missing helper functions in event pointer dereference check tracing: Fix test_event_printk() to process entire print argument |
||
Andrea Righi
|
23579010cf |
bpf: Fix bpf_get_smp_processor_id() on !CONFIG_SMP
On x86-64 calling bpf_get_smp_processor_id() in a kernel with CONFIG_SMP disabled can trigger the following bug, as pcpu_hot is unavailable: [ 8.471774] BUG: unable to handle page fault for address: 00000000936a290c [ 8.471849] #PF: supervisor read access in kernel mode [ 8.471881] #PF: error_code(0x0000) - not-present page Fix by inlining a return 0 in the !CONFIG_SMP case. Fixes: 1ae6921009e5 ("bpf: inline bpf_get_smp_processor_id() helper") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241217195813.622568-1-arighi@nvidia.com |
||
Linus Torvalds
|
5529876063 |
Ftrace fixes for 6.13:
- Always try to initialize the idle functions when graph tracer starts A bug was found that when a CPU is offline when graph tracing starts and then comes online, that CPU is not traced. The fix to that was to move the initialization of the idle shadow stack over to the hot plug online logic, which also handle onlined CPUs. The issue was that it removed the initialization of the shadow stack when graph tracing starts, but the callbacks to the hot plug logic do nothing if graph tracing isn't currently running. Although that fix fixed the onlining of a CPU during tracing, it broke the CPUs that were already online. - Have microblaze not try to get the "true parent" in function tracing If function tracing and graph tracing are both enabled at the same time the parent of the functions traced by the function tracer may sometimes be the graph tracing trampoline. The graph tracing hijacks the return pointer of the function to trace it, but that can interfere with the function tracing parent output. This was fixed by using the ftrace_graph_ret_addr() function passing in the kernel stack pointer using the ftrace_regs_get_stack_pointer() function. But Al Viro reported that Microblaze does not implement the kernel_stack_pointer(regs) helper function that ftrace_regs_get_stack_pointer() uses and fails to compile when function graph tracing is enabled. It was first thought that this was a microblaze issue, but the real cause is that this only works when an architecture implements HAVE_DYNAMIC_FTRACE_WITH_ARGS, as a requirement for that config is to have ftrace always pass a valid ftrace_regs to the callbacks. That also means that the architecture supports ftrace_regs_get_stack_pointer() Microblaze does not set HAVE_DYNAMIC_FTRACE_WITH_ARGS nor does it implement ftrace_regs_get_stack_pointer() which caused it to fail to build. Only implement the "true parent" logic if an architecture has that config set. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ2GoLxQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qrooAQCY2e6mwLFIb3HttmC5KikrEE48YLOj QEz3UGb2zrxVTQD/ebYtXTiZSU/oS+CHdDsXhKSq7jKdLlRWjqUTx81PJQs= =mvcR -----END PGP SIGNATURE----- Merge tag 'ftrace-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ftrace fixes from Steven Rostedt: - Always try to initialize the idle functions when graph tracer starts A bug was found that when a CPU is offline when graph tracing starts and then comes online, that CPU is not traced. The fix to that was to move the initialization of the idle shadow stack over to the hot plug online logic, which also handle onlined CPUs. The issue was that it removed the initialization of the shadow stack when graph tracing starts, but the callbacks to the hot plug logic do nothing if graph tracing isn't currently running. Although that fix fixed the onlining of a CPU during tracing, it broke the CPUs that were already online. - Have microblaze not try to get the "true parent" in function tracing If function tracing and graph tracing are both enabled at the same time the parent of the functions traced by the function tracer may sometimes be the graph tracing trampoline. The graph tracing hijacks the return pointer of the function to trace it, but that can interfere with the function tracing parent output. This was fixed by using the ftrace_graph_ret_addr() function passing in the kernel stack pointer using the ftrace_regs_get_stack_pointer() function. But Al Viro reported that Microblaze does not implement the kernel_stack_pointer(regs) helper function that ftrace_regs_get_stack_pointer() uses and fails to compile when function graph tracing is enabled. It was first thought that this was a microblaze issue, but the real cause is that this only works when an architecture implements HAVE_DYNAMIC_FTRACE_WITH_ARGS, as a requirement for that config is to have ftrace always pass a valid ftrace_regs to the callbacks. That also means that the architecture supports ftrace_regs_get_stack_pointer() Microblaze does not set HAVE_DYNAMIC_FTRACE_WITH_ARGS nor does it implement ftrace_regs_get_stack_pointer() which caused it to fail to build. Only implement the "true parent" logic if an architecture has that config set" * tag 'ftrace-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Do not find "true_parent" if HAVE_DYNAMIC_FTRACE_WITH_ARGS is not set fgraph: Still initialize idle shadow stacks when starting |
||
Steven Rostedt
|
afd2627f72 |
tracing: Check "%s" dereference via the field and not the TP_printk format
The TP_printk() portion of a trace event is executed at the time a event is read from the trace. This can happen seconds, minutes, hours, days, months, years possibly later since the event was recorded. If the print format contains a dereference to a string via "%s", and that string was allocated, there's a chance that string could be freed before it is read by the trace file. To protect against such bugs, there are two functions that verify the event. The first one is test_event_printk(), which is called when the event is created. It reads the TP_printk() format as well as its arguments to make sure nothing may be dereferencing a pointer that was not copied into the ring buffer along with the event. If it is, it will trigger a WARN_ON(). For strings that use "%s", it is not so easy. The string may not reside in the ring buffer but may still be valid. Strings that are static and part of the kernel proper which will not be freed for the life of the running system, are safe to dereference. But to know if it is a pointer to a static string or to something on the heap can not be determined until the event is triggered. This brings us to the second function that tests for the bad dereferencing of strings, trace_check_vprintf(). It would walk through the printf format looking for "%s", and when it finds it, it would validate that the pointer is safe to read. If not, it would produces a WARN_ON() as well and write into the ring buffer "[UNSAFE-MEMORY]". The problem with this is how it used va_list to have vsnprintf() handle all the cases that it didn't need to check. Instead of re-implementing vsnprintf(), it would make a copy of the format up to the %s part, and call vsnprintf() with the current va_list ap variable, where the ap would then be ready to point at the string in question. For architectures that passed va_list by reference this was possible. For architectures that passed it by copy it was not. A test_can_verify() function was used to differentiate between the two, and if it wasn't possible, it would disable it. Even for architectures where this was feasible, it was a stretch to rely on such a method that is undocumented, and could cause issues later on with new optimizations of the compiler. Instead, the first function test_event_printk() was updated to look at "%s" as well. If the "%s" argument is a pointer outside the event in the ring buffer, it would find the field type of the event that is the problem and mark the structure with a new flag called "needs_test". The event itself will be marked by TRACE_EVENT_FL_TEST_STR to let it be known that this event has a field that needs to be verified before the event can be printed using the printf format. When the event fields are created from the field type structure, the fields would copy the field type's "needs_test" value. Finally, before being printed, a new function ignore_event() is called which will check if the event has the TEST_STR flag set (if not, it returns false). If the flag is set, it then iterates through the events fields looking for the ones that have the "needs_test" flag set. Then it uses the offset field from the field structure to find the pointer in the ring buffer event. It runs the tests to make sure that pointer is safe to print and if not, it triggers the WARN_ON() and also adds to the trace output that the event in question has an unsafe memory access. The ignore_event() makes the trace_check_vprintf() obsolete so it is removed. Link: https://lore.kernel.org/all/CAHk-=wh3uOnqnZPpR0PeLZZtyWbZLboZ7cHLCKRWsocvs9Y7hQ@mail.gmail.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20241217024720.848621576@goodmis.org Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
65a25d9f7a |
tracing: Add "%s" check in test_event_printk()
The test_event_printk() code makes sure that when a trace event is registered, any dereferenced pointers in from the event's TP_printk() are pointing to content in the ring buffer. But currently it does not handle "%s", as there's cases where the string pointer saved in the ring buffer points to a static string in the kernel that will never be freed. As that is a valid case, the pointer needs to be checked at runtime. Currently the runtime check is done via trace_check_vprintf(), but to not have to replicate everything in vsnprintf() it does some logic with the va_list that may not be reliable across architectures. In order to get rid of that logic, more work in the test_event_printk() needs to be done. Some of the strings can be validated at this time when it is obvious the string is valid because the string will be saved in the ring buffer content. Do all the validation of strings in the ring buffer at boot in test_event_printk(), and make sure that the field of the strings that point into the kernel are accessible. This will allow adding checks at runtime that will validate the fields themselves and not rely on paring the TP_printk() format at runtime. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20241217024720.685917008@goodmis.org Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
917110481f |
tracing: Add missing helper functions in event pointer dereference check
The process_pointer() helper function looks to see if various trace event macros are used. These macros are for storing data in the event. This makes it safe to dereference as the dereference will then point into the event on the ring buffer where the content of the data stays with the event itself. A few helper functions were missing. Those were: __get_rel_dynamic_array() __get_dynamic_array_len() __get_rel_dynamic_array_len() __get_rel_sockaddr() Also add a helper function find_print_string() to not need to use a middle man variable to test if the string exists. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20241217024720.521836792@goodmis.org Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
a6629626c5 |
tracing: Fix test_event_printk() to process entire print argument
The test_event_printk() analyzes print formats of trace events looking for cases where it may dereference a pointer that is not in the ring buffer which can possibly be a bug when the trace event is read from the ring buffer and the content of that pointer no longer exists. The function needs to accurately go from one print format argument to the next. It handles quotes and parenthesis that may be included in an argument. When it finds the start of the next argument, it uses a simple "c = strstr(fmt + i, ',')" to find the end of that argument! In order to include "%s" dereferencing, it needs to process the entire content of the print format argument and not just the content of the first ',' it finds. As there may be content like: ({ const char *saved_ptr = trace_seq_buffer_ptr(p); static const char *access_str[] = { "---", "--x", "w--", "w-x", "-u-", "-ux", "wu-", "wux" }; union kvm_mmu_page_role role; role.word = REC->role; trace_seq_printf(p, "sp gen %u gfn %llx l%u %u-byte q%u%s %s%s" " %snxe %sad root %u %s%c", REC->mmu_valid_gen, REC->gfn, role.level, role.has_4_byte_gpte ? 4 : 8, role.quadrant, role.direct ? " direct" : "", access_str[role.access], role.invalid ? " invalid" : "", role.efer_nx ? "" : "!", role.ad_disabled ? "!" : "", REC->root_count, REC->unsync ? "unsync" : "sync", 0); saved_ptr; }) Which is an example of a full argument of an existing event. As the code already handles finding the next print format argument, process the argument at the end of it and not the start of it. This way it has both the start of the argument as well as the end of it. Add a helper function "process_pointer()" that will do the processing during the loop as well as at the end. It also makes the code cleaner and easier to read. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20241217024720.362271189@goodmis.org Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Linus Torvalds
|
59dbb9d81a |
XSA-465 and XSA-466 security patches for v6.13
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCZ2EoeQAKCRCAXGG7T9hj vv0FAQDvP7/oSa3bx1rNrlBbmaTOCqAFX9HJRcb39OUsYyzqgQEAt7jGG6uau+xO VRAE1u/s+9PA0VGQK8/+HEm0kGYA7wA= =CiGc -----END PGP SIGNATURE----- Merge tag 'xsa465+xsa466-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: "Fix xen netfront crash (XSA-465) and avoid using the hypercall page that doesn't do speculation mitigations (XSA-466)" * tag 'xsa465+xsa466-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/xen: remove hypercall page x86/xen: use new hypercall functions instead of hypercall page x86/xen: add central hypercall functions x86/xen: don't do PV iret hypercall through hypercall page x86/static-call: provide a way to do very early static-call updates objtool/x86: allow syscall instruction x86: make get_cpu_vendor() accessible from Xen code xen/netfront: fix crash when removing device |
||
Steven Rostedt
|
166438a432 |
ftrace: Do not find "true_parent" if HAVE_DYNAMIC_FTRACE_WITH_ARGS is not set
When function tracing and function graph tracing are both enabled (in different instances) the "parent" of some of the function tracing events is "return_to_handler" which is the trampoline used by function graph tracing. To fix this, ftrace_get_true_parent_ip() was introduced that returns the "true" parent ip instead of the trampoline. To do this, the ftrace_regs_get_stack_pointer() is used, which uses kernel_stack_pointer(). The problem is that microblaze does not implement kerenl_stack_pointer() so when function graph tracing is enabled, the build fails. But microblaze also does not enabled HAVE_DYNAMIC_FTRACE_WITH_ARGS. That option has to be enabled by the architecture to reliably get the values from the fregs parameter passed in. When that config is not set, the architecture can also pass in NULL, which is not tested for in that function and could cause the kernel to crash. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Michal Simek <monstr@monstr.eu> Cc: Jeff Xie <jeff.xie@linux.dev> Link: https://lore.kernel.org/20241216164633.6df18e87@gandalf.local.home Fixes: 60b1f578b578 ("ftrace: Get the true parent ip for function tracer") Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
cc252bb592 |
fgraph: Still initialize idle shadow stacks when starting
A bug was discovered where the idle shadow stacks were not initialized for offline CPUs when starting function graph tracer, and when they came online they were not traced due to the missing shadow stack. To fix this, the idle task shadow stack initialization was moved to using the CPU hotplug callbacks. But it removed the initialization when the function graph was enabled. The problem here is that the hotplug callbacks are called when the CPUs come online, but the idle shadow stack initialization only happens if function graph is currently active. This caused the online CPUs to not get their shadow stack initialized. The idle shadow stack initialization still needs to be done when the function graph is registered, as they will not be allocated if function graph is not registered. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241211135335.094ba282@batman.local.home Fixes: 2c02f7375e65 ("fgraph: Use CPU hotplug mechanism to initialize idle shadow stacks") Reported-by: Linus Walleij <linus.walleij@linaro.org> Tested-by: Linus Walleij <linus.walleij@linaro.org> Closes: https://lore.kernel.org/all/CACRpkdaTBrHwRbbrphVy-=SeDz6MSsXhTKypOtLrTQ+DgGAOcQ@mail.gmail.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Linus Torvalds
|
acd855a949 |
- Prevent incorrect dequeueing of the deadline dlserver helper task and fix
its time accounting - Properly track the CFS runqueue runnable stats - Check the total number of all queued tasks in a sched fair's runqueue hierarchy before deciding to stop the tick - Fix the scheduling of the task that got woken last (NEXT_BUDDY) by preventing those from being delayed -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmdexEsACgkQEsHwGGHe VUpFqA//SIIbNJEIQEwGkFrYpGwVpSISm94L4ENsrkWbJWQlALwQEBJF9Me/DOZH vHaX3o+cMxt26W7o0NKyPcvYtulnOr33HZA/uxK35MDaUinSA3Spt3jXHfR3n0mL ljNQQraWHGaJh7dzKMZoxP6DR78/Z0yotXjt33xeBFMSJuzGsklrbIiSJ6c4m/3u Y1lrQT8LncsxJMYIPAKtBAc9hvJfGFV6IOTaTfxP0oTuDo/2qTNVHm7to40wk3NW kb0lf2kjVtE6mwMfEm49rtjE3h0VnPJKGKoEkLi9IQoPbQq9Uf4i9VSmRe3zqPAz yBxV8BAu2koscMZzqw1CTnd9c/V+/A9qOOHfDo72I5MriJ1qVWCEsqB1y3u2yT6n XjwFDbPiVKI8H9YlsZpWERocCRypshevPNlYOF93PlK+YTXoMWaXMQhec5NDzLLw Se1K2sCi3U8BMdln0dH6nhk0unzNKQ8UKzrMFncSjnpWhpJ69uxyUZ/jL//6bvfi Z+7G4U54mUhGyOAaUSGH/20TnZRWJ7NJC542omFgg9v0VLxx+wnZyX4zJIV0jvRr 6voYmYDCO8zn/hO67VBJuei97ayIzxDNP1tVl15LzcvRcIGWNUPOwp5jijv8vDJG lJhQrMF6w4fgPItC20FvptlDvpP9cItSzyyOeg074HjDS53QN2Y= =jOb3 -----END PGP SIGNATURE----- Merge tag 'sched_urgent_for_v6.13_rc3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Prevent incorrect dequeueing of the deadline dlserver helper task and fix its time accounting - Properly track the CFS runqueue runnable stats - Check the total number of all queued tasks in a sched fair's runqueue hierarchy before deciding to stop the tick - Fix the scheduling of the task that got woken last (NEXT_BUDDY) by preventing those from being delayed * tag 'sched_urgent_for_v6.13_rc3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/dlserver: Fix dlserver time accounting sched/dlserver: Fix dlserver double enqueue sched/eevdf: More PELT vs DELAYED_DEQUEUE sched/fair: Fix sched_can_stop_tick() for fair tasks sched/fair: Fix NEXT_BUDDY |
||
Linus Torvalds
|
35f301dd45 |
BPF fixes:
- Fix a bug in the BPF verifier to track changes to packet data property for global functions (Eduard Zingerman) - Fix a theoretical BPF prog_array use-after-free in RCU handling of __uprobe_perf_func (Jann Horn) - Fix BPF tracing to have an explicit list of tracepoints and their arguments which need to be annotated as PTR_MAYBE_NULL (Kumar Kartikeya Dwivedi) - Fix a logic bug in the bpf_remove_insns code where a potential error would have been wrongly propagated (Anton Protopopov) - Avoid deadlock scenarios caused by nested kprobe and fentry BPF programs (Priya Bala Govindasamy) - Fix a bug in BPF verifier which was missing a size check for BTF-based context access (Kumar Kartikeya Dwivedi) - Fix a crash found by syzbot through an invalid BPF prog_array access in perf_event_detach_bpf_prog (Jiri Olsa) - Fix several BPF sockmap bugs including a race causing a refcount imbalance upon element replace (Michal Luczaj) - Fix a use-after-free from mismatching BPF program/attachment RCU flavors (Jann Horn) Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> -----BEGIN PGP SIGNATURE----- iIsEABYKADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZ13rdhUcZGFuaWVsQGlv Z2VhcmJveC5uZXQACgkQ2yufC7HISINfqAD7B2vX6EgTFrgy7QDepQnZsmu2qjdW fFUzPatFXXp2S3MA/16vOEoHJ4rRhBkcUK/vw3gyY5j5bYZNUTTaam5l4BcM =gkfb -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Daniel Borkmann: - Fix a bug in the BPF verifier to track changes to packet data property for global functions (Eduard Zingerman) - Fix a theoretical BPF prog_array use-after-free in RCU handling of __uprobe_perf_func (Jann Horn) - Fix BPF tracing to have an explicit list of tracepoints and their arguments which need to be annotated as PTR_MAYBE_NULL (Kumar Kartikeya Dwivedi) - Fix a logic bug in the bpf_remove_insns code where a potential error would have been wrongly propagated (Anton Protopopov) - Avoid deadlock scenarios caused by nested kprobe and fentry BPF programs (Priya Bala Govindasamy) - Fix a bug in BPF verifier which was missing a size check for BTF-based context access (Kumar Kartikeya Dwivedi) - Fix a crash found by syzbot through an invalid BPF prog_array access in perf_event_detach_bpf_prog (Jiri Olsa) - Fix several BPF sockmap bugs including a race causing a refcount imbalance upon element replace (Michal Luczaj) - Fix a use-after-free from mismatching BPF program/attachment RCU flavors (Jann Horn) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (23 commits) bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs selftests/bpf: Add tests for raw_tp NULL args bpf: Augment raw_tp arguments with PTR_MAYBE_NULL bpf: Revert "bpf: Mark raw_tp arguments with PTR_MAYBE_NULL" selftests/bpf: Add test for narrow ctx load for pointer args bpf: Check size for BTF-based ctx access of pointer members selftests/bpf: extend changes_pkt_data with cases w/o subprograms bpf: fix null dereference when computing changes_pkt_data of prog w/o subprogs bpf: Fix theoretical prog_array UAF in __uprobe_perf_func() bpf: fix potential error return selftests/bpf: validate that tail call invalidates packet pointers bpf: consider that tail calls invalidate packet pointers selftests/bpf: freplace tests for tracking of changes_packet_data bpf: check changes_pkt_data property for extension programs selftests/bpf: test for changing packet data from global functions bpf: track changes_pkt_data property for global functions bpf: refactor bpf_helper_changes_pkt_data to use helper number bpf: add find_containing_subprog() utility function bpf,perf: Fix invalid prog_array access in perf_event_detach_bpf_prog bpf: Fix UAF via mismatching bpf_prog/attachment RCU flavors ... |
||
Priya Bala Govindasamy
|
c83508da56 |
bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs
BPF program types like kprobe and fentry can cause deadlocks in certain situations. If a function takes a lock and one of these bpf programs is hooked to some point in the function's critical section, and if the bpf program tries to call the same function and take the same lock it will lead to deadlock. These situations have been reported in the following bug reports. In percpu_freelist - Link: https://lore.kernel.org/bpf/CAADnVQLAHwsa+2C6j9+UC6ScrDaN9Fjqv1WjB1pP9AzJLhKuLQ@mail.gmail.com/T/ Link: https://lore.kernel.org/bpf/CAPPBnEYm+9zduStsZaDnq93q1jPLqO-PiKX9jy0MuL8LCXmCrQ@mail.gmail.com/T/ In bpf_lru_list - Link: https://lore.kernel.org/bpf/CAPPBnEajj+DMfiR_WRWU5=6A7KKULdB5Rob_NJopFLWF+i9gCA@mail.gmail.com/T/ Link: https://lore.kernel.org/bpf/CAPPBnEZQDVN6VqnQXvVqGoB+ukOtHGZ9b9U0OLJJYvRoSsMY_g@mail.gmail.com/T/ Link: https://lore.kernel.org/bpf/CAPPBnEaCB1rFAYU7Wf8UxqcqOWKmRPU1Nuzk3_oLk6qXR7LBOA@mail.gmail.com/T/ Similar bugs have been reported by syzbot. In queue_stack_maps - Link: https://lore.kernel.org/lkml/0000000000004c3fc90615f37756@google.com/ Link: https://lore.kernel.org/all/20240418230932.2689-1-hdanton@sina.com/T/ In lpm_trie - Link: https://lore.kernel.org/linux-kernel/00000000000035168a061a47fa38@google.com/T/ In ringbuf - Link: https://lore.kernel.org/bpf/20240313121345.2292-1-hdanton@sina.com/T/ Prevent kprobe and fentry bpf programs from attaching to these critical sections by removing CC_FLAGS_FTRACE for percpu_freelist.o, bpf_lru_list.o, queue_stack_maps.o, lpm_trie.o, ringbuf.o files. The bugs reported by syzbot are due to tracepoint bpf programs being called in the critical sections. This patch does not aim to fix deadlocks caused by tracepoint programs. However, it does prevent deadlocks from occurring in similar situations due to kprobe and fentry programs. Signed-off-by: Priya Bala Govindasamy <pgovind2@uci.edu> Link: https://lore.kernel.org/r/CAPPBnEZpjGnsuA26Mf9kYibSaGLm=oF6=12L21X1GEQdqjLnzQ@mail.gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Kumar Kartikeya Dwivedi
|
838a10bd2e |
bpf: Augment raw_tp arguments with PTR_MAYBE_NULL
Arguments to a raw tracepoint are tagged as trusted, which carries the semantics that the pointer will be non-NULL. However, in certain cases, a raw tracepoint argument may end up being NULL. More context about this issue is available in [0]. Thus, there is a discrepancy between the reality, that raw_tp arguments can actually be NULL, and the verifier's knowledge, that they are never NULL, causing explicit NULL check branch to be dead code eliminated. A previous attempt [1], i.e. the second fixed commit, was made to simulate symbolic execution as if in most accesses, the argument is a non-NULL raw_tp, except for conditional jumps. This tried to suppress branch prediction while preserving compatibility, but surfaced issues with production programs that were difficult to solve without increasing verifier complexity. A more complete discussion of issues and fixes is available at [2]. Fix this by maintaining an explicit list of tracepoints where the arguments are known to be NULL, and mark the positional arguments as PTR_MAYBE_NULL. Additionally, capture the tracepoints where arguments are known to be ERR_PTR, and mark these arguments as scalar values to prevent potential dereference. Each hex digit is used to encode NULL-ness (0x1) or ERR_PTR-ness (0x2), shifted by the zero-indexed argument number x 4. This can be represented as follows: 1st arg: 0x1 2nd arg: 0x10 3rd arg: 0x100 ... and so on (likewise for ERR_PTR case). In the future, an automated pass will be used to produce such a list, or insert __nullable annotations automatically for tracepoints. Each compilation unit will be analyzed and results will be collated to find whether a tracepoint pointer is definitely not null, maybe null, or an unknown state where verifier conservatively marks it PTR_MAYBE_NULL. A proof of concept of this tool from Eduard is available at [3]. Note that in case we don't find a specification in the raw_tp_null_args array and the tracepoint belongs to a kernel module, we will conservatively mark the arguments as PTR_MAYBE_NULL. This is because unlike for in-tree modules, out-of-tree module tracepoints may pass NULL freely to the tracepoint. We don't protect against such tracepoints passing ERR_PTR (which is uncommon anyway), lest we mark all such arguments as SCALAR_VALUE. While we are it, let's adjust the test raw_tp_null to not perform dereference of the skb->mark, as that won't be allowed anymore, and make it more robust by using inline assembly to test the dead code elimination behavior, which should still stay the same. [0]: https://lore.kernel.org/bpf/ZrCZS6nisraEqehw@jlelli-thinkpadt14gen4.remote.csb [1]: https://lore.kernel.org/all/20241104171959.2938862-1-memxor@gmail.com [2]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com [3]: https://github.com/eddyz87/llvm-project/tree/nullness-for-tracepoint-params Reported-by: Juri Lelli <juri.lelli@redhat.com> # original bug Reported-by: Manu Bretelle <chantra@meta.com> # bugs in masking fix Fixes: 3f00c5239344 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs") Fixes: cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL") Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Co-developed-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241213221929.3495062-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Kumar Kartikeya Dwivedi
|
c00d738e16 |
bpf: Revert "bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"
This patch reverts commit cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"). The patch was well-intended and meant to be as a stop-gap fixing branch prediction when the pointer may actually be NULL at runtime. Eventually, it was supposed to be replaced by an automated script or compiler pass detecting possibly NULL arguments and marking them accordingly. However, it caused two main issues observed for production programs and failed to preserve backwards compatibility. First, programs relied on the verifier not exploring == NULL branch when pointer is not NULL, thus they started failing with a 'dereference of scalar' error. Next, allowing raw_tp arguments to be modified surfaced the warning in the verifier that warns against reg->off when PTR_MAYBE_NULL is set. More information, context, and discusson on both problems is available in [0]. Overall, this approach had several shortcomings, and the fixes would further complicate the verifier's logic, and the entire masking scheme would have to be removed eventually anyway. Hence, revert the patch in preparation of a better fix avoiding these issues to replace this commit. [0]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com Reported-by: Manu Bretelle <chantra@meta.com> Fixes: cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241213221929.3495062-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Vineeth Pillai (Google)
|
c7f7e9c731 |
sched/dlserver: Fix dlserver time accounting
dlserver time is accounted when: - dlserver is active and the dlserver proxies the cfs task. - dlserver is active but deferred and cfs task runs after being picked through the normal fair class pick. dl_server_update is called in two places to make sure that both the above times are accounted for. But it doesn't check if dlserver is active or not. Now that we have this dl_server_active flag, we can consolidate dl_server_update into one place and all we need to check is whether dlserver is active or not. When dlserver is active there is only two possible conditions: - dlserver is deferred. - cfs task is running on behalf of dlserver. Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server") Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # ROCK 5B Link: https://lore.kernel.org/r/20241213032244.877029-2-vineeth@bitbyteword.org |
||
Vineeth Pillai (Google)
|
b53127db1d |
sched/dlserver: Fix dlserver double enqueue
dlserver can get dequeued during a dlserver pick_task due to the delayed deueue feature and this can lead to issues with dlserver logic as it still thinks that dlserver is on the runqueue. The dlserver throttling and replenish logic gets confused and can lead to double enqueue of dlserver. Double enqueue of dlserver could happend due to couple of reasons: Case 1 ------ Delayed dequeue feature[1] can cause dlserver being stopped during a pick initiated by dlserver: __pick_next_task pick_task_dl -> server_pick_task pick_task_fair pick_next_entity (if (sched_delayed)) dequeue_entities dl_server_stop server_pick_task goes ahead with update_curr_dl_se without knowing that dlserver is dequeued and this confuses the logic and may lead to unintended enqueue while the server is stopped. Case 2 ------ A race condition between a task dequeue on one cpu and same task's enqueue on this cpu by a remote cpu while the lock is released causing dlserver double enqueue. One cpu would be in the schedule() and releasing RQ-lock: current->state = TASK_INTERRUPTIBLE(); schedule(); deactivate_task() dl_stop_server(); pick_next_task() pick_next_task_fair() sched_balance_newidle() rq_unlock(this_rq) at which point another CPU can take our RQ-lock and do: try_to_wake_up() ttwu_queue() rq_lock() ... activate_task() dl_server_start() --> first enqueue wakeup_preempt() := check_preempt_wakeup_fair() update_curr() update_curr_task() if (current->dl_server) dl_server_update() enqueue_dl_entity() --> second enqueue This bug was not apparent as the enqueue in dl_server_start doesn't usually happen because of the defer logic. But as a side effect of the first case(dequeue during dlserver pick), dl_throttled and dl_yield will be set and this causes the time accounting of dlserver to messup and then leading to a enqueue in dl_server_start. Have an explicit flag representing the status of dlserver to avoid the confusion. This is set in dl_server_start and reset in dlserver_stop. Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # ROCK 5B Link: https://lkml.kernel.org/r/20241213032244.877029-1-vineeth@bitbyteword.org |
||
Juergen Gross
|
0ef8047b73 |
x86/static-call: provide a way to do very early static-call updates
Add static_call_update_early() for updating static-call targets in very early boot. This will be needed for support of Xen guest type specific hypercall functions. This is part of XSA-466 / CVE-2024-53241. Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com> Co-developed-by: Peter Zijlstra <peterz@infradead.org> Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com> |
||
Kumar Kartikeya Dwivedi
|
659b9ba7cb |
bpf: Check size for BTF-based ctx access of pointer members
Robert Morris reported the following program type which passes the verifier in [0]: SEC("struct_ops/bpf_cubic_init") void BPF_PROG(bpf_cubic_init, struct sock *sk) { asm volatile("r2 = *(u16*)(r1 + 0)"); // verifier should demand u64 asm volatile("*(u32 *)(r2 +1504) = 0"); // 1280 in some configs } The second line may or may not work, but the first instruction shouldn't pass, as it's a narrow load into the context structure of the struct ops callback. The code falls back to btf_ctx_access to ensure correctness and obtaining the types of pointers. Ensure that the size of the access is correctly checked to be 8 bytes, otherwise the verifier thinks the narrow load obtained a trusted BTF pointer and will permit loads/stores as it sees fit. Perform the check on size after we've verified that the load is for a pointer field, as for scalar values narrow loads are fine. Access to structs passed as arguments to a BPF program are also treated as scalars, therefore no adjustment is needed in their case. Existing verifier selftests are broken by this change, but because they were incorrect. Verifier tests for d_path were performing narrow load into context to obtain path pointer, had this program actually run it would cause a crash. The same holds for verifier_btf_ctx_access tests. [0]: https://lore.kernel.org/bpf/51338.1732985814@localhost Fixes: 9e15db66136a ("bpf: Implement accurate raw_tp context access via BTF") Reported-by: Robert Morris <rtm@mit.edu> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241212092050.3204165-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Eduard Zingerman
|
ac6542ad92 |
bpf: fix null dereference when computing changes_pkt_data of prog w/o subprogs
bpf_prog_aux->func field might be NULL if program does not have subprograms except for main sub-program. The fixed commit does bpf_prog_aux->func access unconditionally, which might lead to null pointer dereference. The bug could be triggered by replacing the following BPF program: SEC("tc") int main_changes(struct __sk_buff *sk) { bpf_skb_pull_data(sk, 0); return 0; } With the following BPF program: SEC("freplace") long changes_pkt_data(struct __sk_buff *sk) { return bpf_skb_pull_data(sk, 0); } bpf_prog_aux instance itself represents the main sub-program, use this property to fix the bug. Fixes: 81f6d0530ba0 ("bpf: check changes_pkt_data property for extension programs") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202412111822.qGw6tOyB-lkp@intel.com/ Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20241212070711.427443-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Linus Torvalds
|
1594c49394 |
Probes fixes for v6.13-rc1:
- eprobes: Fix to release eprobe when failed to add dyn_event. This unregisters event call and release eprobe when it fails to add a dynamic event. Found in cleaning up. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmdYT3sbHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b5X8IALRigb6oDLzrq8yavSPy xn1QlnRtRFdLz+PQ3kFCzU3TOT9oxdFhBkYAXS32vDItPqzM7Upj0oZceqhmd5kz aXSdkL+PFmbHuLzyPuBksyX4gKga06rQBHJ2SIPxnRPZcXBBRStqyWRDpNjwIxrW K8p6k0Agrtd4tL7QtBdukda0uJqKSjN3gOzRAu40KMBjBJZ3kMTsoc+GWGIoIMHb PIDaXTZT0DlZ9ZxiEA/gPcjMugNjDVhkbq2ChPU+asvlRs0YUANT4CF0HcntJvDO W0xIWivfYIKWFLdAn6fhXicPkqU9DQ7FjppyRKC6y4bwuCYJlSeLsPmSWNI2IEBX bFA= =LLWX -----END PGP SIGNATURE----- Merge tag 'probes-fixes-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull eprobes fix from Masami Hiramatsu: - release eprobe when failing to add dyn_event. This unregisters event call and release eprobe when it fails to add a dynamic event. Found in cleaning up. * tag 'probes-fixes-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/eprobe: Fix to release eprobe when failed to add dyn_event |