mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
synced 2025-01-01 02:36:02 +00:00
e88ed227f6
Going a step further, we propose a way to use any user-space workload as the task waiting for the timerlat timer. This is done via a per-CPU file named osnoise/cpu$id/timerlat_fd file. The tracef_fd allows a task to open at a time. When a task reads the file, the timerlat timer is armed for future osnoise/timerlat_period_us time. When the timer fires, it prints the IRQ latency and wakes up the user-space thread waiting in the timerlat_fd. The thread then starts to run, executes the timerlat measurement, prints the thread scheduling latency and returns to user-space. When the thread rereads the timerlat_fd, the tracer will print the user-ret(urn) latency, which is an additional metric. This additional metric is also traced by the tracer and can be used, for example of measuring the context switch overhead from kernel-to-user and user-to-kernel, or the response time for an arbitrary execution in user-space. The tracer supports one thread per CPU, the thread must be pinned to the CPU, and it cannot migrate while holding the timerlat_fd. The reason is that the tracer is per CPU (nothing prohibits the tracer from allowing migrations in the future). The tracer monitors the migration of the thread and disables the tracer if detected. The timerlat_fd is only available for opening/reading when timerlat tracer is enabled, and NO_OSNOISE_WORKLOAD is set. The simplest way to activate this feature from user-space is: -------------------------------- %< ----------------------------------- int main(void) { char buffer[1024]; int timerlat_fd; int retval; long cpu = 0; /* place in CPU 0 */ cpu_set_t set; CPU_ZERO(&set); CPU_SET(cpu, &set); if (sched_setaffinity(gettid(), sizeof(set), &set) == -1) return 1; snprintf(buffer, sizeof(buffer), "/sys/kernel/tracing/osnoise/per_cpu/cpu%ld/timerlat_fd", cpu); timerlat_fd = open(buffer, O_RDONLY); if (timerlat_fd < 0) { printf("error opening %s: %s\n", buffer, strerror(errno)); exit(1); } for (;;) { retval = read(timerlat_fd, buffer, 1024); if (retval < 0) break; } close(timerlat_fd); exit(0); } -------------------------------- >% ----------------------------------- When disabling timerlat, if there is a workload holding the timerlat_fd, the SIGKILL will be sent to the thread. Link: https://lkml.kernel.org/r/69fe66a863d2792ff4c3a149bf9e32e26468bb3a.1686063934.git.bristot@kernel.org Cc: Juri Lelli <juri.lelli@redhat.com> Cc: William White <chwhite@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
261 lines
11 KiB
ReStructuredText
261 lines
11 KiB
ReStructuredText
###############
|
|
Timerlat tracer
|
|
###############
|
|
|
|
The timerlat tracer aims to help the preemptive kernel developers to
|
|
find sources of wakeup latencies of real-time threads. Like cyclictest,
|
|
the tracer sets a periodic timer that wakes up a thread. The thread then
|
|
computes a *wakeup latency* value as the difference between the *current
|
|
time* and the *absolute time* that the timer was set to expire. The main
|
|
goal of timerlat is tracing in such a way to help kernel developers.
|
|
|
|
Usage
|
|
-----
|
|
|
|
Write the ASCII text "timerlat" into the current_tracer file of the
|
|
tracing system (generally mounted at /sys/kernel/tracing).
|
|
|
|
For example::
|
|
|
|
[root@f32 ~]# cd /sys/kernel/tracing/
|
|
[root@f32 tracing]# echo timerlat > current_tracer
|
|
|
|
It is possible to follow the trace by reading the trace file::
|
|
|
|
[root@f32 tracing]# cat trace
|
|
# tracer: timerlat
|
|
#
|
|
# _-----=> irqs-off
|
|
# / _----=> need-resched
|
|
# | / _---=> hardirq/softirq
|
|
# || / _--=> preempt-depth
|
|
# || /
|
|
# |||| ACTIVATION
|
|
# TASK-PID CPU# |||| TIMESTAMP ID CONTEXT LATENCY
|
|
# | | | |||| | | | |
|
|
<idle>-0 [000] d.h1 54.029328: #1 context irq timer_latency 932 ns
|
|
<...>-867 [000] .... 54.029339: #1 context thread timer_latency 11700 ns
|
|
<idle>-0 [001] dNh1 54.029346: #1 context irq timer_latency 2833 ns
|
|
<...>-868 [001] .... 54.029353: #1 context thread timer_latency 9820 ns
|
|
<idle>-0 [000] d.h1 54.030328: #2 context irq timer_latency 769 ns
|
|
<...>-867 [000] .... 54.030330: #2 context thread timer_latency 3070 ns
|
|
<idle>-0 [001] d.h1 54.030344: #2 context irq timer_latency 935 ns
|
|
<...>-868 [001] .... 54.030347: #2 context thread timer_latency 4351 ns
|
|
|
|
|
|
The tracer creates a per-cpu kernel thread with real-time priority that
|
|
prints two lines at every activation. The first is the *timer latency*
|
|
observed at the *hardirq* context before the activation of the thread.
|
|
The second is the *timer latency* observed by the thread. The ACTIVATION
|
|
ID field serves to relate the *irq* execution to its respective *thread*
|
|
execution.
|
|
|
|
The *irq*/*thread* splitting is important to clarify in which context
|
|
the unexpected high value is coming from. The *irq* context can be
|
|
delayed by hardware-related actions, such as SMIs, NMIs, IRQs,
|
|
or by thread masking interrupts. Once the timer happens, the delay
|
|
can also be influenced by blocking caused by threads. For example, by
|
|
postponing the scheduler execution via preempt_disable(), scheduler
|
|
execution, or masking interrupts. Threads can also be delayed by the
|
|
interference from other threads and IRQs.
|
|
|
|
Tracer options
|
|
---------------------
|
|
|
|
The timerlat tracer is built on top of osnoise tracer.
|
|
So its configuration is also done in the osnoise/ config
|
|
directory. The timerlat configs are:
|
|
|
|
- cpus: CPUs at which a timerlat thread will execute.
|
|
- timerlat_period_us: the period of the timerlat thread.
|
|
- stop_tracing_us: stop the system tracing if a
|
|
timer latency at the *irq* context higher than the configured
|
|
value happens. Writing 0 disables this option.
|
|
- stop_tracing_total_us: stop the system tracing if a
|
|
timer latency at the *thread* context is higher than the configured
|
|
value happens. Writing 0 disables this option.
|
|
- print_stack: save the stack of the IRQ occurrence. The stack is printed
|
|
after the *thread context* event, or at the IRQ handler if *stop_tracing_us*
|
|
is hit.
|
|
|
|
timerlat and osnoise
|
|
----------------------------
|
|
|
|
The timerlat can also take advantage of the osnoise: traceevents.
|
|
For example::
|
|
|
|
[root@f32 ~]# cd /sys/kernel/tracing/
|
|
[root@f32 tracing]# echo timerlat > current_tracer
|
|
[root@f32 tracing]# echo 1 > events/osnoise/enable
|
|
[root@f32 tracing]# echo 25 > osnoise/stop_tracing_total_us
|
|
[root@f32 tracing]# tail -10 trace
|
|
cc1-87882 [005] d..h... 548.771078: #402268 context irq timer_latency 13585 ns
|
|
cc1-87882 [005] dNLh1.. 548.771082: irq_noise: local_timer:236 start 548.771077442 duration 7597 ns
|
|
cc1-87882 [005] dNLh2.. 548.771099: irq_noise: qxl:21 start 548.771085017 duration 7139 ns
|
|
cc1-87882 [005] d...3.. 548.771102: thread_noise: cc1:87882 start 548.771078243 duration 9909 ns
|
|
timerlat/5-1035 [005] ....... 548.771104: #402268 context thread timer_latency 39960 ns
|
|
|
|
In this case, the root cause of the timer latency does not point to a
|
|
single cause but to multiple ones. Firstly, the timer IRQ was delayed
|
|
for 13 us, which may point to a long IRQ disabled section (see IRQ
|
|
stacktrace section). Then the timer interrupt that wakes up the timerlat
|
|
thread took 7597 ns, and the qxl:21 device IRQ took 7139 ns. Finally,
|
|
the cc1 thread noise took 9909 ns of time before the context switch.
|
|
Such pieces of evidence are useful for the developer to use other
|
|
tracing methods to figure out how to debug and optimize the system.
|
|
|
|
It is worth mentioning that the *duration* values reported
|
|
by the osnoise: events are *net* values. For example, the
|
|
thread_noise does not include the duration of the overhead caused
|
|
by the IRQ execution (which indeed accounted for 12736 ns). But
|
|
the values reported by the timerlat tracer (timerlat_latency)
|
|
are *gross* values.
|
|
|
|
The art below illustrates a CPU timeline and how the timerlat tracer
|
|
observes it at the top and the osnoise: events at the bottom. Each "-"
|
|
in the timelines means circa 1 us, and the time moves ==>::
|
|
|
|
External timer irq thread
|
|
clock latency latency
|
|
event 13585 ns 39960 ns
|
|
| ^ ^
|
|
v | |
|
|
|-------------| |
|
|
|-------------+-------------------------|
|
|
^ ^
|
|
========================================================================
|
|
[tmr irq] [dev irq]
|
|
[another thread...^ v..^ v.......][timerlat/ thread] <-- CPU timeline
|
|
=========================================================================
|
|
|-------| |-------|
|
|
|--^ v-------|
|
|
| | |
|
|
| | + thread_noise: 9909 ns
|
|
| +-> irq_noise: 6139 ns
|
|
+-> irq_noise: 7597 ns
|
|
|
|
IRQ stacktrace
|
|
---------------------------
|
|
|
|
The osnoise/print_stack option is helpful for the cases in which a thread
|
|
noise causes the major factor for the timer latency, because of preempt or
|
|
irq disabled. For example::
|
|
|
|
[root@f32 tracing]# echo 500 > osnoise/stop_tracing_total_us
|
|
[root@f32 tracing]# echo 500 > osnoise/print_stack
|
|
[root@f32 tracing]# echo timerlat > current_tracer
|
|
[root@f32 tracing]# tail -21 per_cpu/cpu7/trace
|
|
insmod-1026 [007] dN.h1.. 200.201948: irq_noise: local_timer:236 start 200.201939376 duration 7872 ns
|
|
insmod-1026 [007] d..h1.. 200.202587: #29800 context irq timer_latency 1616 ns
|
|
insmod-1026 [007] dN.h2.. 200.202598: irq_noise: local_timer:236 start 200.202586162 duration 11855 ns
|
|
insmod-1026 [007] dN.h3.. 200.202947: irq_noise: local_timer:236 start 200.202939174 duration 7318 ns
|
|
insmod-1026 [007] d...3.. 200.203444: thread_noise: insmod:1026 start 200.202586933 duration 838681 ns
|
|
timerlat/7-1001 [007] ....... 200.203445: #29800 context thread timer_latency 859978 ns
|
|
timerlat/7-1001 [007] ....1.. 200.203446: <stack trace>
|
|
=> timerlat_irq
|
|
=> __hrtimer_run_queues
|
|
=> hrtimer_interrupt
|
|
=> __sysvec_apic_timer_interrupt
|
|
=> asm_call_irq_on_stack
|
|
=> sysvec_apic_timer_interrupt
|
|
=> asm_sysvec_apic_timer_interrupt
|
|
=> delay_tsc
|
|
=> dummy_load_1ms_pd_init
|
|
=> do_one_initcall
|
|
=> do_init_module
|
|
=> __do_sys_finit_module
|
|
=> do_syscall_64
|
|
=> entry_SYSCALL_64_after_hwframe
|
|
|
|
In this case, it is possible to see that the thread added the highest
|
|
contribution to the *timer latency* and the stack trace, saved during
|
|
the timerlat IRQ handler, points to a function named
|
|
dummy_load_1ms_pd_init, which had the following code (on purpose)::
|
|
|
|
static int __init dummy_load_1ms_pd_init(void)
|
|
{
|
|
preempt_disable();
|
|
mdelay(1);
|
|
preempt_enable();
|
|
return 0;
|
|
|
|
}
|
|
|
|
User-space interface
|
|
---------------------------
|
|
|
|
Timerlat allows user-space threads to use timerlat infra-structure to
|
|
measure scheduling latency. This interface is accessible via a per-CPU
|
|
file descriptor inside $tracing_dir/osnoise/per_cpu/cpu$ID/timerlat_fd.
|
|
|
|
This interface is accessible under the following conditions:
|
|
|
|
- timerlat tracer is enable
|
|
- osnoise workload option is set to NO_OSNOISE_WORKLOAD
|
|
- The user-space thread is affined to a single processor
|
|
- The thread opens the file associated with its single processor
|
|
- Only one thread can access the file at a time
|
|
|
|
The open() syscall will fail if any of these conditions are not met.
|
|
After opening the file descriptor, the user space can read from it.
|
|
|
|
The read() system call will run a timerlat code that will arm the
|
|
timer in the future and wait for it as the regular kernel thread does.
|
|
|
|
When the timer IRQ fires, the timerlat IRQ will execute, report the
|
|
IRQ latency and wake up the thread waiting in the read. The thread will be
|
|
scheduled and report the thread latency via tracer - as for the kernel
|
|
thread.
|
|
|
|
The difference from the in-kernel timerlat is that, instead of re-arming
|
|
the timer, timerlat will return to the read() system call. At this point,
|
|
the user can run any code.
|
|
|
|
If the application rereads the file timerlat file descriptor, the tracer
|
|
will report the return from user-space latency, which is the total
|
|
latency. If this is the end of the work, it can be interpreted as the
|
|
response time for the request.
|
|
|
|
After reporting the total latency, timerlat will restart the cycle, arm
|
|
a timer, and go to sleep for the following activation.
|
|
|
|
If at any time one of the conditions is broken, e.g., the thread migrates
|
|
while in user space, or the timerlat tracer is disabled, the SIG_KILL
|
|
signal will be sent to the user-space thread.
|
|
|
|
Here is an basic example of user-space code for timerlat::
|
|
|
|
int main(void)
|
|
{
|
|
char buffer[1024];
|
|
int timerlat_fd;
|
|
int retval;
|
|
long cpu = 0; /* place in CPU 0 */
|
|
cpu_set_t set;
|
|
|
|
CPU_ZERO(&set);
|
|
CPU_SET(cpu, &set);
|
|
|
|
if (sched_setaffinity(gettid(), sizeof(set), &set) == -1)
|
|
return 1;
|
|
|
|
snprintf(buffer, sizeof(buffer),
|
|
"/sys/kernel/tracing/osnoise/per_cpu/cpu%ld/timerlat_fd",
|
|
cpu);
|
|
|
|
timerlat_fd = open(buffer, O_RDONLY);
|
|
if (timerlat_fd < 0) {
|
|
printf("error opening %s: %s\n", buffer, strerror(errno));
|
|
exit(1);
|
|
}
|
|
|
|
for (;;) {
|
|
retval = read(timerlat_fd, buffer, 1024);
|
|
if (retval < 0)
|
|
break;
|
|
}
|
|
|
|
close(timerlat_fd);
|
|
exit(0);
|
|
}
|