New Ways to Find Latency in Linux Using Tracing

1. Brought to you by New ways to find latency in Linux using tracing Steven Rostedt Open Source Engineer at VMware

3. What is ftrace? ● Official tracer of the Linux Kernel – Introduced in 2008 ● Came from the PREEMPT_RT patch set – Initially focused on finding causes of latency ● Has grows significantly since 2008 – Added new ways to find latency – Does much more than find latency

4. The ftrace interface (the tracefs file system) # mount -t tracefs nodev /sys/kernel/tracing # ls /sys/kernel/tracing available_events max_graph_depth snapshot available_filter_functions options stack_max_size available_tracers osnoise stack_trace buffer_percent per_cpu stack_trace_filter buffer_size_kb printk_formats synthetic_events buffer_total_size_kb README timestamp_mode current_tracer recursed_functions trace dynamic_events saved_cmdlines trace_clock dyn_ftrace_total_info saved_cmdlines_size trace_marker enabled_functions saved_tgids trace_marker_raw error_log set_event trace_options eval_map set_event_notrace_pid trace_pipe events set_event_pid trace_stat free_buffer set_ftrace_filter tracing_cpumask function_profile_enabled set_ftrace_notrace tracing_max_latency hwlat_detector set_ftrace_notrace_pid tracing_on instances set_ftrace_pid tracing_thresh kprobe_events set_graph_function uprobe_events kprobe_profile set_graph_notrace uprobe_profile

5. Introducing trace-cmd Luckily today, we do not need to know all those files ● trace-cmd is a front end interface to ftrace ● It interfaces with the tracefs directory for you https://www.trace-cmd.org git clone git://git.kernel.org/pub/scm/utils/trace-cmd/trace-cmd.git

6. The Old Tracers ● Wake up trace – All tasks – RT tasks – Deadline Tasks ● Preemption off tracers – irqsoff tracer – preemptoff tracer – preemptirqsoff tracer

7. Wake up tracer Task 1 Interrupt Wake up Task 2 event Task 2 Sched switch event latency

8. Running wakeup_rt tracer # trace-cmd start -p wakeup_rt # trace-cmd show # tracer: wakeup_rt # # wakeup_rt latency trace v1.1.5 on 5.10.52-test-rt47 # latency: 94 us, #211/211, CPU#4 | (M:preempt_rt VP:0, KP:0, SP:0 HP:0 #P:8) # ----------------- # | task: rcuc/4-42 (uid:0 nice:0 policy:1 rt_prio:1) # ----------------- # # _--------=> CPU# # / _-------=> irqs-off # | / _------=> need-resched # || / _-----=> need-resched-lazy # ||| / _----=> hardirq/softirq # |||| / _---=> preempt-depth # ||||| / _--=> preempt-lazy-depth # |||||| / _-=> migrate-disable # ||||||| / delay # cmd pid |||||||| time | caller # / |||||||| | / bash-1872 4dN.h5.. 1us+: 1872:120:R + [004] 42: 98:R rcuc/4 bash-1872 4dN.h5.. 30us : <stack trace> => __ftrace_trace_stack => probe_wakeup => ttwu_do_wakeup => try_to_wake_up

9. Running wakeup_rt tracer => invoke_rcu_core => rcu_sched_clock_irq => update_process_times => tick_sched_handle => tick_sched_timer => __hrtimer_run_queues => hrtimer_interrupt => __sysvec_apic_timer_interrupt => asm_call_irq_on_stack => sysvec_apic_timer_interrupt => asm_sysvec_apic_timer_interrupt => lock_acquire => _raw_spin_lock => shmem_get_inode => shmem_mknod => lookup_open.isra.0 => path_openat => do_filp_open => do_sys_openat2 => __x64_sys_openat => do_syscall_64 => entry_SYSCALL_64_after_hwframe bash-1872 4dN.h5.. 31us : 0 bash-1872 4dN.h4.. 32us : task_woken_rt ←ttwu_do_wakeup bash-1872 4dN..2.. 87us : put_prev_entity <-put_prev_task_fair bash-1872 4dN..2.. 87us : update_curr <-put_prev_entity bash-1872 4dN..2.. 87us : __update_load_avg_se <-update_load_avg bash-1872 4dN..2.. 87us : __update_load_avg_cfs_rq <-update_load_avg

10. Running wakeup_rt tracer bash-1872 4dN..2.. 87us : pick_next_task_stop <-__schedule bash-1872 4dN..2.. 87us : pick_next_task_dl <-__schedule bash-1872 4dN..2.. 87us : pick_next_task_rt <-__schedule bash-1872 4dN..2.. 88us : update_rt_rq_load_avg <-pick_next_task_rt bash-1872 4d...3.. 88us : __schedule bash-1872 4d...3.. 88us : 1872:120:R ==> [004] 42: 98:R rcuc/4 bash-1872 4d...3.. 94us : <stack trace> => __ftrace_trace_stack => probe_wakeup_sched_switch => __schedule => preempt_schedule_common => preempt_schedule_thunk => _raw_spin_unlock => shmem_get_inode => shmem_mknod => lookup_open.isra.0 => path_openat => do_filp_open => do_sys_openat2 => __x64_sys_openat => do_syscall_64 => entry_SYSCALL_64_after_hwframe

11. Interrupt off tracer Task Interrupt Irqs disabled latency Irqs enabled

12. Running preemptirqsoff tracer # trace-cmd start -p preemptirqsoff # trace-cmd show # tracer: preemptirqsoff # # preemptirqsoff latency trace v1.1.5 on 5.14.0-rc4-test+ # -------------------------------------------------------------------- # latency: 2325 us, #3005/3005, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:8) # ----------------- # | task: bash-48651 (uid:0 nice:0 policy:0 rt_prio:0) # ----------------- # => started at: _raw_spin_lock # => ended at: _raw_spin_unlock # # # _------=> CPU# # / _-----=> irqs-off # | / _----=> need-resched # || / _---=> hardirq/softirq # ||| / _--=> preempt-depth # |||| / delay # cmd pid ||||| time | caller # / ||||| | / trace-cm-48651 1...1 0us : _raw_spin_lock trace-cm-48651 1...1 1us : do_raw_spin_trylock <-_raw_spin_lock trace-cm-48651 1...1 1us : flush_tlb_batched_pending <-unmap_page_range trace-cm-48651 1...1 2us : vm_normal_page <-unmap_page_range trace-cm-48651 1...1 2us : page_remove_rmap <-unmap_page_range

13. Running preemptirqsoff tracer [...] trace-cm-48651 1...1 2313us : unlock_page_memcg <-unmap_page_range trace-cm-48651 1...1 2314us : __rcu_read_unlock <-unlock_page_memcg trace-cm-48651 1...1 2315us : __tlb_remove_page_size <-unmap_page_range trace-cm-48651 1...1 2316us : vm_normal_page <-unmap_page_range trace-cm-48651 1...1 2317us : page_remove_rmap <-unmap_page_range trace-cm-48651 1...1 2318us : lock_page_memcg <-page_remove_rmap trace-cm-48651 1...1 2318us : __rcu_read_lock <-lock_page_memcg trace-cm-48651 1...1 2319us : unlock_page_memcg <-unmap_page_range trace-cm-48651 1...1 2320us : __rcu_read_unlock <-unlock_page_memcg trace-cm-48651 1...1 2321us : __tlb_remove_page_size <-unmap_page_range trace-cm-48651 1...1 2322us : _raw_spin_unlock <-unmap_page_range trace-cm-48651 1...1 2323us : do_raw_spin_unlock <-_raw_spin_unlock trace-cm-48651 1...1 2324us : preempt_count_sub <-_raw_spin_unlock trace-cm-48651 1...1 2325us : _raw_spin_unlock trace-cm-48651 1...1 2327us+: tracer_preempt_on <-_raw_spin_unlock trace-cm-48651 1...1 2343us : <stack trace> => unmap_page_range => unmap_vmas => exit_mmap => mmput => begin_new_exec => load_elf_binary => bprm_execve => do_execveat_common => __x64_sys_execve => do_syscall_64 => entry_SYSCALL_64_after_hwframe

14. Running preemptirqsoff tracer # trace-cmd start -p preemptirqsoff -d -O sym-offset # trace-cmd show # tracer: preemptirqsoff [..] # latency: 248 us, #4/4, CPU#6 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:8) # ----------------- # | task: ksoftirqd/6-47 (uid:0 nice:0 policy:0 rt_prio:0) # ----------------- # => started at: run_ksoftirqd # => ended at: run_ksoftirqd [..] # cmd pid ||||| time | caller # / ||||| | / <idle>-0 4d..1 1us!: cpuidle_enter_state+0xcf/0x410 kworker/-48096 4...1 242us : schedule+0x4d/0xe0 <-schedule+0x4d/0xe0 kworker/-48096 4...1 243us : tracer_preempt_on+0xf4/0x110 <-schedule+0x4d/0xe0 kworker/-48096 4...1 248us : <stack trace> => worker_thread+0xd4/0x3c0 => kthread+0x155/0x180 => ret_from_fork+0x22/0x30

15. Measuring Interrupt Latency Task Interrupt latency

16. Measuring latency from interrupts ● You can easily trace the latency from interrupts – For x86: # trace-cmd record -p function_graph -l handle_irq_event -l ‘*sysvec_*’ -e irq_handler_entry -e 'irq_vectors:*entry*'

17. Tracing Latency from Interrupts # trace-cmd report -l --cpu 2 sleep-2513 [002] 2973.197184: funcgraph_entry: | __sysvec_apic_timer_interrupt() { sleep-2513 [002] 2973.197186: local_timer_entry: vector=236 sleep-2513 [002] 2973.197196: funcgraph_exit: + 12.233 us | } sleep-2513 [002] 2973.197206: funcgraph_entry: | __sysvec_irq_work() { sleep-2513 [002] 2973.197207: irq_work_entry: vector=246 sleep-2513 [002] 2973.197207: funcgraph_exit: 1.405 us | } <idle>-0 [002] 2973.198186: funcgraph_entry: | __sysvec_apic_timer_interrupt() { <idle>-0 [002] 2973.198187: local_timer_entry: vector=236 <idle>-0 [002] 2973.198193: funcgraph_exit: 7.992 us | } <idle>-0 [002] 2974.991721: funcgraph_entry: | handle_irq_event() { <idle>-0 [002] 2974.991723: irq_handler_entry: irq=24 name=ahci[0000:00:1f.2] <idle>-0 [002] 2974.991733: funcgraph_exit: + 13.158 us | } <idle>-0 [002] 2977.039843: funcgraph_entry: | handle_irq_event() { <idle>-0 [002] 2977.039845: irq_handler_entry: irq=24 name=ahci[0000:00:1f.2] <idle>-0 [002] 2977.039855: funcgraph_exit: + 12.869 us | }

18. Problems with the latency tracers ● No control over what tasks they trace – They trace the highest priority process – May not be the process you are interested in ● Not flexible – Has one use case – General for the entire system

19. Tracing interrupt latency with function graph tracer ● Does not have a “max latency” – You see all the latency in a trace – No way to record the max latency found

20. Introducing Synthetic Events ● Can map two events into a single event

21. Introducing Synthetic Events ● Can map two events into a single event – sched_waking + sched_switch wakeup_latency → – irqs_disabled + irqs_enabled irqs_off_latency → – irq_enter_handler + irq_exit_handler irq_latency →

22. Introducing Synthetic Events ● Can map two events into a single event – sched_waking + sched_switch wakeup_latency → – irqs_disabled + irqs_enabled irqs_off_latency → – irq_enter_handler + irq_exit_handler irq_latency → ● Have all the functionality as a normal event

23. Introducing Synthetic Events ● Can map two events into a single event – sched_waking + sched_switch wakeup_latency → – irqs_disabled + irqs_enabled irqs_off_latency → – irq_enter_handler + irq_exit_handler irq_latency → ● Have all the functionality as a normal event – Can be filtered on

24. Introducing Synthetic Events ● Can map two events into a single event – sched_waking + sched_switch wakeup_latency → – irqs_disabled + irqs_enabled irqs_off_latency → – irq_enter_handler + irq_exit_handler irq_latency → ● Have all the functionality as a normal event – Can be filtered on – Can have triggers attached (like histograms)

25. Synthetic events # echo 'wakeup_lat s32 pid; u64 delta;' > /sys/kernel/tracing/synthetic_events # echo 'hist:keys=pid:__arg__1=pid,__arg__2=common_timestamp.usecs' > /sys/kernel/tracing/events/sched/sched_waking/trigger # echo 'hist:keys=next_pid:pid=$__arg__1,’ ‘delta=common_timestamp.usecs-$__arg__2:onmatch(sched.sched_waking)’ ‘.trace(wakeup_lat,$pid,$delta) if next_comm == “cyclictest”' > /sys/kernel/tracing/events/sched/sched_switch/trigger

26. Synthetic events # echo 'wakeup_lat s32 pid; u64 delta;' > /sys/kernel/tracing/synthetic_events # echo 'hist:keys=pid:__arg__1=pid,__arg__2=common_timestamp.usecs' > /sys/kernel/tracing/events/sched/sched_waking/trigger # echo 'hist:keys=next_pid:pid=$__arg__1,’ ‘delta=common_timestamp.usecs-$__arg__2:onmatch(sched.sched_waking)’ ‘.trace(wakeup_lat,$pid,$delta) if next_comm == “cyclictest”' > /sys/kernel/tracing/events/sched/sched_switch/trigger Too Complex!

27. Introducing libtracefs A library to interact with the tracefs file system All functions have man pages The tracefs_sql man page has the sqlhist program in it https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/ git clone git://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git cd libtracefs make sqlhist # builds from the man page!

28. Synthetic events # sqlhist -e -n wakeup_lat ‘SELECT start.pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) AS delta FROM ’ ‘sched_waking AS start JOIN sched_switch AS end ON start.pid = end.next_pid' ‘WHERE end.next_pid = “cyclictest”’

37. cyclictest A tool used by real-time developers to test wake up latency in the system https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git/ git clone git://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git cd rt-tests make cyclictest

38. Customize latency tracing # sqlhist -e -n wakeup_lat -T -m lat 'SELECT end.next_comm AS comm, ’ ‘(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) AS lat ’ ‘FROM sched_waking AS start JOIN sched_switch AS end ’ ‘ON start.pid = end.next_pid’ ‘WHERE end.next_prio < 100 && end.next_comm == "cyclictest"' # trace-cmd start -e all -e wakeup_lat -R stacktrace # cyclictest -l 1000 -p80 -i250 -a -t -q -m -d 0 -b 1000 --tracemark # trace-cmd show -s | tail -16 <idle>-0 [001] d..1 23454.902254: cpu_idle: state=0 cpu_id=1 <idle>-0 [002] d..2 23454.902254: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=cyclictest next_pid=12275 next_prio=19 <idle>-0 [002] d..4 23454.902256: wakeup_lat: next_comm=cyclictest lat=17 <idle>-0 [002] d..5 23454.902258: <stack trace> => trace_event_raw_event_synth => action_trace => event_hist_trigger => event_triggers_call => trace_event_buffer_commit => trace_event_raw_event_sched_switch => __traceiter_sched_switch => __schedule => schedule_idle => do_idle => cpu_startup_entry => secondary_startup_64_no_verify

50. Customize latency tracing # trace-cmd extract -s # trace-cmd report --cpu 2 | tail -30 <idle>-0 [001] d..1 23454.902254: cpu_idle: state=0 cpu_id=1 <idle>-0 [002] 23454.902239: sched_wakeup: cyclictest:12275 [19] CPU:002 <idle>-0 [002] 23454.902241: hrtimer_expire_exit: hrtimer=0xffffbbd68286fe60 <idle>-0 [002] 23454.902241: hrtimer_cancel: hrtimer=0xffffbbd6826efe70 <idle>-0 [002] 23454.902242: hrtimer_expire_entry: hrtimer=0xffffbbd6826efe70 <idle>-0 [002] 23454.902243: sched_waking: comm=cyclictest pid=12272 prio=120 target_cpu=002 <idle>-0 [002] 23454.902244: prandom_u32: ret=1102749734 <idle>-0 [002] 23454.902246: sched_wakeup: cyclictest:12272 [120] CPU:002 <idle>-0 [002] 23454.902247: hrtimer_expire_exit: hrtimer=0xffffbbd6826efe70 <idle>-0 [002] 23454.902248: write_msr: 6e0, value 4866ce957272 <idle>-0 [002] 23454.902249: local_timer_exit: vector=236 <idle>-0 [002] 23454.902250: cpu_idle: state=4294967295 cpu_id=2 <idle>-0 [002] 23454.902251: rcu_utilization: Start context switch <idle>-0 [002] 23454.902252: rcu_utilization: End context switch <idle>-0 [002] 23454.902253: prandom_u32: ret=3692516021 <idle>-0 [002] 23454.902254: sched_switch: swapper/2:0 [120] R ==> cyclictest:12275 [19] <idle>-0 [002] 23454.902256: wakeup_lat: next_comm=cyclictest lat=17 <idle>-0 [002] 23454.902258: kernel_stack: <stack trace > => trace_event_raw_event_synth (ffffffff8121a0db) => action_trace (ffffffff8121e9fb) => event_hist_trigger (ffffffff8121ca8d) => event_triggers_call (ffffffff81216c72) [..]

57. Brought to you by Steven Rostedt rostedt@goodmis.org @srostedt

New Ways to Find Latency in Linux Using Tracing

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to New Ways to Find Latency in Linux Using Tracing

Similar to New Ways to Find Latency in Linux Using Tracing (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

New Ways to Find Latency in Linux Using Tracing