Performance Analysis Tools for Linux Kernel
- 3. 3
perf
• A collection of performance analysis tools
• Sampling and profiling the system
• Providing some tracing mechanisms
• Included in the linux kernel (tools/perf)
- 4. 4
Before using perf...
• Narrow down the scope with other tools
– dmesg/syslog
– top
– iostat/vmstat
– iotop
– pidstat
– strace
– latencytop
- 6. 6
Events
# perf list
List of pre-defined events (to be used in -e):
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
...
alignment-faults [Software event]
bpf-output [Software event]
context-switches OR cs [Software event]
...
block:block_bio_backmerge [Tracepoint event]
block:block_bio_bounce [Tracepoint event]
block:block_bio_complete [Tracepoint event]
...
- 9. 9
Hardware Events
• Hardware events are CPU-specific.
Check the CPU manual for more details.
• The common events:
– branch-instructions
– bus-cycles
– cache-misses
– cache-references
– cpu-cycles
– Instructions
- 10. 10
Multiplexing and Scaling Events
• The hardware resources are limited.
• When there are more enabled events than the available
counters, the events are managed in round-robin
fashion.
• The event count may be scaled.
final_count = raw_count * time_enabled / time_running
- 12. 12
Software Events
• Special “software” counters provided by the kernel,
even if the hardware does not support performance
counters.
• The software events:
– cpu-clock
– task-clock
– context-switches
– page-faults, major-faults, and minor-faults
– alignment-faults
• The available events are defined in “perf_sw_ids” in
include/uapi/linux/perf_event.h.
- 13. 13
Page Fault
• Major Fault
The page is not loaded at the time the fault is generated. The
additional disk latency is expected.
• Minor Fault
The page is already in the physical memory but not marked in MMU at
the time the fault is generated. No disk latency occurs.
- 15. 15
Tracepoint Events
• In-kernel static trace events
• Tracepoint events depend on the kernel and loaded
modules.
• The tracepoint event list:
# perf list tracepoint
# less /sys/kernel/debug/tracing/avaliable_events
- 16. 16
Tracepoint Events – Example
static long do_wait(struct wait_opts *wo)
{
struct task_struct *tsk;
int retval;
trace_sched_process_wait(wo->wo_pid);
init_waitqueue_func_entry(&wo->child_wait,
child_wait_callback);
wo->child_wait.private = current;
add_wait_queue(¤t->signal->wait_chldexit,
&wo->child_wait);
repeat:
...
- 18. 18
perf probe
• Define new dynamic tracepoints events
• kprobe
The kernel symbols are listed in /proc/kallsyms.
• uprobe
The symbols can be read with “readelf”.
• Define a new kprobe event
# perf probe <kprobe event>
• Delete a kprobe event
# perf probe --del=<kprobe event>
• Use the kprobe event
# perf trace --event probe:<kprobe event>
# perf record -e probe:<kprobe event>
- 19. 19
perf probe (cont’d)
• The following commands need kernel debuginfo and
debugsource.
• Show the arguments of the probed function
# perf probe -V <func>
• Show the source code of the probed function
# perf probe -L <func>
• Probe the line 12 of a function
# perf probe <func>:12
• Probe a member of a struct in the arguments (ifindex in
struct net of icmp_out_count())
# perf probe ‘icmp_out_count net->ifindex’
• Do a verbose dry run
# perf probe -nv <probe>
NOTE: You may encounter build-id mismatching. See bsc#964063.
- 24. 24
perf top – Tip
Press ‘z’ to zero all counts for a few seconds to
avoid recording activity from “perf top” itself.
- 28. 28
perf record – Examples
Sample on-CPU functions for the specified PID, until Ctrl-C:
# perf record -p PID
Sample CPU stack traces for the PID, using dwarf (dbg info) to
unwind stacks, for 10 seconds
# perf record -p PID --call-graph dwarf sleep 10
Sample CPU stack traces for the entire system, for 10 seconds:
# perf record -ag -- sleep 10
Trace all block completions, of size at least 100 Kbytes
# perf record -e block:block_rq_complete
--filter 'nr_sector > 200'
- 32. 32
Process PID CPU Timestamp CPU Cycles
audioPipe:src 31884 [000] 620396.493554: 141988 cycles:
3ed40 [unknown] (/usr/lib64/gstreamer-1.0/libgstcoreeleme
8f [unknown] ([unknown])
55b5f048f100 [unknown] ([unknown])
0 [unknown] ([unknown])
offlineimap 32210 [000] 620396.507109: 156215 cycles:
2529 [unknown] (/usr/lib64/python2.7/lib-dynload/time.so
c87200016b0004 [unknown] ([unknown])
alsa-sink-ALC32 3300 [002] 620396.512810: 419889 cycles:
39a17 [unknown] (/usr/lib64/pulseaudio/libpulsecommon-10.
55ba449bb190 [unknown] ([unknown])
threaded-ml 31876 [000] 620396.512898: 156215 cycles:
7fff8173d763 __sched_text_start ([kernel.kallsyms])
7fff8173e04d schedule ([kernel.kallsyms])
7fff81741f9b schedule_hrtimeout_range_clock ([kernel.kallsyms])
7fff8124bae1 poll_schedule_timeout ([kernel.kallsyms])
7fff8124cf2a do_sys_poll ([kernel.kallsyms])
e418d [unknown] (/lib64/libc-2.25.so)
6 [unknown] ([unknown])
backtrace
- 36. 36
perf stat – Output
# perf stat -a sleep 10
Performance counter stats for 'system wide':
20032.666323 task-clock (msec) # 2.000 CPUs utilized (100.00%)
41074 context-switches # 0.002 M/sec (100.00%)
2152 cpu-migrations # 0.107 K/sec (100.00%)
6252 page-faults # 0.312 K/sec
6838201211 cycles # 0.341 GHz (50.01%)
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
4157856056 instructions # 0.61 insns per cycle (74.99%)
871168471 branches # 43.487 M/sec (75.03%)
38497517 branch-misses # 4.42% of all branches (74.98%)
10.016014696 seconds time elapsed
count
scale
- 37. 37
perf stat – Examples
CPU counter statistics for the entire system, for 10 seconds:
# perf stat -a sleep 10
Various CPU level 1 data cache statistics for the specified command:
# perf stat -e L1-dcache-loads,L1-dcache-load-misses,
L1-dcache-stores command
Count scheduler events for the specified PID, until Ctrl-C:
# perf stat -e 'sched:*' -p PID
Measure “make” 10 times:
# perf stat -r 10 --sync --pre make clean -- make
- 40. 40
Timestamp Duration Process PID Syscall
16412.718 ( 0.002 ms): InputThread/2933 epoll_wait(epfd: 35<anon_inode:[event
16412.723 ( 0.003 ms): InputThread/2933 read(fd: 40</dev/input/event1>, buf:
16412.727 ( 0.001 ms): InputThread/2933 read(fd: 40</dev/input/event1>, buf:
16412.733 ( 0.001 ms): InputThread/2933 read(fd: 40</dev/input/event1>, buf:
16412.735 ( 0.001 ms): InputThread/2933 read(fd: 40</dev/input/event1>, buf:
16412.745 ( 0.004 ms): InputThread/2933 write(fd: 31<pipe:[21362]>, buf: 0x7f
16394.158 (18.593 ms): X/2874 ... [continued]: epoll_wait()) = 1
16412.750 ( 0.004 ms): InputThread/2933 epoll_wait(epfd: 34<anon_inode:[event
16412.753 ( 0.002 ms): X/2874 read(fd: 30<pipe:[21362]>, buf: 0x7fffae2c9890,
16412.765 ( 0.005 ms): X/2874 writev(fd: 53, vec: 0x7fffae2c87d0, vlen: 1
16394.152 (18.621 ms): QXcbEventReade/3136 ... [continued]: poll()) = 1
16412.775 ( 0.003 ms): QXcbEventReade/3136 recvmsg(fd: 3<socket:[31188]>, msg
16412.783 ( 0.003 ms): QXcbEventReade/3136 write(fd: 5<anon_inode:[eventfd]>,
16396.068 (16.723 ms): kwin_x11/3131 ... [continued]: ppoll()) = 1
16412.787 ( 0.006 ms): QXcbEventReade/3136 poll(ufds: 0x7f90bc8e6bc8, nfds: 1
16412.793 ( 0.001 ms): kwin_x11/3131 read(fd: 5<anon_inode:[eventfd]>, buf: 0
16412.814 ( 0.004 ms): konsole/13778 poll(ufds: 0x55db78c35440, nfds: 16, tim
16412.820 ( 0.001 ms): konsole/13778 ioctl(fd: 28</dev/ptmx>, cmd: FIONREAD,
16412.822 ( 0.003 ms): konsole/13778 read(fd: 28</dev/ptmx>, buf: 0x55db78afd
- 41. 41
perf trace – Examples
Only display the events had duration longer than 0.2 for the specified
command (also allocate 64MB for perf buffer):
# perf trace -m 64M --duration 0.2 command
Trace all syscalls except write
# perf trace -m 64M -e !write
Trace only block_rq_issue and block_rq_complete
# perf trace --no-syscalls
--event block:block_rq_issue,block:block_rq_complete
- 42. 42
Choose Your perf command
• Real-time monitoring
perf top
• Offline profiling
perf record, perf report, perf script
• Event counting
perf stat
• System event tracing
perf trace
- 43. 43
And more...
• perf sched – scheduler properties
• perf kmem – kernel memory properties
• perf mem – profiling memory accesses
• perf lock – analyzing lock events
• perf kvm – tracing/measuring kvm guest os
• perf timechart – visualizing the system behavior
• perf bench – benchmark suites
...
- 48. 48
References
• Linux perf Examples
http://www.brendangregg.com/perf.html
• Linux kernel profiling with perf
https://perf.wiki.kernel.org/index.php/Tutorial
- 50. 50
Raw PMU
●
Show the available PMUs
# showevtinfo |less
IDX : 23068702
PMU name : core (Intel Core)
Name : L2_LINES_IN
Desc : L2 cache misses
●
Look up the raw number from the PMU name
# evt2raw L2_LINES_IN
r537024
●
Use the event
# perf stat -e r537024 -a sleep 10
- 51. 51
Event Modifiers
u - user-space counting
k - kernel counting
h - hypervisor counting
I - non idle counting
G - guest counting (in KVM guests)
H - host counting (not in KVM guests)
p - precise level
P - use maximum detected precise level
S - read sample value (PERF_SAMPLE_READ)
D - pin the event to the PMU
- 52. 52
Event Modifiers – Examples
• Request 0 skid on cpu-cycles
# perf stat -e cpu-cycles:pp -a sleep 10
• Only count user cache misses
# perf record -e cache-misses:u -a sleep 10
- 53. 53
perf top – Options
-C, --cpu
Focuses on specific CPUs
-p, --pid, or -t, --tid
Focuses on specific process or thread IDs
-d, --delay
Number of seconds to delay between refreshes
-n
Show the number of samples
--dsos, --comms
Limits the information to particular DSO’s and commands
-s, --sort
Sorts the samples
-g
enable call-graph (stack chain/backtrace) recording
- 54. 54
perf record – Options
-d, --delay N
Wait N msecs before the profiling starts
-F N
Sample every N times a second
-c N
Sample every N events
-s, --stat
Record per-thread event counts
-i, --no-inhert
Child tasks do not inherit counters
-a, --all-cpus
System-wide collection from all CPUs
- 55. 55
perf record – Options (cont’d)
-g
Enable call-graph recording
--call-graph <type>
Setup and enable call-graph recording.
“fp” (frame pointer) is the default for “-g”. Try “dwarf” if
“--fomit-frame-pointer” is used
-e, --event=<event>
Select the event
--filter=<filter>
Use the event filter
-m
Specify the buffer size
- 56. 56
perf stat – Options
--pre, --post
Pre and post measurement hooks
--sync
Call sync every iteration
-d, --detailed
Print more detailed statistics, can be specified up to 3 times
-r, --repeat=<n>
Repeat command and print average + stddev
-I msecs, --interval-print msecs
Print count deltas every N msecs
-e, --event=<event>
Select the event
-a, --all-cpus
System-wide collection from all CPUs
- 57. 57
perf trace – Options
-p, --pid, -t, --tid, or, -u, --uid
Focuses on specific process, thread, or user IDs
-e, --expr
List of syscalls to show
--event=<event>
Trace other events
--no-syscalls
Don’t trace syscalls
-m
Specify the buffer size
--duration
Show only events that had a duration greater than N.M msecs
- 58. 58
perf trace – Options (cont’d)
-s, --summary
Show only a summary of syscalls
-S, --with-summary
Show all syscalls followed by a summary
-o, --output=
Output file name
-F=[all|min|maj]
Trace pagefaults
- 59. 59
perf trace vs strace
• perf trace is inspired by strace.
• perf trace uses tracepoints instead of ptrace.
perf trace is faster in general.
• perf trace outputs to a buffer first while strace dumps
the result directly.
• perf trace may lose events if the ring buffer fills while
strace never loses events.
It can be moderated by increasing the buffer size.