SlideShare a Scribd company logo
Brought to you by
Continuous Go Profiling &
Observability
Felix Geisendörfer
Staff Engineer at
■ Go developers and operators of Go applications
■ Interested in reducing costs and latency, or debugging problems such as
memory leaks, infinite loops and performance regressions
■ Focus is on Go’s built-in tools, but we’ll also cover Linux perf and eBPF
Target Audience
Felix Geisendörfer
Staff Engineer at Datadog
■ Working on continuous Go profiling as a product
■ Previous 6.5 years working for Apple (Factory Traceability)
■ Open Source Contributor (node.js, Go): github.com/felixge
https://dtdg.co/p99-go-profiling
Slides
What is profiling?
■ Anything that produces a weighted list of stack traces
■ Example: CPU Profiler that interrupts process every 10ms of CPU time,
captures a stack trace and aggregates their counts
stack trace count
main;foo 5
main;foo;bar 4
main;foobar 4
What is Continuous Profiling?
■ Profiling in production
■ Continuously upload profiles to a backend for later analysis
Why profile in production?
■ Data distributions have a big impact on performance
■ Production profiles can help mitigate and root cause incidents
■ Profiling is usually low overhead (1-10%)
About Go
■ Compiled language like C/C++/Rust
■ Should work well with industry standard observability tools … right?
Does Go pass the Duck Test?
Goroutines
■ Green threads scheduled onto OS thread by Go runtime
■ Tightly integrated with Go’s network stack (epoll on Linux)
■ Tiny 2 KiB stacks that grow dynamically
■ Fast context switching (~170ns), 10x faster than Linux threads
see https://dtdg.co/3n6kBoC
■ Data sharing via mutexes and channels (CSP)
The trouble with goroutines
uprobe:./example:main.Foo {
@start[tid] = nsecs;
}
uretprobe:./example:main.Foo {
@msecs = hist((nsecs - @start[tid]) / 1000000);
delete(@start[tid]);
}
END {
clear(@start);
}
uretprobes + dynamic stacks = 💣
$ sudo bpftrace -c ./example funclatency.bpf
Attaching 3 probes...
SIGILL: illegal instruction
PC=0x7fffffffe001 m=4 sigcode=128
instruction bytes: 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
0x0
goroutine 1 [running]:
runtime: unknown pc 0x7fffffffe001
stack: frame={sp:0xc00006cf70, fp:0x0} stack=[0xc00006c000,0xc00006d000)
000000c00006ce70: 000000c000014010 0000000000000010
000000c00006ce80: 000000c000018000 000000000000004b
000000c00006ce90: 000000c00001a000 0000000000000013
see: runtime: ebpf uretprobe support #22008: https://dtdg.co/3s4vnfn
Thread IDs? Goroutine IDs!
uprobe:./example:main.Foo {
@start[tid] = nsecs;
}
uretprobe:./example:main.Foo {
@msecs = hist((nsecs - @start[tid]) / 1000000);
delete(@start[tid]);
}
END {
clear(@start);
}
Thread IDs? Goroutine IDs!
struct stack {
uintptr_t lo;
uintptr_t hi;
}
struct gobuf {
uintptr_t sp;
uintptr_t pc;
uintptr_t g;
uintptr_t ctxt;
uintptr_t ret;
uintptr_t lr;
uintptr_t bp;
}
struct g {
struct stack stack;
uintptr_t stackguard0;
uintptr_t stackguard1;
uintptr_t _panic;
uintptr_t _defer;
uintptr_t m;
struct gobuf sched;
uintptr_t syscallsp;
uintptr_t syscallpc;
uintptr_t stktopsp;
uintptr_t param;
uint32_t atomicstatus;
uint32_t stackLock;
uint64_t goid;
}
uprobe:./example:runtime.execute {
@gids[tid] = ((struct g *)sarg0)->goid;
}
■ Does not follow System V AMD64 ABI 🙈
■ Arguments are passed on the stack rather than using registers (slowish)
■ Go 1.17 switched to a register calling convention, but still idiosyncratic (to
support goroutine scalability, multiple return arguments, etc.)
■ ABI0 remains in use to support legacy assembly code
Go’s Calling Convention
See Proposal: Register-based Go calling convention: https://dtdg.co/2VIPOSV
■ Requires separate stack for C call frames which need to be static
■ High complexity and some overhead (~60ns) to switch between stacks
see https://dtdg.co/2X1HvTq
Calling C Code
■ Go pushed frame pointers onto the stack, has no -fomit-frame-pointer
■ Go also generates DWARF unwind/symbol tables by default
■ Leads to good interoperability with tools such as Linux perf
■ Go runtime uses idiosyncratic gopclntab unwinding and symbol tables
(DWARF is strippable and $@!%^# turing complete, so this is good)
Less odd: Stack Traces
Duck Test: Go is an odd duck
Pay attention when using 3rd party tools in production
Ashley Willis (CC BY-NC-SA 4.0)
■ Quirky runtime, Pedestrian language, limited type system, but ...
■ What Go lacks as language, it makes up for in tooling
■ Built-in documentation, testing, benchmarking, code formatting, tracing,
profiling and more!
So why bother with Go?
■ Five different profilers: CPU, Heap, Mutex, Block, Goroutine
go test -cpuprofile cpu.prof -memprofile mem.prof -bench
■ pprof visualization and analysis tool
go tool pprof -http=:6060 cpu.prof
Built-in observability tools
Built-in observability tools
■ Runtime execution tracer (⚠ overhead can be > 10%)
go test -trace trace.out -bench
Built-in Profilers
■ Three profilers that measure time:
● CPU
● Block
● Mutex
Profilers measuring time
CPU Profiler
■ Annotate goroutines with arbitrary key/value pairs
■ Understand CPU consumption of individual requests, users, endpoints, etc.
CPU Profiler: Labels
labels := pprof.Labels("user_id", "123")
pprof.Do(ctx, labels, func(ctx context.Context) {
// handle request
go update(ctx) // child goroutine inherits labels
})
■ Uses setitimer(2) to receive SIGPROF signal for every 10ms of CPU time
■ Signal handler takes stack traces and aggregates them into a profile
■ setitimer(2) has thread delivery bias and can’t keep up when utilizing more
than 2.5 cores 🙄
■ Rhys Hiltner (Twitch) and myself are working on an upstream patch to use
timer_create(2)
See: runtime/pprof: Linux CPU profiles inaccurate beyond 250% CPU use #35057: https://dtdg.co/3CAeApm
CPU Profiler: Implementation Details
■ Samples mutex wait (both) and channel wait (block profiler) events
■ Why the overlap?
● Block captures Lock(), i.e. the blocked mutexes
● Mutex captures Unlock(), i.e. the mutexes doing the blocking
■ Block profile used to be biased. Fix contributed for Go 1.17.
see https://go-review.googlesource.com/c/go/+/299991
Mutex & Block Profiler
Recap: Profilers measuring time
Allocation & Heap Profiler
func malloc(size):
object = ... // alloc magic
if poisson_sample(size):
s = stacktrace()
profile[s].allocs++
profile[s].alloc_bytes += sizeof(object)
track_profiled(object, s)
return object
func sweep(object):
// do gc stuff to free object
if is_profiled(object)
s = alloc_stacktrace(object)
profile[s].frees++
profile[s].free_bytes += sizeof(object)
return object
■ Allocations per stack trace
■ Memory remaining inuse on the heap (allocs-frees)
■ Can identify the source of memory leaks, but not the refs retaining things
Allocation & Heap Profiler
■ Can sometimes guide CPU optimizations better than CPU profiler
Allocation & Heap Profiler
made using tweetpik.com
■ Second-Order Effects: Reducing allocs can make unrelated code faster (!)
■ 💡 Reduce allocations and number of pointers on the heap
Allocation & Heap Profiler
made using tweetpik.com
■ Briefly stops all goroutines and captures their stack traces (⚠ Latency)
■ Useful for debugging goroutine leaks
■ Text output format also includes waiting times for debugging “stuck
programs” (block/mutex don’t show this until the blocking event has finished)
■ fgprof captures goroutine profiles at 100 Hz -> Wallclock Profile
https://github.com/felixge/fgprof
Goroutine Profiler
Bonus: Linux perf & eBPF
■ Frame pointers & DWARF tables lead to good interoperability
■ perf offers better accuracy (but accuracy of builtin profilers is decent enough)
■ Deals with dual Go and C stacks (no need for runtime.SetCgoTraceback())
■ Downsides: Linux only, Security, Permissions, Lack of Profiler Labels
■ Example: perf record -F 99 -g ./myapp && perf report
Linux perf
■ Example: bpftrace -e 'profile:hz:99 { @[ustack()] = count(); }' -c ./myapp
■ Should require less context switching, stacks aggregated in kernel
■ Otherwise similar caveats as Linux perf
eBPF (bpftrace)
Recap
■ Go is a bit odd for a compiled language, but ...
■ Wide variety of profiling and observability tools can be used
■ Most should be safe for production (⚠ goroutine profiler, execution tracer,
uretprobes)
■ Continuous Profiling makes sure you always have the data at your fingertips
Recap
Check out
github.com/DataDog/go-profiler-notes
for more in-depth Go profiling research
Brought to you by
Felix Geisendörfer
p99@felixge.de
@felixge

More Related Content

Continuous Go Profiling & Observability

  • 1. Brought to you by Continuous Go Profiling & Observability Felix Geisendörfer Staff Engineer at
  • 2. ■ Go developers and operators of Go applications ■ Interested in reducing costs and latency, or debugging problems such as memory leaks, infinite loops and performance regressions ■ Focus is on Go’s built-in tools, but we’ll also cover Linux perf and eBPF Target Audience
  • 3. Felix Geisendörfer Staff Engineer at Datadog ■ Working on continuous Go profiling as a product ■ Previous 6.5 years working for Apple (Factory Traceability) ■ Open Source Contributor (node.js, Go): github.com/felixge
  • 5. What is profiling? ■ Anything that produces a weighted list of stack traces ■ Example: CPU Profiler that interrupts process every 10ms of CPU time, captures a stack trace and aggregates their counts stack trace count main;foo 5 main;foo;bar 4 main;foobar 4
  • 6. What is Continuous Profiling? ■ Profiling in production ■ Continuously upload profiles to a backend for later analysis
  • 7. Why profile in production? ■ Data distributions have a big impact on performance ■ Production profiles can help mitigate and root cause incidents ■ Profiling is usually low overhead (1-10%)
  • 8. About Go ■ Compiled language like C/C++/Rust ■ Should work well with industry standard observability tools … right?
  • 9. Does Go pass the Duck Test?
  • 10. Goroutines ■ Green threads scheduled onto OS thread by Go runtime ■ Tightly integrated with Go’s network stack (epoll on Linux) ■ Tiny 2 KiB stacks that grow dynamically ■ Fast context switching (~170ns), 10x faster than Linux threads see https://dtdg.co/3n6kBoC ■ Data sharing via mutexes and channels (CSP)
  • 11. The trouble with goroutines uprobe:./example:main.Foo { @start[tid] = nsecs; } uretprobe:./example:main.Foo { @msecs = hist((nsecs - @start[tid]) / 1000000); delete(@start[tid]); } END { clear(@start); }
  • 12. uretprobes + dynamic stacks = 💣 $ sudo bpftrace -c ./example funclatency.bpf Attaching 3 probes... SIGILL: illegal instruction PC=0x7fffffffe001 m=4 sigcode=128 instruction bytes: 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 goroutine 1 [running]: runtime: unknown pc 0x7fffffffe001 stack: frame={sp:0xc00006cf70, fp:0x0} stack=[0xc00006c000,0xc00006d000) 000000c00006ce70: 000000c000014010 0000000000000010 000000c00006ce80: 000000c000018000 000000000000004b 000000c00006ce90: 000000c00001a000 0000000000000013 see: runtime: ebpf uretprobe support #22008: https://dtdg.co/3s4vnfn
  • 13. Thread IDs? Goroutine IDs! uprobe:./example:main.Foo { @start[tid] = nsecs; } uretprobe:./example:main.Foo { @msecs = hist((nsecs - @start[tid]) / 1000000); delete(@start[tid]); } END { clear(@start); }
  • 14. Thread IDs? Goroutine IDs! struct stack { uintptr_t lo; uintptr_t hi; } struct gobuf { uintptr_t sp; uintptr_t pc; uintptr_t g; uintptr_t ctxt; uintptr_t ret; uintptr_t lr; uintptr_t bp; } struct g { struct stack stack; uintptr_t stackguard0; uintptr_t stackguard1; uintptr_t _panic; uintptr_t _defer; uintptr_t m; struct gobuf sched; uintptr_t syscallsp; uintptr_t syscallpc; uintptr_t stktopsp; uintptr_t param; uint32_t atomicstatus; uint32_t stackLock; uint64_t goid; } uprobe:./example:runtime.execute { @gids[tid] = ((struct g *)sarg0)->goid; }
  • 15. ■ Does not follow System V AMD64 ABI 🙈 ■ Arguments are passed on the stack rather than using registers (slowish) ■ Go 1.17 switched to a register calling convention, but still idiosyncratic (to support goroutine scalability, multiple return arguments, etc.) ■ ABI0 remains in use to support legacy assembly code Go’s Calling Convention See Proposal: Register-based Go calling convention: https://dtdg.co/2VIPOSV
  • 16. ■ Requires separate stack for C call frames which need to be static ■ High complexity and some overhead (~60ns) to switch between stacks see https://dtdg.co/2X1HvTq Calling C Code
  • 17. ■ Go pushed frame pointers onto the stack, has no -fomit-frame-pointer ■ Go also generates DWARF unwind/symbol tables by default ■ Leads to good interoperability with tools such as Linux perf ■ Go runtime uses idiosyncratic gopclntab unwinding and symbol tables (DWARF is strippable and $@!%^# turing complete, so this is good) Less odd: Stack Traces
  • 18. Duck Test: Go is an odd duck Pay attention when using 3rd party tools in production Ashley Willis (CC BY-NC-SA 4.0)
  • 19. ■ Quirky runtime, Pedestrian language, limited type system, but ... ■ What Go lacks as language, it makes up for in tooling ■ Built-in documentation, testing, benchmarking, code formatting, tracing, profiling and more! So why bother with Go?
  • 20. ■ Five different profilers: CPU, Heap, Mutex, Block, Goroutine go test -cpuprofile cpu.prof -memprofile mem.prof -bench ■ pprof visualization and analysis tool go tool pprof -http=:6060 cpu.prof Built-in observability tools
  • 21. Built-in observability tools ■ Runtime execution tracer (⚠ overhead can be > 10%) go test -trace trace.out -bench
  • 23. ■ Three profilers that measure time: ● CPU ● Block ● Mutex Profilers measuring time
  • 25. ■ Annotate goroutines with arbitrary key/value pairs ■ Understand CPU consumption of individual requests, users, endpoints, etc. CPU Profiler: Labels labels := pprof.Labels("user_id", "123") pprof.Do(ctx, labels, func(ctx context.Context) { // handle request go update(ctx) // child goroutine inherits labels })
  • 26. ■ Uses setitimer(2) to receive SIGPROF signal for every 10ms of CPU time ■ Signal handler takes stack traces and aggregates them into a profile ■ setitimer(2) has thread delivery bias and can’t keep up when utilizing more than 2.5 cores 🙄 ■ Rhys Hiltner (Twitch) and myself are working on an upstream patch to use timer_create(2) See: runtime/pprof: Linux CPU profiles inaccurate beyond 250% CPU use #35057: https://dtdg.co/3CAeApm CPU Profiler: Implementation Details
  • 27. ■ Samples mutex wait (both) and channel wait (block profiler) events ■ Why the overlap? ● Block captures Lock(), i.e. the blocked mutexes ● Mutex captures Unlock(), i.e. the mutexes doing the blocking ■ Block profile used to be biased. Fix contributed for Go 1.17. see https://go-review.googlesource.com/c/go/+/299991 Mutex & Block Profiler
  • 29. Allocation & Heap Profiler func malloc(size): object = ... // alloc magic if poisson_sample(size): s = stacktrace() profile[s].allocs++ profile[s].alloc_bytes += sizeof(object) track_profiled(object, s) return object func sweep(object): // do gc stuff to free object if is_profiled(object) s = alloc_stacktrace(object) profile[s].frees++ profile[s].free_bytes += sizeof(object) return object
  • 30. ■ Allocations per stack trace ■ Memory remaining inuse on the heap (allocs-frees) ■ Can identify the source of memory leaks, but not the refs retaining things Allocation & Heap Profiler
  • 31. ■ Can sometimes guide CPU optimizations better than CPU profiler Allocation & Heap Profiler made using tweetpik.com
  • 32. ■ Second-Order Effects: Reducing allocs can make unrelated code faster (!) ■ 💡 Reduce allocations and number of pointers on the heap Allocation & Heap Profiler made using tweetpik.com
  • 33. ■ Briefly stops all goroutines and captures their stack traces (⚠ Latency) ■ Useful for debugging goroutine leaks ■ Text output format also includes waiting times for debugging “stuck programs” (block/mutex don’t show this until the blocking event has finished) ■ fgprof captures goroutine profiles at 100 Hz -> Wallclock Profile https://github.com/felixge/fgprof Goroutine Profiler
  • 35. ■ Frame pointers & DWARF tables lead to good interoperability ■ perf offers better accuracy (but accuracy of builtin profilers is decent enough) ■ Deals with dual Go and C stacks (no need for runtime.SetCgoTraceback()) ■ Downsides: Linux only, Security, Permissions, Lack of Profiler Labels ■ Example: perf record -F 99 -g ./myapp && perf report Linux perf
  • 36. ■ Example: bpftrace -e 'profile:hz:99 { @[ustack()] = count(); }' -c ./myapp ■ Should require less context switching, stacks aggregated in kernel ■ Otherwise similar caveats as Linux perf eBPF (bpftrace)
  • 37. Recap
  • 38. ■ Go is a bit odd for a compiled language, but ... ■ Wide variety of profiling and observability tools can be used ■ Most should be safe for production (⚠ goroutine profiler, execution tracer, uretprobes) ■ Continuous Profiling makes sure you always have the data at your fingertips Recap
  • 40. Brought to you by Felix Geisendörfer p99@felixge.de @felixge