Continuous Go Profiling & Observability
- 1. Brought to you by
Continuous Go Profiling &
Observability
Felix Geisendörfer
Staff Engineer at
- 2. ■ Go developers and operators of Go applications
■ Interested in reducing costs and latency, or debugging problems such as
memory leaks, infinite loops and performance regressions
■ Focus is on Go’s built-in tools, but we’ll also cover Linux perf and eBPF
Target Audience
- 3. Felix Geisendörfer
Staff Engineer at Datadog
■ Working on continuous Go profiling as a product
■ Previous 6.5 years working for Apple (Factory Traceability)
■ Open Source Contributor (node.js, Go): github.com/felixge
- 5. What is profiling?
■ Anything that produces a weighted list of stack traces
■ Example: CPU Profiler that interrupts process every 10ms of CPU time,
captures a stack trace and aggregates their counts
stack trace count
main;foo 5
main;foo;bar 4
main;foobar 4
- 6. What is Continuous Profiling?
■ Profiling in production
■ Continuously upload profiles to a backend for later analysis
- 7. Why profile in production?
■ Data distributions have a big impact on performance
■ Production profiles can help mitigate and root cause incidents
■ Profiling is usually low overhead (1-10%)
- 8. About Go
■ Compiled language like C/C++/Rust
■ Should work well with industry standard observability tools … right?
- 10. Goroutines
■ Green threads scheduled onto OS thread by Go runtime
■ Tightly integrated with Go’s network stack (epoll on Linux)
■ Tiny 2 KiB stacks that grow dynamically
■ Fast context switching (~170ns), 10x faster than Linux threads
see https://dtdg.co/3n6kBoC
■ Data sharing via mutexes and channels (CSP)
- 11. The trouble with goroutines
uprobe:./example:main.Foo {
@start[tid] = nsecs;
}
uretprobe:./example:main.Foo {
@msecs = hist((nsecs - @start[tid]) / 1000000);
delete(@start[tid]);
}
END {
clear(@start);
}
- 12. uretprobes + dynamic stacks = 💣
$ sudo bpftrace -c ./example funclatency.bpf
Attaching 3 probes...
SIGILL: illegal instruction
PC=0x7fffffffe001 m=4 sigcode=128
instruction bytes: 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
0x0
goroutine 1 [running]:
runtime: unknown pc 0x7fffffffe001
stack: frame={sp:0xc00006cf70, fp:0x0} stack=[0xc00006c000,0xc00006d000)
000000c00006ce70: 000000c000014010 0000000000000010
000000c00006ce80: 000000c000018000 000000000000004b
000000c00006ce90: 000000c00001a000 0000000000000013
see: runtime: ebpf uretprobe support #22008: https://dtdg.co/3s4vnfn
- 13. Thread IDs? Goroutine IDs!
uprobe:./example:main.Foo {
@start[tid] = nsecs;
}
uretprobe:./example:main.Foo {
@msecs = hist((nsecs - @start[tid]) / 1000000);
delete(@start[tid]);
}
END {
clear(@start);
}
- 14. Thread IDs? Goroutine IDs!
struct stack {
uintptr_t lo;
uintptr_t hi;
}
struct gobuf {
uintptr_t sp;
uintptr_t pc;
uintptr_t g;
uintptr_t ctxt;
uintptr_t ret;
uintptr_t lr;
uintptr_t bp;
}
struct g {
struct stack stack;
uintptr_t stackguard0;
uintptr_t stackguard1;
uintptr_t _panic;
uintptr_t _defer;
uintptr_t m;
struct gobuf sched;
uintptr_t syscallsp;
uintptr_t syscallpc;
uintptr_t stktopsp;
uintptr_t param;
uint32_t atomicstatus;
uint32_t stackLock;
uint64_t goid;
}
uprobe:./example:runtime.execute {
@gids[tid] = ((struct g *)sarg0)->goid;
}
- 15. ■ Does not follow System V AMD64 ABI 🙈
■ Arguments are passed on the stack rather than using registers (slowish)
■ Go 1.17 switched to a register calling convention, but still idiosyncratic (to
support goroutine scalability, multiple return arguments, etc.)
■ ABI0 remains in use to support legacy assembly code
Go’s Calling Convention
See Proposal: Register-based Go calling convention: https://dtdg.co/2VIPOSV
- 16. ■ Requires separate stack for C call frames which need to be static
■ High complexity and some overhead (~60ns) to switch between stacks
see https://dtdg.co/2X1HvTq
Calling C Code
- 17. ■ Go pushed frame pointers onto the stack, has no -fomit-frame-pointer
■ Go also generates DWARF unwind/symbol tables by default
■ Leads to good interoperability with tools such as Linux perf
■ Go runtime uses idiosyncratic gopclntab unwinding and symbol tables
(DWARF is strippable and $@!%^# turing complete, so this is good)
Less odd: Stack Traces
- 18. Duck Test: Go is an odd duck
Pay attention when using 3rd party tools in production
Ashley Willis (CC BY-NC-SA 4.0)
- 19. ■ Quirky runtime, Pedestrian language, limited type system, but ...
■ What Go lacks as language, it makes up for in tooling
■ Built-in documentation, testing, benchmarking, code formatting, tracing,
profiling and more!
So why bother with Go?
- 20. ■ Five different profilers: CPU, Heap, Mutex, Block, Goroutine
go test -cpuprofile cpu.prof -memprofile mem.prof -bench
■ pprof visualization and analysis tool
go tool pprof -http=:6060 cpu.prof
Built-in observability tools
- 23. ■ Three profilers that measure time:
● CPU
● Block
● Mutex
Profilers measuring time
- 25. ■ Annotate goroutines with arbitrary key/value pairs
■ Understand CPU consumption of individual requests, users, endpoints, etc.
CPU Profiler: Labels
labels := pprof.Labels("user_id", "123")
pprof.Do(ctx, labels, func(ctx context.Context) {
// handle request
go update(ctx) // child goroutine inherits labels
})
- 26. ■ Uses setitimer(2) to receive SIGPROF signal for every 10ms of CPU time
■ Signal handler takes stack traces and aggregates them into a profile
■ setitimer(2) has thread delivery bias and can’t keep up when utilizing more
than 2.5 cores 🙄
■ Rhys Hiltner (Twitch) and myself are working on an upstream patch to use
timer_create(2)
See: runtime/pprof: Linux CPU profiles inaccurate beyond 250% CPU use #35057: https://dtdg.co/3CAeApm
CPU Profiler: Implementation Details
- 27. ■ Samples mutex wait (both) and channel wait (block profiler) events
■ Why the overlap?
● Block captures Lock(), i.e. the blocked mutexes
● Mutex captures Unlock(), i.e. the mutexes doing the blocking
■ Block profile used to be biased. Fix contributed for Go 1.17.
see https://go-review.googlesource.com/c/go/+/299991
Mutex & Block Profiler
- 29. Allocation & Heap Profiler
func malloc(size):
object = ... // alloc magic
if poisson_sample(size):
s = stacktrace()
profile[s].allocs++
profile[s].alloc_bytes += sizeof(object)
track_profiled(object, s)
return object
func sweep(object):
// do gc stuff to free object
if is_profiled(object)
s = alloc_stacktrace(object)
profile[s].frees++
profile[s].free_bytes += sizeof(object)
return object
- 30. ■ Allocations per stack trace
■ Memory remaining inuse on the heap (allocs-frees)
■ Can identify the source of memory leaks, but not the refs retaining things
Allocation & Heap Profiler
- 31. ■ Can sometimes guide CPU optimizations better than CPU profiler
Allocation & Heap Profiler
made using tweetpik.com
- 32. ■ Second-Order Effects: Reducing allocs can make unrelated code faster (!)
■ 💡 Reduce allocations and number of pointers on the heap
Allocation & Heap Profiler
made using tweetpik.com
- 33. ■ Briefly stops all goroutines and captures their stack traces (⚠ Latency)
■ Useful for debugging goroutine leaks
■ Text output format also includes waiting times for debugging “stuck
programs” (block/mutex don’t show this until the blocking event has finished)
■ fgprof captures goroutine profiles at 100 Hz -> Wallclock Profile
https://github.com/felixge/fgprof
Goroutine Profiler
- 35. ■ Frame pointers & DWARF tables lead to good interoperability
■ perf offers better accuracy (but accuracy of builtin profilers is decent enough)
■ Deals with dual Go and C stacks (no need for runtime.SetCgoTraceback())
■ Downsides: Linux only, Security, Permissions, Lack of Profiler Labels
■ Example: perf record -F 99 -g ./myapp && perf report
Linux perf
- 36. ■ Example: bpftrace -e 'profile:hz:99 { @[ustack()] = count(); }' -c ./myapp
■ Should require less context switching, stacks aggregated in kernel
■ Otherwise similar caveats as Linux perf
eBPF (bpftrace)
- 38. ■ Go is a bit odd for a compiled language, but ...
■ Wide variety of profiling and observability tools can be used
■ Most should be safe for production (⚠ goroutine profiler, execution tracer,
uretprobes)
■ Continuous Profiling makes sure you always have the data at your fingertips
Recap