Surge2012

When Node.js goes wrong:
Debugging Node in production
Surge 2012
David Pacheco (@dapsays)
Joyent

The Rise of Node.js

• We see Node.js as the confluence of three ideas:
• JavaScript’s friendliness and rich support for asynchrony (i.e., closures)
• High-performance JavaScript VMs (e.g., V8)
• Time-tested system abstractions (i.e. Unix, in the form of streams)
• Event-oriented model delivers consistent performance in the
presence of long latency events (i.e. no artificial latency bubbles)
• Node.js is displacing C for a lot of highly reliable, high performance
core infrastructure software (at Joyent alone: DNS, DHCP,
SNMP, LDAP, key value stores, public-facing web services, ...).
• This has been great for rapid development, but historically has
come with a cost in debuggability.
• Debugging is decidedly not a new problem...

2

The Genesis of debugging

“As soon as we started programming,
we found to our surprise that it wasn't
as easy to get programs right as we
had thought. Debugging had to be
discovered. I can remember the exact
instant when I realized that a large
part of my life from then on was going
to be spent in ﬁnding mistakes in my
own programs.”
—Sir Maurice Wilkes, 1913 - 2010

3

Debugging Node: a run-away service

• February, 2011: Joyent is preparing to launch no.de, a Node PaaS.
• During testing, Cloud Analytics becomes intermittently
unresponsive.
• Quickly traced the problem to a rogue data aggregator using 100%
of 1 CPU core and not responding over HTTP or AMQP.
• How do you debug this?

4

Debugging a run-away service

• Check the logs?

5


• Check syscall activity (truss/strace)?

6


• Check thread stacks:

v8::internal::Runtime::SetObjectProperty+0x36d()
v8::internal::Runtime_SetProperty+0x73()
0xfe7601f6()
0xfbff31d8()
0xfc468f59()
0xfe8e51cf() ...
0xfe760841() ev_run+0x406()
0xfe8e3dc8() uv_run+0x1c()
0xfe8e24a4() node::Start+0xa9()
... main+0x1b()
_start+0x83()

7

Give up?

• Could add more logging for next time, but we don’t know how to
reproduce it. (Plus, what would we log?)
• ... but it’s still exhibiting these symptoms! Can’t we figure out why?!

Text
Text

8

The more general problem

• The software: a moderately complex concurrent service
(that is, where concurrent requests can affect one another).
• The deployment: in production, 10s to 1000s of instances.
• The problem: ~once/day, one of the instances crashes, leaving
behind a stacktrace where an assertion was blown. (Or worse: one
of the instances simply misbehaves, leaving nothing to help figure
out what’s going wrong.)
• How do you debug this?

9

A first approach

• Add instrumentation (console.log) and redeploy.
• How easy is it to deploy a new version?
• How risky is it to deploy a new version? What’s the impact?
What if it’s a very common code path that you need to instrument?
• What if this isn’t your code, but a customer’s, whose deployment
you cannot control?
• Are you sure you’ll only need to do this once?
(You lose credibility with each lap with ops and customers.)
• If you’re lucky or if the problem is relatively simple, this can work
okay.

10

“The postmortem technique”

“Experience with the EDSAC has
shown that although a high proportion
of mistakes can be removed by
preliminary checking, there frequently
remain mistakes which could only
have been detected in the early
stages by prolonged and laborious
study. Some attention, therefore, has
been given to the problem of dealing
with mistakes after the programme
has been tried and found to fail.”

—Stanley Gill, 1926 - 1975
“The diagnosis of mistakes in programmes on the EDSAC”, 1951

11

A better approach

• For C programs, we have rich tools for postmortem analysis of a
system based on a snapshot of its state.
• This technique is so old, the term for this state snapshot dates
from the dawn of computing: it’s a core dump.
• Once a core dump has been generated, either automatically after
a crash or on-demand using gcore(1), the program can be
immediately restarted to restore service quickly so that
engineers can debug the problem asynchronously.
• Using the debugger on the core dump, you can inspect all internal
program state: global variables, threads, and objects.
• Can also use the same tools with a live process.
• Can’t we do this with Node.js?
12

Debugging dynamic environments

• Historically, native postmortem tools have been unable to
meaningfully observe dynamic environments like Node.
• Such tools would need to translate native abstractions from the
dump (symbols, functions, structs) into their higher-level
counterparts in the dynamic environment (variables, Functions,
Objects).
• Some abstractions don’t even exist explicitly in the language itself.
(e.g., JavaScript’s event queue)
• Node is not alone! The state of the art is no better in Python, Ruby,
or PHP, and not nearly solved for Java or Erlang either.

13

Aside: MDB

• illumos-based systems like SmartOS and OmniOS have MDB, the
modular debugger built specifically for postmortem analysis
• MDB was originally built for postmortem analysis of the operating
system kernel and later extended to applications
• Plug-ins (“dmods”) can easily build on one another to deliver
powerful postmortem analysis tools, e.g.:
• ::stacks coalesces threads based on stack trace, with optional ﬁltering
by module, caller, etc
• ::findleaks performs postmortem garbage collection on a core dump to ﬁnd
memory leaks in native code

• Could we build a dmod for Node?

14

mdb_v8: postmortem debugging for Node

• With some excruciating pain and some ugly layer violations, we
were able to build mdb_v8
• With ::jsstack, prints call stacks, including native C++ and
JavaScript functions and arguments.
• With ::jsprint, given a pointer, prints out as a C++ object and
its JavaScript counterpart.
• With ::v8function, given a JSFunction pointer, show the assembly
for that function.
• With ::findjsobjects, scans the heap to identify how many
instances of each object type exist (incredible visibility into memory
usage).
• Demo
15

Remember that run-away Node program?

• In February, 2011, we had essentially no way to see what this
program was doing.
• We saved a core dump in case we might one day have a way to
read it. We also added instrumentation in case we saw it again.
(We expected to see it again very soon after going to production.)
• We didn’t see it again until October, while the mdb_v8 work was
underway. So we applied what we had to the original core file...

16

And the winner is:

> ::jsstack
8046a9c <anonymous> (as exports.bucketize) at lib/heatmap.js position 7838
8046af8 caAggrValueHeatmapImage at lib/ca/ca-agg.js position 48960
...
> 8046a9c::jsframe -v
8046a9c <anonymous> (as exports.bucketize)
func: fc435fcd
file: lib/heatmap.js
posn: position 7838
arg1: fc070719 (JSObject)
arg2: fc070709 (JSArray)
> fc070719::jsprint
{
base: 1320886447,
height: 281,
Invalid input resulted in infinite loop in JavaScript
width: 624, Time to root cause: 10 minutes
max: 11538462,
min: 11538462,
...
}
17

mdb_v8: How the sausage is made

• V8 (libv8.a) includes a small amount (a few KB) of metadata that
describes the heap’s classes, type information, and class layouts.
(Small enough to include in production builds.)
• mdb_v8 knows how to identify stack frames, iterate function
arguments, iterate object properties, and walk basic V8 structures
(arrays, functions, strings).
• mdb_v8 uses the debug metadata encoded in the binary to avoid
hardcoding the way heap structures are laid out in memory. (Still
has intimate knowledge of things like property iteration.)

18

What did you say was in this sausage?

• Goal: debugger module shouldn’t hardcode structure offsets and
other constants, but rather rely on metadata included in the “node”
binary.
• Generally speaking, these offsets are computed at compile-time
and used in inline functions defined by macros. So they get
compiled out and are not available at runtime.
• The build process was modified to:
• Generate a new C++ source file with references to the constants that we need,
using extern “C” constants that our debugger module can look for and read.
• Build this with the rest of libv8_base.a.

• Result: this “debug metadata” is embedded in $PREFIX/bin/node,
and the debugger can read it directly from the core file.
• (Should) generally work for 32-bit/64-bit, different architectures,
and no matter how complex the expressions for the constants are.
19

Problems with this approach

• We strongly believe in the general approach of having the
debugger grok program state from a snapshot, because it’s
comprehensive and has zero runtime overhead, meaning it
works in production. (This is a constraint.)
• With the current implementation, the debugger module is built and
delivered separately from the VM, which means that changes in
the VM can (and do) break the debugger module.
• Additionally, each debugger feature requires reverse engineering
and reimplementing some piece of the VM.
• Ideally, the VM would embed programmatic logic for decoding the
in-memory state (e.g., iterating objects, iterating object properties,
walking the stack, and so on) -- without relying on the VM itself to
be running.
20

Debugging live programs

• Postmortem tools can be applied to live processes, and
core files can be generated for running processes.
• Examining processes and core dumps is useful for many kinds of
failure, but sometimes you want to trace runtime activity.

21

DTrace

• Provides comprehensive tracing of kernel and application-level
events in real-time (from “thread on-CPU” to “Node GC done”)
• Scales arbitrarily with the number of traced events.
(first class in situ data aggregation)
• Suitable for production systems because it’s safe, has minimal
overhead (usually no disabled probe effect), and can be enabled/
disabled dynamically (no application restart required).
• Open-sourced in 2005. Available on illumos-derived systems like
SmartOS and OmniOS, Solaris-derived systems, BSD, and
MacOS (Linux ports in progress).

22

DTrace in dynamic environments

• DTrace instruments the system holistically, which is to say, from
the kernel, which poses a challenge for interpreted environments
• User-level statically defined tracing (USDT) providers describe
semantically relevant points of instrumentation
• Some interpreted environments (e.g., Ruby, Python, PHP, Erlang)
have added USDT providers that instrument the interpreter itself
• This approach is very fine-grained (e.g., every function call) and
doesn’t work in JIT’d environments
• We decided to take a different tack for Node.js...

23

DTrace for Node.js

• Given the nature of the paths that we wanted to instrument, we
introduced a function into JavaScript that Node can call to get into
USDT-instrumented C++
• Introduces disabled probe effect: calling from JavaScript into C++
costs even when probes are not enabled
• Use USDT is-enabled probes to minimize disabled probe effect
once in C++
• If (and only if) the probe is enabled, prepare a structure for the
kernel that allows for translation into a structure that is familiar to
node programmers

24

DTrace example: Node GC time, per GC
#

dtrace –n ‘
node*:::gc-start { self->start = timestamp; }
node*:::gc-done/self->start/{
@[“microseconds”] = quantize((timestamp – self->start) / 1000);
self->start = 0;
}’

microseconds
value ------------- Distribution ------------- count
32 | 0
64 |@@@@@ 19
128 |@@ 6
256 |@@ 6
512 |@@@@ 13
1024 |@@@@@ 17
2048 |@@@@@@@ 24
4096 |@@@@@@@@ 29
8192 |@@@@@ 16
16384 |@ 5
32768 |@ 3
65536 | 1
131072 |@ 3
262144 | 0

25

DTrace probes in Node.js modules

• Our technique is adequate for DTrace probes in the Node.js core,
but it’s very cumbersome for pure Node.js modules
• Fortunately, Chris Andrews has generalized this technique in his
node-dtrace-provider module:
https://github.com/chrisa/node-dtrace-provider
• This module allows one to declare and fire one’s own probes
entirely in JavaScript
• Used extensively by Joyent’s Mark Cavage in his node-restify and
ldap.js modules, especially to allow for measurement of latency

26

DTrace stack traces

• ustack(): DTrace looks at (%ebp, %eip) and follows frame
pointers to the top of the stack (standard approach).
Asynchronously, looks for symbols in the process’s address space
to map instruction offsets to function names:
0x80ed9ab becomes malloc+0x16
• Great for C, C++. Doesn’t work for JIT’d environments.
• Functions are compiled at runtime => they have no corresponding symbols
=> the VM must be called upon at runtime to map frames to function names
• Garbage collection => functions themselves move around at arbitrary points
=> mapping of frames to function names must be done “synchronously”

• jstack(): Like ustack(), but invokes VM-specific ustack helper,
expressed in D and attached to the VM binary, to resolve names.

27

DTrace ustack helpers

• For JIT’d code, DTrace supports ustack helper mechanism, by
which the VM itself includes logic to translate from
(frame pointer, instruction pointer) -> human-readable function name

• When jstack() action is processed in probe context (in the kernel),
DTrace invokes the helper to translate frames:

Before After
0xfe772a8c toJSON at native date.js position 39314

0xfe84d962 BasicJSONSerialize at native json.js position 8444

0xfea6b6ed BasicSerializeObject at native json.js position 7622

0xfe84db11 BasicJSONSerialize at native json.js position 8444

0xfeaba5ee stringify at native json.js position 10128

28

V8 ustack helper

• The ustack helper has to do much of the same work that mdb_v8
does to identify stack frames and pick apart heap objects.
• The same debug metadata that’s used for mdb_v8 is used for the
helper, but unlike mdb_v8, the helper is embedded directly into the
node binary (good!).
• The implementation is written in D, and subject to all the same
constraints as other DTrace scripts (and then some): no functions,
no iteration, no if/else.
• Particularly nasty pieces include expanding ConsStrings and
binary searching to compute line numbers.
• The helper only depends on V8, not Node.js. (With MacOS support
for ustack helpers from profile probes, we could use the same
helper to profile webapps running under Chrome!)
29

Profiling Node with DTrace

• “profile” provider: probe that fires N times per second per CPU
• ustack()/jstack() actions: collect user-level stacktrace when
a probe fires.
• Low-overhead runtime profiling (via stack sampling) that can be
turned on and off without restarting your program.
• Demo.

30

Node.js Flame Graph

• Visualizing profiling output:

• Full, interactive version: http://bit.ly/NMQT1B
31

More real-world examples

• The infinite loop problem we saw earlier was debugged with
mdb_v8, and could have also been debugged with DTrace.
• @izs used mdb_v8‘s heap scanning to zero in on a memory leak
in Node 0.7 that was seriously impacting several users, including
Voxer.
• @mranney (Voxer) has used Node profiling + flame graphs to
identify several performance issues (unoptimized OpenSSL
implementation, poor memory allocation behavior).
• Debugging RangeError (stack overflow, with no stack trace).

32

Final thoughts

• Node is a great for rapidly building complex or distributed system
software. But in order to achieve the reliability we expect from such
systems, we must be able to understand both fatal and non-
fatal failure in production from the first occurrence.
• One year ago: we had no way to solve the “infinite loop” problem
without adding more logging and hoping to see it again.
• Now we have tools to inspect both running and crashed Node
programs (mdb_v8 and the DTrace ustack helper), and we’ve used
them to debug problems in minutes that we either couldn’t solve at
all before or which took days or weeks to solve.
• But the postmortem tools are still primitive (like a flashlight in a
dark room). Need better support from the VM.

33

• Thanks:
• @mraleph for help with V8 and landing patches
• @izs and the Node core team for help integrating DTrace and MDB support
• @brendangregg for flame graphs
• @chrisandrews for node-dtrace-provider
• @mcavage for putting it to such great use in node-restify and ldap.js
• @mranney and Voxer for pushing Node hard, running into lots of issues, and
helping us refine the tools to debug them. (God bless the early adopters!)

• For more info:
• http://dtrace.org/blogs/dap/2012/04/25/profiling-node-js/
• http://dtrace.org/blogs/dap/2012/01/13/playing-with-nodev8-postmortem-debugging/
• https://github.com/joyent/illumos-joyent/blob/master/usr/src/cmd/mdb/common/modules/v8/mdb_v8.c
• https://github.com/joyent/node/blob/master/src/v8ustack.d

34

Surge2012

Related slideshows

More Related Content

Surge2012