Debugging Node.js applications using core dumps

I was recently asked to help debug a production issue affecting a Node.js application. The application got stuck in such a way, that requests weren’t served anymore. Monitoring graphs showed that each stuck Node.js process consumed a full CPU core. On the other hand, there was plenty of memory free on the box.

In similar situations JVM users would probably attach some sort of remote monitoring or at least take a thread/heap dump. Luckily, similar tools exist for the Node.js ecosystem.

mdb and mdb_v8

mdb is a program that enables postmortem (also live) debugging on Solaris-like systems. mdb_v8 is an extension created by Joyent that extends the capabilities of mdb to analyze core dumps created of Node.js processes. See the mdb_v8 documentation for debugging commands supported. mdb_v8 can for example print the call stack of the JavaScript thread (including both native and JS stacks) and the JS heap.

mdb_v8 can also be used on core dumps generated on Linux machines. However, the software still only runs on Solaris-like operating systems. Running a SmartOS VM is therefore required. Luckily, there’s a tool called autopsy that allows easy access to mdb_v8 on Linux & Mac OS X.

Usage sample

For example, running the following JS in Node, taking a core dump & the Node executable from the Linux machine:

And then running autopsy nodebin corefile:

Stack information for example:

As you can see in the stack trace, inline functions without names can only be seen as <anonymous>. Therefore naming even inline functions is helpful when debugging. We can see that the function ikikiersio was called with parameter at 6812a0864e1. Let’s dump its value:

mdb_v8 supports lots of other commands as well (e.g. for analyzing the heap), but in this case just analyzing the stack data was enough to hint us in the right direction with our production problem.

Production postmortem

The initial CPU usage pattern hinted at something being stuck in an infinite loop. The stack traces from all the stuck Node.js processes pointed to garbage collection running at that time. GC running at the same time in all the processes doesn’t sound like a coincidence. So, probably the processes are stuck in some sort of situation, where GC can free enough memory to avoid the process being killed but not enough to do useful work.

Further analysis concluded that the software running was expecting a 4GB heap for caching things in-VM, but default Node.js heap size of about 1.6GB was used. We raised the heap size and there have been no similar situations since. GC tracing debug options could have also been used to analyze the problem, but core dumps provide a nice way of examining a running system where certain debug options might not be in use.

Pitfalls

I’d suggest testing that this technique works before a reason for using it arises. For example, I had a problem analyzing core dumps taken from Node.js running on Alpine Linux (might have something to do with Alpine Linux using musl instead of glibc or the Node.js binary missing debug symbols).