Node.js Memory, Garbage Collection & Production Failures
How Node.js memory really works: the V8 heap, garbage collection, memory leaks, OOM crashes, and diagnosing it all in production on EC2 and ECS.
Most Node.js apps run fine for months and then, one day, a service starts slowing down, its memory creeps up, and eventually it crashes with JavaScript heap out of memory. Nothing in the code "looks" wrong. This is the territory that separates someone who can write Node from someone who can run it in production: understanding how V8 stores your objects, how it cleans them up, what the memory limits actually are, and how the handful of classic failures show up and get diagnosed. This is the companion to the core Node.js guide, and it goes deep on exactly that. Plain-English first, then how it actually works under the hood, then the real-world failures and how to find them.
How to read this guide. Every idea is built from the ground up: what the thing is, how it works mechanically, then what goes wrong and how you would diagnose it on a real server. Analogies anchor the abstract parts. Where it matters for interviews, an "Interview answer" gives you phrasing you can use directly. The goal is not to memorize numbers but to understand the machine well enough that production surprises make sense. Current as of Node.js 24 LTS, 2026.
Where Your Data Lives: Stack and Heap
When your program runs, V8 (the engine inside Node) stores data in two very different places, and knowing which is which explains a lot of behavior.
The stack holds simple, fixed-size things: numbers, booleans, and the references (pointers) to bigger objects. It works like a stack of plates: every function call pushes a new frame on top, and when the function returns, its frame pops off and its local variables vanish automatically. It is tiny, extremely fast, and self-cleaning.
The heap holds everything with a variable or unknown size: objects, arrays, strings, functions, closures. These do not disappear when a function returns; they stay until the garbage collector decides nothing can reach them anymore. The heap is where almost all interesting memory behavior, and almost all memory problems, happen.
function example() {
let count = 10; // the number lives on the STACK
let user = { name: "Alice" }; // the object lives on the HEAP;
// `user` (a reference to it) lives on the stack
}
// When example() returns: `count` and the `user` reference pop off the stack.
// The { name: "Alice" } object stays on the heap until GC proves nothing points to it.Analogy. The stack is your desk: small, right in front of you, instantly cleared when you finish a task. The heap is the warehouse out back: it holds everything large, things stay there after you walk away, and someone (the garbage collector) periodically walks the aisles throwing out boxes that nobody has a delivery slip for anymore.
The practical payoff: pass-by-value vs pass-by-reference
This split is not academic. It is exactly why mutating an object inside a function changes the caller's object, but reassigning a number does not. Primitives (numbers, booleans, strings) are copied by value, because the value itself lives on the stack. Objects and arrays are passed by reference, because what lives on the stack is a pointer to the one shared object on the heap, and copying the pointer still points at the same box.
function tweak(num, obj) {
num = 99; // reassigns this function's OWN copy of the number
obj.name = "Bob"; // follows the pointer and mutates the SHARED heap object
}
let n = 1;
let user = { name: "Alice" };
tweak(n, user);
console.log(n); // 1 -> the primitive was copied, caller unaffected
console.log(user.name); // "Bob" -> the object is shared, caller sees the changeThe same mechanism explains why two variables can secretly be the same object:
let a = { count: 0 };
let b = a; // b copies the REFERENCE, not the object
stack heap
┌──────┐ ┌─────────────┐
│ a ──┼───────>│ { count: 0 }│
│ b ──┼───────>│ │ both point at ONE object
└──────┘ └─────────────┘
b.count = 5;
console.log(a.count); // 5 -> a and b are the same heap objectBeginner trap: "I copied the array, so the original is safe."const copy = originalcopies the reference, not the data, so mutatingcopymutatesoriginal. To actually duplicate, you need a shallow copy ([...original],{ ...original }) or a deep copy (structuredClone(original)) for nested data. Confusing a reference copy with a value copy is behind a huge share of "why did my other variable change?" bugs.
Why the stack is small and the heap is large
The stack has a strict, small size limit (a few hundred KB to about 1 MB by default, tunable with --stack-size), because each function call must reserve a frame and the runtime needs that allocation to be instant and predictable. The heap is far larger and grows on demand. This is why a deeply recursive function overflows the stack (too many frames) while a giant array fills the heap (too many objects): same word "too much memory," two entirely different regions.
Beginner trap: "stack overflow" vs "out of memory" are different failures. ARangeError: Maximum call stack size exceededmeans you pushed too many function frames onto the small stack, almost always from infinite or very deep recursion. AJavaScript heap out of memorymeans the heap filled up, usually from holding onto too many objects. They sound similar but have completely different causes and fixes: the first is a control-flow problem (fix the recursion, or convert it to a loop or an explicit queue), the second is a memory-retention problem (find what you are holding onto).
How Garbage Collection Actually Works
JavaScript does not make you free memory by hand the way C does. Instead V8 runs a garbage collector (GC): it periodically finds objects that can no longer be reached by your running program and reclaims their memory. The whole system rests on one observation about real programs, called the generational hypothesis: most objects die young. A request handler creates dozens of temporary objects that are garbage milliseconds later, while a few things (your config, your cache, your database pool) live for the entire process.
V8 leans into this by splitting the heap into two generations and collecting them differently.
New Space (the young generation)
New objects are born here. It is small, and it is collected very frequently with a fast algorithm called Scavenge. Scavenge divides New Space in half: objects are allocated into one half, and when it fills, the collector copies the survivors into the other half and wipes the first half wholesale. Because most young objects are already dead by collection time, there are usually few survivors to copy, so this is cheap and quick. An object that survives a couple of these rounds is considered "tenured" and gets promoted to Old Space.
Analogy. New Space is the kitchen counter during cooking. You churn through scraps constantly, and every few minutes you sweep the whole counter clean, keeping only the few things still in use. It stays fast precisely because almost everything on the counter is already trash by the time you sweep.
Old Space (the old generation)
Objects that survived long enough live here. This region is larger and collected far less often, using Mark-Sweep-Compact: the collector marks every object still reachable from your program, sweeps away everything unmarked, and occasionally compacts the survivors together to avoid fragmentation. This is more expensive than Scavenge, which is why V8 tries hard to keep short-lived objects from ever reaching Old Space.
Analogy. Old Space is the warehouse. You do not inventory it every few minutes; that would be far too slow. You do a big, thorough audit occasionally: walk every aisle, tag what is still claimed, haul out everything untagged, and slide the remaining boxes together so there are no awkward gaps.
What "reachable" means
The collector starts from a set of roots (global objects, the current call stack, and similar) and follows every reference outward. Anything it can reach is alive; anything it cannot reach is garbage. This is the crucial mental model for leaks: an object is kept alive as long as something still references it, even if your program will never actually use it again. A leak in Node is almost never "GC failed to run"; it is "you are still unintentionally referencing things you are done with."
Interview answer: "How does garbage collection work in Node.js?" V8 uses a generational, mark-and-sweep collector based on the idea that most objects die young. New objects go into a small New Space collected frequently with a fast copying algorithm called Scavenge, which keeps only the survivors. Objects that live long enough are promoted to a larger Old Space, collected less often with Mark-Sweep-Compact, which marks everything reachable from the roots, sweeps away the rest, and compacts to reduce fragmentation. An object is reclaimed only when nothing references it anymore, so memory leaks happen when code unintentionally keeps references to objects it no longer needs.
Why GC matters for performance: stop-the-world pauses
Here is the part that turns into a production issue. Some GC work, especially major collections in Old Space, requires pausing your JavaScript while it runs, because the collector cannot safely move objects around while your code is also touching them. These are "stop-the-world" pauses. Modern V8 (its collector is called Orinoco) does a lot of the work concurrently and incrementally on background threads to keep pauses small, but they are never zero. On a busy server, a long major GC pause shows up as a latency spike: most requests are fast, but the unlucky ones that land during a pause are slow, which is why GC trouble usually appears in your p99 latency, not your average. (p99, the 99th percentile, is the response time that 99 percent of requests come in under; it captures the slow unlucky 1 percent that an average hides.)
Beginner trap. "Garbage collection is automatic, so I never have to think about memory." Automatic collection frees you from manually freeing memory, but it does not free you from managing references. If you hold references too long you leak; if you churn huge numbers of objects you make GC work harder and cause pauses. Automatic does not mean free.
The Memory Limit: max-old-space-size and the Heap Ceiling
V8 does not let the heap grow without bound. Old Space in particular has a ceiling, and when a major collection cannot free enough room to stay under it, the process dies with the famous FATAL ERROR: Reached heap limit / JavaScript heap out of memory.
The default ceiling depends on the Node version and the machine, and the exact number is not something to memorize, because you can always ask V8 directly:
const v8 = require("node:v8");
const limitGB = v8.getHeapStatistics().heap_size_limit / 1024 ** 3;
console.log(`Old space heap limit: ${limitGB.toFixed(2)} GB`);You raise the ceiling with the --max-old-space-size flag, in megabytes:
# Allow up to ~4 GB of old-space heap
node --max-old-space-size=4096 server.js
# Commonly set via env var so it applies to npm scripts and tooling
NODE_OPTIONS="--max-old-space-size=4096" npm run buildThe single most important point in this guide. Raising --max-old-space-size does not fix a memory leak. If your code keeps accumulating references, a bigger heap only means the process takes longer to fill up before it crashes with the exact same error. Increasing the limit is the right move only when your workload genuinely needs more memory at once (large data processing, big builds). When memory climbs steadily under steady load, that is a leak, and the bigger heap just delays the inevitable while making each GC pause longer. Diagnose first; resize second, and only if the data says so.The container trap
This one bites almost everyone who deploys to Kubernetes or Docker. Node has been container-aware since version 12, meaning it reads the container's memory limit (its cgroup limit) and sizes the heap accordingly: roughly 50% of the container's memory up to about 4 GiB, leveling off near a 2 GB heap beyond that, when you do not set the flag yourself. The trap appears when these two limits disagree.
If your container is capped at 512 MB but you launch Node with --max-old-space-size=2048, you have told V8 it may use 2 GB of heap inside a box that the orchestrator will kill at 512 MB. V8 happily grows the heap, the container blows past its cgroup limit, and the kernel's OOM killer terminates the process before V8's own limit is ever reached. The confusing symptom: your app dies with a generic OOMKilled (exit code 137) and no nice V8 heap-limit error, because Node never got the chance to report one.
Analogy. The cgroup limit is the weight rating of an elevator; --max-old-space-size is how much you personally decide to load onto the cart you push into it. If you load the cart heavier than the elevator's rating, it does not matter that your cart could hold more; the elevator's safety system stops everything. Always keep your heap setting comfortably under the container's memory limit, leaving headroom for the stack, buffers, and non-heap memory.Interview answer: "Why does my Node container get OOMKilled even though the app seems fine?" Almost always because the V8 heap limit and the container memory limit are out of sync. Node sizes its heap from the cgroup limit by default, but if--max-old-space-size(orNODE_OPTIONS) sets a heap larger than the container allows, V8 will grow past the container's cap and the kernel's OOM killer ends the process with exit code 137, before V8 reports its own heap error. The fix is to set the heap limit below the container limit with headroom for non-heap memory, or to leave it unset and let Node's container awareness size it.
Reading process.memoryUsage()
Before you can diagnose anything, you need to read Node's own memory report. process.memoryUsage() returns an object whose fields each mean something specific, and confusing them sends people down wrong paths.
console.log(process.memoryUsage());
// {
// rss: 215_482_368, // total memory the OS gave the process
// heapTotal: 138_412_032, // heap V8 has reserved
// heapUsed: 119_530_104, // heap actually in use by your objects
// external: 8_220_310, // memory for C++ objects bound to JS (e.g. Buffers)
// arrayBuffers: 1_540_096 // subset of external: ArrayBuffer/Buffer memory
// }What each one tells you:
- rss (Resident Set Size) is the total physical memory the operating system has handed your process, including the heap, the stack, and Node's own C++ machinery. This is the number your container limit is actually measured against, so it is what gets you OOMKilled.
- heapTotal is how much heap V8 has reserved from the OS so far. It grows as needed.
- heapUsed is how much of that heap your live JavaScript objects actually occupy. This is the number to watch over time for leaks: if it climbs steadily and never comes back down under steady load, you are leaking.
- external is memory used by C++ objects tied to your JavaScript, most commonly
Buffers and other binary data. A leak here will not show inheapUsedbut will still growrss. - arrayBuffers is the slice of
externalspecifically forArrayBufferandBufferallocations.
Beginner trap: watchingrssto find a JavaScript leak.rssis noisy: it includes non-heap memory, it rarely shrinks even after objects are freed (the OS often lets a process keep memory it might reuse), and it is affected by buffers and native code. For a JavaScript object leak, watchheapUsedacross time under stable load. Userssto understand total footprint and container pressure, not to pinpoint a leak.
Node Memory on a Real Server: EC2, the OS, and Cluster Workers
The container trap is really one instance of a bigger truth: your Node process uses more memory than its V8 heap, and the host kills you based on the bigger number, not the heap. Seeing this clearly on a raw virtual machine like an EC2 instance makes the whole picture click.
Picture four nested boxes, each sitting inside the next:
┌─────────────────────────────────────────────────────────┐
│ EC2 instance RAM (e.g. t3.medium = 4 GB total) │
│ ┌──────────────────────────────────────────────────┐ │
│ │ OS, kernel, system agents, filesystem page cache │ │
│ └──────────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Your Node PROCESS (this whole box = RSS) │ │
│ │ ┌────────────────────────────────────────────┐ │ │
│ │ │ V8 Heap: your JS objects (New + Old Space) │ │ │
│ │ │ capped by --max-old-space-size │ │ │
│ │ └────────────────────────────────────────────┘ │ │
│ │ + Buffers / ArrayBuffers (external, NOT in heap) │ │
│ │ + stack + compiled code + native addons + libuv │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘The V8 heap is just one box inside the process. The number the operating system actually sees, and the number that gets you killed, is RSS (the whole process box). And the instance RAM is shared: the OS, the kernel, your logging and monitoring agents, and the filesystem page cache all take a slice, so not all of a 4 GB box is yours. Realistically maybe 3 to 3.5 GB is usable by your app.
Why heapUsed can look fine while the box dies
The fields from the previous section map directly onto that diagram. heapUsed is only the innermost box. external and arrayBuffers (Buffers, file and network data, streamed uploads) live outside the V8 heap in C++ memory, so they are not governed by --max-old-space-size, yet they still count toward rss and therefore toward instance RAM. This is how a service with a perfectly flat heapUsed still exhausts its instance: a Buffer-heavy workload (image processing, large uploads, streaming) piled up hundreds of megabytes of external memory that the heap limit never watched.
Analogy. The V8 heap is the trunk of your car, and --max-old-space-size is a rule about how full the trunk may get. But the car's total weight (RSS) also includes passengers, fuel, and roof cargo (Buffers, native memory, code). The bridge's weight limit (instance RAM) is checked against the whole car, not just the trunk. You can obey the trunk rule perfectly and still be too heavy for the bridge.How a raw EC2 instance differs from a container
On a raw instance with Node running directly (via systemd, pm2, or just node server.js), Node's container-awareness reads the whole instance's RAM and sizes its default heap from that (roughly 50% up to the cap, as covered above). The difference from a container is what happens at the limit. A container has a cgroup cap scoped to it, and exceeding it gets that container OOM-killed with exit code 137. On a raw instance there is no per-process cap; instead, when the whole instance runs low on RAM, the Linux kernel's OOM killer wakes up and terminates whatever process it judges worst, often your Node process, sometimes something else entirely, possibly after the box has already started swapping and slowing down.
The cluster-worker multiplier
This is the EC2 mistake that surprises people most. To use all the cores on an instance, teams run multiple Node processes with cluster or pm2 in cluster mode. But each worker is a full process with its own heap, so running workers multiplies memory usage. Four workers on a 4-core, 4 GB instance, each defaulting to a roughly 2 GB old-space heap, has theoretically authorized about 8 GB of heap on a 4 GB box. They will not all fill at once, but under load they can collectively blow past the instance RAM and trigger the kernel OOM killer. You must divide the memory budget across workers rather than give each the full-instance default.
# 4 GB instance, 4 cluster workers: budget ~700 MB heap EACH, not the 2 GB default
NODE_OPTIONS="--max-old-space-size=700" pm2 start server.js -i 4A concrete walkthrough
Take a t3.small (2 GB RAM) running one Node API:
- The instance boots; the OS and agents consume about 400 MB, leaving roughly 1.6 GB usable.
- Node starts. Container-awareness sees 2 GB and defaults the old-space heap near 1 GB.
- The app's live objects settle at
heapUsedaround 300 MB, andrsssits around 450 MB (heap plus code plus stacks). Healthy. - Traffic spikes with large file uploads, each buffering a 20 MB file. Thirty concurrent uploads is about 600 MB of external Buffer memory.
heapUsedbarely moves because that is not heap, butrssjumps toward 1.1 GB. - Add the 400 MB of OS overhead and the box nears its 1.6 GB usable ceiling. The kernel OOM killer fires and kills Node, with no V8
heap out of memoryerror in the logs, because the heap was never the problem. The process RSS outgrew the instance RAM.
Observing it on the box
# Per-process resident memory (rss is in KB)
ps -o pid,rss,comm -p $(pgrep -f "node server.js")
# Whole-instance memory picture
free -m # total / used / free / available
top # watch the RES column for your node process
# Was a process OOM-killed by the kernel? Check the kernel log:
dmesg | grep -i "killed process"And from inside Node, log the breakdown so you can see which region is growing:
setInterval(() => {
const m = process.memoryUsage();
const mb = (n) => (n / 1024 / 1024).toFixed(0);
console.log(
`rss=${mb(m.rss)}MB heapUsed=${mb(m.heapUsed)}MB ` +
`external=${mb(m.external)}MB arrayBuffers=${mb(m.arrayBuffers)}MB`
);
}, 10000);If rss climbs while heapUsed stays flat, look at external and arrayBuffers (Buffers, streams). If heapUsed climbs too, it is a heap leak (the four patterns in the next section). If everything inside Node is flat but the box still runs out, something else on the instance is eating the RAM.
The practical rules for sizing Node on a VM
- Size the heap below usable instance RAM, not total RAM. Leave headroom for the OS, agents, buffers, and native memory. A rough single-process starting point is
--max-old-space-sizearound 60 to 75 percent of (instance RAM minus OS overhead). - Divide the budget across cluster workers. With N workers, each gets roughly one Nth of the app's memory budget, not the full-instance default.
- Watch RSS against instance RAM, because that comparison is what decides whether the kernel OOM killer fires, not the heap number.
- Give buffer-heavy work extra headroom. Streaming, uploads, and image processing grow external memory invisibly to the heap metrics.
- Right-size the instance or scale out. If RSS legitimately needs more than the box offers, a bigger instance or more instances behind a load balancer is the answer, not just raising the heap flag, which never fixes a real leak anyway.
Interview answer: "How does a Node app use the memory on an EC2 instance, and why might it get killed?" A Node process's total memory (its RSS) is the V8 heap plus external memory like Buffers, plus stack, compiled code, native addons, and thread-pool stacks. The instance's RAM is shared with the OS and agents, so only part is usable. The kernel kills based on RSS against available instance RAM, not against the V8 heap limit, so an app can have a flat heapUsed and still be OOM-killed because Buffer-heavy work grew external memory, or because several cluster workers each took a full default heap and collectively exceeded the box. The fixes are to size the heap below usable RAM, divide that budget across workers, watch RSS rather than just the heap, and scale the instance when the workload genuinely needs it.The Classic Memory Leaks
A memory leak in Node is not the collector malfunctioning. It is your code holding references to things it is finished with, so those things are still "reachable" and can never be collected. Four patterns cause the overwhelming majority of real leaks.
1. Module-level collections that only grow
A Map, array, or object declared at module scope lives for the entire process. If you keep adding to it and never remove, it grows forever.
// ❌ Leak: every request adds an entry that is never removed.
const cache = new Map();
app.get("/user/:id", async (req, res) => {
const user = await db.getUser(req.params.id);
cache.set(req.params.id, user); // grows without bound, forever
res.json(user);
});
// ✅ Bound it: cap the size, or use a real cache with eviction.
// LRU (Least Recently Used) drops the entry untouched for the longest;
// TTL (Time To Live) drops entries after a fixed age.
const cache = new Map();
function remember(key, value) {
if (cache.size > 10_000) {
cache.delete(cache.keys().next().value); // evict the oldest
}
cache.set(key, value);
}The naive-cache trap. A plain Map used as a cache with no eviction policy is probably the most common Node leak in the wild. A cache must have a bound: a maximum size, a time-to-live, or both. "Cache forever" is just "leak slowly."2. Listeners and subscriptions you never remove
Every .on() adds a listener that holds a reference to its callback (and everything that callback closes over). Add them per request without removing them and they pile up.
// ❌ Leak: a new listener every request, never removed.
app.get("/stream", (req, res) => {
emitter.on("data", (chunk) => res.write(chunk)); // accumulates forever
});
// ✅ Remove it when the request ends.
app.get("/stream", (req, res) => {
const onData = (chunk) => res.write(chunk);
emitter.on("data", onData);
res.on("close", () => emitter.off("data", onData)); // clean up
});Node's MaxListenersExceededWarning (it fires at 11 listeners on one emitter) is usually a real leak warning, not noise to silence by raising the limit.
3. Timers that are never cleared
setInterval keeps its callback, and everything that callback references, alive for as long as the interval runs. Start intervals tied to a connection or object without clearing them and you leak.
// ❌ Leak: the interval (and everything `bigData` references) lives forever.
function startPolling(bigData) {
setInterval(() => check(bigData), 1000); // never cleared
}
// ✅ Keep the handle and clear it when done.
function startPolling(bigData) {
const id = setInterval(() => check(bigData), 1000);
return () => clearInterval(id); // caller stops it when finished
}4. Closures that capture more than you think
A closure keeps alive every variable it references from its enclosing scope. A long-lived closure that captures a large object pins that object in memory even if it only uses one small field of it.
// ❌ The handler closes over the entire `hugePayload` just to read one field.
function register(hugePayload) {
emitter.on("tick", () => log(hugePayload.id)); // pins all of hugePayload
}
// ✅ Capture only what you need.
function register(hugePayload) {
const id = hugePayload.id; // extract the small piece
emitter.on("tick", () => log(id)); // huge payload can now be collected
}Interview answer: "What are the common causes of memory leaks in Node.js?" The big four are unbounded module-level collections (aMapor array used as a cache with no eviction), event listeners and subscriptions added repeatedly and never removed, timers (setInterval) that are never cleared, and long-lived closures that capture large objects. They share one root cause: code keeps a reference to data it is finished with, so the garbage collector cannot reclaim it because the data is still reachable. The fix is always to drop the reference: bound the cache, remove the listener, clear the timer, or capture only the small piece you need.
The "let it be collected" tools: WeakMap, WeakRef, AbortController
These three exist specifically to avoid the leaks above by not holding things alive longer than needed.
A WeakMap (and WeakSet) holds its keys weakly: if the only thing referencing a key object is the WeakMap, the garbage collector is still free to reclaim it, and the entry disappears automatically. This makes a WeakMap perfect for attaching metadata to objects you do not own the lifecycle of, because you never have to remember to delete entries.
// Cache derived data keyed by an object, without pinning that object alive.
const parsedCache = new WeakMap();
function getParsed(reqObject) {
if (parsedCache.has(reqObject)) return parsedCache.get(reqObject);
const parsed = expensiveParse(reqObject);
parsedCache.set(reqObject, parsed);
return parsed; // when reqObject is GC'd, its cache entry vanishes on its own
}A WeakRef lets you reference an object without keeping it alive, for advanced caching where you want "use it if it still exists, otherwise rebuild." It is a sharp tool used rarely; the honest interview answer is that you reach for WeakMap often and WeakRef almost never.
An AbortController is the modern way to cancel in-flight async work (a fetch, a stream, a timer) so it does not linger and leak. You pass its signal into the operation and call abort() to stop it, which is the clean fix for "the user navigated away but the request and its callbacks are still pending."
const controller = new AbortController();
// The fetch is cancellable; aborting frees it and rejects the promise.
fetch("https://api.example.com/slow", { signal: controller.signal })
.then(res => res.json())
.catch(err => {
if (err.name === "AbortError") return; // expected on cancel
throw err;
});
// Cancel it (e.g. on request close, timeout, or component teardown):
controller.abort();Beginner trap: a plainMapcache keyed by objects leaks; aWeakMapdoes not. If you key a regularMapby request or user objects and never delete entries, those objects can never be collected, because theMapholds them strongly forever. Switching to aWeakMaplets them go the moment nothing else needs them. Use aWeakMapwhenever the key's lifetime should decide the entry's lifetime.
Finding a Leak: Heap Snapshots
When heapUsed climbs steadily and you need to know what is accumulating, you take heap snapshots: full pictures of every object on the heap, which you compare over time to see what is growing.
The workflow most people use:
# Start the app with the inspector open
node --inspect server.js
# Then open chrome://inspect in Chrome, click "inspect", go to the Memory tab,
# and take heap snapshots. The key technique is the COMPARISON:
# 1. Take snapshot A after warm-up.
# 2. Exercise the suspected path many times (e.g. hammer an endpoint).
# 3. Take snapshot B.
# 4. Compare B to A and sort by "Delta": what grew is your leak suspect.You can also capture snapshots programmatically, which is handy on a server you cannot attach a debugger to, and Node can even dump one automatically right before it would crash from the heap limit:
const v8 = require("node:v8");
// Writes a .heapsnapshot file you can load into Chrome DevTools later.
v8.writeHeapSnapshot();# Dump a snapshot automatically when the process approaches the heap limit,
# so you can inspect what filled it up right before the OOM crash.
node --heapsnapshot-near-heap-limit=2 server.jsAnalogy. A single heap snapshot is a photograph of a messy room; you cannot tell what is accumulating from one photo. Two snapshots taken before and after some activity are a "spot the difference" pair: whatever is bigger in the second photo is what your code is piling up. The comparison is the whole technique; a lone snapshot rarely tells you much.
Other tools worth naming.--trace-gcprints a line for every garbage collection so you can see how often and how long GC runs (helpful for spotting GC thrash).--profproduces a V8 profile for CPU hotspots. Theclinicsuite (clinic doctor,clinic heapprofiler,clinic flame) automates much of this and produces readable reports, and is a common answer to "what tools do you use to diagnose Node performance?"
A leak hunt, start to finish
Knowing the tools exist is different from knowing the loop. Here is the whole investigation as you would actually run it, so the pieces connect:
- Notice. Your dashboard (or the in-process logger from earlier) shows
heapUsedclimbing slowly across hours and never dropping back under steady traffic, then the service restarts itself every so often withheap out of memory. That steady upward slope, not a spike, is the signature of a leak rather than a load burst. - Confirm it is the heap. Watch
heapUsedspecifically. IfheapUsedis flat butrssclimbs, it is external/Buffer memory, not a classic object leak, and you would look at streams and Buffers instead. Here,heapUseditself climbs, so it is a retained-object leak. - Capture a baseline. Let the app warm up and reach steady state, then take heap snapshot A (in Chrome DevTools via
--inspect, or withv8.writeHeapSnapshot()). - Reproduce the growth. Drive the suspected path hard: replay a few thousand requests to the endpoint you suspect, so whatever is accumulating accumulates a lot. The leak needs to be big in the next snapshot to stand out.
- Capture and compare. Take heap snapshot B, then load it in DevTools and switch the view to Comparison against A, sorted by the size delta. The object type that grew by thousands of instances is your suspect: maybe
Userobjects, orArray, or closures from a specific function. - Find what is holding it. Select an instance of the leaking object and read its Retainers (the "retaining path"), which traces the chain of references keeping it alive all the way back to a GC root. That path points straight at the culprit: a module-level
Mapit was added to, an emitter still listening, an interval still running. - Fix the reference and verify. Drop the reference (bound the cache, remove the listener, clear the timer), redeploy, and watch
heapUsedflatten across the same load. A flat line under sustained traffic is the proof the leak is gone.
The mental shortcut. A heap snapshot answers "what is piling up," and the retaining path answers "who is holding it." Almost every leak hunt is those two questions in sequence, and the answer to the second is always one of the four classic causes.
Event Loop Lag: The Other Way Node "Hangs"
Not every production problem is memory. The other big one is blocking the event loop. Because your JavaScript runs on one thread, any synchronous work that takes a long time stops everything: no other requests are served, no timers fire, no I/O callbacks run. The app is not crashed; it is frozen, and health checks may start failing as if it were down.
The usual culprits are CPU-bound work with no chance to yield: a huge synchronous loop, JSON.parse or JSON.stringify on a very large payload, synchronous crypto like crypto.pbkdf2Sync, reading a large file with readFileSync in a handler, or a catastrophically backtracking regular expression on hostile input.
You measure it by checking how late timers fire compared to when they were scheduled; that lateness is your event-loop lag:
const { monitorEventLoopDelay } = require("node:perf_hooks");
const h = monitorEventLoopDelay({ resolution: 20 });
h.enable();
setInterval(() => {
// mean and max delay in milliseconds; rising numbers mean the loop is blocking
console.log(`loop delay mean=${(h.mean / 1e6).toFixed(1)}ms max=${(h.max / 1e6).toFixed(1)}ms`);
h.reset();
}, 5000);The fix is architectural, not a flag. When the loop is blocked by CPU work, the answer is to get that work off the main thread: move it to aworker_thread, break it into chunks that yield withsetImmediatebetween them, push it to a separate service, or replace the algorithm (for example, stream and parse large JSON instead ofJSON.parse-ing it whole). You cannot "tune" your way out of blocking; you have to stop blocking.
Interview answer: "How do you detect and fix a blocked event loop?" Detect it by measuring event-loop delay, either withperf_hooks'monitorEventLoopDelayor an APM tool (Application Performance Monitoring: a service like Datadog or New Relic that automatically tracks an app's runtime health and timing); steadily rising delay under load means synchronous work is hogging the thread. Find the culprit with a CPU profile (--proforclinic flame), which highlights the long synchronous function. Fix it by moving CPU-bound work off the main thread withworker_threads, chunking long loops so they yield to the loop, replacing blocking calls with async ones, or offloading to another service. The single thread must stay free to keep the app responsive.
Other Production Failures Worth Knowing
A few more failures round out what interviewers (and real on-call shifts) throw at you.
Unhandled promise rejections. A rejected promise with no .catch triggers unhandledRejection. In modern Node the default is to treat this as fatal and crash the process, which is correct: an unhandled rejection means an error path you never accounted for. The fix is to handle rejections at their source, not to silence the warning globally.
Uncaught exceptions. A thrown error that no try/catch caught fires uncaughtException and, by default, crashes. The right posture is to log, perform fast cleanup, and exit, letting your process manager restart a clean instance. Treating uncaughtException as a "swallow everything and continue" handler is dangerous because the process may be in a corrupted state.
process.on("uncaughtException", (err) => {
logger.fatal(err); // record what happened
// optionally flush logs / close critical resources quickly
process.exit(1); // then exit; let the supervisor restart a fresh process
});File descriptor and connection exhaustion. Every open socket, file handle, and database connection consumes a finite operating-system resource. Leak them (open connections in a loop without closing, never returning pooled connections, unbounded outbound requests) and you eventually hit EMFILE: too many open files or exhaust your database pool, at which point new work stalls or errors even though CPU and memory look fine. The fixes are bounded connection pools, always releasing resources in a finally, and limiting outbound concurrency.
Beginner trap: assuming a crash-restart loop is "handled." Letting a process manager restart after a fatal error is correct, but if the underlying cause is a leak or a poison input, the fresh process hits the same wall and you get a crash loop: the service flaps up and down, dropping requests each cycle. Restart is a safety net for the unexpected, not a substitute for fixing a reproducible failure. Watch your restart count as a signal.
Monitoring in Production: CloudWatch and Beyond
Everything so far has been local diagnosis: attach a debugger, read process.memoryUsage(), take a snapshot. In production you cannot babysit one process; you need metrics flowing off every instance so you can see trouble building and get paged before the crash. On AWS that means CloudWatch, and there is one gotcha that catches almost everyone first.
The "no memory metric on a bare EC2 instance" gotcha
For a plain EC2 instance, CloudWatch shows metrics like CPU utilization, network, and disk I/O by default, but not memory usage or disk space. This is not an oversight: those built-in metrics come from the hypervisor, the software layer underneath your instance that carves one physical server into many virtual machines. The hypervisor can see hardware-level activity going into your instance (CPU cycles, network bytes) but cannot see inside the operating system to know how much RAM is actually used versus held as disk cache. Memory and disk usage are OS-level facts, so the hypervisor simply does not have them. (This is the bare EC2 story specifically. On ECS, memory utilization is provided for you, covered in the next subsection.)
The fix is the unified CloudWatch agent, a process you install on the instance that reads the OS memory subsystem (on Linux, /proc/meminfo) and pushes those numbers to CloudWatch as custom metrics under the CWAgent namespace. The setup is three steps: give the instance an IAM role with the CloudWatchAgentServerPolicy, install the agent, and point it at a config file listing the metrics you want.
// A minimal CloudWatch agent config: report memory and root-disk usage every 60s
{
"metrics": {
"append_dimensions": { "InstanceId": "${aws:InstanceId}" },
"metrics_collected": {
"mem": { "measurement": ["mem_used_percent"], "metrics_collection_interval": 60 },
"disk": { "measurement": ["used_percent"], "resources": ["/"], "metrics_collection_interval": 60 }
}
}
}After the agent starts, mem_used_percent appears in CloudWatch under the CWAgent namespace, keyed by your InstanceId. The crucial point connecting this to everything above: mem_used_percent is instance-wide RSS pressure, the same total-process-against-instance-RAM number that decides whether the kernel OOM killer fires. It is the right metric to alarm on for "the box is about to run out," but it will not tell you whether the cause is the V8 heap, Buffers, or another process. For that you need app-level metrics.
On ECS, memory utilization is built in (the EC2 gotcha does not apply)
The "you must install an agent" rule is specific to a bare EC2 instance. If your Node app runs on Amazon ECS (the container service), CloudWatch gives you CPU and memory utilization automatically, with no CloudWatch agent. The reason is that ECS already runs the ECS container agent inside the box, and once a minute it measures the CPU and memory each running task is using and reports it to CloudWatch in the AWS/ECS namespace. There is already an in-OS agent doing the measuring, so the visibility problem is solved for you.
One subtlety worth knowing: ECS reports memory utilization as a percentage of the limit you declared in the task definition, not as a percentage of the physical machine. That is actually the number you want for catching an OOM kill, because an ECS task is killed when it hits its task-definition memory limit, so a MemoryUtilization climbing toward 100% is a climb toward that kill. As with any memory alarm, watch the Maximum statistic rather than the average, since a task that averages 60% but peaks at 95% is one burst away from being killed.
The behavior differs slightly by launch type (where the container actually runs):
| Where your Node app runs | Memory utilization by default? | Why, and what to add |
|---|---|---|
| Bare EC2 instance | No | No in-OS agent; the hypervisor cannot see OS memory. Install the CloudWatch agent (mem_used_percent, CWAgent namespace). |
| ECS on Fargate (serverless containers) | Yes, automatic | The ECS agent reports CPU and memory against the task-definition limit (AWS/ECS namespace). Nothing to install. |
| ECS on EC2 (containers on your own instances) | Yes, at service/task level | The ECS container agent provides task memory. But the underlying EC2 instance's own memory still needs the CloudWatch agent if you want it. |
For deeper, per-container detail rather than service averages, you enable Container Insights, a paid CloudWatch feature that publishes task- and container-level metrics (like MemoryUtilized) to the ECS/ContainerInsights namespace, with ready-made dashboards. It is the ECS equivalent of "I need to see inside each task, not just the service average."
Interview-ready summary. "AWS gives you memory metrics" is true for ECS and false for bare EC2, and that catches people. On EC2 you install the CloudWatch agent because the hypervisor cannot see OS memory; on ECS (Fargate or EC2 launch type) the ECS container agent already reports memory against the task-definition limit, so it is there by default. Either way, you still publish app-level metrics (heap, event-loop lag) from inside Node for cause-level insight, because the platform metric only tells you the box or task is full, not why.
Pushing Node's own numbers as custom metrics
To see inside the process (heap usage, event-loop lag), publish your own metrics from the app with the CloudWatch PutMetricData API. This turns the process.memoryUsage() fields and the event-loop delay from local curiosities into dashboardable, alarmable time series.
const { CloudWatchClient, PutMetricDataCommand } = require("@aws-sdk/client-cloudwatch");
const { monitorEventLoopDelay } = require("node:perf_hooks");
const cw = new CloudWatchClient({});
const loop = monitorEventLoopDelay();
loop.enable();
setInterval(async () => {
const m = process.memoryUsage();
const mb = (n) => n / 1024 / 1024;
await cw.send(new PutMetricDataCommand({
Namespace: "MyApp/Node",
// Batch related metrics into ONE call to cut cost and avoid throttling.
MetricData: [
{ MetricName: "HeapUsedMB", Value: mb(m.heapUsed), Unit: "Megabytes" },
{ MetricName: "RssMB", Value: mb(m.rss), Unit: "Megabytes" },
{ MetricName: "ExternalMB", Value: mb(m.external), Unit: "Megabytes" },
{ MetricName: "EventLoopLagMs", Value: loop.mean / 1e6, Unit: "Milliseconds" },
],
}));
loop.reset();
}, 60_000);Cost note. CloudWatch bills custom metrics by the number of distinct metrics and by PutMetricData calls, so batch related metrics into a single call (as above) and pick a sensible interval like 60 seconds rather than every second. An alternative that avoids per-call cost is the embedded metric format (EMF): you write specially structured JSON to your logs and CloudWatch extracts metrics from it.Alarms and the OOM signature in logs
Metrics are only useful if something watches them. Set CloudWatch alarms so a human gets paged while there is still time to act, not after the crash:
mem_used_percenthigh for several minutes (say above 85 percent) catches the instance approaching its RAM ceiling before the OOM killer fires.HeapUsedMBtrending up across hours is your leak alarm; pair it with a sustained-slope condition rather than a single spike.EventLoopLagMsabove a threshold catches the event loop blocking before health checks start failing.
When a process does get OOM-killed, the evidence lives in logs, not metrics. On EC2 the kernel records it (dmesg | grep -i "killed process"); on ECS the task-stopped reason shows OutOfMemoryError and the container exits with code 137. Ship the instance logs to CloudWatch Logs so this is searchable after the fact, and so a metric filter can turn "Killed process" lines into an alarmable metric.
Where APM tools fit. APM stands for Application Performance Monitoring: a category of tools that attach to your running app and continuously record how it behaves, like how long each request takes, how often errors happen, how busy the event loop is, and how memory trends over time, then show it all on dashboards with alerting. Think of CloudWatch as monitoring the machine (CPU, RAM, disk) and an APM as monitoring the application running on it (routes, queries, code-level timing). Raw CloudWatch gives you metrics, logs, and alarms, but reading a leak's retaining path or a flame graph (a chart showing which functions consumed the most CPU time) is painful in it. APM tools (Datadog, New Relic, and Grafana are popular ones; OpenTelemetry is an open, vendor-neutral standard for the same job) add automatic Node instrumentation: per-route latency, event-loop lag, garbage-collection pauses, heap trends, and distributed traces (following a single request as it hops across multiple services), usually with far nicer leak and CPU views. The common production setup is CloudWatch for infrastructure and alarms plus an APM for deep application insight. For interviews, knowing why you need the agent (the hypervisor cannot see OS memory) and which metric maps to which failure matters more than any specific vendor.
Interview answer: "How do you monitor a Node service's memory in production on AWS?" It depends on where it runs. On a bare EC2 instance, the default CloudWatch metrics include CPU and network but not memory, because those come from the hypervisor, which cannot see inside the OS, so you install the unified CloudWatch agent to push OS-level memory (mem_used_percent,CWAgentnamespace) and alarm on it. On ECS (Fargate or the EC2 launch type) memory utilization is provided automatically in theAWS/ECSnamespace, measured against the task-definition limit, because the ECS container agent already reports it, and Container Insights adds per-task detail. In all cases you also publish custom metrics from the app, likeheapUsed,rss, and event-loop lag, viaPutMetricDataor the embedded metric format, and alarm on a sustained heap climb (a leak) or rising loop lag (blocking). For OOM events you rely on logs, since a killed process leaves admesg/exit-137 signature rather than a clean metric. Many teams add an APM tool on top for richer heap and trace views.
Putting It Together: A Diagnostic Playbook
When a Node service misbehaves in production, the symptom usually points to the category:
- Memory grows steadily and never drops, then crashes with
heap out of memory. That is a leak. Confirm by watchingheapUsedover time, then compare two heap snapshots to find what is accumulating, and look first at caches, listeners, timers, and closures. - The process dies with
OOMKilled/ exit code 137 and no V8 error. That is the container trap: the heap limit and the container limit are out of sync, or totalrss(heap plus buffers plus native) exceeds the cgroup cap. Align--max-old-space-sizebelow the container limit with headroom. - Latency is fine on average but terrible at p99, with periodic spikes. Suspect GC pauses (check
--trace-gc) or intermittent event-loop blocking (measure withmonitorEventLoopDelay). - The whole service freezes and health checks fail, but it has not crashed. The event loop is blocked by synchronous CPU work. Profile to find it, then move it off the main thread.
- Requests start failing with
EMFILEor pool-timeout errors while CPU and memory look healthy. Resource exhaustion: leaked file descriptors or connections. Bound the pools and release infinally.
Interview answer: "How would you debug a Node service that is slowly using more memory until it crashes?" First confirm it is actually a heap leak by watchingprocess.memoryUsage().heapUsedover time under steady load; a steady climb that never recedes confirms it. Then take a heap snapshot after warm-up, exercise the app, take a second snapshot, and compare them sorted by growth to identify which objects are accumulating. The cause is almost always a retained reference: an unbounded cache, listeners or timers never cleaned up, or a closure pinning a large object. Fix the reference rather than raising--max-old-space-size, because a bigger heap only delays the same crash. If it is a container, also verify the heap limit sits safely under the container memory limit to avoid an OOM kill.
Related
NodeJS Fundamentals
Master Node.js for interviews: the event loop, async patterns, streams, concurrency, and the beginner traps that quietly sink candidates. With worked examples.
React Rendering Strategies, Explained: CSR, SSR, SSG, ISR, and PPR
CSR, SSR, SSG, ISR, and PPR explained with realistic examples, plus exactly which Core Web Vitals each rendering strategy moves. From build time to the browser.
React Fundamentals
A practical, example-driven React guide from fundamentals to React 19. Master hooks, the Virtual DOM, performance, and the gotchas interviewers actually test.