Node.js Diagnostics — perf_hooks, diagnostics_channel, and Heap Snapshots

Node.js’s diagnostic infrastructure provides deep visibility into application behaviour at runtime — without attaching a debugger or modifying production code. The perf_hooks module measures execution time with nanosecond precision and hooks into V8 GC events; async_hooks tracks the propagation of async context across all async operations (the foundation for AsyncLocalStorage); diagnostics_channel provides a publish-subscribe bus for structured instrumentation messages; and the --inspect flag opens the Chrome DevTools Protocol for interactive debugging, profiling, and heap analysis. This lesson builds a complete production diagnostic toolkit for the task manager API.

Diagnostic API Comparison

API Measures Overhead Use For
perf_hooks.performance High-resolution timestamps, user timings, GC events Minimal Route timing, operation benchmarking
perf_hooks.PerformanceObserver GC duration, HTTP request duration, DNS lookup time Minimal Automated GC monitoring, Node.js built-in metrics
async_hooks Async context creation, propagation, destruction Medium (2–5% CPU) Distributed tracing, AsyncLocalStorage internals
diagnostics_channel Custom named channels — publish structured events Negligible when no subscribers Library instrumentation, APM integration
V8 Profiler API CPU samples, heap snapshots programmatically High during sampling On-demand profiling triggered by conditions
Note: diagnostics_channel (stable since Node.js 19) is the correct way for libraries and frameworks to publish instrumentation data without coupling to any specific APM tool. Express, Mongoose, and undici (Node.js’s built-in HTTP client) publish events to diagnostic channels. APM tools (Datadog, New Relic, OpenTelemetry) subscribe to these channels to collect traces and metrics. You can subscribe to them directly for custom monitoring without installing an APM agent.
Tip: Use performance.timerify(fn) to wrap a function so that every call creates a PerformanceEntry observable with a PerformanceObserver. This is cleaner than manually calling performance.mark() and performance.measure() around each call site. For Express middleware timing, wrap the route handler: const timedHandler = performance.timerify(originalHandler). The observer receives entries with the function name and duration automatically.
Warning: async_hooks has a non-trivial performance overhead (2–5% CPU in tight async loops) because it fires callbacks on every async resource creation, destruction, and context switch. Do not enable it unconditionally in production. AsyncLocalStorage uses async_hooks internally but is optimised to minimise overhead. For production tracing, use diagnostics_channel (subscribe to existing channels) rather than raw async_hooks — the overhead is negligible when no consumer is subscribed.

Complete Diagnostic Toolkit

const {
    performance, PerformanceObserver,
    createHistogram,
} = require('perf_hooks');
const diagnostics_channel = require('diagnostics_channel');
const { AsyncResource }   = require('async_hooks');

// ── 1. High-resolution timing with performance marks ──────────────────────
async function timedDatabaseQuery(filter) {
    performance.mark('query-start');
    const result = await Task.find(filter).lean();
    performance.mark('query-end');

    const measure = performance.measure('task-query', 'query-start', 'query-end');
    console.log(`Query took ${measure.duration.toFixed(2)}ms`);

    // Clean up marks to avoid memory leak
    performance.clearMarks('query-start');
    performance.clearMarks('query-end');
    performance.clearMeasures('task-query');

    return result;
}

// ── 2. PerformanceObserver — GC monitoring ────────────────────────────────
const gcObserver = new PerformanceObserver(list => {
    for (const entry of list.getEntries()) {
        if (entry.detail?.kind === 4) {  // kind 4 = major GC
            console.warn(`Major GC: ${entry.duration.toFixed(2)}ms — potential pause`);
        }
    }
});

gcObserver.observe({ entryTypes: ['gc'] });

// ── 3. Histogram for percentile tracking ─────────────────────────────────
const requestHistogram = createHistogram();

function trackRequestDuration(durationMs) {
    // Record in nanoseconds (histogram uses ns)
    requestHistogram.record(durationMs * 1e6);
}

function getLatencyStats() {
    return {
        p50:  requestHistogram.percentile(50)  / 1e6,   // back to ms
        p95:  requestHistogram.percentile(95)  / 1e6,
        p99:  requestHistogram.percentile(99)  / 1e6,
        mean: requestHistogram.mean            / 1e6,
        max:  requestHistogram.max             / 1e6,
        count:requestHistogram.count,
    };
}

// ── 4. diagnostics_channel — custom instrumentation ───────────────────────
// Publisher side (e.g. in task service)
const taskChannel = diagnostics_channel.channel('taskmanager:task');

async function createTask(dto, userId) {
    const task = await Task.create({ ...dto, user: userId });

    // Publish only when someone is subscribed — zero overhead otherwise
    if (taskChannel.hasSubscribers) {
        taskChannel.publish({
            type:      'created',
            taskId:    task._id.toString(),
            userId,
            priority:  task.priority,
            timestamp: Date.now(),
        });
    }

    return task;
}

// Subscriber side (e.g. in metrics or APM module)
diagnostics_channel.subscribe('taskmanager:task', (message) => {
    if (message.type === 'created') {
        tasksCreatedCounter.inc({ priority: message.priority });
    }
});

// Subscribe to built-in Node.js channels
diagnostics_channel.subscribe('undici:request:create', ({ request }) => {
    console.log(`Outgoing HTTP: ${request.method} ${request.origin}${request.path}`);
});

// ── 5. On-demand heap snapshot triggered by memory threshold ───────────────
const v8 = require('v8');
const fs  = require('fs');
const path= require('path');

let lastSnapshotTime = 0;
const SNAPSHOT_COOLDOWN_MS = 5 * 60 * 1000;  // max one snapshot per 5 minutes

function checkMemoryAndSnapshot() {
    const heapUsed  = process.memoryUsage().heapUsed;
    const heapLimit = v8.getHeapStatistics().heap_size_limit;
    const pct       = heapUsed / heapLimit;

    if (pct > 0.85 && Date.now() - lastSnapshotTime > SNAPSHOT_COOLDOWN_MS) {
        lastSnapshotTime = Date.now();
        const filename   = path.join('/tmp', `heap-${Date.now()}.heapsnapshot`);
        const snapshot   = v8.writeHeapSnapshot(filename);
        console.warn(`Heap at ${(pct * 100).toFixed(1)}% — snapshot written: ${snapshot}`);
    }
}

setInterval(checkMemoryAndSnapshot, 30000);

// ── 6. AsyncResource for context propagation ─────────────────────────────
// Propagates async context into callbacks that were created outside
// the current async context (e.g. event emitter listeners)
class TaskEventHandler extends AsyncResource {
    constructor(requestContext) {
        super('TaskEventHandler');
        this.requestContext = requestContext;
    }

    handleEvent(event) {
        // Runs in the async context that created this handler
        // AsyncLocalStorage values from the originating request are available
        this.runInAsyncScope(() => {
            processTaskEvent(event, this.requestContext);
        });
    }
}

How It Works

Step 1 — performance.mark() Has Nanosecond Resolution

performance.now() returns a high-resolution timestamp in milliseconds with sub-millisecond precision (typically microsecond or nanosecond on modern hardware). performance.mark(name) records a named timestamp; performance.measure(name, start, end) computes the duration. The resulting PerformanceMeasure entry has duration in milliseconds. This is more accurate than Date.now() (millisecond resolution) for measuring sub-millisecond operations.

Step 2 — PerformanceObserver Receives Entries Asynchronously

Rather than polling for metrics, PerformanceObserver receives entries as a callback when new entries of the observed types arrive. GC entries (entryTypes: ['gc']) are posted after each garbage collection cycle with the duration. This enables monitoring GC pauses without any polling overhead. A major GC over 100ms will cause a visible event loop pause — logging these in production helps correlate latency spikes with GC activity.

Step 3 — diagnostics_channel Is Zero-Cost Without Subscribers

channel.hasSubscribers is a fast property check. If no one has subscribed to a channel, the entire publish path is a single boolean check — no message serialisation, no function calls. This makes it safe to add diagnostic channel publications in hot code paths in production — they contribute near-zero overhead until an APM agent or custom subscriber attaches.

Step 4 — Heap Snapshots Are On-Demand in V8

v8.writeHeapSnapshot(path) writes a JSON snapshot of the entire V8 heap to a file. This is the same format opened by Chrome DevTools Memory tab. Taking snapshots programmatically (triggered by a memory threshold or an HTTP endpoint) enables capturing heap state at the exact moment of a memory issue — far more useful than a snapshot taken after the fact. The cooldown prevents cascading: the snapshot itself is expensive and should not run when memory is already critical.

Step 5 — AsyncResource Preserves Context for Out-of-Band Callbacks

AsyncLocalStorage context is automatically propagated along the async chain of a request. But callbacks registered on event emitters or custom event buses outside the request’s async chain lose the context. AsyncResource.runInAsyncScope(fn) re-enters the async context that was active when the resource was created, making AsyncLocalStorage values (like requestId) available inside the callback even if it fires in a different async context.

Quick Reference

Task Code
Time an operation performance.mark('start'); ...; performance.measure('op', 'start')
Monitor GC new PerformanceObserver(cb).observe({ entryTypes: ['gc'] })
Percentile histogram const h = createHistogram(); h.record(ns); h.percentile(95)
Publish diagnostic event diagnostics_channel.channel('name').publish(data)
Subscribe to channel diagnostics_channel.subscribe('name', msg => ...)
Zero-cost check if (channel.hasSubscribers) channel.publish(data)
Heap snapshot v8.writeHeapSnapshot('/tmp/heap.heapsnapshot')
Preserve async context new AsyncResource('name').runInAsyncScope(fn)

🧠 Test Yourself

A library publishes events via diagnostics_channel.channel('mylib:op').publish(data) in a tight loop. In production with no subscribers, what is the performance impact?