Node.js in Production — Graceful Shutdown, PM2, and Process Hardening

Node.js applications — especially long-running Express servers — can degrade over time through uncaught errors, resource leaks, and unhandled edge cases. Graceful shutdown, process management, and production hardening are the practices that transform a development server into a production-worthy service. This lesson covers the complete lifecycle management pattern: process signals for graceful shutdown, connection draining, PM2 for zero-downtime deployments and clustering, environment configuration validation, and the startup/shutdown sequence that ensures no requests are lost during deployments.

Production Process Lifecycle

Signal Default Behaviour Production Handler
SIGTERM Terminate process Start graceful shutdown — drain connections
SIGINT Terminate process (Ctrl+C) Same as SIGTERM in production
SIGHUP Hang up terminal Reload configuration without restart
SIGUSR2 User-defined PM2 uses this to signal graceful restart
unhandledRejection Warning then terminate (Node 15+) Log + controlled shutdown
uncaughtException Terminate Log + shutdown (cannot safely continue)
Note: Graceful shutdown means: stop accepting new connections, let existing in-flight requests complete, close database connections cleanly, flush logs, and only then exit. Without graceful shutdown, a SIGTERM from Docker or Kubernetes immediately kills the process — any in-flight HTTP requests get a TCP RST (connection reset), clients see errors, and database connections are left open until they time out. Docker sends SIGTERM, waits 10 seconds, then sends SIGKILL — your shutdown must complete in under 10 seconds.
Tip: Use the zod library to validate all environment variables at startup. Define a schema with z.object({ PORT: z.string().transform(Number), MONGO_URI: z.string().url(), JWT_SECRET: z.string().min(32) }) and call schema.parse(process.env). If any required variable is missing or malformed, the app throws a descriptive error at startup (before accepting any traffic) rather than failing silently mid-request when the missing value is first accessed.
Warning: Never use process.exit() directly to handle errors in production. It terminates the process without running cleanup code — database connections are not closed, in-flight requests are dropped, and file handles are not flushed. Instead, trigger your graceful shutdown handler, which calls process.exit() only after all cleanup is complete. Use a timeout (e.g. 10 seconds) to force-exit if cleanup hangs — preventing the process from staying alive indefinitely after a fatal error.

Complete Production Hardening

// src/server.js — production-hardened startup and shutdown
const mongoose = require('mongoose');
const app      = require('./app');
const { logger } = require('./config/logger');

// ── Environment validation at startup ─────────────────────────────────────
const REQUIRED_ENV = ['MONGO_URI', 'JWT_SECRET', 'REFRESH_SECRET', 'PORT'];
const missing = REQUIRED_ENV.filter(key => !process.env[key]);
if (missing.length) {
    console.error(`Missing required environment variables: ${missing.join(', ')}`);
    process.exit(1);
}

const PORT = parseInt(process.env.PORT, 10) || 3000;

// ── Graceful shutdown ─────────────────────────────────────────────────────
let server;
let isShuttingDown = false;

async function gracefulShutdown(signal) {
    if (isShuttingDown) return;
    isShuttingDown = true;

    logger.info(`Received ${signal} — starting graceful shutdown`);

    // Step 1: Stop accepting new connections
    server.close(async () => {
        logger.info('HTTP server closed — no new connections accepted');

        try {
            // Step 2: Close database connections
            await mongoose.disconnect();
            logger.info('MongoDB disconnected');

            // Step 3: Close Redis connection (if used)
            const redis = require('./config/redis');
            await redis.quit();
            logger.info('Redis connection closed');

        } catch (err) {
            logger.error('Error during cleanup', { error: err.message });
        }

        logger.info('Graceful shutdown complete');
        process.exit(0);
    });

    // Force shutdown after 10 seconds (Docker SIGKILL window)
    setTimeout(() => {
        logger.error('Graceful shutdown timed out — forcing exit');
        process.exit(1);
    }, 10000);
}

// ── Signal handlers ───────────────────────────────────────────────────────
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
process.on('SIGINT',  () => gracefulShutdown('SIGINT'));

// PM2 graceful reload: sends SIGUSR2 to signal reload
process.on('SIGUSR2', () => gracefulShutdown('SIGUSR2'));

// ── Error handlers ────────────────────────────────────────────────────────
process.on('unhandledRejection', (reason, promise) => {
    logger.error('Unhandled Promise rejection', {
        reason: reason?.message ?? String(reason),
        stack:  reason?.stack,
    });
    // In production: trigger graceful shutdown after logging
    gracefulShutdown('unhandledRejection');
});

process.on('uncaughtException', err => {
    logger.error('Uncaught exception — cannot safely continue', {
        error: err.message,
        stack: err.stack,
    });
    gracefulShutdown('uncaughtException');
});

// ── Health check endpoint ────────────────────────────────────────────────
app.get('/api/v1/health', async (req, res) => {
    if (isShuttingDown) {
        return res.status(503).json({ status: 'shutting_down' });
    }

    const mongoState = mongoose.connection.readyState;
    const healthy    = mongoState === 1;

    res.status(healthy ? 200 : 503).json({
        status:  healthy ? 'ok' : 'degraded',
        uptime:  process.uptime(),
        mongodb: mongoState === 1 ? 'connected' : 'disconnected',
        version: process.env.npm_package_version,
        pid:     process.pid,
    });
});

// ── Startup ───────────────────────────────────────────────────────────────
async function start() {
    // Connect to database before accepting traffic
    await mongoose.connect(process.env.MONGO_URI, {
        maxPoolSize: 10,
        serverSelectionTimeoutMS: 5000,
    });
    logger.info('MongoDB connected');

    server = app.listen(PORT, () => {
        logger.info(`Server started on port ${PORT}`, {
            pid:  process.pid,
            env:  process.env.NODE_ENV,
            port: PORT,
        });
    });

    server.on('error', err => {
        if (err.code === 'EADDRINUSE') {
            logger.error(`Port ${PORT} already in use`);
            process.exit(1);
        }
        logger.error('Server error', { error: err.message });
    });
}

start().catch(err => {
    logger.error('Failed to start server', { error: err.message, stack: err.stack });
    process.exit(1);
});
# ecosystem.config.js — PM2 configuration (or ecosystem.config.cjs)
module.exports = {
    apps: [{
        name:         'taskmanager-api',
        script:       'src/server.js',
        instances:    'max',           # one per CPU core
        exec_mode:    'cluster',       # PM2 cluster mode (like Node.js Cluster)
        watch:        false,           # never watch in production
        max_memory_restart: '500M',    # restart if memory exceeds 500MB
        node_args:    '--max-old-space-size=400',  # V8 heap limit
        env_production: {
            NODE_ENV: 'production',
        },
        error_file:   '/var/log/taskmanager/error.log',
        out_file:     '/var/log/taskmanager/out.log',
        log_date_format: 'YYYY-MM-DD HH:mm:ss Z',
        kill_timeout: 10000,  # wait 10s for graceful shutdown before SIGKILL
        listen_timeout: 8000, # wait 8s for app to bind to port on startup
        # Graceful reload: PM2 sends SIGUSR2 to old process,
        # starts new, waits for 'ready', then kills old
        wait_ready:   true,   # wait for process.send('ready')
    }],
};

# After graceful shutdown code, signal PM2 that app is ready:
# process.send?.('ready');   // in start() after server.listen() callback

How It Works

Step 1 — SIGTERM Initiates a Controlled Shutdown Sequence

Orchestrators (Kubernetes, Docker Swarm, PM2) send SIGTERM when they want to replace or restart a process. The handler sets isShuttingDown = true (so the health check returns 503 — load balancers stop routing traffic), calls server.close() (stops accepting new TCP connections while keeping existing ones alive until they complete), then disconnects from the database and exits. The entire sequence must complete before the 10-second force-kill window.

Step 2 — server.close() Drains Existing Connections

server.close(callback) stops the server from accepting new connections. Existing connections (keep-alive HTTP connections) remain open until the clients close them. In high-traffic environments, keep-alive connections may take minutes to drain — add server.closeAllConnections() (Node.js 18.2+) or use the http-terminator package to forcefully close idle connections while allowing active requests to complete.

Step 3 — Environment Validation Fails Fast at Startup

Checking for required environment variables before starting the server catches misconfiguration at deploy time rather than mid-request. A missing JWT_SECRET discovered when the first login request arrives causes a runtime error and potentially leaked stack traces. Validating at startup causes a clear exit code 1 with a descriptive error message — the deployment fails immediately and the misconfiguration is obvious from deployment logs.

Step 4 — unhandledRejection Is Fatal in Modern Node.js

Since Node.js 15, unhandled Promise rejections exit the process with code 1. Explicitly handling unhandledRejection lets you log the error with full context (including the correlation ID from AsyncLocalStorage) before triggering graceful shutdown, rather than the default process crash with a stack trace. This provides better operational visibility into what caused the rejection before the process terminates.

Step 5 — PM2 Cluster Mode + wait_ready Enable Zero-Downtime Deploys

With wait_ready: true, PM2 starts the new process version but waits until it sends process.send('ready') (after the HTTP server binds successfully) before removing the old process from load rotation. This ensures the new version is fully initialised before it starts receiving traffic. The kill_timeout: 10000 gives the old process 10 seconds for graceful shutdown before SIGKILL.

Quick Reference

Task Code
Handle SIGTERM process.on('SIGTERM', gracefulShutdown)
Stop accepting connections server.close(callback)
Force close idle connections server.closeAllConnections() (Node 18.2+)
Disconnect Mongoose await mongoose.disconnect()
Shutdown timeout setTimeout(() => process.exit(1), 10000)
Validate env vars Check process.env[key] array, exit 1 if missing
Health check during shutdown Return 503 when isShuttingDown === true
PM2 cluster instances: 'max', exec_mode: 'cluster'
Zero-downtime reload wait_ready: true + process.send('ready')

🧠 Test Yourself

Kubernetes sends SIGTERM to a Node.js server. The server immediately calls process.exit(0). What problem does this cause for in-flight requests and how should the shutdown be handled instead?