Node.js applications — especially long-running Express servers — can degrade over time through uncaught errors, resource leaks, and unhandled edge cases. Graceful shutdown, process management, and production hardening are the practices that transform a development server into a production-worthy service. This lesson covers the complete lifecycle management pattern: process signals for graceful shutdown, connection draining, PM2 for zero-downtime deployments and clustering, environment configuration validation, and the startup/shutdown sequence that ensures no requests are lost during deployments.
Production Process Lifecycle
| Signal | Default Behaviour | Production Handler |
|---|---|---|
SIGTERM |
Terminate process | Start graceful shutdown — drain connections |
SIGINT |
Terminate process (Ctrl+C) | Same as SIGTERM in production |
SIGHUP |
Hang up terminal | Reload configuration without restart |
SIGUSR2 |
User-defined | PM2 uses this to signal graceful restart |
unhandledRejection |
Warning then terminate (Node 15+) | Log + controlled shutdown |
uncaughtException |
Terminate | Log + shutdown (cannot safely continue) |
zod library to validate all environment variables at startup. Define a schema with z.object({ PORT: z.string().transform(Number), MONGO_URI: z.string().url(), JWT_SECRET: z.string().min(32) }) and call schema.parse(process.env). If any required variable is missing or malformed, the app throws a descriptive error at startup (before accepting any traffic) rather than failing silently mid-request when the missing value is first accessed.process.exit() directly to handle errors in production. It terminates the process without running cleanup code — database connections are not closed, in-flight requests are dropped, and file handles are not flushed. Instead, trigger your graceful shutdown handler, which calls process.exit() only after all cleanup is complete. Use a timeout (e.g. 10 seconds) to force-exit if cleanup hangs — preventing the process from staying alive indefinitely after a fatal error.Complete Production Hardening
// src/server.js — production-hardened startup and shutdown
const mongoose = require('mongoose');
const app = require('./app');
const { logger } = require('./config/logger');
// ── Environment validation at startup ─────────────────────────────────────
const REQUIRED_ENV = ['MONGO_URI', 'JWT_SECRET', 'REFRESH_SECRET', 'PORT'];
const missing = REQUIRED_ENV.filter(key => !process.env[key]);
if (missing.length) {
console.error(`Missing required environment variables: ${missing.join(', ')}`);
process.exit(1);
}
const PORT = parseInt(process.env.PORT, 10) || 3000;
// ── Graceful shutdown ─────────────────────────────────────────────────────
let server;
let isShuttingDown = false;
async function gracefulShutdown(signal) {
if (isShuttingDown) return;
isShuttingDown = true;
logger.info(`Received ${signal} — starting graceful shutdown`);
// Step 1: Stop accepting new connections
server.close(async () => {
logger.info('HTTP server closed — no new connections accepted');
try {
// Step 2: Close database connections
await mongoose.disconnect();
logger.info('MongoDB disconnected');
// Step 3: Close Redis connection (if used)
const redis = require('./config/redis');
await redis.quit();
logger.info('Redis connection closed');
} catch (err) {
logger.error('Error during cleanup', { error: err.message });
}
logger.info('Graceful shutdown complete');
process.exit(0);
});
// Force shutdown after 10 seconds (Docker SIGKILL window)
setTimeout(() => {
logger.error('Graceful shutdown timed out — forcing exit');
process.exit(1);
}, 10000);
}
// ── Signal handlers ───────────────────────────────────────────────────────
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown('SIGINT'));
// PM2 graceful reload: sends SIGUSR2 to signal reload
process.on('SIGUSR2', () => gracefulShutdown('SIGUSR2'));
// ── Error handlers ────────────────────────────────────────────────────────
process.on('unhandledRejection', (reason, promise) => {
logger.error('Unhandled Promise rejection', {
reason: reason?.message ?? String(reason),
stack: reason?.stack,
});
// In production: trigger graceful shutdown after logging
gracefulShutdown('unhandledRejection');
});
process.on('uncaughtException', err => {
logger.error('Uncaught exception — cannot safely continue', {
error: err.message,
stack: err.stack,
});
gracefulShutdown('uncaughtException');
});
// ── Health check endpoint ────────────────────────────────────────────────
app.get('/api/v1/health', async (req, res) => {
if (isShuttingDown) {
return res.status(503).json({ status: 'shutting_down' });
}
const mongoState = mongoose.connection.readyState;
const healthy = mongoState === 1;
res.status(healthy ? 200 : 503).json({
status: healthy ? 'ok' : 'degraded',
uptime: process.uptime(),
mongodb: mongoState === 1 ? 'connected' : 'disconnected',
version: process.env.npm_package_version,
pid: process.pid,
});
});
// ── Startup ───────────────────────────────────────────────────────────────
async function start() {
// Connect to database before accepting traffic
await mongoose.connect(process.env.MONGO_URI, {
maxPoolSize: 10,
serverSelectionTimeoutMS: 5000,
});
logger.info('MongoDB connected');
server = app.listen(PORT, () => {
logger.info(`Server started on port ${PORT}`, {
pid: process.pid,
env: process.env.NODE_ENV,
port: PORT,
});
});
server.on('error', err => {
if (err.code === 'EADDRINUSE') {
logger.error(`Port ${PORT} already in use`);
process.exit(1);
}
logger.error('Server error', { error: err.message });
});
}
start().catch(err => {
logger.error('Failed to start server', { error: err.message, stack: err.stack });
process.exit(1);
});
# ecosystem.config.js — PM2 configuration (or ecosystem.config.cjs)
module.exports = {
apps: [{
name: 'taskmanager-api',
script: 'src/server.js',
instances: 'max', # one per CPU core
exec_mode: 'cluster', # PM2 cluster mode (like Node.js Cluster)
watch: false, # never watch in production
max_memory_restart: '500M', # restart if memory exceeds 500MB
node_args: '--max-old-space-size=400', # V8 heap limit
env_production: {
NODE_ENV: 'production',
},
error_file: '/var/log/taskmanager/error.log',
out_file: '/var/log/taskmanager/out.log',
log_date_format: 'YYYY-MM-DD HH:mm:ss Z',
kill_timeout: 10000, # wait 10s for graceful shutdown before SIGKILL
listen_timeout: 8000, # wait 8s for app to bind to port on startup
# Graceful reload: PM2 sends SIGUSR2 to old process,
# starts new, waits for 'ready', then kills old
wait_ready: true, # wait for process.send('ready')
}],
};
# After graceful shutdown code, signal PM2 that app is ready:
# process.send?.('ready'); // in start() after server.listen() callback
How It Works
Step 1 — SIGTERM Initiates a Controlled Shutdown Sequence
Orchestrators (Kubernetes, Docker Swarm, PM2) send SIGTERM when they want to replace or restart a process. The handler sets isShuttingDown = true (so the health check returns 503 — load balancers stop routing traffic), calls server.close() (stops accepting new TCP connections while keeping existing ones alive until they complete), then disconnects from the database and exits. The entire sequence must complete before the 10-second force-kill window.
Step 2 — server.close() Drains Existing Connections
server.close(callback) stops the server from accepting new connections. Existing connections (keep-alive HTTP connections) remain open until the clients close them. In high-traffic environments, keep-alive connections may take minutes to drain — add server.closeAllConnections() (Node.js 18.2+) or use the http-terminator package to forcefully close idle connections while allowing active requests to complete.
Step 3 — Environment Validation Fails Fast at Startup
Checking for required environment variables before starting the server catches misconfiguration at deploy time rather than mid-request. A missing JWT_SECRET discovered when the first login request arrives causes a runtime error and potentially leaked stack traces. Validating at startup causes a clear exit code 1 with a descriptive error message — the deployment fails immediately and the misconfiguration is obvious from deployment logs.
Step 4 — unhandledRejection Is Fatal in Modern Node.js
Since Node.js 15, unhandled Promise rejections exit the process with code 1. Explicitly handling unhandledRejection lets you log the error with full context (including the correlation ID from AsyncLocalStorage) before triggering graceful shutdown, rather than the default process crash with a stack trace. This provides better operational visibility into what caused the rejection before the process terminates.
Step 5 — PM2 Cluster Mode + wait_ready Enable Zero-Downtime Deploys
With wait_ready: true, PM2 starts the new process version but waits until it sends process.send('ready') (after the HTTP server binds successfully) before removing the old process from load rotation. This ensures the new version is fully initialised before it starts receiving traffic. The kill_timeout: 10000 gives the old process 10 seconds for graceful shutdown before SIGKILL.
Quick Reference
| Task | Code |
|---|---|
| Handle SIGTERM | process.on('SIGTERM', gracefulShutdown) |
| Stop accepting connections | server.close(callback) |
| Force close idle connections | server.closeAllConnections() (Node 18.2+) |
| Disconnect Mongoose | await mongoose.disconnect() |
| Shutdown timeout | setTimeout(() => process.exit(1), 10000) |
| Validate env vars | Check process.env[key] array, exit 1 if missing |
| Health check during shutdown | Return 503 when isShuttingDown === true |
| PM2 cluster | instances: 'max', exec_mode: 'cluster' |
| Zero-downtime reload | wait_ready: true + process.send('ready') |