Observability is the ability to answer questions about your system’s behaviour from its outputs — metrics, logs, and traces. Where logging tells you what happened, metrics tell you how much and how often. Prometheus collects and stores time-series metrics; Grafana visualises them as dashboards with alerts. For a MEAN Stack application, the most valuable metrics are HTTP request rate, error rate, response time percentiles, Node.js heap usage, MongoDB connection pool utilisation, and business metrics like tasks created per minute. This lesson builds the complete metrics pipeline from application code to visualised dashboards.
Golden Signals for MEAN Stack
| Signal | Metric | Alert Threshold |
|---|---|---|
| Latency | http_request_duration_seconds p(95) |
> 500ms |
| Traffic | http_requests_total rate(5m) |
Drop > 50% (something broke) |
| Errors | http_requests_total{status=~"5.."} / http_requests_total |
> 1% error rate |
| Saturation | nodejs_heap_used_bytes / nodejs_heap_size_limit_bytes |
> 85% heap utilisation |
| DB connections | mongodb_pool_connections_active |
> 90% of pool size |
| Event loop lag | nodejs_eventloop_lag_seconds |
> 100ms |
/metrics endpoint on a configurable interval (typically 15s). Your application exposes metrics in Prometheus text format (one metric per line: metric_name{labels} value timestamp). The prom-client npm package handles metric registration, collection, and formatting. It also provides default Node.js metrics (heap, event loop, GC) automatically when you call collectDefaultMetrics().histogram_quantile(0.95, rate(...)) function computes the 95th percentile from a histogram./metrics endpoint publicly — it reveals internal application details including request patterns, active user counts, and system resource usage. Protect it with IP allowlisting (only accessible from the Prometheus server’s IP), basic authentication, or a network-level restriction. In Docker Compose, put Prometheus on the backend network and the /metrics route on an internal path that nginx does not proxy to the public internet.Complete Metrics Implementation
// src/config/metrics.js — Prometheus metrics setup
const promClient = require('prom-client');
const { Registry, collectDefaultMetrics, Counter, Histogram, Gauge } = promClient;
const register = new Registry();
// Collect default Node.js metrics (heap, event loop, GC, active handles)
collectDefaultMetrics({
register,
prefix: 'taskmanager_',
labels: { service: 'api', env: process.env.NODE_ENV },
});
// ── Custom application metrics ─────────────────────────────────────────────
// HTTP request duration histogram (the most important metric)
const httpDuration = new Histogram({
name: 'taskmanager_http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
registers: [register],
});
// HTTP request counter
const httpRequests = new Counter({
name: 'taskmanager_http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status_code'],
registers: [register],
});
// Active HTTP connections gauge
const activeConnections = new Gauge({
name: 'taskmanager_http_active_connections',
help: 'Currently active HTTP connections',
registers: [register],
});
// Business metrics
const tasksCreated = new Counter({
name: 'taskmanager_tasks_created_total',
help: 'Total tasks created',
labelNames: ['priority', 'user_role'],
registers: [register],
});
const tasksCompleted = new Counter({
name: 'taskmanager_tasks_completed_total',
help: 'Total tasks completed',
registers: [register],
});
// MongoDB connection pool gauge
const mongoPoolActive = new Gauge({
name: 'taskmanager_mongodb_pool_active',
help: 'Active MongoDB connection pool connections',
registers: [register],
});
// ── Metrics middleware ────────────────────────────────────────────────────
function metricsMiddleware(req, res, next) {
const end = httpDuration.startTimer();
activeConnections.inc();
res.on('finish', () => {
// Normalise route (replace IDs to avoid metric label explosion)
const route = req.route?.path || req.path.replace(/[0-9a-f]{24}/g, ':id') || 'unknown';
httpDuration.observe(
{ method: req.method, route, status_code: res.statusCode },
end() // returns elapsed seconds
);
httpRequests.inc({ method: req.method, route, status_code: res.statusCode });
activeConnections.dec();
});
next();
}
// ── Metrics endpoint ──────────────────────────────────────────────────────
async function metricsHandler(req, res) {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
}
module.exports = {
register, metricsMiddleware, metricsHandler,
tasksCreated, tasksCompleted, mongoPoolActive,
};
// ── Business metric recording ─────────────────────────────────────────────
// In task controller:
const { tasksCreated, tasksCompleted } = require('../config/metrics');
exports.create = asyncHandler(async (req, res) => {
const task = await Task.create({ ...req.body, user: req.user.sub });
tasksCreated.inc({ priority: task.priority, user_role: req.user.role });
res.status(201).json({ success: true, data: task });
});
// ── App integration ───────────────────────────────────────────────────────
// src/app.js
const { metricsMiddleware, metricsHandler } = require('./config/metrics');
app.use(metricsMiddleware);
// Protected metrics endpoint — only accessible from Prometheus server
app.get('/metrics', metricsHandler);
// In production: nginx allowlist to only permit requests from Prometheus IP
# prometheus.yml — Prometheus scrape configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: taskmanager-api
static_configs:
- targets: ['api:3000'] # service name in Docker network
metrics_path: /metrics
- job_name: mongodb
static_configs:
- targets: ['mongodb-exporter:9216']
- job_name: node-exporter
static_configs:
- targets: ['node-exporter:9100'] # host OS metrics
# ── Alerting rules ────────────────────────────────────────────────────────
# prometheus-rules.yml
groups:
- name: taskmanager
rules:
- alert: HighErrorRate
expr: |
sum(rate(taskmanager_http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(taskmanager_http_requests_total[5m])) > 0.01
for: 2m
labels: { severity: critical }
annotations:
summary: "High error rate: {{ $value | humanizePercentage }}"
description: "Error rate above 1% for 2 minutes"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(taskmanager_http_request_duration_seconds_bucket[5m])) by (le, route)
) > 0.5
for: 5m
labels: { severity: warning }
annotations:
summary: "High p95 latency on {{ $labels.route }}: {{ $value }}s"
- alert: HighHeapUsage
expr: |
taskmanager_nodejs_heap_used_bytes / taskmanager_nodejs_heap_size_limit_bytes > 0.85
for: 5m
labels: { severity: warning }
annotations:
summary: "Heap usage at {{ $value | humanizePercentage }}"
How It Works
Step 1 — prom-client Registers and Exposes Metrics
Each metric type (Counter, Histogram, Gauge) is registered with a Registry. collectDefaultMetrics() automatically tracks Node.js internals: heap size and usage, event loop lag, garbage collection duration, and active handles. Custom metrics are created with descriptive names following the namespace_subsystem_name_unit convention. The /metrics endpoint calls register.metrics() which serialises all registered metrics to Prometheus text format.
Step 2 — Route Normalisation Prevents Label Cardinality Explosion
If you label HTTP metrics with the literal URL path, every unique task ID creates a new label combination — 10,000 tasks means 10,000 metric series. This is “label cardinality explosion” and causes Prometheus memory issues. Normalising paths by replacing MongoDB ObjectIDs with :id (using a regex replace) collapses all /tasks/[id] requests into a single /tasks/:id label. Express’s req.route.path provides the parameterised route template directly.
Step 3 — Histograms Enable Percentile Calculations
A histogram observes values into pre-defined buckets. Bucket 0.1 counts requests that took under 100ms. Bucket 0.5 counts requests under 500ms (including those under 100ms). Prometheus’s histogram_quantile(0.95, rate(duration_bucket[5m])) function interpolates the 95th percentile from these bucket counts. The buckets must be defined in advance — choose values that cover your expected latency range (the task manager’s buckets from 5ms to 5s cover all realistic scenarios).
Step 4 — Alerting Rules Fire on Metric Conditions
Prometheus alerting rules evaluate PromQL expressions on the configured interval. When an expression evaluates to true for longer than for: duration, Alertmanager sends a notification. The for: 2m clause prevents noisy alerts on transient spikes — an error rate spike that resolves in 30 seconds does not page anyone. Alert labels (severity: critical) determine routing in Alertmanager — critical alerts go to PagerDuty, warnings go to Slack.
Step 5 — Business Metrics Connect Technical Performance to Outcomes
Technical metrics (latency, error rate) tell you whether the system is healthy. Business metrics (tasks created per minute, completion rate) tell you whether users are succeeding. A deployment might have perfect technical metrics but a drop in task creation rate — indicating a UX bug that’s not causing server errors. Tracking both types enables correlating technical issues with user impact and measuring the outcome of performance improvements.
Common Mistakes
Mistake 1 — Using raw URL paths as metric labels
❌ Wrong — creates one metric series per unique task ID:
httpDuration.observe({ route: req.path }, duration);
// /tasks/64a1... — millions of unique label values!
✅ Correct — normalise to route template:
const route = req.route?.path || req.path.replace(/[0-9a-f]{24}/g, ':id');
httpDuration.observe({ route }, duration);
// /tasks/:id — one series for all task IDs
Mistake 2 — Exposing /metrics publicly
❌ Wrong — metrics visible to the internet:
app.get('/metrics', metricsHandler); // no auth — public!
// Attackers learn: request rates, active users, technology stack, resource usage
✅ Correct — restrict to internal network:
# nginx: only allow Prometheus server IP
location /metrics {
allow 172.20.0.5; # Prometheus container IP
deny all;
proxy_pass http://api:3000;
}
Mistake 3 — Not using histograms for latency
❌ Wrong — gauge/counter only shows average, hides outliers:
const avgLatency = new Gauge({ name: 'avg_request_ms' });
// Average of [10, 10, 10, 5000] = 1257ms — misleading!
✅ Correct — histogram shows distribution:
const duration = new Histogram({ name: 'request_duration_seconds', buckets: [...] });
// histogram_quantile(0.99, ...) = 5s — 1% of requests are very slow!
Quick Reference
| Metric Type | Use For | Code |
|---|---|---|
| Counter | Total count (requests, errors, events) | new Counter({ name, help, labelNames }) |
| Gauge | Current value (active connections, heap) | new Gauge({ ... }); gauge.set(value) |
| Histogram | Distribution (latency, request size) | new Histogram({ ..., buckets }); h.observe(value) |
| Start timer | Measure duration | const end = histogram.startTimer(); end(labels) |
| Default metrics | Node.js internals (heap, GC, event loop) | collectDefaultMetrics({ register }) |
| Expose metrics | Prometheus scrape endpoint | res.end(await register.metrics()) |
| p95 PromQL | 95th percentile query | histogram_quantile(0.95, rate(hist_bucket[5m])) |
| Error rate PromQL | 5xx rate | rate(requests_total{status=~"5.."}[5m]) / rate(requests_total[5m]) |