Prometheus and Grafana — Metrics Collection, Dashboards, and Alerting

📚 MEAN Stack 📂 Chapter 22: Performance and Monitoring 📄 Lesson 22040 Advanced 🕒 March 22, 2024

Observability is the ability to answer questions about your system’s behaviour from its outputs — metrics, logs, and traces. Where logging tells you what happened, metrics tell you how much and how often. Prometheus collects and stores time-series metrics; Grafana visualises them as dashboards with alerts. For a MEAN Stack application, the most valuable metrics are HTTP request rate, error rate, response time percentiles, Node.js heap usage, MongoDB connection pool utilisation, and business metrics like tasks created per minute. This lesson builds the complete metrics pipeline from application code to visualised dashboards.

Golden Signals for MEAN Stack

Signal	Metric	Alert Threshold
Latency	`http_request_duration_seconds p(95)`	> 500ms
Traffic	`http_requests_total rate(5m)`	Drop > 50% (something broke)
Errors	`http_requests_total{status=~"5.."} / http_requests_total`	> 1% error rate
Saturation	`nodejs_heap_used_bytes / nodejs_heap_size_limit_bytes`	> 85% heap utilisation
DB connections	`mongodb_pool_connections_active`	> 90% of pool size
Event loop lag	`nodejs_eventloop_lag_seconds`	> 100ms

Note: Prometheus uses a pull model — it scrapes metrics from your application’s /metrics endpoint on a configurable interval (typically 15s). Your application exposes metrics in Prometheus text format (one metric per line: metric_name{labels} value timestamp). The prom-client npm package handles metric registration, collection, and formatting. It also provides default Node.js metrics (heap, event loop, GC) automatically when you call collectDefaultMetrics().

Tip: Use histograms (not counters or gauges) for request duration metrics. Histograms track the distribution of values in configurable buckets, enabling percentile calculations (p50, p95, p99). A counter only gives you average — which hides the fact that 1% of requests take 10 seconds while 99% take 10ms. The p99 latency is what your worst-affected users experience. Prometheus’s histogram_quantile(0.95, rate(...)) function computes the 95th percentile from a histogram.

Warning: Never expose the /metrics endpoint publicly — it reveals internal application details including request patterns, active user counts, and system resource usage. Protect it with IP allowlisting (only accessible from the Prometheus server’s IP), basic authentication, or a network-level restriction. In Docker Compose, put Prometheus on the backend network and the /metrics route on an internal path that nginx does not proxy to the public internet.

Complete Metrics Implementation

// src/config/metrics.js — Prometheus metrics setup
const promClient = require('prom-client');
const { Registry, collectDefaultMetrics, Counter, Histogram, Gauge } = promClient;

const register = new Registry();

// Collect default Node.js metrics (heap, event loop, GC, active handles)
collectDefaultMetrics({
    register,
    prefix: 'taskmanager_',
    labels: { service: 'api', env: process.env.NODE_ENV },
});

// ── Custom application metrics ─────────────────────────────────────────────

// HTTP request duration histogram (the most important metric)
const httpDuration = new Histogram({
    name:    'taskmanager_http_request_duration_seconds',
    help:    'HTTP request duration in seconds',
    labelNames: ['method', 'route', 'status_code'],
    buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
    registers: [register],
});

// HTTP request counter
const httpRequests = new Counter({
    name:    'taskmanager_http_requests_total',
    help:    'Total HTTP requests',
    labelNames: ['method', 'route', 'status_code'],
    registers: [register],
});

// Active HTTP connections gauge
const activeConnections = new Gauge({
    name:    'taskmanager_http_active_connections',
    help:    'Currently active HTTP connections',
    registers: [register],
});

// Business metrics
const tasksCreated = new Counter({
    name:    'taskmanager_tasks_created_total',
    help:    'Total tasks created',
    labelNames: ['priority', 'user_role'],
    registers: [register],
});

const tasksCompleted = new Counter({
    name:    'taskmanager_tasks_completed_total',
    help:    'Total tasks completed',
    registers: [register],
});

// MongoDB connection pool gauge
const mongoPoolActive = new Gauge({
    name:    'taskmanager_mongodb_pool_active',
    help:    'Active MongoDB connection pool connections',
    registers: [register],
});

// ── Metrics middleware ────────────────────────────────────────────────────
function metricsMiddleware(req, res, next) {
    const end   = httpDuration.startTimer();
    activeConnections.inc();

    res.on('finish', () => {
        // Normalise route (replace IDs to avoid metric label explosion)
        const route = req.route?.path || req.path.replace(/[0-9a-f]{24}/g, ':id') || 'unknown';

        httpDuration.observe(
            { method: req.method, route, status_code: res.statusCode },
            end() // returns elapsed seconds
        );
        httpRequests.inc({ method: req.method, route, status_code: res.statusCode });
        activeConnections.dec();
    });

    next();
}

// ── Metrics endpoint ──────────────────────────────────────────────────────
async function metricsHandler(req, res) {
    res.set('Content-Type', register.contentType);
    res.end(await register.metrics());
}

module.exports = {
    register, metricsMiddleware, metricsHandler,
    tasksCreated, tasksCompleted, mongoPoolActive,
};

// ── Business metric recording ─────────────────────────────────────────────
// In task controller:
const { tasksCreated, tasksCompleted } = require('../config/metrics');

exports.create = asyncHandler(async (req, res) => {
    const task = await Task.create({ ...req.body, user: req.user.sub });
    tasksCreated.inc({ priority: task.priority, user_role: req.user.role });
    res.status(201).json({ success: true, data: task });
});

// ── App integration ───────────────────────────────────────────────────────
// src/app.js
const { metricsMiddleware, metricsHandler } = require('./config/metrics');

app.use(metricsMiddleware);

// Protected metrics endpoint — only accessible from Prometheus server
app.get('/metrics', metricsHandler);
// In production: nginx allowlist to only permit requests from Prometheus IP

# prometheus.yml — Prometheus scrape configuration
global:
    scrape_interval:     15s
    evaluation_interval: 15s

scrape_configs:
    - job_name: taskmanager-api
      static_configs:
          - targets: ['api:3000']   # service name in Docker network
      metrics_path: /metrics

    - job_name: mongodb
      static_configs:
          - targets: ['mongodb-exporter:9216']

    - job_name: node-exporter
      static_configs:
          - targets: ['node-exporter:9100']   # host OS metrics

# ── Alerting rules ────────────────────────────────────────────────────────
# prometheus-rules.yml
groups:
    - name: taskmanager
      rules:
          - alert: HighErrorRate
            expr: |
                sum(rate(taskmanager_http_requests_total{status_code=~"5.."}[5m]))
                / sum(rate(taskmanager_http_requests_total[5m])) > 0.01
            for: 2m
            labels:     { severity: critical }
            annotations:
                summary:     "High error rate: {{ $value | humanizePercentage }}"
                description: "Error rate above 1% for 2 minutes"

          - alert: HighLatency
            expr: |
                histogram_quantile(0.95,
                    sum(rate(taskmanager_http_request_duration_seconds_bucket[5m])) by (le, route)
                ) > 0.5
            for: 5m
            labels: { severity: warning }
            annotations:
                summary: "High p95 latency on {{ $labels.route }}: {{ $value }}s"

          - alert: HighHeapUsage
            expr: |
                taskmanager_nodejs_heap_used_bytes / taskmanager_nodejs_heap_size_limit_bytes > 0.85
            for: 5m
            labels: { severity: warning }
            annotations:
                summary: "Heap usage at {{ $value | humanizePercentage }}"

How It Works

Step 1 — prom-client Registers and Exposes Metrics

Each metric type (Counter, Histogram, Gauge) is registered with a Registry. collectDefaultMetrics() automatically tracks Node.js internals: heap size and usage, event loop lag, garbage collection duration, and active handles. Custom metrics are created with descriptive names following the namespace_subsystem_name_unit convention. The /metrics endpoint calls register.metrics() which serialises all registered metrics to Prometheus text format.

Step 2 — Route Normalisation Prevents Label Cardinality Explosion

If you label HTTP metrics with the literal URL path, every unique task ID creates a new label combination — 10,000 tasks means 10,000 metric series. This is “label cardinality explosion” and causes Prometheus memory issues. Normalising paths by replacing MongoDB ObjectIDs with :id (using a regex replace) collapses all /tasks/[id] requests into a single /tasks/:id label. Express’s req.route.path provides the parameterised route template directly.

Step 3 — Histograms Enable Percentile Calculations

A histogram observes values into pre-defined buckets. Bucket 0.1 counts requests that took under 100ms. Bucket 0.5 counts requests under 500ms (including those under 100ms). Prometheus’s histogram_quantile(0.95, rate(duration_bucket[5m])) function interpolates the 95th percentile from these bucket counts. The buckets must be defined in advance — choose values that cover your expected latency range (the task manager’s buckets from 5ms to 5s cover all realistic scenarios).

Step 4 — Alerting Rules Fire on Metric Conditions

Prometheus alerting rules evaluate PromQL expressions on the configured interval. When an expression evaluates to true for longer than for: duration, Alertmanager sends a notification. The for: 2m clause prevents noisy alerts on transient spikes — an error rate spike that resolves in 30 seconds does not page anyone. Alert labels (severity: critical) determine routing in Alertmanager — critical alerts go to PagerDuty, warnings go to Slack.

Step 5 — Business Metrics Connect Technical Performance to Outcomes

Technical metrics (latency, error rate) tell you whether the system is healthy. Business metrics (tasks created per minute, completion rate) tell you whether users are succeeding. A deployment might have perfect technical metrics but a drop in task creation rate — indicating a UX bug that’s not causing server errors. Tracking both types enables correlating technical issues with user impact and measuring the outcome of performance improvements.

Common Mistakes

Mistake 1 — Using raw URL paths as metric labels

❌ Wrong — creates one metric series per unique task ID:

httpDuration.observe({ route: req.path }, duration);
// /tasks/64a1... — millions of unique label values!

✅ Correct — normalise to route template:

const route = req.route?.path || req.path.replace(/[0-9a-f]{24}/g, ':id');
httpDuration.observe({ route }, duration);
// /tasks/:id — one series for all task IDs

Mistake 2 — Exposing /metrics publicly

❌ Wrong — metrics visible to the internet:

app.get('/metrics', metricsHandler);  // no auth — public!
// Attackers learn: request rates, active users, technology stack, resource usage

✅ Correct — restrict to internal network:

# nginx: only allow Prometheus server IP
location /metrics {
    allow 172.20.0.5;  # Prometheus container IP
    deny  all;
    proxy_pass http://api:3000;
}

Mistake 3 — Not using histograms for latency

❌ Wrong — gauge/counter only shows average, hides outliers:

const avgLatency = new Gauge({ name: 'avg_request_ms' });
// Average of [10, 10, 10, 5000] = 1257ms — misleading!

✅ Correct — histogram shows distribution:

const duration = new Histogram({ name: 'request_duration_seconds', buckets: [...] });
// histogram_quantile(0.99, ...) = 5s — 1% of requests are very slow!

Quick Reference

Metric Type	Use For	Code
Counter	Total count (requests, errors, events)	`new Counter({ name, help, labelNames })`
Gauge	Current value (active connections, heap)	`new Gauge({ ... }); gauge.set(value)`
Histogram	Distribution (latency, request size)	`new Histogram({ ..., buckets }); h.observe(value)`
Start timer	Measure duration	`const end = histogram.startTimer(); end(labels)`
Default metrics	Node.js internals (heap, GC, event loop)	`collectDefaultMetrics({ register })`
Expose metrics	Prometheus scrape endpoint	`res.end(await register.metrics())`
p95 PromQL	95th percentile query	`histogram_quantile(0.95, rate(hist_bucket[5m]))`
Error rate PromQL	5xx rate	`rate(requests_total{status=~"5.."}[5m]) / rate(requests_total[5m])`

Golden Signals for MEAN Stack #

Complete Metrics Implementation #

How It Works #

Step 1 — prom-client Registers and Exposes Metrics #

Step 2 — Route Normalisation Prevents Label Cardinality Explosion #

Step 3 — Histograms Enable Percentile Calculations #

Step 4 — Alerting Rules Fire on Metric Conditions #

Step 5 — Business Metrics Connect Technical Performance to Outcomes #

Common Mistakes #

Mistake 1 — Using raw URL paths as metric labels #

Mistake 2 — Exposing /metrics publicly #

Mistake 3 — Not using histograms for latency #

Quick Reference #

🧠 Test Yourself #

📚 More in this Tutorial Series

Golden Signals for MEAN Stack

Complete Metrics Implementation

How It Works

Step 1 — prom-client Registers and Exposes Metrics

Step 2 — Route Normalisation Prevents Label Cardinality Explosion

Step 3 — Histograms Enable Percentile Calculations

Step 4 — Alerting Rules Fire on Metric Conditions

Step 5 — Business Metrics Connect Technical Performance to Outcomes

Common Mistakes

Mistake 1 — Using raw URL paths as metric labels

Mistake 2 — Exposing /metrics publicly

Mistake 3 — Not using histograms for latency

Quick Reference

🧠 Test Yourself