Production Deployment — Complete CI/CD Pipeline, Rolling Updates, and Monitoring

The capstone series concludes with the complete deployment pipeline — from a development branch to a live, monitored production environment. This final lesson takes the Docker and CI/CD knowledge from Chapters 20–21, the monitoring from Chapter 22, and applies them end-to-end to the Task Manager application. The result is a production deployment that: automatically tests every pull request, builds immutable Docker images on merge to main, deploys to staging automatically, requires human approval for production, monitors health continuously, and alerts the on-call team if anything goes wrong.

Complete Production Stack

Layer Technology Responsibility
DNS / CDN Cloudflare DNS, DDoS protection, edge caching for static assets
Reverse proxy nginx (Docker) SSL termination, static file serving, API proxy
API servers 2× Express (Docker) HTTP request handling, Bull workers
WebSocket Socket.io with Redis adapter Real-time events across both API instances
Database MongoDB Atlas M30 Primary data store — 3-node replica set
Cache / Queue Redis (Docker) Session caching, rate limiting, Bull job queue
File storage Cloudinary Task attachment storage and CDN delivery
Email AWS SES Transactional emails via Bull email queue
Monitoring Prometheus + Grafana Metrics, dashboards, alerting
Logs Winston → Loki Structured logs, queryable via Grafana
CI/CD GitHub Actions Automated testing, building, and deployment
Note: The production deployment uses two Express API instances behind nginx for redundancy and the Redis adapter for Socket.io cross-instance broadcasting. This setup handles: one instance going down during deployment (the other continues serving), zero-downtime deployments (rolling update: Instance B updated while Instance A serves, then Instance A updated while Instance B serves), and horizontal scale if traffic grows (add a third instance, Redis adapter ensures no missed Socket.io events).
Tip: Configure Grafana alerting for the four golden signals: latency (p95 > 500ms), traffic (rate drops > 50%), errors (error rate > 1%), and saturation (heap > 85%). Set up PagerDuty or Opsgenie integration for critical alerts (error rate spike, heap critical) and Slack for warning alerts (latency elevated, heap warning). A well-configured alert system means the on-call engineer is notified of production issues before users report them — and often before users even notice.
Warning: SSL/TLS termination in production is non-negotiable. All traffic must use HTTPS — HTTP requests should be redirected with 301. The refresh token cookie relies on secure: true which only works over HTTPS. JWT tokens in transit are vulnerable to interception over HTTP. nginx handles SSL termination using Let’s Encrypt certificates (auto-renewed via certbot) or certificates from Cloudflare. The Express API never sees HTTP in production — nginx handles the TLS handshake and proxies plain HTTP to the API containers on the internal Docker network.

Complete Production Deployment

# ── .github/workflows/deploy.yml — Full deployment pipeline ─────────────
name: Deploy to Production

on:
    push:
        branches: [main]
    workflow_dispatch:
        inputs:
            image-tag:
                description: 'Image tag to deploy (default: current SHA)'
                required:    false

env:
    REGISTRY:     ghcr.io
    API_IMAGE:    ghcr.io/${{ github.repository }}/api
    CLIENT_IMAGE: ghcr.io/${{ github.repository }}/client

jobs:
    # ── CI gate — must pass before any deployment ─────────────────────────
    ci:
        uses: ./.github/workflows/ci.yml  # reusable CI workflow

    # ── Build Docker images ───────────────────────────────────────────────
    build:
        needs: ci
        uses: ./.github/workflows/docker.yml  # reusable Docker build workflow

    # ── Resolve deployment tag ────────────────────────────────────────────
    resolve-tag:
        needs: build
        runs-on: ubuntu-latest
        outputs:
            tag: ${{ steps.resolve.outputs.tag }}
        steps:
            - id: resolve
              run: |
                  if [ -n "${{ inputs.image-tag }}" ]; then
                      echo "tag=${{ inputs.image-tag }}" >> $GITHUB_OUTPUT
                  else
                      echo "tag=sha-$(echo ${{ github.sha }} | cut -c1-7)" >> $GITHUB_OUTPUT
                  fi

    # ── Deploy to staging ─────────────────────────────────────────────────
    deploy-staging:
        needs: resolve-tag
        environment: staging
        runs-on: ubuntu-latest
        steps:
            - uses: actions/checkout@v4

            - name: Deploy to staging
              uses: appleboy/ssh-action@v1
              with:
                  host:     ${{ secrets.STAGING_HOST }}
                  username: deploy
                  key:      ${{ secrets.STAGING_SSH_KEY }}
                  script: |
                      cd /opt/taskmanager
                      export API_TAG=${{ needs.resolve-tag.outputs.tag }}
                      export CLIENT_TAG=${{ needs.resolve-tag.outputs.tag }}
                      docker compose -f docker-compose.yml -f docker-compose.prod.yml \
                          pull api angular
                      docker compose -f docker-compose.yml -f docker-compose.prod.yml \
                          up -d --no-build api angular
                      sleep 15
                      curl -f https://staging-api.taskmanager.io/api/v1/health/ready || exit 1
                      echo "Staging deployed: $API_TAG"

            - name: Run smoke tests
              run: |
                  npx k6 run --vus 10 --duration 30s \
                      -e BASE_URL=https://staging-api.taskmanager.io \
                      k6-scripts/smoke.js

    # ── Production (requires manual approval via GitHub Environment) ───────
    deploy-production:
        needs: [resolve-tag, deploy-staging]
        environment: production   # configured with required reviewers
        runs-on: ubuntu-latest
        concurrency:
            group: deploy-production
            cancel-in-progress: false
        steps:
            - uses: actions/checkout@v4

            - name: Rolling deploy to production
              uses: appleboy/ssh-action@v1
              with:
                  host:     ${{ secrets.PROD_HOST }}
                  username: deploy
                  key:      ${{ secrets.PROD_SSH_KEY }}
                  script: |
                      cd /opt/taskmanager
                      export API_TAG=${{ needs.resolve-tag.outputs.tag }}

                      # Rolling update: api-1 first
                      docker pull ${{ env.API_IMAGE }}:$API_TAG
                      docker compose -f docker-compose.yml -f docker-compose.prod.yml \
                          up -d --no-build --scale api=1 api
                      sleep 20
                      curl -f https://api.taskmanager.io/api/v1/health/ready || exit 1

                      # Update static client
                      docker pull ${{ env.CLIENT_IMAGE }}:$API_TAG
                      docker compose -f docker-compose.yml -f docker-compose.prod.yml \
                          up -d --no-build angular

                      echo "Production deployed: $API_TAG"

            - name: Verify production health
              run: |
                  for i in {1..5}; do
                      STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
                          https://api.taskmanager.io/api/v1/health/ready)
                      if [ "$STATUS" = "200" ]; then echo "Healthy"; exit 0; fi
                      echo "Attempt $i failed ($STATUS), waiting..."
                      sleep 15
                  done
                  echo "Production health check failed after 5 attempts"
                  exit 1

            - name: Notify success
              uses: slackapi/slack-github-action@v1
              with:
                  channel-id: ${{ secrets.SLACK_DEPLOY_CHANNEL }}
                  slack-message: "✅ Production deployed: `${{ needs.resolve-tag.outputs.tag }}`"
              env:
                  SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}

How It Works

Step 1 — Reusable Workflows Keep the Pipeline DRY

The deployment workflow calls uses: ./.github/workflows/ci.yml and uses: ./.github/workflows/docker.yml as reusable sub-workflows rather than duplicating their steps. The CI workflow is the same one that runs on pull requests — there is no separate “deploy CI” that might diverge from the PR CI. A failed CI means a failed deployment. The Docker build workflow builds both images in parallel and pushes them to GHCR — shared as an artifact between staging and production deployments.

Step 2 — Staging Deployment Is Automatic, Production Requires Approval

On every merge to main: CI runs, Docker images are built, staging is deployed automatically, and a smoke test (10 virtual users, 30 seconds) validates the deployment. Production requires a human reviewer in the GitHub UI to click “Approve”. This approval gate ensures a human has confirmed the staging deployment looks correct before traffic is switched in production. The approval also creates an audit trail: who approved which deployment and when.

Step 3 — Health Checks Gate Each Step

Every deployment step ends with a health check: curl -f https://host/api/v1/health/ready. The -f flag causes curl to exit with code 22 for HTTP 4xx/5xx responses, causing the shell script to fail and the workflow step to fail. If the API is not ready within the timeout window, the deployment workflow fails — the GitHub Actions interface shows a clear failure, Slack receives a failure notification, and the team investigates before any user traffic is affected.

Step 4 — The MongoDB /health/ready Endpoint Uses All Dependency Checks

The /health/ready endpoint returns 200 only when MongoDB is connected (mongoose.connection.readyState === 1), Redis is responding (redis.ping() returns PONG), and the application is not in the middle of shutting down. A deployment that brings up the Express container but MongoDB Atlas hasn’t accepted the connection yet will fail the health check — the deployment pipeline detects this and can alert before routing production traffic to the unhealthy instance.

Step 5 — The Complete Stack Demonstrates the Course

The production Task Manager is the sum of every technique covered: Mongoose schemas (Ch 5–9), Express API (Ch 3–8), JWT authentication (Ch 17), Angular signals and forms (Ch 10–16), real-time Socket.io (Ch 18), Redis caching and rate limiting (Ch 5), MongoDB aggregation and Atlas Search (Ch 13), file uploads (Ch 18), Bull job queues (Ch 5), testing (Ch 19), Docker containerisation (Ch 20), GitHub Actions CI/CD (Ch 21), Prometheus monitoring (Ch 22), and the capstone architecture (Ch 23–25). Every chapter contributed something that is running in production.

Quick Reference — Production Deployment

Task Command / Config
Deploy to staging Automatic on push to main (after CI + Docker build)
Deploy to production Manual approval in GitHub Actions UI
Hotfix deploy workflow_dispatch with specific image tag
Rollback Dispatch workflow with previous SHA as image-tag
Check production health curl https://api.taskmanager.io/api/v1/health
View production logs docker compose logs -f api on server
Redis flush (emergency) docker compose exec redis redis-cli FLUSHDB — clears cache+rate limits
Alert thresholds p95 > 500ms, error rate > 1%, heap > 85%, traffic drop > 50%

🧠 Test Yourself

A production deployment fails the health check after 5 attempts. The deployment workflow marks the job as failed. What is the state of the production environment and what should the on-call engineer do first?