WebSocket Reliability — Reconnection, Heartbeats and Fallbacks

A WebSocket connection in a real application faces many adversaries: mobile networks dropping for a few seconds, laptops sleeping and waking, corporate proxies that time out idle connections, browsers that throttle background tabs. Robust WebSocket handling requires automatic reconnection with backoff, heartbeats to detect phantom connections that the OS has not yet reported as closed, graceful fallback to polling for environments where WebSockets are blocked, and intelligent handling of tab visibility to conserve resources. These patterns are what separates production-grade real-time features from demo-quality ones.

Exponential Backoff Reconnection

// Already built into useWebSocket — this lesson shows the algorithm explicitly

function calculateBackoffDelay(attempt, options = {}) {
    const {
        baseDelay  = 1000,   // 1 second
        maxDelay   = 30_000, // 30 seconds cap
        jitter     = true,   // add randomness to prevent thundering herd
    } = options;

    const exponential = baseDelay * Math.pow(2, attempt);
    const capped      = Math.min(exponential, maxDelay);

    if (jitter) {
        // Add up to 20% random jitter
        return capped * (0.8 + Math.random() * 0.4);
    }
    return capped;
}

// Attempt → delay:
// 0 → ~1s
// 1 → ~2s
// 2 → ~4s
// 3 → ~8s
// 4 → ~16s
// 5+ → ~30s (capped)

// The jitter prevents all clients from reconnecting simultaneously
// after a server restart ("thundering herd" problem)
Note: Jitter is randomness added to retry delays to prevent the “thundering herd” problem. If a server restarts and 1,000 clients all reconnect at exactly 5 seconds, they create a simultaneous spike of connection attempts that can overwhelm the server. By adding ±20% random variation to each client’s delay, the reconnection attempts spread out over time, giving the server a chance to handle them gracefully. Always include jitter in production reconnection logic.
Tip: Pause WebSocket reconnection when the browser tab is hidden (document.visibilityState === "hidden") and resume when it becomes visible again. Background tabs that constantly retry connections waste battery and network resources without providing any user value. Use the visibilitychange event: document.addEventListener("visibilitychange", () => { if (document.visibilityState === "visible") reconnect(); else pauseReconnect(); }). When the user returns to the tab, reconnect immediately rather than waiting for the next backoff timer.
Warning: WebSocket heartbeats detect stale connections — connections that appear open at the OS level but are actually dead (the remote end crashed without sending a close frame, or a NAT device silently dropped the connection). Without heartbeats, you can have a “zombie” connection that never receives messages and never fires onclose. Send a {"type": "ping"} JSON message every 30 seconds and expect a {"type": "pong"} response within 10 seconds — if it does not arrive, close and reconnect.

Application-Level Heartbeat

// Add to useWebSocket hook
const heartbeatRef    = useRef(null);
const heartbeatTimer  = useRef(null);

function startHeartbeat(ws) {
    heartbeatRef.current = null;   // reset pong flag

    heartbeatTimer.current = setInterval(() => {
        if (ws.readyState !== WebSocket.OPEN) return;

        if (heartbeatRef.current === false) {
            // No pong received — connection is zombie, force reconnect
            ws.close(4000, "Heartbeat timeout");
            return;
        }

        heartbeatRef.current = false;   // waiting for pong
        ws.send(JSON.stringify({ type: "ping" }));
    }, 30_000);
}

function stopHeartbeat() {
    clearInterval(heartbeatTimer.current);
}

// In onopen: startHeartbeat(ws);
// In onclose: stopHeartbeat();
// In onmessage: if data.type === "pong": heartbeatRef.current = true;

Visibility-Based Reconnection

// Add to useWebSocket hook — pause reconnect when tab is hidden
useEffect(() => {
    function handleVisibilityChange() {
        if (document.visibilityState === "visible") {
            // Tab became visible — reconnect immediately if disconnected
            if (wsRef.current?.readyState !== WebSocket.OPEN && enabled) {
                clearTimeout(reconnectRef.current);
                attemptRef.current = 0;   // reset backoff on visibility change
                connect();
            }
        } else {
            // Tab hidden — cancel pending reconnect timer to save resources
            clearTimeout(reconnectRef.current);
        }
    }

    document.addEventListener("visibilitychange", handleVisibilityChange);
    return () => document.removeEventListener("visibilitychange", handleVisibilityChange);
}, [connect, enabled]);

HTTP Polling Fallback

// src/hooks/useRealtimeComments.js — WebSocket with polling fallback
import { useWebSocket } from "@/hooks/useWebSocket";
import { useGetCommentsQuery } from "@/store/apiSlice";

export function useRealtimeComments(postId) {
    const [wsWorking, setWsWorking] = useState(false);

    const { status } = useWebSocket(`/ws/posts/${postId}/live`, {
        onMessage: (data) => {
            if (data.type === "room_joined") setWsWorking(true);
        },
    });

    // Fallback: poll every 15s if WebSocket not working
    const { data } = useGetCommentsQuery(postId, {
        pollingInterval: wsWorking ? 0 : 15_000,   // 0 = no polling when WS works
        skip: wsWorking,
    });

    return { wsWorking, status };
}

Complete Reliability Checklist

Issue Solution
Network drop Exponential backoff reconnect
Thundering herd after server restart Jitter in backoff delay
Zombie (phantom) connection Application-level heartbeat ping/pong
Background tab battery drain Pause reconnect on visibilitychange hidden
Immediate reconnect on tab focus Resume on visibilitychange visible
Corporate proxy blocks WebSocket HTTP polling fallback
Server restart viewer count reset Keep last known count during reconnect

🧠 Test Yourself

After a server restart, 500 clients all have their WebSocket connections closed simultaneously. Without jitter, they all retry after exactly 2 seconds. What problem does this cause and how does jitter fix it?