Retry Mechanisms and Flake Management — Handling the Unavoidable

📚 QA Engineering 📂 Chapter 42: Selenium Framework Design — Building Production-Grade Automation 📄 Lesson 42040 Advanced 🕒 April 8, 2026

Even with perfect waits, stable locators, and independent test data, some test failures are genuinely beyond your control: a brief network blip, a transient server error, a CDN cache miss, or a garbage collection pause in the browser. These rare, unreproducible failures — true flakes — need a management strategy that does not mask real defects. The solution is intelligent retry with tracking: retry failed tests automatically, but track the retry rate and investigate any test that needs retries frequently.

Retry Mechanisms — Controlled Retries Without Masking Defects

The pytest-rerunfailures plugin provides configurable retry logic that re-runs failed tests a specified number of times before marking them as truly failed.

# ── pytest-rerunfailures ──
# Install: pip install pytest-rerunfailures

# Run with 2 retries for all tests:
# pytest --reruns 2

# Run with 2 retries and 5-second delay between retries:
# pytest --reruns 2 --reruns-delay 5

# Mark specific tests for retry (when only some are flaky):
# @pytest.mark.flaky(reruns=3, reruns_delay=2)
# def test_payment_gateway():
#     ...

# ── Flake tracking dashboard ──
# Track retries over time to identify and fix chronic flakes

import json
from datetime import datetime
from pathlib import Path


class FlakeTracker:
    FLAKE_LOG = Path("reports/flake_log.json")

    @classmethod
    def record_retry(cls, test_name, attempt, passed, error_msg=""):
        cls.FLAKE_LOG.parent.mkdir(parents=True, exist_ok=True)

        entry = {
            "timestamp": datetime.now().isoformat(),
            "test": test_name,
            "attempt": attempt,
            "passed": passed,
            "error": error_msg[:200] if error_msg else "",
        }

        # Append to log
        existing = []
        if cls.FLAKE_LOG.exists():
            existing = json.loads(cls.FLAKE_LOG.read_text())
        existing.append(entry)
        cls.FLAKE_LOG.write_text(json.dumps(existing, indent=2))

    @classmethod
    def get_top_flakes(cls, top_n=10):
        if not cls.FLAKE_LOG.exists():
            return []
        entries = json.loads(cls.FLAKE_LOG.read_text())
        # Count retries per test
        retry_counts = {}
        for e in entries:
            if e["attempt"] > 1:  # Only count retries, not first attempts
                retry_counts[e["test"]] = retry_counts.get(e["test"], 0) + 1
        # Sort by frequency
        sorted_flakes = sorted(retry_counts.items(), key=lambda x: -x[1])
        return sorted_flakes[:top_n]


# ── Flake management strategy ──
FLAKE_STRATEGY = [
    {
        "rule": "Retry threshold: maximum 2 retries per test",
        "why": "More than 2 retries masks real defects and slows the suite",
    },
    {
        "rule": "Track every retry in a flake log",
        "why": "Visibility into which tests flake and how often",
    },
    {
        "rule": "Review top flakers weekly",
        "why": "Tests that retry > 3 times/week need root cause investigation",
    },
    {
        "rule": "Set a team flake budget: < 2% of total test runs",
        "why": "If > 2% of runs involve retries, the suite has systemic issues",
    },
    {
        "rule": "Quarantine chronic flakes after investigation",
        "why": "Move persistently flaky tests to a separate run; fix or delete them",
    },
    {
        "rule": "Never add retries to a new test",
        "why": "New tests should pass reliably. If a new test is flaky, fix it immediately",
    },
]

print("Flake Management Strategy")
print("=" * 60)
for rule in FLAKE_STRATEGY:
    print(f"\n  Rule: {rule['rule']}")
    print(f"  Why:  {rule['why']}")

print("\n\npytest-rerunfailures Commands:")
print("  pytest --reruns 2                           # Retry all failures twice")
print("  pytest --reruns 2 --reruns-delay 5           # 5s delay between retries")
print("  @pytest.mark.flaky(reruns=3, reruns_delay=2) # Per-test retry config")

Note: The distinction between a flaky test and a real failure is: a flaky test passes on retry without any code changes. If a test fails, is retried, and passes — the failure was transient (network, timing, resource contention). If it fails on all retries, the failure is real and should be investigated. pytest-rerunfailures reports these differently: “rerun” for transient failures, “failed” for persistent ones. Use this distinction in your CI dashboard to separate noise from signal.

Tip: Implement a “flake budget” — a team-level metric that tracks the percentage of test runs that involve retries. Set a threshold (e.g., less than 2%). If the budget is exceeded, the team prioritises flake investigation over new test development. This creates accountability: flaky tests are not ignored — they consume a measurable budget that the team monitors.

Warning: Retries are a management strategy, not a fix. A test that retries 3 times and passes on the fourth attempt is not a healthy test — it is a test with a timing bug, a shared-state issue, or an environment dependency that should be investigated and fixed. Use retries to keep the CI pipeline green while you investigate. Never use retries as a permanent substitute for proper synchronisation and test isolation.

Common Mistakes

Mistake 1 — Adding retries to every test with a high retry count

❌ Wrong: pytest --reruns 5 globally — masks real defects and adds up to 5x execution time for failing tests.

✅ Correct: pytest --reruns 2 globally (conservative), with @pytest.mark.flaky(reruns=3) only on specific tests known to be affected by external transient conditions (payment gateway, third-party API).

Mistake 2 — Not tracking or reviewing retry data

❌ Wrong: Retries are configured, CI is green, nobody looks at how many retries are occurring.

✅ Correct: A weekly flake review where the team examines the top 5 most-retried tests, investigates root causes, and either fixes the underlying issue or quarantines the test.

🧠 Test Yourself

A test has been retrying 4-5 times per week for the past month. Each time it passes on the second attempt. What should the team do?

Increase the retry count to 3 to ensure it always passes eventually.
Accept it as a known flake — some tests are inherently unreliable.
Delete the test — if it is unreliable, it provides no value.
Investigate the root cause. A test that consistently needs retries has an identifiable underlying issue — typically insufficient waits, shared test data, environmental instability, or a race condition. The flake log provides data: when does it fail (time of day, CI load), what error does it throw (timeout, stale element, assertion), and is it correlated with specific browsers or environments? Once identified, the fix is usually adding a proper explicit wait, isolating test data, or stabilising the test environment. Retries should be a temporary bridge while the root cause is being investigated, not a permanent state.

Retry Mechanisms — Controlled Retries Without Masking Defects #

Common Mistakes #

Mistake 1 — Adding retries to every test with a high retry count #

Mistake 2 — Not tracking or reviewing retry data #

🧠 Test Yourself #

📚 More in this Tutorial Series

Retry Mechanisms — Controlled Retries Without Masking Defects

Common Mistakes

Mistake 1 — Adding retries to every test with a high retry count

Mistake 2 — Not tracking or reviewing retry data

🧠 Test Yourself