Flaky tests are one of the biggest threats to a healthy CI/CD pipeline because they break trust; if a test sometimes fails for no good reason, people start ignoring all failures. QA engineers must learn how to detect, manage and eliminate flakiness.
Understanding and Managing Flaky Tests
A flaky test is one that can pass and fail on the same code due to timing, environment issues, data clashes or hidden dependencies. In CI/CD, such tests cause intermittent red builds, reruns and frustration.
# Example: marking a flaky job for investigation (conceptual)
jobs:
e2e_tests:
runs-on: ubuntu-latest
continue-on-error: true # temporarily, while investigating
steps:
- name: Run E2E tests
run: npm run test:e2e
- name: Upload flaky test report
if: failure()
run: ./scripts/collect-flaky-tests.sh
Typical fixes include improving waits, stabilising test data, isolating environments and removing hidden dependencies on time or external systems.
Common Mistakes
Mistake 1 β Accepting flaky tests as βjust how UI tests areβ
This destroys confidence.
β Wrong: Ignoring flaky failures or always clicking βrerunβ without investigation.
β Correct: Log, triage and resolve flaky tests as high-priority work.
Mistake 2 β Overusing retries to βfixβ flakiness
This hides problems.
β Wrong: Setting high retry counts so tests eventually pass.
β Correct: Use limited retries mainly for diagnostics, then fix root causes.