Pytest & CI

Debugging Flaky Tests with pytest-rerunfailures

A test that passes locally but fails roughly one run in twenty under CI — then passes on the automatic retry — is the canonical flaky test, and reaching for pytest-rerunfailures to silence it usually buries the real defect. Used deliberately, the plugin is a diagnostic amplifier: it forces a non-deterministic execution path into a repeatable state where you can capture sys.modules, GC object counts, and thread counts at the exact moment of failure. This guide shows the rerun mechanics, a state-capture hook, and the retry policy that surfaces root causes instead of hiding them.

Prerequisites

  • pytest-rerunfailures >= 14.0, pytest >= 8.0, Python 3.9+.
  • Optional companions: pytest-randomly (order shuffling) and pytest-xdist >= 3.0 (parallel runs — note reruns are worker-local).
  • Markers registered so @pytest.mark.flaky does not warn:
TOML
# pyproject.toml
[tool.pytest.ini_options]
addopts = "--strict-markers"
markers = ["flaky: test with known transient instability under controlled rerun"]

The collection ordering that makes order-dependent flakiness reproducible is covered in Optimizing Test Discovery; timing-based flakiness in async suites overlaps with Debugging Async Code and Event Loops.

Solution

pytest-rerunfailures registers a pytest_runtest_makereport implementation. On a call-phase failure it re-invokes pytest_runtest_protocol for the same item: setup runs again (so function-scoped fixtures are recreated) but teardown is deferred until the final attempt. Anchor your diagnostics to makereport, not teardown, and dump state the instant the failure is observed:

Python
# conftest.py
import pytest, sys, gc, threading, json

@pytest.hookimpl(hookwrapper=True)
def pytest_runtest_makereport(item, call):
    outcome = yield                       # let the report be built first
    report = outcome.get_result()
    # Only snapshot on the failing call phase, before the plugin retries.
    if report.when == "call" and report.failed:
        snapshot = {
            "nodeid": item.nodeid,
            "sys_modules": len(sys.modules),       # detects import-time leakage
            "gc_objects": len(gc.get_objects()),   # rising count => retained objects
            "thread_count": len(threading.enumerate()),  # leaked workers / pools
            "rerun": getattr(report, "rerun", 0),  # which attempt this was
        }
        fname = f"flaky_state_{item.nodeid.replace('::', '_')}_{snapshot['rerun']}.json"
        with open(fname, "w") as fh:
            json.dump(snapshot, fh, indent=2)

Comparing gc_objects and thread_count across the per-attempt artifacts separates a memory leak (monotonic growth) from transient concurrency contention (flat counts, intermittent failure). Reproduce verbosely first:

Bash
# Stream real-time traces CI normally suppresses, and shuffle order.
pytest --reruns=3 --capture=no --log-cli-level=DEBUG -p randomly

Apply a delay only when timing is the suspect, and keep teardown explicit so state cannot leak across attempts:

Python
import pytest
from sqlalchemy import create_engine, text

@pytest.fixture
def db_session():
    conn = create_engine("sqlite:///:memory:").connect()
    yield conn
    conn.execute(text("DROP TABLE IF EXISTS test_data"))  # reset before final teardown
    conn.close()
    assert conn.closed, "Connection leak detected during rerun teardown"

@pytest.mark.flaky(reruns=3, reruns_delay=2)  # delay lets a transient partition heal
def test_writes_then_reads(db_session):
    ...

Why this works

The plugin replays only the setup and call phases per attempt and runs teardown exactly once after the last attempt, so a function-scoped fixture is genuinely fresh each retry while wider-scoped fixtures are not. Capturing state inside the makereport hookwrapper records the process exactly when the failure is observed — before any retry mutates it — turning a Heisenbug into a diff between artifacts. A passing-with-delay, failing-without-delay result is strong evidence of timing-dependent resource contention rather than an assertion error, which tells you where to look instead of just hiding the symptom.

Edge cases and failure modes

  • Wider-scoped fixtures compound corruption. Session/module/class fixtures never reset between reruns, so a parametrized test sharing one of them inherits mutated state from earlier failures. Convert to function scope or clear state in the yield teardown.
  • Hypothesis shrinking conflicts. Reruns break Hypothesis's deterministic shrinking, causing infinite reduction or false passes. Check the PYTEST_RERUN_COUNT environment variable and skip property tests during retries, or isolate Hypothesis suites from retry logic entirely — see Hypothesis Framework Fundamentals.
  • Worker-local reruns under pytest-xdist. A failure on worker A retries only on A; shared DBs, ports, or locks cause cross-worker cascades. Route flaky subsets to sequential execution.
  • Coverage / JUnit XML overwrite. Each attempt can overwrite reports. Use --cov-append plus coverage combine, and collect per-attempt JUnit artifacts to preserve failure history.
  • --reruns as a permanent fix. It masks genuine regressions and inflates pipeline time. Cap at 2-3, pair with failure-signature hashing, and quarantine tests that keep recurring.

Frequently Asked Questions

Does pytest-rerunfailures reset fixture state between retry attempts? Only function-scoped fixtures are recreated per rerun attempt. Module, class, and session-scoped fixtures persist across retries and carry mutated state forward. To force a full reset, convert fixtures to function scope or clear mutable state in a yield-based teardown.

How do I stop pytest-rerunfailures from masking genuine failures? Keep --reruns low (2-3), add --reruns-delay to space retries, and log failure signatures in a pytest_runtest_makereport hook. Alert on identical failure hashes recurring across pipeline runs so persistent logical bugs are escalated rather than silently retried.

Can I combine pytest-rerunfailures with pytest-xdist? Yes, but reruns are worker-local: a test that fails on worker A retries only on worker A. Shared databases, file locks, or ports can cause cascading failures across workers, so route flaky subsets to sequential execution or isolate per-worker resources.

← Back to Optimizing Test Discovery