Debugging Flaky Tests with pytest-rerunfailures

Flaky tests represent one of the most insidious forms of technical debt in modern Python codebases. Unlike deterministic failures, flakiness manifests intermittently, often masking underlying race conditions, state leakage, or environmental volatility. When diagnosing these failures, pytest-rerunfailures is frequently mischaracterized as a superficial CI workaround. In reality, when deployed with architectural precision, it functions as a diagnostic amplifier that forces non-deterministic execution paths into observable, repeatable states.

A rigorous flakiness taxonomy typically isolates four primary vectors: timing-dependent I/O, shared mutable state, asynchronous event loop interference, and external service volatility. pytest-rerunfailures (version >=12.0 paired with pytest>=7.4 and Python 3.9+) intercepts the test runner lifecycle to execute controlled retry cycles. However, its default behavior does not reset session-scoped resources or clear module-level globals, which means improper configuration can silently compound state corruption across attempts. To leverage this plugin effectively, engineers must first understand how it interacts with pytest’s internal hook system. The plugin’s architecture relies heavily on pytest’s extensible runner model, a concept thoroughly documented in Advanced Pytest Architecture & Configuration. By treating reruns as a controlled stress test rather than a blind retry mechanism, teams can isolate the exact execution boundary where determinism breaks down, establish a diagnostic baseline, and implement targeted remediation without compromising test suite integrity.

Anatomy of Rerun Mechanics & Hook Execution

Understanding how pytest-rerunfailures manipulates the test lifecycle is critical to avoiding false positives and hidden state corruption. The plugin operates by registering a pytest_runtest_makereport hook implementation. When a test item executes, pytest generates a TestReport object at three distinct phases: setup, call, and teardown. The plugin monitors these reports and, upon detecting a failure during the call phase, intercepts the standard execution flow. Instead of allowing the runner to proceed to teardown and mark the item as failed, the plugin triggers a secondary invocation of pytest_runtest_protocol for the same test item.

This architectural choice has profound implications for fixture lifecycle management. During a rerun cycle, setup executes again, meaning function-scoped fixtures are freshly instantiated. However, teardown is deliberately deferred until the final rerun attempt concludes. This design prevents premature resource cleanup that could trigger cascading failures during intermediate retries, but it also means that any state mutation occurring in a function-scoped fixture persists across attempts unless explicitly cleared. Session-scoped, module-scoped, and class-scoped fixtures are never recreated between reruns. They maintain their original instantiation state, which is why tests relying on shared database connections, in-memory caches, or singleton objects frequently exhibit compounding flakiness under retry conditions.

The hook execution order during reruns follows a strict sequence:

pytest_runtest_protocol initiates for attempt N.
pytest_runtest_setup executes (re-running setup hooks).
pytest_runtest_call executes the actual test function.
pytest_runtest_makereport evaluates the outcome.
If failed and reruns threshold not met, the plugin increments the attempt counter and loops back to step 2.
If passed or threshold exhausted, pytest_runtest_teardown finally executes.

This lifecycle means that any diagnostic logging or state inspection must be anchored to pytest_runtest_makereport rather than pytest_runtest_teardown. Engineers frequently misdiagnose fixture teardown failures because they assume cleanup runs per attempt. In reality, if a test fails three times, teardown runs exactly once, carrying the accumulated state from all three attempts. Recognizing this boundary is essential when implementing retry-safe fixtures or debugging resource exhaustion patterns.

Rapid Diagnosis: Isolating State Leakage & Race Conditions

Isolating the root cause of flakiness requires a systematic approach that strips away environmental noise and forces the failure into a deterministic context. The first step is disabling output capture and enabling verbose logging during reruns. Executing pytest --reruns=3 --capture=no --log-cli-level=DEBUG forces pytest to stream real-time execution traces, revealing hidden race conditions or delayed I/O operations that standard CI runners suppress.

When standard logging proves insufficient, implementing a custom pytest_runtest_makereport hook provides immediate visibility into process state at the exact moment of failure. The following conftest.py implementation captures critical runtime metrics before the plugin triggers a retry:

import pytest
import sys
import gc
import threading
import json

@pytest.hookimpl(hookwrapper=True)
def pytest_runtest_makereport(item, call):
 outcome = yield
 report = outcome.get_result()
 if report.when == "call" and report.failed:
 state_snapshot = {
 "sys_modules": len(sys.modules),
 "gc_objects": len(gc.get_objects()),
 "thread_count": len(threading.enumerate())
 }
 with open(f"flaky_state_{item.nodeid.replace('::', '_')}.json", "w") as f:
 json.dump(state_snapshot, f, indent=2)

This diagnostic pattern is particularly effective for identifying module-level import side effects, unclosed file descriptors, or thread pool exhaustion. By comparing the gc_objects and thread_count values across successive rerun artifacts, engineers can quickly distinguish between memory leaks and transient concurrency bottlenecks.

Global state mutation remains the most common source of flakiness in large test suites. Python’s module import cache (sys.modules) and class-level attributes are frequently mutated by test functions without proper restoration. To detect this, run tests with --strict-markers and enforce pytest-randomly or pytest-shuffle to randomize execution order. When collection order changes, hidden dependencies between tests surface immediately. Understanding how pytest resolves and orders test items during collection is critical for reproducing these order-dependent failures locally, a process extensively covered in Optimizing Test Discovery. By deliberately shuffling execution sequences, you force the test runner to violate implicit ordering assumptions, making state leakage reproducible on demand.

For race conditions involving external services or database locks, cross-reference rerun timestamps with transaction logs and network trace captures. Implement explicit backoff using pytest.mark.flaky(reruns=3, reruns_delay=2) to allow transient network partitions or connection pool exhaustion to resolve naturally. If the test passes consistently with a delay but fails without it, the root cause is almost certainly timing-dependent resource contention rather than logical assertion errors.

Profiling Flaky Execution with cProfile & pytest-profiling

Performance degradation often masquerades as flakiness when timeouts, connection pool starvation, or blocking I/O introduce non-deterministic execution windows. Attaching profilers directly to rerun cycles provides quantitative evidence to separate true flakiness from infrastructure bottlenecks. The pytest-profiling plugin can be integrated with rerun markers to generate per-attempt execution traces, but for granular control, wrapping individual test functions with cProfile yields immediate diagnostic output.

import cProfile
import pstats
import io

@pytest.mark.flaky(reruns=2)
def test_async_endpoint():
 profiler = cProfile.Profile()
 profiler.enable()
 try:
 response = await client.get("/api/data")
 assert response.status_code == 200
 finally:
 profiler.disable()
 s = io.StringIO()
 ps = pstats.Stats(profiler, stream=s).sort_stats("cumulative")
 ps.print_stats(10)
 print(f"Rerun Profile:\n{s.getvalue()}")

This pattern isolates cumulative execution time per attempt. If the first attempt shows high latency in database connection initialization while subsequent attempts complete rapidly, the flakiness stems from cold-start overhead rather than assertion logic. Conversely, if cumulative time increases linearly across attempts, you are likely observing resource accumulation or connection pool exhaustion.

For memory-related flakiness, integrate tracemalloc to track object allocation across reruns. Start the tracer in a session-scoped fixture and dump snapshots on failure. Significant delta growth in tracemalloc statistics between attempts indicates objects are being retained across rerun boundaries, typically due to circular references in cached responses or uncollected generator states.

Thread and async event loop interference requires specialized monitoring. Use threading.enumerate() and asyncio.all_tasks() within diagnostic hooks to verify that background workers or scheduled coroutines complete before teardown. When pytest-xdist is enabled, parallel workers can amplify timing-dependent failures due to shared resource contention. Profile worker-local execution with pytest-profiling --profile-svg to visualize call graph divergence across workers. If SVG outputs show drastically different execution paths or blocking durations for identical test items, the flakiness is worker-scheduling dependent and requires resource isolation or sequential execution for that specific test subset.

Edge-Case Resolution: Fixtures, Conftest Hierarchies, and Parametrization

Fixture scope mismatches during reruns represent a frequent source of silent state corruption. When a test parametrized with @pytest.mark.parametrize executes, each parameter combination runs as a distinct test item. However, if the parametrization interacts with a session-scoped fixture, the fixture initializes once and persists across all parameter iterations. During a rerun, the fixture does not reset, meaning subsequent parameter combinations inherit mutated state from previous failures.

To enforce deterministic teardown, implement explicit cleanup assertions in yield-based fixtures:

@pytest.fixture
def db_session():
 conn = create_engine("sqlite:///:memory:").connect()
 yield conn
 # Explicit cleanup to prevent cross-rerun state leakage
 conn.execute(text("DROP TABLE IF EXISTS test_data"))
 conn.close()
 assert conn.closed, "Connection leak detected during rerun teardown"

This pattern guarantees that teardown executes exactly once after the final rerun attempt and validates that resources are properly released. If the assertion fails, pytest will mark the teardown as failed, preventing silent resource exhaustion from propagating to subsequent test items.

Conftest hierarchy conflicts frequently emerge when multiple conftest.py files define overlapping autouse fixtures. During reruns, pytest resolves fixture scopes hierarchically, and a higher-level conftest may inadvertently override a lower-level fixture’s teardown logic. Always verify fixture resolution order using pytest --fixtures and ensure that autouse fixtures explicitly yield control rather than returning values, which can disrupt the teardown chain.

Hypothesis property-based testing introduces a unique conflict with pytest-rerunfailures. Hypothesis relies on deterministic execution for its shrinking algorithm, which systematically reduces failing inputs to minimal reproducible examples. Reruns introduce non-deterministic timing and state resets, breaking Hypothesis’s internal state machine and causing infinite shrinking loops or false passes. Mitigate this by disabling Hypothesis shrinking during rerun cycles:

from hypothesis import settings, given, strategies as st
import pytest
import os

@given(st.integers())
@settings(database=None, max_examples=50)
@pytest.mark.flaky(reruns=1, reruns_delay=0.5)
def test_idempotent_transform(n):
 # Disable Hypothesis shrinking on rerun by checking environment
 if os.environ.get("PYTEST_RERUN_COUNT", "0") != "0":
 pytest.skip("Skipping Hypothesis test during rerun to prevent shrink conflicts")
 assert transform(n) == expected(n)

By checking the PYTEST_RERUN_COUNT environment variable (injected by the plugin), you can gracefully skip property tests during retry cycles, preserving Hypothesis’s deterministic guarantees while allowing the plugin to handle transient infrastructure failures.

CI/CD Integration & Deterministic Rerun Strategies

Deploying pytest-rerunfailures in CI/CD pipelines requires strict governance to prevent infinite retry loops and ensure failure signatures are accurately tracked. Blindly applying --reruns=5 across an entire test suite masks genuine regressions and inflates pipeline execution time. Instead, adopt a tiered retry strategy: use pytest.mark.flaky selectively for known transient failures, and enforce --max-reruns=3 globally to prevent runaway execution.

GitHub Actions matrix strategies can isolate flaky tests by running them in dedicated, sequential workers while parallelizing deterministic suites. Configure your workflow to split execution:

- name: Run Deterministic Tests
 run: pytest -n auto --ignore=tests/flaky/
- name: Run Flaky Tests Sequentially
 run: pytest --reruns=2 --reruns-delay=3 tests/flaky/

This separation prevents pytest-xdist from masking shared resource contention and ensures that flaky tests execute in a controlled, sequential environment.

JUnit XML merging requires careful handling when reruns are enabled. By default, pytest overwrites the XML report for each attempt. To preserve historical failure data, configure pytest with --junit-xml=report.xml and use a CI artifact collector to store per-attempt logs. Implement failure signature tracking by hashing assertion error messages and stack traces. If the same signature appears across multiple pipeline runs despite successful reruns, escalate the test to a permanent quarantine until the root cause is resolved.

Avoid using --reruns as a permanent CI fix. Instead, pair rerun configurations with automated alerting. When a test exceeds its retry threshold, trigger a Slack or PagerDuty notification with the exact failure signature, environment variables, and rerun delay metrics. This transforms flaky tests from silent pipeline noise into actionable engineering tickets.

Minimal Reproduction Framework & Validation Checklist

Before merging any fix for a flaky test, establish a minimal reproduction framework that guarantees deterministic validation. The framework should isolate the test from external dependencies, pin environment variables, and enforce strict execution ordering.

Environment Pinning: Lock Python version, dependency versions, and OS-level libraries. Use pip freeze or poetry export to capture exact dependency trees.
Deterministic Seeding: Inject random.seed(), numpy.random.seed(), and os.environ["PYTHONHASHSEED"] at the session level. Verify that test execution order is fixed using pytest --collect-only.
State Isolation: Replace shared fixtures with function-scoped equivalents. Use unittest.mock.patch or pytest-mock to externalize I/O boundaries.
Validation Loop: Execute pytest --reruns=10 --reruns-delay=0 locally. If the test passes 10/10 times, the fix is likely deterministic. If it fails intermittently, the root cause remains unresolved.

Automated regression testing for flakiness should be integrated into pre-commit hooks. Run a subset of historically flaky tests with elevated retry counts and strict timeout limits. If any test exceeds its threshold, block the merge and require a root cause analysis. This enforces a zero-tolerance policy for unexplained flakiness while allowing controlled retries for known transient infrastructure issues.

FAQs

Does pytest-rerunfailures reset fixture state between retry attempts? No. Only function-scoped fixtures are recreated per rerun attempt. Module, class, and session-scoped fixtures persist across retries. To force a full reset, convert fixtures to function scope or explicitly clear mutable state within a yield-based teardown block before the final attempt concludes.

How do I prevent pytest-rerunfailures from masking genuine test failures? Configure --reruns with a strict --reruns-delay and enforce a maximum retry threshold via --max-reruns. Pair this with a custom pytest_runtest_makereport hook to log failure signatures. Implement CI-level alerting that tracks identical failure hashes across multiple pipeline runs, ensuring that persistent logical errors are escalated rather than silently retried.

Why does pytest-rerunfailures conflict with Hypothesis shrinking? Hypothesis assumes deterministic execution for its shrinking algorithm. Reruns introduce non-deterministic timing and state resets, breaking the shrinking state machine and causing infinite reduction loops or false passes. Mitigate this by checking the PYTEST_RERUN_COUNT environment variable and skipping property tests during retry cycles, or isolate Hypothesis suites from retry logic entirely.

Can I use pytest-rerunfailures with pytest-xdist for parallel execution? Yes, but reruns are worker-local. If a test fails on worker A, it retries exclusively on worker A. Shared resources such as databases, file locks, or network ports can cause cascading failures across workers. Use pytest-xdist with isolated worker environments, or route flaky test subsets to sequential execution to prevent cross-worker resource contention.