Hypothesis & Fuzzing

Reducing Hypothesis Test Execution Time in CI

Your Hypothesis suite has crept from seconds to minutes, and the CI dashboard shows it eating the pipeline budget — so someone reaches for max_examples=10 or deadline=None, which makes the clock look better while silently gutting failure detection. The disciplined fix is diagnostic-first: measure which of the three phases (generation, execution, shrinking) is actually slow, then apply the targeted remedy. Within the broader property-based and fuzz testing approach, speed work means aligning strategy complexity, phase execution, and database caching with your system's real computational boundaries — never disabling Hypothesis features wholesale.

Prerequisites

  • hypothesis>=6.100, pytest>=8.0, pytest-xdist, and pytest-profiling for flame graphs.
  • A test invoked with --hypothesis-show-statistics (add it to addopts in pytest.ini).

Solution

Start by reading the phase breakdown, then bound complexity and swap .filter() for assume().

Bash
# Phase-level timing + a flame graph of where CPU goes
pytest --profile -k test_property --hypothesis-show-statistics

The statistics output reports Generate vs Shrink time. If Generate dominates, your strategies produce oversized or deeply nested objects — bound them. If Shrink dominates, the test body has expensive side effects that rerun on every minimization step.

Python
from hypothesis import given, settings, assume
import hypothesis.strategies as st

# Bound every collection and numeric range so generation stays cheap
@given(st.lists(st.text(max_size=20), max_size=10))
@settings(max_examples=200)
def test_bounded(data: list[str]) -> None:
    assert len("".join(data)) <= 200

# Inefficient: filter() generates a full example, checks, discards, retries
@given(st.integers().filter(lambda x: x % 2 == 0))
def test_slow_filter(x: int) -> None:
    assert x // 2 == x / 2

# Optimized: assume() rejects at the byte-stream level, letting the engine backtrack
@given(st.integers())
def test_fast_assume(x: int) -> None:
    assume(x % 2 == 0)
    assert x // 2 == x / 2

Recursive strategies are the highest combinatorial risk; always bound depth explicitly.

Python
from hypothesis import given, settings
import hypothesis.strategies as st

@st.composite
def bounded_tree(draw, max_depth: int = 3):
    depth = draw(st.integers(0, max_depth))
    if depth == 0:
        return None
    return {"value": draw(st.integers(-100, 100)),
            "left": draw(bounded_tree(max_depth=depth - 1)),
            "right": draw(bounded_tree(max_depth=depth - 1))}

@given(bounded_tree(max_depth=4))
@settings(max_examples=150)
def test_tree(tree: dict | None) -> None:
    pass  # traversal executes predictably because depth is capped

Finally, persist the database and parallelize. Caching .hypothesis/examples/ across runs replays known failures first instead of rediscovering them, and pytest -n auto spreads execution across workers.

YAML
# .github/workflows/test.yml
steps:
  - uses: actions/checkout@v4
  - uses: actions/setup-python@v5
    with: { python-version: "3.12" }
  - uses: actions/cache@v4
    with:
      path: .hypothesis/examples
      key: hypothesis-db-${{ runner.os }}-${{ hashFiles('**/requirements.txt') }}
  - run: pip install -r requirements.txt pytest pytest-xdist hypothesis
  - run: pytest -n auto --hypothesis-profile ci
    env: { HYPOTHESIS_PROFILE: ci }
Phase timeline and deadline budget Each example spends time in generation, execution, and shrinking; the deadline applies per execution, and bounding strategies plus swapping filter for assume shifts time out of generation. Where a single example spends its time Generate Execute Shrink (only on failure) deadline = per execute Bound strategies to shrink Generate Swap filter for assume to cut wasted generation cycles Keep the body cheap to shrink Shrink
Read the Generate, Execute, and Shrink split from statistics: deadlines guard per-execution time, bounding strategies cuts Generate, and a cheap test body cuts Shrink.

Why this works

assume() integrates with Hypothesis's internal byte-stream allocation, so a rejection lets the engine backtrack and adjust generation parameters immediately, whereas .filter() generates a full object, evaluates the predicate, and discards everything on failure — benchmarks show 60–80% less generation overhead for constrained spaces. Bounding collection and recursion sizes caps the work both generation and shrinking must do per example. The example database turns repeat failures into instant replays, and worker-isolated databases let pytest-xdist cut wall-clock time without SQLite locking corruption.

Edge cases and failure modes

  • Deadline breach masking O(n²) logic — a DeadlineExceeded often means the assertion itself is slow on large input; profile the body before raising the threshold.
  • pytest-xdist SQLite locking — workers sharing one database file corrupt serialization; rely on hypothesis.extra.pytestplugin worker suffixing or database=None.
  • Disabling shrinking permanentlyphases=[Phase.generate] is fine for smoke tests but turns production failures into opaque stack traces; re-enable Phase.shrink for real pipelines.
  • Stale database skewing generation — months-old examples bias toward outdated edge cases; prune the directory on a retention schedule.
  • st.from_type() on complex Pydantic/SQLAlchemy models — recursive resolution generates oversized objects; override with explicit st.builds(Model, field=st.integers(min_value=0, max_value=10)).

Frequently Asked Questions

Why does my Hypothesis test run much slower on CI than locally? CI environments usually lack the .hypothesis/examples cache, forcing full regeneration and shrinking on every run. Cache the .hypothesis directory in the pipeline and standardize its path with DirectoryBasedExampleDatabase so failing examples replay across runners.

Can I safely use pytest-xdist with Hypothesis? Yes, but workers must not share one SQLite-backed database. The bundled hypothesis.extra.pytestplugin appends a worker ID to the database path automatically; otherwise use database=None for stateless parallel runs.

How do I tell whether slowness comes from generation or my test logic? Run pytest --hypothesis-show-statistics and read the Generate versus Shrink ratio. If Generate dominates, your strategies produce oversized objects; if Shrink dominates, the test body has expensive side effects that run on every minimization step.

Is it safe to disable shrinking to speed up tests? Only for exploratory smoke tests. Disabling Phase.shrink sacrifices minimal reproducers and turns actionable failures into opaque stack traces. Use phases=[Phase.generate] temporarily and re-enable shrinking for production pipelines.

These tactics build on the execution model in the Hypothesis framework fundamentals; when slow custom generators are the cause, see generating custom strategies with hypothesis.strategies, and for parallel-suite tuning compare notes with the pytest-xdist vs pytest-parallel performance comparison.

← Back to Hypothesis Framework Fundamentals