Hypothesis & Fuzzing

Advanced Property-Based Testing with Hypothesis

As test suites mature, example-based assertions struggle to cover edge cases in complex data transformations, distributed state machines, and boundary-heavy algorithms — the failure mode is a green suite that still ships off-by-one errors in temporal calculations, race conditions in state transitions, and silent corruption in serialization layers. Advanced property-based testing closes that gap by validating system invariants across mathematically generated input spaces. Building on the property-based and fuzz testing foundations, this guide moves from basic randomization to deterministic, production-grade generation: composite strategies, stateful machines, shrinking control, and hybrid fuzzing.

Prerequisites

  • Python 3.10+ — 3.9 reached end-of-life in October 2025; modern type hints (typing.Annotated, typing.Protocol) streamline strategy inference.
  • hypothesis>=6.100 — stable Phase.explain, report_multiple_bugs, and stateful APIs.
  • pytest>=8.0, pytest-xdist for parallel execution, pytest-cov for coverage tracking.
  • Comfort with @given, built-in strategies (st.integers(), st.text(), st.lists()), and pytest fixtures — see the Hypothesis framework fundamentals if any of those are unfamiliar.

Core concept

Two ideas separate advanced work from basic @given usage. First, valid-by-construction generation: instead of generating arbitrary data and discarding invalid examples with .filter(), you construct strategies that natively produce only valid states, keeping the rejection rate low and shrinking deterministic. Second, stateful modeling: many systems are sequences of operations over mutable state, so you model them as a RuleBasedStateMachine and let Hypothesis search the space of operation orderings for an invariant violation.

RuleBasedStateMachine transition diagram A transactional store moves between idle and in-transaction states through begin, put, commit, and rollback rules, with an invariant checked after every step. Stateful test: rules drive transitions idle no buffer in transaction buffered writes begin() commit() / rollback() put(k, v) @invariant() runs after every step committed data is never lost
Hypothesis generates sequences of rule calls that move the machine between states; after every step it re-checks each invariant, and on failure it shrinks the whole call sequence.

Step-by-step implementation

Step 1 — Compose valid-by-construction strategies

Naive filtering with st.integers().filter(lambda x: x > 0) generates arbitrary integers, discards invalid ones, and retries. Once rejection climbs above ~20%, shrinking degrades exponentially and tests stall. Instead, use @st.composite with conditional branching so every generated object is structurally valid. The full mechanics — st.builds, type registration, and find() — are in generating custom strategies with hypothesis.strategies.

Python
import datetime
from dataclasses import dataclass
from typing import Literal
from hypothesis import given, settings, assume
from hypothesis import strategies as st

@dataclass
class Transaction:
    transaction_id: str
    amount: float
    currency: Literal["USD", "EUR", "GBP"]
    timestamp: datetime.datetime
    status: Literal["pending", "completed", "failed"]

@st.composite
def valid_transactions(draw: st.DrawFn) -> Transaction:
    currency = draw(st.sampled_from(["USD", "EUR", "GBP"]))
    status = draw(st.sampled_from(["pending", "completed", "failed"]))
    # Valid-by-construction: completed transactions must carry a positive amount
    low = 0.01 if status == "completed" else 0.0
    amount = draw(st.floats(min_value=low, max_value=1_000_000.0, allow_nan=False))
    # Fixed bounds keep the strategy deterministic — never call datetime.now() here
    ts = draw(st.datetimes(min_value=datetime.datetime(2020, 1, 1),
                           max_value=datetime.datetime(2030, 1, 1)))
    assume(amount >= low)  # cheap guard for the rare float rounding edge
    return Transaction(
        transaction_id=f"TXN-{draw(st.text(min_size=8, max_size=12, alphabet='0123456789ABCDEF'))}",
        amount=amount, currency=currency, timestamp=ts, status=status,
    )

@given(valid_transactions())
@settings(max_examples=200)
def test_transaction_invariants(txn: Transaction) -> None:
    assert txn.amount >= 0.0
    if txn.status == "completed":
        assert txn.amount > 0.0  # business invariant

Constructing objects from constrained primitives lets the shrinking algorithm reduce failing examples to minimal counterexamples in milliseconds rather than minutes, because every candidate it explores is already valid.

Step 2 — Model stateful systems with RuleBasedStateMachine

Stateless tests validate pure functions; stateful tests validate sequences of operations over mutable state. Declare rules as transitions, gate them with precondition, and assert system-wide guarantees with @invariant(). Hypothesis generates operation sequences and re-checks invariants after every step.

Python
from typing import Any
from hypothesis.stateful import RuleBasedStateMachine, rule, invariant
import hypothesis.strategies as st

class TransactionalKVStore:
    def __init__(self) -> None:
        self._data: dict[str, Any] = {}
        self._buffer: dict[str, Any] = {}
        self._in_tx = False

    def begin(self) -> None:
        self._in_tx, self._buffer = True, {}

    def put(self, key: str, value: Any) -> None:
        (self._buffer if self._in_tx else self._data)[key] = value

    def commit(self) -> None:
        if self._in_tx:
            self._data.update(self._buffer); self._buffer.clear(); self._in_tx = False

    def rollback(self) -> None:
        if self._in_tx:
            self._buffer.clear(); self._in_tx = False

class KVStateMachine(RuleBasedStateMachine):
    def __init__(self) -> None:
        super().__init__()
        self.store = TransactionalKVStore()
        self.committed: dict[str, Any] = {}   # model of what MUST survive

    @rule()
    def begin_tx(self) -> None:
        self.store.begin()

    @rule(key=st.text(min_size=1, max_size=8), value=st.integers())
    def put_value(self, key: str, value: int) -> None:
        self.store.put(key, value)
        if not self.store._in_tx:
            self.committed[key] = value

    @rule()
    def commit_tx(self) -> None:
        if self.store._in_tx:
            self.committed.update(self.store._buffer)
        self.store.commit()

    @rule()
    def rollback_tx(self) -> None:
        self.store.rollback()

    @invariant()
    def committed_data_never_lost(self) -> None:
        for key, value in self.committed.items():
            assert self.store._data.get(key) == value

TestKVStore = KVStateMachine.TestCase

The committed dict is a model — an independent re-implementation of the contract — and the invariant checks the real store against it. Edge case: rules without a precondition may fire in illegal orders (a commit before any begin), which is exactly the kind of sequence you want explored, provided the implementation tolerates it.

Step 3 — Control shrinking and analyze counterexamples

When a stateful or composite test fails, the shrinking algorithm minimizes the failing input. Complex traces can produce opaque counterexamples, so capture every violation and replay deterministically.

Python
from hypothesis import settings, given, strategies as st

@settings(report_multiple_bugs=True)   # surface all invariant violations, not just the first
@given(st.lists(st.integers(), max_size=10))
def test_with_full_report(xs: list[int]) -> None:
    assert sum(xs) == sum(reversed(xs))

On failure, Hypothesis prints a @reproduce_failure(...) decorator encoding the exact seed and trace. Copy it into an isolated unit test to verify a fix without re-running the full generation cycle. During initial isolation, restrict phases with @settings(phases=[Phase.generate]) to skip shrinking and avoid CI timeouts, then re-enable shrinking for production runs.

Step 4 — Bridge into fuzzing across C-extension boundaries

Pure Python property tests cannot safely exercise C-extensions or memory-managed libraries. Serialize Hypothesis-generated inputs into byte buffers and feed them to a coverage-guided fuzzer such as Atheris to catch segmentation faults and undefined behavior.

Python
import struct
from hypothesis import given, settings, Phase
import hypothesis.strategies as st

def native_parse_payload(data: bytes) -> None:
    if len(data) < 4:
        raise ValueError("header too short")
    length = struct.unpack_from(">I", data, 0)[0]
    if length > len(data) - 4:
        raise BufferError("payload length mismatch")

@given(st.binary(min_size=4, max_size=256))
@settings(max_examples=500, phases=[Phase.generate])
def test_native_boundary(data: bytes) -> None:
    try:
        native_parse_payload(data)
    except (ValueError, BufferError):
        pass  # documented validation errors
    except Exception as exc:  # anything else is a boundary violation
        raise AssertionError(f"native boundary violation: {exc}") from exc

In a real Atheris setup, atheris.Setup(sys.argv, native_parse_payload); atheris.Fuzz() registers the same target for coverage-guided mutation. Enforce execution limits on native calls with signal.alarm() or threading.Timer. This pattern pairs naturally with the isolation techniques in autospec strict mocking when the target has impure dependencies.

Step 5 — Register CI profiles and parallelize

Scale max_examples by pipeline stage and isolate the database under pytest-xdist.

Python
import os
from hypothesis import settings, Phase, Verbosity

settings.register_profile("ci_pr", max_examples=50, deadline=500,
                          phases=[Phase.generate, Phase.shrink, Phase.explain],
                          verbosity=Verbosity.quiet, database=None)
settings.register_profile("ci_nightly", max_examples=1000, deadline=2000,
                          database=".hypothesis/examples")
settings.load_profile("ci_pr" if os.getenv("CI") else "ci_nightly")

Under pytest -n auto, the bundled hypothesis.extra.pytestplugin appends a worker ID to each database path, preventing SQLite locking. The full latency-reduction playbook is in reducing Hypothesis test execution time.

Verification

Confirm the setup behaves before trusting it in CI:

  • Run pytest --hypothesis-show-statistics. Each property should report a low rejection rate (aim under 15%) and a healthy Generate/Shrink split. A high "events" count of filtered draws signals an over-constrained strategy.
  • Force a failure (negate an assertion) and confirm Hypothesis shrinks to a minimal counterexample and prints a @reproduce_failure decorator.
  • For the state machine, inspect the printed trace on failure — it should be the shortest call sequence that breaks the invariant, with arguments already minimized.
  • Run twice with the same seed (pytest --hypothesis-seed=0) and confirm identical generation, proving determinism.

Troubleshooting

SymptomRoot causeFix
Unsatisfiable / high filtered-example ratio.filter() or assume() rejecting most candidatesSwitch to valid-by-construction @st.composite with conditional branching
Stateful test shrinks slowly or times outExpensive side effects run on every shrink stepKeep rule bodies cheap; mock I/O; restrict phases during isolation
Only the first invariant failure is reportedDefault single-bug reportingSet @settings(report_multiple_bugs=True)
Flaky failures across machinesLost .hypothesis database or non-deterministic seedCache and commit the database; pin @seed() while debugging
pytest-xdist SQLite locking errorsWorkers sharing one database fileRely on hypothesis.extra.pytestplugin worker isolation or database=None
RecursionError in recursive strategyMissing depth/leaf boundUse st.recursive(..., max_leaves=...) or an explicit depth counter

Frequently Asked Questions

How do I prevent property-based tests from slowing down my CI pipeline? Register environment-aware settings profiles that scale max_examples by stage (50 for PRs, 500+ for nightly builds), run with pytest-xdist, cache the .hypothesis database, and avoid filter-heavy strategies. Monitor with --hypothesis-show-statistics and tune deadline accordingly.

When should I use assume() instead of strategy filter()? Use assume() inside the test to reject rare invalid inputs early while preserving shrinking efficiency. Avoid .filter() on strategies, which forces the generator to retry until a valid example appears and degrades exponentially as rejection climbs. Prefer valid-by-construction @composite strategies for complex constraints.

How do I reproduce a failing stateful test in production? Hypothesis prints a @reproduce_failure decorator encoding the exact seed and execution trace. Copy it into the test to replay the failure. Cache and commit the .hypothesis database so the minimal example survives across machines and CI runners.

Can I combine Hypothesis with coverage-guided fuzzers like Atheris? Yes. Use Hypothesis to generate structured, type-safe inputs, then serialize them to byte buffers for consumption by Atheris. Hypothesis provides semantic correctness while Atheris drives low-level memory and boundary testing of C-extensions and parsers.

How does shrinking work for a RuleBasedStateMachine? When a stateful test fails, Hypothesis shrinks the entire sequence of rule calls, not just individual argument values. It removes steps, reorders where legal, and minimizes each remaining argument, producing the shortest operation trace that still violates an invariant.

← Back to Property-Based & Fuzz Testing Strategies