Hypothesis & Fuzzing

Hypothesis Framework Fundamentals for Python Testing

As systems grow in complexity, the combinatorial explosion of valid input states renders example-based testing insufficient for guaranteeing correctness. The symptom is familiar: a suite that passes for months, then a production incident traces back to an input nobody wrote a test for. Hypothesis attacks that failure mode by generating inputs, enforcing invariants, and applying delta-debugging to failures — turning a brittle verification script into a generative validation engine. This guide grounds the property-based and fuzz testing approach in concrete execution models: strategies, the @given decorator, the shrinking engine, the example database, and CI-ready settings.

Prerequisites

  • Python 3.10+ (3.9 is end-of-life as of October 2025) with type hints enabled.
  • hypothesis>=6.100 and pytest>=8.0 installed in the active virtual environment.
  • Familiarity with pytest fixture lifecycles and basic decorators.
  • Optional: sqlalchemy for the database-integration example below.

Core concept

Hypothesis replaces explicit input-output pairs with properties — invariants that must hold for every valid input. Instead of asserting f(5) == 10, you assert for all x in Domain, property(f(x)). Properties are algebraic or structural: idempotency (f(f(x)) == f(x)), commutativity (f(a, b) == f(b, a)), or round-trip preservation (decode(encode(x)) == x). Hypothesis generates structured, type-aware inputs that probe boundaries human engineers rarely anticipate, and — crucially — guarantees deterministic reproduction of any failure it finds.

Generate, shrink, and database lifecycle A strategy generates an example, the test runs, a failure triggers shrinking, and the minimal example is stored in the database and replayed first on the next run. The Hypothesis execution lifecycle Generate strategy draws input Execute run the property Shrink minimize the failure Database .hypothesis/examples on failure next run: stored minimal example is replayed first regressions are caught immediately, not rediscovered
Each example flows generate, execute, and on failure shrink; the minimized input is stored in the example database and replayed first on the next run so regressions resurface instantly.

Step-by-step implementation

Step 1 — Write a property with @given

Strategies live in hypothesis.strategies (aliased st) and are lazy generators — they describe how to produce data rather than producing it eagerly. The @given decorator binds strategies to a test, generates an example, injects it, and repeats up to max_examples (default 100).

Python
from hypothesis import given, settings
import hypothesis.strategies as st

@given(st.text(min_size=1, max_size=50))
@settings(max_examples=200)
def test_utf8_round_trip(raw_text: str) -> None:
    """Encoding then decoding UTF-8 must preserve the original string."""
    decoded = raw_text.encode("utf-8").decode("utf-8")
    assert decoded == raw_text                 # round-trip invariant
    assert len(raw_text.encode("utf-8")) >= len(raw_text)  # bytes >= chars

pytest's assertion rewriting applies automatically, so failures include the exact generated input and intermediate state without manual logging.

Step 2 — Compose a custom strategy

@st.composite turns a function into a strategy that draws correlated fields and enforces cross-field rules before returning an object.

Python
from dataclasses import dataclass
from datetime import datetime
from hypothesis import given, strategies as st

@dataclass
class UserEvent:
    user_id: int
    timestamp: datetime
    action: str
    metadata: dict

@st.composite
def valid_user_events(draw: st.DrawFn) -> UserEvent:
    action = draw(st.sampled_from(["login", "purchase", "logout"]))
    # Cross-field constraint: login events must carry a session_id
    if action == "login":
        metadata = draw(st.fixed_dictionaries({"session_id": st.uuids()}))
    else:
        metadata = draw(st.dictionaries(st.text(), st.integers()))
    return UserEvent(
        user_id=draw(st.integers(min_value=1, max_value=100_000)),
        timestamp=draw(st.datetimes(min_value=datetime(2020, 1, 1))),
        action=action, metadata=metadata,
    )

@given(valid_user_events())
def test_event_has_required_fields(event: UserEvent) -> None:
    if event.action == "login":
        assert "session_id" in event.metadata

The deeper patterns — st.builds, type registration, recursive strategies — are covered in generating custom strategies with hypothesis.strategies.

Step 3 — Integrate with pytest fixtures and assume()

Unlike parametrized tests where fixtures run once per function, @given executes the body multiple times, so fixtures are injected per generated example. Scope expensive resources accordingly, and use assume() for rare preconditions.

Python
import pytest
from hypothesis import given, settings, assume
import hypothesis.strategies as st
from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.orm import Session, declarative_base

Base = declarative_base()

class Record(Base):
    __tablename__ = "records"
    id = Column(Integer, primary_key=True)
    payload = Column(String, nullable=False)

@pytest.fixture(scope="function")
def db_session(tmp_path):
    engine = create_engine(f"sqlite:///{tmp_path}/test.db")
    Base.metadata.create_all(engine)
    with Session(engine) as session:
        yield session   # fresh DB per Hypothesis example

@given(st.text(min_size=1, max_size=100))
@settings(max_examples=50)
def test_insert_round_trip(db_session: Session, payload: str) -> None:
    assume("\x00" not in payload)  # SQLite rejects null bytes — rare, so assume() fits
    db_session.add(Record(payload=payload)); db_session.commit()
    fetched = db_session.query(Record).filter_by(payload=payload).first()
    assert fetched is not None and fetched.payload == payload

Prefer assume() for rare or cross-field constraints; prefer .filter() for common, easily satisfiable ones. Overusing assume() raises UnsatisfiedAssumptionError once too many examples are rejected.

Step 4 — Tune settings and the example database

Production suites need predictable execution. hypothesis.settings controls volume, deadlines, and database behavior.

Python
from hypothesis import settings, given, Verbosity
from hypothesis.database import DirectoryBasedExampleDatabase
import hypothesis.strategies as st

@settings(
    max_examples=500,
    deadline=500,  # ms per example; raises DeadlineExceeded if breached
    verbosity=Verbosity.normal,
    database=DirectoryBasedExampleDatabase(".hypothesis/ci_cache"),
)
@given(st.dictionaries(st.text(), st.integers()))
def test_dict_dedup(data: dict[str, int]) -> None:
    assert len(data) == len({k: v for k, v in data.items()})

Increase max_examples for pure functions, decrease for I/O-heavy tests. The default deadline is 200ms; override per-test for genuinely slow operations rather than globally. Detailed tactics live in reducing Hypothesis test execution time.

Step 5 — Pin seeds for deterministic reproduction

@seed() fixes the generation sequence so a flaky failure replays identically across machines.

Python
from hypothesis import given, seed
import hypothesis.strategies as st

@seed(12345)  # identical generation everywhere — use while debugging, then remove
@given(st.lists(st.integers()))
def test_sort_is_stable_under_reverse(data: list[int]) -> None:
    assert sorted(data) == sorted(data, reverse=True)[::-1]

Verification

  • Run pytest --hypothesis-show-statistics and confirm each property reports the expected example count, a low rejection rate, and a sane Generate/Shrink ratio.
  • Negate an assertion to force a failure; confirm Hypothesis reports a minimal counterexample (a short string, a small list) rather than the raw random input — this proves shrinking is active.
  • Delete .hypothesis/ and re-run a failing test, then re-run again; the second run should replay the stored minimal example first, demonstrating the database works.
  • Re-run with --hypothesis-seed=0 twice and confirm identical generation, proving determinism for CI.

Troubleshooting

SymptomRoot causeFix
UnsatisfiedAssumptionErrorOver-restrictive assume() rejecting too many examplesMove the constraint into the strategy with bounds (min_size/max_value) or st.sampled_from()
DeadlineExceededHeavy I/O or complex strategy trees exceed the 200ms defaultSet @settings(deadline=None) for genuinely slow tests and isolate them; bound recursion
Flaky failures across machinesUncached database or non-deterministic seedCache .hypothesis/examples/; pin @seed() while debugging
Strategy explosion / memory bloatUnbounded recursion or large st.sampled_from() collectionsAdd max_size/max_leaves, use st.deferred() for recursion
Fixtures behaving unexpectedlyFixture runs per example, not per testUse scope="function" and keep per-example resources cheap

Frequently Asked Questions

How does Hypothesis differ from pytest's @pytest.mark.parametrize? Parametrize runs a fixed, hand-written list of inputs. Hypothesis generates many boundary-pushing inputs per run and automatically shrinks any failure to a minimal reproducible case, eliminating manual edge-case enumeration.

What is the shrinking process and why is it critical? Shrinking is a delta-debugging pass that reduces a failing input to its simplest form while preserving the failure. It turns a 10,000-character random string into the few-character minimal reproducer, so debugging starts from the smallest possible case.

Can Hypothesis test async or await functions? Yes, via pytest-asyncio or anyio integration, though event-loop management requires explicit fixture scoping and deadline tuning to absorb asynchronous scheduling overhead.

How do I persist and share failing examples across CI environments? Hypothesis stores minimal failing examples in a local .hypothesis/examples/ database by default. Configure database=DirectoryBasedExampleDatabase(path) and cache or commit the directory so failures replay across runners.

When should I use assume() versus strategy filtering? Use assume() for rare preconditions or cross-field dependencies evaluated inside the test. Use .filter() for common, easily satisfiable constraints at the strategy level. Overusing assume() starves the generator and trips the filter-ratio health check.

← Back to Property-Based & Fuzz Testing Strategies