As systems grow in complexity, the combinatorial explosion of valid input states renders example-based testing insufficient for guaranteeing correctness. The symptom is familiar: a suite that passes for months, then a production incident traces back to an input nobody wrote a test for. Hypothesis attacks that failure mode by generating inputs, enforcing invariants, and applying delta-debugging to failures — turning a brittle verification script into a generative validation engine. This guide grounds the property-based and fuzz testing approach in concrete execution models: strategies, the @given decorator, the shrinking engine, the example database, and CI-ready settings.
Prerequisites
- Python 3.10+ (3.9 is end-of-life as of October 2025) with type hints enabled.
hypothesis>=6.100andpytest>=8.0installed in the active virtual environment.- Familiarity with
pytestfixture lifecycles and basic decorators. - Optional:
sqlalchemyfor the database-integration example below.
Core concept
Hypothesis replaces explicit input-output pairs with properties — invariants that must hold for every valid input. Instead of asserting f(5) == 10, you assert for all x in Domain, property(f(x)). Properties are algebraic or structural: idempotency (f(f(x)) == f(x)), commutativity (f(a, b) == f(b, a)), or round-trip preservation (decode(encode(x)) == x). Hypothesis generates structured, type-aware inputs that probe boundaries human engineers rarely anticipate, and — crucially — guarantees deterministic reproduction of any failure it finds.
Step-by-step implementation
Step 1 — Write a property with @given
Strategies live in hypothesis.strategies (aliased st) and are lazy generators — they describe how to produce data rather than producing it eagerly. The @given decorator binds strategies to a test, generates an example, injects it, and repeats up to max_examples (default 100).
from hypothesis import given, settings
import hypothesis.strategies as st
@given(st.text(min_size=1, max_size=50))
@settings(max_examples=200)
def test_utf8_round_trip(raw_text: str) -> None:
"""Encoding then decoding UTF-8 must preserve the original string."""
decoded = raw_text.encode("utf-8").decode("utf-8")
assert decoded == raw_text # round-trip invariant
assert len(raw_text.encode("utf-8")) >= len(raw_text) # bytes >= chars
pytest's assertion rewriting applies automatically, so failures include the exact generated input and intermediate state without manual logging.
Step 2 — Compose a custom strategy
@st.composite turns a function into a strategy that draws correlated fields and enforces cross-field rules before returning an object.
from dataclasses import dataclass
from datetime import datetime
from hypothesis import given, strategies as st
@dataclass
class UserEvent:
user_id: int
timestamp: datetime
action: str
metadata: dict
@st.composite
def valid_user_events(draw: st.DrawFn) -> UserEvent:
action = draw(st.sampled_from(["login", "purchase", "logout"]))
# Cross-field constraint: login events must carry a session_id
if action == "login":
metadata = draw(st.fixed_dictionaries({"session_id": st.uuids()}))
else:
metadata = draw(st.dictionaries(st.text(), st.integers()))
return UserEvent(
user_id=draw(st.integers(min_value=1, max_value=100_000)),
timestamp=draw(st.datetimes(min_value=datetime(2020, 1, 1))),
action=action, metadata=metadata,
)
@given(valid_user_events())
def test_event_has_required_fields(event: UserEvent) -> None:
if event.action == "login":
assert "session_id" in event.metadata
The deeper patterns — st.builds, type registration, recursive strategies — are covered in generating custom strategies with hypothesis.strategies.
Step 3 — Integrate with pytest fixtures and assume()
Unlike parametrized tests where fixtures run once per function, @given executes the body multiple times, so fixtures are injected per generated example. Scope expensive resources accordingly, and use assume() for rare preconditions.
import pytest
from hypothesis import given, settings, assume
import hypothesis.strategies as st
from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.orm import Session, declarative_base
Base = declarative_base()
class Record(Base):
__tablename__ = "records"
id = Column(Integer, primary_key=True)
payload = Column(String, nullable=False)
@pytest.fixture(scope="function")
def db_session(tmp_path):
engine = create_engine(f"sqlite:///{tmp_path}/test.db")
Base.metadata.create_all(engine)
with Session(engine) as session:
yield session # fresh DB per Hypothesis example
@given(st.text(min_size=1, max_size=100))
@settings(max_examples=50)
def test_insert_round_trip(db_session: Session, payload: str) -> None:
assume("\x00" not in payload) # SQLite rejects null bytes — rare, so assume() fits
db_session.add(Record(payload=payload)); db_session.commit()
fetched = db_session.query(Record).filter_by(payload=payload).first()
assert fetched is not None and fetched.payload == payload
Prefer assume() for rare or cross-field constraints; prefer .filter() for common, easily satisfiable ones. Overusing assume() raises UnsatisfiedAssumptionError once too many examples are rejected.
Step 4 — Tune settings and the example database
Production suites need predictable execution. hypothesis.settings controls volume, deadlines, and database behavior.
from hypothesis import settings, given, Verbosity
from hypothesis.database import DirectoryBasedExampleDatabase
import hypothesis.strategies as st
@settings(
max_examples=500,
deadline=500, # ms per example; raises DeadlineExceeded if breached
verbosity=Verbosity.normal,
database=DirectoryBasedExampleDatabase(".hypothesis/ci_cache"),
)
@given(st.dictionaries(st.text(), st.integers()))
def test_dict_dedup(data: dict[str, int]) -> None:
assert len(data) == len({k: v for k, v in data.items()})
Increase max_examples for pure functions, decrease for I/O-heavy tests. The default deadline is 200ms; override per-test for genuinely slow operations rather than globally. Detailed tactics live in reducing Hypothesis test execution time.
Step 5 — Pin seeds for deterministic reproduction
@seed() fixes the generation sequence so a flaky failure replays identically across machines.
from hypothesis import given, seed
import hypothesis.strategies as st
@seed(12345) # identical generation everywhere — use while debugging, then remove
@given(st.lists(st.integers()))
def test_sort_is_stable_under_reverse(data: list[int]) -> None:
assert sorted(data) == sorted(data, reverse=True)[::-1]
Verification
- Run
pytest --hypothesis-show-statisticsand confirm each property reports the expected example count, a low rejection rate, and a saneGenerate/Shrinkratio. - Negate an assertion to force a failure; confirm Hypothesis reports a minimal counterexample (a short string, a small list) rather than the raw random input — this proves shrinking is active.
- Delete
.hypothesis/and re-run a failing test, then re-run again; the second run should replay the stored minimal example first, demonstrating the database works. - Re-run with
--hypothesis-seed=0twice and confirm identical generation, proving determinism for CI.
Troubleshooting
| Symptom | Root cause | Fix |
|---|---|---|
UnsatisfiedAssumptionError | Over-restrictive assume() rejecting too many examples | Move the constraint into the strategy with bounds (min_size/max_value) or st.sampled_from() |
DeadlineExceeded | Heavy I/O or complex strategy trees exceed the 200ms default | Set @settings(deadline=None) for genuinely slow tests and isolate them; bound recursion |
| Flaky failures across machines | Uncached database or non-deterministic seed | Cache .hypothesis/examples/; pin @seed() while debugging |
| Strategy explosion / memory bloat | Unbounded recursion or large st.sampled_from() collections | Add max_size/max_leaves, use st.deferred() for recursion |
| Fixtures behaving unexpectedly | Fixture runs per example, not per test | Use scope="function" and keep per-example resources cheap |
Frequently Asked Questions
How does Hypothesis differ from pytest's @pytest.mark.parametrize?
Parametrize runs a fixed, hand-written list of inputs. Hypothesis generates many boundary-pushing inputs per run and automatically shrinks any failure to a minimal reproducible case, eliminating manual edge-case enumeration.
What is the shrinking process and why is it critical? Shrinking is a delta-debugging pass that reduces a failing input to its simplest form while preserving the failure. It turns a 10,000-character random string into the few-character minimal reproducer, so debugging starts from the smallest possible case.
Can Hypothesis test async or await functions?
Yes, via pytest-asyncio or anyio integration, though event-loop management requires explicit fixture scoping and deadline tuning to absorb asynchronous scheduling overhead.
How do I persist and share failing examples across CI environments?
Hypothesis stores minimal failing examples in a local .hypothesis/examples/ database by default. Configure database=DirectoryBasedExampleDatabase(path) and cache or commit the directory so failures replay across runners.
When should I use assume() versus strategy filtering?
Use assume() for rare preconditions or cross-field dependencies evaluated inside the test. Use .filter() for common, easily satisfiable constraints at the strategy level. Overusing assume() starves the generator and trips the filter-ratio health check.
Related guides
- Once primitives feel natural, build domain generators with generating custom strategies with hypothesis.strategies.
- Model mutable systems and bridge into fuzzing with advanced property-based testing.
- Keep CI fast with reducing Hypothesis test execution time.
- For async properties, pair fixture scoping with how to scope pytest fixtures for async tests in the pytest track.
← Back to Property-Based & Fuzz Testing Strategies