As test suites mature, example-based assertions struggle to cover edge cases in complex data transformations, distributed state machines, and boundary-heavy algorithms — the failure mode is a green suite that still ships off-by-one errors in temporal calculations, race conditions in state transitions, and silent corruption in serialization layers. Advanced property-based testing closes that gap by validating system invariants across mathematically generated input spaces. Building on the property-based and fuzz testing foundations, this guide moves from basic randomization to deterministic, production-grade generation: composite strategies, stateful machines, shrinking control, and hybrid fuzzing.
Prerequisites
- Python 3.10+ — 3.9 reached end-of-life in October 2025; modern type hints (
typing.Annotated,typing.Protocol) streamline strategy inference. hypothesis>=6.100— stablePhase.explain,report_multiple_bugs, and stateful APIs.pytest>=8.0,pytest-xdistfor parallel execution,pytest-covfor coverage tracking.- Comfort with
@given, built-in strategies (st.integers(),st.text(),st.lists()), andpytestfixtures — see the Hypothesis framework fundamentals if any of those are unfamiliar.
Core concept
Two ideas separate advanced work from basic @given usage. First, valid-by-construction generation: instead of generating arbitrary data and discarding invalid examples with .filter(), you construct strategies that natively produce only valid states, keeping the rejection rate low and shrinking deterministic. Second, stateful modeling: many systems are sequences of operations over mutable state, so you model them as a RuleBasedStateMachine and let Hypothesis search the space of operation orderings for an invariant violation.
Step-by-step implementation
Step 1 — Compose valid-by-construction strategies
Naive filtering with st.integers().filter(lambda x: x > 0) generates arbitrary integers, discards invalid ones, and retries. Once rejection climbs above ~20%, shrinking degrades exponentially and tests stall. Instead, use @st.composite with conditional branching so every generated object is structurally valid. The full mechanics — st.builds, type registration, and find() — are in generating custom strategies with hypothesis.strategies.
import datetime
from dataclasses import dataclass
from typing import Literal
from hypothesis import given, settings, assume
from hypothesis import strategies as st
@dataclass
class Transaction:
transaction_id: str
amount: float
currency: Literal["USD", "EUR", "GBP"]
timestamp: datetime.datetime
status: Literal["pending", "completed", "failed"]
@st.composite
def valid_transactions(draw: st.DrawFn) -> Transaction:
currency = draw(st.sampled_from(["USD", "EUR", "GBP"]))
status = draw(st.sampled_from(["pending", "completed", "failed"]))
# Valid-by-construction: completed transactions must carry a positive amount
low = 0.01 if status == "completed" else 0.0
amount = draw(st.floats(min_value=low, max_value=1_000_000.0, allow_nan=False))
# Fixed bounds keep the strategy deterministic — never call datetime.now() here
ts = draw(st.datetimes(min_value=datetime.datetime(2020, 1, 1),
max_value=datetime.datetime(2030, 1, 1)))
assume(amount >= low) # cheap guard for the rare float rounding edge
return Transaction(
transaction_id=f"TXN-{draw(st.text(min_size=8, max_size=12, alphabet='0123456789ABCDEF'))}",
amount=amount, currency=currency, timestamp=ts, status=status,
)
@given(valid_transactions())
@settings(max_examples=200)
def test_transaction_invariants(txn: Transaction) -> None:
assert txn.amount >= 0.0
if txn.status == "completed":
assert txn.amount > 0.0 # business invariant
Constructing objects from constrained primitives lets the shrinking algorithm reduce failing examples to minimal counterexamples in milliseconds rather than minutes, because every candidate it explores is already valid.
Step 2 — Model stateful systems with RuleBasedStateMachine
Stateless tests validate pure functions; stateful tests validate sequences of operations over mutable state. Declare rules as transitions, gate them with precondition, and assert system-wide guarantees with @invariant(). Hypothesis generates operation sequences and re-checks invariants after every step.
from typing import Any
from hypothesis.stateful import RuleBasedStateMachine, rule, invariant
import hypothesis.strategies as st
class TransactionalKVStore:
def __init__(self) -> None:
self._data: dict[str, Any] = {}
self._buffer: dict[str, Any] = {}
self._in_tx = False
def begin(self) -> None:
self._in_tx, self._buffer = True, {}
def put(self, key: str, value: Any) -> None:
(self._buffer if self._in_tx else self._data)[key] = value
def commit(self) -> None:
if self._in_tx:
self._data.update(self._buffer); self._buffer.clear(); self._in_tx = False
def rollback(self) -> None:
if self._in_tx:
self._buffer.clear(); self._in_tx = False
class KVStateMachine(RuleBasedStateMachine):
def __init__(self) -> None:
super().__init__()
self.store = TransactionalKVStore()
self.committed: dict[str, Any] = {} # model of what MUST survive
@rule()
def begin_tx(self) -> None:
self.store.begin()
@rule(key=st.text(min_size=1, max_size=8), value=st.integers())
def put_value(self, key: str, value: int) -> None:
self.store.put(key, value)
if not self.store._in_tx:
self.committed[key] = value
@rule()
def commit_tx(self) -> None:
if self.store._in_tx:
self.committed.update(self.store._buffer)
self.store.commit()
@rule()
def rollback_tx(self) -> None:
self.store.rollback()
@invariant()
def committed_data_never_lost(self) -> None:
for key, value in self.committed.items():
assert self.store._data.get(key) == value
TestKVStore = KVStateMachine.TestCase
The committed dict is a model — an independent re-implementation of the contract — and the invariant checks the real store against it. Edge case: rules without a precondition may fire in illegal orders (a commit before any begin), which is exactly the kind of sequence you want explored, provided the implementation tolerates it.
Step 3 — Control shrinking and analyze counterexamples
When a stateful or composite test fails, the shrinking algorithm minimizes the failing input. Complex traces can produce opaque counterexamples, so capture every violation and replay deterministically.
from hypothesis import settings, given, strategies as st
@settings(report_multiple_bugs=True) # surface all invariant violations, not just the first
@given(st.lists(st.integers(), max_size=10))
def test_with_full_report(xs: list[int]) -> None:
assert sum(xs) == sum(reversed(xs))
On failure, Hypothesis prints a @reproduce_failure(...) decorator encoding the exact seed and trace. Copy it into an isolated unit test to verify a fix without re-running the full generation cycle. During initial isolation, restrict phases with @settings(phases=[Phase.generate]) to skip shrinking and avoid CI timeouts, then re-enable shrinking for production runs.
Step 4 — Bridge into fuzzing across C-extension boundaries
Pure Python property tests cannot safely exercise C-extensions or memory-managed libraries. Serialize Hypothesis-generated inputs into byte buffers and feed them to a coverage-guided fuzzer such as Atheris to catch segmentation faults and undefined behavior.
import struct
from hypothesis import given, settings, Phase
import hypothesis.strategies as st
def native_parse_payload(data: bytes) -> None:
if len(data) < 4:
raise ValueError("header too short")
length = struct.unpack_from(">I", data, 0)[0]
if length > len(data) - 4:
raise BufferError("payload length mismatch")
@given(st.binary(min_size=4, max_size=256))
@settings(max_examples=500, phases=[Phase.generate])
def test_native_boundary(data: bytes) -> None:
try:
native_parse_payload(data)
except (ValueError, BufferError):
pass # documented validation errors
except Exception as exc: # anything else is a boundary violation
raise AssertionError(f"native boundary violation: {exc}") from exc
In a real Atheris setup, atheris.Setup(sys.argv, native_parse_payload); atheris.Fuzz() registers the same target for coverage-guided mutation. Enforce execution limits on native calls with signal.alarm() or threading.Timer. This pattern pairs naturally with the isolation techniques in autospec strict mocking when the target has impure dependencies.
Step 5 — Register CI profiles and parallelize
Scale max_examples by pipeline stage and isolate the database under pytest-xdist.
import os
from hypothesis import settings, Phase, Verbosity
settings.register_profile("ci_pr", max_examples=50, deadline=500,
phases=[Phase.generate, Phase.shrink, Phase.explain],
verbosity=Verbosity.quiet, database=None)
settings.register_profile("ci_nightly", max_examples=1000, deadline=2000,
database=".hypothesis/examples")
settings.load_profile("ci_pr" if os.getenv("CI") else "ci_nightly")
Under pytest -n auto, the bundled hypothesis.extra.pytestplugin appends a worker ID to each database path, preventing SQLite locking. The full latency-reduction playbook is in reducing Hypothesis test execution time.
Verification
Confirm the setup behaves before trusting it in CI:
- Run
pytest --hypothesis-show-statistics. Each property should report a low rejection rate (aim under 15%) and a healthyGenerate/Shrinksplit. A high "events" count of filtered draws signals an over-constrained strategy. - Force a failure (negate an assertion) and confirm Hypothesis shrinks to a minimal counterexample and prints a
@reproduce_failuredecorator. - For the state machine, inspect the printed trace on failure — it should be the shortest call sequence that breaks the invariant, with arguments already minimized.
- Run twice with the same seed (
pytest --hypothesis-seed=0) and confirm identical generation, proving determinism.
Troubleshooting
| Symptom | Root cause | Fix |
|---|---|---|
Unsatisfiable / high filtered-example ratio | .filter() or assume() rejecting most candidates | Switch to valid-by-construction @st.composite with conditional branching |
| Stateful test shrinks slowly or times out | Expensive side effects run on every shrink step | Keep rule bodies cheap; mock I/O; restrict phases during isolation |
| Only the first invariant failure is reported | Default single-bug reporting | Set @settings(report_multiple_bugs=True) |
| Flaky failures across machines | Lost .hypothesis database or non-deterministic seed | Cache and commit the database; pin @seed() while debugging |
pytest-xdist SQLite locking errors | Workers sharing one database file | Rely on hypothesis.extra.pytestplugin worker isolation or database=None |
RecursionError in recursive strategy | Missing depth/leaf bound | Use st.recursive(..., max_leaves=...) or an explicit depth counter |
Frequently Asked Questions
How do I prevent property-based tests from slowing down my CI pipeline?
Register environment-aware settings profiles that scale max_examples by stage (50 for PRs, 500+ for nightly builds), run with pytest-xdist, cache the .hypothesis database, and avoid filter-heavy strategies. Monitor with --hypothesis-show-statistics and tune deadline accordingly.
When should I use assume() instead of strategy filter()?
Use assume() inside the test to reject rare invalid inputs early while preserving shrinking efficiency. Avoid .filter() on strategies, which forces the generator to retry until a valid example appears and degrades exponentially as rejection climbs. Prefer valid-by-construction @composite strategies for complex constraints.
How do I reproduce a failing stateful test in production?
Hypothesis prints a @reproduce_failure decorator encoding the exact seed and execution trace. Copy it into the test to replay the failure. Cache and commit the .hypothesis database so the minimal example survives across machines and CI runners.
Can I combine Hypothesis with coverage-guided fuzzers like Atheris? Yes. Use Hypothesis to generate structured, type-safe inputs, then serialize them to byte buffers for consumption by Atheris. Hypothesis provides semantic correctness while Atheris drives low-level memory and boundary testing of C-extensions and parsers.
How does shrinking work for a RuleBasedStateMachine?
When a stateful test fails, Hypothesis shrinks the entire sequence of rule calls, not just individual argument values. It removes steps, reorders where legal, and minimizes each remaining argument, producing the shortest operation trace that still violates an invariant.
Related guides
- The reusable generator patterns behind Step 1 are detailed in generating custom strategies with hypothesis.strategies.
- For the underlying
@given,assume, and shrinking model, revisit the Hypothesis framework fundamentals. - When these tests get slow, reducing Hypothesis test execution time isolates the bottleneck phase.
- Combine generative inputs with matrix-driven cases using advanced parametrization techniques in the pytest track.
- Isolate impure dependencies in fuzz targets with autospec strict mocking.
← Back to Property-Based & Fuzz Testing Strategies