Hypothesis & Fuzzing

Stateful and Model-Based Testing

Example-based tests verify one operation at a time, but real defects in stateful systems — caches, queues, ORMs, connection pools, finite state machines — emerge only from sequences of operations: insert then evict then read, acquire then acquire then release. Enumerating those sequences by hand is hopeless; the interesting bug is usually the eleven-step interleaving nobody wrote down. Hypothesis solves this by generating operation sequences for you, executing them against your real system and a simplified model, and shrinking any failing sequence to the shortest reproduction. The failure mode this guide targets is the intermittent "works in isolation, breaks under a specific order" bug that example tests structurally cannot find.

This guide builds on the generative model introduced in Hypothesis Framework Fundamentals and the strategy composition covered in Advanced Property-Based Testing. Here the unit of generation is no longer a single value but an entire program of method calls.

Prerequisites

  • Python 3.9+
  • hypothesis >= 6.0 (the RuleBasedStateMachine, Bundle, and run_state_machine_as_test API used below has been stable since 6.x)
  • pytest >= 7.0 for collection of the machine classes
  • Familiarity with hypothesis.strategies (aliased st) and the @given lifecycle
Bash
pip install "hypothesis>=6.0" "pytest>=7.0"

Core concept

A RuleBasedStateMachine is a description of a system as a set of transitions. Hypothesis treats the class as a generator of programs: it picks a legal rule, supplies generated arguments, executes it, checks invariants, and repeats up to stateful_step_count times. The "model-based" half of the name comes from running a deliberately simple reference implementation (often a plain dict or list) alongside the real system under test, and asserting the two agree. The model encodes what should happen; the rules drive the real code; the invariants catch divergence.

Five primitives compose the whole approach:

  • @rule — a candidate operation. Hypothesis may call it any number of times, in any order, with arguments drawn from the strategies you bind to its parameters.
  • @initialize — runs at most once, before any rule, to establish deterministic starting state.
  • Bundle — a named queue of values produced by rules, so later rules can operate on objects that earlier rules actually created (you never fabricate an ID that does not exist).
  • @invariant — a property checked after every rule; it asserts truths that hold in every reachable state.
  • @precondition — a guard that disables a rule unless the machine is in a valid state for it.
Rule-based state machine transitions Initialize seeds the model, rules push and consume bundle values as transitions, and an invariant is checked after every step. RuleBasedStateMachine lifecycle @initialize seed model once @rule create target = Bundle @rule mutate consume Bundle Bundle: created resources push on create, draw on mutate @invariant runs after every rule assert system under test matches model
Initialize seeds the model once; create rules push handles into a Bundle and mutate rules draw from it; the invariant compares the real system against the model after every step.

Step-by-step implementation

We will test a bounded least-recently-used cache. The reference model is an OrderedDict; the system under test is the same here for illustration, but in practice it would be the real C-accelerated or Redis-backed implementation.

1. Subclass RuleBasedStateMachine and declare a Bundle

Python
# test_lru_state.py
from collections import OrderedDict
from hypothesis import strategies as st
from hypothesis.stateful import (
    RuleBasedStateMachine, Bundle, rule, initialize, invariant, precondition,
)

CAPACITY = 4

class BoundedLRU:
    """System under test: a fixed-capacity LRU cache."""
    def __init__(self, capacity: int) -> None:
        self.capacity = capacity
        self._data: "OrderedDict[int, str]" = OrderedDict()

    def put(self, key: int, value: str) -> None:
        if key in self._data:
            self._data.move_to_end(key)
        self._data[key] = value
        if len(self._data) > self.capacity:
            self._data.popitem(last=False)  # evict least-recently-used

    def get(self, key: int) -> str | None:
        if key not in self._data:
            return None
        self._data.move_to_end(key)
        return self._data[key]

class LRUStateMachine(RuleBasedStateMachine):
    keys = Bundle("keys")  # queue of keys we have actually inserted

The Bundle("keys") is the spine of the test: it guarantees that rules which read or evict only ever reference keys some earlier rule inserted.

2. Seed state with @initialize

Python
    @initialize()
    def setup(self) -> None:
        # Runs once, before any rule. Construct system + model together.
        self.cache = BoundedLRU(CAPACITY)
        self.model: "OrderedDict[int, str]" = OrderedDict()

@initialize is preferred over __init__ for setup that should reset per sequence: Hypothesis instantiates the class once per example, and @initialize is sequenced into the generated program ahead of all rules.

3. Define operations as @rule methods, pushing into the Bundle

Python
    @rule(target=keys, key=st.integers(0, 1000), value=st.text(max_size=8))
    def put(self, key: int, value: str):
        self.cache.put(key, value)
        # Mirror the operation in the model, including LRU eviction.
        if key in self.model:
            self.model.move_to_end(key)
        self.model[key] = value
        if len(self.model) > CAPACITY:
            self.model.popitem(last=False)
        return key  # pushed into the `keys` Bundle for later rules

target=keys means the rule's return value is appended to the Bundle. A later rule names keys as an argument type to draw one of those values back out.

4. Guard rules with @precondition

Python
    @precondition(lambda self: len(self.model) > 0)
    @rule(key=keys)
    def get_existing(self, key: int):
        # `key` is drawn from the Bundle, so it was inserted earlier —
        # but it may since have been evicted, which is the interesting case.
        got = self.cache.get(key)
        expected = self.model.get(key)
        if expected is not None:
            self.model.move_to_end(key)
        assert got == expected

@precondition disables get_existing until at least one key exists, so Hypothesis never wastes a step on an operation that cannot be meaningful. Stack the decorator above @rule.

5. Assert state with @invariant

Python
    @invariant()
    def never_exceeds_capacity(self) -> None:
        # Must hold in every reachable state, after every rule.
        assert len(self.cache._data) <= CAPACITY

    @invariant()
    def model_agreement(self) -> None:
        assert list(self.cache._data.keys()) == list(self.model.keys())

# pytest collects the TestCase automatically from the class:
TestLRU = LRUStateMachine.TestCase

The .TestCase attribute is the pytest entry point. Assigning it to a module-level name lets pytest discover and run the machine like an ordinary test class.

6. Run and tune

Use settings to control sequence length and the number of distinct programs:

Python
from hypothesis import settings, HealthCheck
from hypothesis.stateful import run_state_machine_as_test

def test_lru_thoroughly():
    # run_state_machine_as_test runs the machine from inside a normal test,
    # which is the supported place to attach settings or fixtures.
    run_state_machine_as_test(
        LRUStateMachine,
        settings=settings(
            max_examples=200,          # distinct operation sequences
            stateful_step_count=60,    # max rules per sequence (default 50)
            suppress_health_check=[HealthCheck.too_slow],
        ),
    )

stateful_step_count caps the length of each generated program; max_examples controls how many programs are tried. Longer sequences find deeper bugs at linear cost.

Verification

Confirm the machine actually exercises rules rather than skipping them. Run with statistics and verbosity:

Bash
pytest test_lru_state.py --hypothesis-show-statistics -q

The statistics block reports how many examples ran and how often each rule fired. If a precondition is too strict you will see a rule reported as never selected. To watch the generated programs, set verbosity:

Python
from hypothesis import settings, Verbosity
settings(verbosity=Verbosity.debug)

When a test fails, Hypothesis prints the minimal failing sequence as runnable code — state.put(key=0, value=''), state.get_existing(key=0) — which you can paste directly into a regression test. That shrunk reproduction is the payoff: a five-line program instead of a sixty-step random walk.

Troubleshooting

SymptomRoot causeFix
Rule ... was never selected in statisticsA @precondition is never satisfied, or no Bundle ever has values for the rule's key=bundle argumentLoosen the precondition, or ensure a producing rule (target=bundle) runs first
FailedHealthCheck: too_slowEach rule does real I/O and 50 steps blow the deadlineAdd suppress_health_check=[HealthCheck.too_slow] and/or lower stateful_step_count
Invariant fails only at high step countsA leak or off-by-one that needs accumulation to surfaceKeep it failing; this is the intended catch. Read the shrunk sequence
consumes(bundle) value reused unexpectedlyUsed bundle (peek) where you needed consumes(bundle) (remove)Use consumes() when a value must not be drawn again, e.g. after deletion
Model and system disagree immediately@initialize did not reset both, or __init__ state leaked across examplesMove all per-sequence setup into @initialize

A frequent precondition mistake worth its own note: stacking order matters. @precondition must sit above @rule in the decorator stack, and the lambda receives self, so it can inspect the model (lambda self: self.model) but not yet the drawn arguments.

Frequently Asked Questions

What is the difference between @rule and @initialize in a Hypothesis state machine? An @initialize method runs at most once per generated sequence, before any rule, and is used to seed deterministic starting state. An @rule method is a candidate step that Hypothesis can call zero or more times in any order, subject to its preconditions, with arguments drawn from bound strategies.

How do Bundles pass created objects between rules? A Bundle is a named queue of values produced by earlier rules. A rule declares target=my_bundle to push its return value onto the queue, and another rule names my_bundle as an argument type to draw a previously created value, so Hypothesis only ever references objects that actually exist.

When does an @invariant method run during stateful testing? By default an @invariant runs after every rule execution, once the machine has been initialized. It asserts properties that must hold in every reachable state, independent of which rule sequence produced that state.

How do I control how many steps a stateful test executes? Apply settings(stateful_step_count=N) to the machine (alongside max_examples for the number of distinct sequences). stateful_step_count caps the length of each generated rule program; the default is 50.

Can I run a RuleBasedStateMachine without pytest collecting it as a class? Yes. Call run_state_machine_as_test(MachineClass) inside a normal test function. This is the supported way to attach a settings object or run the machine from a parametrized or fixture-driven context.

← Back to Property-Based & Fuzz Testing Strategies