Stateful and Model-Based Testing

Example-based tests verify one operation at a time, but real defects in stateful systems — caches, queues, ORMs, connection pools, finite state machines — emerge only from sequences of operations: insert then evict then read, acquire then acquire then release. Enumerating those sequences by hand is hopeless; the interesting bug is usually the eleven-step interleaving nobody wrote down. Hypothesis solves this by generating operation sequences for you, executing them against your real system and a simplified model, and shrinking any failing sequence to the shortest reproduction. The failure mode this guide targets is the intermittent "works in isolation, breaks under a specific order" bug that example tests structurally cannot find.

This guide builds on the generative model introduced in Hypothesis Framework Fundamentals and the strategy composition covered in Advanced Property-Based Testing. Here the unit of generation is no longer a single value but an entire program of method calls.

Prerequisites

Python 3.9+
hypothesis >= 6.0 (the RuleBasedStateMachine, Bundle, and run_state_machine_as_test API used below has been stable since 6.x)
pytest >= 7.0 for collection of the machine classes
Familiarity with hypothesis.strategies (aliased st) and the @given lifecycle

pip install "hypothesis>=6.0" "pytest>=7.0"

Core concept

A RuleBasedStateMachine is a description of a system as a set of transitions. Hypothesis treats the class as a generator of programs: it picks a legal rule, supplies generated arguments, executes it, checks invariants, and repeats up to stateful_step_count times. The "model-based" half of the name comes from running a deliberately simple reference implementation (often a plain dict or list) alongside the real system under test, and asserting the two agree. The model encodes what should happen; the rules drive the real code; the invariants catch divergence.

Five primitives compose the whole approach:

@rule — a candidate operation. Hypothesis may call it any number of times, in any order, with arguments drawn from the strategies you bind to its parameters.
@initialize — runs at most once, before any rule, to establish deterministic starting state.
Bundle — a named queue of values produced by rules, so later rules can operate on objects that earlier rules actually created (you never fabricate an ID that does not exist).
@invariant — a property checked after every rule; it asserts truths that hold in every reachable state.
@precondition — a guard that disables a rule unless the machine is in a valid state for it.

Two Bundle operations sharpen the last point. Naming a Bundle as an argument type peeks: the drawn value stays in the queue and can be selected again by later rules — correct for read operations, where an object survives being read. Wrapping the same Bundle as consumes(bundle) removes the value as it is drawn, which is exactly what a destructive operation needs so no later rule references a handle you have already destroyed. A rule can also push more than one handle at once by returning multiple(a, b) instead of a single value.

The "model" in model-based testing is a second, deliberately naive implementation running in lock-step with the real one — a form of differential testing where the two implementations disagree only on a real defect. The model is written to be obviously correct rather than fast: an OrderedDict standing in for a hand-tuned eviction ring, a Python set standing in for a bloom filter, a list standing in for a B-tree. Because the model is trivial, any divergence the invariant reports is a bug in the system under test, not in the oracle. This is the same generative discipline behind property-based testing with Hypothesis, lifted from single values to whole sequences of method calls.

Initialize seeds the model once; create rules push handles into a Bundle and mutate rules draw from it; the invariant compares the real system against the model after every step.

Step-by-step implementation

We will test a bounded least-recently-used cache. The reference model is an OrderedDict; the system under test is the same here for illustration, but in practice it would be the real C-accelerated or Redis-backed implementation.

1. Subclass RuleBasedStateMachine and declare a Bundle

# test_lru_state.py
from collections import OrderedDict
from hypothesis import strategies as st
from hypothesis.stateful import (
    RuleBasedStateMachine, Bundle, rule, initialize, invariant, precondition,
    consumes,
)

CAPACITY = 4

class BoundedLRU:
    """System under test: a fixed-capacity LRU cache."""
    def __init__(self, capacity: int) -> None:
        self.capacity = capacity
        self._data: "OrderedDict[int, str]" = OrderedDict()

    def put(self, key: int, value: str) -> None:
        if key in self._data:
            self._data.move_to_end(key)
        self._data[key] = value
        if len(self._data) > self.capacity:
            self._data.popitem(last=False)  # evict least-recently-used

    def get(self, key: int) -> str | None:
        if key not in self._data:
            return None
        self._data.move_to_end(key)
        return self._data[key]

    def invalidate(self, key: int) -> None:
        self._data.pop(key, None)  # explicit removal; no-op if already evicted

class LRUStateMachine(RuleBasedStateMachine):
    keys = Bundle("keys")  # queue of keys we have actually inserted

The Bundle("keys") is the spine of the test: it guarantees that rules which read or evict only ever reference keys some earlier rule inserted.

2. Seed state with @initialize

    @initialize()
    def setup(self) -> None:
        # Runs once, before any rule. Construct system + model together.
        self.cache = BoundedLRU(CAPACITY)
        self.model: "OrderedDict[int, str]" = OrderedDict()

@initialize is preferred over __init__ for setup that should reset per sequence: Hypothesis instantiates the class once per example, and @initialize is sequenced into the generated program ahead of all rules.

3. Define operations as @rule methods, pushing into the Bundle

    @rule(target=keys, key=st.integers(0, 1000), value=st.text(max_size=8))
    def put(self, key: int, value: str):
        self.cache.put(key, value)
        # Mirror the operation in the model, including LRU eviction.
        if key in self.model:
            self.model.move_to_end(key)
        self.model[key] = value
        if len(self.model) > CAPACITY:
            self.model.popitem(last=False)
        return key  # pushed into the `keys` Bundle for later rules

target=keys means the rule's return value is appended to the Bundle. A later rule names keys as an argument type to draw one of those values back out.

4. Guard rules with @precondition

    @precondition(lambda self: len(self.model) > 0)
    @rule(key=keys)
    def get_existing(self, key: int):
        # `key` is drawn from the Bundle, so it was inserted earlier —
        # but it may since have been evicted, which is the interesting case.
        got = self.cache.get(key)
        expected = self.model.get(key)
        if expected is not None:
            self.model.move_to_end(key)
        assert got == expected

@precondition disables get_existing until at least one key exists, so Hypothesis never wastes a step on an operation that cannot be meaningful. Stack the decorator above @rule.

5. Model destructive operations with consumes()

Reads leave a key in the Bundle; an explicit invalidation should not. Draw the key with consumes(keys) so Hypothesis removes it from the queue as it is spent:

    @rule(key=consumes(keys))
    def invalidate(self, key: int):
        # `key` came off the Bundle and will not be drawn again by this path.
        # It may already have been evicted, so invalidate must tolerate absence.
        self.cache.invalidate(key)
        self.model.pop(key, None)

The subtlety this catches is real: a key can sit in the Bundle while the cache has silently evicted it under capacity pressure, so invalidate is routinely called on a key the system under test no longer holds. Because both the cache's invalidate and the model's pop(key, None) are no-ops on a missing key, the model_agreement invariant keeps holding — and if the real implementation instead raised KeyError on a stale handle, this rule is what would surface it. Use consumes() whenever a value must not be reused after removal; use the bare Bundle when the object survives the operation.

6. Assert state with @invariant

    @invariant()
    def never_exceeds_capacity(self) -> None:
        # Must hold in every reachable state, after every rule.
        assert len(self.cache._data) <= CAPACITY

    @invariant()
    def model_agreement(self) -> None:
        assert list(self.cache._data.keys()) == list(self.model.keys())

# pytest collects the TestCase automatically from the class:
TestLRU = LRUStateMachine.TestCase

The .TestCase attribute is the pytest entry point. Assigning it to a module-level name lets pytest discover and run the machine like an ordinary test class.

The two invariants make the differential structure explicit: every generated rule is applied to the real cache and the model in lock-step, and after each step the invariant asserts the two agree. The model is the oracle — because it is obviously correct, any divergence is a defect in the system under test.

Every generated rule runs against both the real cache and the naive model; the invariant compares their state after each step, so any divergence pinpoints a defect in the system under test — which Hypothesis then shrinks to the shortest failing sequence.

7. Run and tune

Use settings to control sequence length and the number of distinct programs:

from hypothesis import settings, HealthCheck
from hypothesis.stateful import run_state_machine_as_test

def test_lru_thoroughly():
    # run_state_machine_as_test runs the machine from inside a normal test,
    # which is the supported place to attach settings or fixtures.
    run_state_machine_as_test(
        LRUStateMachine,
        settings=settings(
            max_examples=200,          # distinct operation sequences
            stateful_step_count=60,    # max rules per sequence (default 50)
            suppress_health_check=[HealthCheck.too_slow],
        ),
    )

stateful_step_count caps the length of each generated program; max_examples controls how many programs are tried. Longer sequences find deeper bugs at linear cost.

Verification

Confirm the machine actually exercises rules rather than skipping them. Run with statistics and verbosity:

pytest test_lru_state.py --hypothesis-show-statistics -q

The statistics block reports how many examples ran and how often each rule fired. If a precondition is too strict you will see a rule reported as never selected. To watch the generated programs, set verbosity:

from hypothesis import settings, Verbosity
settings(verbosity=Verbosity.debug)

When a test fails, Hypothesis prints the minimal failing sequence as runnable code — state.put(key=0, value=''), state.get_existing(key=0) — which you can paste directly into a regression test. That shrunk reproduction is the payoff: a five-line program instead of a sixty-step random walk.

Troubleshooting

Symptom	Root cause	Fix
`Rule ... was never selected` in statistics	A `@precondition` is never satisfied, or no Bundle ever has values for the rule's `key=bundle` argument	Loosen the precondition, or ensure a producing rule (`target=bundle`) runs first
`FailedHealthCheck: too_slow`	Each rule does real I/O and 50 steps blow the deadline	Add `suppress_health_check=[HealthCheck.too_slow]` and/or lower `stateful_step_count`
Invariant fails only at high step counts	A leak or off-by-one that needs accumulation to surface	Keep it failing; this is the intended catch. Read the shrunk sequence
`consumes(bundle)` value reused unexpectedly	Used `bundle` (peek) where you needed `consumes(bundle)` (remove)	Use `consumes()` when a value must not be drawn again, e.g. after deletion
Model and system disagree immediately	`@initialize` did not reset both, or `__init__` state leaked across examples	Move all per-sequence setup into `@initialize`

A frequent precondition mistake worth its own note: stacking order matters. @precondition must sit above @rule in the decorator stack, and the lambda receives self, so it can inspect the model (lambda self: self.model) but not yet the drawn arguments.

Designing the model, not just the machine

A state machine test is only as good as the model it compares against, and the most common failure is a model that mirrors the implementation instead of describing it. When both contain the same mistake, every run passes.

A good model is a simplified, obviously correct version of the system, usually built from plain Python data structures. If the system under test is a cache with eviction, the model is a dictionary plus an insertion-order list. If it is a bank ledger, the model is a running integer total. The model should be small enough to review in one sitting and slow enough that nobody would ship it.

from hypothesis.stateful import RuleBasedStateMachine, rule, invariant, precondition
from hypothesis import strategies as st
from myapp.cache import LruCache

class CacheModel(RuleBasedStateMachine):
    def __init__(self):
        super().__init__()
        self.real = LruCache(capacity=3)
        self.model: dict[str, int] = {}        # the obviously correct version
        self.order: list[str] = []

    @rule(key=st.text(min_size=1, max_size=3), value=st.integers())
    def put(self, key, value):
        self.real.put(key, value)
        if key in self.model:
            self.order.remove(key)
        elif len(self.model) == 3:
            evicted = self.order.pop(0)        # model's own eviction rule
            del self.model[evicted]
        self.model[key] = value
        self.order.append(key)

    @rule(key=st.text(min_size=1, max_size=3))
    def get(self, key):
        assert self.real.get(key) == self.model.get(key)
        if key in self.order:                  # a hit refreshes recency
            self.order.remove(key)
            self.order.append(key)

    @invariant()
    def size_never_exceeds_capacity(self):
        assert len(self.real) == len(self.model) <= 3

TestCache = CacheModel.TestCase

Three design rules make this productive. Keep rules small and single-purpose so the shrunk counterexample names one operation rather than a compound. Use @precondition rather than early returns, so Hypothesis knows a rule is unavailable and does not waste draws on it. And put the cheap checks in @invariant — they run after every rule, which is what turns a state machine test into a continuous audit rather than an end-of-run assertion.

The payoff is the failure output. When a rule sequence breaks an invariant, Hypothesis prints a minimal reproducing program — a literal sequence of method calls — that you can paste into a normal test. That reproduction is the artefact worth keeping: convert it into an example-based regression test immediately, because the state machine will not necessarily rediscover it after the fix.

Invariants run after every rule, so the failure is reported at the step that broke the property rather than at the end of the run.

A last note on debugging a failing machine. Hypothesis prints the reproducing sequence as executable code, but the model's state at the point of failure is often more informative than the sequence itself. Add a __repr__ to the model, or print it from the failing invariant, so the report shows what the model believed alongside the sequence that got it there. Doing so turns a fifteen-call reproduction into an immediate diagnosis in most cases, because the divergence between the two states names the operation that lied.

Where the system under test is concurrent, keep the machine itself sequential. Hypothesis drives rules one at a time, so a machine that spawns threads inside a rule is testing the thread scheduler as much as the system, and the resulting failures rarely shrink to anything readable. Model concurrency explicitly instead — a rule that represents "the background worker runs one step" gives you interleaving you can reproduce, which is the whole point of the technique.

Frequently Asked Questions

What is the difference between @rule and @initialize in a Hypothesis state machine? An @initialize method runs at most once per generated sequence, before any rule, and is used to seed deterministic starting state. An @rule method is a candidate step that Hypothesis can call zero or more times in any order, subject to its preconditions, with arguments drawn from bound strategies.

How do Bundles pass created objects between rules? A Bundle is a named queue of values produced by earlier rules. A rule declares target=my_bundle to push its return value onto the queue, and another rule names my_bundle as an argument type to draw a previously created value, so Hypothesis only ever references objects that actually exist.

When does an @invariant method run during stateful testing? By default an @invariant runs after every rule execution, once the machine has been initialized. It asserts properties that must hold in every reachable state, independent of which rule sequence produced that state.

How do I control how many steps a stateful test executes? Apply settings(stateful_step_count=N) to the machine (alongside max_examples for the number of distinct sequences). stateful_step_count caps the length of each generated rule program; the default is 50.

Can I run a RuleBasedStateMachine without pytest collecting it as a class? Yes. Call run_state_machine_as_test(MachineClass) inside a normal test function. This is the supported way to attach a settings object or run the machine from a parametrized or fixture-driven context.

Ground the generation primitives first with Hypothesis Framework Fundamentals, then return here to sequence them into programs.
Build the data your rules draw from using custom strategies with hypothesis.strategies and the composition techniques in Advanced Property-Based Testing.
Apply this model concretely to web services in Modeling REST APIs as State Machines.
When long sequences blow the clock, trim them with reducing Hypothesis test execution time.
Debugging RuleBasedStateMachine failures — read the printed call sequence and turn it into an ordinary test.

← Back to Property-Based & Fuzz Testing Strategies