Example-based tests verify one operation at a time, but real defects in stateful systems — caches, queues, ORMs, connection pools, finite state machines — emerge only from sequences of operations: insert then evict then read, acquire then acquire then release. Enumerating those sequences by hand is hopeless; the interesting bug is usually the eleven-step interleaving nobody wrote down. Hypothesis solves this by generating operation sequences for you, executing them against your real system and a simplified model, and shrinking any failing sequence to the shortest reproduction. The failure mode this guide targets is the intermittent "works in isolation, breaks under a specific order" bug that example tests structurally cannot find.
This guide builds on the generative model introduced in Hypothesis Framework Fundamentals and the strategy composition covered in Advanced Property-Based Testing. Here the unit of generation is no longer a single value but an entire program of method calls.
Prerequisites
- Python 3.9+
hypothesis >= 6.0(theRuleBasedStateMachine,Bundle, andrun_state_machine_as_testAPI used below has been stable since 6.x)pytest >= 7.0for collection of the machine classes- Familiarity with
hypothesis.strategies(aliasedst) and the@givenlifecycle
pip install "hypothesis>=6.0" "pytest>=7.0"
Core concept
A RuleBasedStateMachine is a description of a system as a set of transitions. Hypothesis treats the class as a generator of programs: it picks a legal rule, supplies generated arguments, executes it, checks invariants, and repeats up to stateful_step_count times. The "model-based" half of the name comes from running a deliberately simple reference implementation (often a plain dict or list) alongside the real system under test, and asserting the two agree. The model encodes what should happen; the rules drive the real code; the invariants catch divergence.
Five primitives compose the whole approach:
@rule— a candidate operation. Hypothesis may call it any number of times, in any order, with arguments drawn from the strategies you bind to its parameters.@initialize— runs at most once, before any rule, to establish deterministic starting state.Bundle— a named queue of values produced by rules, so later rules can operate on objects that earlier rules actually created (you never fabricate an ID that does not exist).@invariant— a property checked after every rule; it asserts truths that hold in every reachable state.@precondition— a guard that disables a rule unless the machine is in a valid state for it.
Step-by-step implementation
We will test a bounded least-recently-used cache. The reference model is an OrderedDict; the system under test is the same here for illustration, but in practice it would be the real C-accelerated or Redis-backed implementation.
1. Subclass RuleBasedStateMachine and declare a Bundle
# test_lru_state.py
from collections import OrderedDict
from hypothesis import strategies as st
from hypothesis.stateful import (
RuleBasedStateMachine, Bundle, rule, initialize, invariant, precondition,
)
CAPACITY = 4
class BoundedLRU:
"""System under test: a fixed-capacity LRU cache."""
def __init__(self, capacity: int) -> None:
self.capacity = capacity
self._data: "OrderedDict[int, str]" = OrderedDict()
def put(self, key: int, value: str) -> None:
if key in self._data:
self._data.move_to_end(key)
self._data[key] = value
if len(self._data) > self.capacity:
self._data.popitem(last=False) # evict least-recently-used
def get(self, key: int) -> str | None:
if key not in self._data:
return None
self._data.move_to_end(key)
return self._data[key]
class LRUStateMachine(RuleBasedStateMachine):
keys = Bundle("keys") # queue of keys we have actually inserted
The Bundle("keys") is the spine of the test: it guarantees that rules which read or evict only ever reference keys some earlier rule inserted.
2. Seed state with @initialize
@initialize()
def setup(self) -> None:
# Runs once, before any rule. Construct system + model together.
self.cache = BoundedLRU(CAPACITY)
self.model: "OrderedDict[int, str]" = OrderedDict()
@initialize is preferred over __init__ for setup that should reset per sequence: Hypothesis instantiates the class once per example, and @initialize is sequenced into the generated program ahead of all rules.
3. Define operations as @rule methods, pushing into the Bundle
@rule(target=keys, key=st.integers(0, 1000), value=st.text(max_size=8))
def put(self, key: int, value: str):
self.cache.put(key, value)
# Mirror the operation in the model, including LRU eviction.
if key in self.model:
self.model.move_to_end(key)
self.model[key] = value
if len(self.model) > CAPACITY:
self.model.popitem(last=False)
return key # pushed into the `keys` Bundle for later rules
target=keys means the rule's return value is appended to the Bundle. A later rule names keys as an argument type to draw one of those values back out.
4. Guard rules with @precondition
@precondition(lambda self: len(self.model) > 0)
@rule(key=keys)
def get_existing(self, key: int):
# `key` is drawn from the Bundle, so it was inserted earlier —
# but it may since have been evicted, which is the interesting case.
got = self.cache.get(key)
expected = self.model.get(key)
if expected is not None:
self.model.move_to_end(key)
assert got == expected
@precondition disables get_existing until at least one key exists, so Hypothesis never wastes a step on an operation that cannot be meaningful. Stack the decorator above @rule.
5. Assert state with @invariant
@invariant()
def never_exceeds_capacity(self) -> None:
# Must hold in every reachable state, after every rule.
assert len(self.cache._data) <= CAPACITY
@invariant()
def model_agreement(self) -> None:
assert list(self.cache._data.keys()) == list(self.model.keys())
# pytest collects the TestCase automatically from the class:
TestLRU = LRUStateMachine.TestCase
The .TestCase attribute is the pytest entry point. Assigning it to a module-level name lets pytest discover and run the machine like an ordinary test class.
6. Run and tune
Use settings to control sequence length and the number of distinct programs:
from hypothesis import settings, HealthCheck
from hypothesis.stateful import run_state_machine_as_test
def test_lru_thoroughly():
# run_state_machine_as_test runs the machine from inside a normal test,
# which is the supported place to attach settings or fixtures.
run_state_machine_as_test(
LRUStateMachine,
settings=settings(
max_examples=200, # distinct operation sequences
stateful_step_count=60, # max rules per sequence (default 50)
suppress_health_check=[HealthCheck.too_slow],
),
)
stateful_step_count caps the length of each generated program; max_examples controls how many programs are tried. Longer sequences find deeper bugs at linear cost.
Verification
Confirm the machine actually exercises rules rather than skipping them. Run with statistics and verbosity:
pytest test_lru_state.py --hypothesis-show-statistics -q
The statistics block reports how many examples ran and how often each rule fired. If a precondition is too strict you will see a rule reported as never selected. To watch the generated programs, set verbosity:
from hypothesis import settings, Verbosity
settings(verbosity=Verbosity.debug)
When a test fails, Hypothesis prints the minimal failing sequence as runnable code — state.put(key=0, value=''), state.get_existing(key=0) — which you can paste directly into a regression test. That shrunk reproduction is the payoff: a five-line program instead of a sixty-step random walk.
Troubleshooting
| Symptom | Root cause | Fix |
|---|---|---|
Rule ... was never selected in statistics | A @precondition is never satisfied, or no Bundle ever has values for the rule's key=bundle argument | Loosen the precondition, or ensure a producing rule (target=bundle) runs first |
FailedHealthCheck: too_slow | Each rule does real I/O and 50 steps blow the deadline | Add suppress_health_check=[HealthCheck.too_slow] and/or lower stateful_step_count |
| Invariant fails only at high step counts | A leak or off-by-one that needs accumulation to surface | Keep it failing; this is the intended catch. Read the shrunk sequence |
consumes(bundle) value reused unexpectedly | Used bundle (peek) where you needed consumes(bundle) (remove) | Use consumes() when a value must not be drawn again, e.g. after deletion |
| Model and system disagree immediately | @initialize did not reset both, or __init__ state leaked across examples | Move all per-sequence setup into @initialize |
A frequent precondition mistake worth its own note: stacking order matters. @precondition must sit above @rule in the decorator stack, and the lambda receives self, so it can inspect the model (lambda self: self.model) but not yet the drawn arguments.
Frequently Asked Questions
What is the difference between @rule and @initialize in a Hypothesis state machine?
An @initialize method runs at most once per generated sequence, before any rule, and is used to seed deterministic starting state. An @rule method is a candidate step that Hypothesis can call zero or more times in any order, subject to its preconditions, with arguments drawn from bound strategies.
How do Bundles pass created objects between rules?
A Bundle is a named queue of values produced by earlier rules. A rule declares target=my_bundle to push its return value onto the queue, and another rule names my_bundle as an argument type to draw a previously created value, so Hypothesis only ever references objects that actually exist.
When does an @invariant method run during stateful testing?
By default an @invariant runs after every rule execution, once the machine has been initialized. It asserts properties that must hold in every reachable state, independent of which rule sequence produced that state.
How do I control how many steps a stateful test executes?
Apply settings(stateful_step_count=N) to the machine (alongside max_examples for the number of distinct sequences). stateful_step_count caps the length of each generated rule program; the default is 50.
Can I run a RuleBasedStateMachine without pytest collecting it as a class?
Yes. Call run_state_machine_as_test(MachineClass) inside a normal test function. This is the supported way to attach a settings object or run the machine from a parametrized or fixture-driven context.
Related guides
- Ground the generation primitives first with Hypothesis Framework Fundamentals, then return here to sequence them into programs.
- Build the data your rules draw from using custom strategies with hypothesis.strategies and the composition techniques in Advanced Property-Based Testing.
- Apply this model concretely to web services in Modeling REST APIs as State Machines.
- When long sequences blow the clock, trim them with reducing Hypothesis test execution time.
← Back to Property-Based & Fuzz Testing Strategies