Memory Profiling with tracemalloc

A long-running worker creeps from 200 MB to 2 GB over a day and gets OOM-killed; a pytest session that should be flat grows with every test until CI runners die. top tells you that memory is growing but not which line is responsible. tracemalloc, in the standard library since Python 3.4, records the Python call stack for every allocation, so you can attribute live bytes to exact source lines and call paths, diff two points in time, and assert ceilings in tests. This guide covers driving tracemalloc end to end: configuring frame depth, taking and grouping snapshots, filtering noise, and wiring memory assertions into pytest. When the growth turns out to be wall-clock time rather than retained bytes, the companion approach is CPU profiling with cProfile and py-spy; tracemalloc is the tool for memory specifically.

Prerequisites

Python 3.4+ for tracemalloc itself; 3.6+ for Snapshot.compare_to ordering by size_diff to behave as documented here.
A way to start tracing before the allocations you care about — either tracemalloc.start() early in the process, or the PYTHONTRACEMALLOC=N environment variable / -X tracemalloc=N flag to set frame depth at launch.
pytest 6+ if you intend to gate memory in CI with the fixture pattern below.
Awareness that tracing adds CPU and memory overhead proportional to nframe; never leave it on in production hot paths.

Core concept

tracemalloc hooks Python's memory allocators. Once tracemalloc.start(nframe) runs, every subsequent allocation is recorded with up to nframe stack frames. A snapshot (take_snapshot()) is an immutable copy of all currently tracked allocations at that instant. You then ask the snapshot to aggregate its traces into statistics, grouped either by 'lineno' (one entry per source line) or 'traceback' (one entry per distinct call path). filter_traces removes entries you do not care about — the standard library, importlib, tracemalloc's own frames — before you read the numbers.

The leak-hunting workflow is a pipeline: capture a baseline snapshot, exercise the code, capture a second snapshot, and compare_to the baseline to surface the lines whose retained bytes grew. The single number get_traced_memory() (current, peak) is the cheap gate for tests. The diagram traces that pipeline.

Two snapshots bracket the workload; compare_to diffs them and sorts by size_diff so the lines retaining the most new memory rise to the top.

Step-by-step implementation

1. Start tracing with the right frame depth

tracemalloc.start(nframe) begins recording. nframe is the number of stack frames stored per allocation. The default of 1 tells you the allocating line but not how it was reached; raise it when allocations funnel through a shared helper and you need the caller.

import tracemalloc

# Record up to 25 frames so we can group by full call path later.
tracemalloc.start(25)

To enable tracing from the very first allocation — before your own code runs — set the environment instead of calling start():

PYTHONTRACEMALLOC=25 python worker.py        # or: python -X tracemalloc=25 worker.py

2. Take a snapshot

take_snapshot() freezes the current set of tracked allocations into an immutable Snapshot. It is cheap to hold and safe to pickle, so you can capture one, run work, capture another, and diff offline.

import tracemalloc

tracemalloc.start(25)
data = [bytes(1024) for _ in range(10_000)]   # ~10 MB of work
snapshot = tracemalloc.take_snapshot()

3. Group statistics by lineno or traceback

snapshot.statistics(key_type) returns a list of Statistic objects sorted largest-first. Use 'lineno' to collapse everything allocated on the same line into one entry, or 'traceback' to keep distinct call paths separate.

for stat in snapshot.statistics("lineno")[:5]:
    # stat.size is bytes retained; stat.count is the number of blocks
    print(f"{stat.size / 1024:8.1f} KiB  {stat.count:>7} blocks  {stat.traceback[0]}")

 10240.0 KiB    10000 blocks  worker.py:4

Switching to 'traceback' and printing stat.traceback.format() shows the full path that reached the allocating line — essential when a generic list.append is the named site but the cause is one specific caller.

top = snapshot.statistics("traceback")[0]
print("\n".join(top.traceback.format()))   # full call stack for the biggest allocation

The distinction matters most when one allocating line is reached from several callers. 'lineno' sums every path into a single number, which hides who is responsible; 'traceback' splits that same line into one entry per call path.

The same allocating line, buffer.py:12, reached from three callers: grouping by lineno reports one merged total, while grouping by traceback attributes the bytes back to each call path.

4. Filter out noise

Raw snapshots are dominated by the import machinery and tracemalloc's own bookkeeping. snapshot.filter_traces([...]) returns a new snapshot keeping only matching frames. Negative filters (inclusive=False) drop frames; positive filters keep only your module.

import tracemalloc

filtered = snapshot.filter_traces((
    tracemalloc.Filter(False, "<frozen importlib._bootstrap>"),
    tracemalloc.Filter(False, tracemalloc.__file__),   # drop tracemalloc's own frames
    tracemalloc.Filter(False, "<unknown>"),
))
for stat in filtered.statistics("lineno")[:5]:
    print(stat)

5. Read the single-number gate with get_traced_memory

For a fast pass/fail, skip snapshots and read tracemalloc.get_traced_memory(), which returns (current, peak) bytes since start(). reset_peak() (Python 3.9+) zeroes the peak so you can measure a specific region.

import tracemalloc

tracemalloc.start()
buf = [bytes(2048) for _ in range(5_000)]
current, peak = tracemalloc.get_traced_memory()
print(f"current={current/1e6:.1f} MB  peak={peak/1e6:.1f} MB")
tracemalloc.stop()

6. Assert a memory ceiling in pytest

Wrap tracing in a fixture so each test starts clean and the assertion reads peak. This is the CI counterpart to the session-fixture leaks called out in advanced pytest architecture and configuration — a leaking session-scoped fixture is exactly what blows the ceiling.

# conftest.py
import tracemalloc
import pytest

@pytest.fixture
def memory_ceiling():
    tracemalloc.start()
    tracemalloc.reset_peak()          # Python 3.9+: ignore allocations before the test body
    yield
    current, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    # Surface the peak so a failing assert prints an actionable number.
    pytest.peak_bytes = peak

# test_memory.py
import tracemalloc

def build_report(rows):
    return [{"id": r, "blob": bytes(1024)} for r in range(rows)]

def test_report_stays_under_5mb(memory_ceiling):
    build_report(2_000)
    _, peak = tracemalloc.get_traced_memory()
    assert peak < 5 * 1024 * 1024, f"peak {peak/1e6:.1f} MB exceeded 5 MB ceiling"

When the ceiling fails because allocations leak across tests rather than within one, fixture scope is usually the culprit; pin it down with the techniques in mastering pytest fixtures.

7. Locate growth by comparing snapshots

The core leak technique is diffing two snapshots. take_snapshot() before and after a repeated operation, then compare_to to rank lines by size_diff. The focused walkthroughs live in finding memory leaks with tracemalloc snapshots and comparing tracemalloc snapshots to locate growth.

import tracemalloc

tracemalloc.start(25)
before = tracemalloc.take_snapshot()
cache = {}
for i in range(50_000):
    cache[i] = bytes(64)          # an unbounded cache: the leak
after = tracemalloc.take_snapshot()

for stat in after.compare_to(before, "lineno")[:3]:
    # size_diff is the byte growth between the two snapshots
    print(f"+{stat.size_diff/1024:8.1f} KiB  {stat.traceback[0]}")

Verification

Confirm tracing is live before measuring: tracemalloc.is_tracing() must return True.
Sanity-check get_traced_memory() peak against a known allocation — allocate a 10 MB list and verify the peak rises by roughly that much.
Run the pytest gate with -q and deliberately bump the workload above the ceiling once; the assertion message must print the real peak in MB.
Cross-check the suspected leaking line by printing stat.count (block count) alongside stat.size — a line whose count grows unboundedly across iterations is a leak, not a one-time buffer. A short checklist covers the setup mistakes that make a measurement worthless before it starts: trace before the application imports, warm the code path before the baseline, collect garbage before every snapshot, filter out the measuring code, and keep the cycle count identical between runs. Every one of those is a line of code, and skipping any of them produces a diff that looks meaningful and is not.

Troubleshooting

Symptom	Root cause	Fix
`RuntimeError: the tracemalloc module must be tracing memory`	Called `take_snapshot()` before `start()`	Call `tracemalloc.start()` first, or set `PYTHONTRACEMALLOC`
Top stats point only at `<frozen importlib._bootstrap>`	Import machinery dominates unfiltered snapshots	Apply `filter_traces` with `Filter(False, ...)` for importlib
`traceback` only shows one frame	`nframe` too low (default 1)	Restart with `tracemalloc.start(25)`
tracemalloc number much lower than RSS	C-extension allocations are invisible to it	Cross-check with `psutil`/`memray` for non-Python memory
Peak includes setup you do not care about	Peak accumulates since `start()`	Call `reset_peak()` (3.9+) right before the region
Snapshot diff shows growth that is actually a cache warm-up	First snapshot taken too early	Take the baseline after warm-up, then loop the operation

Running tracemalloc in a long-lived service

The snapshot workflow is designed for a script you can start and stop. A service that must keep serving needs a different shape: tracing enabled continuously at low resolution, snapshots taken on a schedule, and comparisons written somewhere durable.

The cost of that is real but bounded. tracemalloc.start(nframe) adds a per-allocation overhead that scales with the number of frames captured — one frame is cheap, twenty-five is not — and roughly doubles the memory used for allocation bookkeeping. In practice nframe=5 is the sweet spot: enough stack to identify the caller, cheap enough to leave on.

import gc
import threading
import tracemalloc

def start_memory_watch(interval: float = 300.0, top: int = 15) -> None:
    """Trace allocations continuously and log the growth every `interval` seconds."""
    tracemalloc.start(5)                       # 5 frames: caller context, low overhead
    baseline = tracemalloc.take_snapshot()

    def loop() -> None:
        nonlocal baseline
        while True:
            threading.Event().wait(interval)
            gc.collect()                       # exclude anything merely uncollected
            current = tracemalloc.take_snapshot()
            diff = current.compare_to(baseline, "lineno")
            for stat in diff[:top]:
                if stat.size_diff > 1_000_000:  # only report real growth
                    print(f"+{stat.size_diff / 1e6:.1f} MB  {stat}")
            baseline = current                 # rolling comparison, not cumulative

    threading.Thread(target=loop, daemon=True, name="memory-watch").start()

Two decisions in that snippet matter more than the rest. The gc.collect() before each snapshot removes garbage that is merely uncollected from the report, which is the difference between a signal and a wall of noise — without it, every cycle of a request handler looks like a leak. And re-baselining each interval reports rate of growth rather than total growth since start, which is what you want from a monitor: a steady service shows nothing, and a leaking one shows the same lines every interval.

Deciding what counts as a leak

Not all growth is a leak, and treating it as one wastes time. Three patterns are benign and worth recognising before escalating.

Caches grow to their bound and stop. A functools.lru_cache with maxsize=1024 climbs during warm-up and then flattens; the giveaway is that the growth decelerates. Interpreter-level structures do the same: interned strings, module dictionaries and type caches all grow on first use of a code path and never shrink, which is why a service that touches a new endpoint for the first time an hour after boot shows a step change with no leak behind it.

Fragmentation looks like growth from outside the process but not from inside. tracemalloc reports what Python allocated; the RSS reported by the OS includes arenas that CPython has freed internally but not returned. A process whose RSS climbs while tracemalloc's total stays flat is fragmenting, not leaking, and the fix is allocation-pattern-shaped (fewer large transient objects) rather than reference-shaped.

A real leak has a distinctive signature: linear growth against a repeated operation, concentrated in a handful of allocation sites, that survives gc.collect(). The compare_to output names those sites directly, which is why the diff is the primary artefact rather than the raw snapshot.

Keeping the overhead honest

Measure the cost before leaving tracing on permanently. Run a representative load test with tracemalloc.start(5) and without it, and compare both throughput and RSS. On allocation-heavy workloads the difference can reach 10–20%; on I/O-bound services it is usually under 5%. If the cost is unacceptable, sample instead: enable tracing for sixty seconds every hour, which catches a steady leak within a few cycles and costs almost nothing on average.

Re-baselining each interval turns the report into a growth rate, so a healthy service is silent and a leaking one repeats the same lines.

Choosing between tracemalloc and the alternatives

tracemalloc is the right default because it is in the standard library, needs no build step, and attributes allocations to the exact line that made them. Three other tools cover what it cannot.

objgraph answers the retention question directly by drawing reference chains from an object back to a root, which is faster than reading gc.get_referrers output by hand when the chain is more than two links long. It is a development-time tool: building the graph is slow and needs graphviz.

memray (Linux, Python 3.8+) traces both Python and native allocations, so it sees memory allocated inside C extensions that tracemalloc misses entirely — the usual reason a process's RSS exceeds the total tracemalloc reports. It also records a full allocation history that can be replayed as a flame graph, at a noticeably higher overhead than shallow tracemalloc tracing.

The operating system's own numbers — RSS from psutil.Process().memory_info() — are the ground truth for how much memory the process actually holds, and the only measurement that includes fragmentation and native allocations. Track RSS as the alert signal and use tracemalloc to explain it; alerting on tracemalloc totals alone misses whole categories of growth.

A practical sequence: alert on RSS, confirm with tracemalloc totals, locate with a compare_to diff, identify the retainer with gc or objgraph, and reach for memray only when the Python-level total does not account for the growth. One last operational note: enable tracing before the code you want to measure imports anything. tracemalloc can only account for allocations made after start(), so a call placed after the application's imports will show a plausible-looking but incomplete picture, and the modules loaded during startup will be invisible in every subsequent snapshot. In a service, that means starting the trace in the entry-point module before the application package is imported; in a test, it means starting it inside the fixture rather than inside the test body.

It is also worth knowing what tracemalloc charges to whom. Memory is attributed to the frame that requested it, not to the object that keeps it alive, so a list comprehension that builds ten thousand dictionaries is reported at the comprehension's line even when the dictionaries are handed to a cache three call levels away. That attribution is the right default for finding what allocates, and the reason the retention question needs a second tool: the growing line in the report is frequently correct code doing exactly what it was asked to, on behalf of a caller that forgot to let go.

Frequently Asked Questions

What does the nframe argument to tracemalloc.start() control?nframe sets how many stack frames tracemalloc records for each allocation. With nframe=1 (the default) you only get the line that allocated; a higher value lets you group by full traceback to see the call path that led to the allocation, at the cost of more memory and overhead.

Why does tracemalloc report less memory than the operating system?tracemalloc only tracks allocations made through Python's memory allocators after start() was called. Memory allocated by C extensions outside pymalloc, allocations made before tracing started, and interpreter overhead are invisible to it, so RSS is always larger.

How do I assert a memory ceiling in a pytest test? Call tracemalloc.start() in a fixture, run the code under test, then read tracemalloc.get_traced_memory() which returns current and peak bytes. Assert the peak against a threshold and stop tracing in teardown so the next test starts clean.

What is the difference between grouping statistics by lineno and by traceback?group_by='lineno' aggregates all allocations on the same source line into one entry, regardless of how that line was reached. group_by='traceback' keeps each distinct call path separate, which is essential when the same helper allocates on behalf of many callers.

For the full leak-hunting recipe, follow finding memory leaks with tracemalloc snapshots.
To rank which lines grew between two points in time, use comparing tracemalloc snapshots to locate growth.
When the bottleneck is wall-clock time rather than retained bytes, reach for CPU profiling with cProfile and py-spy instead.
To step through the exact call path that reaches an allocating line, drop into interactive debugging with pdb and ipdb.
Memory that grows only under concurrency often points at retained tasks; see debugging async code and event loops.
Finding reference cycles with gc and objgraph — the retention question tracemalloc cannot answer.

← Back to Systematic Debugging & Performance Profiling