Systematic Debugging & Performance Profiling

Q: What is the difference between pdb.set_trace() and breakpoint()?

breakpoint() was added in Python 3.7 and is an indirection layer. By default it calls pdb.set_trace(), but it honors the PYTHONBREAKPOINT environment variable, so PYTHONBREAKPOINT=ipdb.set_trace switches every breakpoint to ipdb and PYTHONBREAKPOINT=0 disables them all without editing code.

Q: Why does tracemalloc show almost no memory even though the process is large?

tracemalloc only tracks allocations made by the Python memory manager after tracemalloc.start() is called. Allocations from C extensions like NumPy or from before start are invisible, and any memory allocated before the call is never attributed. Start tracing as early as possible, ideally via the PYTHONTRACEMALLOC environment variable.

Q: Should I read cumulative time or total time in a cProfile report?

Total time (tottime) is the time spent in a function excluding subcalls, so it finds the genuine hot loop. Cumulative time (cumtime) includes everything called from that function, so it finds the expensive call tree. Sort by cumtime to find what to cut, then by tottime to find what to optimize.

Some bugs surrender to a single well-placed print. The ones that consume a sprint do not: the integration test that fails only on the CI runner and never on your laptop, the worker whose resident set creeps upward over six hours until the orchestrator OOM-kills it, the endpoint that was fast last quarter and is now the p99 outlier, the async task that hangs without raising and without logging. These are not coding mistakes you can spot by re-reading the diff. They are systems problems, and the engineers who close them quickly do so because they reach for the right instrument first instead of guessing.

This guide is that instrument selection, made explicit. We assume you are comfortable with Python's data model, the C-level distinction between heap allocation and reference counting, the cooperative-scheduling model of asyncio, and the realities of running tests under CI/CD where you cannot attach a GUI debugger and every diagnostic must survive in a log. We cover interactive debugging with pdb and ipdb, post-mortem analysis of tracebacks, heap accounting with tracemalloc, deterministic and sampling CPU profiling with cProfile and py-spy, and the specific failure modes of asynchronous code. The throughline is method: classify the symptom, pick the tool that observes that class of symptom directly, and stop guessing.

Four symptom classes route to four instruments: a wrong result goes to an interactive debugger, climbing memory to tracemalloc snapshot diffing, a CPU hot spot to cProfile or py-spy, and a silent async hang to the event loop's debug mode.

1. Interactive Debugging with pdb, ipdb, and breakpoint()

The interactive debugger is the correct first tool for a logic bug — a function that returns the wrong value or takes a branch you did not expect. Unlike scattering print calls, a debugger lets you inspect arbitrary expressions, walk up and down the call stack, and re-evaluate state without re-running the program. Since Python 3.7 the canonical entry point is the built-in breakpoint(), which is an indirection over sys.breakpointhook. By default it invokes pdb.set_trace(), but it reads the PYTHONBREAKPOINT environment variable first, which is what makes it superior to a hardcoded pdb.set_trace(): you can redirect every breakpoint in the codebase from the outside.

# inventory.py
def reconcile(counts: dict[str, int], adjustments: dict[str, int]) -> dict[str, int]:
    merged = dict(counts)
    for sku, delta in adjustments.items():
        breakpoint()  # 3.7+: honors PYTHONBREAKPOINT; no import needed
        merged[sku] = merged.get(sku, 0) + delta
    return merged

Run it normally and you drop into pdb at the breakpoint. The commands that matter in practice are n (next line), s (step into a call), c (continue to the next breakpoint), w (print the stack), u/d (move up/down a frame to inspect a caller's locals), p/pp (print/pretty-print an expression), and ! to execute an arbitrary statement (!merged['widget'] = 0). The single most valuable habit is interrogating expressions rather than variables: p [s for s in adjustments if s not in counts] answers a question, where p adjustments only dumps data.

ipdb is pdb with IPython's machinery layered on — tab completion, syntax highlighting, and the full %-magic set. It is a drop-in replacement, and the clean way to opt in is the environment variable rather than editing imports:

pip install ipdb
# Every breakpoint() call in the process now lands in ipdb instead of pdb:
PYTHONBREAKPOINT=ipdb.set_trace python -m inventory
# Or disable all breakpoints globally — useful as a CI safety net:
PYTHONBREAKPOINT=0 python -m inventory

Why it matters in CI/CD. A breakpoint() left in a code path will block a CI runner forever, because there is no TTY to satisfy the prompt and the job hangs until it times out. Two defenses belong in every pipeline. First, set PYTHONBREAKPOINT=0 in the CI environment so a stray breakpoint becomes a no-op instead of a hang. Second, add a lint gate — ruff flags breakpoint() and pdb imports under rule T100/T20, and flake8-debugger does the same — so the call never merges in the first place. Treat PYTHONBREAKPOINT=0 as defense in depth, not as permission to skip the lint. Conditional breakpoints keep an interactive session focused on the failing case; the mechanics of condition and one-shot breakpoints are covered in setting conditional breakpoints in pdb.

2. Post-Mortem Debugging and Reading Tracebacks

Stepping forward from a breakpoint is the wrong model when the program has already crashed — you do not want to replay the run, you want to inspect the exact frame where it died, with every local still intact. That is post-mortem debugging. After an unhandled exception reaches the REPL, import pdb; pdb.pm() re-enters the debugger positioned at the frame that raised, letting you walk the stack and inspect the locals that produced the failure. For a script, python -m pdb -c continue script.py runs to completion and drops into post-mortem automatically on an uncaught exception.

Reading the traceback itself is a skill that precedes any tool. Python prints frames oldest-to-newest, so the last frame is where the exception was raised and the frames above it are the call chain that led there. In Python 3.11+ the traceback also carries fine-grained column markers (^^^^) pinpointing the sub-expression at fault, which collapses the ambiguity of a long chained expression. When the real cause is wrapped, read the raise ... from chain: "The above exception was the direct cause" marks an explicit from, while "During handling of the above exception, another exception occurred" marks an implicit chain — usually a bug in the except block, not the original error.

import pdb
import sys

def load_config(raw: dict) -> dict:
    return {"retries": int(raw["retries"])}  # KeyError or ValueError lands here

def main():
    try:
        load_config({"retrise": "3"})  # typo'd key -> KeyError
    except Exception:
        # Capture the *active* traceback and re-enter at the raising frame.
        pdb.post_mortem(sys.exc_info()[2])

if __name__ == "__main__":
    main()

In a Jupyter or IPython session the equivalent is the %debug magic: run it in the cell immediately after a traceback and it opens ipdb at the failing frame, or enable %pdb on to have it trigger automatically on every uncaught exception. To preserve a post-mortem from a CI failure where there is no interactive session, register sys.excepthook to dump traceback.format_exc() plus the locals of each frame to a log artifact; that frozen stack is often enough to diagnose a failure that will not reproduce locally. The deeper workflow — pdb.pm(), post_mortem, and harvesting locals from a dead frame — is the subject of post-mortem debugging with pdb.pm().

A crash that only appears intermittently is a different animal: the traceback may be real but the trigger is timing or ordering. Before assuming a memory or concurrency fault, confirm the failure is deterministic by reproducing it under controlled reruns, a discipline covered in debugging flaky tests with pytest-rerunfailures.

3. Memory Profiling with tracemalloc

When resident memory climbs over the life of a long-running process or a large test session ends in an OOM kill, the question is which lines allocated the memory that was never freed. tracemalloc, part of the standard library since Python 3.4, answers exactly that. It hooks the CPython allocator and attributes every Python-level allocation to the source line and call stack that requested it. Crucially it sees only allocations made through the Python memory manager after start() is called — memory held by C extensions such as NumPy buffers or allocations that predate the call are invisible. Start tracing as early as possible, and for whole-program coverage prefer the PYTHONTRACEMALLOC=1 environment variable so tracing is active before your first import runs.

import tracemalloc

tracemalloc.start(25)  # keep up to 25 frames per allocation for readable tracebacks

cache: list[bytes] = []

def handle_request(n: int) -> None:
    # A classic leak: an unbounded module-level cache that nothing evicts.
    cache.append(b"x" * 1024 * n)

snapshot_before = tracemalloc.take_snapshot()
for i in range(1, 200):
    handle_request(i)
snapshot_after = tracemalloc.take_snapshot()

# Diff the two snapshots and rank by the bytes that GREW between them.
top = snapshot_after.compare_to(snapshot_before, "lineno")
for stat in top[:5]:
    print(stat)  # e.g. inventory.py:12: size=19.4 MiB (+19.4 MiB), count=199 (+199)

The decisive technique is differential snapshotting — compare_to between a baseline and a later snapshot isolates growth and discards the steady-state allocations that a single take_snapshot() would bury in noise. Group by "lineno" to find the allocating line, or by "traceback" and then call stat.traceback.format() to see the full call path, which matters when the same helper is allocated from many call sites. In a test suite, a session-scoped fixture that accumulates state is the textbook source of the leak that the fixture-lifecycle guidance in advanced pytest architecture and configuration warns about; the same compare_to workflow finds it. The full leak-hunting procedure, including bisecting growth across many snapshots, lives in memory profiling with tracemalloc and its companion on comparing tracemalloc snapshots to locate growth.

Why it matters in CI/CD. Memory leaks rarely fail a fast unit run; they surface as nondeterministic OOM kills in long integration jobs or in production after hours of uptime, which makes them expensive to attribute after the fact. Wiring a tracemalloc baseline-and-diff into a soak test — take a snapshot, drive N iterations of the workload, diff, and assert the top growth statistic stays under a threshold — turns an unbounded-growth regression into a deterministic test failure on the commit that introduced it.

4. CPU Profiling with cProfile and py-spy

A slow endpoint or a test suite that has doubled in wall-clock time is a CPU-time problem, and the cardinal rule is to measure before optimizing — intuition about hot spots is wrong often enough to waste days. cProfile is the standard-library deterministic profiler: it instruments every function call and records exact call counts and timings. Run it without touching the code:

# Profile a module and write a binary stats file for later analysis:
python -m cProfile -o profile.out -m myapp.batch_job

import pstats
from pstats import SortKey

stats = pstats.Stats("profile.out")
stats.strip_dirs()
stats.sort_stats(SortKey.CUMULATIVE).print_stats(10)  # the expensive call TREE
stats.sort_stats(SortKey.TIME).print_stats(10)        # the hot LEAF function

The two sort keys answer different questions, and conflating them is the most common profiling mistake. tottime (TIME) is time spent inside a function excluding its subcalls — sort by it to find the genuine hot loop you should optimize. cumtime (CUMULATIVE) includes everything called downstream — sort by it to find the expensive subtree you should consider cutting entirely. The disciplined sequence is cumulative first to find what to remove, then total to find what to make faster. The distinction is explored in depth in interpreting cProfile cumulative vs total time.

Reading the same call stack two ways: cumtime brackets a frame plus everything below it, so parse_rows looks costly and flags a subtree to remove; tottime counts only a frame's own work, exposing decode as the leaf where the time is actually spent.

cProfile has two limits: its per-call instrumentation adds measurable overhead that distorts very call-heavy code, and it must be wired in before the process starts. When you need to profile a process that is already running — a live production worker, or one that is hung — reach for py-spy, an external sampling profiler written in Rust that attaches to a running PID and reads its stacks from outside the interpreter, adding almost no overhead and requiring no code change or restart:

pip install py-spy
# Attach to a live process and render an interactive top-like view:
py-spy top --pid 48213
# Or sample for 30s and emit a flamegraph SVG of where wall-clock time went:
py-spy record --pid 48213 --duration 30 --output flame.svg
# dump prints the current stack of every thread — the fastest way to see a hang:
py-spy dump --pid 48213

Why it matters in CI/CD. A deterministic cProfile run on a representative workload, with --durations or a stored stats file kept as a build artifact, makes performance regressions reviewable in the same way coverage is. py-spy dump is the on-call tool: when a worker pegs a core or stops making progress in production, dumping its stacks from outside is faster and safer than attaching a debugger to a process you cannot afford to pause. Both ends of this — sampling a live process and reading the deterministic report — are detailed in CPU profiling with cProfile and py-spy and profiling a running process with py-spy.

5. Debugging Async Code and Event Loops

Asynchronous code fails in ways synchronous code cannot: a task can hang forever without raising, a coroutine can be created and silently dropped, and a callback can fire against a loop that has already closed. The first move is always to enable the event loop's debug mode, which asyncio has shipped since 3.4. It logs coroutines that take too long, surfaces "coroutine was never awaited" with a source location, and warns about non-thread-safe calls. Enable it without editing code via PYTHONASYNCIODEBUG=1, or per-entrypoint with asyncio.run(main(), debug=True).

import asyncio

async def fetch(n: int) -> int:
    await asyncio.sleep(0.1)
    return n * 2

async def main() -> None:
    # BUG: calling a coroutine function without awaiting creates a coroutine
    # object that is never scheduled -> "coroutine 'fetch' was never awaited".
    fetch(10)                       # orphaned; debug mode reports where
    result = await fetch(21)        # correct: awaited
    print(result)

    # Inspect everything the loop is currently running — invaluable for a hang.
    for task in asyncio.all_tasks():
        print(task.get_name(), task.get_coro())

asyncio.run(main(), debug=True)     # 3.7+ entrypoint; debug=True turns on checks

The two failures that dominate support channels have specific causes. "Event loop is closed" almost always means a coroutine or callback ran after asyncio.run() returned and tore the loop down — frequently a background task that was never awaited or cancelled before shutdown, or a teardown that closed the loop while work was still pending. The fix is to track tasks (keep a strong reference; an unreferenced task can be garbage-collected mid-flight) and cancel-and-await them during shutdown. The "coroutine was never awaited" warning means a coroutine object was created but never driven by the loop — the missing await is the cause, and debug mode prints the line. Diagnosing a true hang means dumping asyncio.all_tasks() to see which coroutine is parked on an await that never resolves, or attaching py-spy dump to see the parked native stack.

Why it matters in CI/CD. Async test suites are a primary source of "Event loop is closed" noise because fixtures and the test runner each manage loop lifecycle, and a mismatch between fixture scope and loop scope leaks a loop across tests. Getting that scoping right — matching the async fixture's lifetime to the loop it runs on — is covered in how to scope pytest fixtures for async tests. The two concrete errors get full treatments in debugging async code and event loops, with dedicated walkthroughs for the "Event loop is closed" RuntimeError and tracing "coroutine was never awaited" warnings.

Common Pitfalls & Antipatterns

Leaving breakpoint() in a merged code path. Root cause: no lint gate and no environment guard. Fix: enable ruff rule T100 (or flake8-debugger) to block the call at review time, and set PYTHONBREAKPOINT=0 in CI so any survivor degrades to a no-op rather than hanging the runner.
Trusting tracemalloc numbers when started too late. Root cause: allocations before tracemalloc.start() and all C-extension memory are untracked, so a real leak in NumPy buffers shows as "no growth." Fix: start tracing via PYTHONTRACEMALLOC=1 before imports, and use OS-level RSS (psutil, /proc) as the ground truth that tracemalloc only explains.
Optimizing the wrong function by misreading the profile. Root cause: sorting by cumtime and then optimizing the top entry, which is usually main or the framework loop. Fix: sort by tottime to find the leaf where time is actually spent; use cumtime only to decide which subtrees to eliminate.
Profiling with cProfile on call-heavy code and trusting absolute timings. Root cause: deterministic instrumentation adds per-call overhead that inflates functions with millions of tiny calls. Fix: use relative comparisons, or switch to a sampling profiler like py-spy whose overhead is independent of call frequency.
Dropping async tasks on the floor. Root cause: asyncio.create_task(...) without keeping a reference, so the garbage collector can finalize the task mid-flight, or never awaiting it before the loop closes. Fix: store tasks in a set, add a done callback to discard them, and cancel-then-await all pending tasks during shutdown.
Debugging an intermittent crash as if it were deterministic. Root cause: re-running once, seeing it pass, and declaring it fixed. Fix: reproduce under repeated runs first to establish a failure rate, then bisect; a "fix" that does not move the failure rate fixed nothing.
Reaching for a memory or CPU profiler when the real problem is a hang. Root cause: a parked await consumes neither CPU nor growing memory, so both profilers show a quiet, healthy process. Fix: dump the task list (asyncio.all_tasks()) or native stacks (py-spy dump) to see what every coroutine and thread is blocked on.

The four tracks below answer four different questions, and picking the right one first is most of the speed advantage an experienced debugger has.

Classify before you instrument: each tool answers one question well and the others badly.

Frequently Asked Questions

What is the difference between pdb.set_trace() and breakpoint()?breakpoint() was added in Python 3.7 and is an indirection layer. By default it calls pdb.set_trace(), but it honors the PYTHONBREAKPOINT environment variable, so PYTHONBREAKPOINT=ipdb.set_trace switches every breakpoint to ipdb and PYTHONBREAKPOINT=0 disables them all without editing code.

Why does tracemalloc show almost no memory even though the process is large?tracemalloc only tracks allocations made by the Python memory manager after tracemalloc.start() is called. Allocations from C extensions like NumPy or from before start are invisible, and any memory allocated before the call is never attributed. Start tracing as early as possible, ideally via the PYTHONTRACEMALLOC environment variable.

Should I read cumulative time or total time in a cProfile report? Total time (tottime) is the time spent in a function excluding subcalls, so it finds the genuine hot loop. Cumulative time (cumtime) includes everything called from that function, so it finds the expensive call tree. Sort by cumtime to find what to cut, then by tottime to find what to optimize.

When should I reach for py-spy instead of cProfile? Use py-spy when you cannot or do not want to modify or restart the process, for example a live production worker or a hung process. py-spy is a sampling profiler that attaches to a running PID from outside the interpreter, so it adds almost no overhead and needs no code changes, whereas cProfile is a deterministic in-process profiler that instruments every call.

What causes a "coroutine was never awaited" RuntimeWarning? It means a coroutine function was called, producing a coroutine object, but that object was never awaited or scheduled on the event loop. The usual cause is forgetting await or passing the coroutine somewhere that expects a value. Run with PYTHONASYNCIODEBUG=1 or asyncio.run(main, debug=True) to get the source location of the orphaned coroutine.

Start an interactive session with interactive debugging with pdb and ipdb, the entry point for stepping through a logic bug and the home of conditional-breakpoint and post-mortem techniques.
When resident memory climbs, memory profiling with tracemalloc turns an OOM kill into an attributable line of code through differential snapshots.
For wall-clock regressions, CPU profiling with cProfile and py-spy covers both the deterministic report and sampling a process that is already running.
For tasks that hang or warn without raising, debugging async code and event loops maps the event loop's debug mode onto the errors it surfaces.
When the bug hides behind a test double, the isolation contracts in advanced mocking and test doubles in Python determine whether the failure you are debugging is in the code or in the mock.

← Back to all guides