Debugging & Performance

CPU Profiling with cProfile and py-spy

A service that meets its latency target in local benchmarks but burns CPU under production load is the canonical reason to reach for a profiler, and the first decision is which kind. Python ships a deterministic profiler, cProfile, that records every call and return — exact, reproducible, and ideal when you can run the workload yourself. It cannot help with a wedged production worker you must not restart; for that you need a sampling profiler such as py-spy that attaches to a live PID from outside the interpreter. This guide treats both as one toolkit: capture with cProfile, read the numbers correctly with pstats, visualize the call graph, and sample running processes with py-spy when the deterministic path is closed to you.

Prerequisites

  • Python 3.8+ (cProfile, pstats, and cProfile.Profile are stdlib; runctx has existed since 2.x).
  • py-spy >= 0.3.14 (pip install py-spy); 0.3.x added dump, --native, and reliable --pid attach.
  • snakeviz >= 2.2 for interactive visualization (pip install snakeviz).
  • gprof2dot >= 2024.6.6 plus Graphviz (dot) for static call graphs.
  • On Linux, attaching py-spy to another process requires either CAP_SYS_PTRACE, running as root, or relaxing kernel.yama.ptrace_scope — covered under Troubleshooting.

Core concept

A deterministic profiler hooks the interpreter's call and return events, so it knows exactly how many times every function ran and how long each invocation took. That precision costs overhead — typically 2–5x slowdown — and the overhead is heaviest on functions that make many small calls, which can distort the picture. A sampling profiler instead interrupts the program at a fixed frequency (py-spy defaults to 100 Hz) and records the current call stack. It never touches the code path, adds negligible overhead, and reflects where wall-clock time is actually spent, at the cost of missing functions that run between samples. The mental model: cProfile answers "how many times and how long per call"; py-spy answers "where is wall time going right now".

pstats reports two times per function that engineers routinely confuse. Total time (tottime) is time spent in the function body itself, excluding subcalls. Cumulative time (cumtime) includes everything the function transitively called. A dispatcher near the top of the stack has huge cumtime but tiny tottime; a tight numeric loop is the opposite. The distinction is important enough to have its own deep dive on interpreting cProfile cumulative vs total time.

Deterministic versus sampling profiling cProfile hooks every call and return for exact counts; py-spy samples the stack at a fixed rate from outside the process. Two ways to see CPU time cProfile (deterministic) hooks every call + return call handler -- exact ncalls tottime: in the body only cumtime: body + all subcalls precise, but 2-5x overhead you run the code yourself py-spy (sampling) reads stacks from outside snapshot stack at 100 Hz attaches by --pid, no restart wall-clock, near-zero overhead safe on production workers misses sub-sample functions Profile deterministically when you can; sample when not.
cProfile instruments every call for exact counts at the cost of overhead; py-spy samples the stack from outside the interpreter, reflecting wall-clock time with negligible cost and no restart.

Step-by-step implementation

1. Capture a deterministic profile with cProfile.run / runctx

cProfile.run(statement, filename=None) executes a string of code under the profiler. Pass a filename to persist raw stats for later analysis instead of dumping a table to stdout:

Python
import cProfile

def fib(n: int) -> int:
    return n if n < 2 else fib(n - 1) + fib(n - 2)

def workload() -> int:
    return sum(fib(n) for n in range(30))

# Run the statement under the profiler and write raw stats to disk.
# The string is exec'd in a fresh namespace, so reference module globals by name.
cProfile.run("workload()", filename="workload.prof")

run execs the statement in an empty namespace, so it cannot see locals. When the code you want to profile depends on local variables — common inside a test or a function — use runctx, which takes explicit globals and locals dicts:

Python
import cProfile

def profile_request(payload: dict) -> None:
    handler = build_handler(payload)          # local objects the statement needs
    # runctx exec's "handler.run()" with these exact namespaces, so the local
    # `handler` resolves correctly where plain run() would raise NameError.
    cProfile.runctx("handler.run()", globals(), locals(), filename="request.prof")

For surgical control — profiling a single hot region without wrapping it in a string — drive a Profile object directly:

Python
import cProfile, pstats

profiler = cProfile.Profile()
profiler.enable()
result = expensive_pipeline(records)          # only this region is measured
profiler.disable()
profiler.dump_stats("pipeline.prof")          # raw stats for pstats / SnakeViz

Avoid profiling under pytest collection: the deterministic overhead skews fixture setup. Profile the function under test directly, or use the pytest fixture scoping rules to isolate the call inside a function-scoped fixture so collection time is excluded.

2. Sort and read the stats with pstats

A .prof file is raw; pstats.Stats turns it into a sortable table. Strip directory prefixes so function names are legible, then sort:

Python
import pstats
from pstats import SortKey

stats = pstats.Stats("workload.prof")
stats.strip_dirs()                            # drop absolute paths from names
stats.sort_stats(SortKey.CUMULATIVE)          # rank by total + subcall time
stats.print_stats(10)                         # top 10 rows

# A second pass by total time surfaces leaf hotspots, not just hot callers.
stats.sort_stats(SortKey.TIME).print_stats(10)

Sort by SortKey.CUMULATIVE to find which entry points dominate the run, and by SortKey.TIME (total time) to find the leaf functions actually burning CPU. The ncalls, percall, tottime, and cumtime columns are dissected in the dedicated guide linked above.

3. Visualize the call graph

Tables hide structure. SnakeViz renders a .prof file as an interactive icicle chart in the browser — the width of each block is its cumulative time:

Bash
snakeviz workload.prof

For a static, shareable artifact (CI logs, code review), gprof2dot converts stats to a Graphviz call graph:

Bash
# gprof2dot reads the cProfile format and emits a DOT graph; dot renders it.
python -m gprof2dot -f pstats workload.prof | dot -Tsvg -o callgraph.svg

Each node is colored by cumulative time, making the hot path visually obvious without scanning rows.

4. Sample a running process with py-spy

When the workload is a long-running process you cannot restart, attach py-spy by PID. py-spy top gives a live, top-like view of the hottest functions:

Bash
py-spy top --pid 48291

py-spy record writes a flame graph SVG over a sampling window:

Bash
# Sample PID 48291 for 30 seconds and write an interactive flame graph.
py-spy record --pid 48291 --duration 30 --output flame.svg

py-spy dump prints a one-shot snapshot of every thread's stack — the fastest way to see what a hung process is doing right now. Driving these subcommands against a live PID, including --native for C extensions and ptrace_scope permissions, is the focus of profiling a running process with py-spy.

5. Confirm the fix

After changing the suspect function, re-profile the same workload and compare cumtime on that function. A real win shows the number shrinking; if total runtime barely moved, the hotspot relocated and you optimized the wrong thing.

Verification

Confirm each tool actually measured what you intended:

  • cProfile captured the region: load the .prof and check the call you care about appears with a plausible ncalls. A zero or missing row means enable()/disable() bracketed the wrong code.
  • pstats sort is correct: the first row under CUMULATIVE should be your entry point (it transitively contains everything); under TIME it should be a leaf. If the entry point tops the TIME list, it is doing real work in its own body, not just dispatching.
  • py-spy attached: py-spy dump --pid <PID> should print live stacks. An empty or error result means a permissions or interpreter-detection problem, not an idle process.
  • The fix held: diff cumulative time before and after. Use pstats.Stats(old).sort_stats(SortKey.CUMULATIVE) and the same on the new profile.

Troubleshooting

SymptomRoot causeFix
cProfile.run raises NameError for a variablerun execs in an empty namespaceUse cProfile.runctx(stmt, globals(), locals())
Profile shows huge time in <built-in method> rowsDeterministic overhead inflates many tiny callsCross-check with py-spy; trust sampling for wall-clock attribution
py-spy exits with Permission denied / Operation not permittedptrace_scope restricts attachingRun with sudo, grant CAP_SYS_PTRACE, or set kernel.yama.ptrace_scope=0
py-spy reports Failed to find python interpreterProcess is in a container or static buildRun py-spy inside the same namespace / container, or use --pid of the in-namespace PID
Flame graph is all idle / wait framesApp is I/O-bound, not CPU-boundAdd --idle to include idle threads, or profile the blocking call, not CPU
SnakeViz shows one giant block, no detailStats were not strip_dirs()'d or the region is one callProfile a representative workload with many iterations

Frequently Asked Questions

When should I use cProfile instead of py-spy? Use cProfile when you can run the code yourself and want exact, deterministic per-call counts. Use py-spy when you need to profile a process you cannot restart, such as a production worker, because it samples from outside the interpreter and adds almost no overhead.

Why does cProfile show a different hot function than py-spy?cProfile is deterministic and counts every call, so its overhead inflates functions that make many tiny calls. py-spy samples the call stack at a fixed frequency, so it reflects wall-clock time spent and is less skewed by call frequency. Trust py-spy for where wall time goes and cProfile for exact call counts.

Does py-spy require changing my application code? No. py-spy attaches to a running Python process by PID using OS debugging facilities and reads its stacks externally. You do not import anything or add hooks; you point py-spy at the PID and it produces top output, a flame graph, or a one-shot dump.

How do I profile only one function instead of the whole program? Wrap the call with cProfile.runctx, passing the expression and explicit globals and locals dicts, or use a cProfile.Profile object with enable() and disable() around the region. Then load the stats with pstats.Stats and sort by cumulative time.

← Back to Systematic Debugging & Performance Profiling