CPU Profiling with cProfile and py-spy

A service that meets its latency target in local benchmarks but burns CPU under production load is the canonical reason to reach for a profiler, and the first decision is which kind. Python ships a deterministic profiler, cProfile, that records every call and return — exact, reproducible, and ideal when you can run the workload yourself. It cannot help with a wedged production worker you must not restart; for that you need a sampling profiler such as py-spy that attaches to a live PID from outside the interpreter. This guide treats both as one toolkit: capture with cProfile, read the numbers correctly with pstats, visualize the call graph, and sample running processes with py-spy when the deterministic path is closed to you.

Prerequisites

Python 3.8+ (cProfile, pstats, and cProfile.Profile are stdlib; runctx has existed since 2.x).
py-spy >= 0.3.14 (pip install py-spy); 0.3.x added dump, --native, and reliable --pid attach.
snakeviz >= 2.2 for interactive visualization (pip install snakeviz).
gprof2dot >= 2024.6.6 plus Graphviz (dot) for static call graphs.
On Linux, attaching py-spy to another process requires either CAP_SYS_PTRACE, running as root, or relaxing kernel.yama.ptrace_scope — covered under Troubleshooting.

Core concept

A deterministic profiler hooks the interpreter's call and return events, so it knows exactly how many times every function ran and how long each invocation took. That precision costs overhead — typically 2–5x slowdown — and the overhead is heaviest on functions that make many small calls, which can distort the picture. A sampling profiler instead interrupts the program at a fixed frequency (py-spy defaults to 100 Hz) and records the current call stack. It never touches the code path, adds negligible overhead, and reflects where wall-clock time is actually spent, at the cost of missing functions that run between samples. The mental model: cProfile answers "how many times and how long per call"; py-spy answers "where is wall time going right now".

pstats reports two times per function that engineers routinely confuse. Total time (tottime) is time spent in the function body itself, excluding subcalls. Cumulative time (cumtime) includes everything the function transitively called. A dispatcher near the top of the stack has huge cumtime but tiny tottime; a tight numeric loop is the opposite. The distinction is important enough to have its own deep dive on interpreting cProfile cumulative vs total time.

The same run seen two ways: cumtime (light span) counts a frame plus everything it called, so the top-level dispatcher looks huge; tottime (solid segment) counts only the frame's own body, so it collapses to almost nothing for a dispatcher and to the whole bar for the leaf loop that actually burns CPU.

Sampling has its own statistical caveat worth internalizing before you trust a flame graph. py-spy estimates a function's share of wall time from the fraction of samples in which that function was on the stack, so the confidence interval on any single frame narrows with the number of samples collected, not with the length of the code. At the default 100 Hz a 3-second capture yields roughly 300 samples — enough to spot the dominant hotspot but too noisy to rank two functions that each account for ~5% of runtime. Increase --rate or, more reliably, lengthen --duration when you need to distinguish close contenders. The corollary: a function that never appears in the profile is not proven cheap, only proven to run for less than roughly one sample interval per invocation.

One scope limit trips up almost everyone. cProfile instruments the interpreter thread that called enable() and does not automatically follow execution into other threads or into multiprocessing/subprocess children — a profile of a thread-pool dispatcher will show the pool's join/wait as a giant cumtime sink with the real work invisible because it ran on worker threads the profiler never hooked. py-spy has the opposite default worth knowing: py-spy record --subprocesses follows forked workers, and it samples every thread of the target, making it the correct tool the moment concurrency enters the picture.

cProfile instruments every call for exact counts at the cost of overhead; py-spy samples the stack from outside the interpreter, reflecting wall-clock time with negligible cost and no restart.

Step-by-step implementation

1. Capture a deterministic profile with cProfile.run / runctx

cProfile.run(statement, filename=None) executes a string of code under the profiler. Pass a filename to persist raw stats for later analysis instead of dumping a table to stdout:

import cProfile

def fib(n: int) -> int:
    return n if n < 2 else fib(n - 1) + fib(n - 2)

def workload() -> int:
    return sum(fib(n) for n in range(30))

# Run the statement under the profiler and write raw stats to disk.
# The string is exec'd in a fresh namespace, so reference module globals by name.
cProfile.run("workload()", filename="workload.prof")

run execs the statement in an empty namespace, so it cannot see locals. When the code you want to profile depends on local variables — common inside a test or a function — use runctx, which takes explicit globals and locals dicts:

import cProfile

def profile_request(payload: dict) -> None:
    handler = build_handler(payload)          # local objects the statement needs
    # runctx exec's "handler.run()" with these exact namespaces, so the local
    # `handler` resolves correctly where plain run() would raise NameError.
    cProfile.runctx("handler.run()", globals(), locals(), filename="request.prof")

For surgical control — profiling a single hot region without wrapping it in a string — drive a Profile object directly:

import cProfile, pstats

profiler = cProfile.Profile()
profiler.enable()
result = expensive_pipeline(records)          # only this region is measured
profiler.disable()
profiler.dump_stats("pipeline.prof")          # raw stats for pstats / SnakeViz

Avoid profiling under pytest collection: the deterministic overhead skews fixture setup. Profile the function under test directly, or use the pytest fixture scoping rules to isolate the call inside a function-scoped fixture so collection time is excluded.

2. Sort and read the stats with pstats

A .prof file is raw; pstats.Stats turns it into a sortable table. Strip directory prefixes so function names are legible, then sort:

import pstats
from pstats import SortKey

stats = pstats.Stats("workload.prof")
stats.strip_dirs()                            # drop absolute paths from names
stats.sort_stats(SortKey.CUMULATIVE)          # rank by total + subcall time
stats.print_stats(10)                         # top 10 rows

# A second pass by total time surfaces leaf hotspots, not just hot callers.
stats.sort_stats(SortKey.TIME).print_stats(10)

Sort by SortKey.CUMULATIVE to find which entry points dominate the run, and by SortKey.TIME (total time) to find the leaf functions actually burning CPU. The ncalls, percall, tottime, and cumtime columns are dissected in the dedicated guide linked above.

3. Visualize the call graph

Tables hide structure. SnakeViz renders a .prof file as an interactive icicle chart in the browser — the width of each block is its cumulative time:

snakeviz workload.prof

For a static, shareable artifact (CI logs, code review), gprof2dot converts stats to a Graphviz call graph:

# gprof2dot reads the cProfile format and emits a DOT graph; dot renders it.
python -m gprof2dot -f pstats workload.prof | dot -Tsvg -o callgraph.svg

Each node is colored by cumulative time, making the hot path visually obvious without scanning rows.

4. Sample a running process with py-spy

When the workload is a long-running process you cannot restart, attach py-spy by PID. py-spy top gives a live, top-like view of the hottest functions:

py-spy top --pid 48291

py-spy record writes a flame graph SVG over a sampling window. In a flame graph the x-axis is not time — it is the proportion of samples — so a wide frame is a function the interpreter sat inside often, and its stacked children show where that time drained. Read it top-down from the widest tower to find the hot path; narrow spikes are noise at the default rate:

# Sample PID 48291 for 30 seconds, following forked workers and
# raising the rate to 250 Hz to separate near-equal frames.
py-spy record --pid 48291 --duration 30 --rate 250 --subprocesses --output flame.svg

py-spy dump prints a one-shot snapshot of every thread's stack — the fastest way to see what a hung process is doing right now, and unlike record it returns immediately rather than sampling over a window. Driving these subcommands against a live PID, including --native for C extensions and ptrace_scope permissions, is the focus of profiling a running process with py-spy.

Because cProfile only hooks the thread that called enable(), profiling a thread pool or an async event loop from the main thread hides the real work. For threads, register the profiler as the per-thread hook so each new thread inherits it, or profile the worker function in isolation:

import cProfile, threading

profiler = cProfile.Profile()
# setprofile installs the hook on every thread started AFTER this call,
# so worker threads are captured instead of showing up as an opaque join().
threading.setprofile(profiler.dispatcher)
profiler.enable()
run_thread_pool(tasks)
profiler.disable()
profiler.dump_stats("pool.prof")

For multiprocessing or subprocess workers, cProfile cannot cross the process boundary at all — profile inside the child (wrap its entry point in cProfile.run, writing one .prof per PID) or switch to py-spy record --subprocesses, which samples the whole process tree from outside. This is the same wall-clock-versus-instrumentation trade-off from the core concept, now decided by process topology rather than by whether you can restart the target.

6. Confirm the fix

After changing the suspect function, re-profile the same workload and compare cumtime on that function. A real win shows the number shrinking; if total runtime barely moved, the hotspot relocated and you optimized the wrong thing.

Verification

Confirm each tool actually measured what you intended:

cProfile captured the region: load the .prof and check the call you care about appears with a plausible ncalls. A zero or missing row means enable()/disable() bracketed the wrong code.
pstats sort is correct: the first row under CUMULATIVE should be your entry point (it transitively contains everything); under TIME it should be a leaf. If the entry point tops the TIME list, it is doing real work in its own body, not just dispatching.
py-spy attached: py-spy dump --pid <PID> should print live stacks. An empty or error result means a permissions or interpreter-detection problem, not an idle process.
The fix held: diff cumulative time before and after. Use pstats.Stats(old).sort_stats(SortKey.CUMULATIVE) and the same on the new profile.

Troubleshooting

Symptom	Root cause	Fix
`cProfile.run` raises `NameError` for a variable	`run` execs in an empty namespace	Use `cProfile.runctx(stmt, globals(), locals())`
Profile shows huge time in `<built-in method>` rows	Deterministic overhead inflates many tiny calls	Cross-check with py-spy; trust sampling for wall-clock attribution
`py-spy` exits with `Permission denied` / `Operation not permitted`	`ptrace_scope` restricts attaching	Run with sudo, grant `CAP_SYS_PTRACE`, or set `kernel.yama.ptrace_scope=0`
`py-spy` reports `Failed to find python interpreter`	Process is in a container or static build	Run py-spy inside the same namespace / container, or use `--pid` of the in-namespace PID
Flame graph is all idle / `wait` frames	App is I/O-bound, not CPU-bound	Add `--idle` to include idle threads, or profile the blocking call, not CPU
cProfile shows work as one big `join`/`wait` row	Real work runs on threads/child processes the profiler never hooked	`threading.setprofile()` for threads; profile inside the child or use `py-spy --subprocesses`
py-spy ranks two hot functions inconsistently between runs	Too few samples — sampling error swamps the gap	Raise `--rate` or lengthen `--duration` to tighten the estimate
SnakeViz shows one giant block, no detail	Stats were not `strip_dirs()`'d or the region is one call	Profile a representative workload with many iterations

Choosing between a deterministic and a sampling profiler

The two profilers on this page answer different questions, and using the wrong one wastes an afternoon.

cProfile is deterministic: it instruments every call, so its counts are exact and its overhead is proportional to call volume. That makes it the right tool when the question is how many times something runs, or when the hot path is spread across many small functions whose individual cost is invisible to sampling. The cost is distortion — a workload dominated by millions of cheap calls can run two to four times slower under cProfile, and that slowdown is not uniform, so the profile over-represents call-heavy code.

py-spy samples the interpreter's stack from a separate process at a fixed rate. Overhead on the target is near zero, it needs no code change and no restart, and it can attach to a process that is already misbehaving in production. In exchange the numbers are statistical: a function that never appears in a sample is not necessarily fast, it may simply be rare, and call counts do not exist at all.

The practical rule is to start with the one that matches your access to the process. If you can reproduce the slowness locally in a script, run cProfile and get exact numbers. If the problem only appears in production, or the process is already stuck, attach py-spy and take the statistical answer — it is available in seconds and usually points at the same function.

Two failure modes are worth naming. Profiling under cProfile inside a test suite measures the suite's overhead as much as the code, so profile a script that calls the hot path directly. And sampling a process that spends its time waiting — on a lock, a socket, or a subprocess — produces a flat profile that looks like nothing is happening; that is the signal to switch to py-spy dump for a one-shot stack, or to add --idle so waiting threads are included rather than skipped.

Exactness and non-intrusiveness are the trade: choose by whether you can reproduce the workload or must observe it where it lives.

Frequently Asked Questions

When should I use cProfile instead of py-spy? Use cProfile when you can run the code yourself and want exact, deterministic per-call counts. Use py-spy when you need to profile a process you cannot restart, such as a production worker, because it samples from outside the interpreter and adds almost no overhead.

Why does cProfile show a different hot function than py-spy?cProfile is deterministic and counts every call, so its overhead inflates functions that make many tiny calls. py-spy samples the call stack at a fixed frequency, so it reflects wall-clock time spent and is less skewed by call frequency. Trust py-spy for where wall time goes and cProfile for exact call counts.

Does py-spy require changing my application code? No. py-spy attaches to a running Python process by PID using OS debugging facilities and reads its stacks externally. You do not import anything or add hooks; you point py-spy at the PID and it produces top output, a flame graph, or a one-shot dump.

How do I profile only one function instead of the whole program? Wrap the call with cProfile.runctx, passing the expression and explicit globals and locals dicts, or use a cProfile.Profile object with enable() and disable() around the region. Then load the stats with pstats.Stats and sort by cumulative time.

Pin down the column meanings in interpreting cProfile cumulative vs total time before you trust a sort order.
When the target is a live worker, profiling a running process with py-spy covers attach, flame graphs, and ptrace permissions.
Pair CPU work with memory profiling using tracemalloc when high CPU is actually GC pressure from allocation churn.
For async services, a CPU hotspot is often a blocked event loop — see debugging async code and event loops.
When the slow path lives in a test suite, interactive debugging with pdb and ipdb helps you reach the call site before profiling it.
Reading flame graphs from py-spy output — width is samples, not calls, and four shapes cover every reading.

← Back to Systematic Debugging & Performance Profiling