A service that meets its latency target in local benchmarks but burns CPU under production load is the canonical reason to reach for a profiler, and the first decision is which kind. Python ships a deterministic profiler, cProfile, that records every call and return — exact, reproducible, and ideal when you can run the workload yourself. It cannot help with a wedged production worker you must not restart; for that you need a sampling profiler such as py-spy that attaches to a live PID from outside the interpreter. This guide treats both as one toolkit: capture with cProfile, read the numbers correctly with pstats, visualize the call graph, and sample running processes with py-spy when the deterministic path is closed to you.
Prerequisites
- Python
3.8+(cProfile,pstats, andcProfile.Profileare stdlib;runctxhas existed since 2.x). py-spy >= 0.3.14(pip install py-spy);0.3.xaddeddump,--native, and reliable--pidattach.snakeviz >= 2.2for interactive visualization (pip install snakeviz).gprof2dot >= 2024.6.6plus Graphviz (dot) for static call graphs.- On Linux, attaching
py-spyto another process requires eitherCAP_SYS_PTRACE, running as root, or relaxingkernel.yama.ptrace_scope— covered under Troubleshooting.
Core concept
A deterministic profiler hooks the interpreter's call and return events, so it knows exactly how many times every function ran and how long each invocation took. That precision costs overhead — typically 2–5x slowdown — and the overhead is heaviest on functions that make many small calls, which can distort the picture. A sampling profiler instead interrupts the program at a fixed frequency (py-spy defaults to 100 Hz) and records the current call stack. It never touches the code path, adds negligible overhead, and reflects where wall-clock time is actually spent, at the cost of missing functions that run between samples. The mental model: cProfile answers "how many times and how long per call"; py-spy answers "where is wall time going right now".
pstats reports two times per function that engineers routinely confuse. Total time (tottime) is time spent in the function body itself, excluding subcalls. Cumulative time (cumtime) includes everything the function transitively called. A dispatcher near the top of the stack has huge cumtime but tiny tottime; a tight numeric loop is the opposite. The distinction is important enough to have its own deep dive on interpreting cProfile cumulative vs total time.
Step-by-step implementation
1. Capture a deterministic profile with cProfile.run / runctx
cProfile.run(statement, filename=None) executes a string of code under the profiler. Pass a filename to persist raw stats for later analysis instead of dumping a table to stdout:
import cProfile
def fib(n: int) -> int:
return n if n < 2 else fib(n - 1) + fib(n - 2)
def workload() -> int:
return sum(fib(n) for n in range(30))
# Run the statement under the profiler and write raw stats to disk.
# The string is exec'd in a fresh namespace, so reference module globals by name.
cProfile.run("workload()", filename="workload.prof")
run execs the statement in an empty namespace, so it cannot see locals. When the code you want to profile depends on local variables — common inside a test or a function — use runctx, which takes explicit globals and locals dicts:
import cProfile
def profile_request(payload: dict) -> None:
handler = build_handler(payload) # local objects the statement needs
# runctx exec's "handler.run()" with these exact namespaces, so the local
# `handler` resolves correctly where plain run() would raise NameError.
cProfile.runctx("handler.run()", globals(), locals(), filename="request.prof")
For surgical control — profiling a single hot region without wrapping it in a string — drive a Profile object directly:
import cProfile, pstats
profiler = cProfile.Profile()
profiler.enable()
result = expensive_pipeline(records) # only this region is measured
profiler.disable()
profiler.dump_stats("pipeline.prof") # raw stats for pstats / SnakeViz
Avoid profiling under pytest collection: the deterministic overhead skews fixture setup. Profile the function under test directly, or use the pytest fixture scoping rules to isolate the call inside a function-scoped fixture so collection time is excluded.
2. Sort and read the stats with pstats
A .prof file is raw; pstats.Stats turns it into a sortable table. Strip directory prefixes so function names are legible, then sort:
import pstats
from pstats import SortKey
stats = pstats.Stats("workload.prof")
stats.strip_dirs() # drop absolute paths from names
stats.sort_stats(SortKey.CUMULATIVE) # rank by total + subcall time
stats.print_stats(10) # top 10 rows
# A second pass by total time surfaces leaf hotspots, not just hot callers.
stats.sort_stats(SortKey.TIME).print_stats(10)
Sort by SortKey.CUMULATIVE to find which entry points dominate the run, and by SortKey.TIME (total time) to find the leaf functions actually burning CPU. The ncalls, percall, tottime, and cumtime columns are dissected in the dedicated guide linked above.
3. Visualize the call graph
Tables hide structure. SnakeViz renders a .prof file as an interactive icicle chart in the browser — the width of each block is its cumulative time:
snakeviz workload.prof
For a static, shareable artifact (CI logs, code review), gprof2dot converts stats to a Graphviz call graph:
# gprof2dot reads the cProfile format and emits a DOT graph; dot renders it.
python -m gprof2dot -f pstats workload.prof | dot -Tsvg -o callgraph.svg
Each node is colored by cumulative time, making the hot path visually obvious without scanning rows.
4. Sample a running process with py-spy
When the workload is a long-running process you cannot restart, attach py-spy by PID. py-spy top gives a live, top-like view of the hottest functions:
py-spy top --pid 48291
py-spy record writes a flame graph SVG over a sampling window:
# Sample PID 48291 for 30 seconds and write an interactive flame graph.
py-spy record --pid 48291 --duration 30 --output flame.svg
py-spy dump prints a one-shot snapshot of every thread's stack — the fastest way to see what a hung process is doing right now. Driving these subcommands against a live PID, including --native for C extensions and ptrace_scope permissions, is the focus of profiling a running process with py-spy.
5. Confirm the fix
After changing the suspect function, re-profile the same workload and compare cumtime on that function. A real win shows the number shrinking; if total runtime barely moved, the hotspot relocated and you optimized the wrong thing.
Verification
Confirm each tool actually measured what you intended:
- cProfile captured the region: load the
.profand check the call you care about appears with a plausiblencalls. A zero or missing row meansenable()/disable()bracketed the wrong code. - pstats sort is correct: the first row under
CUMULATIVEshould be your entry point (it transitively contains everything); underTIMEit should be a leaf. If the entry point tops theTIMElist, it is doing real work in its own body, not just dispatching. - py-spy attached:
py-spy dump --pid <PID>should print live stacks. An empty or error result means a permissions or interpreter-detection problem, not an idle process. - The fix held: diff cumulative time before and after. Use
pstats.Stats(old).sort_stats(SortKey.CUMULATIVE)and the same on the new profile.
Troubleshooting
| Symptom | Root cause | Fix |
|---|---|---|
cProfile.run raises NameError for a variable | run execs in an empty namespace | Use cProfile.runctx(stmt, globals(), locals()) |
Profile shows huge time in <built-in method> rows | Deterministic overhead inflates many tiny calls | Cross-check with py-spy; trust sampling for wall-clock attribution |
py-spy exits with Permission denied / Operation not permitted | ptrace_scope restricts attaching | Run with sudo, grant CAP_SYS_PTRACE, or set kernel.yama.ptrace_scope=0 |
py-spy reports Failed to find python interpreter | Process is in a container or static build | Run py-spy inside the same namespace / container, or use --pid of the in-namespace PID |
Flame graph is all idle / wait frames | App is I/O-bound, not CPU-bound | Add --idle to include idle threads, or profile the blocking call, not CPU |
| SnakeViz shows one giant block, no detail | Stats were not strip_dirs()'d or the region is one call | Profile a representative workload with many iterations |
Frequently Asked Questions
When should I use cProfile instead of py-spy?
Use cProfile when you can run the code yourself and want exact, deterministic per-call counts. Use py-spy when you need to profile a process you cannot restart, such as a production worker, because it samples from outside the interpreter and adds almost no overhead.
Why does cProfile show a different hot function than py-spy?cProfile is deterministic and counts every call, so its overhead inflates functions that make many tiny calls. py-spy samples the call stack at a fixed frequency, so it reflects wall-clock time spent and is less skewed by call frequency. Trust py-spy for where wall time goes and cProfile for exact call counts.
Does py-spy require changing my application code?
No. py-spy attaches to a running Python process by PID using OS debugging facilities and reads its stacks externally. You do not import anything or add hooks; you point py-spy at the PID and it produces top output, a flame graph, or a one-shot dump.
How do I profile only one function instead of the whole program?
Wrap the call with cProfile.runctx, passing the expression and explicit globals and locals dicts, or use a cProfile.Profile object with enable() and disable() around the region. Then load the stats with pstats.Stats and sort by cumulative time.
Related guides
- Pin down the column meanings in interpreting cProfile cumulative vs total time before you trust a sort order.
- When the target is a live worker, profiling a running process with py-spy covers attach, flame graphs, and ptrace permissions.
- Pair CPU work with memory profiling using tracemalloc when high CPU is actually GC pressure from allocation churn.
- For async services, a CPU hotspot is often a blocked event loop — see debugging async code and event loops.
- When the slow path lives in a test suite, interactive debugging with pdb and ipdb helps you reach the call site before profiling it.