Profiling a Running Process with py-spy

You have a Python worker pegging a CPU in production and restarting it to add instrumentation is not an option — you need to see what it is doing right now, from outside, without touching its code. Unlike cProfile, which you compile into the process you control, py-spy attaches to an already-running interpreter by PID, reads the call stacks through OS debugging facilities, and renders them as a live view, a flame graph, or a one-shot dump. This guide covers the three subcommands and the Linux ptrace_scope permission wall that blocks the first attempt on most hosts.

Prerequisites

py-spy >= 0.3.14 (pip install py-spy); 0.3.x is required for reliable --pid attach, dump, and --native.
A running CPython process you can identify by PID.
On Linux: CAP_SYS_PTRACE, root, or kernel.yama.ptrace_scope=0 to attach to a process you do not own. In Docker, run the container with --cap-add SYS_PTRACE.
macOS requires running py-spy with sudo.

Solution

First locate the worker's PID — pick the actual interpreter, not a shell or supervisor parent:

pgrep -f 'gunicorn.*worker' || ps -eo pid,cmd | grep '[p]ython'

py-spy top gives a live, auto-refreshing view of the hottest functions in that process:

# Live top-style view; %Own is self time, %Total includes subcalls.
py-spy top --pid 48291

py-spy record samples for a window and writes an interactive flame graph SVG:

# Sample for 30s and write a flame graph. --native unwinds C/Rust frames
# (e.g. numpy internals) so they are not collapsed into one built-in row.
py-spy record --pid 48291 --duration 30 --native --output flame.svg

py-spy dump prints a single snapshot of every thread's current stack — the fastest answer to "what is this hung process stuck on":

# One-shot stack dump of all threads; no sampling window needed.
py-spy dump --pid 48291

If attach is refused on Linux, the cause is almost always ptrace_scope:

# Option A (per session, needs root): allow any process to attach.
sudo sysctl -w kernel.yama.ptrace_scope=0
# Option B (no sysctl change): just run py-spy elevated.
sudo py-spy dump --pid 48291

Sampling from outside the process is what makes the overhead negligible, and it also explains the tool's limits.

The target is never paused or instrumented: py-spy reads its memory and reconstructs the Python stack from the interpreter structures.

Why this works

py-spy runs as its own Rust process and reads the target's memory through process_vm_readv/ptrace, then follows the frame objects' f_back pointers to reconstruct the Python stack. Nothing is written into the target and no module is imported, which is why the attach is safe on a live production worker.

py-spy is written in Rust and reads the target process's memory through process_vm_readv (Linux), vm_read (macOS), or ReadProcessMemory (Windows), reconstructing Python frame objects without pausing the interpreter for more than a few microseconds per sample. Because it never injects code or imports a module into the target, it cannot corrupt application state and adds negligible overhead, which is what makes it safe to point at a live production worker. The ptrace_scope setting exists precisely because reading another process's memory is a privileged operation; relaxing it or granting CAP_SYS_PTRACE is what authorizes the attach.

Edge cases and failure modes

Wrong PID under a process manager. Gunicorn, uWSGI, and Celery fork worker processes; profiling the master shows almost no work. Attach to a worker PID, or use py-spy dump --pid <master> once to discover children.
Container PID namespaces. A PID inside a container differs from the host PID. Either run py-spy inside the container (with SYS_PTRACE added) or translate the PID before attaching from the host.
C extensions hide time. Without --native, time inside numpy, pandas, or a compiled dependency collapses into an opaque built-in frame. Add --native, which needs debug symbols to be most useful.
Idle / I/O-bound threads. A worker blocked on a socket shows wait-style frames. Use --idle to include idle threads in record, and remember py-spy measures wall-clock stacks, not CPU — cross-check with cProfile's cumulative vs total time if you need exact CPU attribution.
An asyncio event loop as the bottleneck. A single-threaded async worker parks everything on one loop thread, so sampled stacks collapse into epoll_wait/select. py-spy tells you the loop is stuck but not which coroutine starved it — pair it with debugging async code and event loops to trace the offending task.
High RSS rather than high CPU. py-spy only samples call stacks; a process that is slow because it is thrashing memory shows nothing useful here. Reach for finding memory leaks with tracemalloc snapshots instead.
Static or stripped Python builds. py-spy may fail to locate the interpreter struct (Failed to find python interpreter). Match the py-spy version to the CPython version and avoid heavily stripped builds.

Running py-spy where the process actually lives

Attaching locally is easy; attaching to the container in production is where the practical problems are, and they are all permission problems with known answers.

Reading another process's memory requires CAP_SYS_PTRACE. In Docker that means running the profiler container with --cap-add SYS_PTRACE, and on hosts with the Yama LSM enabled you may also need --pid=host so the profiler can see the target's process id at all.

# Same container, elevated capability
$ docker run --rm -it --cap-add SYS_PTRACE --pid=container:api \
    ghcr.io/benfred/py-spy:latest py-spy top --pid 1

# Kubernetes: an ephemeral debug container in the same pod
$ kubectl debug -it deploy/api --image=ghcr.io/benfred/py-spy:latest \
    --target=api --profile=general -- py-spy dump --pid 1

Three commands cover the cases you will actually hit. py-spy top gives a live, top-style view — the right first look, because it answers "what is it doing right now" in five seconds. py-spy record -o out.svg --duration 60 writes a flame graph, which is what you want when the answer is a distribution rather than a single hot frame. And py-spy dump prints one stack per thread and exits, which is the tool for a hung process where sampling would just show the same blocked frame repeatedly.

Sampling has two blind spots worth planning around. Time spent inside a C extension appears as the extension's frame with nothing below it — the profile stops where Python stops — and --native is what restores the missing frames, at the cost of needing debug symbols. And by default py-spy skips threads that are idle, so a process waiting on a lock looks empty; --idle includes them, and the resulting profile makes a deadlock obvious because every thread sits in the same acquire.

For a persistent record, py-spy record --format speedscope writes a file that opens in the speedscope viewer, where a sixty-second capture can be scrubbed and filtered by thread. Keep those artefacts alongside the incident notes: a flame graph from the minute the latency spiked is worth far more than a reproduction attempt a week later.

One caution about interpretation. Sampled percentages are of wall-clock time in Python frames, not of CPU. A process blocked on a socket for 90% of the window will show that socket read as 90% of the profile, which is accurate and often exactly what you needed to know — but it is not evidence of a CPU bottleneck. Cross-check with py-spy top's own CPU column, or with the process's actual CPU usage, before concluding that the hot frame is burning cycles.

Match the subcommand to the question: sampling a hung process only proves it is still hung.

Frequently Asked Questions

Do I need to restart my process to profile it with py-spy? No. py-spy attaches to an already-running interpreter by PID and reads its stacks from outside the process, so there is no restart and no code change. This is why it is safe for production workers.

Why does py-spy fail with Operation not permitted on Linux? Linux ptrace_scope restricts which processes may attach to another. Run py-spy with sudo, grant the binary CAP_SYS_PTRACE, or set kernel.yama.ptrace_scope to 0. In a container, add the SYS_PTRACE capability.

What does the --native flag do in py-spy?--native unwinds native C and Rust stack frames alongside Python frames, so time spent inside C extensions like numpy or a compiled library shows up instead of appearing as an opaque built-in call. Can py-spy profile a process running as another user? Only with matching privileges — typically running the profiler as root or with CAP_SYS_PTRACE. In a hardened container the capability is usually dropped, which is why the ephemeral-debug-container pattern above is the practical route in Kubernetes: the debug container gets the capability while the application container keeps its restricted profile.

How long should a recording run? Long enough to cover several iterations of whatever the process does repeatedly — thirty to sixty seconds for a request-serving service under load, and a full cycle for a batch job. Shorter captures over-represent whatever happened to be running; longer ones average away the spike you were investigating. When the problem is periodic, record across at least two periods so the profile shows both phases.

CPU Profiling with cProfile and py-spy — the full picture of both profilers and when to reach for each.
Interpreting cProfile: Cumulative vs Total Time — turn the numbers py-spy hints at into exact CPU attribution.
Finding Memory Leaks with tracemalloc Snapshots — when the process is slow on memory, not CPU.
Debugging Async Code and Event Loops — diagnose a stalled asyncio worker that flat stacks cannot explain.

← Back to CPU Profiling with cProfile and py-spy