Interpreting cProfile: Cumulative vs Total Time

You ran cProfile, you have a table, and the top row by one sort is your entry point while the top row by another sort is a tiny helper — and it is not obvious which one to optimize. The confusion is almost always between tottime and cumtime. This guide explains every column the profiler emits, why a dispatcher shows enormous cumulative time but trivial total time, and how to sort pstats to find the function actually worth changing.

Prerequisites

Python 3.8+ (cProfile, pstats, pstats.SortKey).
A .prof file produced by cProfile.run(..., filename="x.prof") or Profile.dump_stats("x.prof") — see CPU profiling with cProfile and py-spy for capture.

Solution

Load the stats, strip directory noise, and print both sorts:

import pstats
from pstats import SortKey

stats = pstats.Stats("workload.prof")
stats.strip_dirs()                       # turn /abs/path/mod.py:func into mod.py:func

# cumtime: which entry points dominate the whole run (body + everything called).
stats.sort_stats(SortKey.CUMULATIVE).print_stats(8)

# tottime: which leaf functions burn CPU in their own body.
stats.sort_stats(SortKey.TIME).print_stats(8)

A representative row looks like this:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      240/2    0.001    0.000    1.480    0.740 app.py:12(dispatch)
   1000000    1.310    0.000    1.420    0.000 app.py:40(parse_record)

Read it column by column:

ncalls — number of calls. A 240/2 split means the function recursed: 240 total entries, 2 primitive (non-recursive) entries.
tottime — time spent in this function's own body, excluding anything it called. This is the leaf-level CPU cost.
percall (first) — tottime / primitive ncalls, the average self-cost per call.
cumtime — total time in this function including every subcall, summed across all entries. The top-level entry point has the largest cumtime.
percall (second) — cumtime / primitive ncalls.

In the example, dispatch has tiny tottime (0.001s) but huge cumtime (1.480s): it does almost nothing itself and spends all its time in parse_record. parse_record has tottime ≈ cumtime (1.31 vs 1.42), so it is a genuine leaf hotspot. Optimize parse_record, not dispatch.

dispatch owns a 1.480 s cumtime but only 0.001 s of tottime — its bar is borrowed almost entirely from parse_record. parse_record's bar is nearly solid tottime, so it is the leaf worth optimizing.

The two columns measure the same run from different vantage points, which is why sorting by each answers a different question.

Sort by tottime to find the code to make faster; sort by cumtime to find the call you should not be making at all.

Why this works

cProfile records, for every function, the time charged to its own frame separately from the time charged to frames it called. tottime is the self-time; cumtime rolls in the descendants. That is why callers high in the stack accumulate large cumtime while doing little real work, and why a tight inner loop shows tottime ≈ cumtime. Sorting by cumtime answers "where does the run spend its time" (the hot path), and sorting by tottime answers "which line is doing the work" (the hot leaf). You almost always want both: cumulative to navigate to the path, total to find the function to change.

Edge cases and failure modes

Recursion inflates ncalls. The total/primitive split is the tell. Optimizing a recursive function means cutting call count (memoization) or self-cost per call — percall on primitive calls tells you which.
Built-in rows dominate. Many <built-in method> entries with high tottime usually mean a hot loop calling cheap builtins millions of times; the fix is fewer calls, not a faster builtin. Cross-check wall-clock attribution with py-spy on a live process when deterministic overhead is suspect.
cumtime is not additive across rows. You cannot sum cumtime over functions to get total runtime; nested calls double-count. Read the root frame's cumtime for the whole-run figure.
callers / callees views. Use stats.print_callers("parse_record") to see who drives a hotspot — essential when a leaf is called from several paths.
Threaded code under-reports. cProfile only profiles the thread that started it; spawned threads are invisible. Profile each thread, or sample with py-spy.

Working a profile from the top down

A profile with three thousand rows is not read; it is queried. pstats gives you the query language, and four commands answer nearly every performance question.

import pstats

st = pstats.Stats("profile.out")
st.strip_dirs()                       # shorten paths so rows stay readable

st.sort_stats("cumulative").print_stats(15)     # 1. where does the time go, by path?
st.sort_stats("tottime").print_stats(15)        # 2. which function burns the CPU?
st.print_callers("parse_row")                   # 3. who is calling the hot function?
st.print_callees("handle_request")              # 4. what does the entry point spend on?

Read them in that order. The cumulative view names the expensive path — usually an entry point you already knew about. The total view names the function to change. print_callers then tells you whether that function is hot because it is slow or because something calls it a hundred thousand times, and those two diagnoses have completely different fixes: optimise the body, or remove the calls.

The number to check before any of that is ncalls. A function with a per-call cost of two microseconds and four million calls is not a slow function; it is a caller problem, and rewriting the body in C would still leave you with four million calls. The percall columns make this explicit — one is tottime/ncalls, the other cumtime/primitive calls — and a large gap between the two ncalls figures (shown as 120/12) means recursion, where cumulative time is counted once for the outermost frame only.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  4000000    3.912    0.000    3.912    0.000 rows.py:44(clean_cell)
        1    0.004    0.004    9.881    9.881 report.py:12(build_report)
    12000    0.271    0.000    8.204    0.001 rows.py:20(parse_row)

That excerpt is the common shape: build_report owns almost all cumulative time but no total time — it is the entry point. parse_row is where the path narrows. clean_cell is the real cost, and its per-call time is essentially zero, so the fix is in parse_row's loop, not in clean_cell.

Two more habits pay off on real profiles. Profile a single representative operation rather than a whole run, so warm-up and teardown do not dominate; wrapping one call in cProfile.runctx gives a far cleaner picture than profiling main(). And save the raw stats file rather than the printed table — pstats can re-sort and re-query it later, and diffing two saved profiles before and after a change is the only honest way to claim an improvement.

Finally, remember that cProfile measures function calls, so time spent inside a single C-level call is attributed to that call and nothing below it. A profile dominated by {method 'execute' of 'psycopg2...Cursor' objects} is telling you the time is in the database, and no amount of Python-level restructuring will move it — that is the point at which the CPU profiler has answered its question and the next tool is a query plan.

The third query is the one that decides the fix: optimise the body, or remove the calls.

Frequently Asked Questions

What is the difference between tottime and cumtime in cProfile?tottime (total time) is the time spent in a function's own body, excluding calls it makes. cumtime (cumulative time) includes that plus all time spent in functions it called. A function that mostly delegates has low tottime and high cumtime.

Which column should I sort by to find a hotspot? Sort by tottime to find the leaf functions actually burning CPU, and by cumtime to find which high-level entry points dominate the run. Start with cumtime to locate the hot path, then tottime to find the line doing the work.

What does ncalls show when it reads like 240/2? The first number is total calls and the second is primitive (non-recursive) calls. A split like 240/2 means the function recursed: it was entered 240 times overall but only 2 of those were the original non-recursive entries. Why does the total of all tottime values not match the wall-clock runtime? Because tottime counts only time spent executing Python bytecode inside each frame. Time spent blocked in a C call that does not release to Python, time in the profiler's own bookkeeping, and time before or after the profiled region are all outside the sum. Use the cumtime of the outermost frame as the measure of the profiled region, and compare it against wall clock to see how much the instrumentation itself cost.

Profiling a running process with py-spy — when a .prof file is not an option and you need wall-clock stacks from a live worker.
Finding memory leaks with tracemalloc snapshots — the same diff-two-samples discipline applied to retained bytes instead of CPU time.
Comparing tracemalloc snapshots to locate growth — attribute allocation growth to a line once a hot path has been ruled out as the cause.
Post-mortem debugging with pdb.pm — drop into the frame a hotspot points at to inspect the state driving the cost.

← Back to CPU Profiling with cProfile and py-spy