Skip to content
Jared Frost
Go back

The roofline model for LLM inference

If you only internalize one mental model for LLM inference performance, make it the roofline. It explains, from two numbers on a datasheet, why decode is slow, why batching works, why quantization helps more than you’d expect, and why your tensor cores sit idle while you serve a single user.

Table of contents

Open Table of contents

Two numbers and one ratio

Every GPU has two peak rates that matter here:

A kernel is compute-bound if it’s limited by the first and memory-bound if limited by the second. Which one you hit depends on a property of the workload, not the hardware: arithmetic intensity (AI) — FLOPs performed per byte moved from memory.

arithmetic intensity = FLOPs / bytes moved      [FLOP/byte]

The crossover — the ridge point — is where the two roofs meet:

ridge point = peak FLOP/s / peak bytes/s

For the H100:

PEAK_FLOPS = 989e12      # FLOP/s   (FP16 tensor core, dense)
PEAK_BW    = 3.35e12     # bytes/s  (HBM3)

ridge_point = PEAK_FLOPS / PEAK_BW
print(f"{ridge_point:.0f} FLOP/byte")   # -> 295

295 FLOP/byte. A kernel must do ~295 FLOPs for every byte it reads from HBM just to keep the tensor cores busy. Below that, you’re starved on bandwidth and the compute units idle. (For an A100 the ridge sits around 153 FLOP/byte — lower peak compute, lower bandwidth, same story.)

The attainable throughput is the roofline itself:

def attainable_tflops(ai):
    """min(compute roof, bandwidth roof) at a given arithmetic intensity."""
    return min(PEAK_FLOPS, PEAK_BW * ai) / 1e12

attainable_tflops(1)     # ->   3.35 TFLOPS   (~0.3% of peak)
attainable_tflops(295)   # -> 989    TFLOPS   (right at the ridge)
attainable_tflops(600)   # -> 989    TFLOPS   (compute-bound, flat)
Roofline model for an NVIDIA H100 SXM5 (FP16) A log-log roofline. A bandwidth roof rising with arithmetic intensity meets a flat 989 TFLOPS compute roof at a ridge point of 295 FLOP/byte. LLM decode sits at AI of about 1, deep in the memory-bound region; prefill clears the ridge into the compute-bound region. 1 10 100 1000 1 10 100 1000 decode · AI ≈ 1 prefill · AI ≈ seq len 3.35 TB/s · HBM3 989 TFLOPS ridge ≈ 295 memory-bound compute-bound arithmetic intensity — FLOP/byte (log) attainable TFLOP/s (log)
The roofline for an H100 SXM5 (FP16). Below the ridge at ~295 FLOP/byte you're bandwidth-bound; single-stream decode lives at AI ≈ 1, using ~0.3% of peak compute. Prefill clears the ridge and turns compute-bound.

Where decode lands

Autoregressive decode generates one token at a time. The dominant op is a matrix–vector product: every weight matrix W of shape [d_out × d_in] multiplies a single activation vector x.

Count it in FP16 (2 bytes/element), at batch size 1:

weights read : 2 · d_out · d_in   bytes      (each weight loaded once)
FLOPs        : 2 · d_out · d_in   FLOP        (one multiply + one add per weight)

AI = (2 · d_out · d_in) / (2 · d_out · d_in) = 1 FLOP/byte

One. Against a ridge point of 295. Single-stream decode runs at roughly 1/295 ≈ 0.3% of the H100’s tensor-core throughput. The expensive silicon NVIDIA built is almost entirely idle, because you spend all your time hauling weights across the memory bus to use each one exactly twice.

That single fact has a direct, quantitative consequence. If decode is bandwidth-bound, its speed is just bytes to move ÷ bandwidth. For a model with P parameters in FP16, you stream ~2P bytes per token:

params       = 8e9     # an 8B model
weight_bytes = 2 * params               # FP16 -> 16 GB / token

sol = weight_bytes / PEAK_BW            # speed-of-light latency
print(f"{sol*1e3:.2f} ms/token -> {1/sol:.0f} tok/s")
# 4.78 ms/token -> 209 tok/s   (upper bound, batch 1)

~209 tokens/s is the speed of light for an 8B model on one H100: you cannot decode faster without moving fewer bytes. Real numbers come in lower — attention reads the growing KV cache, not every op is on the tensor cores, kernel launches cost time, and you realistically sustain ~70–80% of peak HBM bandwidth. But the ceiling is set by bandwidth, full stop.

Why prefill is the opposite

Prefill — processing the T-token prompt before the first output token — does the same weight matrices against T activations at once. Weights are still read once; the FLOPs scale with T:

AI ≈ T   (sequence length)

A 512-token prompt gives AI ≈ 512 > 295 → compute-bound. So a single model runs in two completely different regimes:

PhaseOp typeArithmetic intensityBound by
Prefillmatrix–matrix≈ sequence lengthCompute
Decodematrix–vector≈ batch sizeBandwidth

Arithmetic-intensity spectrum on a log scale: batch-1 decode sits at AI ≈ 1 deep in the memory-bound region, the H100 ridge is at 295 FLOP/byte, and prefill sits past the ridge in the compute-bound region.

The same split as a number line: prefill’s intensity grows with sequence length T, clearing the ridge; batch-1 decode is pinned to the far left. Batching decode slides it rightward — AI ≈ batch size.

This is the reason prefill and decode get profiled, batched, and increasingly even scheduled on separate hardware — they stress opposite roofs.

Every decode optimization is a roofline move

Once you see decode pinned to the bandwidth roof at AI ≈ 1, the entire optimization playbook falls out of one question: how do I raise arithmetic intensity, or move fewer bytes?

None of these are tricks. They’re all the same move — shift the workload rightward toward the ridge, or shrink the bytes under the bandwidth roof.

The takeaway

Before touching a profiler, you can predict the regime from a datasheet:

  1. Compute the ridge point: peak FLOP/s ÷ peak bandwidth (~295 for an H100).
  2. Estimate your kernel’s arithmetic intensity.
  3. If AI < ridge, you’re bandwidth-bound — chasing FLOPs is wasted effort; move fewer bytes or raise AI. If AI > ridge, optimize the compute.

Decode lives far to the left of the ridge, and almost everything we do to make inference fast is a structured campaign to drag it rightward. In the next posts I’ll measure a real decode kernel against this ceiling with Nsight Compute and see how much of the 0.3% → 100% gap is recoverable, and how much is physics.

Numbers cited are vendor peak specs for the H100 SXM5 (989 TFLOPS FP16 dense tensor; 3.35 TB/s HBM3). Peak ≠ achievable — I’ll always say which one I’m quoting.


Share this post:

Next Post
Building in public: a commitment to mastering GPU kernels