Skip to content
Jared Frost

Roofline, bandwidth, and the memory hierarchy

Which hardware spec lines move which metric, and which are noise?

AI Inference Engineer 2026 — Special Course · Part 1 — Fundamentals of AI Inference / MLSys

Overview

Lecture 02 established that decode is bandwidth-bound and prefill is compute-bound. This lecture turns those statements into a quantitative model — the roofline — that lets you predict, from a model + GPU pair alone, what the achievable performance ceiling is for any kernel in your workload.

A roofline plot answers one specific question:

Given a kernel's arithmetic intensity (FLOPs per byte read from HBM), what is the upper bound on its achievable throughput on this GPU?

The answer is: the minimum of (intensity × bandwidth) and (peak FLOPs). That ceiling is physical — no kernel can exceed it on that hardware. If your measured kernel is far below the ceiling, you have engineering room. If it is at the ceiling, the only way forward is changing the workload (precision drop, larger batches, prefix cache) or the hardware.

This lecture builds the roofline model from first principles:

  1. The memory hierarchy on Hopper and Blackwell — what each level costs and what holds the working set.
  2. The roofline equation and how to compute the ridge point for any GPU.
  3. Worked rooflines for H100, H200, B200, Jetson Orin Nano Super 8 GB.
  4. How to map a kernel onto a roofline — measure arithmetic intensity, locate it on the chart, identify the ceiling.
  5. What roofline cannot tell you — and the next two layers of analysis after it.

By the end you should be able to plot the roofline for any GPU from its datasheet, place any matmul-shaped kernel on it, and predict the performance ceiling before running it.


1. The memory hierarchy

A GPU is a bandwidth pyramid. Closer to the SM means faster but smaller. Modern Hopper / Blackwell SMs look like this:

                 ┌─────────────────────────────┐
                 │  Registers (per thread)     │  256 × 32-bit = ~1 KB / thread
                 │   ~10s of TB/s effective    │  total per SM: ~256 KB
                 └─────────────────────────────┘

                 ┌─────────────────────────────┐
                 │  Shared memory / L1 (per SM)│  H100/B200: ~228 KB usable per SM
                 │  ~25 TB/s aggregate         │
                 └─────────────────────────────┘

                 ┌─────────────────────────────┐
                 │  L2 cache (per GPU)         │  H100: 50 MB · B200: 60+ MB
                 │  ~5 TB/s effective          │
                 └─────────────────────────────┘

                 ┌─────────────────────────────┐
                 │  HBM (per GPU)              │  H100 80 GB HBM3 @ 3.35 TB/s
                 │                             │  H200 141 GB HBM3e @ 4.8 TB/s
                 │                             │  B200 192 GB HBM3e @ 8.0 TB/s
                 └─────────────────────────────┘

                 ┌─────────────────────────────┐
                 │  NVLink (cross-GPU)         │  H100 NVLink 4: 900 GB/s per GPU
                 │                             │  B200 NVLink 5: 1.8 TB/s per GPU
                 └─────────────────────────────┘

                 ┌─────────────────────────────┐
                 │  PCIe / fabric (cross-node) │  PCIe Gen5: 64 GB/s per direction
                 │                             │  IB NDR: 400 Gb/s · XDR: 800 Gb/s
                 └─────────────────────────────┘

What lives at each level for LLM inference:

The roofline equation assumes HBM is the bandwidth bottleneck. That assumption is correct for >95% of LLM inference kernels at batch=1.


2. The roofline equation

A roofline plot has two axes:

Two ceilings:

The ridge point is where they meet:

ridge_point = peak_FLOPs / bandwidth

For any kernel with arithmetic intensity below the ridge point: bandwidth-bound (memory is the ceiling). For any kernel above the ridge point: compute-bound (FLOPs are the ceiling).

   throughput

peak ──┼───────────────────  ← compute ceiling (peak TFLOPs)
       │              ╱
       │           ╱       ← linear region (bandwidth × intensity)
       │        ╱
       │     ╱
       │  ╱
       │╱
       └────┴──────────────► arithmetic intensity (FLOPs/byte)

         ridge point

For LLM inference, the question is always: where is this kernel on this chart? If far below the ceiling, optimization is possible. If at the ceiling, the only progress is to change the X coordinate (raise arithmetic intensity by batching, fusion, or precision drop) or the chart (different GPU).


3. Worked rooflines

3.1 H100 SXM (80 GB HBM3)

Ridge points:

A decode step at batch=1 has arithmetic intensity ≈ 2/bytes_per_param = ~1 FLOP/byte at FP16. That is ~300× below the ridge — deeply bandwidth-bound.

A prefill of a 4K-token prompt with batch=1 has intensity in the hundreds of FLOPs/byte — at or above the ridge, compute-bound.

3.2 H200 SXM (141 GB HBM3e)

Ridge points:

H200 lowers the ridge point because bandwidth grew without FLOPs growing. Bandwidth-bound kernels are 43% faster on H200 than H100 at the same precision. Compute-bound kernels are unchanged — H200's win is decode, not prefill. This is why H200 is the right pick for chat workloads and H100 is the right pick for embedding / batch.

3.3 B200 SXM (192 GB HBM3e)

Ridge points:

Blackwell's FP4 path moves the ridge point dramatically higher — meaning at FP4 the compute ceiling is far above where most kernels sit. For bandwidth-bound decode, the FP4 win is in the X coordinate: arithmetic intensity doubles vs FP8 because bytes per param halve.

3.4 Jetson Orin Nano Super 8 GB (edge cross-reference)

Ridge points:

Notice the FP16 ridge point on Orin Nano is actually below Hopper's ~295 — but that is no comfort, because the binding constraint is the ~102 GB/s LPDDR5 bandwidth, roughly 1/33 of an H100's. A 4B-class model decoding at batch=1 on Orin Nano hits the bandwidth wall extremely hard (this is the entire premise of the Qwen Inference Optimization special course). Batching almost never helps on edge because each user is alone.

3.5 Cross-GPU summary

GPU Peak FP16 HBM BW FP16 Ridge FP8 Ridge FP4 Ridge
Jetson Orin Nano Super 8G ~17 TFLOPs 102 GB/s ~167 (INT8: ~324)
H100 SXM 989 TFLOPs 3.35 TB/s 295 591
H200 SXM 989 TFLOPs 4.80 TB/s 206 412
B200 SXM 2,250 TFLOPs 8.0 TB/s 281 562 1,125

The takeaway: the ridge point is a property of the (GPU, precision) pair. Choose precision to match the regime your kernel is in.


4. Mapping a kernel onto a roofline

For each kernel in your workload:

4.1 Compute its arithmetic intensity

intensity = FLOPs_per_call / bytes_read_from_HBM_per_call

For a (M, K) @ (K, N) matmul on M × K + K × N + M × N matrix in FP16:

FLOPs       = 2 × M × N × K
bytes       = 2 × (M × K + K × N + M × N)   [if everything streams from HBM]
intensity   = M × N × K / (M × K + K × N + M × N)

For a "fat" matmul (M, N, K all large): intensity ≈ MNK / (MK + KN + MN) → grows with the matrix size. For a "skinny" matmul (decode, M=1): intensity ≈ K × N / (K + KN + N) ≈ 1 in FP16. Bandwidth-bound.

4.2 Locate it on the chart

Draw a vertical line at the kernel's intensity. The achievable ceiling is where that line meets the lower of the two ceilings:

4.3 Compare to measured

Run the kernel, measure achieved TFLOPs/s. The ratio (achieved / ceiling) is the efficiency.

4.4 The two failure modes the roofline does not catch

Roofline is necessary but not sufficient:

  1. Scheduler / Python overhead — between kernel launches there is CPU time. Roofline assumes 100% kernel time. For very short decodes (<1 ms), the Python loop can dominate.
  2. Warp execution efficiency — even at high occupancy, divergent branches or memory access patterns can mean each warp does less actual work. Nsight Compute gives this as sm__sass_thread_inst_executed_per_inst_executed.ratio.

If your kernel is at 50% efficiency and the roofline says you should be at 80%, run Nsight Compute and check both of these before trying to rewrite the kernel.


5. The roofline mental check

Five questions a roofline-literate engineer asks before changing anything:

  1. What is my kernel's arithmetic intensity? Compute it from the math, not measurement first.
  2. Where is the ridge point on this GPU at this precision?
  3. Is my kernel above or below the ridge? This tells me which ceiling I am up against.
  4. What is my measured efficiency vs the ceiling?
  5. Which of the four levers moves it: precision (changes intensity), batching (changes intensity), fusion (changes bytes), GPU (changes ceiling)?

The answer to question 5 is the entire content of Parts 2 and 3 of this course.


Lab — plot the roofline for your target GPU and three kernels

Goal: produce a single chart, committed to your benchmark repo, with three points on it.

  1. Pick a target GPU — H100, H200, or whatever you have.
  2. Draw the roofline at FP16, FP8, and INT4 for that GPU. Use the datasheet numbers.
  3. Pick three kernels from your benchmark:
    • A decode step at batch=1 on a 4B-class model.
    • A prefill step at 2K tokens on the same model.
    • The FFN gate+up matmul at batch=64 (decode regime, larger batch).
  4. Measure each kernel's FLOPs/sec achieved and HBM bytes read (Nsight Compute → gpu__time_duration + dram__bytes). Compute arithmetic intensity and achieved TFLOPs.
  5. Plot the three points on the roofline. Annotate which ceiling each one is bound by.

Pass criterion: the chart goes in your benchmark report. Anyone reading it understands which kernels are bandwidth-bound and which are compute-bound on this GPU.


Self-check

  1. The H200 has the same peak FP16 TFLOPs as the H100 but 43% more HBM bandwidth. For a workload that is 80% decode (bandwidth-bound) and 20% prefill (compute-bound), what is the expected end-to-end speedup of H200 over H100? Sketch the math.
  2. You run a decode kernel and measure 480 GB/s of HBM read on an H200 (4.8 TB/s peak). What is your bandwidth efficiency? What are two likely reasons the kernel is at this level instead of 80%+?
  3. A teammate proposes doubling batch size from 64 to 128 for a 70B-class decode. The roofline says the kernel is at the compute ceiling at batch=64. Predict the throughput change. Defend in one sentence.
  4. Why does the FP4 ridge point on B200 (~1125 FLOPs/byte) feel "too high to ever hit"? What workload shape would actually reach it?
  5. On Jetson Orin Nano Super 8 GB at INT8 the ridge point is ~324 OPs/byte. A Qwen3-4B decoded at batch=1 has intensity ~2. What is the upper bound on tokens/sec from the roofline alone? What single change (without changing hardware) gets closest to it?

References

Cross-references:


Current as of 2026-06

GPU peak numbers from NVIDIA's published datasheets for H100, H200, B200 SXM. Update if NVIDIA releases corrected peaks or if new precision modes ship (FP6, FP3).


Next


← All lectures