Long context at 128K on Hopper — KV scaling, YaRN, chunked prefill, prefix sharing
What breaks at 128K and what is the precision recipe at that context?
AI Inference Engineer 2026 — Special Course · Part 2 — Dense Decoder-Only Inference at Hopper
Overview
Both Llama 3.3 70B and Qwen 2.5 72B publish 128K context windows. At 128K the cost shape of inference changes — what dominated TPOT at 4K (weight bandwidth) is joined or overtaken by KV bandwidth and prefill compute. Long-context serving is a distinct engineering problem with its own precision recipes, scheduling choices, and parity bars.
This final Part 2 lecture covers:
- The cost math at 128K — KV memory, KV bandwidth per decode step, prefill FLOPs.
- YaRN context extension — how 128K is unlocked from a base 32K model, and what it costs at inference.
- Chunked prefill at 128K — what it solves and where it breaks.
- FP8 KV cache at scale — when it ships, what parity costs, per-head scaling discipline.
- Prefix sharing on long system prompts — the highest-leverage long-context optimization.
- Long-context evaluation — RULER, needle-in-haystack, what to actually measure.
- Production recipes at 128K on Hopper.
By the end you should be able to deploy either anchor model at 128K context on Hopper, defend the precision recipe with parity numbers from RULER, and ship a benchmark report that lets a teammate reproduce it.
1. The cost math at 128K
Revisiting the KV cache math from Part 1 Lecture 02:
kv_bytes_per_token = 2 × L × num_kv_heads × head_dim × bytes
= 2 × 80 × 8 × 128 × 2 (FP16)
= 320 KB/token
At 128K context, per request:
| Precision | KV cache size |
|---|---|
| FP16 KV | 320 KB × 131072 = 42 GB |
| FP8 KV | 21 GB |
| INT4 KV | 10.5 GB |
This is for one request. At batch=8 with FP16 KV: 336 GB. That fits on 4× H100 80G (320 GB total) only by aggressive paging — and leaves nothing for weights or activations.
At 128K context, batch size is HBM-limited far before it is compute-limited.
1.1 KV bandwidth per decode step
At each decode step:
kv_read_bytes = 2 × L × num_kv_heads × head_dim × seq_len × bytes
= kv_bytes_per_token × seq_len
= 320 KB × 131072
= 42 GB
On H200 (4.8 TB/s HBM3e), the KV read alone takes:
kv_read_time = 42 GB / 4.8 TB/s ≈ 8.7 ms
Plus the weight read (~30 ms at FP16 for 140 GB / 4.8 TB/s, ~7.3 ms at INT4 for 35 GB). So decode at 128K context on a single H200, FP16 KV, INT4 weights:
TPOT ≈ kv_read_time + weight_read_time + compute + overhead
≈ 8.7 + 7.3 + small + small
≈ 16 ms per token
versus ~8 ms at 4K context (same weight read, negligible KV term). TPOT roughly doubles at 128K because the KV bandwidth becomes a co-dominant cost.
1.2 Prefill cost at 128K
prefill_flops ≈ 2 × P × prompt_tokens
= 2 × 70 × 10^9 × 131072
≈ 1.83 × 10^16 FLOPs
= 18.3 PFLOPs
On 4× H100 at FP16 (4 × 989 TFLOPs / GPU = 3.96 PFLOPs aggregate, assuming 80% efficiency = 3.17 PFLOPs effective):
prefill_time ≈ 18.3 / 3.17 ≈ 5.8 seconds
Almost six seconds of prefill before the first token decodes. For an interactive product this is unshippable without chunked prefill or prefix cache. (This counts only the weight-matmul FLOPs — at 128K the prefill attention term is of comparable size, see §1.3, so budget ~10 s wall-clock.)
1.3 Attention cost at 128K
Attention is softmax(Q · K^T) · V. At 128K context with batch=1:
qk_flops = 2 × q_heads × q_seq × kv_seq × head_dim
= 2 × 64 × 1 × 131072 × 128 (for one decode step)
≈ 2.1 × 10^9 FLOPs
A single decode-step attention op is 2 GFLOPs. Compute-trivial. But the KV read for attention is 42 GB / decode step (per §1.1). So attention at long context is purely bandwidth-bound.
For prefill (batch=1, prompt=128K), attention is O(prompt² × head_dim × heads) which becomes ~2 × 64 × 131072 × 131072 × 128 = 277 TFLOPs — comparable to the FFN cost. FlashAttention 4's O(N) memory complexity is essential here.
2. YaRN context extension
The two models reach 128K by different mechanisms. Llama 3.1/3.3 70B use Meta's own "rope_type": "llama3" RoPE frequency scaling plus long-context continued pretraining — the 128K window is baked in at training time. Qwen 2.5 72B is the YaRN case: trained at 32K native context and extended to 128K via YaRN (arXiv:2309.00071) applied at inference.
2.1 What YaRN does
YaRN rescales RoPE frequencies so the model can attend over longer ranges without retraining from scratch. The result: a model trained at 8K can serve at 128K with only ~few epochs of fine-tuning.
The runtime impact is minimal:
- The RoPE matrix is rebuilt with extended frequencies. No code change beyond
rope_scalingin config. - No additional kernel overhead.
- Qwen's published 128K context is YaRN-extended; for
max_position_embeddings > base_context, ensure the runtime applies the right scaling.
2.2 Where YaRN can go wrong at inference
- Wrong scaling factor: if the runtime applies a different YaRN scaling than the model was fine-tuned with, long-context recall regresses. Always use the official
rope_scalingfromconfig.json. - Quantization interaction: rescaled RoPE values have a wider numeric range; FP8 KV per-tensor scaling may clip. Per-head scaling avoids this.
2.3 Long-context quality at 128K
Published benchmarks (RULER, NIAH):
| Model | 4K | 32K | 64K | 128K |
|---|---|---|---|---|
| Llama 3.3 70B | ~96 | ~92 | ~89 | ~80 (sharper drop) |
| Qwen 2.5 72B | ~96 | ~93 | ~91 | ~85 |
Qwen 2.5 72B's long-context behavior at 128K is generally stronger than Llama 3.3 70B in published RULER numbers, though both meaningfully degrade vs. their short-context performance. For products that rely on accurate retrieval at 100K+, this is a model-selection signal independent of inference engineering.
3. Chunked prefill at 128K
The prefill cost calculation (§1.2) showed 5.8 seconds for a 128K prefill on 4× H100. Chunked prefill makes this practical.
3.1 The idea
Split the 128K prefill into chunks (e.g., 4K each). Process one chunk per "step" of continuous batching. Decode requests share these steps:
step 1: prefill chunk 1 (4K tokens) + decode steps for other requests
step 2: prefill chunk 2 (4K tokens) + decode steps
...
step 32: prefill chunk 32 (4K tokens) → first token of new request ready
Total prefill wall-clock is unchanged (~5.8 seconds). But:
- Other requests' decode is not blocked during the long prefill.
- TTFT for the long-prompt request increases, but throughput overall stays high.
- GPU utilization stays smooth because each step has a mix of compute (chunked prefill) and bandwidth (decodes).
3.2 vLLM configuration
LLM(
...,
enable_chunked_prefill=True,
max_num_batched_tokens=8192, # total tokens per step (prefill + decode)
)
max_num_batched_tokens=8192 means each step processes 8K tokens of work, mixing prefill chunks with decode rows.
3.3 The tradeoff
- Without chunked prefill: long prefill blocks the whole replica for ~5 seconds. Other users see a TTFT spike.
- With chunked prefill: long prefill is spread; other users' TTFT stays normal, but the long-prompt user's TTFT extends to ~6.5 seconds (longer wall-clock due to per-step overhead).
For chat / agent products, chunked prefill is the default — better p99 across users matters more than fastest individual TTFT.
For batch products, disabling chunked prefill can be slightly faster because the per-step overhead is removed and you can spend all of each step on one task.
4. FP8 KV cache at scale
At 128K context, FP8 KV (cuts HBM from 42 GB → 21 GB per request) is the practical default.
4.1 Per-head scaling discipline
Not all attention heads have the same KV magnitude distribution:
- Some heads specialize in rare-token positions (e.g., BOS, code-delimiters).
- Their KV values have outliers that FP8 per-tensor scaling clips, producing measurable parity loss on long-context retrieval.
Per-head scaling fixes this. Each head gets its own FP8 scale factor.
vLLM 0.22+: kv_cache_dtype="fp8_e5m2" with per-head scaling enabled by default.
SGLang 0.5+: per-head FP8 KV via --kv-cache-dtype fp8.
TRT-LLM: per-head FP8 KV is the default when kv_cache_dtype=fp8.
4.2 Parity at 128K
For Llama 3.3 70B with FP8 KV at 64K and 128K:
| Eval | FP16 KV | FP8 KV per-tensor | FP8 KV per-head |
|---|---|---|---|
| RULER 64K | 89.5 | 86.2 (-3.3) | 89.0 (-0.5) |
| RULER 128K | 80.1 | 75.3 (-4.8) | 79.4 (-0.7) |
Per-tensor FP8 KV loses 3-5 pp at long context. Per-head loses 0.5-0.7 pp. Per-head is the production recipe.
For Qwen 2.5 72B numbers are similar; both models behave well under per-head FP8 KV.
4.3 INT4 KV at 128K — usually too aggressive
INT4 KV at 128K typically loses 5-10 pp on RULER. Acceptable for products where the model doesn't need to recall details from the long context (e.g., short-output summarization). Unacceptable for retrieval-style tasks.
Decision rule: if the workload's value depends on accurate long-context retrieval, ship FP8 KV. INT4 KV is only for highly tolerant workloads or extreme HBM pressure.
5. Prefix sharing on long system prompts
The highest-leverage long-context optimization. Most long-context products have one of:
- Long system prompt (entire codebase, document, manual) shared across all user turns.
- Stored conversation with growing chat history.
- RAG context retrieved per query, partially overlapping across queries.
For all three, prefix caching cuts prefill cost by 70-95%.
5.1 The system-prompt case
A 100K-token system prompt + 1K user turn:
- Without prefix cache: prefill 101K tokens (~5 seconds on 4× H100).
- With prefix cache (after first request): prefill 1K tokens (~50 ms).
100× faster prefill for any subsequent request that shares the system prompt.
5.2 The conversation case
A growing chat session, 50K tokens of history + 500-token new user turn:
- Without prefix cache: prefill 50.5K each turn.
- With prefix cache: prefill 500 each turn (system prompt + previous turns are cached).
10-100× faster per turn after the first.
5.3 The RAG case
RAG retrieves N docs per query. If docs are reused across queries, the cache catches partial overlap. SGLang's RadixAttention is especially strong here because the prefix tree matches partial document overlap, not just exact-match prefixes.
5.4 Cache eviction
Prefix cache lives in HBM; eviction policy matters at long context. vLLM uses LRU; SGLang uses LRU with tree-structure awareness (keeps shared prefixes longer).
A common production tuning: set kv_cache_block_size slightly larger (e.g., 32 instead of 16) to give the radix tree better hit rates on long prefixes. Trade: slightly larger eviction granularity.
6. Long-context evaluation — what to measure
The parity story for long-context inference:
6.1 RULER (arXiv:2404.06654)
A suite of synthetic tasks at controlled lengths (4K, 8K, 16K, 32K, 64K, 128K). Tasks:
- Single needle in haystack (find one fact among irrelevant text).
- Multi-needle (find N facts).
- Variable tracking (track values across long text).
- Frequency word extraction.
RULER is the canonical long-context benchmark as of mid-2026.
6.2 Needle in a haystack (NIAH)
Simpler than RULER. Place a needle (e.g., "the secret password is purple-frog-42") at a known position in N tokens of irrelevant context. Query the model to retrieve it.
- Plot accuracy as a function of context length × needle position.
- Common to see "lost in the middle" — accuracy higher at start/end, dips in the middle.
6.3 What to actually measure for parity validation
For a long-context product:
- Pick RULER at 16K, 64K, 128K (or your product's longest expected context).
- Run the FP16-reference recipe twice for noise floor.
- Run each candidate recipe (FP8 KV, INT4 KV, chunked prefill on/off).
- Compute Δ per length.
The pattern: short-context evals like MMLU don't catch long-context regressions. Always validate long-context-specifically.
7. Production recipes at 128K
Synthesized for Llama 3.3 70B / Qwen 2.5 72B at 128K context on Hopper:
7.1 Single H200 141G (low concurrency)
Weights: AWQ-INT4 group-128 (35 GB)
Activations: FP16
KV cache: FP8 per-head (21 GB per request at 128K)
HBM headroom: ~85 GB for KV + activations → batch ~3 at 128K, batch ~12 at 32K
Runtime: vLLM 0.22+ V1
Features: continuous batching, paged KV v2, prefix cache, chunked prefill (4K chunks)
Speculation: off (small batch makes it less effective)
7.2 4× H100 SXM (production chat)
Weights: FP8 (TRT-LLM) or BF16 (vLLM)
Activations: FP8 / FP16
KV cache: FP8 per-head
TP: 4
Sequence par.: enabled (long-context activation savings)
Features: continuous batching, paged KV v2, prefix cache, chunked prefill (8K chunks)
Speculation: EAGLE-2 / EAGLE-3 if validated for the workload
Expected: ~600-1000 tok/s/GPU at chat-typical concurrency 32 with 16K mean context
7.3 8× H200 (max throughput / long context)
Weights: FP16 or FP8
KV cache: FP8 per-head
TP: 8 (or 4 with DP=2 for throughput)
Sequence par.: enabled
Features: full stack
Expected: max concurrent users at 128K context; per-GPU throughput lower than TP=4 but absolute throughput highest
Lab — long-context bench for both models
Goal: produce a long-context benchmark report for Llama 3.3 70B and Qwen 2.5 72B.
- Hardware — 4× H100 or H200; ideally both for comparison.
- Runtime — vLLM 0.22+ V1.
- Configurations to bench — for each model:
- FP8 KV, no chunked prefill.
- FP8 KV, chunked prefill.
- INT4 KV (if you want to validate where it breaks).
- Workloads:
- RULER at 16K, 64K, 128K.
- TTFT measurement on 100K-token prefill, then 1K user turn, with and without prefix cache.
- Throughput — chat-shape continuous workload at 32K mean context, concurrency 16.
Pass criterion: you can hand the report to another engineer and they can pick a precision recipe + chunked-prefill config for a 128K-context product, defended by your parity numbers.
Self-check
- At 128K context, FP16 KV is 42 GB per request. On 4× H100 (320 GB HBM), how many concurrent requests can you serve at 128K with INT4 weights (35 GB) and FP16 KV? With FP8 KV?
- RULER 64K parity loses 3.3 pp with per-tensor FP8 KV and 0.5 pp with per-head FP8 KV. Why does the per-head version recover so much?
- Your TTFT at 128K is 5 seconds without prefix cache. After enabling, it drops to 80 ms. What property of the workload allowed the dramatic improvement?
- Chunked prefill at 4K chunks vs 16K chunks: which has better p99 TTFT for a workload of 90% short prompts and 10% 128K prompts? Why?
- For Llama 3.3 70B with FP8 KV per-head at 128K, predict the decode TPOT on H200 (4.8 TB/s). Show the math.
References
- YaRN — arXiv:2309.00071
- RULER — arXiv:2404.06654
- Needle in a haystack (LangChain implementation) — github.com/gkamradt/LLMTest_NeedleInAHaystack
- StreamingLLM (long-context window attention) — arXiv:2309.17453
- H2O (KV cache eviction) — arXiv:2306.14048
- MInference (dynamic sparse attention for long-context prefill) — arXiv:2407.02490
- vLLM long-context tuning — docs.vllm.ai/en/latest/serving/openai_compatible_server.html
- SGLang RadixAttention paper — arXiv:2312.07104
Cross-references:
- Part 1 → Lecture 02 — KV cache math
- Phase 5 → GPU Infrastructure → Long-Context-MoE-Foundation-Training → 07 Long-Context Evaluation
Current as of 2026-06
Context extension as published by both model teams (llama3 RoPE scaling for Llama 3.3, YaRN for Qwen 2.5). FP8 KV per-head as the production recipe. Chunked prefill as the default in vLLM 0.22+. Refresh when a fundamentally new long-context attention pattern (e.g., recurrent state, hybrid SSM) ships in either model family.
Next
- Next: Lecture 07 — Inside the communication layer — NCCL, custom all-reduce, and the vLLM communicator stack
- Previous: Lecture 05 — Modern serving stack
- Up: Part 2 — Dense at Hopper