Production MoE serving — MTP speculation, constrained decode, cost model
What's the full production recipe, and what's the $/MTok at GB200 NVL72 scale?
AI Inference Engineer 2026 — Special Course · Part 3 — MoE Inference at Blackwell
Overview
This is the capstone of the course. The previous Part 3 lectures established the architecture (Lecture 01), the silicon (Lecture 02), the parallelism (Lecture 03), and the serving topology (Lecture 04). This lecture brings them together into a defended, production-shippable inference deployment for DeepSeek V3.1 and Qwen3-MoE 235B-A22B on Blackwell, with a $/MTok cost model that another engineer can reproduce.
Topics:
- MTP as native speculation for DeepSeek — configuration, acceptance rate, throughput.
- EAGLE-3 for Qwen3-MoE — the non-native speculation path with draft model.
- Constrained decoding — XGrammar / Outlines at MoE scale, tool-call use cases.
- The production stack — full configuration for each model on NVL72.
- Cost model —
$/MTokderivation, what each precision / EP / disaggregation choice buys. - Capstone benchmark — the report your final repo must contain.
By the end you should be able to deploy either MoE model to production-shape traffic on a Blackwell cluster, defend every recipe choice with parity and throughput numbers, and produce a $/MTok cost defense for the business.
1. MTP — native speculation for DeepSeek V3.1
DeepSeek V3 introduced multi-token prediction (MTP) as a training-time objective. At inference time, the model emits k+1 logit predictions per forward pass, not just one.
1.1 The inference flow
Standard autoregressive decode:
step 1: emit token N+1
step 2: emit token N+2
step 3: emit token N+3
→ 3 forward passes for 3 tokens
MTP decode (k=3):
step 1: emit token N+1, predict N+2 and N+3
Verification: accept N+1 (always, it's the standard output)
accept N+2 if its prediction matches what step 2 would have produced
accept N+3 if both N+2 accepted AND N+3 matches
→ 1 forward pass for up to 3 tokens
The model's MTP heads produce the predictions. The verification logic (does the predicted token match the standard output) lives in the inference runtime. The acceptance rate determines the effective speedup.
1.2 Acceptance rates
Real-world for DeepSeek V3.1:
- Chat: 65-75% on token N+2; 50-65% on N+3.
- Code generation: 55-65% on N+2 (harder to predict).
- JSON / tool-call: 80-90% (highly predictable structure).
Average effective tokens per pass: ~1.8-2.2 across diverse traffic.
1.3 Configuration
vLLM 0.22+:
LLM(
model="deepseek-ai/DeepSeek-V3.1",
speculative_config={
"method": "deepseek_mtp",
"num_speculative_tokens": 3,
},
...
)
SGLang:
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3.1 \
--speculative-algorithm DeepseekMTP \
--speculative-num-steps 3 \
...
1.4 Throughput impact
For DeepSeek V3.1 at FP4 on 16× B200 NVL72 partition, chat workload:
| Decoding | Throughput tok/s/replica |
|---|---|
| Greedy autoregressive | ~6,500 |
| MTP k=2 | ~10,000 (1.5×) |
| MTP k=3 | ~13,000 (2.0×) |
MTP is the largest single throughput win on DeepSeek's deployment stack.
2. EAGLE-3 for Qwen3-MoE
Qwen3-MoE 235B-A22B has no native MTP. The 2026 production speculation path is EAGLE-3 (arXiv:2503.01840) with a separately-trained or distilled draft model.
2.1 EAGLE-3 mechanics
Unlike a separate draft LLM, EAGLE-3 trains a lightweight speculation head that sits on top of the target model. The head reuses the target's hidden states and predicts next-k tokens.
- No separate draft model HBM cost.
- Acceptance rates 75-85% across categories.
- 2.5-3.5× effective decode throughput.
For Qwen3-MoE 235B-A22B, an EAGLE-3 head trained specifically for this model is published by the SGLang community.
2.2 Configuration
SGLang:
python -m sglang.launch_server \
--model-path Qwen/Qwen3-235B-A22B \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path Qwen/Qwen3-235B-A22B-EAGLE3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
...
2.3 Throughput impact
For Qwen3-MoE 235B-A22B at FP4 on 8× B200, chat workload:
| Decoding | Throughput tok/s/replica |
|---|---|
| Greedy autoregressive | ~4,500 |
| EAGLE-3 k=3 | ~10,500 (2.3×) |
Comparable to DeepSeek + MTP at a similar effective speedup ratio.
3. Constrained decoding at MoE scale
Structured output (tool calls, JSON, code) requires the model to emit specifically-shaped sequences. Constrained decoding uses a grammar (or regex / FSM) to restrict the sampler to legal tokens only.
3.1 XGrammar vs Outlines
| Library | Approach | Where it lives |
|---|---|---|
| XGrammar | compile grammar to FSM, apply at logit level | SGLang (native), vLLM (integration) |
| Outlines | regex / Pydantic, slower for complex grammars | vLLM, llama.cpp |
XGrammar is the fastest production option in mid-2026. SGLang ships it natively.
3.2 BFCL parity at MoE FP4
The BFCL evaluation lecture framework applies to MoE inference exactly as to dense. At MoE FP4, tool-call accuracy is the parity gate that decides if FP4 ships.
For DeepSeek V3.1:
| Precision | BFCL (simple) | BFCL (multi-turn) | Δ vs BF16 |
|---|---|---|---|
| BF16 reference | 88.5 | 79.2 | — |
| FP8 | 88.1 | 78.7 | -0.4 / -0.5 |
| FP4 | 87.0 | 77.4 | -1.5 / -1.8 |
| FP4 + grammar-constrained decoding | 88.2 | 78.9 | -0.3 / -0.3 |
Grammar-constrained decoding recovers most of the FP4 parity loss on structured-output workloads. This is the recipe for agent products at FP4.
3.3 Configuration
SGLang:
from sglang import RuntimeEndpoint, ChatTemplate
# Define schema
schema = {
"type": "object",
"properties": {
"tool": {"type": "string", "enum": ["home_control", "search", "memory"]},
"args": {"type": "object"},
},
"required": ["tool", "args"],
}
# Request with structured output
response = client.chat.completions.create(
model=endpoint,
messages=...,
response_format={"type": "json_schema", "json_schema": {"schema": schema}},
)
XGrammar applies at the logit-sampling level. Throughput cost is minimal (~5% slower than unconstrained decoding) for well-formed grammars.
4. The production stack
Synthesizing everything from Part 3:
4.1 DeepSeek V3.1 production recipe
Hardware: NVL72 partition, 16× B200 (TP=2 × EP=8)
Weights: FP4 (MX-FP4 via TE2)
Activations: FP8
KV cache: BF16 (MLA-compressed; small, no precision drop needed)
Attention: MLA-aware kernels (SGLang or vLLM 0.22+)
Speculation: MTP k=3
Serving: SGLang 0.5+ V1
Features: continuous batching, paged KV, RadixAttention prefix cache,
chunked prefill (8K chunks), XGrammar (if agent workload)
Disaggregation: enabled if cluster has > 16 GPUs available; SGLang P/D mode
Expected throughput: ~13,000 tok/s/replica (with MTP)
Expected $/MTok: ~$1.9-2.0 raw replica cost (16× B200 @ ~$5.50/GPU-hr — derived in §5.2)
4.2 Qwen3-MoE 235B-A22B production recipe
Hardware: 8× B200 (EP=8) for single-replica; 16-32× for multi-replica
Weights: FP4 (MX-FP4)
Activations: FP8
KV cache: FP8 per-head (GQA, 4 KV heads, 94 layers — meaningful at long context)
Speculation: EAGLE-3 k=3 (with the model-specific draft head)
Serving: SGLang 0.5+ V1
Features: continuous batching, paged KV, prefix cache, chunked prefill,
XGrammar
Disaggregation: usually not worth it at this scale; ship colocated
Expected throughput: ~10,500 tok/s/replica
Expected $/MTok: ~$1.1-1.2 raw replica cost (8× B200 @ ~$5.50/GPU-hr)
5. The cost model
Deriving $/MTok is the final exit-criterion deliverable.
5.1 The formula
$/MTok = (replica_cost_per_hour × 10^6) / (3600 × output_tokens_per_sec)
Inputs:
- replica_cost_per_hour — GPU cost × number of GPUs in a replica + amortized infrastructure (network, storage, scheduler).
- output_tokens_per_sec — measured throughput at the workload's typical concurrency.
5.2 GB200 NVL72 cost model
Approximate 2026 cloud rates:
| GPU class | $/hour (one GPU) |
|---|---|
| H100 SXM | ~$2.50 |
| H200 SXM | ~$3.50 |
| B200 SXM | ~$5.50 |
| B300 SXM | ~$7.00 |
| GB200 (per Blackwell GPU) | ~$5.50 (similar to B200 SXM on hyperscalers) |
A 16-GPU replica of DeepSeek V3.1:
- Hardware: 16 × $5.50 = $88/hour
- Network + scheduler: ~$2/hour
- Total: ~$90/hour
If throughput is 13,000 tok/s/replica (with MTP):
$/MTok = ($90 × 10^6) / (3600 × 13,000)
= $90,000,000 / 46,800,000
≈ $1.92 per million tokens
Now compare: the published rate for DeepSeek V3.1 API ($/MTok) is ~$0.30 input and ~$1.10 output (varies by provider) — below this raw single-replica estimate. That gap tells you real deployments run at much higher utilization and batching than this example assumes, on top of provider-scale economics (prefix caching, traffic mixing across replicas, committed-hardware pricing). The $1.92 is a teaching anchor, not the truth of an optimized fleet.
For the inference engineer's defense: show the raw $/MTok at full utilization. The product team multiplies by overhead.
5.3 What moves the cost model
| Lever | Effect on $/MTok | Range |
|---|---|---|
| FP4 vs FP8 | -30 to -40% | precision drop on weights |
| MTP / EAGLE | -40 to -55% | effective throughput multiplier |
| Disaggregation (NVL72 in-domain) | -20 to -35% | hardware optimization per phase |
| Concurrency 16 → 64 | -50 to -65% | batch amortization |
| Concurrency 64 → 256 | -10 to -20% | diminishing returns |
| EP=8 → EP=16 | -10 to -20% | less per-GPU pressure |
A naive deployment at FP8 + greedy decoding + concurrency 16 + EP=2 might cost $4-6/MTok (greedy alone drops the replica to ~6,500 tok/s, or ~$3.85/MTok; FP8 and the low concurrency push it further). The same model at FP4 + MTP + EP=8 + concurrency 64 hits $2 or less. The optimization stack moves the cost model by 2-3×.
6. The capstone benchmark — what your repo must contain
The final deliverable of this course. Your benchmark repo contains:
6.1 Reproducibility layer
- Exact hardware (GPU SKU, driver, CUDA, cuDNN, FA, TE versions).
- Exact runtime versions and commit hashes.
- Exact model hashes from Hugging Face.
- Calibration data manifests (if AWQ used).
6.2 Parity reports
For each model × precision recipe:
- Reference noise floor (FP16/BF16 across two seeds).
- Candidate Δ on the workload's eval set.
- Failure-mode analysis if parity exceeded budget.
6.3 Throughput tables
For each (model × runtime × precision × EP × concurrency) cell:
- TTFT p50/p95/p99.
- TPOT p50/p95/p99.
- Throughput tok/s/replica and tok/s/GPU.
- Effective $/MTok at the cost model from §5.
6.4 Profiles
At least one Nsight Systems profile for:
- TP all-reduce dominated regime (Part 2).
- EP all-to-all dominated regime (Part 3).
- MTP / EAGLE speculation in flight.
6.5 The narrative
A 3-page markdown summary explaining:
- The workload and SLO targets.
- The recipe choice for each (model, hardware) pair.
- The $/MTok defense.
- What you would change if the cluster scale doubled.
This narrative is what a senior engineer reviews before signing off on the deployment. It is the capstone.
Lab — produce the capstone report
Goal: the final capstone benchmark report for either DeepSeek V3.1 or Qwen3-MoE 235B-A22B.
- Pick one model and one workload class (chat or agent).
- Define the SLO — TTFT, TPOT, throughput targets.
- Hardware — what you have access to (8× B200, NVL72 partition, etc.).
- Build matrix — at least four configurations covering different precision × EP × speculation choices.
- Bench each with parity validation.
- Compute $/MTok for each.
- Write the narrative — recommend a ship recipe with defended numbers.
Pass criterion: another engineer can clone the repo, run make bench, and reproduce your numbers within ±10%.
Self-check
- MTP delivers 2× decode speedup for DeepSeek V3.1 with no extra HBM. EAGLE-3 delivers 2.3× speedup for Qwen3-MoE with a small EAGLE-head HBM cost. Why might you still pick EAGLE-3 for a code-gen workload even if MTP-DeepSeek is available?
- Your XGrammar-constrained decoding adds 5% to TPOT. Your BFCL accuracy improves 4 pp at FP4. Does this trade ship for an agent product?
- At NVL72 single-replica DeepSeek V3.1 EP=64 FP4 with MTP, predict your $/MTok if the per-GPU cost is $5.50/hour and throughput is 20,000 tok/s/replica.
- A product team wants 0.5s TTFT and 30ms TPOT for a chat product. Which model + recipe do you ship: DeepSeek V3.1 NVL72 or Qwen3-MoE 235B-A22B 8× B200? Defend in two sentences.
- The cost model says FP4 saves 35%, MTP saves 50%, EP=16 saves 15%. Why is the total savings not 100%?
References
- DeepSeek V3 MTP — arXiv:2412.19437
- "Multi-token Prediction" — arXiv:2404.19737
- EAGLE-3 — arXiv:2503.01840
- XGrammar — arXiv:2411.15100
- Outlines — github.com/dottxt-ai/outlines
- BFCL — gorilla.cs.berkeley.edu/leaderboard.html
- SGLang DeepSeek serving guide — sgl-project.github.io
- DeepSeek V3.1 official inference guide — github.com/deepseek-ai/DeepSeek-V3
Cross-references:
- Phase 5 → Edge AI → Agent Tool-Dispatch Evaluation with BFCL
- Part 2 → Lecture 05 — Modern serving stack
- Phase 5 → ML Systems Engineering Guide → Capstone Options
Current as of 2026-06
MTP and EAGLE-3 as canonical 2025–2026 speculation paths. XGrammar as the standard constrained-decode library. SGLang as the production serving runtime for MoE. NVL72 cost model from current cloud pricing. Refresh when new speculation methods land or when pricing shifts significantly.
End of Part 3 — End of Course
Congratulations. You have completed AI Inference Engineer 2026.
The proof is the benchmark repo. If another engineer can run it and reproduce your numbers, you are the senior inference engineer the course was designed to produce.
- Back to: AI Inference Engineer 2026 — Overview
- Up: Phase 5 → 7. ML Systems Engineering
- Refresh Log: REFRESH-LOG.md