AI Inference Engineer 2026 — Special Course
From transformer-execution fundamentals to dense-70B on Hopper to MoE-672B on Blackwell — the modern inference stack, end to end.
Part 1 — Fundamentals of AI Inference / MLSys
The mental model layer of the course. Five lectures that build, in order, the four things every AI inference engineer needs before touching a specific model or hardware target:
- 1.01 The 2026 inference engineer's mental model
What does the role do, day to day, and what metrics decide whether the work was good?
- 1.02 Transformer execution — from tokens to bits
What actually runs on the GPU when a token is generated, and why decode is the bandwidth-bound problem?
- 1.03 Roofline, bandwidth, and the memory hierarchy
Which hardware spec lines move which metric, and which are noise?
- 1.04 The precision stack — FP16 → FP8 → FP4 → INT4
What does each precision floor cost, what does each one buy, and how do we know parity?
- 1.05 The runtime landscape — vLLM, SGLang, TensorRT-LLM, llama.cpp, MLX
Given a workload + hardware + SLO, which runtime do we start with — and why?
Part 2 — Dense Decoder-Only Inference at Hopper
The end-to-end production inference stack for 70B-class dense models on Hopper-class hardware (H100 / H200). Seven lectures, anchored on a side-by-side comparison of two of the most-deployed dense models in 2025–2026:
- 2.01 Anatomy of a 70B-class dense model — Llama 3.3 70B vs Qwen 2.5 72B
What stays the same between these two and what changes? What does each difference cost or buy?
- 2.02 Hopper hardware story — H100, H200, Transformer Engine, FP8
What does Hopper actually provide that Ampere doesn't, and what does H200 add over H100?
- 2.03 Quantizing Llama 3.3 70B and Qwen 2.5 72B — AWQ, GPTQ, QuaRot, SpinQuant, FP8
What precision recipe ships for each model, defended by parity numbers?
- 2.04 Single-node multi-GPU serving — tensor parallelism on 8× H100/H200
How does TP scale, where do the collectives dominate, and what's the runtime-specific config?
- 2.05 Modern serving stack — continuous batching, paged KV, prefix cache, speculation
Which knobs move which metric, on this hardware, on these models?
- 2.06 Long context at 128K on Hopper — KV scaling, YaRN, chunked prefill, prefix sharing
What breaks at 128K and what is the precision recipe at that context?
- 2.07 Inside the communication layer — NCCL, custom all-reduce, the vLLM communicator stack
How does a runtime actually move bytes between GPUs, and which collective path wins at decode?
Part 3 — MoE Inference at Blackwell
The Blackwell-class production inference stack for modern Mixture-of-Experts models. Five lectures, anchored on a side-by-side comparison of the two dominant 2025 open-weights MoE families:
- 3.01 Anatomy of a modern MoE — DeepSeek V3.1 and Qwen3-MoE 235B-A22B
What's the same, what differs, and how does each difference change inference cost?
- 3.02 Blackwell hardware story — B200, B300, GB200 NVL72, TE2, FP4
What does Blackwell silicon provide that Hopper doesn't, and how big is NVL72?
- 3.03 Expert parallelism (EP) and the gating hot path
How is an MoE partitioned across many GPUs, and where does the all-to-all cost dominate?
- 3.04 Disaggregated prefill / decode — Mooncake, Splitwise, DistServe
When does separating prefill GPUs from decode GPUs pay for itself?
- 3.05 Production MoE serving — MTP speculation, constrained decode, cost model
What's the full production recipe, and what's the $/MTok at GB200 NVL72 scale?