Skip to content
Jared Frost

AI Inference Engineer 2026 — Special Course

From transformer-execution fundamentals to dense-70B on Hopper to MoE-672B on Blackwell — the modern inference stack, end to end.

Part 1 — Fundamentals of AI Inference / MLSys

The mental model layer of the course. Five lectures that build, in order, the four things every AI inference engineer needs before touching a specific model or hardware target:

  1. 1.01
    The 2026 inference engineer's mental model

    What does the role do, day to day, and what metrics decide whether the work was good?

  2. 1.02
    Transformer execution — from tokens to bits

    What actually runs on the GPU when a token is generated, and why decode is the bandwidth-bound problem?

  3. 1.03
    Roofline, bandwidth, and the memory hierarchy

    Which hardware spec lines move which metric, and which are noise?

  4. 1.04
    The precision stack — FP16 → FP8 → FP4 → INT4

    What does each precision floor cost, what does each one buy, and how do we know parity?

  5. 1.05
    The runtime landscape — vLLM, SGLang, TensorRT-LLM, llama.cpp, MLX

    Given a workload + hardware + SLO, which runtime do we start with — and why?

Part 2 — Dense Decoder-Only Inference at Hopper

The end-to-end production inference stack for 70B-class dense models on Hopper-class hardware (H100 / H200). Seven lectures, anchored on a side-by-side comparison of two of the most-deployed dense models in 2025–2026:

  1. 2.01
    Anatomy of a 70B-class dense model — Llama 3.3 70B vs Qwen 2.5 72B

    What stays the same between these two and what changes? What does each difference cost or buy?

  2. 2.02
    Hopper hardware story — H100, H200, Transformer Engine, FP8

    What does Hopper actually provide that Ampere doesn't, and what does H200 add over H100?

  3. 2.03
    Quantizing Llama 3.3 70B and Qwen 2.5 72B — AWQ, GPTQ, QuaRot, SpinQuant, FP8

    What precision recipe ships for each model, defended by parity numbers?

  4. 2.04
    Single-node multi-GPU serving — tensor parallelism on 8× H100/H200

    How does TP scale, where do the collectives dominate, and what's the runtime-specific config?

  5. 2.05
    Modern serving stack — continuous batching, paged KV, prefix cache, speculation

    Which knobs move which metric, on this hardware, on these models?

  6. 2.06
    Long context at 128K on Hopper — KV scaling, YaRN, chunked prefill, prefix sharing

    What breaks at 128K and what is the precision recipe at that context?

  7. 2.07
    Inside the communication layer — NCCL, custom all-reduce, the vLLM communicator stack

    How does a runtime actually move bytes between GPUs, and which collective path wins at decode?

Part 3 — MoE Inference at Blackwell

The Blackwell-class production inference stack for modern Mixture-of-Experts models. Five lectures, anchored on a side-by-side comparison of the two dominant 2025 open-weights MoE families:

  1. 3.01
    Anatomy of a modern MoE — DeepSeek V3.1 and Qwen3-MoE 235B-A22B

    What's the same, what differs, and how does each difference change inference cost?

  2. 3.02
    Blackwell hardware story — B200, B300, GB200 NVL72, TE2, FP4

    What does Blackwell silicon provide that Hopper doesn't, and how big is NVL72?

  3. 3.03
    Expert parallelism (EP) and the gating hot path

    How is an MoE partitioned across many GPUs, and where does the all-to-all cost dominate?

  4. 3.04
    Disaggregated prefill / decode — Mooncake, Splitwise, DistServe

    When does separating prefill GPUs from decode GPUs pay for itself?

  5. 3.05
    Production MoE serving — MTP speculation, constrained decode, cost model

    What's the full production recipe, and what's the $/MTok at GB200 NVL72 scale?