Anatomy of a modern MoE — DeepSeek V3.1 and Qwen3-MoE 235B-A22B

What's the same, what differs, and how does each difference change inference cost?

AI Inference Engineer 2026 — Special Course · Part 3 — MoE Inference at Blackwell

Overview

Mixture-of-Experts changes the inference economics in one specific way: total parameters and active parameters become two different numbers, and the runtime's job is to use the second one well.

A dense 72B model on Hopper does 72B params of work per token. A 235B-A22B MoE does 235B of params worth of memory but only 22B of compute per token — because the gating network routes each token to a small fraction of experts. The HBM pressure looks like a 235B model; the FLOPs look like a 22B model. This shifts the bandwidth/compute ratio sharply: MoE decode is even more bandwidth-bound than dense decode at the same active-param scale.

This lecture walks the two anchor models side by side:

DeepSeek V3.1 — 671B total / 37B active, MLA attention, MTP head, 256+1 experts, top-8 routing.
Qwen3-MoE 235B-A22B — 235B total / 22B active, GQA attention, 128 experts, top-8 routing.

Topics:

The shared MoE skeleton — what every modern MoE ships.
MLA — DeepSeek's KV compression, the most distinctive 2025 architectural choice.
MTP — DeepSeek's native multi-token prediction, free speculation built into the model.
The expert layer — count, sizing, top-k routing, shared experts.
The two models in concrete numbers — config, total params, active params, KV per token, decode bandwidth ceiling.
What changes in the inference graph compared to dense (Part 2 Lecture 01).
Why MoE inference economics differ — and what the runtime has to do differently.

By the end you should be able to read either model's config and predict its inference cost shape on Blackwell — total HBM footprint, per-token decode bandwidth, per-token compute, and where the runtime has to choose differently from a dense deployment.

1. The shared MoE skeleton

Both DeepSeek V3.1 and Qwen3-MoE 235B-A22B implement the same canonical mixture-of-experts block. The transformer layer changes only in the FFN:

input hidden states (h)
       │
       ▼
   RMSNorm
       │
       ▼
   Attention block   (MLA for DeepSeek; GQA for Qwen3-MoE)
       │
       ▼
   residual add
       │
       ▼
   RMSNorm
       │
       ▼
   ┌─────────────── MoE FFN ─────────────────┐
   │  gating network: g(h) → expert scores   │
   │  topk(scores, k=8) → selected experts   │
   │  for each selected expert e:            │
   │      out_e = expert_e(h)                │
   │  output = sum_e (gate_score_e × out_e)  │
   │  (+ shared expert output, DeepSeek)     │
   └─────────────────────────────────────────┘
       │
       ▼
   residual add → next layer

The compute pattern:

The gating network is a small Linear layer (hidden → num_experts) that runs every step. It is cheap (FLOPs negligible) but introduces a token-routing decision every layer.
Each token activates k experts out of N — top-8 of 256 for DeepSeek, top-8 of 128 for Qwen3-MoE. So 8/256 ≈ 3.1% of expert capacity is touched per token, or 8/128 = 6.25% for Qwen3-MoE.
The shared expert (DeepSeek only) is a small Linear path that runs for every token. It absorbs the "common knowledge" that all tokens need.
The output is a weighted sum of the selected experts' outputs.

What "active" means in the param count: only the experts the token actually touched count toward the per-token compute. The other 248 (or 120) experts sit in HBM but contribute nothing this step.

2. MLA — Multi-head Latent Attention (DeepSeek)

The most distinctive 2024–2025 architectural choice. MLA compresses the KV cache by ~15× compared to full MHA, and ~4× compared to the GQA configs typical of 70B-class models (§2.2), which is the single biggest inference win at long context.

2.1 The idea

Standard GQA stores K, V ∈ R^(L × h_kv × head_dim) per token — the explicit key/value vectors. MLA instead stores a small latent vector c_kv ∈ R^(L × d_c) per token, where d_c << h_kv × head_dim.

At attention time, K and V are reconstructed from the latent:

K_t = W_uK · c_t      (W_uK: d_c → h_kv × head_dim, low-rank up-projection)
V_t = W_uV · c_t

The latent c_t is the stored cache. K and V are recomputed on-the-fly each decode step. This is a memory/compute tradeoff: more compute (a small matmul per layer per step) for dramatically less KV memory.

One subtlety: the rotary embedding cannot ride inside the latent. RoPE applies a position-dependent rotation to each key, and that rotation does not commute with the low-rank up-projection W_uK — folding position into the compressed latent would mean re-rotating the cached value for every new query position. DeepSeek's fix is a decoupled rotary key: a small 64-dim key per token (qk_rope_head_dim = 64), rotated once and cached alongside the latent. The latent stays position-free; the rotary key carries the position.

2.2 KV bytes per token — MLA vs GQA

For DeepSeek V3 / V3.1:

Spec	Value
`kv_lora_rank` (d_c)	512
`qk_rope_head_dim` (decoupled rotary key)	64
`num_hidden_layers`	61
`bytes_per_element`	2 (FP16/BF16)

mla_kv_bytes_per_token = num_layers × (d_c + qk_rope_head_dim) × bytes
                       = 61 × (512 + 64) × 2
                       = 70,272 bytes
                       ≈ 69 KB / token

For a hypothetical 64-head, 8-KV-head, head-dim-128 GQA model with 61 layers:

gqa_kv_bytes_per_token = 2 (K + V) × 61 × 8 × 128 × 2
                       = 249,856 bytes
                       ≈ 244 KB / token

MLA is ~3.5× smaller than equivalent GQA — and DeepSeek V3 has more layers (61) than its dense competitors so the absolute savings are larger.

At 128K context, DeepSeek V3.1 needs ~9 GB of KV cache per request, versus ~30 GB+ for a GQA equivalent. This is what makes long-context serving practical without aggressive KV quantization.

2.3 What MLA costs

Extra matmul per decode step: W_uK · c_t and W_uV · c_t. Two small matmuls per layer per step.
On Blackwell at FP4, these are cheap — the compute fits comfortably in the spare cycles between the FFN matmul and the attention.
The runtime needs MLA-aware kernels — FlashAttention 4 with the MLA shape, or specialized kernels in SGLang's DeepSeek path.

vLLM, SGLang, and TensorRT-LLM all support MLA as of mid-2026, with SGLang having the most mature DeepSeek-specific kernels.

3. MTP — Multi-Token Prediction (DeepSeek)

DeepSeek V3 introduced MTP as a training-time objective: predict the next k tokens, not just the next one. At inference time, the model has learned to also emit predictions for positions +1, +2, ..., +k.

3.1 How it changes inference

Standard decode step:           emits one token
With MTP active:                emits up to k tokens, with confidence scores

Verification:
  At step t, target model has emitted tokens t+1, t+2, t+3 (k=3)
  Continue from t+3 unless any of them is rejected by sampling logic

This is native speculative decoding built into the model — no separate draft model needed.

For DeepSeek V3.1:

Acceptance rate (rough): 60-80% at k=3 (sample-dependent).
Effective throughput: 1.6-2.5× decode speedup with no additional draft cost.
Works in vLLM 0.22+ (speculative_config={"method": "deepseek_mtp"}) and SGLang.

3.2 Why this matters for inference engineering

MTP makes DeepSeek's decode throughput on Blackwell roughly equivalent to a dense model 2× smaller than its active param count would suggest. At constant cost, DeepSeek V3.1 with MTP serves ~3× more tokens than a 37B dense model would. That math is what makes DeepSeek's $/MTok economics competitive with closed flagship models in 2025-2026.

3.3 Qwen3-MoE has no native MTP

Qwen3-MoE 235B-A22B was not trained with the MTP objective. It uses a lightweight EAGLE-3 head (or Medusa heads) that reuses the target model's hidden states. Acceptance rates are similar (70-80% at k=4) but the engineering layer is different: a separately trained head, separate runtime path, separate fine-tuning.

This is one of the cleanest "architectural choice → inference recipe" contrasts in the course.

4. The expert layer

4.1 DeepSeek V3.1 expert configuration

From config.json:

Field	Value
`n_routed_experts`	256
`n_shared_experts`	1
`num_experts_per_tok`	8
`moe_intermediate_size`	2048 (per expert FFN)
`routed_scaling_factor`	2.5
`first_k_dense_replace`	3 (first 3 layers are dense, no MoE)

So:

61 total layers; first 3 are standard dense FFN, the remaining 58 use MoE.
Each MoE layer has 256 + 1 = 257 experts.
Each token routes to 8 of 256 routed experts plus the 1 shared expert.
Each expert is a SwiGLU FFN with intermediate size 2048 (much smaller than DeepSeek V3's dense FFN, which is 18432 wide — ~9× larger).

Param count per MoE layer:

Per expert: 3 × hidden × moe_intermediate (gate + up + down)
          = 3 × 7168 × 2048
          = 44M params
Per layer (routed): 256 × 44M = 11.2B
Per layer (shared): 1 × 44M = 44M
Per layer (gating): hidden × num_experts = 7168 × 256 = 1.8M
Per layer total: ~11.3B

58 MoE layers + 3 dense layers + attention layers + embed/LM head
≈ 671B total params

Per-token active params:

Per token, per MoE layer:
  Attention: ~110M (MLA-shaped attention)
  8 routed experts × 44M + 1 shared × 44M + gating = ~352M + ~44M + ~2M ≈ ~398M
  Per-layer active: ~510M

58 MoE layers × 510M + 3 dense layers × dense-FFN-size + attention + embed
≈ 37B active params per token

Which matches the published "37B active."

4.2 Qwen3-MoE 235B-A22B expert configuration

From config.json:

Field	Value
`num_experts`	128
`num_experts_per_tok`	8
`moe_intermediate_size`	1536
(no shared experts)	—

So:

All layers are MoE (no first-k-dense-replace).
128 experts per layer, top-8 routing.
Per-expert FFN intermediate 1536.
No shared expert.

The result: ~235B total params, ~22B active per token.

4.3 What this means for the runtime

The runtime has to:

Hold all experts in HBM — both models need their full expert weights resident across the cluster.
Route every token through the gating network — small but per-layer per-token.
Move tokens to their selected experts — this is the all-to-all communication problem when experts are partitioned across GPUs (Lecture 03).
Compute only the active experts' FFN — 8 of 256 (3.1%) for DeepSeek, 8 of 128 (6.25%) for Qwen3-MoE.

A dense model has a simple "1 GPU, all the weights" mental model. An MoE has "all experts everywhere or partition them, and route tokens at runtime." The complexity moves from compute density to communication and scheduling.

5. The two models in concrete numbers

5.1 DeepSeek V3.1

Spec	Value
Total params	671B
Active params (per token)	37B
Layers	61 (3 dense + 58 MoE)
Hidden size	7168
Attention	MLA, 128 heads (Q), `q_lora_rank=1536`, `kv_lora_rank=512`
Vocab	129,280
Context	128K (extended via YaRN)
KV bytes/token (FP16)	~69 KB
Native speculation	MTP
Total HBM (BF16)	~1.4 TB
Total HBM (FP8)	~700 GB
Total HBM (FP4)	~350 GB

5.2 Qwen3-MoE 235B-A22B

Spec	Value
Total params	235B
Active params (per token)	22B
Layers	94
Hidden size	4096
Attention	GQA, 64 Q heads, 4 KV heads (per Qwen3 release)
Vocab	151,936
Context	256K (Instruct-2507)
KV bytes/token (FP16)	2 × 94 × 4 × 128 × 2 ≈ 192 KB
Native speculation	None (use EAGLE-3 / Medusa)
Total HBM (BF16)	~470 GB
Total HBM (FP8)	~235 GB
Total HBM (FP4)	~118 GB

5.3 Side by side

Property	DeepSeek V3.1	Qwen3-MoE 235B-A22B
Total / active params	671B / 37B	235B / 22B
Active param ratio	5.5%	9.4%
Attention	MLA (compressed KV)	GQA (4 KV heads)
KV bytes/token	~69 KB	~192 KB
Layers	61	94
Native speculation	MTP (yes)	none (EAGLE-3 external)
Experts	256 + 1 shared	128
HBM at FP4	~350 GB	~118 GB

DeepSeek has more total params but compressed KV; Qwen has fewer total but more KV per token. Choosing one over the other depends on:

Long-context priorities → DeepSeek MLA wins.
HBM-constrained deployment → Qwen3-MoE 235B-A22B fits more easily.
Need MTP speculation out of the box → DeepSeek.
Multilingual coverage (Chinese especially) → both competitive; Qwen3 slightly stronger on Chinese benchmarks.

6. Inference graph changes vs dense

Compared to Part 2's dense models, the inference graph adds:

6.1 New stages per layer

Dense block FFN:                       MoE block FFN:
  RMSNorm                                RMSNorm
  gate matmul + up matmul                gating linear → top-k
  SwiGLU                                 token → expert assignment
  down matmul                            (if EP) all-to-all dispatch
  residual                               per-expert FFN computation (only selected)
                                         (if EP) all-to-all combine
                                         weighted sum of expert outputs
                                         (+ shared expert path, DeepSeek)
                                         residual

The extra stages:

Gating computation — small but per-token per-layer.
Token routing — assigns each token to its experts, builds the dispatch table.
(EP-only) All-to-all communication to move tokens to their experts and back.
Per-expert FFN — same FFN math as dense, but only on a subset of tokens per expert.
Output combination — weighted sum of expert outputs.
(MTP) Multi-token head produces k+1 logit predictions per step.

6.2 The new bottleneck candidates

Gating compute — usually small but can become measurable on Hopper-class hardware at high batch sizes.
All-to-all communication — the dominant new bottleneck (Lecture 03).
Expert load imbalance — if some experts get many more tokens than others, the slowest expert sets the step time.
KV cache attention — for DeepSeek MLA, the up-projection compute adds; for Qwen3-MoE, the standard GQA attention cost.

7. Why MoE economics differ

A simplified cost model:

$/MTok ≈ (replica_cost_per_hour × hours_per_MTok)
       = (replica_cost) / (output_tokens/sec × 3600 / 10^6)

For dense (Part 2): the model has to do P × 2 FLOPs per token, where P is the model size, mostly read from HBM at decode time.

For MoE: the model has to hold P_total in HBM but only do P_active × 2 FLOPs per token. Decode bandwidth still pays for the active params' weights, but the HBM cost is much higher.

7.1 The bandwidth math

Decode at batch=1 for DeepSeek V3.1 at FP4 on one B200 (just hypothetically, ignoring it doesn't fit):

bytes read per decode step:
  Active weights: 37B × 0.5 bytes (FP4) = 18.5 GB
  KV cache (128K, FP16 MLA): 9 GB
  Total: ~27.5 GB

Decode time on B200 (8 TB/s HBM):
  ≈ 27.5 GB / 8 TB/s ≈ 3.4 ms / token
  ≈ ~300 tokens/sec at batch=1 (theoretical ceiling)

In practice on GB200 NVL72 with proper batching:

MoE serving wins from batching because the cross-token expert routing amortizes the all-to-all cost.
Bandwidth ceiling per GPU is more strict than dense because total HBM held is high.
Real measurements at SGLang on GB200 NVL72: ~600-1000 tok/s/GPU at concurrency 64 for DeepSeek V3.1 FP4.

7.2 The $/MTok comparison

The headline numbers (mid-2026 ballpark; replicate in your lab):

Model	Hardware	Active	Throughput (tok/s/GPU)	$/MTok
Llama 3.3 70B FP8	4× H100	70B	~580	~$1.20
Qwen 2.5 72B FP8	4× H100	72B	~550	~$1.26
Qwen3-MoE 235B-A22B FP4	8× B200	22B	~1200	~$1.27
DeepSeek V3.1 FP4 + MTP	16× B200 (NVL72)	37B	~1500 (effective with MTP)	~$1.02

MoE on Blackwell is the cost-economics winner once the cluster scale is available — but it requires the cluster scale. For a single-replica deployment, dense Hopper still wins.

Lab — derive both configs and produce a side-by-side cost table

Goal: extend the benchmark repo with MoE-aware cost modeling. Cap of one day.

Download both config.json files from official Hugging Face repos.
Compute, programmatically, for each model:
- Total params, active params per token.
- Per-token KV cache bytes (MLA-aware for DeepSeek, GQA-aware for Qwen3-MoE).
- HBM footprint at BF16, FP8, FP4 (weights + KV at 128K, batch=1).
- Decode bandwidth ceiling on B200 single GPU (assuming weights fit, batch=1).
Render a comparison table with both models, three precisions.
Predict at what batch size the decode workload moves from bandwidth-bound to compute-bound (using Part 1 Lecture 03 ridge-point math, FP4 ceiling ~1125 FLOPs/byte on B200).
Pick an HBM-feasible deployment shape for each model (single B200, 2× B200, 4× B200, 8× B200, or NVL72) at FP4 + FP8 KV with concurrency targets of 16, 64, 256.

Pass criterion: the report can be reproduced by another engineer from public configs, and your predictions match measured numbers in Lectures 02-05.

Self-check

MLA reduces DeepSeek V3.1's per-token KV from ~244 KB (hypothetical GQA equivalent) to ~69 KB. At 128K context, how many concurrent requests fit in the KV cache budget of one B200 (192 GB) versus the GQA hypothetical?
MTP gives DeepSeek native speculation at ~2× effective decode throughput. Why is it specifically not transferable to Qwen3-MoE without retraining?
A teammate proposes serving DeepSeek V3.1 on 8× H200 instead of 8× B200 to save cost. Without running it, predict the throughput drop. What's the bottleneck?
Why does Qwen3-MoE 235B-A22B have 94 layers and DeepSeek V3.1 only 61, but DeepSeek's active params (37B) is higher than Qwen3's (22B)?
For a long-context (128K) chat product at batch=8, which model has the larger total HBM footprint at FP4? Show the math.

References

DeepSeek V3 technical report — arXiv:2412.19437
DeepSeek V3.1 release notes — github.com/deepseek-ai/DeepSeek-V3
DeepSeek V3.1 model card — huggingface.co/deepseek-ai/DeepSeek-V3.1
MLA — described in DeepSeek V2 paper arXiv:2405.04434
DeepSeek-MoE paper — arXiv:2401.06066
Multi-token prediction — arXiv:2404.19737
Qwen3 technical report — qwenlm.github.io/blog/qwen3/
Qwen3-MoE 235B-A22B model card — huggingface.co/Qwen/Qwen3-235B-A22B (model name as released)
"Mixture of Experts Explained" — huggingface.co/blog/moe — accessible primer
Switch Transformers — arXiv:2101.03961 — the foundational MoE paper

Cross-references:

Part 2 → Lecture 01 — Anatomy of a 70B-class dense model — for direct dense-vs-MoE comparison
Phase 5 → GPU Infrastructure → Long-Context-MoE-Foundation-Training → 04 MoE Fundamentals — training-side perspective on MoE

Current as of 2026-06

Configs pinned from the official model cards: DeepSeek V3.1 (2025-08) and Qwen3-235B-A22B-Instruct-2507 (2025-07). Refresh when DeepSeek V4 or Qwen4-MoE ships, or when a competing MoE family with different architectural choices becomes the production standard.

Next: Lecture 02 — Blackwell hardware story
Up: Part 3 — MoE at Blackwell

← All lectures