Skip to content
Jared Frost

Expert parallelism (EP) and the gating hot path

How is an MoE partitioned across many GPUs, and where does the all-to-all cost dominate?

AI Inference Engineer 2026 — Special Course · Part 3 — MoE Inference at Blackwell

Overview

Tensor parallelism (Part 2 Lecture 04) splits a single matrix multiply across multiple GPUs and joins the result with an all-reduce. Expert parallelism splits the experts of an MoE layer across multiple GPUs and joins the result with an all-to-all. This one structural difference is the most consequential change in the inference graph going from dense to MoE.

This lecture covers:

  1. What EP partitions and how token routing works across GPUs.
  2. The all-to-all communication pattern — and why it's harder than TP's all-reduce.
  3. TP × EP × PP combinations for MoE — when each parallelism axis pays its rent.
  4. DeepEP and DeepSeek-specific optimizations.
  5. Expert load balancing — the runtime decision that decides whether EP scales.
  6. Gating computation cost — small per-token, large per-cluster.
  7. Token-level routing on NVL72 — what scales and what doesn't.
  8. The Llama-style dense TP fallback (when EP loses to TP for small MoEs).

By the end you should be able to pick the right TP × EP combination for DeepSeek V3.1 or Qwen3-MoE 235B-A22B on a given Blackwell cluster, predict the all-to-all overhead, and diagnose load-imbalance bottlenecks.


1. What EP partitions

In expert parallelism, each GPU holds a subset of the experts. For DeepSeek V3.1 with 256 routed experts at EP=8:

GPU 0: experts 0..31
GPU 1: experts 32..63
GPU 2: experts 64..95
GPU 3: experts 96..127
GPU 4: experts 128..159
GPU 5: experts 160..191
GPU 6: experts 192..223
GPU 7: experts 224..255

Shared experts: replicated on every GPU
Gating network: replicated on every GPU

At a token's MoE layer, the gating network produces top-8 expert IDs. These IDs may be on any GPU. The token's hidden state must be dispatched to the right GPUs, the FFN run there, and the results gathered back.

This is the all-to-all communication. Every GPU sends tokens to every other GPU according to which experts those tokens need.

1.1 The dispatch table

At each MoE layer, per step:

Input: hidden states for all tokens in the batch — shape (B, hidden)
Gating: top-8 expert IDs per token — shape (B, 8)

For each (token, expert_id) pair:
  expert_owner_gpu = expert_id // (n_experts / EP_size)

Build per-source-rank send buffers:
  send[rank_r] = [token i, hidden_state_i for all tokens routed to expert on rank r]

NCCL all-to-all dispatch
Each rank receives: tokens needing local experts
Compute FFN per expert locally on received tokens
NCCL all-to-all combine: send back to original ranks

Original ranks combine outputs weighted by gating scores

1.2 Why this is harder than TP all-reduce

TP all-reduce: every rank contributes the same-sized buffer, NCCL reduces in a ring or tree. Predictable cost.

EP all-to-all: every rank sends different-sized buffers to different destinations (because tokens are routed unevenly to experts). NCCL alltoallv handles variable sizes but communication efficiency depends on:

At low batch sizes, EP all-to-all is latency-dominated. At high batch sizes, it becomes bandwidth-dominated. Both regimes have different optimization knobs.


2. The all-to-all cost

For DeepSeek V3.1 at EP=8 on 8× B200, batch=64, decode (1 token per request):

Per layer per step:
  - 64 tokens × 8 experts each = 512 token-expert pairs
  - Average tokens per expert: 512 / 256 = 2
  - Per token data: hidden_size × bytes = 7168 × 1 (FP8) = 7 KB

Send per source rank: 64 tokens × ~50% local = 32 tokens elsewhere
                    ≈ 32 × 7 KB / 8 destinations = ~28 KB per pair

NCCL alltoallv: ~28 KB to 7 destinations, asymmetric
  Latency-bound regime — bandwidth not the constraint
  Approx 100-200 μs per all-to-all

Per layer: 2 all-to-alls (dispatch + combine) ≈ 200-400 μs
Per token (58 MoE layers): ~12-23 ms in all-to-all alone

Compare to the bandwidth-bound weight read on B200:

Active weights: 37B × 0.5 bytes (FP4) = 18.5 GB
Per GPU at EP=8: 18.5 / 8 ≈ 2.3 GB (only the experts hit per token, summed over all layers)
HBM read time: 2.3 GB / 8 TB/s ≈ 0.29 ms per token total
Per MoE layer (58 of the 61 layers): ~5 μs in weight reads

So in an EP=8 deployment, all-to-all dominates weight reads by an order of magnitude or more — ~12-23 ms against ~0.3 ms per token. This is why MoE inference is fundamentally a communication-bound problem — even more than dense TP — and why NVLink 5's bandwidth doubling pays off.

2.1 At larger batch sizes

At batch=512:

Per layer: 512 tokens × 8 experts = 4096 token-expert pairs
Average tokens per expert: 4096 / 256 = 16
Per all-to-all message: ~30 tokens × 7 KB = ~200 KB per destination

Now bandwidth-dominated: 200 KB × 7 destinations ÷ 1.8 TB/s ≈ 1 μs (much less than latency floor)
But NCCL setup + small-message handling ≈ 50-100 μs

Per layer: ~100-200 μs all-to-all
Per token (58 MoE layers): ~6-12 ms

Larger batch amortizes the per-step all-to-all cost. MoE strongly prefers higher batch sizes than dense. Continuous batching is essential.


3. TP × EP × PP combinations

For an MoE model, three parallelism axes:

The total degree: TP × EP × PP GPUs per replica.

3.1 Pure EP (no TP)

Each GPU holds a subset of experts plus a full copy of the attention block.

Pure EP is the standard 2026 MoE recipe. EP=8 for 8× B200, EP=16 for 16× B200, EP=64 for the NVL72.

3.2 TP + EP (combined)

For very large active params or memory-tight deployments:

This is used for:

The cost: more all-reduces (TP) + more all-to-alls (EP). Diminishing returns; rarely worth it unless attention compute is the bottleneck.

3.3 Recommendation table

Model Cluster Recipe
Qwen3-MoE 235B-A22B 2× B200 EP=2, FP4 weights, FP8 KV
Qwen3-MoE 235B-A22B 8× B200 EP=8, FP4 weights, FP8 KV
DeepSeek V3.1 8× B200 EP=8, FP4 weights, BF16 MLA-KV
DeepSeek V3.1 16× B200 EP=16, FP4 weights, FP8 KV
DeepSeek V3.1 NVL72 (single replica) TP=2 × EP=32 = 64 GPUs, FP4
DeepSeek V3.1 NVL72 (multi-replica) 4 replicas × 16 GPUs each

4. DeepEP and DeepSeek-specific optimizations

DeepSeek released DeepEP (github.com/deepseek-ai/DeepEP) — an EP communication library specifically tuned for the DeepSeek V3 family.

Key optimizations:

Benchmark gains over default NCCL alltoallv on NVL72:

SGLang and TRT-LLM integrate DeepEP for DeepSeek-family deployments. vLLM does so for the DeepSeek model path. For Qwen3-MoE 235B-A22B (which is not DeepSeek), the same techniques apply but the library integration varies.


5. Expert load balancing

The single most subtle MoE inference issue.

5.1 The problem

Not all experts are activated equally. Some experts get many tokens, some get few. The slowest expert sets the step time (a "straggler" pattern).

In training, MoE loss includes a load-balancing auxiliary term that penalizes uneven routing. At inference time, this is fixed — but training-time imbalance can persist as natural distribution bias.

Real-world: a deployment might see expert utilization vary 3-10×. The most-loaded expert does 3-10× more work per step than the least-loaded.

5.2 Approaches

Token-level rebalancing (DeepEP, SGLang):

Expert replication (when budget allows):

Drop-token policies (older, less common):

5.3 Measurement

Production runtimes (SGLang, vLLM) expose expert utilization metrics:

expert 0..7   on GPU 0: tokens_per_step = [180, 95, 110, 75, 140, 88, 200, 95]
expert 0..7   on GPU 1: ...

Use Nsight Systems' MoE-aware view (newer Nsight versions) or runtime metrics endpoints. Watch for ratio max/min > 3 — this is the threshold where rebalancing becomes worthwhile.


6. Gating computation cost

The gating network is Linear(hidden, num_experts). For DeepSeek V3.1:

Per token: 7168 × 256 = 1.83M FLOPs
Per layer × per batch: 1.83M × batch × seq

At batch=64, seq=128 (typical chat-shape):

Per gating call: 1.83M × 64 × 128 ≈ 15 GFLOPs
On B200 FP16 (2,250 TFLOPs peak): ~7 μs

Per layer: gating + topk + scatter overhead ≈ 20-50 μs
Per token (58 MoE layers): ~1.2-2.9 ms in gating-related ops

Small. Usually not the bottleneck unless the runtime's gating implementation is unoptimized.

The subtlety: gating is frequent and latency-sensitive. Even if individual cost is small, poor implementation can add 10-20% to total step time. vLLM 0.22+, SGLang 0.5+, and TRT-LLM all have optimized gating paths.


7. Token-level routing on NVL72

At full NVL72 scale (72 GPUs), some additional considerations:

For NVL72 single-replica DeepSeek V3.1:

7.1 When EP doesn't pay

For small batches (concurrency < 16), EP overhead can exceed the bandwidth savings. For these workloads, fewer-GPU TP can be faster than many-GPU EP:

For low-batch chat products, smaller-cluster deployments are often more cost-efficient.


Lab — bench EP scaling on Qwen3-MoE

Goal: measure EP scaling and identify the all-to-all overhead crossover.

  1. Hardware — 2× B200, 4× B200, 8× B200 (or NVL72 partition).
  2. Model — Qwen3-MoE 235B-A22B FP4 (or FP8 if FP4 path not ready).
  3. Runtime — SGLang 0.5+ V1 (best DeepEP-style integration for non-DeepSeek MoE) or vLLM 0.22+.
  4. Bench at three EP degrees — EP=2, EP=4, EP=8. Same batch=64, prompt=1024, output=256, iterations=100 with 20 warmup.
  5. Profile one EP=4 run with Nsight Systems. Identify all-to-all fraction of step time.
  6. Plot per-replica throughput, per-GPU throughput, and all-to-all overhead percentage.
  7. Identify the scaling-efficiency curve.

Pass criterion: you can defend EP=4 vs EP=8 for a chat product at concurrency 64 with measured numbers.


Self-check

  1. For DeepSeek V3.1 at EP=8, batch=64, predict the per-layer all-to-all time on NVL72 NVLink 5 (assume 28 KB messages × 7 destinations).
  2. Why does NVLink 5's 2× bandwidth improvement specifically help MoE EP all-to-all more than dense TP all-reduce?
  3. A teammate measures expert utilization on GPU 0 as [180, 95, 110, 75, 140, 88, 200, 95]. What's the imbalance ratio? Should you enable token-level rebalancing?
  4. EP=8 vs TP=8 for DeepSeek V3.1: which has more cross-GPU communication overhead? Why?
  5. At batch=2 on NVL72 EP=64, the per-step gating + all-to-all overhead per layer is ~80 μs. Per layer compute is ~30 μs. What's wrong, and what's the fix?

References

Cross-references:


Current as of 2026-06

NCCL 2.30+, DeepEP latest, SGLang 0.5+ MoE path, vLLM 0.22+ V1 MoE support, NVL72. Refresh when DeepEP 2.x or successor lands.


Next


← All lectures