Single-node multi-GPU serving — tensor parallelism on 8× H100/H200
How does TP scale, where do the collectives dominate, and what's the runtime-specific config?
AI Inference Engineer 2026 — Special Course · Part 2 — Dense Decoder-Only Inference at Hopper
Overview
A 70B-class dense model in FP16 is ~140 GB. No single GPU in the Hopper generation holds it without quantization — H100 has 80 GB, H200 has 141 GB. To serve it at higher precision than INT4, or to serve at larger batch / longer context, the model is split across multiple GPUs in a single node, joined by NVLink.
The dominant 2026 split is tensor parallelism (TP) — the canonical approach pioneered by Megatron-LM. This lecture covers TP end-to-end for the anchor pair:
- The TP partitioning of attention and FFN layers — which matrices split how.
- The NCCL all-reduce that dominates TP step time.
- NVLink + NVSwitch fabric reality on HGX H100/H200 8-GPU boxes.
- TP=2 vs TP=4 vs TP=8 — what each tradeoff actually costs in throughput and latency.
- Runtime-specific TP configuration — vLLM, SGLang, TRT-LLM.
- Diagnosis — when the all-reduce is the bottleneck and what to do.
- Sequence parallelism and tensor-parallel-FFN overlays.
By the end you should be able to set up TP on 8× H100/H200 for either Llama 3.3 70B or Qwen 2.5 72B, measure the all-reduce overhead, and tell whether a given workload should use TP=4 or TP=8.
1. Tensor parallelism — the partitioning
TP splits individual matrix multiplications across GPUs. For each Linear layer, the weight matrix is partitioned and each GPU computes a slice of the output. The slices are joined by all-reduce.
1.1 Attention partition
Q, K, V projections are partitioned by head:
On 4-GPU TP, each GPU holds 16 of 64 Q heads + 2 of 8 KV heads:
GPU 0: Q_heads_0..15, K_heads_0..1, V_heads_0..1
GPU 1: Q_heads_16..31, K_heads_2..3, V_heads_2..3
GPU 2: Q_heads_32..47, K_heads_4..5, V_heads_4..5
GPU 3: Q_heads_48..63, K_heads_6..7, V_heads_6..7
Attention computed locally per GPU on its head shard:
attn_local = softmax(Q_local · K_local^T) · V_local
Output projection partitioned by *rows*:
W_o split: rows 0..2047 on GPU 0, rows 2048..4095 on GPU 1, etc.
Per-GPU partial: attn_local · W_o_local → partial_d
All-reduce across GPUs: full_d = sum(partial_d) over 4 GPUs
The all-reduce is the cross-GPU step. Without it, each GPU has only its head shard's contribution.
1.2 FFN partition
The two FFN matmuls split differently:
W_gate, W_up: split by columns (output dim).
Per GPU computes (B, d) × (d, d_ff/N) → (B, d_ff/N)
No communication yet — local result.
SwiGLU: silu(gate) * up — element-wise, local.
W_down: split by rows (input dim).
Per GPU computes (B, d_ff/N) × (d_ff/N, d) → (B, d) partial
All-reduce across GPUs to get full (B, d)
Again, the FFN ends with an all-reduce.
1.3 Total all-reduces per layer
For each transformer layer, two all-reduces — one after attention output projection, one after FFN down projection. For an 80-layer model that's 160 all-reduces per token at decode.
This is the single largest performance consideration in TP serving.
2. The NCCL all-reduce — what it actually does
NCCL (NVIDIA Collective Communications Library) implements all-reduce as either ring or tree algorithm depending on size and topology.
2.1 Ring all-reduce
For 4 GPUs with NVLink:
Round 1: GPU 0 sends slice to GPU 1, GPU 1 to GPU 2, etc.
Round 2: each GPU receives + sums slice
... 2(N-1) rounds total to fully reduce + share
Bandwidth: each GPU sends 2 * (N-1) * (size/N) bytes. For large messages, near-optimal bandwidth use.
2.2 Tree all-reduce
For 8+ GPUs, hierarchical tree reduces are sometimes faster:
log2(N) levels of pairwise reduce → final value at root → log2(N) levels of broadcast back
Lower latency for small messages, higher latency for large messages (less bandwidth-efficient).
2.3 What NCCL picks
NCCL auto-tunes per (message size, topology). On HGX H100/H200 with NVLink Switch (fully-connected 8-GPU domain), ring is typical for the all-reduces in TP because the messages are large during prefill (~16-32 MB per layer at FP16); decode messages are far smaller and latency-bound (§2.4).
2.4 The bandwidth cost
For an FFN down-projection all-reduce in 70B FP16:
message_size_per_gpu = batch × hidden_size × bytes
= 32 × 8192 × 2
= 0.5 MB at batch=32
Across 80 layers × 2 all-reduces per layer:
total bytes = 80 × 2 × 0.5 MB = 80 MB per token decode
On NVLink (900 GB/s aggregate) the all-reduce time is small for this message size,
but the *latency* (round-trips) adds up.
For prefill at batch=1, prompt=2048:
message_size = 2048 × 8192 × 2 = 32 MB per layer per all-reduce
total bytes per prompt = 80 × 2 × 32 MB = 5.1 GB cross-GPU traffic per prefill
This is significant. Prefill TP=8 spends ~25% of step time in NCCL all-reduces.
3. NVLink + NVSwitch on HGX boxes
The physical fabric for 8× H100 / H200 in standard HGX configurations:
- NVLink 4 — 50 GB/s per link, 18 links per GPU → 900 GB/s aggregate per GPU.
- NVSwitch v3 — 4 switches in the HGX baseboard interconnect the 8 GPUs in an all-to-all topology.
- Each pair of GPUs has 900 GB/s effective bandwidth.
- Latency — sub-microsecond for small messages.
This fabric is the reason TP=8 is practical at all. Without NVLink, TP across 8 GPUs through PCIe (128 GB/s aggregate to host, 64 GB/s peer-to-peer if not blocked) would be 10–14× slower.
3.1 NVLink Switch chiplets and the H200 NVL configuration
Newer HGX configurations and the upcoming NVL chassis use NVLink switches as separate chips that interconnect across servers. For inference-on-single-node-server workloads (which is the entire scope of this lecture), the baseboard NVSwitch is what matters.
4. TP=2 vs TP=4 vs TP=8
For Llama 3.3 70B / Qwen 2.5 72B:
| TP | Per-GPU weight share (FP16) | Per-GPU KV share | Communication overhead | When to use |
|---|---|---|---|---|
| 1 | doesn't fit at FP16 | n/a | none | INT4 only on H200 |
| 2 | ~70 GB | 50% | ~10% of step time | H100 NVL or H200 single-node |
| 4 | ~35 GB | 25% | ~15% of step time | best chat-shape throughput |
| 8 | ~18 GB | 12.5% | ~25% of step time | max batch / long context |
4.1 The TP scaling efficiency
Ideal: 4× more GPUs → 4× more throughput. Real: 4× more GPUs → 2.5–3.5× more throughput at this hardware/model.
The gap is communication. The plot looks like:
per-replica throughput
▲
│ • TP=2 (95% efficiency vs single-GPU)
│ • TP=4 (85% efficiency)
│ TP=8 • (65–75% efficiency)
│
└────────────────────► TP degree
Scaling efficiency depends on workload:
- Chat (decode-dominant, batch=32): TP=4 is the sweet spot. TP=8 sacrifices ~25% to gain headroom for KV cache / longer context.
- Batch / offline (prefill-dominant): TP=8 can win because prefill is compute-bound; FFN matmuls amortize the all-reduce cost.
- Long context (128K): TP=8 because the KV cache pressure forces it; the throughput hit is acceptable because long-context throughput is bottlenecked elsewhere.
4.2 The per-GPU throughput trap
A common metric mistake: comparing TP=4 vs TP=8 in per-replica throughput.
- TP=4 at 1200 tok/s → 300 tok/s/GPU.
- TP=8 at 1600 tok/s → 200 tok/s/GPU.
TP=8 has higher absolute throughput (better TPOT, more users served per replica) but worse $/MTok if each H200 has a fixed cost.
For chat products, TP=4 is usually the cost-efficient pick. TP=8 is for cases where TP=4 cannot fit the workload (e.g., 128K context at large batch).
4.3 When the TP size is not yours to choose — FP8 block alignment
There is a hard constraint that overrides the cost/throughput reasoning above: block-scaled FP8 weights require every tensor-parallel shard's dimension to be a whole number of quantization blocks. If dim / TP is not a multiple of the FP8 block size, the engine fails to load.
For Qwen 2.5 72B the FFN intermediate 29568 = 128 × 231 is not divisible by 128 after splitting across 2, 4, or 8 GPUs — so block-FP8 + TP=8 will not load until you drop to a TP that re-aligns (or use a framework build that pads). The mechanism and the fix-order are in Lecture 03 §5.4. The takeaway for this lecture: when you serve FP8, validate that the model loads at your chosen TP before you reason about throughput — quantization can force a smaller TP than the workload alone would pick.
5. Runtime-specific TP configuration
5.1 vLLM
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
tensor_parallel_size=4, # the TP degree
dtype="float16",
gpu_memory_utilization=0.92,
)
vLLM auto-discovers GPUs and assigns ranks. Works on standard HGX systems out of the box.
5.2 SGLang
python -m sglang.launch_server \
--model-path Qwen/Qwen2.5-72B-Instruct \
--tp-size 4 \
--dtype float16 \
--port 30000
SGLang's TP integrates with its RadixAttention and EP support. Same fabric assumptions as vLLM.
5.3 TensorRT-LLM
trtllm-build \
--checkpoint_dir ./fp8_checkpoint \
--output_dir ./engine \
--tp_size 4 \
...
TRT-LLM compiles the model graph for the chosen TP size. Changing TP requires a rebuild — this is a deployment-time decision, not a runtime config.
5.4 Cross-runtime parity
For Llama 3.3 70B at FP16, TP=4, batch=32 on 4× H100:
| Runtime | TPOT (mean) | Throughput tok/s/GPU |
|---|---|---|
| vLLM 0.22 | ~30 ms | ~310 |
| SGLang 0.5 | ~29 ms | ~315 |
| TRT-LLM 1.3 BF16 | ~26 ms | ~360 |
| TRT-LLM 1.3 FP8 | ~16 ms | ~580 |
Numbers approximate; replicate in your lab. TRT-LLM's lead is mostly kernel quality on Hopper (FA4, WGMMA, TMA) and FP8 maturity. vLLM is closing the gap quarter by quarter.
6. Diagnosis — is the all-reduce the bottleneck?
A profile-driven diagnosis:
- Nsight Systems timeline — look for NCCL kernel time as a fraction of step time.
- Per-layer breakdown —
nsysshows attention matmul + all-reduce + FFN matmul + all-reduce per layer. - NCCL communication time — is it overlapping with compute? On Hopper, vLLM 0.22+ and TRT-LLM use
cudaStream-based overlap; older runtimes serialize.
Quick diagnostic:
| If you see | Suspect |
|---|---|
| NCCL > 30% of step | All-reduce dominant — try fewer GPUs or smaller messages |
| NCCL serial with compute (no overlap) | Outdated runtime — upgrade |
| Per-GPU FLOPs achievement low | Kernel quality / batch size — try larger batch |
| Step time scales poorly with TP | All-reduce — TP=4 → TP=8 is rarely 2× |
6.1 The reduce-scatter + all-gather pattern
For very large all-reduces, NCCL implements it as reduce-scatter (after which each GPU holds one fully-reduced chunk of the vector) + all-gather (each GPU then collects the remaining chunks). This decomposition is the bandwidth-optimal ring all-reduce: each GPU moves 2(N-1)/N ≈ 2× the tensor's bytes — an all-reduce inherently transfers about twice the data volume of the tensor it reduces, which is why the communication bill is what it is.
7. Sequence parallelism — the FFN overlay
A subtle but important optimization: sequence parallelism (SP) reduces the activation memory cost of TP by partitioning the sequence dimension across GPUs during non-matmul operations (RMSNorm, residual add, etc.).
Without SP, TP-partitioned models duplicate full activations across GPUs during RMSNorm and residual. SP partitions those operations too — each GPU holds only seq/N worth of activations.
- Wins for long-context workloads — at 128K context, SP saves substantial HBM.
- vLLM 0.22+ integrates SP transparently when
--enable-sequence-parallelismis set. - TRT-LLM has SP behind a flag.
For TP=8 + SP, communication doubles (reduce-scatter + all-gather instead of all-reduce), but each communication is smaller. Net win for long-context workloads where the activation memory was the constraint.
Lab — measure TP scaling on Llama 3.3 70B
Goal: produce TP=2 / 4 / 8 throughput numbers on the same hardware, with the all-reduce cost annotated.
- Hardware — 8× H100 SXM (HGX) or 8× H200 if you have access.
- Model — Llama 3.3 70B Instruct, FP16 (or FP8 if you have TRT-LLM).
- Runtime — vLLM 0.22+ V1.
- Bench at three TP degrees — TP=2, 4, 8. Same batch size (32), same prompt (1024), same output (256), same iterations (100, with 20 warmup).
- Profile one TP=4 and one TP=8 run with Nsight Systems. Identify NCCL fraction of step time.
- Compute per-GPU throughput, scaling efficiency, and per-replica throughput.
- Plot the scaling curve. Annotate the bottleneck at each TP.
Pass criterion: you can defend the choice of TP for a chat product at 32K context with measured numbers — and explain why a different TP would be picked for a long-context batch product.
Self-check
- For Llama 3.3 70B FP16 on 4× H100 SXM, predict the all-reduce time per token decode at batch=32. Use 900 GB/s NVLink bandwidth and 160 all-reduces per token.
- A teammate proposes TP=8 for a chat product to "double throughput." The TP=4 → TP=8 measurement shows 1.5× throughput at 65% per-GPU efficiency. Defend or reject in two sentences using $/MTok reasoning.
- Your Nsight trace shows NCCL kernels running serially with compute kernels (no overlap). What runtime upgrade would you try first?
- Sequence parallelism is "free" memory savings — why isn't it on by default? What is the cost?
- For Qwen 2.5 72B at TP=4 vs Llama 3.3 70B at TP=4 on the same hardware, which has worse scaling efficiency and why?
References
- Megatron-LM tensor parallelism paper — arXiv:1909.08053
- "Reducing Activation Recomputation in Large Transformer Models" — arXiv:2205.05198 — sequence parallelism
- NCCL documentation — docs.nvidia.com/deeplearning/nccl/
- NVIDIA HGX H100 platform brief — nvidia.com/en-us/data-center/hgx/
- vLLM TP documentation — docs.vllm.ai/en/latest/serving/distributed_serving.html
- TensorRT-LLM multi-GPU — nvidia.github.io/TensorRT-LLM/
- DistServe (referenced for disaggregation contrast) — arXiv:2401.09670
Cross-references:
- Phase 5 → ML Systems Engineering Guide → Stage 5 Distributed Training Systems — the training-side foundation
- Phase 5 → GPU Infrastructure → 8x-H200-Training-Inference → 02 Training Setup
Current as of 2026-06
NCCL 2.30+, vLLM 0.22+ V1, SGLang 0.5+, TRT-LLM 1.3+, NVLink 4, NVSwitch v3. Refresh when NCCL 3.0 / vLLM V1 stabilizes further / a new NVLink generation lands.
Next
- Next: Lecture 05 — Modern serving stack: continuous batching, paged KV, prefix cache, speculation
- Previous: Lecture 03 — Quantizing Llama 3.3 70B and Qwen 2.5 72B
- Up: Part 2 — Dense at Hopper