Skip to content
Jared Frost

The runtime landscape — vLLM, SGLang, TensorRT-LLM, llama.cpp, MLX

Given a workload + hardware + SLO, which runtime do we start with — and why?

AI Inference Engineer 2026 — Special Course · Part 1 — Fundamentals of AI Inference / MLSys

Overview

The runtime is the layer between the model file and the GPU. It owns:

Picking the right runtime is the single largest engineering decision after picking the model. A wrong choice costs 2–5× in throughput, blocks features (paged KV, prefix cache, FP8), or limits the hardware you can target. A right choice means most of the hard kernel and scheduler work is already done.

This lecture covers the 2026 landscape:

  1. vLLM — the open-source workhorse, V1 engine in 2026.
  2. SGLang — RadixAttention + EP, the structured-output + MoE serving leader.
  3. TensorRT-LLM — NVIDIA's max-throughput path, FP8 and FP4 native.
  4. llama.cpp — the edge / desktop / CPU+GPU universal runtime.
  5. MLX — Apple Silicon native.
  6. LMDeploy and others — TurboMind kernels, secondary players worth knowing.

For each: what it does well, what it does badly, when to pick it.

By the end you should have a runtime decision matrix for any workload × hardware combination, and a defended starting point you can prototype on within an hour of receiving a new project.


1. vLLM — the open-source workhorse

Project: github.com/vllm-project/vllm Pinned version (this lecture): 0.22.x with the V1 engine as the default path.

1.1 What it does

1.2 V0 → V1 engine

vLLM's V0 engine (2023 → early 2025) accumulated complexity. The V1 engine (2025-onwards) is a rewrite that:

Always run V1 unless you have a specific reason (a third-party plugin that hasn't migrated). V0 references should be treated as archaeology.

1.3 What it does badly

1.4 When to pick vLLM

Default starting point for most production deployments.


2. SGLang — RadixAttention + EP

Project: github.com/sgl-project/sglang Pinned version: 0.5.x.

2.1 What it does

2.2 What it does badly

2.3 When to pick SGLang

Default starting point for MoE in Part 3.


3. TensorRT-LLM — NVIDIA's max-throughput path

Project: github.com/NVIDIA/TensorRT-LLM Pinned version: 1.3.x.

3.1 What it does

3.2 What it does badly

3.3 When to pick TensorRT-LLM

Default for the cost-sensitive batch / high-throughput chat workloads on Hopper / Blackwell once the model has stabilized.


4. llama.cpp — universal edge / desktop

Project: github.com/ggml-org/llama.cpp Pinned version: post-2026-04 builds.

4.1 What it does

4.2 What it does badly

4.3 When to pick llama.cpp

Default for any per-user, per-machine, edge or desktop scenario.


5. MLX — Apple Silicon native

Project: github.com/ml-explore/mlx Pinned version: 0.31.x.

5.1 What it does

5.2 What it does badly

5.3 When to pick MLX


6. The secondary tier

Worth knowing, rarely the first pick:


7. The decision matrix

Workload Hardware First pick Second pick Notes
Chat, dense 70B, OpenAI-API serving H100/H200, 8× vLLM TensorRT-LLM vLLM if mixed workloads; TRT-LLM if throughput-only
Chat, MoE 200B+ H200 / B200, 8×+ SGLang vLLM SGLang's EP support is mature
Agent / tool-use with structured output any SGLang vLLM XGrammar integration is fast
Batch / offline embedding H100 TensorRT-LLM vLLM TRT-LLM throughput wins
Batch / offline LLM H200 / B200 TensorRT-LLM vLLM engine compile pays off
RAG with heavy prefix sharing H100 / H200 SGLang vLLM RadixAttention wins
Edge / single user Jetson, Mac, RPi llama.cpp MLX (Mac) llama.cpp is universal
Apple Silicon M-series MLX llama.cpp MLX is faster on Apple GPU
Multi-modal vision H100 / H200 vLLM SGLang vLLM's multi-modal support is most mature
MoE on Blackwell B200 / GB200 SGLang TensorRT-LLM SGLang EP + FP4 path

Starting point — not eternal truth. Re-bench when your hardware or model changes.


8. The shape of the runtime layer

All five runtimes have the same architectural skeleton:

HTTP / gRPC API


   scheduler  ──►  pending request queue
       │              │
       ▼              ▼
   model graph   ──►  batched forward pass
       │              │
       ▼              ▼
    kernels      ──►  attention, FFN, sampling
       │              │
       ▼              ▼
  KV manager    ──►  paged KV / radix tree / per-session


    HBM / fabric

Where they differ:

A senior engineer can read any runtime's source and locate each box. Junior engineers learn one runtime; seniors learn the boxes and apply them to whichever runtime they're handed.


Lab — bench three runtimes on one model

Goal: produce a same-machine, same-model comparison across three runtimes.

  1. Pick one model — Qwen3-4B Instruct (AWQ-INT4 or original BF16, both available).
  2. Pick three runtimes — vLLM (V1), llama.cpp (post-2026-04), and one of: SGLang or TensorRT-LLM.
  3. Same hardware, same warmup, same prompt set.
  4. Bench:
    • Chat shape: prompt length 512, output 128, concurrency 1, 8, 32.
    • Agent shape: prompt 256, output 16, concurrency 16 (low-latency).
  5. Report TTFT, TPOT, throughput, peak HBM, with profiler traces for one run.
  6. Pick a winner per shape and defend in one paragraph.

Pass criterion: the report is in your benchmark repo. Anyone reading it can see why one runtime won at one shape.


Self-check

  1. You receive a project: serve Qwen3-MoE 235B-A22B for an agent product on 8× H200. What runtime do you start with, and what do you check before committing?
  2. A teammate insists on TensorRT-LLM for a chat product with a model that ships a new fine-tune every week. Defend or reject the choice in two sentences.
  3. You are shipping an LLM appliance on Jetson Orin Nano Super 8 GB. Which runtime, which model size, which quantization? Why?
  4. SGLang's RadixAttention wins on RAG. Why specifically does RAG benefit more than chat?
  5. Your benchmark shows vLLM at 920 tok/s and TRT-LLM at 1430 tok/s on Llama 3.3 70B at FP8 on H200, single replica. You ship vLLM anyway. Give one engineering reason where this is the right call.

References

Cross-references:


Current as of 2026-06

Versions pinned: vLLM 0.22.x V1, SGLang 0.5.x, TensorRT-LLM 1.3.x, llama.cpp post-2026-04, MLX 0.31.x. Update when vLLM V1 ships as the unconditional default, or when a runtime ships a breaking API change, or when TRT-LLM's Blackwell FP4 path lands a stable release.


Next


← All lectures