About
I’m Jared Frost, a GPU/AI kernel engineer. I come from a hardware engineering background and work primarily in C++ and CUDA, down at the level where performance is won or lost: memory hierarchies, warp scheduling, occupancy, and the gap between a kernel’s theoretical and achieved throughput.
These days my focus is LLM inference — understanding exactly where the cycles and the bytes go, and closing the distance to the hardware’s roofline.
What this blog is
A public lab notebook. I write up the things I’m learning and building so that I’m forced to understand them precisely enough to explain them. Expect posts on:
- CUDA kernel optimization — profiling with Nsight, occupancy and register pressure, memory coalescing, shared-memory tiling, warp-level primitives.
- LLM inference performance — the roofline model for prefill vs. decode, KV-cache management, batching, quantization, attention kernels (FlashAttention and friends), and serving throughput.
- Hardware-aware analysis — reading the numbers off NVIDIA datasheets and turning them into predictions, then measuring how close real kernels get.
- Benchmarks & write-ups — reproducible measurements, not vibes. Code on GitHub where it makes sense.
The mission
I’m working toward becoming a top-tier AI kernel engineer for NVIDIA GPUs — the person who can take an inference workload and squeeze out the last fraction of a percent. This blog is part commitment device, part portfolio: a running record of the work, in public. If that’s interesting to you — or you’re hiring for exactly this — let’s talk.
Read more about why I’m doing this in the open in Building in public: a commitment to mastering GPU kernels.
Contact
- GitHub — github.com/ai-hpc
- LinkedIn — Jared Frost
- Email — jared@fastcrest.com