Skip to content
Jared Frost
Go back

Building in public: a commitment to mastering GPU kernels

This is the first post, so it’s the one that sets the terms.

I’m going to become a top-tier AI kernel engineer for NVIDIA GPUs. Not “familiar with CUDA.” Not “can read a profiler.” The person you hand a stubborn inference workload to when the easy 10× is gone and all that’s left is the last fraction of a percent. That’s the bar. This blog exists to hold me to it.

Why in public

Two reasons.

Accountability. A goal you announce is harder to quietly abandon than one you keep in your head. Writing a claim down — with my name on it, on a domain that’s mine — is a commitment device. If I say I’ll understand how the warp scheduler hides memory latency, I now have to actually understand it well enough to write it down without hand-waving. Teaching is the forcing function; the public part is the deadline.

Evidence. A résumé asserts. A blog demonstrates. When I say I can take a kernel from 30% to 80% of peak bandwidth, I’d rather point to a post with the Nsight traces, the before/after, and the reasoning than ask anyone to take my word for it. If you’re hiring for exactly this kind of work — that’s the point.

What I’ll write about

The intersection I care about is hardware-aware LLM inference:

The rules I’m setting for myself

  1. Numbers or it didn’t happen. Every performance claim gets a measurement and the conditions it was measured under.
  2. Explain the why, not just the what. Anyone can paste a faster kernel. The value is in why the hardware prefers it.
  3. Be honest about the gap. Peak is a spec. I’ll always separate theoretical peak from achievable, and say which one I’m quoting.
  4. Ship regularly. Cadence beats perfection. A rough, correct post beats a polished one that never gets written.

The next post starts where all of this has to start — the roofline model for LLM inference: why a single-stream decode on an H100 uses about 0.3% of the GPU’s tensor-core throughput, and what that fact dictates about everything downstream.

Let’s go.


Share this post:

Previous Post
The roofline model for LLM inference