Building in public: a commitment to mastering GPU kernels

This is the first post, so it’s the one that sets the terms.

I’m going to become a top-tier AI kernel engineer for NVIDIA GPUs. Not “familiar with CUDA.” Not “can read a profiler.” The person you hand a stubborn inference workload to when the easy 10× is gone and all that’s left is the last fraction of a percent. That’s the bar. This blog exists to hold me to it.

Why in public

Two reasons.

Accountability. A goal you announce is harder to quietly abandon than one you keep in your head. Writing a claim down — with my name on it, on a domain that’s mine — is a commitment device. If I say I’ll understand how the warp scheduler hides memory latency, I now have to actually understand it well enough to write it down without hand-waving. Teaching is the forcing function; the public part is the deadline.

Evidence. A résumé asserts. A blog demonstrates. When I say I can take a kernel from 30% to 80% of peak bandwidth, I’d rather point to a post with the Nsight traces, the before/after, and the reasoning than ask anyone to take my word for it. If you’re hiring for exactly this kind of work — that’s the point.

What I’ll write about

The intersection I care about is hardware-aware LLM inference:

CUDA kernel optimization — memory coalescing, shared-memory tiling, occupancy vs. register pressure, warp-level primitives, and what Nsight Compute actually tells you.
The roofline, for real workloads — predicting performance from the datasheet, then measuring how close real kernels get and explaining the gap.
LLM inference internals — prefill vs. decode, KV-cache management, batching, quantization, and attention kernels like FlashAttention.
Reproducible benchmarks — numbers with the setup attached, code on GitHub. No vibes.

The rules I’m setting for myself

Numbers or it didn’t happen. Every performance claim gets a measurement and the conditions it was measured under.
Explain the why, not just the what. Anyone can paste a faster kernel. The value is in why the hardware prefers it.
Be honest about the gap. Peak is a spec. I’ll always separate theoretical peak from achievable, and say which one I’m quoting.
Ship regularly. Cadence beats perfection. A rough, correct post beats a polished one that never gets written.

The next post starts where all of this has to start — the roofline model for LLM inference: why a single-stream decode on an H100 uses about 0.3% of the GPU’s tensor-core throughput, and what that fact dictates about everything downstream.

Let’s go.