This is the first post, so it’s the one that sets the terms.
I’m going to become a top-tier AI kernel engineer for NVIDIA GPUs. Not “familiar with CUDA.” Not “can read a profiler.” The person you hand a stubborn inference workload to when the easy 10× is gone and all that’s left is the last fraction of a percent. That’s the bar. This blog exists to hold me to it.
Why in public
Two reasons.
Accountability. A goal you announce is harder to quietly abandon than one you keep in your head. Writing a claim down — with my name on it, on a domain that’s mine — is a commitment device. If I say I’ll understand how the warp scheduler hides memory latency, I now have to actually understand it well enough to write it down without hand-waving. Teaching is the forcing function; the public part is the deadline.
Evidence. A résumé asserts. A blog demonstrates. When I say I can take a kernel from 30% to 80% of peak bandwidth, I’d rather point to a post with the Nsight traces, the before/after, and the reasoning than ask anyone to take my word for it. If you’re hiring for exactly this kind of work — that’s the point.
What I’ll write about
The intersection I care about is hardware-aware LLM inference:
- CUDA kernel optimization — memory coalescing, shared-memory tiling, occupancy vs. register pressure, warp-level primitives, and what Nsight Compute actually tells you.
- The roofline, for real workloads — predicting performance from the datasheet, then measuring how close real kernels get and explaining the gap.
- LLM inference internals — prefill vs. decode, KV-cache management, batching, quantization, and attention kernels like FlashAttention.
- Reproducible benchmarks — numbers with the setup attached, code on GitHub. No vibes.
The rules I’m setting for myself
- Numbers or it didn’t happen. Every performance claim gets a measurement and the conditions it was measured under.
- Explain the why, not just the what. Anyone can paste a faster kernel. The value is in why the hardware prefers it.
- Be honest about the gap. Peak is a spec. I’ll always separate theoretical peak from achievable, and say which one I’m quoting.
- Ship regularly. Cadence beats perfection. A rough, correct post beats a polished one that never gets written.
The next post starts where all of this has to start — the roofline model for LLM inference: why a single-stream decode on an H100 uses about 0.3% of the GPU’s tensor-core throughput, and what that fact dictates about everything downstream.
Let’s go.