Projects
GPU kernel and LLM-inference projects I've built or am actively working on. Code on GitHub.
-
FlashAttention from scratch
ActiveA fused, IO-aware attention kernel in CUDA — tiling Q·Kᵀ through shared memory so the full score matrix never touches HBM.
- CUDA
- Attention
- Tensor Cores
-
Tiled GEMM kernel
ActiveA from-scratch HGEMM/SGEMM approaching cuBLAS via shared-memory tiling, register blocking, and vectorized loads — with a roofline write-up.
- CUDA
- GEMM
- Performance
-
llm-infer-bench
ActiveA reproducible LLM inference benchmark harness: tokens/s, TTFT, and roofline utilization across batch sizes, sequence lengths, and quantization.
- Inference
- Benchmarking
- Python
-
Nsight case studies
PastAnnotated Nsight Compute walkthroughs — turning profiler counters (SOL, stalls, the in-tool roofline) into measured kernel speedups.
- Profiling
- Nsight
- CUDA
More on GitHub: @ai-hpc