
Kernel Kombat: How Multi‑Agent LLMs Squeeze 1.32× More From Your GPUs
TL;DR Astra is a multi‑agent LLM system that optimizes existing CUDA kernels instead of generating them from PyTorch. On three production‑relevant SGLang kernels, it delivered 1.32× average speedup (up to 1.46×) without fine‑tuning—just structured zero‑shot prompting. The win isn’t a single trick; it’s a division of labor: testing, profiling, planning, and coding each handled by a specialized agent that iterates toward faster, still‑correct kernels. Why this matters for business readers GPU efficiency is the new gross margin. If your serving stack pushes trillions of tokens per day, a 25–45% kernel‑level speedup compounds into: ...