Speculate Smarter, Not Harder: Hierarchical Decoding Without Regret
Opening — Why this matters now LLM inference has quietly become the dominant cost center of modern AI systems. Training grabs headlines; inference drains budgets. As models scale into the tens of billions of parameters, every additional forward pass hurts — financially and operationally. Speculative decoding promised relief by letting small models run ahead and big models merely verify. But verification, ironically, became the bottleneck. ...