Cover image

Speculate Smarter, Not Harder: Hierarchical Decoding Without Regret

Speed is the polite word. Cost is the less polite one. Every production LLM system eventually meets the same boring villain: the target model must generate tokens one after another, and each forward pass is expensive. Speculative decoding was supposed to soften that problem. Let a cheaper draft model run ahead, ask the expensive model to verify the draft, and accept several tokens per target-model call when the draft is good enough. Simple. Elegant. Almost suspiciously useful. ...

January 12, 2026 · 16 min · Zelina