Model-Serving

FLARE Without Fireworks: Diffusion Speed Needs an Autoregressive Anchor

TL;DR for operators FLARE is not a “diffusion models are faster, therefore rejoice” paper. That would be convenient. Also wrong. The paper shows a practical conversion recipe for taking strong hybrid-attention autoregressive LLM checkpoints and giving them a diffusion-style parallel generation path without throwing away the original causal behavior.1 The important move is not one trick. It is a coupled mechanism: a clean autoregressive stream anchors the model’s inherited capability, a noisy diffusion stream learns block-level denoising, document-packed masking prevents examples from leaking into one another, recurrent-state scheduling makes hybrid attention behave under non-causal visibility, and a unified serving stack lets one checkpoint run in two decoding modes. ...

Speculate Smarter, Not Harder: Hierarchical Decoding Without Regret

Speed is the polite word. Cost is the less polite one. Every production LLM system eventually meets the same boring villain: the target model must generate tokens one after another, and each forward pass is expensive. Speculative decoding was supposed to soften that problem. Let a cheaper draft model run ahead, ask the expensive model to verify the draft, and accept several tokens per target-model call when the draft is good enough. Simple. Elegant. Almost suspiciously useful. ...

Rotate Less, Quantize Better: OptRot and the Geometry of LLM Compression

Packing is easy until one object is much larger than everything else. A warehouse can fit hundreds of ordinary boxes onto neatly spaced shelves. Add one grand piano, however, and the spacing plan becomes rather less elegant. Either the piano does not fit, or every shelf is redesigned around an object that appears once. ...