Hybrid-Attention

TL;DR for operators FLARE is not a “diffusion models are faster, therefore rejoice” paper. That would be convenient. Also wrong. The paper shows a practical conversion recipe for taking strong hybrid-attention autoregressive LLM checkpoints and giving them a diffusion-style parallel generation path without throwing away the original causal behavior.1 The important move is not one trick. It is a coupled mechanism: a clean autoregressive stream anchors the model’s inherited capability, a noisy diffusion stream learns block-level denoising, document-packed masking prevents examples from leaking into one another, recurrent-state scheduling makes hybrid attention behave under non-causal visibility, and a unified serving stack lets one checkpoint run in two decoding modes. ...