When One Token Rules Them All: Diffusion Models and the Quiet Collapse of Composition
Opening — Why this matters now Text-to-image diffusion models are often marketed as masters of compositional imagination: just add more words, and the model will obligingly combine them into a coherent visual scene. In practice, however, this promise quietly collapses the moment multiple concepts compete for attention. A landmark swallows an object. An artist style erases the product. One concept wins, the other simply vanishes. ...