When One Token Rules Them All: Diffusion Models and the Quiet Collapse of Composition

Product teams often discover image-generation failure in the most boring possible way: the image looks good.

The lighting is fine. The texture is convincing. The output is not deformed, not surreal in the bad way, and not obviously broken. Then someone notices the actual requested product is missing.

A prompt asks for a famous castle on a coaster. The model gives the castle. It may give a postcard, a painting, a dramatic tourist shot, perhaps a suspiciously elegant architectural fantasy. The coaster quietly leaves the room. No farewell email.

That is the failure mode studied in Dominating vs. Dominated: Generative Collapse in Diffusion Models.¹ The paper calls it the Dominant-vs-Dominated, or DvD, phenomenon: in a multi-concept prompt, one concept visually overwhelms the generation while another concept is suppressed or disappears. The point is not that diffusion models sometimes misunderstand complicated prompts. We already knew that. The sharper claim is that some concepts arrive with rigid visual priors learned from low-diversity training examples, and those priors can seize the generation process before the rest of the prompt gets a fair turn.

That matters because many commercial prompts are not “a cat on a sofa.” They are closer to “a branded product in the style of X,” “a character-themed mug,” “a famous landmark on a souvenir item,” or “a campaign asset combining an icon, object, and visual identity.” These are exactly the places where a visually strong concept can make the output look polished while silently violating the business requirement.

The paper’s contribution is useful because it turns a familiar complaint — “the model ignored part of my prompt” — into a mechanism:

low-diversity concepts form rigid visual priors;
those priors dominate early semantic attention;
dominated concepts lose influence during the first denoising steps;
the effect is distributed across attention heads, making simple head-level fixes weak;
an attention-based diagnostic can flag likely DvD cases, although it is not yet a production mitigation strategy.

The business lesson is equally plain: for image-generation workflows, aesthetic quality is the cheap test. Composition fidelity is the expensive one.

The failure is not random seed bad luck

A convenient misconception is to treat missing concepts as prompt-engineering noise. Add commas. Add “clearly visible.” Add “product photo.” Try another seed. Invoke the ancient ritual of “highly detailed, 8k.” Sometimes that helps. Often it only hides the fact that the model has already decided what the image is about.

The paper’s opening example is “Neuschwanstein Castle coaster.” Across five random seeds, Stable Diffusion 1.4 successfully generates both concepts in only one case, while Stable Diffusion 2.1 does so in two. The castle’s distinctive form dominates; the coaster is usually absent. This is not merely a failure to understand the word “coaster.” It is a conflict between two types of learned representation.

Some concepts are visually narrow. A specific landmark has repeated visual patterns: facade, towers, skyline, usual photographic angles. A famous character has costume, color scheme, silhouette, logo, and pose conventions. An artist’s name may activate a constrained visual style. Other concepts are visually broad. A coaster can be round, square, ceramic, cork, patterned, blank, photographed on a table, sold as a product, or barely visible under a cup.

When a narrow concept and a broad concept are combined, the narrow one may behave like a gravitational object. It is easier for the model to instantiate. Its training examples create a strong basin of attraction. The broader concept has more possible forms, and therefore less immediate authority during generation.

This is the core mechanism the paper proposes: visual diversity disparity.

The phrase sounds harmless. It is not. In operational terms, it means the model’s learned prior is uneven across concepts. Some words come with a highly specific image template. Others come with a loose cloud of possibilities. When both appear in the same prompt, the model does not necessarily negotiate fairly.

DominanceBench turns the failure into something measurable

The authors introduce DominanceBench, a benchmark of 300 prompts exhibiting strong DvD behavior. The prompts pair one low-diversity concept — artist, landmark, or character — with one high-diversity object such as a mug, pouch, coaster, tote bag, hoodie, or notebook.

This benchmark is not just a list of funny prompts where models fail. It is designed to isolate a recurring pattern. For each candidate prompt, the authors generate 10 images using Stable Diffusion 1.4. They compute a DvD Score using VQA-style questions answered by Qwen2.5-VL: five questions assess the presence of one concept, and five questions assess the presence of the other. A prompt enters DominanceBench if at least 7 of 10 generated images exceed the DvD threshold.

The paper also evaluates the same prompts with Stable Diffusion 2.1. The DvD Score decreases overall, but many prompts still remain above the threshold. That is an important boundary: the phenomenon is not identical across model versions, but it does not vanish merely because the model is newer.

The authors validate the benchmark against 300 balanced prompts where both concepts appear successfully. Those balanced prompts have much lower DvD Scores; the paper reports a median of 11.6 in one benchmark validation and an average of 20.64 in the balanced-prompt comparison used for attention analysis. The exact statistic differs because the balanced sets and scoring contexts serve different parts of the study, but the direction is consistent: DvD prompts occupy a different regime from ordinary two-concept prompts.

A compact way to read the benchmark design is this:

Component	Likely purpose in the paper	What it supports	What it does not prove
DominanceBench, 300 prompts	Main evidence infrastructure	DvD can be collected and studied systematically across artists, landmarks, characters, and objects	That every multi-concept failure is DvD
Balanced prompts	Control comparison	High DvD Scores are not simply a property of all two-concept prompts	That balanced prompts cover all commercial use cases
Stable Diffusion 1.4 and 2.1 comparison	Robustness across model versions	The pattern persists beyond a single SD release, though weaker in SD 2.1	That the same magnitude holds for all modern proprietary models
VQA-based scoring	Operational measurement	Concept presence can be evaluated at scale	That VQA judgments perfectly match human brand or product requirements

This distinction matters for business readers. The benchmark does not say “your image model will always fail with famous concepts.” It says: prompts mixing visually rigid and visually flexible concepts deserve separate composition testing, because aggregate image quality metrics will not catch the failure.

Low visual diversity creates rigid priors

The paper’s most important causal evidence comes from a controlled DreamBooth experiment.

Instead of only observing that famous landmarks dominate coasters, the authors create a new token, “dvddog,” and fine-tune Stable Diffusion 1.4’s UNet using 120 ImageNet dog images. They then vary the number of dog breeds represented in training. One variant uses all 120 images from a single breed, giving the model a low-diversity concept. Another uses 12 images each from 10 breeds, giving the model a higher-diversity concept. Intermediate variants sit between those extremes.

This setup matters because it manipulates visual diversity while holding the concept family constant. Without this control, one could argue that landmarks dominate because they are semantically special, culturally famous, or more common in captions. The dog experiment narrows the question: if the same kind of concept is trained with less visual variation, does it become more dominant?

The answer is yes.

The authors test 50 prompts pairing “dvddog” with other concepts across object co-occurrence, scene context, and style modifiers. For each prompt and model variant, they generate 10 images and compute the DvD Score. The representative examples show lower-diversity variants crossing the DvD threshold while higher-diversity variants and the baseline generate more balanced compositions. The supplementary aggregate results extend this pattern across all 50 prompts: DvD Scores decrease as training diversity increases.

This is the paper’s main causal bridge:

Low visual diversity during training makes a concept representation more rigid; when placed in a multi-concept prompt, that rigid concept can dominate flexible concepts.

For business use, the dangerous category is not simply “rare concept.” It is “concept with a narrow, repeated visual identity.” That includes famous landmarks, iconic characters, highly standardized product shapes, logos, mascot-like visuals, signature campaign styles, and perhaps custom fine-tuned brand assets trained on too little variation.

Fine-tuning is where this becomes uncomfortable. A company may fine-tune an image model on a product, mascot, packaging line, or influencer persona using a small set of consistent images. That is often the point: keep the identity stable. But this paper suggests a trade-off. If the concept becomes too rigid, it may compose poorly with other requirements. The model may preserve the brand asset by sacrificing the rest of the scene.

Stability is not always controllability. Sometimes it is just stubbornness wearing a premium font.

Early attention decides the composition before the image looks like anything

Once the paper establishes visual diversity as the likely cause, it asks where the failure appears inside the diffusion process.

The answer is early.

The authors analyze cross-attention patterns in Stable Diffusion 1.4. Cross-attention is the mechanism through which image-generation components attend to text tokens. In simplified terms, it helps decide which parts of the prompt influence the evolving image representation. If one token receives unusually concentrated attention at the wrong moment, the model may allocate semantic control unevenly.

The paper defines a Focus Score to measure attention concentration on a peak token relative to the rest of the prompt, with entropy normalization to allow comparisons across prompts. The authors compute Focus Scores across UNet layers during the first denoising step and compare DominanceBench prompts with balanced prompts.

The result is specific: DominanceBench prompts show significantly higher Focus Scores in layers 5–10, with the strongest differences in lower-resolution layers 8–10. Those layers are important because lower-resolution stages carry more semantic, global structure. If the dominant concept captures attention there, it is not merely decorating the image. It is helping decide what the image fundamentally becomes.

The authors then check whether the peak-attention token is actually the dominant concept. In 249 out of 300 DominanceBench prompts — 83% — the dominating concept’s token receives maximum attention in lower-resolution semantic layers.

This number is one of the paper’s most useful operational clues. DvD is not only visible after generation. Its signature appears at the first denoising step, before the output image is fully formed.

That opens the door to diagnostics: monitor early semantic attention, especially in lower-resolution layers, and flag prompts where a low-diversity concept monopolizes attention. It does not yet solve the problem. But it changes the workflow from “generate ten images and hope” to “detect likely collapse early.”

The dominated concept may receive attention and still lose

One of the paper’s better observations is that attention presence alone is not enough. A dominated token can have high attention at an early layer and still fail to appear.

The authors illustrate this with a prompt involving the Colosseum and a carry-all pouch. At the first denoising step, the token for “pouch” shows high attention in layer 7, a middle block associated with semantic content. Yet the generated output still contains the Colosseum without the pouch.

This looks paradoxical only if attention is treated as a static “token importance” scoreboard. The paper instead examines temporal dynamics: how attention changes over denoising steps.

For this analysis, the authors track the dominant token in layers 8–10 and the dominated token in layer 7, where it tends to peak. They use attention deviation rather than the entropy-normalized Focus Score because the goal is not to compare different prompts; it is to track the relative advantage of the same concepts over time within a prompt. The appendix explains this metric choice as an implementation detail with interpretive consequences: entropy can introduce noise from irrelevant tokens, while attention deviation isolates the competitive balance between the dominant and dominated concepts.

The temporal result is the mechanism in miniature. During the earliest timestep interval, 50–40, dominated concepts show strongly negative attention change. Dominating concepts start positive or near zero. In plain language: the dominated token may get a brief chance, but its influence decays early, while the dominant token holds or gains ground.

That timing matters because early denoising steps set global structure. Once the image has begun organizing itself around the castle, character, or artist style, later attention to the object may be too late. The model may still “know” the object is in the text, but the visual trajectory has already been claimed.

This is why prompt repair can feel inconsistent. Adding emphasis to the missing object may help if it changes early competition. It may fail if the dominant concept still seizes the semantic layout first.

Head ablation shows why simple pruning is not enough

The paper then asks whether DvD is localized in a small set of attention heads. This section is easy to misread, so the purpose should be kept clear: it is an ablation and comparison with memorization, not a second causal thesis.

The authors compare 300 DominanceBench prompts with 500 memorized prompts from prior work. Memorization here means prompts that reproduce near-identical images across random seeds, a related but different failure. Both memorization and DvD involve reduced visual variation, but they operate differently. Memorization is about reproducing specific training images. DvD is about one concept dominating another in composition.

The authors ablate individual attention heads at the first denoising timestep across layers 1–16 by scaling down their attention logits. They then classify outputs as mitigated, unchanged, or corrupted/other.

Single-head ablation mitigates 145 out of 300 DominanceBench prompts, or 48%. For memorization, it mitigates 392 out of 500 prompts, or 78%. That difference matters: memorization is more vulnerable to single-head intervention.

The layer pattern is also informative. For both phenomena, mitigation effects concentrate in layers 1–6, the downsampling blocks. But their peaks differ. Memorization reaches its highest mitigation rate at layer 6, while DvD peaks earlier at layer 3. The authors then run multi-head ablation to see whether mitigation remains strong when multiple heads are suppressed.

Here the contrast becomes clearer. For memorization, multi-head ablation maintains a high mitigated proportion, around 0.8, suggesting localized behavior in a few critical heads. DvD shows a lower mitigated proportion, around 0.6, and a higher unchanged proportion, around 0.2. The authors interpret this as evidence that DvD is distributed across multiple heads.

The supplementary ablation on non-mitigating heads reinforces the point. Pairwise ablation of non-mitigating layer-1 heads barely mitigates DvD — 0.55% — while causing corrupted or incoherent outputs in 18.68% of cases. In other words, knocking out random supporting heads does not solve dominance; it just makes the model worse. Very elegant. Very useless. A classic ablation lesson.

The business implication is not “prune bad heads.” It is almost the opposite. If dominance is distributed, production mitigation likely requires better training data diversity, training objectives, attention control, or model-level interventions. A small surgical fix may not exist.

The appendix is not decoration; it tells us what can be trusted

The supplementary material is unusually relevant because it clarifies which findings are main evidence, which are robustness checks, and which are exploratory extensions.

The training-data diversity analysis supports the paper’s category choices. The authors collect LAION training images associated with top frequent prompts for concept keywords, compute CLIP ViT-L/14 image embeddings, and measure intra-category cosine distances. Landmarks show the most compact visual cluster, with a median cosine distance of 0.3079. Characters follow at 0.4417, artists at 0.4660, and objects at 0.4672. Lower distance means less visual diversity. This supports the low-diversity versus high-diversity grouping used in DominanceBench.

The balanced-prompt appendix explains the control set. The authors use a non-memorized prompt benchmark from prior work and identify prompts with exactly two concepts, producing 300 balanced prompts. This comparison matters because DvD must be distinguished from ordinary two-concept difficulty.

The memorized-prompt appendix positions DvD between balanced prompts and memorization using a text-conditional noise prediction metric from prior memorization work. Memorized prompts show the highest L2 norm, balanced prompts the lowest, and DominanceBench sits in the middle. The interpretation is subtle: dominant concepts may reproduce visually memorized patterns, but DvD is not identical to full prompt-level memorization.

The metric-design appendix explains why the temporal analysis does not use entropy normalization. This is not a side note; it protects the interpretation of Figure 8. If irrelevant tokens change attention, entropy can create the illusion that the dominant concept’s advantage changed even when it did not. The chosen attention-deviation metric is meant to track concept competition directly.

Finally, the detection appendix is an exploratory extension. It uses Focus Score in lower-resolution layers at the first denoising step to flag likely DvD. The best reported configuration uses threshold 0.010 with layers 9 and 10, detecting 70.67% of DominanceBench prompts versus 33.67% of balanced prompts, a 37.00 percentage-point discrimination gap. This is useful, but not clean enough to be sold as a finished detector. A one-third false-positive rate on balanced prompts is too high for unsupervised production rejection without additional checks.

The authors validate detected dominant tokens by replacing them with generic category terms: “Van Gogh” becomes “artist,” “Colosseum” becomes “landmark,” and “Spider-Man” becomes “character.” DvD Scores fall sharply. In Stable Diffusion 1.4, the median DvD Score drops from 64 to 20. In Stable Diffusion 2.1, it drops from 40 to about 17.

This confirms the detector is often identifying the token responsible for dominance. It does not mean prompt replacement is a good product solution. If a user asks for a Van Gogh coaster, giving them an “artist coaster” is not a fix. It is a resignation letter written in prompt syntax.

What this changes for AI image workflows

The paper directly shows a mechanism in Stable Diffusion-style models. The business interpretation is an inference, but a useful one: image-generation QA should test whether all required concepts survive composition, especially when one concept has a strong visual identity.

For teams building or using generative image systems, the practical checklist changes.

Workflow area	What the paper directly shows	Cognaptus inference for business use	Boundary
Prompt testing	DvD appears when low-diversity concepts combine with high-diversity objects	Test prompts with rigid icons, characters, landmarks, styles, and brand assets against everyday product nouns	Results are based mainly on SD 1.4/2.1 and curated DvD prompts
Fine-tuning	Lower training diversity increases dominance in a controlled DreamBooth setup	Custom brand or product tokens should be trained with enough controlled variation to preserve composability	More diversity may weaken identity consistency; the trade-off needs measurement
QA metrics	VQA-based DvD Scores separate collapsed from balanced generations	Automated concept-presence checks should complement aesthetic scoring	VQA may miss brand-specific, legal, or design-quality requirements
Runtime diagnostics	Early lower-resolution attention can flag DvD risk	Monitor early attention concentration before spending compute on batches of bad candidates	Detection remains noisy and not yet a mitigation method
Model intervention	DvD is distributed across heads	Do not expect simple attention-head pruning to solve composition collapse	Better fixes may require training objectives, data curation, or architecture-level methods

The most immediate use is not model surgery. It is better evaluation.

A creative team may not care whether the failure came from layer 9 or layer 10. They care that the model produced a beautiful image that violates the brief. But the mechanism tells engineers where to build cheaper early warnings: concept-presence evaluation, prompt-risk classification, attention concentration monitoring, and targeted test suites around high-risk concept pairs.

For brand and product workflows, the danger is especially concrete. A rigid brand character can dominate the product. A famous visual style can erase the actual object. A landmark can turn a souvenir mockup into a travel poster. A logo-like icon can consume layout space meant for packaging detail. In these cases, the model is not failing dramatically enough to be rejected by a casual viewer. It is failing specifically enough to be expensive.

The boundary: this is diagnosis before cure

The paper is strongest when explaining DvD as a mechanism. It is less complete as a mitigation paper, and the authors are clear about that.

The analysis focuses on cross-attention. Feedforward layers, residual pathways, and richer inter-head relationships may also contribute. The benchmark is built around Stable Diffusion 1.4, with additional evaluation on Stable Diffusion 2.1. That is valuable, but it does not automatically generalize to every modern diffusion architecture or proprietary image model. The scoring relies on VQA judgments, which are scalable but not the same as expert human evaluation for commercial design.

The detection method is promising but not production-ready. Its best configuration catches many DvD prompts, but balanced prompts are still flagged too often. The prompt-replacement validation proves the detected token is meaningful, not that replacing specific user intent with a generic category is acceptable.

The paper also leaves open the trade-off between identity preservation and compositional flexibility. Low visual diversity can create dominance, but commercial customization often wants strong identity. The right answer is not “add random diversity.” It is to design training sets and evaluations that preserve the identity while varying pose, context, object interaction, scale, medium, lighting, and composition.

That is where the next practical research question sits: not whether the model remembers a concept, but whether the concept can behave politely in a sentence with others.

The quiet collapse is a governance problem, not just a generation problem

The most useful idea in this paper is that visual concepts have different negotiation power inside a model.

Some tokens are flexible. Some are rigid. Some are loud. A diffusion model does not merely “follow the prompt”; it resolves competition among learned priors. When one prior is too rigid, composition can collapse while the image remains visually attractive.

That shifts how we should evaluate generative systems. The question is not only “Can the model make a high-quality image?” It is also “Can the model preserve the required relationship among concepts when one concept has a much stronger visual prior than the others?”

For business users, this is where prompt engineering ends and system design begins. A serious image-generation pipeline should know which concept pairs are risky, test them systematically, score concept presence, and distinguish a beautiful miss from a usable asset. Otherwise, teams will keep approving images that look impressive right up to the moment they fail the brief.

One token ruling them all is cute in a demo. In production, it is a QA bill waiting to arrive.

Cognaptus: Automate the Present, Incubate the Future.

Hayeon Jeong and Jong-Seok Lee, “Dominating vs. Dominated: Generative Collapse in Diffusion Models,” arXiv:2512.20666, 2025. https://arxiv.org/html/2512.20666 ↩︎

The failure is not random seed bad luck#

DominanceBench turns the failure into something measurable#

Low visual diversity creates rigid priors#

Early attention decides the composition before the image looks like anything#

The dominated concept may receive attention and still lose#

Head ablation shows why simple pruning is not enough#

The appendix is not decoration; it tells us what can be trusted#

What this changes for AI image workflows#

The boundary: this is diagnosis before cure#

The quiet collapse is a governance problem, not just a generation problem#