Text-to-Image

Safe on Paper, Lost in the Prompt

TL;DR for operators A safety-aligned image model can keep its FID and CLIPScore nearly unchanged while becoming materially worse at following ordinary instructions. It may still generate a plausible bird, vase, or product scene, but quietly miss the requested color, quantity, relationship, or attribute. The paper identifies a mechanism behind this failure. When safety tuning modifies the text encoder, benign prompt embeddings can become compressed and their semantic neighborhoods can be rearranged. Distinctions that the original model represented clearly begin to blur. The authors call this semantic collapse.1 ...

Pretty Text, Ugly Logic: When Image Models Learn to Write but Not to Reason

A slide looks finished. The headline is sharp, the equations are aligned, the answer box is confident, and the design has the mild corporate glow of something that has already been approved by three people who did not read it. That is exactly the problem. For years, text-to-image models failed in a wonderfully obvious way: they could not spell. A poster would say “Qaurterly Reveneu,” the mockup button would contain mystical glyphs, and everyone understood the output was decorative, not operational. Recent models have changed that. They can now place readable text inside images, produce document-like pages, and generate slide-like visual artifacts. The failure mode has become less funny and more expensive: the text may be readable, but the reasoning may be wrong. ...

When Images Learn to Think in Code: The Rise of Code-as-CoT for Structured Generation

Poster. That is where the problem becomes embarrassingly visible. Ask an image model to make “a beautiful poster for a finance seminar,” and it may produce something visually polished enough to survive a casual scroll. Ask it to place five labeled cards, keep the headline readable, align the icons, preserve the chart, and spell the sponsor name correctly, and the glamour fades. The model may understand the request. It may even describe the right plan. Then it still puts the label where no label should live, mangles the typography, and invents a layout that looks as if the design brief was translated through fog. ...

Mirror, Mirror on the Latent: How Reflective Flow Sampling Sharpens Text‑to‑Image Models

Image generation teams have a familiar problem: the model is good enough to impress people in a demo, then slightly disobedient enough to annoy them in production. The prompt asks for a red ceramic teapot on a wooden table. The output gives a beautiful teapot, possibly red, possibly ceramic, possibly levitating in a tasteful manner. Add text, spatial relations, or editing instructions, and the gap between “pretty” and “correct” becomes a recurring invoice. ...

When One Token Rules Them All: Diffusion Models and the Quiet Collapse of Composition

Product teams often discover image-generation failure in the most boring possible way: the image looks good. The lighting is fine. The texture is convincing. The output is not deformed, not surreal in the bad way, and not obviously broken. Then someone notices the actual requested product is missing. A prompt asks for a famous castle on a coaster. The model gives the castle. It may give a postcard, a painting, a dramatic tourist shot, perhaps a suspiciously elegant architectural fantasy. The coaster quietly leaves the room. No farewell email. ...

When Models Teach Themselves: Inside the Rise of SuperIntelliAgent

Image generators fail in very ordinary ways. A prompt asks for a green banana and a blue vase. The model gives you something banana-adjacent, vase-adjacent, and chromatically negotiable. A designer asks for a bowl containing a pizza. The model places the pizza beside the bowl, halfway inside the bowl, or in a bowl-like universe where geometry has apparently resigned. A product team then does the usual dance: collect bad outputs, ask users what they preferred, curate examples, fine-tune later, and call the whole thing “continuous improvement” because the spreadsheet had a date column. ...

Mirror, Mirror in the Model: How MLLMs Learn from Their Own Mistakes

TL;DR for operators Image generators fail in a familiar way: the output looks polished, but the prompt was quietly ignored. A product photo misses the specified texture. A campaign image reverses a spatial relation. A science illustration draws the visually plausible version, not the physically correct one. Everyone then discovers, with appropriate corporate surprise, that “high quality” and “correct” are not synonyms. ...

FLUX.1 [dev]

A 12-billion-parameter rectified flow transformer capable of generating images from text descriptions.

Stable Diffusion 3 Medium

A text-to-image diffusion model from Stability AI featuring improved prompt alignment, style diversity, and compositional reasoning.

Stable Diffusion v1.4

A high-quality text-to-image latent diffusion model trained on LAION-2B, enabling fast and flexible image generation.