When Images Learn to Think in Code: The Rise of Code-as-CoT for Structured Generation

Poster.

That is where the problem becomes embarrassingly visible.

Ask an image model to make “a beautiful poster for a finance seminar,” and it may produce something visually polished enough to survive a casual scroll. Ask it to place five labeled cards, keep the headline readable, align the icons, preserve the chart, and spell the sponsor name correctly, and the glamour fades. The model may understand the request. It may even describe the right plan. Then it still puts the label where no label should live, mangles the typography, and invents a layout that looks as if the design brief was translated through fog.

This is not merely a “better prompt” problem. It is a representation problem.

The paper CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation proposes a blunt but important answer: when the image has structure, make the model think in code before it paints.¹ Not code as a developer gimmick. Code as an executable intermediate representation: write the layout, run it in a sandbox, render a draft, and then refine the draft into a final image.

That changes the job. The model is no longer asked to jump from language directly into pixels. It is asked to produce a checkable scaffold first. For enterprise image generation, that distinction matters more than another glossy demo. Pretty is cheap. Placement is expensive.

Natural-language CoT is too soft for layout work

Chain-of-Thought has become one of those phrases that AI discourse uses until it becomes almost decorative. In language tasks, a written reasoning trace can help a model decompose a problem. In image generation, the temptation is to copy the same pattern: first describe the scene, then generate the image.

The CoCo paper argues that this is not enough for structured visuals.

A sentence can say, “place the title at the top, add a chart on the left, put a caption below, and include three labeled icons on the right.” That sounds specific. It is not specific in the way a renderer needs. It does not define coordinates, canvas size, font placement, object boundaries, line geometry, spatial hierarchy, or how much space a label should occupy before colliding with another element. Natural language gives intention. Structured visuals need constraints.

Code gives a different kind of plan. A Matplotlib script, for example, can specify canvas dimensions, rectangles, arrows, axes, labels, text strings, coordinates, colors, and relative placement. That does not make the result beautiful by itself. Programmatic drafts are often plain. But they are explicit. They can be executed. They can fail. They can be inspected.

That is the central mechanism in CoCo:

Generate executable code from the prompt.
Run the code in a sandbox to create a deterministic draft image.
Feed the draft back into the unified multimodal model for final visual refinement.

The paper’s quiet insight is that visual reasoning should sometimes become renderable reasoning. If the intermediate thought cannot be rendered, it may not be useful enough for layout-sensitive generation.

CoCo turns image generation into a three-stage production pipeline

CoCo builds on Bagel, a unified multimodal model that supports both visual understanding and visual generation. The base architecture matters because CoCo needs the model to move across text, code, draft images, and final images within one workflow.

The pipeline is simple enough to sketch:

Prompt
  ↓
Executable layout code
  ↓
Sandbox-rendered draft image
  ↓
Draft-guided refinement
  ↓
Final image

This is not the usual “generate an image, then fix it if the user complains” workflow. The draft is not a post-hoc repair target. It is a planned intermediate object. The model first externalizes the structure into code, the runtime renders that structure into an image, and only then does the model handle the part that code is bad at: visual polish, realism, texture, and style.

That division of labor is important.

Stage	What it is good at	What it avoids
Code generation	Layout, coordinates, text placement, charts, diagrams, object relations	Vague natural-language planning
Sandbox rendering	Deterministic preview, executable verification, visible structure	Hidden reasoning that cannot be inspected
Draft-guided refinement	Visual quality, style, realism, semantic smoothing	Rebuilding the scene from scratch and losing structure

The business translation is straightforward. Many companies do not actually need infinite artistic freedom. They need reliable structured generation: report charts, training diagrams, product infographics, educational visuals, presentation graphics, compliance-friendly layouts, and text-heavy marketing assets. In those cases, the model’s job is not to dream. It is to obey the layout. A shocking concept, apparently.

The paper does not only propose a pipeline. It also builds the training data needed for that pipeline.

The authors construct CoCo-10K, a curated dataset containing two main forms of supervision:

Training format	What it teaches
Text–Code pairs	How to turn a prompt into executable visual layout code
Text–Draft Image–Final Image triplets	How to refine a rendered draft while preserving its intended structure

This distinction is easy to miss, but it is the heart of the training story. A model must learn to produce code that runs. It must also learn not to ignore the draft once it sees it.

The authors report that an off-the-shelf Bagel model has two pilot weaknesses. First, it can generate code, but the code is often non-executable. Second, even when given a draft, it may ignore the draft and create a new image. That second failure is very familiar in business workflows: the model appears to accept a reference, then quietly treats it as a mood board rather than an instruction.

CoCo-10K addresses this with a synthetic data pipeline. The dataset includes general editing, scientific diagrams, and complex text cases. The synthesis side uses prompts for charts, posters, infographics, annotated diagrams, and text-heavy visuals. A strong model generates layout code, the code is executed to create an A-image draft, and an image-editing model refines it into a B-image final output. The resulting paired data mirrors the intended inference process.

The practical lesson is not “synthetic data solves everything.” It is more precise: when a workflow depends on an intermediate representation, the model must be trained on that representation and on the act of preserving it. Otherwise the intermediate step becomes theater.

The main results show strength where structure is the task

The headline result is on StructT2IBench, a benchmark for structured image synthesis involving charts, diagrams, mathematical figures, tables, and puzzles. CoCo reaches 73.52% overall accuracy, compared with 49.58% for the strongest reported baseline, GPT-Image, and 4.69% for Bagel in the table.

That is not a small improvement. It is a change in failure mode.

Benchmark area	CoCo result	Important interpretation
StructT2IBench overall	73.52%	Strong overall gain on structured visual synthesis
Chart	79.44%	Code-like layout planning is especially useful
Graph	62.58%	Spatial and symbolic structure benefit from explicit representation
Math	69.12%	Formula-like and diagrammatic visuals fit the method well
Table	79.15%	Strong, but not the top score in the table; GPT-Image reports 83.31%
Puzzle	49.10%	Not universally dominant; Nano Banana reports higher performance
Science	58.81%	Strong but not top in the reported comparison

This matters because the paper’s own table is more nuanced than the abstract-level story. CoCo is not best at every subcategory. It is best overall, and it dominates several structure-heavy categories, but other systems remain stronger in some subareas. That is not a weakness of the paper. It is the useful part of the evidence.

For business readers, the implication is not “replace every image model with Code-as-CoT.” The implication is: when your visual task has a schema, use a schema-aware generation process. Charts, visual reports, annotated diagrams, and text-heavy posters are not the same product category as cinematic concept art.

Text rendering results are strong, but the LongText result needs careful reading

The paper also evaluates text rendering on OneIG-Bench and LongText-Bench.

On OneIG-Bench, CoCo reports 0.895 in English, 0.811 in Chinese, and 0.853 overall, outperforming the listed baselines. This is a strong result, and it supports the paper’s claim that code-based intermediate representations help with text placement and typographic precision.

On LongText-Bench, CoCo reports 0.755 in English, 0.753 in Chinese, and 0.754 overall. This is much stronger than Bagel’s 0.342 overall. However, GPT-4o reports 0.788 overall, driven by a very high English score of 0.956. CoCo is stronger on Chinese in that comparison, but not the overall winner.

That distinction matters. The useful reading is not that CoCo universally beats every strong closed model on every text-heavy task. The useful reading is that CoCo substantially improves the Bagel-based open/unified model workflow, performs very strongly on multilingual structured text rendering, and shows that executable planning can close a major reliability gap.

A serious enterprise reader should prefer that interpretation. It is less glamorous, and therefore more useful.

The ablations explain why code alone is not the product

The ablation section is one of the paper’s most important parts because it prevents a lazy reading.

A lazy reading says: “The model writes code, therefore code is the magic.” The paper’s ablations say something more operational: a small amount of code supervision is necessary, but the dominant signal should teach draft-to-final refinement.

In the training mixture experiment, the authors vary the proportion of Text–Code supervision, denoted as $r_c$. The best reported LongText-Bench results occur at $r_c = 0.05$, with 0.755 English and 0.753 Chinese. Higher code-supervision proportions perform worse:

Method	$r_c$	English	Chinese
Bagel	—	0.373	0.310
CoCo	0.20	0.724	0.667
CoCo	0.10	0.733	0.671
CoCo	0.05	0.755	0.753

The interpretation is subtle but important. Code supervision teaches the model to produce executable scaffolds. But if too much training emphasis goes into code generation, the model may underlearn the refinement behavior that turns crude programmatic renderings into useful images. In a production workflow, the draft is not the deliverable. It is the control surface.

The code-executability diagnostic reinforces the point. On LongText-Bench, the off-the-shelf Bagel model compiles only 29 out of 320 generated programs, or 9.06%. CoCo reaches 320 out of 320, or 100%. This test is an ablation-style diagnostic rather than the main visual quality result. Its purpose is to show that Text–Code supervision is not optional. Without executable code, the entire preview mechanism collapses.

So the paper’s real claim is not “code makes images better.” It is: code makes a structured draft possible, and draft-guided training teaches the model to preserve and improve that draft.

That is a much better product principle.

The adaptive canvas result hints at real layout reasoning

The discussion section reports a small but interesting behavior. CoCo is trained with data constructed at a fixed resolution of 1024, yet during inference the generated code can adapt canvas shapes to prompt semantics. Poster-like prompts may lead to wider layouts such as 16:9, while charts and diagrams tend to produce square or near-square canvases.

This is not the main evidence. It is closer to an exploratory generalization observation. Still, it is worth noting because it suggests that the model is not merely memorizing a fixed visual template. It appears to treat layout parameters as part of the generated program.

For enterprise use, this is where code-as-reasoning becomes especially attractive. Canvas size, aspect ratio, margins, alignment, and object grouping are not decoration. They are business constraints. A campaign banner, a dashboard card, a lecture slide, and a printed flyer do not share the same geometry. If a generation system can represent these choices as explicit parameters, the workflow becomes easier to audit, modify, and automate.

The word “audit” should be used carefully here. CoCo does not make the final image fully auditable in the way a deterministic charting pipeline is auditable. The refinement model can still alter details. But the draft gives teams a visible and executable checkpoint before the final image step. That is already better than asking a diffusion model to please behave nicely.

What this means for enterprise visual AI

The most useful business lesson from CoCo is architectural. Structured image generation should not always be treated as direct text-to-image generation. It may be better designed as a specification-to-preview-to-refinement pipeline.

A business implementation would not need to copy CoCo exactly. The intermediate representation could be Python plotting code, SVG, HTML/CSS, a design-system JSON schema, slide layout XML, or a domain-specific chart grammar. The key is that the intermediate layer should be executable or renderable.

Enterprise use case	Why Code-as-CoT helps	What remains uncertain
Automated report graphics	Charts, labels, and layout can be specified before styling	Needs integration with verified data pipelines
Marketing posters	Text placement and brand structure can be previewed	Final typography and brand compliance still need review
Scientific or educational diagrams	Geometry and labels benefit from explicit coordinates	Domain correctness requires subject-matter validation
Dashboard mockups	Layout scaffolds can map to frontend components	Production UI code may require a stricter schema than Matplotlib
Presentation visuals	Drafts can reduce revision loops	Slide-level consistency across a deck is not directly tested

The ROI pathway is therefore not “AI makes better images.” That sentence should be retired, preferably humanely.

The ROI pathway is narrower and stronger: fewer failed generations, fewer manual layout corrections, better preservation of text and structure, and more controllable intermediate artifacts. For teams producing recurring structured visuals, that can matter. For teams making purely expressive art assets, the case is weaker.

The boundary: CoCo is not a universal image-generation victory lap

The paper is strongest when read inside its intended domain: structured, text-intensive, layout-sensitive generation.

Several boundaries should shape practical interpretation.

First, the benchmarks emphasize structured visuals and text-heavy images. That is exactly where CoCo should shine. The paper does not prove that executable code is superior for open-ended creative image generation, cinematic imagery, fashion photography, or emotionally expressive visual art.

Second, the system depends on sandboxed code execution. That introduces operational issues: runtime reliability, package availability, security policy, latency, and failure handling. A production system cannot simply let arbitrary generated code run wherever it likes. That would be less “AI innovation” and more “incident report waiting patiently.”

Third, CoCo-10K relies on a synthetic pipeline involving strong external models for code and image refinement. This is a reasonable research strategy, but it means the dataset carries the assumptions and biases of the teacher systems and the prompt construction process. The paper shows impressive results, but it does not settle how broadly the approach transfers across industries, brand systems, languages, or highly specialized visual domains.

Fourth, the final image is still produced by a generative refinement model. The code-rendered draft improves controllability, but it does not guarantee exact preservation of every element. For regulated charts, financial disclosures, medical diagrams, or legal documents, the final artifact still needs verification.

These are not reasons to dismiss the paper. They are reasons to use it correctly.

The deeper shift: from prompting images to compiling intentions

CoCo is interesting because it reframes a practical weakness in image generation. The weakness is not only that models hallucinate text or misplace objects. The weakness is that direct text-to-image generation often lacks a controllable intermediate form.

In software, we rarely move from product requirement directly to finished application without specifications, prototypes, tests, and build artifacts. In analytics, we rarely move from business question directly to final dashboard without queries, transformations, and chart definitions. But in image generation, users have been asked to accept a suspiciously magical jump from prompt to pixels.

CoCo says: add a build step.

That may be the most business-relevant idea in the paper. Not because every company should generate Matplotlib scripts for posters, but because many enterprise AI workflows need intermediate artifacts that are explicit enough to inspect and structured enough to repair. Code-as-CoT is one version of that principle. The broader principle is that AI systems become more useful when their reasoning leaves behind something operational, not just verbal.

For structured generation, the next serious question is not whether the image looks impressive in a demo. The question is whether the system can preserve the parts the business actually cares about: the numbers, the labels, the hierarchy, the layout, the language, the compliance-sensitive details, and the revision path.

CoCo does not solve all of that. But it points in the right direction: away from decorative reasoning, toward executable planning.

And for once, “thinking step by step” may actually mean something visible on the page.

Cognaptus: Automate the Present, Incubate the Future.

Haodong Li et al., “CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation,” arXiv:2603.08652, 2026, https://arxiv.org/abs/2603.08652. ↩︎

Natural-language CoT is too soft for layout work#

CoCo turns image generation into a three-stage production pipeline#

CoCo-10K teaches two skills: executable planning and selective refinement#

The main results show strength where structure is the task#

Text rendering results are strong, but the LongText result needs careful reading#

The ablations explain why code alone is not the product#

The adaptive canvas result hints at real layout reasoning#

What this means for enterprise visual AI#

The boundary: CoCo is not a universal image-generation victory lap#

The deeper shift: from prompting images to compiling intentions#