A slide looks finished. The headline is sharp, the equations are aligned, the answer box is confident, and the design has the mild corporate glow of something that has already been approved by three people who did not read it.

That is exactly the problem.

For years, text-to-image models failed in a wonderfully obvious way: they could not spell. A poster would say “Qaurterly Reveneu,” the mockup button would contain mystical glyphs, and everyone understood the output was decorative, not operational. Recent models have changed that. They can now place readable text inside images, produce document-like pages, and generate slide-like visual artifacts. The failure mode has become less funny and more expensive: the text may be readable, but the reasoning may be wrong.

Hong and Zhou’s paper, Evaluating Reasoning Fidelity in Visual Text Generation, studies precisely this new failure class: whether text-to-image systems can preserve semantic and logical correctness when the reasoning itself must be expressed as rendered text inside an image.1 The paper’s answer is not comforting, but it is useful. Current text-to-image systems are improving at visual text rendering; they are not yet reliable reasoning engines merely because their outputs now look like documents.

The practical translation is simple: a visual generator should not be treated as the place where reasoning happens. It should be treated as the place where already-checked reasoning is displayed. Yes, apparently we needed a benchmark to say that the pretty slide is not the brain. Welcome to progress.

The paper tests a different question from ordinary visual-text benchmarks

Most visual-text evaluation asks: did the model render the requested text correctly? That is a necessary question. It is not the whole question.

A system that misspells words is easy to distrust. A system that writes fluent but logically broken reasoning is harder to catch. Hong and Zhou shift the evaluation target from rendering fidelity to reasoning fidelity. Instead of asking only whether the output text is readable, they ask whether the rendered text contains correct intermediate reasoning and a correct final answer.

That changes the benchmark design.

The paper first uses a text rendering task as a filter. Models are asked to reproduce WikiText passages at fixed lengths: 64, 128, 256, and 512 words. This stage is not the main thesis. It is the entry ticket. If a model cannot render readable long text, its reasoning errors cannot be cleanly separated from its OCR and layout failures.

Only after that does the paper move into reasoning tasks:

Task family What the model must do Why it matters
Factual knowledge Answer ARC-style grade-school science multiple-choice questions with reasoning for each option Tests whether visual text generation preserves basic factual reasoning
Context understanding Answer DROP-style questions from long passages, often requiring reference resolution or simple arithmetic Tests whether reasoning over supplied context survives visual rendering
Math reasoning Solve MATH problems with step-by-step reasoning and a final answer Tests multi-step procedural reasoning under visual-text constraints

The important phrase is “under visual-text constraints.” The authors compare text-to-image systems against text-only LLM baselines. That comparison is the spine of the paper. Same broad task family, different output channel. The question is not whether images are nice. The question is whether forcing reasoning through an image-generation pipeline weakens the reasoning trace.

Clear text is the entry ticket, not the exam

The rendering results are already revealing, but not because they are surprising in the old way.

The evaluated systems include closed-source text-to-image models such as GPT-Image-1.5 in low and medium quality settings, GPT-Image-2, Gemini-2.5-Flash-Image, and Flux.2-Pro, along with open-source models including Qwen-Image, SDXL, and TextDiffuser-2. The paper uses PaddleOCR to extract text and reports character error rate, word error rate, and OCR confidence.

The rendering stage shows four familiar failure modes: layout overflow, text blurring, instruction violation, and extra hallucinated text. SDXL and TextDiffuser-2 fail badly enough on short inputs that the authors exclude them from the later reasoning evaluation. That exclusion is important. The paper is not simply punishing weak renderers for ugly text. It tries to focus the reasoning analysis on models that can produce reasonably readable outputs.

Even among the remaining models, rendering quality varies sharply. GPT-Image-2 has the strongest text rendering among the T2I systems in the main table, with character error rate 0.049 and word error rate 0.283. GPT-M does reasonably well by comparison, with CER 0.091 and WER 0.347. Flux.2, by contrast, has CER 1.352 and WER 1.450, a sign that insertion and hallucination errors can overwhelm the original text.

There is a useful warning here: OCR confidence alone is not enough. Flux.2 has high OCR confidence, but very high error rates. The OCR system can be confident about text that is clearly readable and still not the text that was requested. In business terms, this is the difference between “the invoice number is legible” and “the invoice number is correct.” These are not the same sentence, no matter how badly procurement software wants them to be.

The main comparison is final answers versus reasoning steps

The core contribution of the paper is its split between answer score and process score.

The answer score asks whether the final answer is correct. The process score asks whether each intermediate reasoning step is correct. This distinction matters because a model can land on the right answer through a broken path, or produce a plausible-looking chain of explanation that does not actually support the answer.

The paper’s main table makes this visible:

Model Math answer / process Context answer / process Factual answer / process Interpretation
GPT-5.2 text-only 0.934 / 0.969 0.870 / 0.936 0.988 / 0.994 Strong final answers and strong reasoning traces
Qwen3-8B text-only 0.838 / 0.917 0.790 / 0.821 0.953 / 0.947 Smaller LLM, still comparatively consistent
GPT-Image-2 0.728 / 0.845 0.826 / 0.901 0.965 / 0.931 Best T2I model, but still below text-only reasoning
GPT-M 0.520 / 0.615 0.803 / 0.822 0.945 / 0.919 Better generation quality helps, but does not close the gap
Gemini 0.761 / 0.419 0.802 / 0.438 0.981 / 0.319 High answer scores can coexist with weak process scores
Qwen-Image 0.678 / 0.507 0.760 / 0.630 0.940 / 0.710 Reasoning trace quality remains unstable
Flux.2 0.608 / 0.376 0.858 / 0.796 0.975 / 0.861 Some answers look strong despite weaker rendering and process issues

The most business-relevant column is not always the final answer. It is the gap between final answer and process.

Gemini is a good example. Its factual knowledge answer score is 0.981, close to the text-only GPT-5.2 score of 0.988. But its factual process score is only 0.319. That means the final multiple-choice selection can look excellent while the explanation underneath is often unreliable. In a consumer poster, maybe that is tolerable. In a financial report, educational worksheet, medical instruction sheet, compliance summary, or board presentation, it is less charming.

The authors also find that difficulty makes the gap more visible. In math reasoning, performance declines as problem difficulty increases. For most T2I systems, process scores degrade more sharply than answer scores. This is the paper’s most useful diagnosis: the output channel can preserve the appearance of a solution while degrading the procedure that makes the solution trustworthy.

That is not a small distinction. Business users often review generated outputs visually and selectively. They look at the title, the chart, the final number, and perhaps one explanation paragraph. A model that can produce correct-looking answers with bad intermediate reasoning is designed, accidentally but effectively, to pass that style of shallow review.

Better image generation helps, but it is not reasoning insurance

The paper compares GPT-Image-1.5 under two generation quality settings: GPT-L and GPT-M. This is not the full thesis; it functions more like a quality sensitivity test. It asks whether improving the generation setting improves reasoning fidelity.

It does.

GPT-L performs extremely poorly on math reasoning, with answer score 0.011 and process score 0.126. GPT-M rises to 0.520 answer score and 0.615 process score. The difference is too large to ignore. Better visual generation can reduce rendering failures and make the reasoning trace more coherent.

But this improvement does not erase the gap. GPT-M still trails GPT-Image-2, and both trail text-only LLM baselines in reasoning consistency. The lesson is not that rendering quality is irrelevant. The lesson is that rendering quality is a necessary layer, not a sufficient one.

This distinction matters for product design. If a company deploys an AI slide generator, there are at least three capabilities involved:

  1. generating the substantive answer;
  2. formatting that answer into a visual artifact;
  3. preserving the answer faithfully through the visual rendering process.

Many products collapse all three into one magical prompt. The paper suggests that this collapse is operationally fragile. When the model both reasons and renders in one pass, it becomes harder to know whether the error came from bad reasoning, bad rendering, bad extraction, or a bad interaction among all three.

A pipeline that separates reasoning from rendering may sound less glamorous. It is also easier to audit. Glamour has never passed a compliance review, although it has attended many.

The paper cross-examines the obvious excuse: maybe OCR caused the problem

A skeptical reader might object: perhaps the models reasoned correctly, but OCR extraction or image readability corrupted the evaluation.

The paper takes that objection seriously. This is where the error-source analysis matters. It is not a second thesis; it is a robustness and attribution check.

Test or evidence Likely purpose What it supports What it does not prove
Table 1 main task results Main evidence T2I systems trail text-only LLMs, especially in process scores Does not by itself isolate rendering errors from reasoning errors
Figure 2 difficulty curves Main evidence and difficulty sensitivity Reasoning quality declines as difficulty rises, especially for process scores Does not identify every mechanism behind the decline
GPT-L versus GPT-M Generation-quality sensitivity test Better image generation improves reasoning-related metrics Does not show that rendering quality alone solves reasoning fidelity
PaddleOCR versus DeepSeek-OCR OCR backend robustness check OCR choice has limited impact on rendering metrics Does not remove all possible extraction errors, especially math-symbol edge cases
CCR and ACR readability checks Rendering-error attribution check Many outputs are visually readable even when reasoning fails Does not prove internal model reasoning is absent
Human labels for CCR Metric validation VLM-based clarity estimates broadly track human readability judgments Does not validate semantic correctness

The OCR backend comparison is particularly useful. PaddleOCR and DeepSeek-OCR produce broadly consistent character and word error results. For example, GPT-M has CER/WER 0.091/0.347 under PaddleOCR and 0.096/0.351 under DeepSeek-OCR. Gemini’s results are also close: 0.506/0.732 under PaddleOCR and 0.502/0.716 under DeepSeek-OCR.

That does not mean OCR is perfect. The authors explicitly acknowledge that mathematical symbols and equations remain difficult. But it does reduce the chance that the whole result is merely an OCR artifact.

The readability analysis pushes the point further. Using a VLM-based Character Clear Rate and All Clear Rate, the paper finds that most selected models generate a high proportion of clearly readable characters. GPT-M, for instance, has text-rendering CCR 0.998 and math-reasoning CCR 0.999. Gemini has math-reasoning CCR 0.996. Yet these models still show substantial reasoning failures.

The authors then validate the VLM-based readability metric with human labels on 400 samples: 200 from text rendering and 200 from math reasoning. The Pearson correlation between VLM CCR and human scores is 0.785, rising to 0.920 after removing 8 outliers. That is not a universal guarantee, but it is enough to support the paper’s narrower claim: many failures are not just unreadable-text failures.

The uncomfortable conclusion survives cross-examination. The text can be clear. The reasoning can still be wrong.

The failure mode is not ugly output; it is believable output

The paper’s failure cases are not decorative anecdotes. They clarify what kind of operational risk the benchmark is detecting.

The authors identify reasoning failures such as logically inconsistent deductions, hallucinated intermediate steps, and repetitive reasoning patterns. One example shows a model repeatedly reproducing previous steps rather than advancing the solution. Another shows clear rendered text with incorrect reasoning highlighted. These are not merely typography problems. They are procedural failures.

For business users, the dangerous version is not a visibly broken image. A visibly broken image is rejected. The dangerous version is a clean generated slide where:

  • the formula is readable;
  • the final number looks plausible;
  • the explanation uses the right vocabulary;
  • the intermediate steps contain a quiet contradiction;
  • nobody checks because the deck is due in twelve minutes.

This is why the paper’s comparison-based framing matters. The gap is not between “AI image good” and “AI image bad.” The gap is between several pairs that people often confuse:

Common belief Correction
Readable text means reliable content Readability only proves that the text can be read
Correct final answer means correct reasoning Correct answers can mask broken intermediate steps
Better rendering means solved reasoning Better rendering helps but does not guarantee logic
OCR validation is enough OCR checks extraction, not semantic correctness
A generated document can be reviewed like a normal document Generated reasoning needs process-level checks, not just visual inspection

The last point is the important one. A normal document is usually the end of a human or software workflow. An AI-generated visual document may be both the output and part of the computation. That makes it harder to audit. When the reasoning only exists inside pixels, the organization loses access to structured intermediate text unless it deliberately preserves it.

The operational lesson is to keep reasoning outside the image

The paper directly shows that current T2I models struggle to preserve reasoning fidelity when they must express solutions as visual text. The business inference is architectural: do not ask the visual generator to be the reasoner of record.

A safer workflow separates the reasoning core from the rendering layer:

Pipeline layer Operational role Failure it reduces
Text reasoning model Produces answer, assumptions, intermediate steps, and uncertainty notes in structured text Reduces dependence on image-space reasoning
Rule or LLM verifier Checks final answer against reasoning steps and source context Catches answer-process mismatch
Renderer Converts approved text into slide, report, dashboard, or image format Limits visual generation to presentation
OCR/VLM extraction check Re-reads the rendered artifact Detects rendering corruption, missing text, and layout errors
Semantic post-check Compares extracted visual text with approved source text and reasoning Detects visual-text drift and hallucinated insertions
Human review for high-stakes outputs Focuses review on assumptions, numbers, and logic rather than only layout Catches errors automated checks still miss

This is not a call to avoid AI-generated visuals. It is a call to stop treating the visual artifact as the place where reasoning should be born.

For automated reporting, the reasoning should exist in a structured intermediate form: JSON, Markdown, database records, or another auditable representation. The slide, image, or PDF should be a projection of that reasoning, not its only incarnation. If the rendered artifact is the first time the reasoning exists, the audit trail has already lost.

The same principle applies to education technology. A generated worksheet solution may look clean, but a student needs correct steps, not just a tidy answer. It applies to financial dashboards, where a wrong intermediate assumption can change a recommendation. It applies to compliance summaries, where the visible conclusion may be less important than the documented basis for it. It applies to interface agents that generate instruction screens or forms, where a small textual drift can become a user action error.

The ROI argument is also more sober than “add another AI model.” The value is not in making the visual output fancier. The value is cheaper diagnosis. A separated pipeline lets a business identify whether the error came from reasoning, rendering, extraction, or semantic drift. That reduces debugging cost and makes governance less theatrical.

What the paper does not prove

The paper is careful about its own boundaries, and those boundaries matter.

First, the evaluation is intentionally end-to-end. A model must generate an image and externalize reasoning through rendered text. That means the benchmark measures the combined behavior of reasoning, visual generation, layout, and extraction. The authors mitigate extraction confounds, but they do not eliminate every possible OCR or symbol-recognition issue.

Second, the paper does not exhaustively sweep rendering variables such as font size, stroke width, or layout density. A production system with strict templates, larger canvases, or better document rendering constraints might perform better than a general-purpose prompt-to-image setup.

Third, many text-to-image systems are optimized for natural images or short embedded text, not dense structured documents with multi-step reasoning. Specialized document-focused visual-text models may reduce the gap. They may also change where the failure appears. Progress has a habit of moving the mess to a more expensive layer.

Fourth, the benchmark focuses on reasoning expressed through explicit textual steps. A model might perform some internal reasoning but fail to externalize it faithfully in rendered text. That distinction is scientifically important. Operationally, however, it does not rescue the generated artifact. If the reasoning cannot be faithfully externalized, the user still cannot audit it.

So the fair conclusion is not “image models cannot reason.” The fair conclusion is narrower and more useful: current text-to-image systems should not be trusted as standalone generators of reasoning-heavy visual text, especially where the intermediate steps matter.

The pretty document still needs a brain behind it

This paper lands at a useful moment because the interface of AI work is becoming more visual. People want models to generate slides, dashboards, whitepapers, forms, diagrams, worksheets, landing pages, and product mockups. That demand is reasonable. Nobody wants to manually align boxes in a slide deck until retirement.

But visual fluency creates a new trap. When text inside images becomes legible, users may upgrade their trust faster than the model upgraded its reasoning. The old warning sign was broken spelling. The new warning sign is absence of an audit trail.

Hong and Zhou’s benchmark gives businesses a cleaner vocabulary for this risk. The issue is not only hallucination. It is reasoning fidelity under visual-text generation. The answer is not to ban visual AI, but to discipline its role: reason in text, verify in structure, render in image, then check the rendering.

A generated slide can be beautiful. It can be useful. It can even be mostly correct.

It should not be the only witness to its own logic.

Cognaptus: Automate the Present, Incubate the Future.


  1. Jiajun Hong and Jiawei Zhou, “Evaluating Reasoning Fidelity in Visual Text Generation,” arXiv:2606.04479v1, 3 June 2026. ↩︎