Picture This: When AI Reasoning Leaves the Text Box

Reasoning usually arrives as text. A model explains itself in sentences, equations, bullet points, and the occasional theatrical “therefore.” We have learned to call this chain-of-thought, or CoT, because “the model wrote a long scratchpad and we hope it helped” sounded insufficiently scientific.

The paper Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text asks a sharper question: what if the intermediate reasoning medium does not have to be text at all?¹

Not “can images help a model reason?” That question is already familiar in multimodal AI. Not “can we add a diagram beside the explanation?” That is useful, but not radical. The paper’s proposal is stronger: images can serve as the reasoning medium itself. The rationale can be rendered as a compact typographic canvas or reorganized into a graphical multi-panel image, then fed to a multimodal model as visual reasoning tokens.

That sounds like a compression trick. It is partly that. But treating it only as OCR-flavored prompt compression misses the more interesting mechanism. The paper is not merely stuffing words into a JPEG suitcase. It is testing whether visual layout, spatial grouping, equations, diagrams, and typography can become the carrier of intermediate reasoning.

For business users, the question is not whether executives should start asking their AI systems to think in comic strips. Please do not. The question is whether future multimodal agents should represent intermediate work as visual canvases when text becomes too expensive, too linear, or too awkward for spatial tasks.

The answer from this paper is cautiously interesting: often yes, under controlled conditions; not always; and definitely not without renderer governance. A familiar theme. The future is visual, but still needs someone to check the margins.

The paper changes the reasoning medium, not just the prompt format

Standard text reasoning gives the model a question and a textual rationale. The model reads the rationale as text tokens and outputs an answer. Optical reasoning changes the middle layer. The rationale is transformed into an image, then the multimodal model receives the question plus visual reasoning tokens extracted from that image.

The paper formalizes this as a shift from:

$$ a \sim \pi_\theta(\cdot \mid q, r_{txt}) $$

to:

$$ a \sim \pi_\theta(\cdot \mid q, z_{vis}) $$

where $q$ is the question, $r_{txt}$ is the text rationale, and $z_{vis}$ is the visual-token representation of a rendered rationale image.

This is the core mechanism. The model is not simply seeing an illustration. It is using an image as the intermediate representation of reasoning.

The authors instantiate this idea in two ways:

Variant	What it does	What it tests
Typographic optical reasoning, or T-OR	Renders the rationale into a compact typographic image using layout search over width, font size, line spacing, and padding	Whether images can preserve reasoning content while reducing reasoning tokens
Graphical optical reasoning, or G-OR	Converts the rationale into a step-aligned graphical canvas with panels, annotations, equations, and diagrams	Whether images can do more than compress text by reorganizing reasoning spatially

That distinction matters. T-OR is mostly about compactness and readability. G-OR is about expressiveness. One asks, “Can the same reasoning be packed into fewer tokens?” The other asks, “Can a visual structure make the reasoning easier to use?”

The paper’s strongest business relevance comes from this difference. Many enterprise AI workflows do not fail because the model lacks another paragraph of explanation. They fail because intermediate work is hard to inspect, hard to structure, or too costly to pass between agents. A visual rationale canvas could become a cheaper and more natural handoff object for some workflows: engineering diagrams, compliance evidence maps, medical decision trees, logistics layouts, financial scenario boards, and any task where spatial relations carry information that text handles with the elegance of a tax form.

T-OR is a renderer with a budget, not a magic screenshot

The typographic variant is easy to misunderstand. It is not “take a screenshot of the rationale and hope the model can read it.” T-OR searches for a compact but readable layout under a target visual-token budget.

Its renderer preserves the original rationale order. Text spans remain text. Equations remain equations. Tables are typeset. Visual rationale units can be inserted as image blocks. The system then varies layout parameters such as text width and font size, estimates the resulting visual token count, and chooses a feasible layout that balances canvas utilization against readability penalties.

In plain terms: T-OR tries to remove wasted visual space without making the rationale illegible.

That is more operationally important than it sounds. If images become reasoning media, the renderer becomes part of the reasoning system. Font choice, color, width, density, and backend are no longer cosmetic settings. They are model-interface parameters. The “design team” has accidentally walked into the inference pipeline. Naturally, this is where everything becomes both interesting and annoying.

The main evidence asks whether visual rationales can compete with text rationales

The main experiment evaluates optical reasoning across five benchmarks and five multimodal large language models. The benchmarks cover mathematical reasoning, scientific reasoning, and interleaved text-image reasoning: AquaRat, GSM8K, GPQA Diamond, ScienceQA, and Zebra-CoT. The models include GPT-5.1, Gemini 2.5 Flash, Claude Sonnet 4.5, Kimi K2.5, and Qwen3-VL-235B, as reported in the paper.

The setup is controlled in several important ways.

First, the authors primarily use rationales from open-source CoT datasets. This means the experiment is mostly about representation of reasoning, not about whether the model can invent the right reasoning from scratch.

Second, explicit final answers are removed from the rationales. The model cannot simply read the answer and repeat it. It must infer the solution from the provided reasoning content.

Third, model reasoning mode is disabled, so the model relies on the supplied rationale rather than generating a long hidden or visible scratchpad of its own.

Fourth, the paper uses accuracy as the primary task metric and introduces Marginal Accuracy Gain, or MAG, to measure improvement over no reasoning per reasoning token.

These design choices make the main result easier to interpret. The paper is asking: if the same or similar rationale content is represented visually rather than textually, how much answer accuracy survives, and how many reasoning tokens are saved?

The headline result is token efficiency, not universal accuracy dominance

Across the main T-OR table, the result is not “images always beat text.” The result is subtler: T-OR often matches or exceeds text reasoning in specific model-benchmark pairs while using fewer reasoning tokens, and it often has much higher token efficiency.

The paper reports that, for language tasks, T-OR matches or exceeds text reasoning in seven model-benchmark pairs while reducing reasoning tokens by an average of 28.57%. Where T-OR underperforms text reasoning, the best setting trails by an average accuracy gap of 0.027 while still reducing reasoning tokens by 20%.

For multimodal tasks, T-OR matches or exceeds text reasoning in five model-benchmark pairs while reducing reasoning tokens by an average of 16%. Where it trails, the average gap is 0.014 with a 32% token reduction.

The paper also reports that T-OR achieves 1.96× the token efficiency of text reasoning under its MAG metric.

That last number is the business-facing result, but it needs careful translation. It does not mean every enterprise can halve inference cost tomorrow by turning rationales into images. It means that, in this experimental setup, visual reasoning tokens produced more marginal accuracy gain per token than text reasoning tokens on average.

The practical lesson is about representation economics. If reasoning content can be represented in a denser medium without losing too much usable information, the cost-performance frontier changes. For agentic systems, this could matter because intermediate reasoning is often passed repeatedly: planner to executor, executor to verifier, verifier to reporter. A cheaper intermediate representation compounds quickly. So does a bad one, which is why the renderer deserves suspicion.

The compression curves show model-specific visual appetite

Figure 2 is not just a decorative plot of “more tokens good.” It is a sensitivity test. It shows how accuracy changes relative to text reasoning as T-OR uses different compression ratios.

The pattern is uneven. Gemini 2.5 Flash remains competitive under aggressive compression. Kimi K2.5 and Claude Sonnet 4.5 improve more consistently as visual tokens increase. Some models handle dense visual rationales better than others.

That matters because it turns optical reasoning from a generic method into a model-interface problem. A company cannot safely say, “We use MLLMs, therefore optical rationales will work.” The better statement is: “This model, with this renderer, at this resolution, for this task family, preserves enough reasoning signal.”

Less glamorous, more useful.

The paper’s extreme-compression analysis makes the point even sharper. On AquaRat with Gemini 2.5 Flash, the authors reduce the average visual-token budget dramatically. At only 1.2 estimated visual reasoning tokens per example, T-OR still beats the no-reasoning baseline. At 7.2 estimated tokens, it reaches 0.7992 accuracy, above the text reasoning baseline of 0.7323 and above the full-budget T-OR setting of 0.7362.

This is surprising enough that it should not be over-interpreted. The token count is estimated using a uniform patch-based mapping, not the undisclosed internal tokenization rules of every closed-source model. Also, “7.2 tokens” should not be read as seven literal words of reasoning. It is an estimated visual budget under the paper’s accounting method.

Still, the result is useful. It suggests that some models may not need pixel-perfect access to every rationale detail. They may benefit from coarse visual structure, layout cues, or compressed symbolic patterns. In business terms, this is not a license to make rationales tiny. It is a hint that there may be a tunable compression regime where intermediate reasoning remains useful but cheaper.

G-OR is the expressive version, and also the dangerous one

The graphical variant is the paper’s more ambitious idea. T-OR preserves rationale order; G-OR reorganizes rationale content into visual panels. It uses a structured prompt to generate a compact educational-style multi-panel image, preserving key reasoning text and equations while adding graphical elements and spatial layouts.

The paper evaluates G-OR on AquaRat and reports that it reaches 0.8150 accuracy, outperforming no reasoning at 0.6890, text reasoning at 0.7323, and the reported T-OR settings, where the best shown value is 0.7835 in that comparison.

This is evidence for the expressive value of visual reasoning. The paper’s interpretation is that images are not merely compact containers for text. They can represent relations spatially. That is the part business readers should notice.

Text is linear. Many business problems are not. A procurement dependency map, an insurance claim workflow, an engineering fault tree, a warehouse layout, a process-mining trace, or a risk-control matrix may be more naturally represented as a diagrammatic object. G-OR points toward agents that reason over structured visual canvases rather than scrolling through another majestic wall of generated prose.

But G-OR also introduces a new failure mode: graphical hallucination. In the appendix case study, the authors note that generated schematics are not always accurate. One geometric case uses a red segment intended to show a key diagonal relation, but its placement deviates from the exact geometric constraint.

That is not a small implementation detail. It is the central governance problem for expressive visual reasoning. A wrong sentence is bad. A wrong diagram can be worse because humans and models both tend to trust geometry when it looks clean. A beautifully drawn wrong relationship is still wrong. It just has better typography.

The ablations show that design choices become model behavior

The ablation section studies renderer variables on GPQA Diamond with GPT-5.1. Its likely purpose is not to prove the main thesis again; it tests sensitivity. That is valuable because optical reasoning depends on a rendering pipeline.

The paper varies color, font family, font size, and text width. Red achieves the highest accuracy among tested colors; green performs worst. The “Heros” font performs best among the tested font families. Very small fonts reduce accuracy, while moderate font sizes work better. Narrower text width performs better than wider layouts in the reported setting.

This does not mean every system should render reasoning in red Heros at two inches wide. Please spare your interface designers. The broader point is that visual presentation changes model performance.

Experimental component	Likely purpose	What it supports	What it does not prove
Main T-OR benchmark table	Main evidence	Visual rationales can often preserve task performance while reducing token budgets	T-OR always beats text reasoning
G-OR on AquaRat	Exploratory extension / expressive-medium evidence	Spatially organized visual rationales can outperform text and typographic variants in the reported setup	G-OR is robust across all tasks and models
Layout style and density ablations	Sensitivity test	Rendering choices affect visual-rationale decoding	One universal layout exists
Extreme compression test	Robustness / stress test	Some useful reasoning signal survives under very small estimated visual budgets	Visual tokens can be treated as exact cost equivalents across providers
Renderer backend comparison	Implementation sensitivity test	Pillow, Matplotlib, and XeLaTeX interact differently with different MLLMs	Renderer choice is merely cosmetic
LLMLingua-2 comparison	Comparison with prior compression method	Optical mapping can preserve useful cues better than text truncation in the tested AquaRat setup	Optical reasoning dominates all text-compression methods
Model-generated rationale test	Practicality check	T-OR can remain competitive when rationales are generated by the model itself	The full end-to-end optical reasoning loop is solved

The renderer backend comparison is especially operational. On GPQA Diamond, the paper tests Pillow, Matplotlib, and XeLaTeX renderers. Qwen3-VL-235B and Claude Sonnet 4.5 do best with XeLaTeX; Gemini 2.5 Flash does best with Matplotlib; Kimi K2.5 is close but favors XeLaTeX in the reported table.

This is what production AI usually looks like once the demo ends. The concept is elegant. The system is picky. The renderer is now part of your model compatibility matrix.

Optical reasoning beats text truncation in the tested comparison

The paper compares T-OR with LLMLingua-2, a representative efficient text reasoning compression method, on AquaRat using Gemini 2.5 Flash. Under comparable reasoning-token budgets, LLMLingua-2 remains near the no-reasoning baseline, while T-OR performs substantially better.

The authors’ interpretation is that text truncation removes content, while optical 2D mapping may preserve more of the rationale’s useful structure. That explanation is plausible, with a boundary: this is a focused comparison, not a universal ranking of all compression strategies.

The business implication is still clean. Compression methods are not interchangeable. A text compressor decides what to delete. A visual renderer decides how to arrange. Those are different failure modes.

Deletion risk means the model never sees an important premise. Layout risk means the model sees it but may misread, under-attend, or visually confuse it. Depending on the workflow, one risk may be preferable to the other.

For regulated or high-stakes workflows, neither should be accepted without evaluation. But for internal reasoning handoffs, visual compression may become attractive precisely because it can preserve complete rationale content while reducing token burden.

The model-generated rationale test is practical, but not yet fully end-to-end

One possible criticism of the main setup is that the rationales are externally provided. That makes the experiment cleaner, but also less like deployment. In real systems, the model often generates its own intermediate reasoning.

The paper addresses this with a model-generated rationale analysis on GPQA Diamond using GPT-5.1. The “free reasoning” baseline reaches 0.6869 accuracy. T-OR reaches 0.6162 at 80% token reduction, 0.6616 at 60%, 0.6768 at 40%, 0.6869 at 20%, and 0.6919 at full budget.

This is a useful practicality check. It suggests optical reasoning can work with model-generated rationales, not only dataset-provided ones.

But it is not the final deployment story. The system still needs a pipeline that generates a rationale, masks or controls final-answer leakage where appropriate, renders it correctly, feeds it back, and evaluates the answer. For business workflows, the more relevant architecture may be multi-agent: one agent reasons, another renders or compresses the rationale, another verifies, and another acts. At that point, optical reasoning becomes an intermediate artifact strategy.

That is more promising than it sounds. Many enterprise AI systems already create intermediate artifacts: JSON plans, retrieval traces, extracted tables, audit logs, workflow diagrams, and approval packets. Optical reasoning says one of those artifacts could be a visual canvas designed for machine consumption, not just human inspection.

What this means for business AI design

The paper directly shows that visual rationale representations can preserve or improve accuracy under controlled benchmark conditions while reducing reasoning-token budgets in many settings. It also shows that graphical visual rationales can outperform text and typographic variants in one AquaRat setup, and that layout/rendering decisions materially affect performance.

Cognaptus’ business inference is narrower but useful: multimodal agents may eventually benefit from visual intermediate representations when three conditions hold.

First, the task has structure that text handles inefficiently. Spatial relations, process dependencies, formulas, evidence maps, and diagrams are natural candidates.

Second, intermediate reasoning is reused. If a rationale is consumed only once, rendering overhead may not be worth it. If it is passed among agents, used for verification, or stored as an audit artifact, compactness and readability matter more.

Third, the organization can evaluate renderer-model-task combinations. Optical reasoning is not a plug-in optimization. It is a representation layer. Like any representation layer, it needs tests.

Business use case	Why optical reasoning might help	Main boundary
Multi-agent workflow handoffs	Compresses intermediate reasoning into a compact artifact	Requires model-specific renderer validation
Technical analysis and engineering review	Equations, diagrams, and spatial relations can live in one canvas	Graphical errors may mislead both model and user
Compliance and audit preparation	Visual rationale artifacts may be easier to inspect than long text traces	Visual reasoning must not replace source evidence
Operations and logistics planning	Layouts, flows, and dependencies are naturally visual	Token savings matter only if visual parsing remains reliable
Education and internal training tools	Step-aligned panels may improve explanation and checking	Pedagogical clarity is not the same as reasoning correctness

The ROI pathway is therefore not “images are cheaper than text.” That is too blunt. The better pathway is:

represent intermediate reasoning in a denser visual medium;
preserve enough useful reasoning signal for the downstream model;
reduce repeated token costs in agent pipelines;
improve spatial organization for tasks where linear text is awkward;
validate renderer reliability before using the artifact for decisions.

That last step is where the spreadsheet people return, as they always do.

The boundaries are not footnotes; they define the product design

The paper’s limitations are not generic academic modesty. They directly define what an enterprise implementation would need to control.

The first boundary is model-dependent perception. Different MLLMs respond differently to visual-token budgets, layout density, colors, fonts, and rendering backends. A model upgrade could change the optimal renderer. A vendor tokenizer change could alter the cost calculation. A task domain with dense tables may behave differently from one with geometric diagrams.

The second boundary is token accounting. The paper uses a uniform Qwen3-VL-style patch mapping to estimate visual token counts across models because closed-source systems do not disclose exact visual tokenization rules. This is reasonable for controlled comparison, but businesses should not treat the paper’s estimated visual token counts as exact invoice predictors.

The third boundary is rationale origin. Much of the main evaluation uses provided rationales. That isolates the representation question, but deployment usually requires generation, rendering, verification, and action. Each step can fail.

The fourth boundary is graphical reliability. G-OR’s expressiveness is exactly why it is risky. If the renderer invents or misplaces a relation, the resulting image may look more coherent than the underlying reasoning deserves. In a production setting, graphical rationale generation needs constraints, validation, and probably comparison against source text or symbolic structure.

In other words, optical reasoning is not just a model trick. It is a workflow design problem.

The deeper lesson: reasoning media are now design choices

For years, AI reasoning has been treated as a text phenomenon because language models output text. Multimodal systems weaken that assumption. If a model can consume images as structured token sequences, then an intermediate rationale does not have to be a paragraph. It can be a canvas.

That shift matters. Text is excellent for sequential explanation. It is less excellent for dense structure, spatial relations, and parallel evidence. Images can compress, group, align, and highlight. They can also distort, hallucinate, and seduce the reader with false clarity. A visual rationale is not automatically better. It is simply another medium with different economics and different failure modes.

The paper’s contribution is to make that design space visible. T-OR shows that visual rendering can preserve reasoning content under token pressure. G-OR shows that visual structure may add expressive power beyond compression. The ablations show that the interface layer matters. The limitations show why this is not yet a deployment recipe.

For business AI, that is the right kind of research result: not a miracle, not a product, but a mechanism worth testing.

The next generation of agent systems may not pass around endless text traces. They may pass compact visual workspaces: diagrams, equations, evidence panels, spatial plans, and compressed rationale canvases. The clever part will not be making those canvases pretty. It will be making them faithful, cheap, model-readable, and auditable.

Thinking in pictures is not new for humans. For AI systems, it may become a serious engineering option. Finally, a reason for the model to draw a diagram that is not just pretending to be helpful.

Cognaptus: Automate the Present, Incubate the Future.

Yutong Bian, Dongjie Cheng, Heming Xia, Yongqi Li, and Wenjie Li, “Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text,” arXiv:2606.09585, 2026, https://arxiv.org/abs/2606.09585. ↩︎

The paper changes the reasoning medium, not just the prompt format#

T-OR is a renderer with a budget, not a magic screenshot#

The main evidence asks whether visual rationales can compete with text rationales#

The headline result is token efficiency, not universal accuracy dominance#

The compression curves show model-specific visual appetite#

G-OR is the expressive version, and also the dangerous one#

The ablations show that design choices become model behavior#

Optical reasoning beats text truncation in the tested comparison#

The model-generated rationale test is practical, but not yet fully end-to-end#

What this means for business AI design#

The boundaries are not footnotes; they define the product design#

The deeper lesson: reasoning media are now design choices#