Screenshot.
That is where many business workflows quietly change the problem.
A support agent receives a screenshot of a customer bill instead of the billing table as text. A contract review tool receives a scanned clause instead of the clause extracted from the PDF. A procurement assistant receives a rendered purchase order, not the original form fields. Everyone involved assumes the content is the same. The model can read it. The OCR looks correct. The answer should be the same.
That assumption is convenient. It is also the exact assumption this paper makes uncomfortable.
In Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs, Angela van Sprang, Laurens Samson, Ana Lucic, Erman Acar, Sennay Ghebreab, and Yuki M. Asano introduce REST and REST+, two benchmarks designed to test whether multimodal large language models produce consistent answers when semantically identical information is presented as text, image, or a mixed prompt.1 The point is not merely that image inputs are sometimes harder. Everyone already suspected that. The sharper point is that even after controlling for OCR, models may still answer differently because the same content enters the model through different representational routes.
In other words: reading is not reasoning. A model can transcribe the sign on the door and still walk into the wrong room. Very advanced, very expensive wrong room. Progress.
The failure begins after the model has already read the text
The common explanation for text-image failures is OCR. If a model answers incorrectly from an image, perhaps it simply failed to read the words. That explanation is often true in practical systems. Bad scans, rotated pages, tiny fonts, compression artifacts, and handwritten notes can destroy reliability before reasoning even starts.
REST is useful because it deliberately tries not to stop there. The benchmark first asks the model to transcribe the rendered text. Then, for consistency evaluation, it focuses on cases where OCR is correct. This matters because it separates recognition failure from reasoning inconsistency.
The paper’s core question is therefore more precise:
When the model has correctly read the rendered text, does it reason over that content the same way it reasons over native text?
REST tests this across four task sources: MMLU, AI2-ARC, GSM-Symbolic, and a newly generated system-of-equations benchmark called SoEBench. Each sample is presented in three formats:
| Format | What the model receives | Why it matters |
|---|---|---|
| Text | The question as ordinary text | Baseline language pathway |
| Image | The same question rendered as an image | Visual pathway with equivalent semantic content |
| Mixed | Part image, part text | Cross-modal integration pathway |
The mixed condition is important but easy to misread. It is not a clean midpoint between text and image. In multiple-choice tasks, for example, the question context may be rendered as an image while answer options remain text. In open-ended tasks, the context may be image and the final question text. The appendix flips some of these designs and finds small differences without a universally easier configuration. So the mixed setting is best read as a test of integration, not as a tidy average of two modalities.
REST measures invariance, not just accuracy
A normal benchmark asks: did the model get the answer right?
REST asks a more operationally annoying question: did the model give the same answer when the same content arrived through different channels?
The paper uses three main metrics:
| Metric | What it asks | Business interpretation |
|---|---|---|
| Render-Equivalence Rate (RER) | What fraction of questions receive the same answer across modalities? | Format invariance |
| Cross-Modality Failure Rate (CFR) | Among questions solved in at least one modality, how often does the model fail to solve them across all modalities? | Modality-dependent lost capability |
| Max Modal Coverage (MMC) | What fraction of questions can be solved in at least one modality? | Hidden potential if routing were perfect |
RER is the headline consistency measure. A model with high RER behaves more like a system that understands content independent of format. A model with low RER behaves more like several related systems wearing one product name.
CFR is more painful for operations. It excludes questions the model simply cannot solve in any modality. It focuses on questions where the model has demonstrated the capability somewhere, then asks whether that capability survives format changes. That is exactly the business problem. A system that can solve a case from text but fails from an OCR-correct screenshot is not merely weak. It is unstable.
MMC adds a diagnostic twist. If a question can be solved in at least one modality, then a better router, preprocessing pipeline, or validation strategy may recover performance. This turns inconsistency from a philosophical defect into an engineering problem: the answer may exist inside the system, but only behind the right input door.
The main result: stronger models are better, but not invariant
Across 15 multimodal models, REST finds substantial variation in cross-modal consistency. On the OCR-correct subset, RER ranges from 6.6% for DeepSeek-Tiny to 90.7% for GPT-5-mini. Claude Haiku 4.5 is close at 90.3%, and Qwen-2.5 32B leads among the open-source models evaluated with an RER of 84.7%.
That sounds reassuring for frontier models until one reads the companion number. Even GPT-5-mini has a CFR of 8.7% on OCR-correct REST questions. That means that among questions it can answer correctly in at least one modality, it still fails to answer them consistently across all modalities in nearly one in eleven cases. For weaker models, the failure rate is not a rounding error. Phi-4’s CFR reaches 82.3% on the OCR-correct REST average.
The paper also reports a general preference for text modality across MMLU, AI2-ARC, and GSM-Symbolic. For example, GPT-4o-mini scores 84.4% on MMLU text but 77.0% on the image version. Gemini-2.5 Flash Lite shows a larger GSM-Symbolic gap: 79.4% text versus 50.5% image. These differences are not merely cosmetic. They imply that formatting can shift a model from competent to fragile without changing the underlying question.
But the paper is careful enough not to reduce the whole story to “text always wins.” SoEBench complicates the picture. It is newly generated, designed to avoid prior exposure, and uses a restricted symbol set to keep OCR simple. Most models achieve near-perfect OCR there, yet consistency problems remain. Some models still do best in text, but not all: Phi-4 and Gemini-2.5 Flash Lite perform better in the image modality on SoEBench. That is the more interesting conclusion. The issue is not a universal text advantage. The issue is modality dependence.
For enterprise use, that difference matters. A universal text advantage would suggest a simple policy: always extract text. Modality dependence suggests something less elegant: input format interacts with task type, model family, rendering choices, and probably the model’s internal alignment between visual and textual representations. We are back in engineering, where elegance often goes to die.
OCR-first helps sometimes, which is not the same as solving the problem
A natural mitigation is to ask the model to transcribe the image first, then solve the transcribed problem. This is a reasonable workflow. In many business systems, it should still be part of the pipeline.
The paper tests this too. The result is mixed. OCR-first improves some model-task combinations but hurts others. GPT-5-mini sees small gains across the reported tasks. Phi-4 improves on several established benchmarks but drops on SoEBench. Qwen-2.5 32B gains strongly on SoEBench but loses accuracy on MMLU and AI2-ARC. DeepSeek-Small drops sharply on MMLU and AI2-ARC.
This is not a contradiction. OCR-first changes the computation. It may simplify visual processing, but it also creates a new intermediate representation, adds another instruction-following step, and may remove visual layout cues. For rendered plain text, this can help. For mixed or structured inputs, it may also introduce a second chance to be subtly wrong.
The practical lesson is simple: OCR-first is a tool, not a theorem. It should be benchmarked as a routing strategy, not assumed as a universal repair.
REST+ shows that rendering details can become reasoning details
REST establishes that equivalent content can produce different answers across text, image, and mixed formats. REST+ asks a narrower but operationally useful question: if the same text is rendered as an image in different ways, do visual characteristics change performance?
REST+ uses MMLU questions and creates 10 visual permutations per question: three fonts, three resolutions, and one color variant at 200 DPI. For computational feasibility, it evaluates only text and image conditions, using 1,085 questions sampled across subject classes.
The result is another small insult to the idea that “content is content.” REST+ consistency scores are lower than REST. On the OCR-correct subset, RER ranges from 5.8% to 72.1%. InternVL3 14B achieves the highest REST+ consistency, while GPT-5-mini drops to 67.6%. The complete dataset shows still lower scores when OCR errors are included.
The paper’s REST+ experiments serve different evidentiary roles:
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Resolution / DPI | Robustness and sensitivity test | Lower resolution can reduce OCR and consistency for some models; some models are more stable than others | That higher DPI always improves reasoning once OCR is correct |
| Font family | Sensitivity test | Font choice has no clear consistent effect after OCR control | That typography never matters in messy real documents |
| Text color | Exploratory sensitivity test | Several models perform better with red or yellow text than black in the tested setting | That colored enterprise documents should be preferred |
| Token usage | Efficiency analysis | Vision tokens are often less efficient than text tokens for equivalent text content | That vision-token compression is impossible |
| Chess images | Exploratory extension beyond rendered text | Inconsistency also appears in a natural-image same-content setting | That all real-world visual tasks behave like rendered text |
The color result is especially amusing because it is both real in the benchmark and dangerous to overinterpret. Most models perform better with colored text than black text, and some show more than 5% relative improvement with red or yellow. This does not mean compliance teams should start painting invoices like children’s cereal boxes. It means visual preprocessing can alter performance even when the semantic content is unchanged.
Resolution has a more obvious operational interpretation. At 50 DPI, some models lose OCR and consistency, while others remain relatively stable. Claude Haiku 4.5, for example, shows much lower OCR and RER at 50 DPI in the complete dataset than at 100 or 200 DPI. Gemini is more stable across DPI. InternVL3 14B also maintains relatively consistent behavior. The lesson is not “use 200 DPI and relax.” The lesson is that rendering quality belongs in the evaluation matrix.
Font family, by contrast, has no clear systematic effect in the OCR-correct subset. Most models stay close across DejaVu Sans, Courier New, and Cursive. That is a useful negative result. Not every visual variation matters equally.
Token efficiency is the cost side of the same problem
One reason the paper matters now is that rendering text as images has been explored as a way to reduce token cost. If visual tokens could represent text compactly without losing reasoning quality, long-context workflows might become cheaper.
REST+ makes that promise less automatic.
The paper compares text-token and vision-token efficiency across models where token counts are available. Most models require more vision tokens to achieve comparable performance, or perform better with text despite fewer text tokens. InternVL3 14B, the most consistent model on REST+, uses roughly 1,600 visual tokens for images while needing about 160 text tokens on average—a 10:1 ratio. It may perform well, but not cheaply.
Qwen-2.5 32B is the interesting exception. At 50 DPI, it uses about 104 image tokens on average versus 142 text tokens and obtains 1.8% higher image accuracy than text. That exception matters because it shows the direction is not impossible. But the fact that it is an exception also matters. A cost-saving strategy based on rendering text as images cannot be adopted from a model card or a vendor demo. It has to be tested on the actual model, task, and document style.
For business users, token efficiency is not a separate technical curiosity. It determines whether a workflow can scale economically. A modality that saves input tokens but increases inconsistency may simply move cost from compute to review, escalation, and rework. The invoice is still paid. It just arrives later and looks like operational friction.
The representation analysis gives the mechanism its shape
The strongest part of the paper is not only the benchmark result. It is the attempt to connect behavioral inconsistency to internal representation.
The authors analyze open-source models because hidden activations are needed. They compare representations across modalities using cosine similarity and retrieval-style alignment. The setup includes natural object images, written-down labels, and text words. The question is whether matching samples across modalities occupy similar regions in the model’s internal space.
The finding: models with higher cross-modal similarity tend to have higher REST consistency. Written-down label images are more similar to matching word representations than natural images are, but the broader pattern supports the mechanism-first interpretation. When text and image representations of the same concept are closer inside the model, the model is more likely to behave consistently across modalities.
This does not prove causality. The authors say as much. A future intervention would need to make representations more similar and then test whether RER rises. Still, the correlation is valuable because it gives the benchmark a plausible mechanism. Cross-modal inconsistency is not just a surface prompting bug. It may reflect a deeper modality gap: the model stores “same content” in different representational neighborhoods depending on how it arrives.
That is exactly why OCR correctness is insufficient. OCR can verify that the model saw the right symbols. It cannot guarantee those symbols entered the same reasoning geometry.
What this means for document automation and multimodal agents
The direct business conclusion is not “avoid multimodal models.” That would be melodramatic, and melodrama is what people use when evidence is unavailable.
The better conclusion is that modality must become an explicit reliability dimension.
For document-heavy automation, the paper suggests four practical design rules.
First, prefer native text extraction when the source allows it. If a PDF contains extractable text, sending screenshots to a multimodal model should not be the default. Screenshots may be convenient for demos, but native text often travels through the model’s stronger pathway.
Second, evaluate OCR and reasoning separately. A pipeline that reports “OCR correct” has not finished validation. It has only cleared the first gate. The second gate is whether the model gives equivalent decisions across equivalent input formats.
Third, test rendering sensitivity for high-volume workflows. DPI, color, font, cropping, and page segmentation may all become hidden performance variables. REST+ shows that not every rendering choice matters equally, but some do. A robust deployment should include a small “render-equivalence” test set built from real business documents.
Fourth, route high-stakes cases through cross-format validation. If the same contract clause, claim form, or financial table produces different answers when supplied as extracted text and rendered image, that disagreement is a signal. It should trigger review, not be averaged into confidence. A model that disagrees with itself is doing free risk detection. One should be polite and use the gift.
A compact operational framework would look like this:
| Workflow decision | Recommended test | Failure signal |
|---|---|---|
| Use extracted text or screenshot? | Compare answers on paired native-text and rendered-image inputs | Different answers despite equivalent content |
| Add OCR-first step? | Benchmark direct image solving versus OCR-first solving | OCR-first improves recognition but lowers task accuracy |
| Accept a document rendering standard? | Test DPI, color, and font variants on representative samples | Large performance changes from cosmetic rendering |
| Deploy multimodal agent in production? | Measure RER-like consistency on internal tasks | High accuracy in one modality but low invariance |
| Escalate to human review? | Run cross-format disagreement checks | Same content, different decision |
This is not exotic AI governance. It is ordinary software testing applied to a system that happens to have eyes and opinions.
The boundaries are real, but they do not rescue the lazy interpretation
The main REST setting uses rendered text. That is appropriate for isolating modality effects, but it is not the same as messy enterprise visual input. Real documents contain tables, logos, stamps, signatures, handwriting, skew, compression, diagrams, and layout semantics. REST tells us that inconsistency exists even in controlled conditions. It does not measure every failure mode of production document AI.
The paper’s natural chess extension helps broaden the claim. It constructs same-content chess positions as natural image, generated image, and text, then asks yes-or-no questions. The results show cross-modal inconsistency beyond typographic rendering, with RER ranging from 58.5 to 77.5 and accuracy differences across modalities reaching up to 35% for GPT-5-mini. This is useful evidence, but it is still an exploratory extension, not a complete map of all natural-image reasoning.
The model list is also time-specific. Rankings will change. They always do. The more durable lesson is not that one named model is safe and another is unsafe. The durable lesson is that model selection should include equivalence testing. A model that leads on a general multimodal benchmark may still behave inconsistently under REST+ style perturbations. The appendix even notes that high MMMU scores correlate with REST and REST+ in general, but high-scoring models do not necessarily remain highly consistent under REST+.
Finally, the representation analysis is correlational. It points toward the modality gap as a plausible mechanism, but does not prove that closing the representational gap will automatically solve the behavior. Future work needs intervention, not just measurement.
These boundaries matter. They keep the article from pretending the paper solved enterprise multimodal reliability. It did not. It gave us a sharper diagnostic lens. That is already useful.
Same content is not the same workflow
The business fantasy of multimodal AI is that format becomes irrelevant. Text, image, chart, table, screenshot, scanned PDF—just throw it into the model and let the shared representation do its elegant little dance.
REST and REST+ show a less convenient picture. The model may read the same content, but the route still matters. Text and image can remain different worlds inside the system. Sometimes the model crosses cleanly. Sometimes it does not. Sometimes a red rendering behaves better than a black one, because apparently the future of intelligence still has opinions about font color. We endure.
For builders, the lesson is practical: do not validate multimodal workflows only by task accuracy. Validate invariance. Ask whether semantically equivalent inputs produce equivalent outputs. Track where OCR ends and reasoning begins. Treat disagreement across modalities as a first-class failure mode.
The quiet danger is not that multimodal models cannot read. The quiet danger is that they can read, and still disagree with themselves.
Cognaptus: Automate the Present, Incubate the Future.
-
Angela van Sprang, Laurens Samson, Ana Lucic, Erman Acar, Sennay Ghebreab, and Yuki M. Asano, “Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs,” arXiv:2512.08923v2, 2026. ↩︎