Opening — Why this matters now

Multimodal LLMs promised a unified cognitive layer — one model that could see, read, and reason without switching mental gears. In reality, the industry has quietly tolerated a lingering flaw: the same question, when shown as text or rendered as an image, often yields different answers. As enterprises push MLLMs into document-heavy workflows, compliance systems, and vision-driven automation, this inconsistency becomes more than a research curiosity — it becomes operational risk.

A new paper, “Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs,” cuts straight into this fracture line. The authors introduce REST and REST+, a pair of benchmarks designed to stress-test MLLMs by presenting identical inputs across text, image, and mixed modalities. The results? Even frontier models wander — sometimes wildly.

Background — Context and prior art

Vision–language models have grown up on the assumption that aligning text and image embeddings into a shared latent space ensures coherent multimodal reasoning. Prior work already hinted at the truth: this space is less “shared” and more “adjacent suburbs with poor public transport.”

Two key issues had remained blurry:

How much inconsistency is due to bad OCR?
How much is rooted in deeper representational misalignment — the infamous modality gap?

Existing benchmarks confounded the two. The authors address this directly: keep the content identical, keep OCR trivial, and see whether the cracks remain.

Analysis — What the paper does

The authors introduce two benchmarks:

REST (Render‑Equivalence Stress Test)

REST evaluates consistency across four tasks: OCR, Text, Image, and Mixed. It pulls from:

MMLU (broad knowledge)
ARC (reasoning)
GSM8K-Symbolic (math)
SOEBENCH — a new system-of-equations benchmark crafted to avoid memorization and limit OCR complexity

Each question is rendered into three formats (text, image, mixed). The model must give the same answer across formats to be considered consistent.

REST+

REST+ pushes harder: every question is rendered with 10 variations — combinations of font, resolution (50/100/200 DPI), and colour (six variants). The idea is simple: if the content is the same, does superficial visual styling still sway the model?

It does.

Findings — Results with visualization

Across 15 state-of-the-art models, several trends stand out.

1. No model is consistently consistent

Every model exhibits at least ~10% inconsistency, even when OCR is perfect.

2. Text still reigns supreme

Models systematically perform better on native text than on rendered text in images. Even when images are crystal-clear, reasoning quality degrades.

Modality	Performance Trend
Text	Best, stable
Mixed	Middle
Image	Worst (largest drop)

This raises a blunt point: MLLMs remain, fundamentally, text-first systems.

3. Visual “style” affects reasoning more than expected

Contrary to intuition:

Font choice barely matters
Colour does — red and yellow text often produce 5%+ accuracy gains
Lower DPI hurts some models disproportionately

This suggests models rely not only on semantic content but also on visual priors baked into training distributions.

4. Embedding similarity correlates with consistency

The authors compute cross-modal cosine similarity between internal representations. When text and image embeddings for the same input are closer together, the model is more likely to answer consistently.

A distilled takeaway: cross-modal inconsistency is a representational problem, not an OCR problem.

Implications — Next steps and significance

For practitioners deploying multimodal models, three lessons matter:

1. Multimodal isn’t “unified” reasoning yet

If your AI pipeline involves document scans, screenshots, camera feeds, or mixed-context inputs, expect answer drift — even when humans see no difference.

2. Token efficiency tricks come with hidden risks

Techniques like compressing text into visual tokens (e.g., DeepSeek-OCR) may reduce cost, but they amplify the modality gap: the model reads correctly but reasons poorly.

3. Evaluation frameworks must include consistency metrics

Accuracy alone is deceptive. High performance on text-only benchmarks cannot guarantee reliability in real-world multimodal settings.

Enterprise deployment teams should begin incorporating consistency scores, modality-drift audits, and representation alignment checks into their AI governance playbooks.

For model developers, the findings hint at a research frontier: aligning embeddings across modalities could yield the next step-function improvement in robustness.

Conclusion — Wrap-up

REST and REST+ expose a subtle but consequential truth: multimodal LLMs remain partitioned minds. They can read, they can see, they can reason — but not always in the same way at the same time. Until the modality gap is closed, organizations must treat multimodal reasoning as a probabilistic system, not a uniform capability.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper does#

REST (Render‑Equivalence Stress Test)#

REST+#

Findings — Results with visualization#

1. No model is consistently consistent#

2. Text still reigns supreme#

3. Visual “style” affects reasoning more than expected#

4. Embedding similarity correlates with consistency#

Implications — Next steps and significance#

1. Multimodal isn’t “unified” reasoning yet#

2. Token efficiency tricks come with hidden risks#

3. Evaluation frameworks must include consistency metrics#

Conclusion — Wrap-up#