Opening — Why this matters now

Large language models have become surprisingly good at producing correct answers. Unfortunately, that is not the same thing as thinking correctly.

For years, most benchmarks for multimodal AI — systems that combine vision and language — have evaluated models based solely on their final answers. If the answer is correct, the model passes. If not, it fails. Simple.

Simple… and deeply misleading.

A new research paper introduces CRYSTAL, a benchmark designed to expose the reasoning process behind multimodal AI decisions. The premise is straightforward but powerful: if an AI model cannot justify its answer through logically consistent intermediate steps, then its “correct” answer might be nothing more than an intelligent guess.

And as it turns out, that happens far more often than many benchmarks would suggest.


Background — The illusion of accuracy

Modern multimodal large language models (MLLMs) combine visual encoders with powerful language models to reason about images and text simultaneously. Benchmarks such as MathVista and RealWorldQA have helped measure progress in these systems.

However, nearly all of these benchmarks evaluate only the final answer. This creates what researchers call the “lucky guess” problem.

Imagine an image containing three objects, where the question asks: Which object is the smallest?

A model might output the correct option. But if its reasoning states that the object is “larger than the others,” the model has clearly contradicted itself.

Traditional evaluation frameworks still award full credit.

The result is a measurement system that rewards outcomes rather than reasoning — effectively encouraging shortcuts.

Evaluation Style What It Measures Hidden Risk
Answer-only benchmarks Correct final output Lucky guesses appear as intelligence
Reasoning-aware benchmarks Logical steps + final answer Reveals contradictions and hallucinations

CRYSTAL attempts to correct this imbalance.


Analysis — What the CRYSTAL benchmark actually does

CRYSTAL (Clear Reasoning via Yielded Steps, Traceability and Logic) is designed to evaluate not only what models answer, but how they arrive there.

The benchmark contains 6,372 multimodal reasoning instances, each including:

  1. An image and question
  2. A reference reasoning trajectory
  3. A final answer

Instead of evaluating only the final answer, CRYSTAL compares the model’s reasoning chain against reference reasoning steps.

Two metrics power the evaluation.

Match F1

Match F1 measures whether the reasoning steps produced by the model align with the reference reasoning steps.

It evaluates both:

  • Precision: How many predicted steps are correct
  • Recall: How many necessary reasoning steps the model captured
Metric Meaning
Precision Correct reasoning steps produced
Recall Required reasoning steps captured
Match F1 Balance between the two

A model that produces correct answers but incorrect reasoning will have low Match F1.

Ordered Match F1

Reasoning is not just about what steps appear — but also when they appear.

Ordered Match F1 adds an additional constraint: reasoning steps must appear in the correct logical sequence.

Scenario Result
Correct steps in correct order High score
Correct steps in wrong order Penalized
Missing steps Penalized

This exposes models that assemble fragments of reasoning without coherent logical flow.


Findings — The uncomfortable results

The authors evaluated 20 multimodal models, including several frontier commercial systems.

The results reveal three consistent problems.

1. Cherry-picking reasoning

Models tend to generate a few correct reasoning steps while omitting many others.

Observation Interpretation
High precision Some steps correct
Low recall Many steps missing

In other words, models often display fragments of reasoning rather than full logical chains.

2. Disordered reasoning

Even when models produce the correct steps, they frequently appear in the wrong order.

No competitive model preserved more than 60% of reasoning steps in the correct sequence.

This suggests that many models reconstruct explanations after generating an answer rather than deriving the answer from reasoning.

3. Non-monotonic scaling

Increasing model size does not reliably improve reasoning quality.

Model Size Accuracy Reasoning Quality
Small Lower Often similar
Medium Higher Slight improvement
Frontier High Still flawed

Bigger models do not automatically mean better reasoning.


Training implications — A new reward design

Beyond benchmarking, the paper proposes a training method called Causal Process Reward (CPR).

Most reinforcement learning approaches use additive rewards:

Reward = accuracy + reasoning quality

This allows models to maximize reward by prioritizing accuracy alone.

CRYSTAL instead proposes a multiplicative reward structure.

$$ Reward = Accuracy \times ReasoningAlignment $$

If reasoning is poor, the entire reward collapses — even when the answer is correct.

The researchers further introduce CPR-Curriculum, which gradually increases reasoning difficulty during training.

This approach reportedly improves Match F1 by 32% in experiments using GRPO optimization.

Training Strategy Result
Standard RL Minimal reasoning improvement
CPR Better reasoning alignment
CPR + Curriculum Largest improvement

Implications — Why businesses should care

For enterprises deploying AI systems, the implications are significant.

Many real-world use cases depend on traceable reasoning, not just correct answers.

Examples include:

  • Financial analysis
  • Medical decision support
  • Legal document interpretation
  • Autonomous agents interacting with complex environments

If models can reach the right answer through flawed logic, their reliability in critical settings becomes questionable.

Benchmarks like CRYSTAL push the industry toward process-level accountability, where AI must demonstrate how conclusions are reached.

This is particularly relevant as regulators increasingly demand transparency in automated decision systems.


Conclusion — Intelligence should leave footprints

For decades, AI evaluation has focused on results.

CRYSTAL reminds us that reasoning is a process, not a destination.

Correct answers without coherent reasoning are not intelligence — they are coincidences dressed up in probability.

As multimodal systems become embedded in economic infrastructure, evaluation frameworks that expose reasoning quality will likely become as important as model accuracy itself.

In other words: the future of AI benchmarking may look less like a final exam and more like showing your work on the blackboard.

Cognaptus: Automate the Present, Incubate the Future.