Answers are cheap.

In a business setting, this is slightly annoying. A model reads a chart, extracts a number, answers a compliance question, classifies a product defect, or explains a visual inspection result. The answer lands in the dashboard. It looks clean. It may even be correct.

Then someone asks the only question that matters: how did it get there?

That is where the confidence often starts to leak. A model can choose the right option while describing the wrong object. It can read a graph with the right method and the wrong number. It can produce a neat answer after skipping half the reasoning steps that would make the answer auditable. The output looks like intelligence because the scoreboard only checks the last line.

The CRYSTAL paper is useful because it does not treat this as a philosophical complaint. It turns the complaint into an evaluation design: compare not only the final answer, but also the intermediate reasoning steps that should connect the image, the question, and the answer.1

The paper’s main point is not “models should explain themselves” in the usual conference-panel sense. Everyone says that. Usually while showing a slide with a lock icon and three arrows.

The sharper point is this: answer-only evaluation creates a shortcut economy. If the benchmark rewards the final answer and ignores the path, models can win by producing selective, incomplete, or disordered reasoning. CRYSTAL makes that shortcut visible.

The real failure is not wrong answers; it is uninspected causality

Most benchmark tables make AI progress look simple. Model A gets 57%. Model B gets 55%. Model C gets 49%. We rank them, draw a bar chart, and pretend the measurement problem is solved.

For multimodal models, this is especially dangerous. These systems must connect several kinds of evidence: visual perception, text recognition, spatial relations, counting, symbolic reasoning, chart interpretation, and sometimes commonsense physics. A final answer compresses all of that into one cell.

The compression is convenient. It is also where diagnosis dies.

A model that fails on a geometry question may have misunderstood the diagram. Or it may have read the diagram correctly but applied the wrong theorem. Or it may have followed the right theorem but made a numerical error. Or it may have guessed the correct multiple-choice option after producing nonsense reasoning. Accuracy treats these cases as either correct or incorrect. Operationally, they are different maintenance problems.

CRYSTAL’s mechanism is built around that distinction. It asks the model to output a structured list of atomic, checkable reasoning steps plus a final answer. These steps are not supposed to be private chain-of-thought theater. They are external evidence checks: visible objects, spatial relations, quantities, chart readings, and short inferences that can be compared against reference steps.

That design matters. A business does not need a model to narrate its soul. It needs a model to expose enough of its decision path that the user can tell whether the output came from grounded evidence or from a statistical shrug in a suit.

CRYSTAL converts “show your work” into measurable checkpoints

CRYSTAL contains 6,372 multimodal reasoning instances drawn from five existing benchmarks: MathVision, ScienceQA-IMG, RealWorldQA, MMVP, and PLOTQA. The average example has 11.6 reference reasoning steps, with a range from 3 to 42. The dataset is stratified into easy, medium, and hard tiers based on features such as step count, question length, and linguistic complexity.

The benchmark construction is important because the reference steps are the measuring stick. The authors use a Delphi-inspired multi-agent pipeline: four independent multimodal models generate candidate reasoning trajectories, semantically similar steps are clustered, a fifth model validates logical soundness and visual grounding, and a human quality gate checks whether the steps are visible, coherent, and answer-consistent. Fewer than 5% of examples require re-iteration after the human gate.

This is not perfect ground truth. No reasoning reference system is. But it is more useful than a final-answer label when the task is diagnostic.

CRYSTAL then evaluates models using three connected measurements:

Measurement What it checks What it catches
Accuracy Whether the final answer is correct Basic task success
Match F1 Whether predicted steps semantically match reference steps Missing, hallucinated, or selective reasoning
Ordered Match F1 Whether matched steps appear in a coherent sequence Fragmented reasoning assembled in the wrong order

Match F1 works like the familiar precision-recall tradeoff, but at the level of reasoning steps. Precision asks: among the steps the model produced, how many align with reference steps? Recall asks: among the reference steps needed for the task, how many did the model cover?

That distinction is the center of the paper.

A model with high precision and low recall is not hallucinating wildly. It is doing something subtler: producing a few safe, correct fragments while omitting much of the necessary reasoning. In ordinary product demos, this looks polished. In an audit, it is a gap with nice formatting.

Ordered Match F1 adds another layer. A model may mention the right ingredients but in the wrong logical order. For simple tasks, order may not matter much. For multi-step reasoning, it matters a great deal. You do not want a model to calculate the final compliance conclusion before it has identified the relevant contract clause. That is not reasoning; that is reverse engineering with a blazer.

The headline result: models cherry-pick reasoning even when they answer well

The authors evaluate 20 multimodal large language models: 16 open-source systems and 4 commercial systems. The commercial systems were not used in the CRYSTAL reference-generation pipeline, which helps reduce the concern that the benchmark merely rewards the styles of its own generators.

The main pattern is precision-recall imbalance. Nineteen of the 20 models show precision substantially exceeding recall. In plain language: models tend to produce steps that are often correct, but they omit many required steps.

That is not a small technical detail. It changes what “good explanation” means.

GPT-5, for example, has the highest answer accuracy in the table at 57.99%, but its Match F1 is 0.612 and its recall is 0.479. So even the strongest answer performer recovers less than half of the reference reasoning steps on average. GPT-5-mini has slightly lower accuracy at 55.59%, but the highest Match F1 at 0.773, with precision 0.978 and recall 0.669. Gemini 2.5 Flash is the exception to the cherry-picking pattern: it produces many more steps, 17.10 per question on average, and has recall 0.765 exceeding precision 0.701.

The useful reading is not “which model wins?” The useful reading is that answer accuracy and reasoning transparency are different axes.

Example from the paper’s results Accuracy Match F1 Precision Recall Interpretation
GPT-5 57.99% 0.612 0.925 0.479 Strong answer accuracy, but many reasoning steps missing
GPT-5-mini 55.59% 0.773 0.978 0.669 Best step-level F1, still incomplete coverage
Gemini 2.5 Flash 53.95% 0.673 0.701 0.765 More exhaustive, but less precise and more verbose
Qwen3-VL-32B 49.22% 0.718 0.819 0.704 Strong step coverage with lower answer accuracy than smaller Qwen3-VL-8B

This is where the paper is more interesting than a leaderboard. Qwen3-VL-32B has lower accuracy than Qwen3-VL-8B, but higher Match F1. Gemma3-4B beats InternVL3.5-38B on Match F1 despite being much smaller. Scaling does not monotonically improve reasoning transparency.

For procurement teams, that matters. Buying the bigger model may improve some output metrics while making the reasoning trace no more complete, or even less operationally useful. “Frontier model” is not a substitute for a task-specific audit protocol. Shocking, I know.

The order problem shows that explanation fragments are not yet reasoning chains

A second failure mode appears when the paper measures ordering.

Match F1 asks whether the right steps are present. Ordered Match F1 asks whether they appear in a reasonable sequence. The authors use a longest-increasing-subsequence ratio to estimate how much of the matched reasoning can be read in the correct relative order.

Here the numbers become awkward. GPT-5-mini achieves the highest Ordered Match F1 at 0.670, but its LIS ratio is 0.560. That means a large share of its matched steps are not in the expected sequence. Qwen3-VL-32B and Gemini 2.5 Flash show similar ordering degradation despite strong Match F1.

There is one small trap in interpreting this column. Some weaker models show high LIS ratios because they generate very few matched steps. If you only produce two or three relevant steps, it is not hard for them to appear ordered. This is why Ordered Match F1 weights order by content coverage.

The business lesson is straightforward: retrieving the right evidence is not the same as constructing the right argument.

In document review, this distinction is familiar. A junior analyst may highlight the right clauses but still misunderstand how the clauses interact. A model can do the same thing at machine speed. CRYSTAL’s order metric is useful because it separates “mentioned the evidence” from “used the evidence coherently.”

The ablations are reliability checks, not a second thesis

The paper’s ablation studies are best read as measurement validation. They are not the main business story, but they tell us whether the benchmark is likely to be measuring something real.

The authors test four sentence encoders across five similarity thresholds, averaged over five baseline models. Encoder choice matters more than threshold choice. DistilRoBERTa-v1 gives the strongest Match F1, while threshold variation produces smaller swings. More importantly, model rankings remain stable across encoder-threshold combinations.

That stability is the key point. If changing the sentence encoder completely rearranged the leaderboard, the benchmark would be measuring encoder quirks. The authors argue that stable rankings suggest the matching captures consistent semantic relationships.

They also run a human agreement study on 100 adversarially sampled step pairs. The encoder reaches 84% overall agreement with a human annotator. Below the operating threshold, agreement is 100%, meaning the encoder does not falsely match unrelated steps in that sample. Most disagreements occur in the borderline similarity zone, where step granularity and paraphrase ambiguity naturally become annoying. Measurement, like office coffee, is rarely perfect; the question is whether the imperfection changes the decision.

For business use, this matters because automated step matching must be conservative. A benchmark that over-credits weak reasoning is worse than no benchmark; it gives management a dashboard-shaped sleeping pill. CRYSTAL’s validation does not remove all uncertainty, but it supports using Match F1 as a diagnostic signal rather than decorative numerology.

The reward design changes the incentive, not just the score

The paper then moves from evaluation to training. This is where the mechanism-first framing pays off.

If answer-only evaluation rewards shortcuts, then a natural fix is to reward reasoning quality. But a simple additive reward has a loophole:

$$ Reward = Accuracy + ReasoningQuality $$

A model can still chase the accuracy term while ignoring the reasoning term, especially when final-answer improvement is easier than step coverage. The paper’s Composite reward does exactly this in practice: it improves accuracy but fails to improve reasoning. In the reported experiment, Composite reaches 44.92% accuracy but only 0.426 Match F1, below the 0.480 baseline. Answer-Only reaches similar accuracy at 44.30% and 0.429 Match F1.

CRYSTAL proposes Causal Process Reward, or CPR, which multiplicatively couples answer correctness with step-level alignment. The intuition is simple: a correct answer without aligned reasoning should not receive the same reward as a correct answer supported by evidence.

The paper then adds CPR-Curriculum: first train for format and answer stability, then introduce the step-level reward while progressively increasing reasoning difficulty. This staged design is not cosmetic. If the model cannot produce valid structured output, asking it to optimize long reasoning chains too early is a fine way to create expensive noise.

The main GRPO experiment uses Qwen2.5-VL-3B-Instruct. CPR-Curriculum improves accuracy from 39.85% to 47.52%, Match F1 from 0.480 to 0.633, recall from 0.347 to 0.493, and Ordered Match F1 from 0.434 to 0.560. The authors report this as a 32% Match F1 improvement.

Reward strategy Accuracy Match F1 Recall Ordered F1 Likely purpose of test Interpretation
Baseline 39.85% 0.480 0.347 0.434 Starting point Terse, selective reasoning
Composite 44.92% 0.426 0.284 0.392 Additive reward ablation Accuracy improves, reasoning does not
Answer-Only 44.30% 0.429 0.308 0.380 Control Confirms answer reward alone is insufficient
CPR 41.40% 0.633 0.489 0.560 Multiplicative reward test Reasoning improves under causal coupling
CPR-Curriculum 47.52% 0.633 0.493 0.560 Training stability extension Accuracy and reasoning improve together

The reward comparison is the paper’s cleanest mechanism test. The additive approach is the obvious baseline. It fails where the mechanism predicts it should fail: the model can collect reward without truly expanding reasoning coverage. CPR performs better because it makes reasoning quality valuable only when tied to answer correctness.

The appendix adds an important cross-model generalization test on InternVL3.5-4B. With the same two-phase protocol and no architecture-specific reward tuning, CPR-Curriculum raises accuracy from 37.61% to 45.76% and Match F1 from 0.432 to 0.833. Recall rises from 0.325 to 0.811, while average steps increase from 3.75 to 9.49. That is a large change in behavior, not just a marginal score adjustment.

But there is a catch, and it is worth taking seriously.

Longer reasoning chains introduce more ordering noise. Both Qwen and InternVL show lower LIS after CPR-Curriculum training, even while Ordered F1 improves because content coverage improves more than order degrades. The appendix also shows regressions on mathematical and combinatorial examples where CPR-trained models articulate the right general procedure but make worse numerical or counting errors.

So CPR is not magic. It pushes models toward broader reasoning coverage. It does not guarantee computational precision. Anyone selling that as a solved problem should be asked to show the invoice for the unicorn.

What this means for enterprise AI evaluation

The practical value of CRYSTAL is not that every company should copy the benchmark directly. Most companies will not need 6,372 multimodal academic examples. The value is the evaluation pattern.

For enterprise AI systems, especially multimodal systems, the unit of evaluation should often be smaller than the final answer. A useful deployment test should ask:

Enterprise question CRYSTAL-style evaluation response
Did the model answer correctly? Measure final-answer accuracy
Did it use the right evidence? Compare output steps against expected evidence checks
Did it omit critical checks? Measure recall over required steps
Did it hallucinate or add unsupported claims? Measure precision over predicted steps
Did it reason in the right order? Penalize disordered chains where sequence matters
Can training improve the behavior? Use process-level rewards, not only answer rewards

This is especially relevant in document intelligence, chart analysis, insurance claims, manufacturing inspection, financial report review, medical-image support, and agentic workflows where visual or textual evidence must support a decision.

Take invoice review. A model may correctly flag a payment as suspicious. But the operational question is whether it checked the vendor identity, invoice amount, payment terms, duplicate invoice risk, approval chain, and contract match. If it flags correctly while skipping the contract match, the result is less reliable than the accuracy score suggests.

Or take chart interpretation. A model may answer a trend question correctly while misreading one of the chart values. The final answer could survive by luck. In another month, the same failure becomes a wrong executive decision. Step-level evaluation catches whether the model can read the chart, not merely whether it landed on the right multiple-choice option.

This also changes vendor selection. A benchmark table that only reports answer accuracy encourages buyers to choose the highest score. A CRYSTAL-style table would reveal whether the vendor’s model is precise but incomplete, verbose but noisy, accurate but untraceable, or less accurate but diagnostically honest. Those are different product risks.

The business value is cheaper diagnosis, not prettier explanations

There is a tempting but shallow reading of this paper: “AI explanations are good for trust.” Fine. True enough. Also so generic that it could be printed on a tote bag.

The stronger business interpretation is that process-level evaluation reduces diagnostic cost.

When a deployed model fails, teams need to know what to fix. Should they improve OCR? Add visual grounding? Refine prompts? Use tools for calculation? Fine-tune on domain examples? Change the reward function? Add human review at a specific checkpoint?

Answer accuracy rarely tells you. Step-level evaluation can.

The appendix’s qualitative examples make this concrete. The paper distinguishes several diagnostic regions:

Failure region What happens Operational response
Perfect alignment Correct answer and complete reasoning Keep monitoring; not much to fix
Sound reasoning, wrong answer Right procedure, wrong visual/numerical extraction Improve perception, OCR, chart reading, or calculation tools
Correct answer, cherry-picked reasoning Answer right, but many steps omitted Improve coverage requirements and recall incentives
Catastrophic failure Wrong answer and fabricated reasoning Treat as capability gap, not a prompt-polishing problem

This is the kind of distinction managers actually need. Without it, teams argue in circles. The product team says the model is accurate enough. Compliance says the reasoning is unverifiable. Engineering says the prompt can be improved. Everyone is partly right, which is the most efficient way to waste a meeting.

A CRYSTAL-style diagnostic layer gives the discussion a better object: which steps were required, which were produced, which were unsupported, and which were out of sequence.

Where the paper should not be overread

CRYSTAL is a strong diagnostic contribution, but it should not be treated as a universal certificate of reasoning quality.

First, reference steps may not capture every valid path. Some tasks allow multiple legitimate reasoning routes. If the reference is too narrow, a model can be penalized for taking a different but valid path.

Second, embedding-based matching has borderline cases. The paper’s human agreement study supports the metric, especially against false matches below the threshold, but semantic similarity is still an approximation. Domain-specific language, abbreviations, numerical equivalence, and visual grounding can complicate matching.

Third, Ordered Match F1 measures relative step order, not full causal dependency. A sequence can be mostly ordered and still contain a flawed inference. Conversely, some reasoning tasks permit partially parallel checks where strict order is less meaningful.

Fourth, CPR improves step coverage but does not guarantee arithmetic or symbolic correctness. The appendix shows cases where CPR-trained models name the right procedure but make worse computations. In business settings, that implies a natural next layer: pair step-level evaluation with tool verification for calculations, table extraction, and document references.

Finally, visible reasoning steps are not the same as private model cognition. CRYSTAL evaluates structured external evidence traces. That is exactly what enterprises should care about, but it should not be confused with reading the model’s mind. The product requirement is not metaphysical transparency. It is controlled, checkable, task-relevant traceability.

The new standard: do not just ask whether the model is right

CRYSTAL’s contribution is not that it discovers models sometimes make mistakes. That news arrived several product cycles ago and has refused to leave.

The contribution is a cleaner measurement frame. It shows that answer accuracy can hide cherry-picked reasoning, incomplete recall, disordered evidence, and non-monotonic scaling effects. It also shows that training incentives matter: additive rewards can leave reasoning flat, while multiplicative process rewards can improve step coverage when paired with curriculum training.

For Cognaptus-style enterprise automation, the lesson is practical. Do not evaluate multimodal AI only by whether it gives the right answer on a test set. Ask whether it can expose the checks that make the answer defensible.

The future of AI evaluation will not be only “Did the model answer correctly?”

It will be:

  1. What evidence did it use?
  2. What evidence did it skip?
  3. Did the sequence make sense?
  4. Which failure mode does this reveal?
  5. Can the training or workflow incentive be changed to fix it?

That is less glamorous than a single accuracy number. It is also much closer to how reliable systems are actually built.

Correct answers are useful. Correct answers with inspectable reasoning are deployable.

Cognaptus: Automate the Present, Incubate the Future.


  1. Wayner Barrios and SouYoung Jin, “Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation,” arXiv:2603.13099v2, 16 March 2026. ↩︎