Right Answer, Wrong Audit: When Reasoning Models Grade the Destination, Not the Route

A reviewer sees the final number. It is correct.

Then the quiet failure begins.

The reviewer stops asking whether the argument actually works. The missing step becomes “implicit.” The shuffled logic becomes “not ideal, but acceptable.” The circular explanation becomes “verbose but essentially correct.” The answer has done something worse than persuade. It has anesthetized the audit.

That is the useful sting in An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models, by Mingzhong Sun, Teresa Yeo, Armando Solar-Lezama, and Tan Zhi-Xuan.1 The paper is not just another entry in the growing cabinet of “LLMs fail at reasoning” specimens. We have enough of those pinned to the wall already. Its sharper claim is that large reasoning models can be good at producing correct solutions and still poor at evaluating whether someone else’s reasoning is valid.

That distinction matters because many business AI workflows quietly assume the opposite. If a model can solve the problem, surely it can review the solution. If it can write code, surely it can audit code. If it can generate a compliance memo, surely it can inspect one. If it can reason, surely it can evaluate reasoning.

Lovely symmetry. Unfortunately, the paper brings data.

The authors introduce VAIR, short for Valid-Answer-Invalid-Reasoning: math problems where the final answer is correct but the reasoning path is deliberately broken. This design strips away the easiest shortcut available to a model-as-judge system. A model cannot simply solve the problem, compare the answer, and declare victory. It must inspect the chain.

Many do not.

The failure is not wrong arithmetic; it is answer-confirmation as an evaluation strategy

The paper’s core mechanism is simple enough to be dangerous: when the final answer is valid, the model treats that answer as evidence that the reasoning must be acceptable. The correct answer becomes a magnet. Intermediate steps get pulled toward innocence.

The authors call this answer confirmation bias. That phrase is doing real work. The model is not merely failing randomly, nor merely lacking mathematical skill. It often has enough skill to solve the same problem. The failure appears when the task switches from production to evaluation.

That is why the production-evaluation gap is the right lens. A production task asks: “Can you solve this?” An evaluation task asks: “Can you judge whether this solution deserves credit?” Humans are often better at evaluating arguments than generating them from scratch. We may not all discover elegant proofs, but we are surprisingly good at spotting when someone smuggles in a premise, skips a required step, or argues in a circle. We evolved in social environments where bad arguments have consequences. Apparently, models have evolved in training environments where correct final answers get the applause.

The result is an uncomfortable split:

Capability What the model can do What the paper tests
Reasoning production Generate a correct solution from a problem statement Solve the original math problem
Answer verification Check whether a final answer matches the correct result Works well when answer validity and reasoning validity align
Reasoning evaluation Judge whether each step logically follows Breaks down when the answer is right but the reasoning is wrong

The business translation is equally blunt: outcome matching is not process validation. A model that confirms the right answer may still be a poor auditor of the route used to reach it.

A spreadsheet that lands on the right net income through a broken formula is not clean. A legal memo that cites the right conclusion through a hallucinated premise is not reliable. A code patch that passes a shallow test while relying on brittle logic is not reviewed. In each case, the final output can be correct while the process is contaminated.

VAIR is designed to make that contamination visible.

VAIR removes the shortcut models most want to take

The dataset starts from valid math problems and gold solutions, mostly from GSM8K and MATH, supplemented with newer Process-Bench examples to reduce contamination risk. The authors then perturb the questions or solutions to create invalid reasoning while preserving the correct final answer.

The four perturbation types are important because they target different audit muscles:

VAIR perturbation What is broken Why it matters operationally
Missing Premises The question omits information required to solve it, but the solution uses that missing information anyway Tests whether the evaluator notices hallucinated assumptions
Missing Reasoning A necessary inferential step is removed Tests whether the evaluator demands traceability rather than accepting a leap
Shuffled Reasoning Steps are presented out of logical dependency order Tests whether the evaluator checks causal sequence, not just content fragments
Circular Reasoning The solution uses tautological or assertive language to reach the right answer Tests whether the evaluator can distinguish explanation from decorative confidence

The final VAIR dataset contains 1,001 perturbed problem-solution pairs, balanced across those four flaw types: 258 Missing Premises, 228 Missing Reasoning, 259 Shuffled Reasoning, and 256 Circular Reasoning. The authors also construct two control datasets: VAVR, where both answer and reasoning are valid, and IAIR, where both answer and reasoning are invalid.

This is the key experimental move. In VAVR and IAIR, answer correctness and reasoning validity point in the same direction. In VAIR, they are deliberately separated. If a model does well on VAVR and IAIR but fails on VAIR, the evidence suggests the model is relying too heavily on the final answer as a proxy for the reasoning chain.

That is exactly what happens.

Frontier models solve well, grade badly, and fail hardest when the flaw is subtle

The paper evaluates six frontier large reasoning models: Claude Sonnet 4.6, Claude Opus 4.7, DeepSeek R1, GPT 5, GPT 5.4, and Gemini 3.1 Pro. On the production task, all models solve the original problems at high accuracy: at least 94.7%. On the controls, they also do well: at least 91.9% on VAVR and at least 95.8% on IAIR.

Then VAIR removes the answer shortcut. Performance drops.

GPT 5.4 falls to 47.9% VAIR evaluation accuracy. GPT 5 reaches 52.5%. Gemini 3.1 Pro is the strongest among the evaluated models, but still drops to 78.6%. In other words, the strongest performer is still substantially worse at evaluating flawed reasoning with a correct answer than at solving the original problems.

The pattern by error type is also revealing. Models are relatively better at detecting Circular Reasoning. That makes sense: vacuous “therefore it is so because it is so” prose looks suspicious. It has the smell of nonsense, even when wearing a lab coat.

The harder cases are Shuffled Reasoning and Missing Reasoning. These are exactly the cases where a production-oriented model can “repair” the solution in its own head. If the numbers are all present, just in the wrong order, the model can reassemble them. If a step is missing but the conclusion is reachable, the model can fill it in. Convenient, unless the task is grading what the student actually wrote.

This is the first practical warning. A strong model may be too helpful to be a strict evaluator. It may silently complete missing logic, reorder broken sequences, and infer unstated steps. In production, that behavior looks like intelligence. In audit, it looks like negligence with excellent manners.

Humans are biased too, but the gap is much smaller

The human comparison is not decorative. It calibrates whether VAIR is simply a weird task that punishes everyone.

The authors recruit 195 U.S. participants through Prolific, using a 240-item GSM8K-derived subset. Participants solve three problems and grade nine solutions, with the same grading rubric and calibration examples used for the models. The study includes quality controls, counterbalancing, and incentives.

Humans score 80.8% on solving, 83.1% on VAVR, 80.3% on IAIR, and 74.5% on VAIR. So yes, humans also find VAIR harder. Correct answers can fool people too; we should not pretend carbon-based reasoning is a temple of pure logic. On Missing Reasoning specifically, human accuracy is close to chance at 54.5%.

But the production-evaluation gap is far smaller: about 6.3 percentage points at maximum. For several models, the gap is not a small wobble; it is a collapse.

There is another telling comparison: effort. Humans spend less time grading than solving. LRMs spend more tokens on VAIR evaluation than on solving. That suggests the models are not casually skipping the task because it is too easy. They are working harder and still often reaching the wrong audit verdict. Very corporate, in the worst possible sense.

The paper’s evidence chain should be read in layers, not as one blob

The paper does something useful by not stopping at behavioral accuracy. It asks why the gap appears. The evidence comes in layers, and each layer has a different purpose.

Test or analysis Likely purpose What it supports What it does not prove alone
VAIR benchmark performance Main evidence LRMs can solve problems yet fail to evaluate flawed reasoning when the answer is valid The internal mechanism of the failure
Human study Main comparison The gap is much smaller in human reasoners under the tested setting That humans are immune to answer bias
Perturbation-type breakdown Diagnostic analysis Failures concentrate in subtler dependency flaws such as missing or shuffled reasoning A full taxonomy of reasoning errors
CoT behavior annotation Behavioral mechanism evidence Models often solve independently, then endorse or rationalize flawed reasoning That verbalized CoT is causally faithful
Static and dynamic probes Representation evidence Some validity representations exist, but are corrupted when a valid answer appears A complete mechanistic map of all model internals
Causal patching Causal mechanism evidence Answer-token representations can flip verdicts and shift evaluation behavior That the same mechanism fully explains every closed frontier model
PRM appendix Robustness/exploratory extension Step-level reward models can show similar failures That all process supervision is doomed

This distinction matters because the strongest article is not “models got low scores.” Benchmarks age. Model names change. Someone will soon publish a new score with a larger model, a stricter prompt, a different judge, or a ritual involving twelve agents and a YAML file.

The more durable point is mechanistic: a valid answer can reshape both the model’s expressed reasoning and its internal representation of whether a solution is valid.

That is much harder to shrug off.

CoT analysis shows the model auditioning as a lenient teacher

The first mechanism analysis examines chain-of-thought behavior. The authors use Gemini 3.1 Flash-Lite to classify evaluator CoTs along two dimensions: workflow and justification.

Workflow has two categories. In Independent Solving, the evaluator solves the problem itself and checks whether the final answer matches. In Step Tracing, the evaluator follows the student’s solution line by line.

Justification behavior has three categories. Blind Endorsement means the flaw is missed. Forced Rationalization means the evaluator notices something odd but invents a reason to excuse it. Strict Rejection means the evaluator correctly identifies and penalizes the flaw.

The paper gives a clean example. A student’s solution to a marathon-watching problem is shuffled: the final answer appears before the premises that justify it. GPT 5.4 re-solves the problem, confirms that the answer is correct, notices the order is “somewhat scrambled,” and then awards a perfect grade because there is “no real logical flaw.”

This is not just a wrong grade. It is the wrong evaluation method. The model uses its own reconstructed solution as a substitute for auditing the submitted reasoning. It grades the answer it would have written, not the reasoning it was given.

The authors are careful here. CoT analysis alone is not conclusive because verbalized reasoning may not faithfully describe the computation that produced the verdict. That caution is not academic decoration. It is exactly why the paper then moves from behavior to representations.

The probe results suggest the model can know validity—until the answer bends the signal

The representation-level analysis asks whether model activations encode reasoning validity. The authors focus on open-weight models because they need access to internal states: GPT-oss-20B, Qwen3-4B, and Qwen3-0.6B. Before using them for interpretability, they verify that these models show the same broad production-evaluation gap pattern as the frontier models.

The static probe setup uses three groups:

Group Example type Model verdict Ground truth
A VAVR graded valid Valid Valid reasoning
B VAIR graded valid Valid Invalid reasoning
C VAIR graded invalid Invalid Invalid reasoning

The probe is first trained on concordant cases: Group A and Group C. These are cases where the model’s verdict aligns with ground truth. On held-out A/C examples, the probe reaches about 89% accuracy at layer 18 for GPT-oss-20B. This suggests the model does have some linearly decodable representation of reasoning validity.

Then the authors apply the probe to Group B: cases where the reasoning is invalid, but the model was fooled into grading it as valid. Accuracy drops below chance. Even an “oracle” probe trained on all three groups struggles to classify Group B reliably, hovering near chance.

That result should make auditors sit up slightly straighter. The failure is not merely that the model refuses to use a validity signal it clearly has. In the fooled cases, the representation itself looks corrupted or entangled with valid-answer evidence. The model’s internal state starts to resemble “valid reasoning” when it should represent “invalid reasoning with correct answer.”

The dynamic probe analysis makes this more vivid. The authors track representations across ten checkpoints during the model’s generated reasoning. For valid reasoning, the probe’s estimated probability of validity stays near 1.0. For rejected invalid reasoning, it stays near 0.0. For fooled VAIR cases, it begins near chance and then climbs toward valid as the model approaches its final verdict.

So the model does not simply start fooled and remain fooled. The answer appears to pull the internal trajectory toward acceptance over the course of evaluation. The audit becomes more lenient as it thinks.

A charming feature in a dinner guest. Less charming in a compliance system.

Causal patching turns the answer from a suspect into a lever

The causal-patching experiment is the paper’s strongest mechanism test. The authors take Group B examples—invalid reasoning with valid answers that the model wrongly accepts—and create counterfactual versions by minimally perturbing the integer answer, $N \rightarrow N + 1$. The reasoning chain is otherwise identical, but the final answer is now invalid.

They then cache hidden states associated with the invalid answer and patch those answer-token activations into the original valid-answer input. This tests whether answer-associated representations causally affect the evaluation verdict.

The flip rates are large:

Model Flip rate when patching all layers Flip rate at peak probe layer
Qwen3-0.6B 80.5% 47.2%
Qwen3-4B 52.2% 27.6%
GPT-oss-20B 55.6% 14.2%

When all layers are patched, verdicts flip in more than half of cases across all three models. Even patching a single peak-probe layer produces meaningful flip rates.

This is not merely “the model said in its CoT that the answer mattered.” The intervention changes the model’s behavior. It also changes the dynamic probe trajectory: after patching, the previously fooled Group B trajectory collapses toward the invalid-reasoning group. The CoT behavior shifts too: less Independent Solving, more Step Tracing, less Blind Endorsement, more Strict Rejection.

That is the paper’s cleanest causal story. The answer is not just correlated with lenient evaluation. It can act as a lever that changes internal validity representation, verbalized evaluation strategy, and final verdict.

PRMs show that “step-level” is not automatically “process-validating”

The appendix on process reward models is worth more attention than appendices usually receive. No, this is not a second thesis. It is a robustness and exploratory extension aimed at a natural objection: what about models explicitly trained to score reasoning steps?

The authors evaluate Qwen2.5-Math-PRM-7B. It performs well on IAIR, with 93.8% accuracy, and reasonably on VAVR, with 79.3%. But on VAIR, accuracy drops to 67.8%. The category breakdown is telling again: Circular Reasoning is detected well at 91.0%, but Missing Reasoning falls to 49.1%, near chance.

This matters because process reward models are supposed to be the antidote to outcome-only evaluation. The paper does not prove that process supervision is useless. That would be both too broad and too convenient. It suggests a more specific failure mode: step-level scoring may still underweight prior context, and PRM training may inherit answer-confirmation bias when labels are generated through rollout methods that reward eventual correct answers.

In operational terms, “we use a process model” is not enough. A process evaluator must actually enforce dependency, premise use, and step continuity. A step can be locally plausible and globally invalid. The audit must see the chain, not merely its links.

The business mistake is treating model review as a second opinion when it is often the same opinion in a nicer suit

Many companies now use AI to generate, review, and approve the same class of work. Draft the memo. Check the memo. Summarize the risk. Grade the vendor response. Review the code. Score the analyst’s reasoning. This looks like a tidy workflow. It may also be a closed loop of answer-confirming agents nodding at each other.

The paper directly shows a math setting with clear truth values. Cognaptus’ business inference is broader but should be stated carefully: any workflow where a model evaluates a reasoning chain after seeing a plausible final output is at risk of answer-confirmation behavior. The risk is highest when the evaluator can independently reconstruct the likely answer and then backfill missing logic.

Here is the practical mapping:

AI workflow Dangerous shortcut Better control
Model-as-grader “The final answer is right, so the reasoning is acceptable” Grade step validity before revealing or checking final answer
Code review agent “The patch passes the visible tests, so the logic is fine” Require line-level invariant checks and adversarial tests
Compliance memo reviewer “The conclusion matches policy, so the citation path is probably fine” Verify each cited rule, premise, and exception separately
Financial analysis checker “The number matches the expected result, so the calculation is sound” Audit formula lineage, assumptions, and intermediate values
RAG answer evaluator “The answer is plausible, so the retrieval support is adequate” Check source-to-claim traceability before judging answer quality
Multi-agent debate “The stronger argument ends at the right conclusion” Assign one agent to premise attack, not conclusion scoring

The common design principle is separation. Do not ask the same model, in the same context, to see the answer and then neutrally judge the chain. That is like asking a junior analyst to audit their own spreadsheet after telling them the CFO’s preferred number. The result may be correct. The audit is still compromised.

A useful audit stack separates answer, path, and evidence

The paper does not prescribe an enterprise architecture, but its results point toward one. A safer reasoning-evaluation workflow should separate three objects:

  1. Answer validity: Is the final answer correct?
  2. Path validity: Does each step follow from prior steps and premises?
  3. Evidence validity: Are the premises themselves grounded in reliable sources?

Most model-evaluation workflows collapse these into one question: “Is this response good?” That is convenient. It is also how flawed reasoning hides behind correct outputs.

A better audit stack would look like this:

Input / source evidence
Premise extraction
Step dependency check
Intermediate calculation or claim verification
Final answer check
Verdict with separated scores:
- source support
- step validity
- answer correctness
- unresolved assumptions

The order matters. If the final answer is checked too early, it can contaminate the rest of the review. In VAIR terms, the model sees the right destination and stops policing the road.

For low-stakes automation, a single model judge may be acceptable. Nobody needs a mechanistic interpretability pipeline to grade whether a lunch survey summary is coherent. For high-stakes settings—finance, legal, medical, scientific, engineering, education, compliance—the review design should make answer-confirmation shortcuts harder.

That means using prompts and systems that force step tracing, but also not trusting prompts alone. The paper’s models were given detailed grading rubrics and examples explaining that shuffled reasoning and missing premises should be penalized. They still failed. Prompting helps, but it is not a safety rail made of steel. It is, at best, a polite sign near a cliff.

What the paper directly shows, and what business users should infer

The distinction between evidence and inference matters here.

The paper directly shows that, in mathematical reasoning tasks where final answers are valid but reasoning chains are invalid, evaluated LRMs often perform much worse at grading reasoning than at producing correct answers. It also shows that humans in the tested setting have a much smaller production-evaluation gap. It provides behavioral CoT evidence, representation-level probe evidence, and causal patching evidence that valid answers can bias model evaluations.

Cognaptus infers that AI review workflows should avoid treating model-generated judgments as independent validation when the evaluator sees the final answer and has strong production ability in the same domain. The more the review task resembles “confirm whether this plausible result is acceptable,” the more vulnerable it may be.

What remains uncertain is the exact transfer to domains where validity is less formal than math. Legal, medical, strategic, and policy reasoning often involve ambiguity, competing standards, incomplete evidence, and judgment calls. VAIR’s clarity is a strength for measurement, but it is not the full world. The paper also does not directly prove that outcome-focused training causes the failure; it argues this is a plausible explanation based on observed inference-time behavior. And the mechanistic work is restricted to open-weight models under 20B parameters, even though their behavior mirrors the broader frontier-model gap.

So the correct business reaction is not “models cannot audit reasoning.” It is more precise: do not assume that reasoning production competence implies reasoning evaluation competence. Test it. Separate the answer from the path. Build adversarial examples where the output is right and the logic is wrong. If the evaluator cannot catch those, it is not an auditor. It is a very fluent congratulator.

The real lesson is traceability, not pessimism

There is a tempting lazy conclusion: AI reasoning is fake, therefore do not use it. That is emotionally satisfying and operationally useless.

The better conclusion is narrower and more actionable. The paper shows that models trained and rewarded for correct outputs may learn evaluation habits that over-respect correct outputs. They can become excellent producers and unreliable inspectors. This is not mysterious. Organizations produce the same failure mode every quarter: teams optimize the metric, then forget why the metric existed.

For AI systems, the remedy is not just “use a better model.” It is better task design:

  • hide final answers during step-level review where possible;
  • require explicit premise-to-step dependency checks;
  • score answer correctness and reasoning validity separately;
  • test evaluators on valid-answer-invalid-reasoning cases;
  • use tools, deterministic checks, or human review for high-stakes trace validation;
  • treat process reward models as components to validate, not magic process police.

The paper’s most useful sentence is not written as a business sentence, but it should be: reasoning is not monolithic. Producing a good answer, evaluating a chain, detecting a missing premise, and resisting a plausible conclusion are different capabilities.

A mature AI workflow should stop pretending otherwise.

Correct answers are valuable. They are not audit trails.

And when a model sees the right answer and decides the broken reasoning must be fine, it is not thinking like a rigorous reviewer. It is thinking like a dashboard worshipper.

A familiar species. Just newly automated.

Cognaptus: Automate the Present, Incubate the Future.


  1. Mingzhong Sun, Teresa Yeo, Armando Solar-Lezama, and Tan Zhi-Xuan, “An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models,” arXiv:2606.01462, 2026. https://arxiv.org/html/2606.01462 ↩︎