Score and Disorder: Why LLM Reasoning Needs More Than Accuracy

A model review often begins with a spreadsheet.

One column says accuracy. Another says cost. A third says latency. Someone asks whether the model is “good enough.” Someone else points at the benchmark score. A decision is made. Procurement smiles. Compliance does not, but compliance rarely smiles anyway.

The problem is not that accuracy is useless. The problem is that accuracy is too small a container for the thing businesses actually want from reasoning systems. A final answer can be correct while the route to that answer is unstable, unnecessarily expensive, locally contradictory, or impossible to reproduce under a harmless rewording of the question. That is not a philosophical inconvenience. It is an operational failure mode waiting politely inside a dashboard.

The arXiv paper Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework by Ali Şenol, Garima Agrawal, and Huan Liu tries to make that failure mode measurable.¹ Its central move is simple but important: stop treating LLM reasoning quality as a one-number property, and instead evaluate six behavioral dimensions—correctness, consistency, robustness, logical coherence, efficiency, and stability.

This is not just another “beyond accuracy” slogan. The paper’s contribution is more practical: it shows how a model’s apparent ranking can change when the deployment context changes. The interesting result is not merely that one model wins a table. The interesting result is that the table itself changes shape once we ask what kind of reasoning failure a business can actually tolerate.

Accuracy is a final-answer metric, not a reasoning-control metric

Accuracy answers one question: did the model’s final output match the expected answer?

For many benchmark settings, that is a sensible starting point. If the task is a multiple-choice question and the correct answer is C, a model that says C has done something useful. Unfortunately, business workflows rarely stop at “C.” They require repeatability, auditability, robustness to wording changes, acceptable compute cost, and reasoning traces that do not quietly contradict themselves before landing on a lucky conclusion.

That difference matters because the failure modes are not interchangeable. A model that is wrong because it lacks domain knowledge is one engineering problem. A model that knows the answer but changes it under equivalent phrasing is another. A model that gives a correct answer through a locally inconsistent explanation creates a third problem: it may pass a benchmark while failing an audit. A leaderboard will happily compress all three into one decorative number. The paper’s framework refuses to do that.

The authors define reasoning quality as a function of six dimensions:

$$ Q = f(CQ, CS, RS, LS, ES, SS) $$

The abbreviations look like yet another taxonomy until you read them as controls in a deployment pipeline:

Dimension	What it measures	Business interpretation	Main boundary
Correctness (CQ)	Whether the final answer matches the expected answer	Basic task success	Does not reveal how the answer was reached
Consistency (CS)	Whether repeated runs produce the same answer	Reproducibility under stochastic generation	Tested here with $K=3$ runs at temperature 0.7
Robustness (RS)	Whether originally correct answers survive semantic-preserving perturbations	Resistance to wording variation	Computed only over originally correct items, so it is downstream of CQ
Logical Coherence (LS)	Whether consecutive reasoning steps avoid local contradiction	Auditability of the visible reasoning trace	Uses NLI-based local contradiction detection; not causal faithfulness
Efficiency (ES)	Whether correctness is achieved with reasonable token cost	Cost and latency discipline	Embeds CQ in the formula, so it structurally correlates with correctness
Stability (SS)	Whether reasoning traces remain semantically similar across runs	Process-level behavioral reliability	Semantic similarity is not the same as correctness

The mechanism-first reading is important here. The paper is not saying “accuracy is bad.” It is saying accuracy is only one valve in a larger control system. If a business chooses a model based only on that valve, it has no instrumentation for the other five.

The framework measures behavior, not hidden neural intent

The paper’s framework is deliberately behavioral. It does not require access to weights, activations, logits, or internal attention patterns. That makes it less glamorous than mechanistic interpretability and much easier to deploy in ordinary model-selection work.

The experimental design evaluates seven models across 975 items from four benchmarks: GSM8K for arithmetic word problems, MMLU for multiple-choice reasoning subjects, StrategyQA for implicit multi-step commonsense reasoning, and a synthetic dataset constructed to stress robustness and consistency. The model set includes closed API models, a 70B open model accessed remotely, and two smaller local models run in float16 on a GTX 1650 with 4GB VRAM. All API models are evaluated at temperature 0.7 with a maximum of 256 new tokens.

Each metric has a likely experimental purpose:

Test component	Likely purpose	What it supports	What it does not prove
Multi-benchmark evaluation	Main evidence	Shows that dimensional profiles vary across task types	Does not establish every domain will show the same ranking
Repeated runs for CS and SS	Main evidence	Separates final-answer repeatability from trace-level semantic stability	Does not isolate behavior at lower temperatures
Semantic-preserving perturbations for RS	Robustness/sensitivity test	Tests whether correct answers survive rephrasing, synonym substitution, and syntactic changes	Does not cover all real enterprise prompt variation
NLI-based contradiction detection for LS	Implementation detail supporting main evidence	Gives an automated local-coherence signal	Does not prove global validity or causal use of the reasoning trace
Deployment-weighted aggregation	Exploratory operational extension	Shows how rankings change under different business priorities	Does not prove the weights are optimal for a real organization
Discriminant validity correlations	Validation test	Checks whether the six dimensions are mostly non-redundant	Uses only 28 model-dataset observations, so evidence is indicative rather than final

This distinction keeps the paper from being overread. The framework is not a truth machine. It is an evaluation instrument. That is already useful. Most organizations do not need a metaphysical theory of model reasoning before they need a better pre-deployment checklist.

The headline result is not the winner; it is the profile shift

Claude-Haiku-4.5 ranks first across all aggregation strategies in the paper, with a balanced aggregate score of 0.778. It also leads on correctness at 0.872 and robustness at 0.963. That is a clean result, and clean results are always tempting to over-market.

But the more useful observation comes from models that look similar in one view and different in another. DeepSeek-V3 and Gemini-2.5-Flash have close balanced scores: 0.716 and 0.727. A casual leaderboard reader might treat them as near substitutes. The dimensional profile says otherwise. DeepSeek-V3 has strong correctness at 0.830, but lower logical coherence at 0.773 and efficiency at 0.437. Gemini-2.5-Flash has lower correctness at 0.808 but higher logical coherence at 0.819 and efficiency at 0.493.

That is the exact kind of distinction businesses usually discover too late. In a low-risk summarization workflow, perhaps the difference does not matter. In a legal assistant where reasoning traces may be reviewed, it matters rather a lot. “Near substitute” is not a property of the model alone. It is a property of the model inside a workflow.

The paper also reports a striking separation between consistency and stability. Consistency scores are uniformly low, ranging from 0.37 to 0.45. Stability scores are uniformly high, ranging from 0.828 to 0.928. In plainer terms: models may vary in their final answers across runs, while their reasoning traces remain semantically similar.

That is not a contradiction. It is a diagnostic. A system can keep telling roughly the same story while choosing different final answers. Anyone who has sat through a committee meeting has seen the human version.

Logical coherence and correctness are not the same success

The paper’s most important interpretive result is the orthogonality between correctness and logical coherence. The reported correlation between CQ and LS is $r=-0.172$ and not significant. In the authors’ reading, this supports the broader chain-of-thought faithfulness literature: a correct final answer does not guarantee a coherent reasoning trace.

This is where business readers should slow down.

A correct output with an incoherent trace is not merely “good enough with a messy explanation.” In audit-sensitive contexts, the trace may become part of the product. A legal workflow may need to show why a clause applies. A medical triage assistant may need to preserve a reasoned path for review. A finance workflow may need to justify a risk classification. In those settings, the explanation is not decorative garnish. It is part of the control surface.

The reverse is also true. A model can produce locally coherent traces while still being wrong. Phi-2, for example, has much lower overall correctness than the leading API models, but achieves high logical coherence and stability in several settings. On StrategyQA, it reaches LS = 0.971, the highest LS across any model-dataset combination in the reported table, while its correctness is only 0.488.

That result should not be read as “Phi-2 reasons better than larger models.” The paper is more careful than that. LS is based on local step-to-step contradiction detection using a DeBERTa-v3-small cross-encoder fine-tuned on MNLI. It detects whether consecutive reasoning steps contradict each other. It does not determine whether the full argument is complete, causally faithful, or globally valid.

So the correct interpretation is narrower and more useful: local coherence can be high even when task correctness is low. That matters for diagnosis. A low-CQ, high-LS model may need better knowledge, retrieval, or task competence. A low-CQ, low-LS model may need reasoning-trace training or stricter process control. Accuracy alone cannot tell these apart.

Dataset format changes what “reasoning quality” appears to mean

The per-dataset results are not just extra rows. They show that benchmark format changes which dimension becomes visible.

On MMLU, four API models achieve CQ above 0.99, and the constrained {A, B, C, D} answer format produces high robustness scores, ranging from 0.826 to 0.999. That does not make MMLU useless. It means that the format narrows the behavioral surface. When the answer space is constrained, robustness can partly reflect the stability of selecting among four labels rather than the model’s full ability to preserve reasoning under open-form variation.

On StrategyQA, efficiency is notably low across all models, ranging from 0.176 to 0.487. The interpretation is straightforward: commonsense multi-step questions tend to elicit longer traces without proportional correctness gains. For deployment, that is not an academic detail. If a workflow uses long-form reasoning on thousands of cases per day, low efficiency becomes cost, latency, and user patience converted into tokens.

The synthetic dataset plays a different role. It includes arithmetic variation, adversarial logical contradictions, and paraphrase robustness probes. Here, consistency scores are the highest among the datasets, ranging from 0.389 to 0.549, while API models maintain RS above 0.85 and local models show greater vulnerability. This is best read as a stress test, not a second thesis. The synthetic set helps expose where perturbation and structured tasks change behavior; it should not be treated as a universal proxy for every enterprise use case.

The broader lesson is familiar but often ignored: benchmark scores inherit the shape of the benchmark. A multiple-choice exam, a commonsense yes/no task, and a contradiction-heavy synthetic probe are not three neutral windows into the same abstract ability. They are three different instruments. The paper’s framework becomes useful because it lets us inspect what each instrument is actually pressing on.

Deployment weights turn model selection into a policy choice

The paper aggregates the six dimensions using several weighting schemes: balanced, safety-priority, accuracy-priority, efficiency-priority, medical triage, legal/compliance, and edge device/IoT. The exact weights are not sacred tablets. They are explicit policy assumptions.

That explicitness is the point.

For legal/compliance, the framework assigns 25% weight to consistency and 35% to logical coherence. Together they represent 60% of the composite score. For edge device/IoT, efficiency receives 50% weight. For medical triage, correctness and robustness dominate, with meaningful but smaller weight on logical coherence. A business can disagree with these weights. Good. Disagreement over visible assumptions is better than silent reliance on a leaderboard that pretends all risks are equal.

The key ranking inversion involves DeepSeek-V3 and GPT-4o-mini. Under accuracy-only evaluation, DeepSeek-V3 ranks second with CQ = 0.830, while GPT-4o-mini ranks fifth with CQ = 0.744. Under legal/compliance weighting, their ordering reverses. The paper attributes this to DeepSeek-V3 having the lowest logical coherence score, 0.773, and the second-lowest consistency score, 0.381, among the evaluated models.

This is the paper’s business argument in miniature. If the workflow cares about reproducible and inferentially sound reasoning, a model that looks better by accuracy can become worse by deployment fit. The “best model” is not a model name. It is a risk-weighted decision.

Business setting	Failure that matters most	Metric emphasis suggested by the paper	Procurement implication
Legal/compliance assistant	Inconsistent or locally contradictory reasoning trace	CS and LS	Do not select by final-answer accuracy alone
Medical triage support	Wrong answer under rephrased clinical descriptions	CQ and RS, with LS as audit support	Test robustness on domain-specific paraphrases
Edge device or constrained deployment	Expensive reasoning relative to correctness	ES	A smaller or cheaper model may be preferable if quality loss is acceptable
Internal research assistant	Broad usefulness across tasks	Balanced profile	Avoid models with one severe dimensional weakness
Customer-facing workflow automation	Volatile answers under similar requests	CS, RS, and SS	Evaluate repeated-run behavior before launch

Cognaptus’ inference is that this framework is best used before model deployment, not after users complain. It turns model selection from “which model topped a benchmark?” into “which failure profile can this workflow survive?” That is less glamorous than a leaderboard. It is also how real systems stay out of trouble.

The discriminant validity test says the six dials are mostly not duplicates

A multi-dimensional framework only earns its keep if the dimensions are not merely renaming the same thing six times. The paper addresses this through a discriminant validity analysis over 28 model-dataset observations, using Pearson correlations with bootstrap confidence intervals.

The authors report that 11 of 15 dimension pairs show acceptable separation, using $|r|<0.50$ as the threshold. Five pairs are classified as independent and six as weak-but-acceptable. Two strong correlations are structural by design: CQ–RS and CQ–ES. Robustness is computed over originally correct instances, and efficiency embeds correctness in its formula. In other words, those correlations are not embarrassing. They are mechanical.

The more informative cases are the low or weak relationships. CQ–LS is $r=-0.172$ and not significant. CS–RS is $r=-0.091$ and not significant. LS–ES is $r=+0.040$ and not significant. These values support the paper’s claim that correctness, consistency under repeated output, robustness to rephrasing, logical coherence, and efficiency are separable enough to be useful as different diagnostic signals.

There is a boundary here. The analysis uses 28 observations formed from seven models across four datasets. Those observations are not fully independent in a strict psychometric sense because the same model families and benchmark domains recur. So the result should be read as initial construct-validity evidence, not a finished measurement science. Still, for a pre-deployment evaluation framework, “initial but useful” is a respectable place to begin. Evaluation practice has survived on much thinner comforts.

What a business should do with this paper

The practical move is not to copy the paper’s tables into a vendor comparison deck and declare victory. Please do not do that. Vendor comparison decks already suffer enough.

A better use is to convert the framework into a domain-specific evaluation loop:

Define the workflow risk profile. A customer-service chatbot, a legal memo assistant, and an edge-device diagnostic tool should not share the same metric weights.
Build a representative evaluation set. Include normal cases, ambiguous cases, domain paraphrases, and adversarial examples that preserve the correct answer.
Run repeated generations. One answer per prompt is not enough to measure consistency or stability.
Measure the six dimensions separately. Keep the profile visible before computing any aggregate.
Choose weights only after naming the cost of failure. The weight vector should reflect business risk, not aesthetic preference.
Treat the aggregate as a decision aid, not a moral ranking. A model with a lower aggregate may still be better for a constrained workflow.
Monitor after deployment. Production prompts drift. Users rephrase. Policies change. The six dials should become part of ongoing evaluation, not a one-time ceremony.

The ROI relevance is also modest but real. The paper does not claim to reduce training cost or improve model reasoning directly. Its value is cheaper diagnosis. If a team can identify that its main problem is robustness rather than correctness, or efficiency rather than coherence, it can avoid expensive and poorly targeted intervention. That may mean prompt redesign, retrieval augmentation, model switching, lower-temperature settings, fine-tuning, or simply refusing to deploy the model in a workflow where its failure profile is unacceptable.

This is where the framework becomes commercially interesting. Not because it produces a universal model ranking, but because it helps prevent the wrong kind of optimization.

Boundaries: useful instrument, not final judge

The paper is careful about several limitations, and they matter for business interpretation.

First, the model rankings are point estimates. The paper does not provide significance tests for between-model differences under every deployment scenario. Small score gaps should not be turned into dramatic procurement conclusions.

Second, logical coherence is local. The LS metric detects contradictions between consecutive reasoning steps. It does not verify whether the overall argument is complete, whether the explanation caused the answer, or whether the model is faithfully reporting its internal decision process. A high LS score means “locally non-contradictory trace,” not “deeply sound reasoning.” That distinction is not pedantic. It is the difference between checking grammar in a contract and checking whether the contract is enforceable.

Third, the consistency results depend on the sampling regime. All evaluations are conducted at temperature 0.7. The uniformly low CS scores may partly reflect that choice. A production system running at a lower temperature could behave differently. The framework still applies, but the numbers should be regenerated under the actual deployment configuration.

Fourth, domain transfer remains open. The benchmarks cover arithmetic, multiple-choice reasoning, commonsense questions, and a synthetic stress test. That is useful coverage, not universal coverage. A bank, hospital, law firm, insurer, or manufacturing company should validate the framework on domain-specific tasks before treating the profile as procurement evidence.

These boundaries do not weaken the main argument. They locate it. The paper does not give businesses a final answer. It gives them a better measuring instrument.

The real shift is from leaderboard selection to reasoning process control

The accepted misconception is that a high final-answer accuracy score means the model reasons reliably, coherently, and safely enough for serious deployment. The paper’s answer is: no, and the “no” is measurable.

Correctness, consistency, robustness, logical coherence, efficiency, and stability are not decorative submetrics. They represent different ways a reasoning system can fail. Some failures cost tokens. Some cost trust. Some cost auditability. Some cost the entire deployment.

For Cognaptus readers, the useful takeaway is not “use Claude-Haiku-4.5,” even though it performs strongly in this study. The useful takeaway is that model selection should become a diagnostic process. Decide which failure modes matter. Measure them separately. Aggregate only after the weights are tied to the workflow’s risk profile. Then keep monitoring because production reality has a charming habit of ruining benchmark assumptions.

Accuracy still matters. It just should not be asked to do six jobs at once. That is how one metric becomes a manager: overpromoted, underqualified, and strangely confident.

Cognaptus: Automate the Present, Incubate the Future.

Ali Şenol, Garima Agrawal, and Huan Liu, “Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework,” arXiv:2605.24661, 2026, https://arxiv.org/abs/2605.24661. ↩︎

Accuracy is a final-answer metric, not a reasoning-control metric#

The framework measures behavior, not hidden neural intent#

The headline result is not the winner; it is the profile shift#

Logical coherence and correctness are not the same success#

Dataset format changes what “reasoning quality” appears to mean#

Deployment weights turn model selection into a policy choice#

The discriminant validity test says the six dials are mostly not duplicates#

What a business should do with this paper#

Boundaries: useful instrument, not final judge#

The real shift is from leaderboard selection to reasoning process control#