The expensive model that thinks less at the wrong moment

Tokens are not wisdom. They are rented time.

Anyone who has paid for reasoning-model inference already understands the business version of this problem. A model spends hundreds or thousands of tokens circling a simple question, then compresses a genuinely compound task into a suspiciously neat answer. It looks thoughtful. It may even sound disciplined. But the bill arrives in one column and the error arrives in another.

The paper When Reasoning Meets Its Laws gives this annoyance a more precise shape.1 Its central point is not that large reasoning models think too much. That would be too easy, and therefore probably wrong. The sharper claim is that current large reasoning models often fail to allocate reasoning effort according to the structure of the problem.

That distinction matters. A model that always thinks longer is wasteful. A model that always thinks shorter is brittle. A useful reasoning model should do something more boring and more valuable: spend compute in proportion to the work actually required.

The authors call this framework LoRe, short for Laws of Reasoning. The name sounds grand, almost as if reasoning has finally received traffic regulations. But the practical question is modest: when a task becomes harder, does the model spend more reasoning compute, and when two independent tasks are combined, does the model spend roughly the sum of the compute needed for each part?

That second question is where the paper becomes interesting.

The failure is not “too much thinking”; it is broken compute accounting

The common misconception is simple: if chain-of-thought is useful, then longer chain-of-thought should be better. In product language, one might say: turn on reasoning mode, increase the token budget, and watch reliability improve.

The paper pushes against that belief. The problem is not only length. It is allocation.

Suppose a model receives two independent sub-questions. If solving the first one normally takes about 400 reasoning tokens and solving the second one takes about 600, then a composed prompt asking both questions should require something close to 1,000 tokens, plus some connector overhead. Not exactly, of course. Natural language is messy, prompts introduce friction, and models do not operate like accounting software. But if the composite task receives less effort than one of its parts, something is wrong.

This is the mechanism-first reading of the paper:

Mechanism What should happen What can go wrong
A harder variant of the same task appears The model should usually spend more reasoning compute It may fail to increase effort, especially in weaker models or certain domains
Two independent sub-tasks are combined The model’s compute should be approximately additive It may underthink the composite task or misallocate effort across parts
Accuracy follows complexity Accuracy should usually fall as problems require more steps Accuracy may degrade for reasons that are not visible from final-answer scores alone
Training examples are selected for compositional compute behavior The model may learn better reasoning allocation Gains may reflect the benchmark setting and teacher-generated traces, not universal business reliability

This is not just a theoretical elegance problem. In business systems, many important prompts are naturally composite. “Review this contract and flag tax, liability, and renewal risks.” “Compare these four suppliers and recommend a procurement strategy.” “Read this financial statement, reconcile the variance, and draft an executive memo.” “Check this code, explain the bug, and propose a patch.”

These are not one-step questions. They are bundles. A model that handles each sub-task reasonably well in isolation can still fail when the bundle is presented as a single workflow. Very corporate, very relatable.

LoRe turns reasoning quality into two measurable behaviors

The paper begins by formalizing two desired relationships between problem complexity and model behavior.

The first is the compute law. In plain English: an efficient reasoning model should spend reasoning compute roughly in proportion to question complexity. The authors define reasoning compute as the expected number of reasoning tokens generated by the model. They then hypothesize that, for an optimal model, compute should scale approximately linearly with complexity:

$$ \text{reasoning compute} \approx \alpha \cdot \text{complexity} $$

with small overhead for introductions, transitions, and other scaffolding tokens.

The second is the accuracy law. If a task requires more primitive steps, and each step carries some probability of failure, then accuracy should decay with complexity. The paper frames this as an exponential relationship: as the number of required steps increases, the probability of getting everything right tends to fall.

The authors do not pretend that true complexity is easy to measure. Their formal definition uses the minimal number of primitive steps required by a fixed deterministic Turing-machine-style process. That is mathematically neat and practically inconvenient. Nobody in a company deployment pipeline is going to compute the true minimal primitive-step complexity of every customer-support prompt. If someone claims they are, check whether they have also invented infinite interns.

So the paper uses two tractable proxies.

First, monotonicity: if one task variant is more complex than another, the model’s reasoning compute should not decrease, and its accuracy should generally not improve simply because the task became more complex.

Second, compositionality: if two questions are independent, then the compute for answering both together should be approximately the sum of answering each separately. For accuracy, the corresponding relationship is multiplicative: the probability of answering both correctly should behave like the product of the probabilities of answering each one correctly.

This is the paper’s most useful move. It converts “reasoning quality” from a vague behavioral label into something inspectable.

LoRe-Mono asks whether harder versions receive more effort

The first benchmark component, LoRe-Mono, tests monotonicity.

The authors construct seed questions across four domains: math, science, language, and code. Each seed question has 30 variants of increasing complexity. The construction is program-based and manually checked to avoid obvious shortcuts, such as periodic answer patterns that would let a model guess without doing the intended work.

The benchmark then checks how the variant index, which is designed to reflect increasing complexity, correlates with two quantities:

  1. reasoning compute;
  2. log accuracy.

This is a main evidence test. Its purpose is not to prove that the authors have solved complexity theory for all prompts. It asks a narrower operational question: when the benchmark designer knows that later variants require more steps, do models usually spend more reasoning tokens and become less accurate?

The answer is mostly yes.

Across the evaluated models, overall Spearman correlations between reasoning compute and complexity are generally high. For example, the reported overall compute correlations include 0.988 for Phi-4-mini, 0.991 for DeepSeek-R1-Distill-Qwen-7B, 0.990 for DeepSeek-R1-Distill-Qwen-14B, and 0.992 for Qwen3-Next-80B-A3B-Thinking. The weaker DeepSeek-R1-Distill-Qwen-1.5B model is the notable exception, with weaker domain behavior and an overall compute correlation of 0.875.

Accuracy mostly behaves as expected too. As variants become more complex, log accuracy generally falls. This supports the paper’s claim that current large reasoning models often understand “harder means more work” at the level of monotonic trends.

But this result should not be overread. Monotonicity is the easy part. It says the model often turns the compute dial in the right direction when a single task becomes more demanding. That is useful, but it does not guarantee that the model can manage compound work.

A junior analyst may know that a 30-page report takes longer than a 3-page memo. That does not mean they can combine market analysis, legal review, and financial modeling into one coherent project plan.

LoRe-Compo exposes the compound-task problem

The second benchmark component, LoRe-Compo, is where the paper earns its title.

LoRe-Compo is built from MATH500. The authors sample pairs of questions from distinct subject categories, using category separation as an operational proxy for independence. They then create triplets: question one, question two, and the composite prompt asking the model to answer both in sequence.

The test asks whether a model’s behavior follows the compositional rule:

$$ f(q_1 \circ q_2) \approx f(q_1) + f(q_2) $$

for reasoning compute, where $q_1 \circ q_2$ is the composed question. For log accuracy, the analogous compositional relationship is also measured. The paper reports deviation using normalized mean absolute deviation, where lower values mean better adherence to compositionality.

This is not a robustness check. It is main evidence. It directly tests the mechanism the paper cares about: whether reasoning effort composes when tasks compose.

The result is uncomfortable. Current reasoning models largely fail this compositionality test.

The reported normalized deviations for reasoning compute are substantial across models: 0.528 for DeepSeek-R1-Distill-Qwen-1.5B, 0.337 for DeepSeek-R1-Distill-Qwen-7B, 0.423 for DeepSeek-R1-Distill-Llama-8B, 0.368 for DeepSeek-R1-Distill-Qwen-14B, 0.392 for Sky-T1-32B, and 0.411 for Qwen3-Next-80B-A3B-Thinking. The length-control models do better but still deviate: Thinkless-1.5B reports 0.339 and AdaptThink-7B reports 0.327.

That last detail matters. The paper is not simply saying “models need shorter answers” or “models need longer answers.” It shows that even models designed to control reasoning length do not automatically become compositional reasoners.

Length control is not the same as compute law compliance. A diet is not a metabolism.

The paper’s evidence stack is stronger when read by test purpose

The article becomes clearer if we separate the paper’s tests by their likely purpose rather than treating every figure and appendix as another plot in the same basket.

Evidence component Likely purpose What it supports What it does not prove
LoRe theoretical framework Conceptual and formal foundation Reasoning compute and accuracy can be studied through monotonicity and compositionality That real-world prompt complexity can be perfectly measured
LoRe-Mono Main evidence for monotonicity Many LRMs increase compute and lose accuracy as constructed task complexity rises That models allocate compute correctly in compound workflows
LoRe-Compo Main evidence for compositionality failure Current LRMs often fail additive compute behavior on composed independent questions That all enterprise tasks will fail in the same way
Length-control model comparison Comparison with prior work Generic reasoning length control does not guarantee compositionality That length-control methods are useless
SFT-Compo Intervention test Selecting correct traces that obey compute compositionality can improve compositional behavior and benchmark accuracy That the same recipe will transfer unchanged to every domain
Synergy analysis Exploratory extension Improving compute compositionality may also improve monotonicity and accuracy compositionality That all reasoning laws causally improve each other in general
Appendix limitations and examples Boundary and diagnostic support Benchmark design, proxy choices, and qualitative failure cases are visible That the benchmark covers closed-source models or broad business workflows

This matters because the paper’s strongest practical lesson does not come from any single number. It comes from the pattern: models can look mostly sensible under monotonic scaling and still fail under composition.

That is exactly the kind of failure ordinary benchmark dashboards can miss.

SFT-Compo trains the model to respect additive reasoning effort

After diagnosing the compositionality problem, the authors propose SFT-Compo, a supervised fine-tuning method designed to enforce compute compositionality.

The method is simple in spirit. The authors build triplets of two sub-questions and their composite question. For each triplet, they sample outputs from a stronger teacher model. They keep only combinations where all three reasoning paths lead to correct answers. Among those correct combinations, they select the set whose reasoning lengths best satisfy the compositionality condition.

In other words, the training data is not merely “correct reasoning.” It is correct reasoning with better compute accounting.

That selection criterion is the important part. A normal supervised fine-tuning dataset might choose a correct trace at random. SFT-Compo chooses correct traces that better match the desired relationship among the sub-problems and the composite problem.

The training setup is modest by frontier-model standards. The paper evaluates four LRMs: DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Llama-8B, and Phi-4-mini-reasoning. The authors construct a compositionality-enforced dataset of 3.9K question-output pairs using DeepScaler-derived triplets and a DeepSeek-R1-Distill-Qwen-14B teacher, then fine-tune for five epochs.

The purpose of this experiment is twofold:

  1. Can SFT-Compo reduce compositionality deviation?
  2. Does enforcing compositionality improve general reasoning performance, not just the benchmark property?

The paper reports yes to both.

On LoRe-Compo, SFT-Compo consistently lowers reasoning-compute deviation for the tested models. The paper’s figure shows the post-training models align more closely with the ideal additive line, especially in the 1.5B visualization.

Then comes the more practically interesting result: general reasoning accuracy also improves. Across six benchmarks—AIME 2024, AIME 2025, AMC 2023, MATH500, GSM8K, and OlympiadBench—SFT-Compo improves average Pass@1 for all four evaluated models.

The reported average Pass@1 changes are:

Base model Base average Standard SFT average SFT-Compo average
DeepSeek-R1-Distill-Qwen-1.5B 47.6 49.3 52.4
DeepSeek-R1-Distill-Qwen-7B 61.5 63.5 64.7
DeepSeek-R1-Distill-Llama-8B 54.5 57.5 59.5
Phi-4-mini-reasoning 59.5 61.3 63.9

The control comparison is important. SFT-Compo outperforms standard SFT in all four cases. That suggests the gains are not only from distilling a stronger teacher. The compositional selection rule itself appears to matter.

For the 8B model, the average Pass@1 rises from 54.5 to 59.5. That is a 5.0-point absolute gain. In a production system, one would still need domain validation, but the directional lesson is valuable: training models to allocate reasoning effort structurally may improve final accuracy, not just make token usage prettier.

The synergy results are interesting, but they should not be oversold

The paper also reports a synergy analysis. After enforcing compositionality in reasoning compute, the authors observe improvements in other properties.

For the 1.5B model, SFT-Compo improves weak monotonicity behavior, especially in the code domain. The paper also reports that improving compute compositionality improves compositionality in log accuracy for the 1.5B and 7B models.

This is an exploratory extension, not the core thesis.

The tempting headline would be: “Teach compute compositionality and all reasoning laws improve.” That would be convenient. It would also be too clean. The evidence suggests interaction among reasoning properties, but it does not establish a universal causal chain across models, domains, and deployment settings.

A safer interpretation is better: compute allocation may be a useful training signal because it touches the process, not only the answer. If a model learns to spend effort in a more structurally appropriate way, accuracy-related properties may improve as a side effect. That is a promising hypothesis, not a procurement guarantee.

Business teams should evaluate reasoning process shape, not only final answer accuracy

The business implication is not “use SFT-Compo tomorrow.” The paper is more useful than that.

The immediate takeaway is that reasoning-model evaluation should include compute-behavior audits. Most enterprise evaluation still over-focuses on final-answer accuracy. That is understandable. Final answers are easy to score, easy to summarize, and easy to turn into executive slides. Unfortunately, they can hide the failure mode this paper identifies.

A model may answer enough isolated benchmark items correctly while still failing when work is bundled. It may also consume too many tokens on trivial subproblems and too few on the step that actually matters. The dashboard says “accuracy acceptable.” The invoice and the audit log disagree.

For Cognaptus-style automation work, this suggests a more disciplined evaluation layer:

Evaluation question Practical diagnostic Business relevance
Does the model spend more effort as task variants become harder? Monotonicity-style test suites with controlled step counts Detects underthinking and overthinking trends before deployment
Does compute add up when independent tasks are composed? Composite prompts built from validated sub-tasks Reveals whether workflow prompts are silently compressing important reasoning
Does length control fix the issue? Compare base, short-reasoning, and adaptive-reasoning modes Prevents confusing cheaper inference with better reasoning discipline
Does process-aware fine-tuning help? Compare standard SFT with compositional trace selection Tests whether the organization can improve reliability without changing model class
Does the result transfer to business tasks? Domain-specific task bundles with expert labels and cost logs Separates academic promise from operational ROI

This is where the paper becomes relevant to analytics, coding, finance, legal review, procurement, and AI agents.

Agents are especially exposed. Agent workflows often chain subtasks: search, extract, compare, plan, execute, verify. If the model’s internal reasoning budget does not compose, then adding more tools may not solve the problem. The model may simply become a more elaborate way to underthink the most important combined step.

This is the part that should worry teams building “AI analyst” or “AI operations manager” products. The product demo may show competence on isolated actions. The real workflow asks the model to coordinate those actions. LoRe-Compo is a reminder that coordination has its own failure mode.

The ROI is not just fewer tokens; it is fewer silent compound failures

It is tempting to read this paper as a token-cost optimization story. That is partly true. If a model overthinks easy prompts, companies pay for wasted inference. If it underthinks hard prompts, companies pay later through review time, corrections, and user distrust.

But the deeper ROI is diagnostic.

A LoRe-style evaluation can help distinguish three different problems that often look identical from the outside:

Symptom Possible cause Different operational response
High token cost, acceptable accuracy Overthinking on simple tasks Add routing, early stopping, or shorter reasoning modes
Low token cost, poor compound-task accuracy Underthinking on composed workflows Split prompts, enforce intermediate checks, or fine-tune for compositional reasoning
Good isolated-task accuracy, poor workflow accuracy Broken compositionality Test bundles directly instead of extrapolating from sub-task benchmarks
Good benchmark accuracy, unstable cost Poor compute discipline Track token allocation by task type and complexity tier
Fine-tuning improves answers but not process Distillation without structural alignment Add trace-selection criteria tied to reasoning behavior

That is the business value: not simply making the model cheaper, but knowing why it is expensive, brittle, or both.

A company does not need to adopt the paper’s exact benchmark to learn from it. It can build internal analogues. For a finance assistant, compose independent tasks such as ratio analysis, variance explanation, and covenant extraction. For a coding assistant, compose bug localization, patch generation, and regression-test reasoning. For a procurement assistant, compose supplier comparison, contract-risk review, and logistics feasibility.

Then measure not only correctness, but reasoning effort distribution.

A model that spends 70% of its reasoning on the easiest sub-task and rushes the hardest one is not “thinking deeply.” It is just narrating inefficiently.

Where the paper’s claims should stop

The paper is careful about several limitations, and those limits matter for business interpretation.

First, LoRe-Mono is synthetic and limited in seed coverage. The authors use carefully designed tasks across math, science, language, and code, but this is not the same as messy enterprise work. Synthetic control is useful because it makes monotonicity measurable. It is also a boundary.

Second, independence is operationalized through disjoint mathematical concepts in the compositional benchmark. That is practical, but not philosophically airtight. Two tasks from different categories may still share reasoning patterns; two tasks from the same category may be operationally independent in a business workflow. Companies applying the idea should define independence in domain terms, not simply copy subject labels.

Third, the evaluated models are open-source LRMs. The paper explicitly notes that closed-source models were not evaluated because of cost constraints. A team using proprietary models cannot assume identical behavior. It should test.

Fourth, SFT-Compo uses teacher-generated reasoning traces. The method’s success depends on the availability of correct traces and a selection process that can identify compositional length behavior. In domains where correctness is hard to verify, this becomes harder. Legal, medical, and strategic advisory tasks do not always provide clean answer keys.

Fifth, reasoning tokens are a proxy for compute, not a complete map of cognition. Models may perform hidden computation differently depending on architecture and inference settings. Token count is still operationally relevant because it affects cost, latency, and observable reasoning behavior, but it is not metaphysics. Thankfully, business teams rarely need metaphysics. They need fewer broken workflows.

The practical lesson: benchmark the bundle, not just the pieces

The strongest idea in When Reasoning Meets Its Laws is not the word “law.” It is the insistence that reasoning models should be evaluated by the shape of their effort.

Monotonicity asks whether harder tasks receive more work. Compositionality asks whether combined independent tasks receive combined effort. Current LRMs mostly pass the first test and struggle with the second. That is a valuable asymmetry.

For business users, the lesson is direct: do not infer workflow reliability from isolated-task competence. A model that answers each component well may still mishandle the composite prompt. A model that controls length may still fail to allocate compute structurally. A model that sounds reflective may still be doing bad internal budgeting.

The next generation of reasoning evaluation should therefore include three layers:

  1. final-answer accuracy;
  2. reasoning-cost behavior;
  3. compositional workflow stability.

The third layer is where many enterprise failures live. It is also where many demos politely avoid looking.

LoRe gives that failure mode a testable language. That is already useful. If future methods can train models not merely to think longer, but to think in proportion to the work before them, then reasoning models may become less theatrical and more operational.

A small disappointment for fans of AI drama, perhaps. A large improvement for everyone paying the invoice.

Cognaptus: Automate the Present, Incubate the Future.


  1. Junyu Zhang, Yifan Sun, Tianang Leng, Jingyan Shen, Liu Ziyin, Paul Pu Liang, and Huan Zhang, “When Reasoning Meets Its Laws,” arXiv:2512.17901, 2025. ↩︎