Opening — Why this matters now
Reasoning models are supposed to think. That’s the selling point. More tokens, deeper chains, longer deliberation—surely that means better answers. Except it doesn’t. As Large Reasoning Models (LRMs) scale, something uncomfortable is emerging: they often think more when they should think less, and think less when problems are actually harder.
This paper introduces a rare thing in modern AI research: rules. Not heuristics. Not vibes. Actual laws. The Laws of Reasoning (LORE) attempt to formalize how reasoning should scale with problem complexity—and expose just how badly today’s models violate those expectations.
Background — From Chain-of-Thought to Chain-of-Confusion
The dominant paradigm in reasoning models is “thinking-then-answering,” popularized by Chain-of-Thought prompting. The implicit assumption has been simple:
- Harder problems → more reasoning
- Easier problems → less reasoning
Humans do this naturally. Models, apparently, do not.
The authors show striking examples where models allocate more reasoning tokens to simpler sub-problems, yet perform worse when those sub-problems are composed into a harder whole. The issue isn’t lack of intelligence—it’s misallocation of reasoning compute.
The root cause? Training data. Chain-of-Thought traces are noisy, inconsistent, and largely unconstrained. Models are never explicitly taught how much to think—only how to imitate past thinking.
Analysis — The Laws of Reasoning (LORE)
The paper proposes two core laws:
1. Compute Law
Reasoning compute should scale linearly with problem complexity.
Formally:
$$ C(x) = \alpha \cdot \kappa(x) $$
Where:
- $C(x)$ = expected reasoning tokens
- $\kappa(x)$ = problem complexity (defined as minimal solution steps)
Since true complexity is unobservable, the paper introduces two testable proxies:
| Property | What it checks |
|---|---|
| Monotonicity | Harder problems require more compute |
| Compositionality | Independent problems add their compute when combined |
2. Accuracy Law
If each reasoning step can fail independently, overall accuracy should decay exponentially with complexity:
$$ A(x) = e^{-\lambda \kappa(x)} $$
Again, two properties approximate this law:
- Accuracy monotonicity (harder → lower accuracy)
- Accuracy compositionality (independent tasks multiply accuracy)
Together, these form the LORE framework.
Findings — What models actually do
Monotonicity: Mostly fine
Using LORE-MONO, a synthetic benchmark with controlled complexity scaling, most modern LRMs show strong monotonicity:
- More steps → more reasoning tokens
- More steps → lower accuracy
Even small models mostly pass this test.
Compositionality: Completely broken
Using LORE-COMPO, built from independent math problems, the results are far worse.
Key observation:
For composed questions, models often generate less reasoning than for either sub-question alone.
Normalized Mean Absolute Deviation (nMAD) scores are large across models, including those explicitly designed for “efficient thinking.” In short: models do not add their thinking when problems are combined.
This explains real-world failures where models ace parts of a problem but collapse on the whole.
Implementation — Enforcing the law (surprisingly easy)
The authors introduce SFT-Compo, a simple supervised fine-tuning strategy:
- Sample two independent questions $x_1, x_2$
- Create a composite question $x_1 \oplus x_2$
- Sample multiple reasoning paths
- Keep only correct paths
- Select the trio where:
$$ |\ell(r_1) + \ell(r_2) - \ell(r_{12})| $$
is minimized
This explicitly teaches the model that composed problems deserve composed reasoning.
Results — Laws pay rent
After SFT-Compo:
- Reasoning compositionality improves by up to 40%
- Accuracy compositionality improves—even though it wasn’t directly trained
- Performance increases across GSM8K, MATH500, AIME, AMC, and OlympiadBench
| Model | Avg Pass@1 Gain |
|---|---|
| 1.5B | +4.8 |
| 7B | +3.2 |
| 8B | +5.0 |
Notably, gains persist even when controlling for teacher distillation effects. The improvement comes from law compliance, not just better answers.
Implications — Why this matters for real systems
This paper quietly dismantles a core assumption in AI deployment:
More reasoning tokens automatically mean better reasoning.
They don’t. What matters is structured allocation of compute.
For businesses deploying agentic systems:
- Overthinking wastes latency and money
- Underthinking breaks compositional workflows
- Reasoning control must be law-driven, not heuristic-driven
LORE provides a principled foundation for:
- Reasoning budgets
- Agent orchestration
- Multi-step task decomposition
- Evaluation benchmarks beyond raw accuracy
Conclusion — Intelligence needs discipline
Reasoning models aren’t failing because they can’t think. They’re failing because no one taught them when and how much to think.
The Laws of Reasoning offer a missing layer between scaling laws and prompt hacks—a framework where intelligence is not just powerful, but well-behaved.
Expect future reasoning systems to be judged not by how long they think, but by whether their thinking obeys the law.
Cognaptus: Automate the Present, Incubate the Future.