Opening — Why this matters now

Reasoning models are supposed to think. That’s the selling point. More tokens, deeper chains, longer deliberation—surely that means better answers. Except it doesn’t. As Large Reasoning Models (LRMs) scale, something uncomfortable is emerging: they often think more when they should think less, and think less when problems are actually harder.

This paper introduces a rare thing in modern AI research: rules. Not heuristics. Not vibes. Actual laws. The Laws of Reasoning (LORE) attempt to formalize how reasoning should scale with problem complexity—and expose just how badly today’s models violate those expectations.

Background — From Chain-of-Thought to Chain-of-Confusion

The dominant paradigm in reasoning models is “thinking-then-answering,” popularized by Chain-of-Thought prompting. The implicit assumption has been simple:

  • Harder problems → more reasoning
  • Easier problems → less reasoning

Humans do this naturally. Models, apparently, do not.

The authors show striking examples where models allocate more reasoning tokens to simpler sub-problems, yet perform worse when those sub-problems are composed into a harder whole. The issue isn’t lack of intelligence—it’s misallocation of reasoning compute.

The root cause? Training data. Chain-of-Thought traces are noisy, inconsistent, and largely unconstrained. Models are never explicitly taught how much to think—only how to imitate past thinking.

Analysis — The Laws of Reasoning (LORE)

The paper proposes two core laws:

1. Compute Law

Reasoning compute should scale linearly with problem complexity.

Formally:

$$ C(x) = \alpha \cdot \kappa(x) $$

Where:

  • $C(x)$ = expected reasoning tokens
  • $\kappa(x)$ = problem complexity (defined as minimal solution steps)

Since true complexity is unobservable, the paper introduces two testable proxies:

Property What it checks
Monotonicity Harder problems require more compute
Compositionality Independent problems add their compute when combined

2. Accuracy Law

If each reasoning step can fail independently, overall accuracy should decay exponentially with complexity:

$$ A(x) = e^{-\lambda \kappa(x)} $$

Again, two properties approximate this law:

  • Accuracy monotonicity (harder → lower accuracy)
  • Accuracy compositionality (independent tasks multiply accuracy)

Together, these form the LORE framework.

Findings — What models actually do

Monotonicity: Mostly fine

Using LORE-MONO, a synthetic benchmark with controlled complexity scaling, most modern LRMs show strong monotonicity:

  • More steps → more reasoning tokens
  • More steps → lower accuracy

Even small models mostly pass this test.

Compositionality: Completely broken

Using LORE-COMPO, built from independent math problems, the results are far worse.

Key observation:

For composed questions, models often generate less reasoning than for either sub-question alone.

Normalized Mean Absolute Deviation (nMAD) scores are large across models, including those explicitly designed for “efficient thinking.” In short: models do not add their thinking when problems are combined.

This explains real-world failures where models ace parts of a problem but collapse on the whole.

Implementation — Enforcing the law (surprisingly easy)

The authors introduce SFT-Compo, a simple supervised fine-tuning strategy:

  1. Sample two independent questions $x_1, x_2$
  2. Create a composite question $x_1 \oplus x_2$
  3. Sample multiple reasoning paths
  4. Keep only correct paths
  5. Select the trio where:

$$ |\ell(r_1) + \ell(r_2) - \ell(r_{12})| $$

is minimized

This explicitly teaches the model that composed problems deserve composed reasoning.

Results — Laws pay rent

After SFT-Compo:

  • Reasoning compositionality improves by up to 40%
  • Accuracy compositionality improves—even though it wasn’t directly trained
  • Performance increases across GSM8K, MATH500, AIME, AMC, and OlympiadBench
Model Avg Pass@1 Gain
1.5B +4.8
7B +3.2
8B +5.0

Notably, gains persist even when controlling for teacher distillation effects. The improvement comes from law compliance, not just better answers.

Implications — Why this matters for real systems

This paper quietly dismantles a core assumption in AI deployment:

More reasoning tokens automatically mean better reasoning.

They don’t. What matters is structured allocation of compute.

For businesses deploying agentic systems:

  • Overthinking wastes latency and money
  • Underthinking breaks compositional workflows
  • Reasoning control must be law-driven, not heuristic-driven

LORE provides a principled foundation for:

  • Reasoning budgets
  • Agent orchestration
  • Multi-step task decomposition
  • Evaluation benchmarks beyond raw accuracy

Conclusion — Intelligence needs discipline

Reasoning models aren’t failing because they can’t think. They’re failing because no one taught them when and how much to think.

The Laws of Reasoning offer a missing layer between scaling laws and prompt hacks—a framework where intelligence is not just powerful, but well-behaved.

Expect future reasoning systems to be judged not by how long they think, but by whether their thinking obeys the law.

Cognaptus: Automate the Present, Incubate the Future.