From Building Blocks to Breakthroughs: Why RL Finally Teaches Models to Think

Opening — Why this matters now

Large Language Models keep telling us they can “reason”—yet break spectacularly the moment a question requires combining two simple facts that sit in different parts of their memory. The industry’s response has been predictable: train bigger models, gather more data, sprinkle some RL on top, and pray.

This new paper—From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning【filecite:turn0file0】—politely shatters that illusion. It suggests something delightfully inconvenient: models don’t generalize because they’re big; they generalize because their training curriculum actually makes sense. And most current curricula do not.

Background — Context and prior art

Historically, post-training follows a comfortable formula: Supervised Fine-Tuning (SFT) to establish “behavior,” then Reinforcement Learning (RL) to sharpen it. The folk wisdom says:

SFT = teach the model facts + how to talk
RL = make the model smarter

The reality is less flattering. SFT often encourages memorization, not reasoning. RL tends to amplify whatever capabilities already exist, not invent logic from scratch. Prior evaluations blurred these lines because real data is messy—models may already “know” the answers through pretraining contamination.

This paper avoids the contamination trap entirely by constructing a fully synthetic world of human biographies. Every fact is fake. Every relation is invented. The boundary between parametric knowledge (in the model) and contextual knowledge (in the prompt) is controlled with laboratory precision.

That clean separation allows the authors to ask a deceptively simple question:

Can RL actually synthesize new reasoning skills? Or does it only amplify old ones?

Analysis — What the paper actually reveals

The authors decompose a seemingly complex reasoning capability—linking internal and external knowledge—into two atomic skills:

Parametric Reasoning (MEM): recalling facts stored inside the model parameters.
Contextual Reasoning (CTX): using new information supplied in the context.

They then define a third, composite skill:

Complementary Reasoning (COMP): combining MEM + CTX across multi-hop chains.

This framework enables precise testing across three levels:

I.I.D. (seen combinations)
Compositional (new combinations of known relations)
Zero-shot (completely unseen relations)

The results show a stark pattern:

1. SFT memorizes; it does not generalize.

A model trained directly on COMP data achieves 90% I.I.D. accuracy but collapses to 18% Zero-shot. It has essentially just memorized shortcut paths.

2. The SFT Generalization Paradox

Training on composite reasoning examples helps the model on easy cases but cripples it on hard ones. Training only on atomic skills produces worse I.I.D. accuracy but better generalization with RL.

3. RL is not an amplifier; it is a synthesizer… under one condition.

RL successfully creates composite reasoning only when the base model has strong MEM + CTX foundations. Otherwise RL has nothing to build with.

The authors’ most important claim:

RL composes new reasoning strategies—but only if the model has mastered the atomic skills first.

This is the conceptual pivot the field has been missing.

Findings — Evidence with Visualization

To make the argument digestible, here is a synthesis of the paper’s empirical behavior.

1. Performance of SFT vs. RL across generalization levels

Training Method	I.I.D.	Composition	Zero-shot
SFT (COMP only)	90%	76%	18%
SFT (MEM + CTX)	~35%	~28%	~24%
SFT (MEM + CTX) → RL	~73%	~61%	51%

RL nearly doubles Zero-shot performance only when MEM and CTX have been taught separately.

2. RL learning curves show synthesis, not amplification

When plotting pass@k (multiple-sample accuracy):

Models trained only on COMP show curve convergence → RL is just amplifying what SFT memorized.
Models trained on MEM+CTX show persistent gaps even up to k=512 → RL has genuinely synthesized new reasoning paths.

3. PCA analysis reveals representational disentanglement

In the embedding space:

SFT(MEM+CTX) separates parametric and contextual representations—clean geometry.
SFT(COMP) entangles everything—no structural separation.
RL on top of disentangled features shifts COMP representations into stable alignment.

The geometry effectively mirrors the logic: you cannot compose what you have not separated.

Implications — Why this matters for the AI ecosystem

1. The RL arms race has been pointed in the wrong direction.

Teams have been dumping composite reasoning traces into massive RL pipelines, hoping for magic. This paper shows: you’re feeding the model dessert before vegetables.

If the atomic building blocks aren’t there, RL cannot rescue you.

2. Data strategy must prioritize atomic skill coverage.

For enterprise AI and agentic systems, this implies:

Train models first on clean, isolated, atomic capabilities.
Use RL only afterward to compose them.
Stop wasting tokens on expensive composite tasks early in training.

3. Generalization requires curriculum, not scale.

This finding echoes cognitive science more than machine learning.

Models don’t generalize because they’re big—they generalize because the training schedule makes the logic unavoidable.

4. A new recipe for enterprise-grade reasoning agents

For automation, compliance, and agentic workflows, the practical recipe becomes clear:

Identify the atomic skill types your agent must perform.
Train them separately with SFT.
Introduce RL only at the composition stage.
Test generalization explicitly via Zero-shot splits.

This mirrors how organisations train analysts and operators: fundamentals first; scenario training second.

Conclusion — The missing piece of the LLM puzzle

This paper provides an unusually clear answer to a question the field has been hand-waving for years:

When does RL create new reasoning ability?

Answer:

When the model already knows the atomic pieces—and only then.

Reinforcement Learning is not a magic wand. It’s a glue. But without the individual blocks—parametric recall and contextual interpretation—the glue has nothing to bind.

For developers of intelligent systems, this insight should reshape data pipelines, post-training strategy, and expectations of model behavior.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper actually reveals#

1. SFT memorizes; it does not generalize.#

2. The SFT Generalization Paradox#

3. RL is not an amplifier; it is a synthesizer… under one condition.#

Findings — Evidence with Visualization#

1. Performance of SFT vs. RL across generalization levels#

2. RL learning curves show synthesis, not amplification#

3. PCA analysis reveals representational disentanglement#

Implications — Why this matters for the AI ecosystem#

1. The RL arms race has been pointed in the wrong direction.#

2. Data strategy must prioritize atomic skill coverage.#

3. Generalization requires curriculum, not scale.#

4. A new recipe for enterprise-grade reasoning agents#

Conclusion — The missing piece of the LLM puzzle#