From Building Blocks to Breakthroughs: Why RL Finally Teaches Models to Think

Training an AI model is often sold like a kitchen renovation: add more data, add reinforcement learning, install the shiny reasoning countertop, and suddenly the whole thing looks expensive enough to be intelligent.

This paper is useful because it ruins that brochure.

The authors of Atomic Skills are the Prerequisite: When Reinforcement Learning Synthesizes Compositional Reasoning, and When It Only Amplifies ask a deceptively simple question: does reinforcement learning create new reasoning ability, or does it only increase the probability of behaviors the model could already produce?¹ Their answer is not the clean slogan either camp wants. RL can synthesize new compositional reasoning, but only when the model has already learned the right underlying atomic skills. Without that foundation, RL mostly polishes whatever behavior already exists. Sometimes that is reasoning. Sometimes it is just a better-trained shortcut wearing a lab coat.

The paper studies this through a controlled task called Complementary Reasoning. The task requires a model to combine two kinds of knowledge: facts stored in the model’s parameters and facts supplied in the input context. That sounds familiar because it is exactly the pressure point inside many enterprise AI systems. RAG systems ask models to combine old internal knowledge with retrieved external documents. Continual-learning systems ask models to absorb new facts without losing old structure. Agent systems ask models to use instructions, memory, tools, and context in the same reasoning chain. In all three cases, the expensive part is not merely “knowing facts.” It is combining the right facts from the right source at the right step.

The paper’s central lesson is therefore not “RL is good.” That would be a very convenient misunderstanding, and convenience is where budgets go to die. The better lesson is:

RL becomes a reasoning synthesizer only after supervised training has separately built the components that need to be synthesized.

That distinction matters. It changes how we should think about training data, evaluation, RAG reliability, agent design, and the economics of post-training.

The mechanism: RL needs parts before it can assemble anything

The authors decompose the target skill into three pieces.

Skill	What it requires	Business analogue
Parametric Reasoning	Retrieve and use facts encoded in model weights	The model uses learned domain knowledge, internal policy logic, or embedded facts
Contextual Reasoning	Read and use new information supplied in the prompt or retrieved context	The model uses documents, search results, CRM notes, legal clauses, or tool outputs
Complementary Reasoning	Combine parametric and contextual knowledge in one multi-hop chain	The model answers a question that requires both its trained knowledge and live external evidence

The important word is combine. A model may be able to read a retrieved document. It may also be able to recall something stored during training. But a business workflow often needs both in sequence.

For example, a compliance assistant might need to use a newly uploaded internal policy memo, connect it with pre-trained regulatory concepts, then answer whether a transaction creates an approval obligation. A support assistant might need to read a customer’s latest ticket, connect it with known product architecture, and infer which team should handle the escalation. A research assistant might need to use a fresh paper and previously learned methodology to decide whether a result is meaningful or just statistically decorated smoke.

These are not pure retrieval tasks. They are not pure memory tasks. They are mixed-source reasoning tasks. That is why the paper’s “Complementary Reasoning” setup is more business-relevant than another leaderboard where models recite facts about public trivia with great confidence and occasional public humiliation.

The authors formalize the task using synthetic biographies built from a relational knowledge graph. The facts are fake, so pre-training contamination is controlled. The relations are semantically meaningful, such as spouse, mentor, occupation, sibling, university, or boss, so the task is not reduced to empty symbol manipulation. The model must answer multi-hop questions that traverse chains of relations. Some relations are stored through parametric training. Others appear only in the prompt context. Complementary questions require the model to bridge both.

This design is not glamorous. That is a feature. The point is to isolate the mechanism.

The first trap: direct SFT learns the path, not the skill

The authors first test whether supervised fine-tuning on the composite task is enough.

It performs well where memorization is useful. When the test relation path has been seen during training, direct SFT on Complementary Reasoning reaches about 90.30% IID accuracy. On composition tests, where individual relations have been seen but the exact path is new, it still reaches 76.25%. Then the trap snaps shut: on structural zero-shot tests, where at least one relation in the path was never seen in QA training, accuracy falls to 18.41%.

That pattern is the paper’s first major signal. Direct SFT can teach a model to imitate seen composite reasoning paths. It does not reliably teach the model the more general algorithm for combining memory and context when the relation structure changes.

The comparison with atomic training is also revealing. A model trained only on Parametric and Contextual Reasoning, without direct composite training, performs poorly on the Complementary Reasoning test: 35.18% IID, 28.20% composition, and 24.07% zero-shot. This tells us two things at once.

First, atomic skills do not automatically compose. Just because a model can retrieve from memory and read from context separately does not mean it can weave them together in a chain.

Second, direct composite SFT does not solve the real generalization problem. It mostly learns the composite patterns it has seen. This is the “SFT generalization paradox” in the paper: the model that looks strong on familiar composite tasks is brittle where the task actually tests structural generalization.

For business users, this should sound unpleasantly familiar. Many enterprise demos work beautifully on prepared cases and then become oddly fragile when a real user asks the same type of question with one extra wrinkle. The system has learned the demo path, not the underlying workflow.

The paper’s core result: atomic-first RL beats composite-first training

The authors then test the training recipe that matters:

Use SFT to teach the model the two atomic skills: Parametric Reasoning and Contextual Reasoning.
Use RL on Complementary Reasoning data to incentivize the model to combine those skills.

The primary experiments use Qwen-2.5-1.5B with a standard SFT-then-RL pipeline, where RL is implemented with GRPO and binary outcome rewards. The comparison is against a more obvious approach: train directly on composite data first, then apply RL on the remaining composite data.

Across data scales, the atomic-first recipe performs better, especially out of distribution. The paper reports that direct composite SFT can look competitive or even better on easier IID and composition settings when enough composite data is used. That is the seductive part. But the zero-shot setting exposes the difference: the atomic-first recipe keeps improving where composite-first training remains much more brittle.

The mechanism-first interpretation is simple:

Composite SFT teaches the model specific paths.
Atomic SFT teaches the model separable capabilities.
RL on composite tasks then pressures the model to assemble those capabilities under reward.

In other words, RL needs something to assemble. Without separable parts, it cannot magically invent clean internal structure from a pile of memorized composite examples. Shocking, I know: even the miracle machine prefers raw materials.

The decisive distinction: synthesizer versus amplifier

The most important section of the paper is not just the headline result. It is the analysis of whether RL is acting as a synthesizer or merely an amplifier.

The authors use pass@ analysis to compare model behavior before and after RL. The logic is useful. If an SFT model already contains the correct reasoning behavior somewhere in its output distribution, then sampling more attempts should eventually reveal it. In that case, RL is mostly amplifying a latent behavior: making the correct path more likely. But if the RL model continues to outperform the SFT model even when the SFT model gets many chances, that suggests RL has induced a more fundamental behavioral shift.

The result depends on the starting point.

When RL starts from the atomic-skill model, the post-RL model remains significantly ahead even as the number of attempts increases. The authors interpret this as evidence that RL synthesized a new bridging mechanism between parametric and contextual reasoning.

When RL starts from the composite-SFT model, the pass@ curves converge. Given enough attempts, the SFT model can match the RL model. In that setting, RL behaves more like an amplifier: it increases the probability of patterns already present in the SFT distribution.

This resolves the lazy debate. “RL synthesizes reasoning” and “RL only amplifies existing behavior” are both incomplete. The paper argues that RL does both under different prerequisites.

Starting foundation	What RL mostly does	Practical reading
Atomic skills are separately learned	Synthesizes a new composite strategy	RL can convert capabilities into workflows
Composite paths are memorized through SFT	Amplifies existing behavior	RL improves reliability on familiar patterns but does not necessarily create robust generalization
One atomic skill is missing	Generalization collapses	Reward cannot compensate for missing capability substrate

That last row is where the business lesson becomes sharp.

The ablations show the prerequisite is real, not cosmetic

The paper’s ablation-style tests ask whether the atomic prerequisite is truly necessary. The authors compare models trained with only Parametric Reasoning, only Contextual Reasoning, direct composite data, or the full atomic pair before RL. They then apply the same RL procedure using the same amount of Complementary Reasoning data.

The answer is not subtle. Removing either atomic skill collapses generalization. Models with only memory or only context do not substantially generalize after RL. More interestingly, some direct-composite baselines have similar initial performance before RL, but still gain little after RL. The authors argue that initial task accuracy is not the key predictor. The key predictor is whether the model has the right underlying atomic foundation.

This matters because organizations often evaluate models by surface task performance before investing in improvement. “This model already scores 70% on our workflow benchmark, so RL should take it to 90%.” Maybe. Or maybe that 70% is mostly shortcut familiarity. If the model has not learned the underlying components separately, RL may just make its shortcuts more consistent.

The paper also compares RL with further SFT and LoRA on the same composite data after the atomic foundation exists. Further SFT does well on IID performance, which is exactly where memorized patterns help. RL performs much better on unseen combinations. The likely purpose of that test is not to prove LoRA is bad or SFT is obsolete. It is an ablation of the training condition: once the atomic skills exist, RL is the part that most strongly incentivizes out-of-distribution composition.

The paper’s evidence can be read as a sequence:

Test	Likely purpose	What it supports	What it does not prove
Direct SFT on Complementary Reasoning	Main baseline	SFT can memorize seen composite paths but fails in structural zero-shot	That all SFT is useless
Atomic-first SFT then RL	Main evidence	RL can synthesize composite reasoning from separable atomic skills	That RL works without careful data design
Missing-atomic-skill baselines	Ablation	Both memory and context skills are necessary prerequisites	That these are the only atomic skills in real-world workflows
RL versus further SFT/LoRA	Training-condition comparison	RL is better for OOD composition once atomic skills exist	That SFT/LoRA have no role in production
Pass@ analysis	Mechanistic evidence	RL’s role changes from synthesizer to amplifier depending on the base model	That the internal neural mechanism is fully localized
Scaling, Llama, longer-hop, seed tests	Robustness/sensitivity tests	The pattern is not obviously a one-run artifact	That the result transfers unchanged to frontier models or messy enterprise data

That distinction between “supports” and “does not prove” is not academic politeness. It prevents very expensive misreadings.

Error analysis: successful RL moves the failure later

One of the more useful behavioral analyses looks at where models fail inside the reasoning chain. The authors classify errors as contextual or parametric and record the progress point at which the first mistake appears.

Models without the right atomic-grounded RL tend to fail early, often around contextual retrieval or the first bridge between memory and context. In the reported error analysis, several weaker settings show contextual errors dominating: 90%, 86%, and 86% of errors in those rows are contextual, with failures appearing relatively early in the chain.

The atomic-first RL model looks different. Its remaining errors shift toward parametric recall: 70% of errors are parametric, and the first failure occurs much later in the chain, with progress at 71.8%. This is a small but important diagnostic. The model is not merely becoming “better” in aggregate. It is failing in a different place.

For a business system, that difference matters because it changes the repair strategy. If failures happen early in contextual reading, you look at retrieval quality, document formatting, chunking, context compression, and instruction discipline. If failures happen late in parametric recall, you look at domain pretraining, internal knowledge coverage, calibration, and whether the model should rely less on memory and more on tools. Aggregate accuracy alone would hide this.

This is where the paper quietly suggests a better evaluation practice: do not only ask whether the model got the final answer right. Ask which source of knowledge failed first.

Why this matters for RAG: retrieval is not enough

The paper is especially relevant to RAG, but not because it says “RAG needs RL.” That would be too easy, and therefore suspicious.

The more precise implication is that RAG reliability depends on three separable abilities:

The model must understand and use retrieved context.
The model must retain or access the necessary background knowledge.
The model must compose retrieved and internal knowledge across multi-step paths.

Most RAG evaluation focuses on the first and third outcomes at the same time. Did the answer use the retrieved document? Was the final answer right? But this paper suggests that those blended metrics can conceal different failure modes.

A model may read retrieved context well but fail to connect it with domain knowledge. Another model may know the domain but ignore the retrieved update. A third may do both separately and still fail when the chain alternates between them. These are operationally different systems, even if they produce the same score on a shallow test set.

Cognaptus inference: for enterprise RAG, the training and evaluation pipeline should not begin with complex end-to-end tasks only. It should first audit atomic skills. Can the model reliably extract facts from fresh context? Can it retrieve or apply stable background concepts? Can it execute one-hop and two-hop reasoning separately within each source? Only after those components are measurable does it make sense to use RL or preference optimization on full composite workflows.

The business value is not “more advanced training.” It is cheaper diagnosis. If a model fails a composite workflow, you want to know whether to improve retrieval, fine-tune domain concepts, redesign prompts, add tool calls, or collect reward data. Without atomic tests, every failure becomes a vague “the AI is not good enough” problem. That is not an evaluation. That is a group therapy session with invoices.

The data lesson: expensive traces are not always the first thing to collect

One practical implication is about data collection.

Complex reasoning traces are expensive. They require domain experts, careful annotation, multi-step correctness checks, and often a depressing number of meetings about edge cases. The paper suggests a more staged strategy: first build atomic skills with cleaner, cheaper, more modular data; then use a smaller amount of composite task data with RL to induce the combination behavior.

The sample-efficiency results support this direction. The authors report that atomic learning requires less SFT data to prime RL generalization than direct composite learning. In their few-shot adaptation analysis, once the atomic foundation exists, even small amounts of composite data can trigger improvement, and using 10% of the composite data can roughly match an upper-bound baseline trained on the full composite dataset in average accuracy across the generalization levels.

This does not mean an enterprise can skip complex workflow examples. It means complex examples should not be used as a substitute for knowing what the model has already learned. If the atomic foundation is missing, composite traces may teach brittle imitation. If the atomic foundation exists, composite reward can become much more leverageable.

A practical training roadmap would look like this:

Stage	Question to answer	Data type	Decision
Atomic audit	Can the model separately handle memory-like knowledge and fresh context?	Focused QA, extraction, domain concept tests, document-use tests	Identify missing substrate before tuning
Atomic SFT or adaptation	Can weak atomic skills be strengthened cleanly?	Modular examples for each capability	Build separable components
Composite RL or preference training	Can the model combine components under reward?	Multi-step workflow cases with outcome rewards	Incentivize assembly and generalization
Failure localization	Where does the first error occur?	Step-level traces or structured evaluation	Repair retrieval, memory, or composition specifically

This is not as exciting as “RL will make the model think.” It is more useful, which is often less exciting.

The robustness tests strengthen the pattern, but not beyond its scope

The appendix does important work, and it should not be treated as decorative furniture.

The model-scaling analysis across Qwen-2.5 models from 0.5B to 3B is a robustness test. It supports the claim that the atomic-first pattern is not only a 1.5B accident. The authors report that the advantage remains significant at 3B, including an approximate 13% zero-shot accuracy gap.

The Llama-3.2-1B replication is a model-diversity check. The table shows the same broad pattern: composite SFT can favor easier IID and composition settings, while atomic-first RL improves zero-shot generalization. In the reported Llama run, the atomic-first RL setup reaches 36.93% zero-shot, compared with 17.10% for the composite-SFT-plus-RL counterpart.

The longer-hop analysis is a sensitivity test for reasoning length. Since the training distribution is skewed toward shorter paths, 4-hop and 5-hop cases are relatively rare. The atomic-first recipe degrades more gracefully on longer chains; in the zero-shot 5-hop setting, it reaches 30.38%, compared with 15.19% for the composite-SFT-plus-RL setup.

The bootstrapping and multi-seed analysis are statistical robustness checks. The reported 95% confidence intervals for zero-shot accuracy do not overlap: [22.70, 40.34] for the baseline versus [44.83, 52.77] for the atomic-first recipe. The composition intervals overlap, which is worth noting. The strongest statistical separation is in IID and zero-shot, especially zero-shot.

The mechanistic appendix is more exploratory. Training dynamics suggest that complementary ability begins to emerge as both atomic skills improve. Checkpoint analysis suggests that the base model needs enough SFT “incubation” before RL generalization appears, but lower SFT loss does not improve things indefinitely. PCA analysis suggests that atomic-first training creates more separated representations for parametric and contextual reasoning, while composite-only training leaves them more entangled. Entropy analysis argues that RL success is not merely about starting from a more uncertain model; the atomic foundation seems to structure the search space in a way RL can exploit.

These analyses support the story. They do not fully localize the internal neural mechanism. The authors are careful enough to leave that as future work. A rare and welcome event: a paper making a strong claim without trying to own the entire universe.

Where the result applies, and where it does not yet

The paper’s controlled design is its strength and its boundary.

The synthetic biography setup avoids pre-training contamination and cleanly separates parametric from contextual knowledge. That makes the causal interpretation much easier. But synthetic biographies are not enterprise document corpora. Exact-match multi-hop QA is not the same as legal analysis, financial advisory workflows, software debugging, or customer support escalation. The primary experiments use small open models, with robustness checks on nearby model scales and architectures. The result may transfer, but transfer is not automatic.

The paper directly shows that, in a controlled semantic-synthetic QA environment, RL can synthesize complementary reasoning when both atomic skills have been learned, and mostly amplifies when the base model lacks that structure.

Cognaptus infers that similar logic should guide enterprise RAG and agent training: measure and strengthen atomic capabilities before spending heavily on composite reward data.

What remains uncertain is the mapping from this clean setup to messy production systems. Real enterprise workflows involve noisy documents, conflicting evidence, permissions, missing context, tool errors, vague user intent, and business rules that change faster than the documentation team admits. The “atomic skills” in those settings may be more numerous than memory and context. They may include tool selection, schema understanding, refusal behavior, temporal reasoning, source ranking, and policy compliance.

So the practical takeaway is not to copy the paper’s dataset. It is to copy its discipline: decompose the capability before optimizing the composite behavior.

The business interpretation: audit before reinforcement

The most tempting reading of this paper is that RL is the missing ingredient for enterprise reasoning. That reading is half true, which makes it dangerous.

RL is useful when the model has the parts but has not learned how to assemble them. It is much less useful when the parts are missing or entangled. In business terms, RL is not a substitute for capability accounting.

Before launching a costly post-training project, teams should ask:

Which atomic capabilities does the workflow require?
Can the model perform each capability independently?
Does the evaluation distinguish seen workflow patterns from unseen compositions?
Are failures caused by context use, internal knowledge, or the bridge between them?
Is the composite data teaching generalizable assembly, or merely rehearsing common paths?
Will RL optimize a real outcome, or just make a brittle shortcut more confident?

This turns the paper into a useful procurement filter. If a vendor says, “We will use RL to improve your agent,” the next question is not “How advanced is your RL?” It is: “What atomic skills have you verified, and how do you know RL will synthesize rather than amplify?”

That is a less glamorous question. It is also the one that saves money.

Conclusion: RL does not create reasoning from nothing

The paper’s contribution is not that it crowns RL as the winner in the post-training debate. It gives us a conditional theory.

SFT can build atomic skills. SFT on composite tasks can also memorize seen reasoning paths. RL can synthesize new compositional behavior, but only when the base model already has the necessary atomic substrate. Without that substrate, RL tends to amplify what is already there.

For AI builders, this changes the order of operations. Do not start by collecting the most complex workflow traces and hoping the model will infer the parts. First identify the parts. Train and test them separately. Then use RL to make the model assemble them.

That is less magical than the usual “reasoning model” story. Good. Magic is hard to debug.

Cognaptus: Automate the Present, Incubate the Future.

Sitao Cheng, Xunjian Yin, Ruiwen Zhou, Yuxuan Li, Xinyi Wang, Liangming Pan, William Yang Wang, and Victor Zhong, “Atomic Skills are the Prerequisite: When Reinforcement Learning Synthesizes Compositional Reasoning, and When It Only Amplifies,” arXiv:2512.01970v3, 27 May 2026, https://arxiv.org/abs/2512.01970. ↩︎

The mechanism: RL needs parts before it can assemble anything#

The first trap: direct SFT learns the path, not the skill#

The paper’s core result: atomic-first RL beats composite-first training#

The decisive distinction: synthesizer versus amplifier#

The ablations show the prerequisite is real, not cosmetic#

Error analysis: successful RL moves the failure later#

Why this matters for RAG: retrieval is not enough#

The data lesson: expensive traces are not always the first thing to collect#

The robustness tests strengthen the pattern, but not beyond its scope#

Where the result applies, and where it does not yet#

The business interpretation: audit before reinforcement#

Conclusion: RL does not create reasoning from nothing#