Thinking in New Directions: When LLMs Learn to Evolve Their Own Concepts

A familiar business scene: a team has already tried the standard AI improvement kit. Better prompts. More examples. Chain-of-thought. Self-consistency. A small agent wrapper. Maybe even a heroic tree-of-thought workflow that burns compute like a startup burns runway.

The model improves, but not in the way the team hoped. It can explain more. It can sample more. It can retry more. Yet when the task requires a new abstraction — a hidden rule in a grid, a nested logical constraint, a multi-step scientific relation, a variable-binding trick in math — the model still behaves like someone confidently rearranging old furniture in a room that needs a new door.

That is the useful starting point for Sarim Chaudhry’s paper, “Recursive Concept Evolution for Compositional Reasoning in Large Language Models.”¹ The paper’s central claim is not that language models need longer answers. It is that they may need a way to modify the internal representational geometry in which answers are formed.

That sounds grand. It also sounds like the sort of phrase that can make a practical manager quietly close the browser tab.

So let us translate it carefully.

The paper proposes Recursive Concept Evolution, or RCE: a framework that attaches to a frozen language model and lets it create, select, merge, and reuse low-rank “concept subspaces” during reasoning. The base model’s weights stay frozen. The new machinery sits inside the hidden-state flow, at a chosen decoder layer, and injects selected concept projections into the residual stream.

The difference from normal reasoning tricks is simple enough:

Usual reasoning upgrade	What it changes	What it leaves fixed
Chain-of-thought	More intermediate text	The model’s latent representation space
Self-consistency	More sampled answers	The model’s latent representation space
Tree-of-thought	More search branches	The model’s latent representation space
RL-style reasoning tuning	Which trajectories are rewarded	The underlying representational substrate
RCE	The hidden representation through new concept subspaces	The frozen base weights

The paper is therefore best read mechanism-first. The benchmark tables matter, but they are not the intellectual center. The center is the proposed shift from searching harder through the same latent map to adding new directions to the map.

Which, admittedly, is a more interesting proposal than asking the model to “think step by step” for the 40,000th time.

The real bottleneck is not always missing knowledge

A common explanation for LLM reasoning failure is that the model did not think long enough. That explanation is sometimes true. Some problems benefit from more intermediate steps, multiple samples, or external verification.

But the paper targets a different failure mode: the model may not have the right internal direction to represent the required abstraction.

The authors frame transformer hidden states as vectors inside a fixed latent geometry learned during pretraining. When a task requires a structure that is poorly aligned with that geometry, downstream layers cannot reliably extract it. The model may still produce fluent reasoning traces, but those traces are built from the wrong internal ingredients.

This is especially relevant for compositional benchmarks such as ARC-AGI-2, MATH, BBH, GPQA, and HLE, where success often depends on discovering a rule, preserving an invariant, or combining several constraints. The question is not simply whether the model has seen enough text. The question is whether it can represent the task’s latent structure in a usable form.

The paper’s sharpest sentence, conceptually, is not a benchmark number. It is the claim that token-level methods give the model “more opportunities to traverse its existing representation space,” but do not alter the space itself.

That distinction matters. If the task-relevant abstraction is already inside the model, more search can help retrieve or stabilize it. If the abstraction is not properly encoded, more search becomes a more expensive way to be wrong.

RCE adds concept subspaces without retraining the base model

RCE introduces a library of concepts. Each concept is a low-rank basis matrix representing a subspace inside the model’s hidden-state space. A gate decides which concepts should activate for a given input. The active concepts project the hidden state into selected directions and inject the resulting update back into the residual stream.

In plainer terms: RCE gives the model a small set of reusable representational tools that can be turned on when the current problem seems to need them.

The paper’s implementation uses a single injection layer. The concept module intercepts hidden states after that layer, applies a gated low-rank update, and then lets the remaining frozen decoder layers process the enriched representation. The base language model is not fine-tuned. Only the concept bases, gating network, and concept generator are trained.

This is important for business interpretation. RCE is not presented as another full-model training recipe. It is closer to an adaptive reasoning layer attached to a frozen model. If the mechanism works robustly, it points toward AI systems that improve reasoning capacity through compact reusable modules rather than repeated full-model tuning.

That “if” is doing work, as it should. The paper is still a research proposal with specific validation boundaries. But the architecture is interesting because it separates three things that are often blurred together:

Layer of the system	Traditional approach	RCE-style interpretation
Base model	Store general language and world knowledge	Keep frozen
Reasoning behavior	Generate better trajectories	Improve hidden representations
Adaptation	Fine-tune or prompt	Grow and reuse concept subspaces

For an enterprise AI stack, this distinction is not academic ornament. It changes where one might invest: prompt library, agent workflow, fine-tuning, retrieval, or reusable internal reasoning modules.

The mechanism has four moving parts: spawn, compete, merge, crystallize

The paper describes RCE as a recursive evolutionary process. Fortunately, the mechanics are more concrete than the branding.

1. Spawn when the model looks representationally stuck

RCE begins with a failure signal. The system monitors predictive entropy and the margin between the top token probabilities. High entropy plus low margin indicates that the model is uncertain in a way that may reflect representational inadequacy.

When the failure score crosses a threshold, a concept generator creates candidate subspaces from the pooled hidden state at the injection layer. These candidates are low-rank bases, perturbed for diversity and orthogonalized so they are not merely collapsed copies of each other.

This is the first key difference from static adapters. The concept library does not start as a complete inventory. It grows when the model encounters situations where the existing representational basis appears insufficient.

A business analogy: the system does not hire a full department in advance. It creates a new specialist when the current team repeatedly fails on a kind of problem.

Of course, if it hired a new specialist every time someone frowned, the organization would become a circus. That is why the next step matters.

2. Compete under Minimum Description Length

Not every spawned concept is accepted. RCE uses a Minimum Description Length, or MDL, criterion to decide whether a candidate concept earns a place in the library.

The intuition is elegant: a concept should be kept only if it reduces representational error more than it increases complexity. In the paper’s formulation, candidate concepts are penalized for structural complexity and non-sparse activation. A candidate enters the library only if its benefit clears this cost.

This is the anti-bloat mechanism. Without it, the concept library could fill with tiny niche tricks that help a few training examples and harm generalization. Every company has seen this pattern in another form: a workflow patched with exception after exception until the exception system becomes the real product, and a very bad one.

The ablation results make MDL central rather than decorative. Removing the MDL criterion produces the largest reported performance drop: ARC-AGI-2 falls from 28.0 to 14.6, and MATH falls from 47.4 to 31.2 in the Mistral-7B ablation setting. The authors interpret this as evidence that unconstrained concept growth leads to overfitting and conflicting representational biases.

That is not a minor implementation detail. It is the pressure that makes the “evolution” part of RCE more than a metaphor.

3. Merge concepts when their joint use becomes genuinely useful

RCE also allows concepts to merge. If two concepts consistently co-activate and their combined contribution exceeds their individual contribution, they can be compressed into a higher-order abstraction using truncated SVD, again subject to an MDL-style acceptance test.

This matters because compositional reasoning is not just the presence of primitives. It is the ability to build reusable combinations.

The paper gives examples of primitive concepts such as spatial symmetry, color equivalence, numerical magnitude, and logical implication. It then describes intermediate merged concepts such as reflect-and-recolor, substitute-and-simplify, and constraint propagation. At the highest observed level, some concepts activate across multiple benchmark domains.

The reported concept library after 10,000 training steps stabilizes at 47 active concepts: 12 primitive concepts, 23 intermediate merged concepts, and 12 high-level abstractions. Primitive concepts average a reuse rate of 4.3 task types, while merged concepts average 8.7. That pattern supports the authors’ claim that merging produces more general abstractions.

This part is especially relevant for business automation. A useful AI system should not merely memorize one invoice format, one legal clause pattern, or one customer-service exception. It should learn reusable operational concepts: “exception requiring escalation,” “constraint conflict,” “missing evidence,” “policy-rule mismatch,” “ambiguous entity resolution.” The interesting possibility is not a model that remembers more cases. It is a model that forms better internal handles for recurring structures.

4. Crystallize the library so learning becomes cumulative

The paper uses “crystallization” to describe making useful concepts persistent. In the current implementation, this is mainly checkpoint-based: the concept library, gate network, and generator are saved and reused. The paper also discusses a deeper integration path through LoRA-style distillation under Fisher-information constraints, but that is presented as a possible consolidation route rather than the core validated result.

This distinction matters. The validated story is not yet “the model permanently rewrites itself into a new foundation model.” The validated story is more modest and more useful: a frozen model can be paired with a cumulative concept library that persists across sessions.

For deployment, that is still meaningful. Many business AI systems fail because every interaction starts fresh except for retrieved documents and chat history. A concept library would be different. It would store not facts, but learned representational tools.

That is a more durable kind of learning — if it can be governed, audited, and kept from becoming a haunted attic of latent hacks.

The main evidence says RCE beats token-search baselines on the tested reasoning tasks

The paper reports results across ARC-AGI-2, MATH, BBH, GPQA, and HLE. On Mistral-7B, RCE improves over the strongest listed baseline, DisCO, by 8.3 percentage points on ARC-AGI-2, 6.1 on MATH, 5.7 on BBH, 7.2 on GPQA, and 4.9 on HLE.

The headline table is:

Method	Model	ARC-AGI-2	MATH	BBH	GPQA	HLE
Base	Mistral-7B	12.4	28.6	51.3	24.1	8.2
CoT	Mistral-7B	15.1	34.2	57.8	28.5	10.1
Self-consistency, n=16	Mistral-7B	16.8	37.1	60.2	30.3	11.4
Tree-of-thought	Mistral-7B	17.3	36.8	59.5	31.0	11.9
GRPO	Mistral-7B	18.2	38.9	62.1	32.4	12.6
DisCO	Mistral-7B	19.7	41.3	64.8	34.2	13.8
RCE	Mistral-7B	28.0	47.4	70.5	41.4	18.7

The largest improvements occur on ARC-AGI-2 and GPQA, which fits the paper’s mechanism. ARC-style tasks reward invariant discovery and structural transformation. GPQA rewards cross-domain abstraction and multi-layer scientific reasoning. If RCE’s concept subspaces genuinely help represent structural patterns, those are exactly the places where one would expect stronger gains.

But the table also needs disciplined reading.

The Mistral-7B results are the primary validated evidence. The table includes Llama-3-8B and Qwen-2.5-14B rows, but the paper notes that those RCE results are projected from validated component-level scaling based on the Mistral implementation. This is not the same evidentiary weight as a fully executed multi-model evaluation.

That does not invalidate the paper. It does change how one should report it. The responsible reading is:

Claim	Evidence type	Business interpretation	Boundary
RCE improves Mistral-7B on compositional benchmarks	Main benchmark evidence	The mechanism may improve reasoning where latent structure matters	Validated primarily on Mistral-7B
RCE retains performance under ARC distribution shifts	Robustness/sensitivity evidence	Concepts may encode invariants rather than surface cues	Tested on controlled ARC shifts, not arbitrary enterprise drift
MDL, invariance augmentation, KL, merging, and orthogonality matter	Ablation evidence	The system needs governance pressure, not just more adaptive capacity	Ablations use selected benchmarks and one model setting
RCE is compute-efficient versus token-search methods	Efficiency comparison	Better reasoning may not require multiplying token generation	Depends on implementation and online evolution cost
Benefits should transfer to larger models	Projected scaling claim	Plausible direction for future systems	Not yet the same as full empirical validation

This is the difference between analysis and brochure writing. Brochures convert every table into destiny. Analysis asks which column is carrying the weight.

The robustness tests are about invariants, not general intelligence

The paper tests ARC-AGI-2 performance retention under three distribution shifts: color permutation, spatial rotation, and distractor injection. RCE reportedly retains 94.3%, 91.7%, and 95.8% of standard accuracy under these shifts. CoT and DisCO retain less, with baselines ranging from roughly 68% to 80%.

This evidence has a specific purpose. It is not a general claim that RCE is robust to everything. It tests whether the learned concepts depend on surface features or structural invariants.

That is a good test for the paper’s core thesis. If a concept is supposed to capture symmetry, transformation, or rule structure, it should survive changes in color palette or irrelevant distractors. If it collapses when the colors change, it was not a concept. It was a glorified shortcut wearing a lab coat.

The paper ties this robustness to MDL selection, orthogonality, and invariance augmentation. Concepts that rely on surface cues are less likely to generalize and should be pruned before crystallization.

For business readers, the translation is straightforward: robustness should be tested against the right kind of variation. If an AI system processes contracts, rotating a grid is irrelevant. But changing party names, jurisdiction phrases, clause order, or formatting may be analogous. If it processes invoices, vendor layout shifts matter. If it supports technical troubleshooting, irrelevant log noise matters.

The value of the RCE robustness section is therefore not the exact ARC percentages. It is the testing principle: evaluate whether the model has learned the structure or merely the surface appearance of the training distribution.

The ablations say the architecture works only when constrained

The ablation table is one of the most useful parts of the paper because it tells us which components are load-bearing.

Removed component	ARC-AGI-2	MATH	Likely purpose of the test	Interpretation
None: Full RCE	28.0	47.4	Main system reference	Baseline for component removal
MDL criterion	14.6	31.2	Ablation of selection pressure	Concept growth without compression discipline overfits
Invariance augmentation	18.3	39.8	Robustness-oriented ablation	Concepts become too surface-dependent
KL constraint	21.5	35.6	Stability ablation	Concept injection may improve structure but damage decoding fluency
Merge mechanism	23.1	42.7	Compositionality ablation	Primitive concepts help, but hierarchy adds value
Orthogonality penalty	20.4	38.1	Redundancy-control ablation	Concepts overlap and waste library capacity
Gate entropy penalty	25.2	44.3	Routing sparsity ablation	Diffuse activation dilutes useful concept effects

The MDL ablation is the most severe. That is the practical lesson. Adaptivity without compression pressure becomes overfitting. A system that can create new internal tools must also have a reason not to create too many.

The KL constraint ablation is also worth attention. The paper reports that removing the KL constraint harms fluency and factual stability. This is intuitive: if the concept module is allowed to push the hidden state too far from the base model’s distribution, it may create useful reasoning structure that the frozen decoder cannot express coherently. Better internal geometry is not useful if it arrives in a dialect the remaining layers cannot read.

The merge ablation is more moderate but conceptually important. It suggests that primitive concepts contribute, but multi-step compositional tasks benefit from hierarchical abstractions. This supports the recursive part of RCE: the system should not merely collect isolated tools; it should combine useful tools into broader ones.

The appendix sensitivity tests reinforce the same lesson. Performance is most sensitive to the spawn threshold and MDL weight. Too-low thresholds produce concept explosion and weaker generalization. Too-high thresholds suppress useful concept formation. Rank 16 performs best among the tested ranks, while top-2 routing works better than activating too few or too many concepts.

There is a management version of this result: innovation needs budget constraints. Without constraints, every exception becomes a feature, every feature becomes a platform, and soon nobody knows why the platform exists.

Apparently, neural concept libraries are not immune to corporate sociology.

Compute efficiency is the business-relevant surprise

The paper reports a striking compute comparison on MATH. Base greedy decoding costs 1.0 relative FLOPs and scores 28.6. CoT costs 3.2 and scores 34.2. Self-consistency with 16 samples costs 16.0 and scores 37.1. Tree-of-thought costs 24.5 and scores 36.8. RCE costs 1.04 and scores 47.4.

The mechanism explains the result. Token-search methods multiply forward passes or generated trajectories. RCE adds a gate MLP and two rank-16 projections per token, which the paper estimates as roughly 4% overhead relative to a base forward pass.

For business systems, this is the most important operational angle. Many AI reliability improvements are purchased with inference cost: more calls, more retries, more agents, more samples, more verification. That can be justified for high-value decisions, but it quickly becomes expensive in high-volume workflows.

RCE points to a different cost curve: improve the representation once, then reuse the concept library cheaply.

That does not mean RCE is automatically cheaper end to end. Online spawning, candidate evaluation, merge checking, and library governance all have costs. The paper itself notes that merge dynamics are quadratic in the number of active concepts, and scaling to thousands of concepts would require approximate candidate selection.

Still, the inference story is notable. If the useful concepts are already crystallized, the marginal cost of activating them may be far lower than sampling 16 reasoning traces.

This is where the paper becomes business-relevant without becoming marketing copy.

The business use case is not “smarter chatbot”; it is reusable reasoning infrastructure

The obvious but shallow interpretation is that RCE could make chatbots answer harder questions. Fine. It probably could, if the claims hold.

The more useful interpretation is that RCE suggests a new layer in enterprise AI architecture: adaptive concept infrastructure.

Today, many business AI systems are organized around:

a general model;
retrieval from documents;
prompt templates;
agent workflows;
sometimes fine-tuning;
monitoring and evaluation.

RCE suggests another layer:

reusable concept modules that capture recurring reasoning structures.

This layer would not store company facts. Retrieval already does that. It would not simply encode process instructions. Prompts and tools already do that. It would store internal representational tools for recurring abstractions.

Examples in business settings might include:

Business domain	Recurring abstraction	Why a concept-library approach could matter
Compliance review	Policy-rule mismatch	The same latent structure appears across different documents and departments
Finance operations	Exception requiring escalation	Surface details vary, but the decision structure repeats
Legal operations	Clause-role recognition under rewording	The wording changes while the contractual function remains stable
Customer support	Root-cause pattern across noisy symptoms	The useful structure is not always explicit in the text
Scientific or technical QA	Cross-domain causal relation	The task requires combining concepts rather than retrieving one passage
Workflow automation	State transition with missing evidence	The system must track constraints and incomplete information

This is an inference from the paper, not a direct experimental result. The paper tests compositional reasoning benchmarks, not enterprise workflows. But the bridge is plausible because the business tasks above share the same abstract property: they require structural generalization under surface variation.

The ROI argument would also be different from normal fine-tuning. A concept library could be valuable if it improves reasoning reliability while avoiding heavy full-model retraining and reducing dependence on expensive multi-sample inference. The value would come from three sources:

Reuse: concepts learned in one workflow may help related workflows.
Efficiency: concept activation may be cheaper than repeated token-level search.
Adaptation: new concepts can be added when the system encounters representational gaps.

The uncertainty is equally clear. We do not yet know how stable such concept libraries would be across real enterprise distributions, whether they can be audited well enough for regulated contexts, or how to prevent adversarial activation. These are not footnotes. They are deployment questions.

The limitations are not generic; they define where the mechanism may fail

The paper identifies three main failure modes.

First, RCE struggles with extremely long formal proofs requiring chains of 15 or more deductive steps. Since the implementation injects concepts at a single layer, the enriched representation is processed only by the remaining layers. If the reasoning structure requires deeper sequential restructuring, single-layer injection may not be enough. Multi-layer injection is a natural extension.

Second, RCE does not solve explicit external memory. A concept subspace can amplify a structural direction, but it is not a storage system. Tasks requiring many independent object states across long time horizons still need memory mechanisms.

Third, adversarial symbolic traps can misactivate concepts. Because concepts amplify hidden directions, an input deliberately aligned with a concept basis may trigger the wrong inference. The robustness tests suggest this is rare under natural ARC shifts, but adversarial deployment settings are a different game.

The paper also names broader scaling limitations. Merge evaluation is quadratic in the number of active concepts. Large concept libraries would need approximate merge candidate selection. Concept identifiability remains unresolved: different training runs may produce different concept decompositions with similar aggregate performance. Scaling to 70B-plus models would raise memory and compute issues because spawn evaluation and merge checks still involve the full model.

These are not reasons to ignore the paper. They are reasons to avoid turning it into a silver-bullet architecture. The useful boundary is:

RCE is promising for reusable structural abstractions.
It is not a replacement for external memory.
It is not yet proven at large model scale.
It needs governance over concept growth, activation, merging, and adversarial misuse.
Its strongest empirical support in the paper is the Mistral-7B implementation.

That last point matters. The larger-model rows are interesting, but the paper itself labels them as projected from component-level scaling rather than full validation. A careful reader should treat them as a roadmap, not a settled result.

What Cognaptus would watch next

If this line of work develops, the most important future evidence will not simply be “higher benchmark score.” We already have too many benchmark leaderboards behaving like beauty contests with spreadsheets.

The next useful tests would be:

Question	Why it matters
Do concept libraries remain stable across random seeds and data order?	Stability affects reproducibility and auditability
Can concepts be interpreted well enough for governance?	Enterprise use needs inspection, not just accuracy
Do concepts transfer across real workflow domains?	Business value depends on reuse, not benchmark memorization
What is the lifecycle cost of spawning, merging, pruning, and checkpointing?	Inference overhead alone understates total cost
Can adversarial concept activation be detected?	Misapplied abstractions can be worse than uncertainty
Does multi-layer injection improve long reasoning without instability?	Single-layer injection is a known boundary

The paper gives enough evidence to make RCE worth watching. It does not give enough evidence to declare representation evolution solved. That is fine. Serious research rarely arrives with a finished procurement checklist, despite what some vendor decks imply.

The deeper lesson: reasoning may need new internal tools, not louder instructions

The most useful contribution of the paper is not merely the RCE architecture. It is the reframing of a familiar problem.

For the last few years, much of LLM reasoning improvement has been treated as a problem of elicitation and search: ask better, sample more, rank better, verify harder. Those methods are useful. But they assume the model’s internal representational space already contains what the task needs.

RCE challenges that assumption. It asks whether a frozen model can build new low-rank representational tools when the current ones are inadequate, then reuse and compose those tools later.

For business AI, this is the right question. The hardest automation problems are not the ones where the answer is sitting in a document waiting to be retrieved. They are the problems where the system must recognize a structure across messy cases: the hidden exception, the invariant policy rule, the implied contradiction, the state transition, the reusable pattern.

Search helps when the map is good. RCE is interesting because it asks whether the map itself can grow.

That is the difference between making the model walk longer and giving it a new direction to walk in.

Cognaptus: Automate the Present, Incubate the Future.

Sarim Chaudhry, “Recursive Concept Evolution for Compositional Reasoning in Large Language Models,” arXiv:2602.15725, 2026, https://arxiv.org/abs/2602.15725. ↩︎

The real bottleneck is not always missing knowledge#

RCE adds concept subspaces without retraining the base model#

The mechanism has four moving parts: spawn, compete, merge, crystallize#

1. Spawn when the model looks representationally stuck#

2. Compete under Minimum Description Length#

3. Merge concepts when their joint use becomes genuinely useful#

4. Crystallize the library so learning becomes cumulative#

The main evidence says RCE beats token-search baselines on the tested reasoning tasks#

The robustness tests are about invariants, not general intelligence#

The ablations say the architecture works only when constrained#

Compute efficiency is the business-relevant surprise#

The business use case is not “smarter chatbot”; it is reusable reasoning infrastructure#

The limitations are not generic; they define where the mechanism may fail#

What Cognaptus would watch next#

The deeper lesson: reasoning may need new internal tools, not louder instructions#