Compute is the bill that arrives after every AI strategy meeting.
Everyone wants stronger reasoning. Fewer hallucinations. Better mathematical reliability. More robust planning. The usual menu is familiar: train a bigger model, sample more answers, generate longer chain-of-thought, bolt on a verifier, or pray to the GPU procurement gods. Elegant, in the way an invoice can be elegant.
A recent paper, Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence, studies a less theatrical option: take an already pretrained fixed-depth language model, remove some layers, loop a selected block of remaining layers, and post-train the modified model so that it can spend more compute internally at inference time.1 The idea is not merely to make the model larger. It is to make depth reusable.
That distinction matters. A normal transformer runs through its stack once. More reasoning usually means more visible tokens, more samples, or a larger network. A depth-recurrent transformer instead reuses part of itself multiple times. It can perform more internal computation without increasing the number of unique parameters or expanding the context with a long verbal “thinking” trace. No, this does not mean the model has suddenly discovered introspection. It means the architecture has been given a loop and trained not to fall over when using it. Small difference. Rather important.
The paper’s real contribution is not “recurrence works.” That claim is too broad and slightly lazy. The contribution is more specific: pretrained 1B-class models such as TinyLlama, OLMo, and Llama can be converted into useful depth-recurrent models if the surgery, optimizer, recurrence schedule, and data curriculum are handled carefully. The care is the point. Recurrence is not a magic button. It is a controlled injury followed by rehabilitation.
The retrofit is model surgery, not a loop button
The architecture starts with a simple decomposition. The authors split a pretrained model into three functional parts:
| Component | What it does | Operational interpretation |
|---|---|---|
| Prelude | Processes the input once | A fixed upfront encoding cost |
| Recurrent block | Runs repeatedly | Variable internal compute |
| Coda | Produces the final output distribution | Final decoding layer stack |
The model can be written schematically as:
Here, $P$ is the prelude, $R$ is the recurrent block, $C$ is the coda, and $r$ is the number of recurrences. The prelude output is injected into each recurrent step through a linear adapter that combines the prelude representation with the previous recurrent state. In plain English: the model does not simply throw the recurrent block at its own output and hope. It keeps feeding the original encoded input back into the loop so the repeated computation remains anchored.
The layer selection is also less obvious than “take the middle and repeat it.” The paper finds that it works best to keep early layers for the prelude and later layers for the recurrent block and coda, removing layers in between. For example, in a TinyLlama configuration with 22 layers, the authors use a $(4,8,4)$ arrangement: the first four layers become the prelude, later layers become the recurrent block and coda, and six intervening layers are dropped.
That creates a useful business analogy, although we should not abuse it. The model is being turned from a fixed assembly line into a process with a reusable workcell. Input preparation happens once. The workcell can iterate. The final packaging step happens at the end. The expensive bit becomes adjustable.
But the adjustment is not free. Each recurrence consumes inference FLOPs. The model has fewer unique parameters, but a deeper effective computation graph when it runs multiple recurrences. This is not parameter scaling. It is compute scheduling.
Pretrained initialization is the first result, not a convenience
The first major experimental question is whether a recurrent model should be trained from scratch or initialized from an existing pretrained model. The answer is not subtle.
The authors compare a recurrent model initialized from Llama-3.2-1B layers against one initialized randomly using a scalable initialization scheme. Both are trained for approximately 120 billion FineWeb-Edu tokens with a mean recurrence count of 32. The pretrained initialization consistently reaches lower loss and higher HellaSwag accuracy. By around 1,000 training steps, the Llama-initialized model is already using recurrence productively; the randomly initialized model is still near random-accuracy behaviour across recurrence settings.
The paper also extrapolates the loss curves and estimates that the randomly initialized model would need at least roughly 950 billion tokens before intersecting the pretrained-initialized loss curve. The authors warn that this is likely an underestimate, because the curves are not perfectly log-linear near the end. This is exactly the kind of result procurement teams enjoy, provided someone else has already paid for the pretraining. Conveniently, that is how open model ecosystems work.
The interpretation is straightforward. Recurrence does not erase the value of pretraining. It amplifies the value of a good starting point. The pretrained transformer already contains useful representations distributed across its residual stream. The retrofit attempts to exploit the fact that transformer layers operate in a shared representation space, so a later block can be looped after post-training rather than trained as a fresh recurrent system from nothing.
That is the first commercial signal: retrofitted recurrence is not mainly a way to avoid pretraining. It is a way to reuse pretraining differently.
The recipe has four moving parts, and each one earns its keep
A naive reading of the paper might be: “Just loop some layers and increase recurrence at inference.” That would be a fine way to waste compute while feeling avant-garde.
The actual recipe has four technical dependencies.
First, the model is initialized from pretrained weights. This transfers knowledge and makes recurrent post-training far more efficient than starting from random initialization.
Second, the model trains over a distribution of recurrence depths rather than a single fixed loop count. The authors follow a Poisson-Lognormal setup, where “train recurrence” refers to the mean of the recurrence distribution.
Third, the average recurrence depth is scheduled upward during training. In the scheduling experiment, a linearly increasing recurrence curriculum improves validation loss per training FLOP. This matters because recurrent models spend a larger share of runtime in forward passes, especially under truncated backpropagation. Starting immediately at deep recurrence is computationally expensive; ramping up depth behaves like a curriculum for internal computation.
Fourth, the optimizer matters. In recurrent training experiments, Muon is more stable than AdamW. One AdamW run spikes and becomes NaN, while Muon achieves lower loss for recurrent models. For non-recurrent TinyLlama, the optimizer difference is much less pronounced. This is an implementation detail, but not a trivial one. When an architectural change shifts the stability profile, “just use our normal recipe” becomes famous last words.
These pieces are not decorative. They are what turns recurrence from an appealing diagram into a trainable system.
| Paper element | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Figure 1 architecture | Mechanism / implementation detail | Shows how pretrained layers are selected, looped, and connected through an adapter | Does not itself show performance gains |
| Figure 2 initialization comparison | Main evidence for pretrained initialization | Pretrained weights make recurrent training far more efficient than random initialization | Does not prove the method scales to frontier models |
| Figure 3 recurrence scheduling | Efficiency ablation / sensitivity test | Increasing recurrence over training improves loss per FLOP | Does not mean all schedules are optimal |
| Figure 4 optimizer comparison | Implementation stability test | Muon stabilizes recurrent training better than AdamW in this setup | Does not imply Muon is universally superior |
| Figures 5–7 math results | Main evidence for task gains | Retrofitted recurrent models improve GSM8K and MATH under comparable training budgets | Does not prove broad reasoning improvement across domains |
| Figure 8 and Table 1 data curriculum | Robustness / general performance check | A healing phase helps preserve non-math benchmark performance after surgery | Does not make recurrence costless or automatically adaptive |
The appendix matters here, but mostly as support. Layer-choice ablations test whether the surgery is arbitrary; it is not. Scheduling ablations test whether the recurrence curriculum is doing useful work; it is. Tables for TinyLlama, OLMo, and Llama show the pattern across model families, with some variation. The appendix is not a second thesis. It is the scaffolding preventing the first thesis from collapsing politely in public.
The math gains are real, but the compute is doing work
The strongest headline results are on GSM8K and MATH. The authors post-train recurrent and non-recurrent models for approximately 50 billion tokens of Nemotron-CC-Math-v1 data, then compare performance over training FLOPs and over test recurrence counts.
For TinyLlama, a $(4,8,4)$ recurrent model has roughly 700 million remaining parameters, about 72.7% of the parent non-recurrent model’s parameters. For OLMo, a $(4,6,4)$ recurrent configuration leaves approximately 900 million parameters, about 87.5% of the pretrained model’s parameters. For Llama, the corresponding $(4,6,4)$ recurrent body has about 851 million non-embedding parameters.
Despite having fewer unique parameters, the recurrent models can match or exceed the fixed-depth baseline on math tasks by using more recurrences at inference. The key phrase is “by using more recurrences.” This is not an efficiency miracle in which quality improves while compute vanishes. It is a different compute allocation model: fewer unique parameters, more reusable depth when needed.
The Llama appendix table illustrates the shape of the trade-off. The Llama non-recurrent baseline records 37.1 on GSM8K and 27.4 on MATH after post-training. A recurrent Llama $(4,6,4)$ model trained with recurrence 32 reaches 49.4 on GSM8K and 31.5 on MATH at test recurrence 32. Lower test recurrence values are weaker; deeper inference is what unlocks the gain.
That pattern is the whole economic proposition. You get a model whose inference can be dialled deeper on harder problems. But until automatic stopping is solved, the dial must be set by policy, benchmark, or engineering heuristics. The paper explicitly treats native adaptivity—automatically assigning the right amount of recurrence to each problem—as future work. So if a vendor announces “self-budgeting latent reasoning” tomorrow after reading only the abstract, please ask where the stopping criterion lives. Preferably before signing the annual contract.
Healing after surgery is not optional
The paper’s most practically useful section may be the data curriculum experiment, because it addresses the uncomfortable fact that model surgery damages the patient.
Earlier math-only post-training improves math performance but slightly degrades non-reasoning benchmarks such as HellaSwag, ARC, and OpenBookQA. That is not surprising. If you remove layers, rewire the model, and then feed it heavily mathematical data, it may become better at math while forgetting how to behave like a generally competent language model. Such is specialisation. Useful in engineers; annoying in deployed systems.
To handle this, the authors test a two-phase procedure for TinyLlama. In single-phase training, they train for 26 billion tokens on an even mix of FineWeb-Edu, Nemotron-Pretraining-SFT-v1-General, and Nemotron-Pretraining-SFT-v1-Math. In two-phase training, they first run a 26 billion token “healing” phase on FineWeb-Edu, then train on the same mixed data for another 26 billion tokens, totalling 52 billion tokens. The recurrence curriculum runs over 25% of total steps.
The result is clear enough to be operationally useful. In Table 1, the two-phase recurrent model at test recurrence 32 reaches 65.2 on ARC-Easy, 37.7 on ARC-Challenge, 60.4 on HellaSwag, 44.8 on MMLU, 51.2 on GSM8K, and 14.2 on MATH. The static-depth TinyLlama two-phase baseline reaches 62.5, 36.5, 60.3, 44.4, 45.2, and 12.8 respectively on those same benchmarks.
The point is not that recurrence dominates every metric. It does not. PIQA, Winogrande, and OpenBookQA are close, and some static-depth results remain competitive. The point is narrower and more credible: with a healing phase, the recurrent model can preserve broad benchmark performance while gaining math performance through test-time recurrence.
This is where the paper becomes more useful than the usual “new architecture improves benchmark” story. It shows that the retrofit is a pipeline intervention, not a standalone architecture trick. The order matters: surgery, healing, then targeted capability training. Businesses implementing domain-specialised models already know this pattern under different names. You do not restructure an operating system and immediately benchmark the accounting module. Unless you enjoy false negatives.
The business value is variable reasoning depth, not cheaper everything
The most interesting business implication is not that recurrent models are “smaller.” In deployed AI systems, smaller parameter count is only one line item. Latency, GPU memory, throughput, batchability, engineering complexity, and quality control all have opinions.
The better interpretation is that recurrence offers a new compute control surface.
Today, many AI systems scale reasoning by spending more tokens: longer chain-of-thought, multiple sampled solutions, verifier passes, or agentic retries. That creates observable text, longer context, and often messy orchestration. Depth recurrence shifts some of that spend into latent computation. The model does more work internally by looping a block of layers, without necessarily producing a longer reasoning trace.
For business systems, Cognaptus would frame the pathway like this:
| Direct paper result | Business inference | Boundary |
|---|---|---|
| Pretrained models can be converted into recurrent models | Existing open-weight models may become platforms for architectural retrofits, not just fine-tuning substrates | Demonstrated mainly at 1B scale |
| More recurrences improve math benchmark performance | Inference compute could be allocated more heavily to hard reasoning tasks | Extra recurrence increases FLOPs and latency |
| Scheduling recurrence improves training efficiency | Post-training recipes may become as important as architecture choice | Recipe sensitivity remains high |
| Healing improves broad benchmark preservation | Domain specialisation should include recovery phases after structural edits | Evidence is strongest for TinyLlama in this section |
| Adaptive stopping is future work | Production systems still need external policies for recurrence budgets | No automatic “think until solved” mechanism yet |
This is especially relevant for workloads where correctness matters more than conversational speed: mathematical verification, code reasoning, tool planning, scientific QA, technical support triage, and enterprise decision workflows where difficult cases can tolerate higher latency.
It is less compelling for high-throughput casual chat, where every extra recurrence competes against response time and serving cost. A customer asking for a password reset does not need a model to gaze deeply into latent space. It needs to stop being dramatic and solve the ticket.
Where the evidence stops
The paper is careful about its boundaries, and the business interpretation should be equally careful.
First, the experiments are at roughly the 1B-parameter scale. That is useful because many enterprise deployments live in the small-to-medium open-model world, but it does not prove the method scales cleanly to much larger models. Larger models have different optimisation behaviour, serving constraints, and post-training economics.
Second, much of the strongest task improvement is mathematical. GSM8K and MATH are good stress tests for reasoning-like behaviour, but they are not a complete enterprise workload suite. The authors explicitly call for work in other reasoning-intensive domains.
Third, the method still spends extra inference compute. Recurrence decouples unique parameter count from test-time compute; it does not repeal arithmetic. A model at test recurrence 32 is doing substantially more work than one at recurrence 1.
Fourth, automatic recurrence allocation remains unsolved. The paper discusses the need for built-in stopping criteria so models can spend more compute on hard problems and less on easy ones. Until that exists, recurrence depth is a configuration decision, not a self-governing intelligence budget.
Finally, the model surgery itself is not arbitrary. Layer choice, adapter design, optimizer, recurrence schedule, and curriculum all affect success. This is not yet a commodity knob like temperature. It is closer to an advanced model-engineering procedure with enough sharp edges to deserve adult supervision.
The useful takeaway is not “models think deeper.” It is “depth can be budgeted.”
The paper’s title says pretrained language models can be taught to think deeper. That is directionally right, but the more operational message is sharper: depth can become a variable resource.
Instead of buying every increment of reasoning through more parameters or more generated tokens, a system can reuse internal layers for additional computation. That creates an appealing middle path between static small models and expensive reasoning-time orchestration. It also gives AI teams a more nuanced performance lever: not just “which model?” but “how much internal computation should this case receive?”
The old enterprise fantasy was a single model that is cheap, fast, accurate, general, adaptive, auditable, and somehow also delightful. That model remains conveniently unavailable. Retrofitted recurrence offers something less magical and more useful: a way to make existing models spend depth where depth helps.
The lesson is not that recurrence gives free intelligence. It gives a new place to put the bill.
And in AI infrastructure, finding a better place to put the bill is sometimes what progress looks like.
Cognaptus: Automate the Present, Incubate the Future.
-
Sean McLeish, Ang Li, John Kirchenbauer, Dayal Singh Kalra, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Jonas Geiping, Tom Goldstein, and Micah Goldblum, “Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence,” arXiv:2511.07384, 2025, https://arxiv.org/abs/2511.07384. ↩︎