Reasoning is expensive mostly because we make the model say it.
That sounds almost too simple, which is usually where trouble begins. Chain-of-thought reasoning improved language-model performance by giving the model a written workspace: first solve, then answer. But the same trick also turns internal computation into external communication. Every intermediate step must be decoded, formatted, and passed forward one token at a time. The model is not just thinking; it is producing a small essay it may not need to show anyone.
The paper “Unlocking the Working Memory of Large Language Models for Latent Reasoning” by Lukas Aichberger and Sepp Hochreiter proposes a cleaner split.1 Its method, Reasoning in Memory (RiM), asks whether a model can use fixed special-token blocks as an internal workspace, instead of generating written reasoning steps or autoregressively generated continuous thoughts.
The important word is fixed.
Many latent-reasoning methods remove natural-language text from the reasoning trace but keep the sequential generation pattern. They replace words with hidden vectors, then generate those vectors step by step. This reduces linguistic overhead, but it does not fully remove the autoregressive bottleneck. RiM’s bet is different: give the model a fixed sequence of memory blocks, train those blocks to carry task-relevant computation, and process them in one forward pass.
In business terms, the paper is not merely about making reasoning more mysterious. We already have enough mystery, thank you. It is about a possible control layer for inference cost: stronger answer quality without paying the full latency price of visible chain-of-thought.
The core mechanism is not “hidden reasoning”; it is fixed working memory
A common misunderstanding of latent reasoning is that anything hidden is automatically efficient. Not quite. If a method still generates intermediate latent states one after another, it may be silent, but it remains sequential. Silent is not the same as cheap.
RiM changes the layout of the computation.
Instead of asking the model to generate a chain of reasoning tokens, RiM appends memory blocks to the input question. Each block contains special tokens: boundary markers and memory tokens. These tokens have no natural-language meaning. Their job is not to say “therefore” in a quieter voice. Their job is to become positions in the transformer sequence where task-specific intermediate representations can form.
A simplified picture looks like this:
Question
↓
[Memory block 1] → readout
[Memory block 2] → readout
[Memory block 3] → readout
...
[Memory block 8] → final answer readout
The memory blocks are ordinary input tokens from the model’s perspective, but they are fixed tokens rather than generated tokens. That matters operationally. Since the blocks are already present in the input, the model can process the augmented sequence in a single forward pass. The final answer is then generated after the latent workspace has been computed.
This is the paper’s first contribution: RiM decouples internal computation from external generation. Chain-of-thought makes the model expose its scratchpad. Coconut-style latent reasoning hides the scratchpad but still generates latent thoughts sequentially. RiM installs a fixed scratchpad inside the prompt sequence and trains the model to use it.
That fixed scratchpad does not work by magic. The paper is careful about this. Prior work on filler or pause tokens has shown that adding semantically empty tokens can fail, or even distract the model, unless training gives those tokens a useful computational role. The model must be taught that these positions are not decorative padding. Otherwise, it will behave like many employees seeing a new enterprise dashboard: ignore it until management forgets about it.
Stage 1 gives the memory blocks a job
The authors use a two-stage curriculum. This is the second major contribution of the paper, and it is more important than the name “working memory” might suggest.
In Stage 1, RiM trains memory blocks using explicit reasoning-step supervision. The training data contains math questions, reasoning steps, and final answers. For each reasoning step, the model receives a corresponding memory block. After each memory block, a readout is trained to predict the next written reasoning step.
The key design choice is the attention mask. The readout is allowed to attend to the question and the memory blocks available up to that point, but not to other ground-truth reasoning-step readouts. This prevents an easy shortcut. If the target step could simply look at previous written steps, the model could predict the next step from visible text and leave the memory blocks idle. RiM blocks that escape route.
So Stage 1 is not mainly about teaching the model to write better explanations. It is about forcing the latent positions to carry enough intermediate information that a written reasoning step can be recovered from them.
This is why the paper’s custom attention mask should be read as an implementation detail with strategic importance. It is not just a neat transformer trick. It protects the causal claim that useful computation is being pushed through the memory blocks rather than smuggled through supervised text.
Stage 2 turns the workspace into an answer machine
Stage 1 teaches the blocks to support reasoning-step reconstruction. But deployment does not need reasoning steps. It needs final answers.
In Stage 2, the paper removes reasoning-step supervision and trains each memory-block readout to predict the final answer. The model now sees a fixed number of memory blocks for every sample. Each readout has access to progressively more memory blocks, and later readouts are weighted more heavily because they have access to more latent computation.
The curriculum therefore moves through a sequence:
| Training phase | What is supervised | Likely purpose | Deployment meaning |
|---|---|---|---|
| Stage 1 | Next reasoning step after each memory block | Ground the memory blocks as useful latent states | Teach the scratchpad how to hold intermediate computation |
| Stage 2 | Final answer after each memory block | Convert grounded latent states into answer refinement | Make a fixed final-block readout usable at inference |
This is also where RiM differs from a naive “add silent tokens and hope” approach. The paper does not claim that working memory appears just because special tokens are present. It argues that useful working memory emerges when the model is trained with dense supervision that gives those tokens a computational role, then retargeted toward final-answer prediction.
That distinction is small in wording and large in engineering.
The representation evidence checks whether the blocks are doing anything
Before the paper asks whether RiM improves accuracy, it asks a more basic question: are the memory blocks actually becoming input-dependent latent states?
This is not decorative analysis. It is main evidence for the mechanism.
The authors train Llama-3.2-1B and collect memory-block representations across GSM8K test questions during training. They project these representations into a shared PCA basis and examine how the memory blocks move over training and across questions.
The result is qualitative but meaningful: before training, the memory-block representations are largely collapsed; after RiM training, they become block-specific and sample-dependent. The trajectories move smoothly during training, and different questions induce different latent workspace structures.
In plain language: the blocks stop looking like identical placeholders and start behaving like task-conditioned internal states.
This evidence does not prove that the model is “reasoning like a human.” The paper borrows the working-memory analogy from cognitive psychology, but the experiment does not inspect mental arithmetic in a biological brain, mercifully. What it supports is narrower and more useful: the added memory positions become structured internal representations that vary with the problem and evolve across memory depth.
That is the right evidentiary level. Business readers should resist the temptation to translate every latent-space plot into a personality profile for the model. The plot says the workspace is being used. It does not say the workspace is transparent.
The main benchmark result is an accuracy-latency tradeoff, not just a leaderboard bump
The paper evaluates RiM on GSM8K as the in-distribution math benchmark and GSM-Hard as an out-of-distribution benchmark. Training uses GSM8K-Aug, with explicit arithmetic reasoning steps used for Stage 1 supervision. The models cover GPT-2, Llama-3.2-1B, and Llama-3.2-3B.
The baselines matter because each answers a different operational question.
| Baseline | What it tests | Why it matters |
|---|---|---|
| SFT without CoT | Direct answer generation | Can RiM beat the cheap answer-only baseline? |
| SFT with CoT | Explicit written reasoning | How close can RiM get to slower visible reasoning? |
| Coconut | Autoregressive latent reasoning | Can fixed memory blocks beat sequential continuous thoughts? |
| DART official numbers | Literature context | Is RiM competitive with more involved silent-reasoning methods? |
The deployable comparison is the final-block readout. This is the realistic setting: after a fixed memory budget, use the final answer. On GSM8K, RiM final-block accuracy beats both direct-answer SFT and the strongest Coconut variant across all three model backbones.
The numbers are not subtle:
| Model | SFT w/o CoT GSM8K greedy | Coconut GSM8K greedy | RiM final-block GSM8K greedy | RiM TTFT vs SFT w/o CoT |
|---|---|---|---|---|
| GPT-2 | 15.4% | 31.1% | 33.6% | same: 7.6 ms |
| Llama-3.2-1B | 23.9% | 36.9% | 42.1% | same: 16.1 ms |
| Llama-3.2-3B | 36.2% | 41.3% | 48.8% | same: 27.9 ms |
On GSM-Hard, the pattern is similar, although the absolute numbers are much lower, as expected for a harder out-of-distribution benchmark. RiM final-block greedy accuracy reaches 7.8% for GPT-2, 10.5% for Llama-3.2-1B, and 12.0% for Llama-3.2-3B, each above the corresponding Coconut and direct-answer SFT results.
The business-relevant interpretation is not “RiM solves math.” It does not. The best GSM-Hard greedy score in the main final-block table is still 12.0%. The point is that RiM improves the answer-only path while preserving the latency profile of answer-only inference.
That is a very different claim from “latent reasoning is more accurate.” The full claim is:
With fixed memory blocks trained through a two-stage curriculum, RiM improves over direct-answer and autoregressive latent baselines on these math benchmarks while keeping time-to-first-token essentially equal to direct-answer SFT.
That longer sentence is less catchy. It is also the actual result.
The latency evidence is the operational center of the paper
For AI products, latency is not a footnote. It is a product feature, a cost driver, and occasionally the reason users close the tab.
The paper reports time to first token in the main results and an additional full-answer latency table for Llama-3.2-1B. The full-answer latency table is especially clear:
| Method | Avg. generated tokens | Wall-clock time per GSM8K question |
|---|---|---|
| SFT w/o CoT | 3.1 | 126.0 ms |
| SFT w/ CoT | 36.7 | 1108.7 ms |
| Coconut | 3.1 | 304.7 ms |
| RiM | 3.1 | 126.5 ms |
This table explains the paper’s practical relevance better than any philosophical discussion of working memory. RiM adds fixed input-side computation, so it behaves like the direct-answer model in generation length and measured latency. Coconut still pays for sequential latent-state generation. Chain-of-thought pays for visible reasoning text.
The inference-cost logic is therefore:
Visible CoT:
more reasoning tokens → more sequential decoding → much higher latency
Autoregressive latent reasoning:
fewer visible tokens, but latent steps still generated sequentially → moderate latency penalty
RiM:
fixed memory blocks processed in one forward pass → near direct-answer latency
This is why the paper is interesting for enterprise deployment. Many business workflows want reasoning quality but do not want verbose reasoning traces exposed to users, logged into records, or decoded at every interaction. Customer support triage, compliance pre-checks, spreadsheet-agent validation, report extraction, and internal analytics all face the same friction: deeper reasoning is valuable, but slow and noisy reasoning can make the product worse.
RiM suggests one possible route: move some computation into a trained internal workspace, then expose only the answer.
That is a research result, not a procurement recommendation. But it is a useful direction.
The appendix tests are not side quests; they tell us what part of RiM matters
The paper’s appendix contains several tests that should not be read as extra leaderboard decoration. They answer different questions about the mechanism.
| Test or result | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Stage-switch ablation | Ablation | Stage 1 grounding before Stage 2 matters | Exact optimal switch timing for other tasks |
| RiM vs Coconut-style curriculum | Ablation / mechanism isolation | Fixed memory blocks need dense supervision, not just gradual replacement | That Coconut is universally weak |
| Expanded representation plots | Mechanism evidence | Memory blocks become structured and sample-dependent | Human-like interpretability |
| Probe-based answer selection | Exploratory extension | Correctness information is partly accessible in memory-block representations | A finished deployable verifier |
| Official-number comparison with DART/Coconut | Literature comparison | RiM is competitive despite lower or simpler training cost | Strict apples-to-apples superiority |
| Empirical full-answer latency | Implementation evidence | RiM preserves answer-only latency in this setup | Latency under all serving stacks and sequence lengths |
Two appendix findings deserve special attention.
First, the stage-switch ablation supports the curriculum story. Training with Stage 2 alone improves final-answer accuracy quickly but plateaus below runs that first ground the memory blocks with Stage 1. Stage 1 alone can produce high any-block accuracy, but final-block accuracy remains weak because the model has not been trained to use a fixed final readout. The switch matters because the two stages solve different problems: create the workspace, then make it deployable.
Second, the comparison with a Coconut-style curriculum applied to RiM’s fixed blocks isolates the training signal. Keeping fixed memory blocks but replacing RiM’s dense supervision with a gradual Coconut-style curriculum performs substantially worse. That is a useful ablation because it reduces the chance that the result is merely “special tokens good.” The stronger interpretation is: special tokens become useful when the supervision forces computation through them.
The probe-based result is more speculative but strategically interesting. The authors train lightweight linear probes on memory-block representations to predict whether each block’s readout is correct. Conditioned on the subset where at least one memory block produces a correct answer, the selection procedure chooses a correct answer 90% of the time. This does not make any-block accuracy magically deployable. It does suggest that future systems might learn when to trust different latent depths, rather than blindly using the last block every time.
For business AI systems, that points toward a broader design: latent reasoning plus internal confidence routing. The paper does not build that product. It merely opens the door and leaves a tasteful note.
What Cognaptus infers for business use
The paper directly shows a method-level result on math reasoning benchmarks. The business interpretation requires one careful step of inference.
Here is the separation:
| Layer | Claim | Confidence |
|---|---|---|
| Paper directly shows | Fixed memory blocks trained with RiM improve over direct-answer SFT and Coconut on GSM8K/GSM-Hard while keeping latency close to direct-answer inference | Supported within the paper’s experimental setup |
| Reasonable business inference | Internal latent workspaces could help enterprise AI systems trade accuracy against latency more efficiently than visible CoT in some structured reasoning tasks | Plausible, not proven |
| Still uncertain | Whether RiM scales to frontier models, messy business documents, code agents, tool-use workflows, legal reasoning, or multi-modal pipelines | Open question |
The most attractive application category is not general chatbot conversation. It is high-volume bounded reasoning, where the answer format is short, latency matters, and visible reasoning is not always desirable.
Examples include:
| Workflow | Why RiM-like reasoning could matter | Boundary |
|---|---|---|
| Document extraction validation | Need hidden consistency checks before returning a field | Math benchmarks are not document extraction |
| Customer support routing | Need fast classification with some multi-step policy reasoning | Requires domain-specific supervision |
| Financial report QA | Need stronger internal computation without verbose traces | Numerical and regulatory accuracy need separate verification |
| Agent pre-action checks | Need quick “should I act?” reasoning before tool calls | Tool-use dynamics are not tested here |
| Internal analytics assistants | Need latency-efficient reasoning over structured prompts | Context length and retrieval interaction remain uncertain |
The deeper business lesson is about reasoning-budget design. We should stop thinking of “more reasoning” as a single knob. There are at least four different knobs:
- visible written reasoning;
- autoregressive latent reasoning;
- fixed internal memory computation;
- external verification or selection after candidate answers.
RiM explores the third knob. It may eventually combine well with the fourth. A practical system might use fixed latent memory for cheap first-pass reasoning, then route uncertain cases to explicit chain-of-thought, tool verification, or human review.
That architecture is more realistic than pretending every query deserves the same theatrical monologue.
The limitations are real, but they are specific
The paper’s limitations are not generic “AI may be risky” wallpaper. They affect how far we can carry the result.
First, the evaluation is concentrated on mathematical reasoning benchmarks: GSM8K and GSM-Hard. These are useful because they provide clear final answers and structured reasoning traces, but they are not a proxy for all business reasoning. A method that works on arithmetic word problems may not automatically work on contract review, market commentary, operations planning, or multi-step tool use.
Second, the training uses GSM8K-Aug reasoning steps for Stage 1. RiM’s curriculum depends on intermediate supervision. Many enterprise tasks do not have clean step-by-step labels. Creating them may require distillation from stronger models, human annotation, synthetic data generation, or task-specific process logs. That cost belongs in the ROI calculation.
Third, the models are GPT-2 and Llama-3.2 at 1B and 3B scale, trained with LoRA adapters. This is valuable for controlled research, but it does not settle behavior at frontier-model scale. Larger models may already internalize some reasoning differently; they may also benefit from RiM in other ways. We do not know yet.
Fourth, latent memory reduces visibility. A written chain-of-thought is not always faithful, but it is at least inspectable. RiM moves more computation into hidden representations. That may be good for latency and privacy of internal reasoning, but bad for auditability unless paired with probes, verifiers, or external evidence checks.
Finally, the latency results are strong within the tested setup, but serving environments differ. Input length, batching, hardware, attention implementation, and memory-block budget can all change real deployment economics. The paper’s key latency advantage comes from removing sequential intermediate generation, and that mechanism is credible. The exact milliseconds should not be copy-pasted into a vendor slide. Some restraint, please.
The article’s bottom line: reasoning needs a workspace, not always a transcript
RiM is interesting because it reframes a practical bottleneck. The usual story says stronger reasoning requires more generated reasoning. RiM asks whether the model can learn to reason in fixed internal memory blocks, then answer directly.
The results support three linked claims.
First, the memory blocks become structured, block-specific, and sample-dependent. They are not merely padding.
Second, the two-stage curriculum matters. Stage 1 grounds the latent workspace; Stage 2 converts it into final-answer refinement.
Third, RiM improves the accuracy-latency tradeoff on GSM8K and GSM-Hard relative to direct-answer SFT and Coconut, while preserving near answer-only inference speed.
For business readers, the most useful interpretation is not that chain-of-thought is obsolete. It is that reasoning systems may need multiple internal modes. Some tasks require visible rationale, audit trails, and external verification. Other tasks require fast, private, bounded computation before a short answer. RiM belongs to the second category.
The model does not always need to think out loud. Sometimes it needs a better desk.
Cognaptus: Automate the Present, Incubate the Future.
-
Lukas Aichberger and Sepp Hochreiter, “Unlocking the Working Memory of Large Language Models for Latent Reasoning,” arXiv:2605.30343, 2026. https://arxiv.org/abs/2605.30343 ↩︎