Cache Me If You Can: Why Enterprise AI Needs Latent Working Memory

A codebase is not a paragraph.

Neither is a litigation folder, a clinical case file, a customer-support history, a policy archive, or the slow-motion disaster known as “all meeting notes since March.” Yet many enterprise AI systems still treat long context as a heroic prompt-engineering problem: push more text into the model, pray the key detail survives attention, and call the bill “innovation.”

The paper End-to-End Context Compression at Scale argues for a different design pattern: do not merely trim context after the model has already swallowed it; compress the context before the expensive decoder prefill, using learned latent tokens that the decoder can consume as memory.¹ The paper introduces Latent Context Language Models, or LCLMs, which map long input sequences into shorter continuous embeddings, then let a decoder reason over those embeddings instead of the original text.

That sounds dangerously close to “summarization with better branding.” It is not. The interesting part is precisely where the branding stops: LCLMs are not rewriting text into shorter text, and they are not pruning a KV cache after full prefill. They are training a smaller encoder and a decoder together so that compressed latent vectors can act as usable context.

For enterprise systems, the result is less “bigger context window, therefore happiness” and more “working memory needs architecture.” Tedious, yes. Also useful. The grown-ups are back in the room.

The core move is compression before decoder prefill

Most long-context inference cost comes from the decoder having to process a long input and materialize the KV cache. KV-cache compression methods attack the cache directly: build or partially build the cache, then evict, mask, compact, or approximate parts of it. That can reduce some memory pressure, but it often arrives late. The model has already paid a large part of the prefill bill.

LCLM changes the timing.

Instead of first pushing all raw tokens through the decoder, the system uses an encoder to compress blocks of input into latent tokens. If the raw input has length $T$ and the compression ratio is $N$, then the compressed latent sequence has roughly:

$$ M \approx \lceil T / N \rceil $$

tokens. The decoder then prefills over $M$ latent tokens, not $T$ raw tokens.

That one shift explains much of the paper’s business relevance. If compression happens before decoder prefill, higher compression ratios can directly reduce decoder-side compute and memory. If compression happens after full prefill, the system may have saved the patient after making them run the marathon.

The paper trains LCLMs at 4×, 8×, and 16× compression using a 0.6B-parameter encoder and a 4B-parameter decoder, with Qwen3-Embedding-0.6B as encoder and Qwen3-4B-Instruct-2507 as decoder in the main experiments. The largest training recipe totals about 350B tokens across stages when counted for the combined encoder-decoder training budget.

This is not a plug-in prompt trick. It is a trained architecture.

The model is not asked to “summarize”; it is trained to remember in latent form

Hard-token compression drops, rewrites, or summarizes text. It is useful, cheap, and often brutally lossy. Once a variable name, legal clause, table cell, or code indentation pattern is paraphrased away, no amount of confident reasoning can recover it. The model can still sound fluent. That is the charming part of the problem.

LCLM uses soft-token compression. The encoder produces continuous latent vectors. These vectors are not human-readable summaries; they are model-facing memory units.

The architecture has three moving parts:

Component	What it does	Why it matters operationally
Encoder	Reads raw input chunks and produces hidden states	Moves long-context processing away from the larger decoder
Pooling operator	Compresses groups of token representations into fewer latent tokens	Controls how much information is squeezed into each memory unit
Adapter	Projects encoder representations into the decoder’s embedding space	Lets a smaller encoder feed a larger decoder without pretending their hidden spaces naturally align
Decoder	Consumes latent tokens as context and generates the answer	Preserves standard language-model decoding behavior

The mechanism matters because it gives enterprises a possible middle layer between raw document retrieval and full-context brute force. A system can hold a broad compressed view of many documents, logs, files, or prior turns, then expand exact text only when necessary.

That is closer to how real work happens. People skim, locate, inspect, and verify. They do not read a 900-page procurement archive aloud every time someone asks whether a vendor clause changed in May. Usually.

The training recipe is the paper’s quiet contribution

The paper’s headline is the new Pareto frontier. The quieter contribution is the recipe that makes the frontier plausible.

Prior soft-token compression methods often struggled because they either specialized too narrowly or damaged the base model’s general behavior. LCLM tries to preserve broad instruction-following and long-context capability by training the compressor and decoder through staged alignment.

The stages are deliberately conservative:

Adapter warmup: freeze encoder and decoder, train only the adapter.
Encoder training: unfreeze the encoder, keep the decoder frozen.
End-to-end continual pre-training: unfreeze the decoder with a small learning rate.
Supervised fine-tuning: train on instruction, reasoning, long-context, and conversation data.

The paper reports that directly training the full system end-to-end from the beginning underperformed. The reason is intuitive: at the start, the decoder is receiving strange projected embeddings from an encoder it has not learned to trust. If every module moves at once, gradients can become unstable and the decoder’s prior capability degrades. In plain English: do not rewire the brain while asking it to explain tax law.

The data design is also not incidental. The continual pre-training format interleaves compressed and uncompressed spans throughout sequences, rather than compressing only the beginning and training on the rest. This teaches the model to condition on latent context at multiple positions. The authors also add an auxiliary reconstruction task: compress a document, then ask the model to reproduce the original content. That task is not merely decorative. It pressures the latent representation to preserve fine-grained information, which matters for exact retrieval, code, clinical facts, and format-sensitive work.

For business readers, this is the important translation: the compressor is not useful because it is small; it is useful because it is trained to behave like a memory interface.

Architecture search answers the unglamorous questions that decide whether this works

The paper runs a controlled architecture search before scaling the main models. This is not the main business evidence, but it explains why the final design is not arbitrary.

The authors train many from-scratch variants for 38B tokens at 16× compression, then compare pre-training loss. The likely purpose of this section is architectural selection, not a final deployment benchmark. It narrows the design space.

Key findings:

Design choice tested	Likely purpose	Paper’s finding	What it does not prove
Pooling operator: token-based, mean, concatenation	Architecture search	Mean pooling improves over token-based pooling; mean and concatenation are close, with compression-ratio-dependent differences at scale	That one pooling method dominates all tasks and model sizes
Encoder window size	Sensitivity / architecture test	Increasing window size from $N$ to 256 helps; 1024 gives a smaller additional gain and becomes the default	That full-context encoding would not help if cost were irrelevant
Attention mask	Architecture comparison	Causal masking consistently performs better than bidirectional masking in their setup	That causal masking is universally superior for every encoder family
Adapter design	Architecture comparison	A lightweight MLP adapter beats an attention-based adapter while using less compute	That more adapter complexity is never useful
Boundary overlap	Implementation sensitivity	Overlapping neighboring windows does not improve pre-training loss and increases compute	That boundary effects never matter in downstream tasks

The scaled configuration settles on an encoder window size of $W = 1024$, causal masking, and an MLP adapter. Pooling depends somewhat on compression ratio: mean pooling has a small edge at 16×, while concatenation slightly outperforms mean pooling at 4×. Since the main emphasis is high compression, the authors use mean pooling as the default.

This is a nice example of useful boring science. Enterprise AI fails surprisingly often because teams copy a fashionable architecture without knowing which small knobs mattered. Here, the knobs are at least measured.

The main evidence is a frontier, not a universal win column

The paper’s primary claim is not that LCLM has the highest raw accuracy in every benchmark cell. It does not. The claim is that LCLM improves the tradeoff among accuracy, time-to-first-token, and peak memory.

That distinction matters.

On long-context benchmarks, the paper evaluates RULER, LongBench, and LongHealth. It keeps task instructions uncompressed while compressing the task-specific long context. This is a practical choice: in real systems, instructions and queries often deserve exact preservation, while background context is the larger memory burden.

From the summary table:

Setting	Full Qwen 4B baseline	LCLM at 4×	LCLM at 8×	LCLM at 16×	Interpretation
RULER 4k	94.41	91.76 mean / 92.30 concat	85.42 mean / 87.20 concat	75.06 mean / 74.50 concat	LCLM remains close at 4×, degrades progressively at higher compression
RULER 8k	93.58	91.03 mean / 91.50 concat	84.48 mean / 86.80 concat	70.23 mean / 70.30 concat	The frontier is attractive, but 16× is not free
RULER 16k	93.74	89.96 mean / 90.50 concat	82.47 mean / 84.40 concat	65.91 mean / 64.90 concat	Compression ratio is a real product decision, not a magic slider
GSM8K	93.25	91.05 mean / 89.90 concat	87.26 mean / 87.40 concat	81.05 mean / 78.90 concat	Strong dense-context compression relative to many cache baselines

The GSM8K result is especially informative. The paper uses it as a fine-grained compression test: the whole prompt is compressed, and the input is short but dense. At 16× compression, LCLM mean retains 81.05 exact-match accuracy versus 93.25 for the full baseline. That is not “no loss,” but it is a useful result because many KV-compression baselines collapse on this setting. Compression ratios of 16× and 8× remove 93.75% and 87.5% of input tokens respectively. Asking a model to solve math after that much compression is not gentle.

Still, the table also prevents overclaiming. KVzip performs extremely well on RULER at 4×, reaching 94.43 on RULER 4k and 93.93 on RULER 16k, close to or slightly above the full baseline in those cells. Query-aware SnapKV is also strong on some LongHealth and LongBench columns. The paper’s argument survives this because the system comparison is not only accuracy. Some KV methods require full prefill first, depend on the query, do not translate cleanly into standard inference-engine memory savings, or impose additional overhead.

So the fair business reading is:

What the paper directly shows	Business meaning	Boundary
LCLMs often sit on a better accuracy-latency-memory frontier	Learned prefill-side compression can make long-context systems more operationally viable	Not always highest accuracy at low compression
Compression time and peak memory scale better at long context lengths	Large document or codebase workflows may become cheaper to run interactively	Measurements are on a single H200 and Hugging Face Transformers, not every production stack
At 512K and 1M contexts, several baselines run out of memory; LCLM continues in tested settings	Pre-decoder compression has a systems advantage for ultra-long contexts	Deployment depends on implementation, batching, and serving infrastructure
LCLM performs well on GSM8K under aggressive compression	Latent memory is not only for sparse retrieval-style tasks	Accuracy still drops versus full context

The phrase “new Pareto frontier” is doing real work here. It does not mean free lunch. It means the menu improved.

The agent scaffold is the most enterprise-shaped part of the paper

The agent section is an exploratory extension, not the main proof of the model family. It is also the section most likely to matter for business software.

The authors build a simple agentic mechanism around compressed context. The input is split into 512-token chunks. Each chunk is compressed and assigned an identifier. The model receives the compressed sequence and an EXPAND(i) tool. When it wants exact detail, it calls the tool and receives the original text for that segment.

This creates a two-layer working-memory pattern:

Compressed global view
        ↓
Model identifies likely relevant segment
        ↓
EXPAND(i)
        ↓
Exact raw text enters working context
        ↓
Model answers with local precision

The paper tests this on challenging needle-in-a-haystack tasks from RULER. The agent improves exact-string-match accuracy over raw 16× compressed LCLM and, in some settings, matches uncompressed-context performance.

This matters because enterprise retrieval often fails when the query vocabulary is not obvious. A user says “dashboard login bug,” but the relevant file is an entitlement module. A compliance officer asks about “vendor termination,” but the clause sits under “service discontinuation.” A clinical analyst looks for “adverse event,” but the note says “unexpected response after administration.” Keyword search, embedding retrieval, and rigid metadata filters can all miss the path.

A compressed global view gives the agent a chance to inspect broad context before choosing what to read exactly. That is not a replacement for retrieval. It is a candidate working-memory layer above retrieval and below full-context brute force.

The practical pattern is:

Enterprise workflow	What latent context can provide	Why expansion still matters
Codebase debugging	Broad visibility across files before choosing candidate modules	Exact function bodies and line-level edits require raw text
Legal or contract review	Compressed awareness of many clauses and cross-references	Quotation and compliance decisions require exact wording
Clinical-document QA	Global view across long patient-case context	High-stakes claims require source text verification
Customer-support memory	Compressed continuity across long interaction history	Refunds, commitments, and dates need exact retrieval
Internal research agents	Large archive skimming before evidence selection	Citations, numbers, and claims must be checked

This is the right level of optimism: latent context can help agents decide where to look. It should not be treated as a license to answer from compressed vibes. We have enough of those already.

The real product question is how much detail you can afford to forget

Compression ratio is not a technical afterthought. It is a governance decision.

At 4× compression, LCLM often stays close to the full-context baseline. At 8×, the tradeoff becomes more visible but still strong in several settings. At 16×, the model can still perform impressively, especially compared with many baselines, but the loss is material on some long-context tasks.

A business system should not expose “compression ratio” as a hidden engineer-only default. It should be tied to task class.

Task class	Recommended memory behavior	Reason
Exploratory search, clustering, broad diagnosis	Higher compression may be acceptable	The goal is to find where to inspect, not to finalize exact claims
Drafting from known materials	Moderate compression plus retrieval	The model needs global coherence but should verify key details
Code modification	Compressed repository view plus exact expansion	File-level navigation can be compressed; code edits cannot be guessed
Legal, finance, healthcare decisions	Low compression or mandatory raw evidence expansion	Exact wording, numbers, and provenance matter
Long conversation continuity	Adaptive compression by information density	Most turns are low-value memory; commitments and facts are not

The paper itself points toward future adaptive compression: allocating more capacity to dense or difficult regions while compressing easier regions more aggressively. That is likely where enterprise systems will need to go. Uniform compression is simple. Real documents are not uniform.

A risk disclosure paragraph in a contract deserves more memory than a greeting email. Shocking, I know.

Where this result applies, and where it does not

The paper directly supports a specific claim: learned encoder-decoder compression, trained end-to-end at scale, can improve the accuracy-latency-memory frontier for long-context inference and can support an agent scaffold that expands raw context on demand.

It does not prove that every enterprise should replace RAG with LCLM. It does not prove that compressed latent tokens preserve every legally or medically relevant detail. It does not prove that a 0.6B encoder and 4B decoder setup will map cleanly to every larger proprietary model, every serving framework, or every hardware profile.

The experiments are also structured around benchmark settings where specific parts of the input are compressed and others remain hard tokens. In practice, deciding what to compress may itself become a system-design challenge. Compress the wrong part, and the model may optimize costs by forgetting the only sentence that mattered. Very efficient. Very modern. Very bad.

The agent expansion result is promising but early. It uses one simple EXPAND tool over fixed-size chunks. Real enterprise agents need policies for when to expand, how many segments to inspect, how to cite raw evidence, how to recover from wrong expansion choices, and how to audit compressed-memory decisions after the fact.

That said, the direction is practical. Enterprises do not merely need larger context windows. They need memory systems with tiers: compressed overview, targeted expansion, exact evidence, and logging.

What Cognaptus would take from this paper

The paper’s business implication is not “use LCLM tomorrow.” The more durable lesson is architectural.

Long-horizon AI systems should separate memory into roles:

Global latent memory for broad situational awareness.
Retrieval and expansion tools for exact evidence.
Verification layers for claims that affect decisions.
Cost-aware routing so the system does not treat every task like a million-token emergency.

That design is especially relevant for enterprise agents that operate across code, documents, tickets, emails, reports, and transaction logs. These systems often fail not because the model is unintelligent, but because the memory layer is primitive: either tiny snippets retrieved by semantic search, or oversized prompts that hope attention behaves like discipline.

LCLM suggests a third option. Compress enough context to give the model a map. Expand enough raw text to keep it honest. Then let verification decide whether the answer deserves to leave the sandbox.

The future of enterprise long-context AI is probably not one giant prompt. It is a memory hierarchy. The model skims, zooms, checks, and acts. In other words, it works less like a chatbot and more like a competent analyst who knows when to stop pretending and open the source file.

A low bar, perhaps. But in enterprise AI, low bars are where many expensive systems currently trip.

Conclusion: cheaper memory is useful only when it remains inspectable

LCLMs are interesting because they move compression to the right place in the inference pipeline: before decoder prefill. That improves the operational tradeoff among speed, memory, and accuracy. The paper also shows that this is not achieved by a single clever pooling trick. It requires staged training, architecture search, reconstruction pressure, and careful evaluation against cache-compression baselines.

For business readers, the strategic point is simple: context compression is not just cost reduction. It is memory design.

If compressed context becomes a black box that confidently loses details, it will create cheaper errors. If it becomes a governed working-memory layer that supports expansion and verification, it can make long-horizon agents more practical.

The difference between those two futures is not model size. It is system architecture. Annoying, but true.

Cognaptus: Automate the Present, Incubate the Future.

Ang Li, Sean McLeish, Haozhe Chen, Nimit Kalra, Zaiqian Chen, Artem Gazizov, Venkata Anoop Suhas Kumar Morisetty, Bhavya Kailkhura, Harshitha Menon, Zhuang Liu, Brian R. Bartoldson, Tom Goldstein, Sanae Lotfi, Micah Goldblum, and Pavel Izmailov, “End-to-End Context Compression at Scale,” arXiv:2606.09659, 2026. ↩︎

The core move is compression before decoder prefill#

The model is not asked to “summarize”; it is trained to remember in latent form#

The training recipe is the paper’s quiet contribution#

Architecture search answers the unglamorous questions that decide whether this works#

The main evidence is a frontier, not a universal win column#

The agent scaffold is the most enterprise-shaped part of the paper#

The real product question is how much detail you can afford to forget#

Where this result applies, and where it does not#

What Cognaptus would take from this paper#

Conclusion: cheaper memory is useful only when it remains inspectable#