Mask, Don’t Muse: When Simple Memory Beats Fancy Summaries

TL;DR for operators

A coding agent’s memory problem is not philosophical. It is a bill.

The paper behind this article compares three ways to manage context in software-engineering agents: keep the full trajectory, summarize old turns with an LLM, or simply mask older environment observations while preserving the agent’s reasoning and actions.¹ Across five SWE-agent configurations on SWE-bench Verified, both context-management strategies usually cut cost sharply versus the Raw Agent. The awkward part is that the simple strategy, Observation Masking, is often just as good as LLM-Summary on solve rate and usually cheaper.

For operators, the lesson is not “never summarize.” That would be a lovely overreaction, but still an overreaction. The lesson is: before paying for a summarizer, measure whether your agent is mostly drowning in old tool outputs. If it is, deleting or masking stale observations may beat generating expensive little memoirs about them.

The paper’s strongest business result is the cost-performance comparison. With Qwen3-Coder 480B, Observation Masking raised solve rate from 53.4% to 54.8% while reducing average instance cost from $1.29 to $0.61. LLM-Summary reached 53.8% at $0.64. The $0.03 per instance difference is small only if one has never multiplied anything.

The deeper mechanism is more interesting. Summaries can add direct API cost, reduce cache reuse, and make agents run longer. In the paper’s trajectory analysis, LLM-Summary lengthened mean trajectories for Qwen3-Coder 480B and Gemini 2.5 Flash. In other words, summarization can make the agent feel informed enough to keep going, which is not always the same as making it useful enough to finish.

The practical operating policy is simple: instrument token sources, benchmark masking first, tune the masking window per scaffold, monitor turn counts as well as token counts, and reserve LLM summarization for unusually long trajectories or domains where older details remain semantically critical.

The memory problem starts with tool output, not with wisdom

Picture the usual coding agent run. It reads a file. It runs tests. It inspects a traceback. It opens another file. It lists a directory. It runs tests again, because apparently debugging without ritual repetition would offend the machine spirits.

Every step produces an observation. That observation gets appended to the agent’s history. The next model call then carries more context than the previous one. At first, this looks reasonable: more evidence should mean better decisions. In a software-engineering agent, however, “more evidence” often means “thousands of stale tokens from logs, file dumps, and command outputs that were useful six turns ago and now mostly exist to invoice someone.”

The paper’s motivating observation is that software-engineering trajectories are especially skewed. In the authors’ preliminary SWE-agent experiments on SWE-bench Lite-50, environment observations made up around 84% of an average turn. That matters because the obvious memory problem is not necessarily the agent’s reasoning trace. It is the bulky residue of tool use.

This distinction is the paper’s first useful act of hygiene. It separates the agent’s decision trail from the environment’s output trail. Those are not the same asset. The reasoning and actions explain what the agent tried. The old observations may contain critical facts, but they also contain logs, duplicated context, outdated file contents, and noise. Treating both as sacred is how an agent becomes expensive while still managing to be confused. Quite the premium experience.

Four memory policies, four different operating assumptions

The paper is best read as a comparison among memory policies, not as a generic paper about “context management.” Each policy makes a different bet about what old context is worth.

Strategy	What it does	Operating assumption	Main operational risk
Raw Agent	Keeps the full trajectory in context.	More history is safer.	Cost grows quickly; old material may bury useful signals.
LLM-Summary	Uses a summarizer model to condense older turns into a running summary while preserving recent turns.	Old context is useful, but should be compressed semantically.	Summary calls cost money, reduce cache reuse, and may alter agent behaviour.
Observation Masking	Keeps reasoning and actions, but replaces observations older than a rolling window with placeholders.	Recent observations plus the action/reasoning trace are often enough.	The masked observation may contain a detail the agent later needs.
Hybrid	Masks early and delays summarization until trajectories become long.	Cheap omission should handle ordinary runs; summaries are a fallback for exceptional length.	Poorly chosen thresholds can compound overhead.

The paper’s main SWE-agent experiments used Observation Masking with a rolling observation window of $M = 10$. Its LLM-Summary configuration summarized $N = 21$ turns at a time while retaining the last $M = 10$ turns unaltered. Those details matter because “summarization” is not magic dust. It is a scheduled operation with thresholds, tail length, cache behaviour, and failure modes.

This is where the article’s comparison frame matters. Raw Agent is uncontrolled spending. LLM-Summary is sophisticated compression. Observation Masking is the unfashionable baseline. Hybrid is the operational compromise. The research question is not which one sounds smarter. The research question is which one buys the most solved issues per dollar.

Cruel, but fair.

Raw Agent is the cleanest baseline and the worst budget habit

The Raw Agent keeps the full trajectory. In principle, that is the least lossy option. In practice, the paper finds it is consistently the most expensive strategy in the main experiments.

This matters because Raw Agent is not a straw man. Many agent prototypes drift into Raw Agent behaviour by default: concatenate history, hope the context window can take it, and call that “memory.” Modern long-context models make this temptation worse. A larger window can hide a bad memory policy for a while, in the same way a larger warehouse can hide bad inventory management. The mess is still there; now it has a loading dock.

The results show why context management becomes an economic necessity rather than a polish item. In the Qwen3-Coder 480B setup, Raw Agent cost $1.29 per instance with a 53.4% solve rate. Observation Masking cost $0.61 and solved 54.8%. LLM-Summary cost $0.64 and solved 53.8%. The agent did not become worse by remembering less. It became cheaper, and in this case slightly better.

That is the uncomfortable business lesson: the full transcript can be a liability. It costs money to process, it can bury relevant material, and it may encourage the model to attend to old noise. In agent operations, memory should be treated as a working set, not as an archive.

LLM summaries are not free just because they are shorter

LLM-Summary has intuitive appeal. It feels safer than deletion. It promises to preserve the “meaning” of old turns while shrinking the token load. This is the kind of solution that looks excellent in architecture diagrams and slightly less excellent in invoices.

The paper identifies two cost channels that make summarization less obviously attractive.

First, summaries require additional model calls. In the paper’s cost breakdown, direct summary-generation cost accounted for up to 7.2% of total instance cost, depending on the model configuration. That is not catastrophic by itself. But it is also not nothing, especially when the summarizer does not consistently improve solve rate.

Second, summarization has poor cache characteristics. Each summary call processes a unique sequence of turns, so cache reuse is limited. In API environments where cache hits are much cheaper than cache misses, this becomes a real design constraint. A shorter context is not automatically a cheaper context if producing it requires fresh, uncached model work.

The third effect is behavioural. The paper’s trajectory-length analysis suggests that summaries can make agents continue longer. For Gemini 2.5 Flash, LLM-Summary produced a mean trajectory length of 52 turns, compared with 44 turns for Observation Masking and 50 for Raw Agent. For Qwen3-Coder 480B, LLM-Summary increased mean trajectory length by 15% versus Raw Agent and 13% versus Observation Masking.

This is the kind of result that should make agent builders pause. The summary is not merely a storage format. It becomes part of the agent’s state. It may reinforce the sense that the agent is making progress, soften failure signals, or provide enough rephrased context to keep the loop alive. The system then saves tokens per turn but spends more turns. Congratulations, the agent found a different way to charge you.

Observation Masking wins by targeting the fattest part of the trace

Observation Masking is almost rudely simple. It preserves the agent’s reasoning and actions but replaces older observations with a placeholder once they fall outside the recent window. The agent can still see what it did. It just cannot reread every old command output forever.

This works because software-engineering agents often operate on verbose observations. File contents, logs, tracebacks, test output, and directory listings are large. They are also frequently stale. Once the agent has moved on, many old observations no longer need to remain verbatim in the working context.

The main table tells the story.

Model configuration	Raw Agent solve / cost	Observation Masking solve / cost	LLM-Summary solve / cost	Operational read
Qwen3-32B	17.0% / $1.12	15.0% / $0.55	16.0% / $0.50	Both cut cost sharply; solve rate slightly lower.
Qwen3-32B thinking	23.0% / $0.51	24.6% / $0.46	24.8% / $0.51	Shorter trajectories limit the value of context management.
Qwen3-Coder 480B	53.4% / $1.29	54.8% / $0.61	53.8% / $0.64	Masking sits on the best frontier: cheaper and slightly stronger.
Gemini 2.5 Flash	32.8% / $0.41	35.6% / $0.18	36.0% / $0.24	Summary has a tiny solve-rate edge; masking is much cheaper.
Gemini 2.5 Flash thinking	40.4% / $0.56	36.4% / $0.24	31.4% / $0.25	Both save cost; masking loses less performance than summary.

The table should not be read as “masking always improves accuracy.” It does not. Gemini 2.5 Flash thinking is a clear counterexample: both context-management strategies reduce cost but also reduce solve rate, with LLM-Summary taking the larger hit. That is precisely why context policy has to be benchmarked, not asserted.

The more defensible claim is narrower and more useful: across these software-engineering agent settings, LLM-Summary does not consistently or significantly outperform the simple masking baseline. In four of five main setups, Observation Masking has the lowest instance cost. In several cases, it matches or beats LLM-Summary on solve rate. The fancy option does not earn its complexity by default.

For business teams, that is the relevant bar. A more complex memory system should have to beat a cheap baseline, not merely sound more intelligent in a roadmap meeting.

The appendix tests configuration, not a second thesis

The paper’s additional studies are important, but they should be interpreted correctly. They are mostly configuration and robustness checks, not a separate grand theory of agent memory.

The authors tune the Observation Masking window and find $M = 10$ best for their main SWE-agent setup. They also test LLM-Summary configurations and find that keeping some recent tail turns matters, and that summarizing more turns at once performs better than an even split between summary window and tail. This is implementation evidence: the strategies are sensitive to parameters, and defaults from one scaffold should not be blindly transplanted into another.

The critic-enhanced summary experiment is especially useful. A natural response to the paper’s main result is: “Fine, but what if the summary is smarter?” The authors test a summarizer that also provides execution-free feedback, effectively adding reflective critique to the compressed context. It does not improve solve rate over standard LLM-Summary and worsens trajectory elongation. The critic gives the agent more avenues to explore, which means more turns, more cost, and no corresponding performance gain.

That is not a proof that reflective memory is doomed. It is a warning against a familiar engineering reflex: when a simple system underperforms, add more model-generated interpretation. Sometimes that helps. Sometimes it just gives the agent more sophisticated ways to wander around.

The OpenHands probe says “promising,” not “universal”

The paper also probes generality using OpenHands on a 50-instance slice of SWE-bench Verified with Gemini 2.5 Flash. This is not the main evidence base. It is a scaffold-transfer check.

The result is encouraging but conditional. If the authors reuse the optimal SWE-agent masking window in OpenHands, performance degrades. After tuning the window, Observation Masking again matches LLM-Summary on cost and solve rate. The paper suggests a plausible reason: scaffolds differ in what they retain. SWE-agent elides certain retry turns caused by syntax errors; OpenHands retains such turns. That means OpenHands may need a larger observation window to preserve enough useful recent state.

This is a practical detail with large consequences. “Use Observation Masking” is not a one-line implementation instruction. The right masking window depends on the scaffold’s conversation structure, tool-output formatting, retry handling, and how much state the agent can reconstruct from actions alone.

The business translation is straightforward: do not copy a memory policy from a benchmark and assume it will survive your production scaffold. Start with the paper’s result as a hypothesis, then tune it against your own traces.

The hybrid result is the operating model hiding in the paper

The paper’s hybrid strategy is the most operationally mature idea: mask first, summarize later.

The motivation is simple. Observation Masking delivers immediate savings because it starts reducing old observation load quickly. LLM-Summary can bound context growth more completely on very long trajectories, but it has warm-up cost, direct summary cost, and trajectory-length risk. The hybrid uses masking during the accumulation phase and delays summarization until trajectories are long enough to justify it.

In the paper’s hybrid experiment, the authors test Qwen3-Coder 480B on SWE-bench Verified-50. They set $N = 43$ and $M = W = 10$, delaying summarization while using Observation Masking in the meantime. They also pass the unmasked context when summarizing, avoiding a clash between the two mechanisms.

The result: compared with Observation Masking alone, the hybrid reduces cost by 7%; compared with LLM-Summary alone, it reduces cost by 11%. It also improves downstream task performance by 2.6 percentage points versus Raw Agent in that setup. The authors estimate savings of $20 versus Observation Masking and $35 versus LLM-Summary on the full SWE-bench Verified benchmark.

This is promising, but it is an extension, not the central claim. It is tested on a 50-instance subset with one strong model configuration because full experiments would be costly. The important lesson is architectural: summarization is more defensible as a last-resort bounding mechanism than as the default memory policy for every ordinary trajectory.

The paper also tests a naive hybrid that simply reuses the individual strategy hyperparameters. That version degrades cost efficiency because overheads compound. This is a useful little slap on the wrist: combining two reasonable techniques does not automatically create one reasonable system.

What the paper shows, what operators can infer, and what remains uncertain

The business value of this paper is not that it gives every team a plug-and-play memory policy. It gives teams a better evaluation discipline.

Layer	Claim	Status
Direct paper result	In SWE-agent on SWE-bench Verified, context management usually reduces cost sharply versus Raw Agent.	Main evidence.
Direct paper result	Observation Masking is usually cheaper than LLM-Summary and often matches or beats its solve rate.	Main evidence across five model configurations.
Direct paper result	LLM-Summary adds direct summary-generation cost and can elongate trajectories.	Mechanism and diagnostic evidence.
Direct paper result	A tuned hybrid can reduce cost further on Qwen3-Coder 480B over SWE-bench Verified-50.	Exploratory extension.
Cognaptus inference	Teams should benchmark masking before adopting summary-heavy context systems.	Strong practical inference for coding agents.
Cognaptus inference	Memory policy should be measured by solved tasks per dollar, not by compression elegance.	Strong operational inference.
Uncertain	Whether masking dominates in domains with compact, high-value, or legally critical observations.	Not established by this paper.
Uncertain	Whether adaptive relevance-based memory beats both fixed masking and fixed summarization.	Open design space.

This separation matters. It prevents the article from becoming an anti-summary manifesto, which would be dramatic and therefore probably wrong. LLM summaries can still be valuable where old observations contain compact but crucial state, where compliance requires durable evidence, or where tasks involve long-horizon dependencies that are not recoverable from recent context.

But in verbose tool-use domains, the burden of proof shifts. Summaries are no longer the obvious premium choice. They are an intervention that must justify its own cost and behavioural side effects.

A Monday checklist for agent teams

For teams deploying coding agents, the paper suggests a practical evaluation sequence.

First, instrument the trace. Measure how much of each turn is system prompt, user instruction, reasoning, action, observation, summary, and retry overhead. If observations dominate, the agent is a candidate for masking.

Second, run Raw Agent only as a baseline. Keeping full history may help diagnose failures, but it should not be the default production policy unless the economics are acceptable. Usually, they will not be. Funny how that works.

Third, test Observation Masking before LLM-Summary. Preserve the reasoning and action trail. Mask observations outside a rolling window. Tune the window on your scaffold rather than copying the paper’s value blindly.

Fourth, compare policies on solved tasks per dollar. A small solve-rate increase may not justify a large cost increase. A small per-instance saving may matter at scale. This is not an aesthetic choice; it is unit economics wearing a lab coat.

Fifth, monitor trajectory length. If a memory policy reduces tokens per call but increases calls per task, your cost model is incomplete. Track turns, termination reasons, retries, and whether the agent is looping under a more polished summary of its own confusion.

Sixth, reserve summarization for long trajectories or high-dependency workflows. The hybrid result points toward a sensible policy: mask early, summarize late, and make the summarizer earn its keep.

Finally, treat memory policy as part of agent behaviour. It is not a passive storage layer. It changes what the model sees, what it believes is still relevant, and when it decides it has enough information to stop.

Where the result should not be overused

The paper is deliberately scoped to software-engineering agents. That scope matters.

Software-engineering tool outputs are unusually verbose. A test log, source file, or directory listing can overwhelm the context with material that is redundant, stale, or recoverable. In domains where observations are short and dense, masking may remove too much. Think legal review, medical triage, financial audit trails, or regulated operational workflows. In those settings, old observations may be evidence, not clutter.

The strategies tested also rely on fixed heuristic triggers. Observation Masking uses a fixed rolling window. LLM-Summary uses a fixed turn schedule. Neither strategy adapts to semantic relevance, file modification state, subgoal boundaries, or legal retention requirements. A more adaptive memory system could outperform both, though the paper makes a strong case that it should be compared against masking rather than allowed to declare victory against Raw Agent alone.

The OpenHands result is initial evidence, not broad scaffold generalization. The hybrid result is promising, but narrower still. It is a 50-instance test with one model configuration. It should inform product experiments, not become procurement scripture.

Cost interpretation also depends on pricing. The paper reports Gemini costs from the API and computes Qwen costs post hoc using official Alibaba API pricing while self-hosting the models. Different cache pricing, inference infrastructure, and batching regimes can change the dollar amounts. They do not erase the mechanism: summary calls cost something, cache behaviour matters, and trajectory length is part of the bill.

The real lesson: memory is a control surface

The lazy takeaway is that summaries are overrated. The better takeaway is that agent memory is a control surface.

Raw context maximizes retention but burns money. LLM-Summary compresses old context but introduces its own cost and behavioural incentives. Observation Masking works because it asks a sharper question: which part of the trajectory is actually bloating the context? In software-engineering agents, the answer is often old environment observations. Mask those first.

The paper is valuable because it punctures a very modern assumption: if an LLM can produce a semantic summary, that summary must be the more intelligent memory. Sometimes intelligence is not the extra model call. Sometimes intelligence is noticing that the agent does not need to reread yesterday’s test log for the fifteenth time.

For operators, the instruction is blunt: do not buy complexity before benchmarking the cheap baseline. Mask first. Summarize when the trajectory proves it needs a summary. Watch the turn count. Measure solved work per dollar.

The agent does not need a memoir. It needs a working memory.

Cognaptus: Automate the Present, Incubate the Future.

Tobias Lindenbauer, Igor Slinko, Ludwig Felder, Egor Bogomolov, and Yaroslav Zharov, “The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management,” arXiv:2508.21433, 2025, https://arxiv.org/abs/2508.21433. ↩︎

TL;DR for operators#

The memory problem starts with tool output, not with wisdom#

Four memory policies, four different operating assumptions#

Raw Agent is the cleanest baseline and the worst budget habit#

LLM summaries are not free just because they are shorter#

Observation Masking wins by targeting the fattest part of the trace#

The appendix tests configuration, not a second thesis#

The OpenHands probe says “promising,” not “universal”#

The hybrid result is the operating model hiding in the paper#

What the paper shows, what operators can infer, and what remains uncertain#

A Monday checklist for agent teams#

Where the result should not be overused#

The real lesson: memory is a control surface#