The short of it

A new study on SWE-agent working over SWE-bench Verified finds that masking old observations (keeping recent turns verbatim, replacing older tool outputs with a placeholder) often matches or slightly beats prompt-based LLM summarization—and at roughly half the cost. The paper also surfaces a subtle failure mode: summaries can elongate trajectories, encouraging agents to “keep going” when they should stop, diluting efficiency and, at times, performance.

Why this matters for builders

Most production SE agents (debuggers, PR autoresponders, test fixers) rack up spend on two things: tokens and time. Tool logs dominate both. In practice, observation tokens comprise the bulk of an agent’s turn, so trimming them intelligently is the highest‑leverage knob. The results show you might not need fancy, model‑authored summaries; a rolling “mask” window can land on the most efficient frontierequal or better solve rate, far lower cost—across Qwen3‑Coder 480B, Qwen3‑32B (thinking/non‑thinking), and Gemini 2.5 Flash (thinking/non‑thinking).

What the study actually did (and found)

  • Setup: SWE-agent scaffold on SWE-bench Verified, turn limit 250; compare three strategies: Raw (no memory control), LLM-Summary (OpenHands‑style running summary with tail), and Observation Masking (keep last M turns’ observations verbatim, replace older ones with a short placeholder). Defaults that worked best here: M = 10 for masking; N = 21, M = 10 for summarization.

  • Headline result: Observation masking more than halves cost vs raw and remains competitive or better than LLM summarization on solve rate—e.g., Qwen3‑Coder 480B: 54.8% solve rate at $0.61/instance vs raw 53.4% at $1.29, and summarization 53.8% at $0.64.

  • Efficiency frontier: Scatter plots of solve rate vs instance cost show masking consistently on the Pareto frontier.

  • Trajectory elongation: With summarization, mean turns per instance increase (e.g., Gemini 2.5 Flash: 52 turns w/ summary vs 44 w/ masking; Qwen3‑Coder 480B similarly +15%). Summaries seem to “smooth” failures, prompting more exploration rather than an early exit. That extra wandering eats your budget.

  • Cost anatomy: Summary calls themselves add ~5–7% to per‑instance cost on stronger models; subtract that and masking still holds up thanks to fewer tokens/turns.

Practical takeaways (and a simple recipe)

Start simple. If your agent’s context is dominated by tool output (logs, traces, diffs), mask older observations and keep the last ~10 turns verbatim.

A minimal policy that works now:

  1. Keep system/user prompts, all reason+action text.
  2. For observations older than the last M turns, replace with a short placeholder (e.g., “(omitted previous log)”).
  3. Tune M ≈ 8–12; in this study M=10 was a sweet spot.
  4. Add hard stop rules (max turns, unchanged workspace check, repeating‑error detector) to avoid drift.

When (maybe) use summarization:

  • You need bounded context for very long runs, and you’ve implemented loop detectors so summaries don’t invite aimless exploration.
  • Your scaffold retains very rich, noisy logs where a semantic digest improves reasoning. (Preliminary OpenHands slice hints scaffold‑specific effects.)

A mental model: Compress what’s noisy, keep what’s fresh

Think of an SE agent’s transcript as three layers:

  • Plan & Actions (keep!) – the chain of thought and tool calls anchor continuity.
  • Recent Observations (keep!) – what just happened matters most for next steps.
  • Stale Observations (compress!) – old logs rarely help; they bloat context and hide the signal.

Masking operationalizes this model cheaply. Summaries promise the same—but can accidentally encourage longer trajectories and add API cost. The paper’s data suggests masking hits a robust 80/20 in many SE settings today.

Limits you should know

  • The evidence is within SWE-agent on SWE-bench—both tuned to software tasks with verbose tool output. Generalization to other scaffolds and domains (e.g., web agents with short observations) is an open question.
  • Triggers are heuristics (fixed windows/turns). Smarter, learned or signal‑based triggers (detect loops, regressions, goal completion) could outperform both.

What I’d try next (operator’s playbook)

  • Hybrid policy: Default to masking; escalate to summarization only when a loop detector or context overflow fires.
  • Cheaper summarizer: If you must summarize, use a tiny local model or a distilled module that’s cache‑friendly.
  • Termination critics, not reflection essays: Put your budget into “should I stop?” critics rather than reflective summaries—this addresses elongation directly.
  • Domain‑aware masks: Mask entire file runs or repeating error chunks instead of raw turn count.

Bottom line: For many SE agents today, masking old logs is the fastest path to higher ROIfewer tokens, fewer turns, same (or better) solve rate. Start there; add complexity only after your stop rules and loop detectors are rock‑solid.

Cognaptus: Automate the Present, Incubate the Future