SWE-Bench

Memory Has to Earn Its Keep

TL;DR for operators Memory is not valuable because an agent writes something down. That is called logging. Sometimes it is called “reflection,” if the logging has better branding. The paper Enhancing Software Engineering Through Closed-Loop Memory Optimization introduces MemOp, a framework for software-engineering agents that defines memory utility by downstream impact: a memory is useful only if it improves the agent’s later performance on software tasks.1 The important move is not the existence of Memory.md, nor the idea that past trajectories can be summarized. The important move is the loop: generate memory from an agent trajectory, validate whether that memory improves task performance, reject harmful or redundant memories, and train a memory model using the resulting accepted and rejected examples. ...

Think Longer, Act Smarter: Why Coding Agents Need Behavior-Preserving Reasoning

Software agents fail in a familiar way. They do not always fail because they are stupid. Sometimes they fail because they are busy. They search too widely, inspect too much, edit too early, revise the wrong file, run out of context, and then collapse under the weight of their own half-formed investigation. In enterprise language: they generate activity before they stabilize a diagnosis. We have seen humans do this too, usually in Slack threads with too many tabs open. The machines are catching up nicely. ...

Think Longer, Act Smarter: Why Coding Agents Need Behavior-Preserving Reasoning

A coding agent can fail in two very different ways. One failure is obvious: it does not think enough. It sees an error report, guesses the wrong file, edits too early, and then spends the rest of the trajectory debugging its own mistake. Anyone who has watched an autonomous coding agent wander through a repository has seen this little tragedy. The machine is busy, but not necessarily useful. ...

Think Longer, Act Worse? What M2A Teaches About Reasoning Agents

Think Longer, Act Worse? What M2A Teaches About Reasoning Agents A coding agent does not fail only because it cannot think. Sometimes it fails because it keeps thinking after it should inspect the repository. Sometimes it writes a plausible explanation before checking the relevant file. Sometimes it burns the context window by wandering through hypotheses, each one almost reasonable, none of them decisive. The result is not stupidity in the familiar sense. It is a coordination failure: the model does not know when to reason, when to call a tool, when to absorb feedback, and when to edit. ...

Many Arms, Fewer Bugs: Why Coding Agents Need to Stop Working Alone

Teams are supposed to divide work. Bad teams divide accountability. Anyone who has managed a complicated project has seen the pattern. One specialist produces an impressive-looking analysis. Another quietly repairs its mistakes. The project succeeds, everyone receives credit, and the least useful participant is invited back for the next assignment. Multi-agent AI systems have inherited this problem with admirable efficiency. ...

When Agents Learn to Test Themselves: TDFlow and the Future of Software Engineering

A bug report is not a specification A bug report says something is wrong. A test says exactly how wrong must fail. That difference is the centre of TDFlow, a test-driven agentic workflow for repository-scale software repair.1 The paper’s central move is not to make the coding agent more charismatic, more autonomous, or more burdened with inspirational tool access. Mercifully. It does almost the opposite: it narrows the agent’s world until the task becomes executable. ...

Mask, Don’t Muse: When Simple Memory Beats Fancy Summaries

TL;DR for operators A coding agent’s memory problem is not philosophical. It is a bill. The paper behind this article compares three ways to manage context in software-engineering agents: keep the full trajectory, summarize old turns with an LLM, or simply mask older environment observations while preserving the agent’s reasoning and actions.1 Across five SWE-agent configurations on SWE-bench Verified, both context-management strategies usually cut cost sharply versus the Raw Agent. The awkward part is that the simple strategy, Observation Masking, is often just as good as LLM-Summary on solve rate and usually cheaper. ...