Software Engineering

Memory Has to Earn Its Keep

TL;DR for operators Memory is not valuable because an agent writes something down. That is called logging. Sometimes it is called “reflection,” if the logging has better branding. The paper Enhancing Software Engineering Through Closed-Loop Memory Optimization introduces MemOp, a framework for software-engineering agents that defines memory utility by downstream impact: a memory is useful only if it improves the agent’s later performance on software tasks.1 The important move is not the existence of Memory.md, nor the idea that past trajectories can be summarized. The important move is the loop: generate memory from an agent trajectory, validate whether that memory improves task performance, reject harmful or redundant memories, and train a memory model using the resulting accepted and rejected examples. ...

The Code Agent Wasn’t Self-Correcting. The Test Harness Was.

TL;DR for operators Code agents do not become reliable because they are asked politely to “fix the bug.” They become more useful when they are placed inside a loop that can run their output, return structured failure evidence, and decide how many further attempts are worth buying. That is the practical point of Zhang and Kothari’s paper, Unlocking LLM Code Correction with Iterative Feedback Loops.1 The authors evaluate four LLMs across Python and Java using LeetCode problems, then move from ordinary one-shot performance to an automated correction loop: generate code, execute it, feed back compiler/runtime/testcase information, and repeat up to ten iterations. ...

Compile Once, Train Later: Offline RL Moves Code-Model Verification Upstream

Compile Once, Train Later: Offline RL Moves Code-Model Verification Upstream Code assistants have a small accounting problem. Not the glamorous kind involving model capability, agentic workflows, or yet another dashboard with a glowing neural blob. The ordinary kind: every time a model proposes code during reinforcement learning, someone—or something—has to run it, test it, score it, and feed that score back into training. ...

Memory Lane Meets Mainframe: Why Coding Agents Need Better Memories, Not Bigger Egos

Memory is a familiar word. That is exactly why it can mislead us. When people hear that coding agents need “memory,” the first image is often a giant scrapbook: past prompts, previous patches, command logs, successful code snippets, failed attempts, and whatever else the agent has dragged behind it like a very confident intern with a messy backpack. More memory sounds safer. More traces sound more useful. More remembered work sounds like less repeated work. ...

Thinking Fast, Remembering Slow: Why SWE-AGILE Fixes the Memory Crisis of AI Agents

Memory sounds like a storage problem. Give the agent a longer context window, let it keep the full conversation, and the work should become easier. This is the kind of solution that looks obvious until it meets a real software repository, a failing test suite, a long terminal log, and a model that now has to find one important clue buried somewhere in the middle of its own autobiography. ...

Proofs at Scale: When 30,000 Agents Replace the Referee

Mathematics has a management problem. That sounds less romantic than saying it has a reasoning problem, but romance is not usually where bottlenecks hide. A proof can be brilliant, a referee can be diligent, and still the verification system can fail for the boring reason that nobody has enough time to check everything line by line. The paper Automatic Textbook Formalization takes that bottleneck seriously and then does something unusually concrete: it reports a multi-agent system that formalized a 500-plus-page graduate algebraic combinatorics textbook into Lean, with all 340 target definitions and theorems proved, in about one week.1 ...

Double Helix, Double Checks: Why Agentic AI Needs Governance Before It Writes Your Code

Code is where AI confidence goes to become expensive. A chatbot can produce a plausible function in ten seconds. An agent can now plan a refactor, split files, update interfaces, generate documentation, and politely leave behind a system that fails because one event payload forgot a required field. Very efficient. Very modern. Very annoying. ...

Agents That Hire Themselves: Why OpenSage Signals the End of Hand-Crafted AI Workflows

Workflow diagrams age badly. A process that looked clean in January usually becomes a small archaeological site by March: one more exception, one more conditional branch, one more “temporary” manual approval that survives longer than the intern who added it. This is how many AI-agent projects quietly become ordinary software projects with a chatbot sitting on top, smiling politely while humans keep repairing the plumbing. ...

Breaking Things on Purpose: How CLI-Gym Teaches AI to Fix the Real World

Broken environments are where coding agents stop looking magical. A model can write a neat Python function, patch a repository, and explain the bug with courtroom confidence. Then it enters a terminal, meets a missing shared library, a corrupted dependency, a bad environment variable, or a filesystem permission issue, and suddenly the “autonomous engineer” starts behaving like an intern trapped inside conda. Not a bad intern, perhaps. Just one who keeps running the same command and hoping Linux will become more emotionally cooperative. ...

When Agents Believe Their Own Hype: The Hidden Cost of Agentic Overconfidence

Code review has a comforting ritual. A developer submits a patch. A reviewer inspects it. The reviewer says it looks good. Everyone feels slightly better, because at least someone checked. In AI-agent workflows, this ritual becomes even more tempting: let one agent write the patch, let another agent review it, then ask the reviewer how confident it is. ...