Many Arms, Fewer Bugs: Why Coding Agents Need to Stop Working Alone

Opening — Why this matters now

For all the breathless demos, AI coding agents still collapse embarrassingly often when faced with real software engineering: large repositories, ambiguous issues, long horizons, and no hand-holding. Benchmarks like SWE-bench-Live have made this painfully explicit. Models that look heroic on curated tasks suddenly forget how to navigate a codebase without spiraling into context soup.

The uncomfortable truth: most agents are still trying to do everything at once.

This paper proposes a fix that is less glamorous than a bigger model—but far more effective: stop forcing a single agent to carry the entire cognitive load. Instead, discover who should do what automatically.

Background — Monoliths don’t scale, neither do prompts

Current SWE agents largely follow a monolithic design. One agent:

Reads the issue
Locates files
Edits code
Runs tests
Submits patches

All inside one reasoning chain.

That design is elegant in slides and disastrous in practice. Long chains accumulate irrelevant context, reinforce spurious correlations, and overfit to training distributions. The result is brittle behavior on newer, out-of-distribution issues—exactly what SWE-bench-Live was designed to expose.

Humans don’t work this way. Neither does hierarchical reinforcement learning. Both rely on temporal abstraction: delegate subtasks, reduce decision horizons, and recombine reusable skills.

Prior multi-agent attempts tried to hard-code this decomposition. Unsurprisingly, human intuition about “roles” often misaligned with how LLMs actually behave.

So the real question becomes: can we let the system discover its own hierarchy?

Analysis — BOAD in plain terms

The paper introduces BOAD (Bandit Optimization for Agent Design), reframing multi-agent design as a search and credit assignment problem rather than a workflow engineering exercise.

At its core:

Each sub-agent (e.g. issue analysis, code navigation, reproduction) is treated as an arm in a multi-armed bandit.
An orchestrator samples a small team of sub-agents.
The system solves real SWE tasks.
Each sub-agent receives individual credit based on whether it actually helped—judged retrospectively.

This avoids two classic traps:

Combinatorial explosion — searching over agent sets is exponential; searching over agents is linear.
Free-riding — success/failure alone cannot tell you which agent mattered.

Instead of binary rewards, BOAD uses hindsight helpfulness: an LLM judge evaluates whether a sub-agent’s contribution moved the solution forward, even if the final patch failed.

The bandit logic (UCB) then balances exploration of new sub-agents with exploitation of proven ones. A controlled archive expansion mechanism ensures the system doesn’t stagnate.

In short: BOAD doesn’t guess good architectures. It discovers them under budget constraints.

Findings — What actually worked

The results are quietly devastating for monolithic agents.

Performance gains

Using a 36B open model:

Setting	SWE-bench Verified	SWE-bench Live
Single-agent SWE-agent	~50%	~12%
Manual multi-agent	Worse	Slightly better
BOAD-discovered hierarchy	53%	20%

On LIVE, this ranked second overall, beating larger systems based on GPT-4 and Claude.

Token efficiency (the underrated win)

Despite added coordination, BOAD:

Reduced max input context by up to 25%
Kept total token usage flat—or lower

Hierarchies didn’t bloat reasoning. They compressed it.

The surprising lesson: fewer agents is better

Ablations showed peak performance with exactly two sub-agents.

Too few: no specialization. Too many: communication overhead and error propagation.

This is not swarm intelligence. It’s selective delegation.

Implications — What this changes for AI builders

BOAD’s real contribution isn’t just better scores. It reframes how we should think about agentic systems:

Architecture matters more than scale Bigger models didn’t win. Better decomposition did.
Manual workflows are a dead end The most effective sub-agents were not the ones humans designed.
Credit assignment is the hidden bottleneck Without hindsight evaluation, multi-agent systems optimize noise.
Generalization comes from abstraction, not memorization Sub-agents that localize issues or analyze structure transfer across tasks and models.

The failure modes are equally instructive: when sub-agent outputs are wrong, orchestrators currently trust them too much. Lightweight verification layers are an obvious next step.

Conclusion — Coordination beats cognition

BOAD demonstrates something the field has quietly resisted admitting: the problem with coding agents isn’t that they can’t reason—it’s that they reason about too much at once.

Hierarchical, automatically discovered agent teams don’t just solve more bugs. They do so with less context, less overfitting, and better generalization. That’s not a prompt trick. That’s an architectural shift.

Expect this pattern to show up far beyond software engineering.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Monoliths don’t scale, neither do prompts#

Analysis — BOAD in plain terms#

Findings — What actually worked#

Performance gains#

Token efficiency (the underrated win)#

The surprising lesson: fewer agents is better#

Implications — What this changes for AI builders#

Conclusion — Coordination beats cognition#