Many Arms, Fewer Bugs: Why Coding Agents Need to Stop Working Alone

Teams are supposed to divide work. Bad teams divide accountability.

Anyone who has managed a complicated project has seen the pattern. One specialist produces an impressive-looking analysis. Another quietly repairs its mistakes. The project succeeds, everyone receives credit, and the least useful participant is invited back for the next assignment.

Multi-agent AI systems have inherited this problem with admirable efficiency.

It is tempting to assume that a coding agent becomes more capable whenever another specialist is added: one agent interprets the issue, one searches the repository, one edits the code, one runs tests, and perhaps one supervises the supervisors. The diagram looks reassuringly organized. Unfortunately, software does not care about the diagram.

The paper BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization asks a more useful question: instead of manually deciding which agents should collaborate, can a system discover which sub-agents actually help?¹

Its answer is not “build a bigger swarm.” The strongest configuration contains exactly two discovered sub-agents: one analyzes the issue, and the other navigates the repository. A customized orchestrator then handles the remaining work.

That small hierarchy improves a 36-billion-parameter model from 12.3% to 20.0% resolved issues on SWE-bench Live, while using fewer tokens. Adding more agents makes the system worse.

The lesson is less theatrical than autonomous software teams negotiating around a virtual conference table. It is also more useful: agent architecture should be treated as a portfolio-selection and credit-assignment problem, not as an organizational-chart exercise.

Single-Agent Coding Compresses the Entire Project Into One Memory

A conventional software-engineering agent receives an issue description and works through the repository using tools. It searches files, reads code, edits implementations, runs commands, interprets failures, and eventually submits a patch.

All these activities occur within one growing reasoning history.

The convenience is obvious. There are no handoffs, no coordination protocol, and no uncertainty about who is responsible. The same agent sees everything.

It also remembers everything.

By the time the agent begins editing code, its context may contain the original issue, several abandoned hypotheses, long file listings, search results, command outputs, test logs, and earlier reasoning about files that turned out to be irrelevant. The information needed to locate a bug is not necessarily the information needed to repair it, yet the single-agent design carries both forward.

The paper’s argument is that this long, mixed context creates two related problems.

First, irrelevant information consumes context capacity and complicates decision-making. Second, the agent can learn accidental relationships between earlier observations and later actions. A reasoning path that worked for one repository may become a misleading template for a new issue.

Hierarchical delegation offers a cleaner alternative. A sub-agent receives only the context needed for a particular task, completes that task, and returns a compressed result. The orchestrator reasons over the result rather than retaining every low-level action used to produce it.

The useful comparison is therefore not simply one agent versus several agents. It is:

Single-agent design	Hierarchical design
One reasoning history accumulates the full workflow	Different tasks receive separate, limited contexts
Low-level exploration remains visible during later decisions	Sub-agents return compressed findings
No handoff errors	Handoffs can introduce errors
Minimal coordination overhead	Coordination must justify its cost
Easy to build	Difficult to design well

Hierarchy removes some cognitive burden, but it creates a new design problem: deciding what should be delegated, to whom, and under what instructions.

Manually inventing roles is the obvious solution. The paper shows why obvious solutions remain such a reliable source of disappointment.

Manual Specialists Look Sensible but Behave Unpredictably

A human software team might separate work into localization, reproduction, editing, and testing. Several earlier multi-agent coding systems followed similar role divisions.

The roles are intuitively reasonable. That does not mean they align with how language models produce useful work.

In BOAD’s experiments, adding manually designed sub-agents to the Seed-OSS-36B-Instruct baseline produced mixed results. On SWE-bench Live, the resolved rate increased from 12.3% to 14.0%. On SWE-bench Verified, it fell from 49.8% to 47.4%.

Manual specialization was not entirely useless. It was simply unreliable.

The problem is that a role description does more than allocate responsibility. It changes what information the sub-agent receives, what tools it uses, how much work it attempts, what it returns, and how the orchestrator interprets that return. A role that appears neatly bounded to a human designer may be too vague for the model, duplicate work already performed by the orchestrator, or produce outputs that are difficult to integrate.

BOAD replaces role intuition with measured contribution.

Rather than searching directly for an entire team configuration, it treats each candidate sub-agent as an arm in a multi-armed bandit. Candidate agents are repeatedly tested in different teams. Agents that appear promising receive more evaluation opportunities, while uncertain candidates continue to receive enough trials to determine whether they are useful.

This changes the unit of search.

A direct search over teams must evaluate combinations: agent A with B and C, agent A with D and E, agent B with C and F, and so forth. The number of possible bundles quickly becomes impractical because every evaluation requires running costly software-engineering tasks.

BOAD instead maintains estimates for individual sub-agents. Because the same sub-agent appears in multiple teams, each trial provides information that can be reused across possible team configurations.

The bandit mechanism uses an Upper Confidence Bound strategy. In plain terms, each candidate receives a score combining:

how helpful it has appeared so far; and
how uncertain that estimate remains.

High-performing candidates are exploited. Under-tested candidates receive an exploration bonus. The archive is also expanded gradually so that early candidates do not permanently define the available design space.

The method is not searching for the prettiest workflow. It is allocating a limited experimentation budget among competing specialist designs.

Team Success Is the Wrong Measure of Individual Value

Searching efficiently is only half the problem. BOAD must also determine whether a sub-agent deserves credit.

Suppose a team resolves an issue. One sub-agent correctly identifies the relevant files. Another returns irrelevant analysis. The orchestrator ignores the bad analysis, implements the fix, and passes the tests.

If every participating agent receives the team’s success score, both sub-agents appear equally valuable.

The unhelpful agent is now a successful free-rider.

Conversely, imagine that a sub-agent correctly locates the faulty code, but the orchestrator later introduces a syntax error. The final patch fails. A team-level success metric would assign no credit to the useful localization work.

BOAD addresses this with hindsight helpfulness. After a trajectory is completed, an LLM judge reviews each participating sub-agent’s contribution and assigns a binary label indicating whether that contribution helped the orchestrator make progress.

The distinction is consequential:

Credit method	What receives credit	Main failure
Final success rate	Every agent present in a successful run	Rewards free-riders and ignores useful work in failed runs
Hindsight helpfulness	Agents judged to have contributed useful intermediate work	Depends on the reliability of the LLM judge

The ablation makes the difference visible. On SWE-bench Live, selecting the top two sub-agents by team success rate produced a 15.3% resolution rate. Selecting them by hindsight helpfulness produced 20.0%.

With three sub-agents, success-rate selection achieved only 11.3%, compared with 16.3% using helpfulness.

This is not a decorative scoring change. It determines which capabilities the optimization process preserves.

For businesses building agent systems, the implication extends beyond coding. If an agent participates in a successful workflow, that does not establish that the agent added value. Evaluation must examine marginal contribution, not attendance.

Otherwise, the system will optimize the AI equivalent of meeting participation.

Evolution Searches for Teams; BOAD Reuses Evidence About Specialists

The paper also compares BOAD with an evolutionary-search baseline.

The evolutionary approach treats an orchestrator and three sub-agents as one bundled design. At each iteration, it generates a new bundle, evaluates it, and uses previous results to inform later bundles.

This is a reasonable method when complete designs are relatively cheap to generate and test. Software-engineering agents are not cheap to test. Each evaluation involves long tool-using trajectories against real repositories.

Under the same number of generated SWE-bench patches, the evolutionary design reached 17.0% on SWE-bench Live. BOAD reached 20.0%.

The cost difference is equally instructive. According to the paper’s appendix, each evolutionary iteration required approximately $2.33 in Claude API calls, compared with $0.96 for a BOAD iteration. Evolution repeatedly generates whole teams; BOAD can reuse previously discovered sub-agents.

Search approach	Object being optimized	SWE-bench Live	Reported Claude API cost per iteration
Evolutionary baseline	Complete orchestrator-and-agent bundle	17.0%	$2.33
BOAD	Reusable individual sub-agents, later composed	20.0%	$0.96

The broader operational principle is straightforward. When evaluations are expensive, reusable evidence matters.

A company testing customer-service agents, document-processing agents, or research assistants may not be able to evaluate hundreds of complete workflows. It can still maintain a library of candidate capabilities, measure their contribution across representative tasks, and allocate further testing toward the most promising ones.

That does not make every workflow a textbook bandit problem. It does suggest that repeatedly rebuilding and retesting entire agent teams is an expensive way to forget what has already been learned.

The Main Result Is Better Generalization, Not Merely a Better Prompt

BOAD is evaluated on two versions of SWE-bench.

SWE-bench Verified contains 500 curated GitHub issues. SWE-bench Live contains 300 more recently collected issues from active repositories and is intended to provide a more difficult, less contamination-prone generalization test.

Using Seed-OSS-36B-Instruct within the SWE-agent scaffold, the results are:

System	SWE-bench Verified resolved	SWE-bench Live resolved
Default single-agent SWE-agent	49.8%	12.3%
SWE-agent with manual sub-agents	47.4%	14.0%
SWE-agent with evolutionary-search agents	46.0%	17.0%
SWE-agent with BOAD	53.2%	20.0%

The Live improvement is the more important result. Moving from 12.3% to 20.0% represents a 7.7-percentage-point increase, or roughly a 63% relative improvement.

On Verified, the increase from 49.8% to 53.2% is smaller but still positive. The result remains 53.1% after excluding the 12 Verified issues used during agent design.

At the time reported by the paper, the BOAD system ranked second on the SWE-bench Live leaderboard, despite using a 36B execution model. It outperformed several larger-model and popular-scaffold combinations listed in the paper, although the leaderboard comparisons mix different models and scaffolds and should not be interpreted as a clean test of model size alone.

The authors also test whether the improvement could be explained by prompt optimization rather than hierarchy. A version that optimizes the main SWE-agent prompt without adding sub-agents reaches 16.3% on Live. BOAD reaches 20.0%.

That ablation serves a specific purpose: it rejects the simpler explanation that Claude merely wrote a better prompt for the existing agent. It does not prove that prompt quality is unimportant. BOAD itself refines prompts and generates a customized orchestrator plan. The evidence shows that prompt improvement alone does not account for the full gain.

Two Specialists Beat One Agent, Three Agents, Four Agents, and Five Agents

The paper’s most memorable result is also the one most likely to be misread.

The best team uses exactly two sub-agents.

Number of selected sub-agents	SWE-bench Live resolved
1	16.3%
2	20.0%
3	16.3%
4	16.7%
5	13.7%

One sub-agent provides too little specialization. Two provide the best result. Beyond two, performance declines as communication and coordination costs accumulate.

The winning pair is not responsible for every stage of software development.

The first, issue_analyzer, structures the issue description. It extracts requirements, technical details, reproduction criteria, success conditions, and an investigation plan.

The second, code_navigator, explores the repository. It identifies relevant files, functions, classes, dependencies, relationships, and code patterns.

After receiving those outputs, the orchestrator performs the remaining code-editing and validation work.

This division reveals what delegation is actually buying. The sub-agents do not replace the orchestrator’s strongest capabilities. They reduce ambiguity before the orchestrator acts.

The appendix reinforces this interpretation. After 100 discovery iterations, issue_analyzer has a hindsight helpfulness rate of 0.982, while code_navigator reaches 0.933. An issue_reproducer is also relatively helpful at 0.768.

By contrast, several downstream or narrowly specialized agents perform poorly:

Discovered sub-agent	Hindsight helpfulness
`issue_analyzer`	0.982
`code_navigator`	0.933
`issue_reproducer`	0.768
`precision_editor`	0.642
`code_fixer`	0.361
`test_analyzer`	0.083
`multi_file_coordinator`	0.042
`config_manager`	0.000

The pattern is not that analysis is intellectually superior to editing. It is that early-stage analytical work has broad usefulness across many issues. Clarifying the problem and finding the relevant code helps regardless of what happens later.

A test-analysis agent, by contrast, becomes useful only after several prerequisites have succeeded. A configuration specialist matters only when the issue involves configuration. A code-fixing sub-agent may duplicate something the orchestrator can already do adequately once the correct files and requirements are known.

Useful agent roles are therefore not simply specialized. They are specialized at high-leverage bottlenecks.

The Orchestrator Must Learn the Team, Not Merely Receive More Tools

Discovering good sub-agents is insufficient if the orchestrator does not know how to use them.

BOAD represents sub-agents as tools. Their documentation explains what they do, what input they require, what they return, and whether they modify the repository. During a warm-up stage, these descriptions and prompts are refined so the orchestrator can invoke the agents correctly.

The system also generates a customized orchestrator plan that explicitly incorporates the selected sub-agents.

This customization raises SWE-bench Live performance from 16.7% to 20.0%, using the same top-two sub-agent set.

That result is an ablation of coordination, not an independent claim that planning prompts are universally worth 3.3 percentage points. Its purpose is to show that useful specialists do not automatically produce a useful team. The orchestrator must understand when to delegate, what context to pass, and how to continue after receiving the result.

This distinction matters in enterprise agent systems, where adding a tool is often treated as equivalent to adding a capability.

A tool that exists but is poorly described may never be called. A tool with vague input requirements may receive incomplete context. A useful output may still be wasted if the orchestrator does not know how to incorporate it.

Agent integration therefore requires at least three aligned components:

Capability: the sub-agent performs a useful task.
Interface: its inputs, outputs, and side effects are explicit.
Coordination: the orchestrator has a concrete plan for invoking and using it.

Most agent demos concentrate on the first component because it is the most visible. Production failures tend to enjoy the other two.

Hierarchy Reduces Context Without Increasing Total Token Use

Multi-agent systems are often assumed to be more expensive because every handoff requires additional inference.

BOAD does introduce communication overhead. Yet its discovered hierarchy reduces the amount of irrelevant context carried through the workflow.

On SWE-bench Live, BOAD reduces average total token usage by 23.8% and average maximum input length by 25.0% relative to the default SWE-agent. On Verified, total token usage rises slightly by 0.7%, while maximum input length falls by 11.6%.

Token measure	Verified change with BOAD	Live change with BOAD
Total tokens	+0.7%	-23.8%
Maximum input length	-11.6%	-25.0%

This is an important counterweight to the usual assumption that delegation necessarily multiplies costs.

The agents do perform additional calls. But each sub-agent works with a narrower context, and the orchestrator receives summarized findings instead of retaining the complete exploratory history.

The practical value is not merely lower token expenditure. Shorter contexts can reduce latency, lower the probability of distraction by irrelevant observations, and make reasoning behavior easier to inspect.

The result does not establish that hierarchical agents are always cheaper. BOAD’s Verified total tokens are essentially flat rather than lower, and different tasks may impose heavier communication costs. It does show that additional agents and additional context are not the same thing.

A well-designed hierarchy can add inference steps while reducing informational clutter.

The Ablations Explain the Architecture; They Are Not Separate Theses

BOAD includes several experiments that answer narrower design questions. Their purpose is to identify which components contribute to the main result.

Test	Likely purpose	Result on SWE-bench Live	What it supports
Prompt optimization without sub-agents	Ablation	16.3% versus BOAD’s 20.0%	Hierarchy adds value beyond improving the main prompt
One to five selected sub-agents	Sensitivity analysis	Two agents perform best	Team size creates a specialization–coordination trade-off
Generic versus customized orchestrator	Ablation	16.7% versus 20.0%	Coordination must adapt to the selected specialists
Fixed versus expanding archive	Ablation	17.0% versus 20.0%	Continuing to introduce candidate roles improves discovery
Success-rate versus helpfulness selection	Ablation	Helpfulness wins consistently	Individual contribution is a better selection signal than team outcome
Transfer to Claude 3.7 Sonnet	Robustness/extension	13.7% to 16.3%	Discovered roles transfer partially across models
Design sets of 6, 12, and 24 issues	Sensitivity test	21%, 20%, and 19%	Results are not highly sensitive to the tested design-set sizes

The design-set-size experiment deserves careful interpretation. Using 6, 12, or 24 design issues produces similar Live results. This suggests that BOAD can identify useful sub-agent patterns without an enormous design set.

It does not establish that six examples are universally sufficient. The tasks still come from SWE-bench, the design sets are constructed from Verified repositories, and all three sizes remain small. The experiment supports robustness within the tested range, not a general law of agent architecture discovery.

Similarly, applying the discovered sub-agents to Claude 3.7 Sonnet improves its Live score from 13.7% to 16.3%. This indicates partial transfer. The smaller gain compared with the model used during optimization also confirms that agent roles and prompts are not entirely model-independent.

The architecture can travel. It does not travel without luggage.

Multi-Agent Success Comes With a New Failure Mode: Trusted Mistakes

The paper’s qualitative analysis compares cases where the single-agent and multi-agent systems produce different outcomes.

The hierarchical system shows two recurring advantages.

First, it avoids over-editing. Single agents sometimes produce large patches, modify unrelated regions, or create unnecessary tests. The multi-agent system tends to make shorter, more localized changes.

Second, it handles multi-site fixes more reliably. A dedicated navigation sub-agent can identify several related call sites or modules before the orchestrator begins editing, reducing both omissions and irrelevant modifications.

But hierarchy creates its own characteristic failure.

When a sub-agent returns an incomplete or incorrect analysis, the orchestrator may accept it as ground truth. Subsequent reasoning then proceeds confidently in the wrong direction. Because the system lacks intermediate validation, it has limited ability to recover.

This failure mode changes the next engineering priority. Once agent teams can delegate effectively, they need mechanisms for distrusting one another productively.

Possible controls include:

cross-checking file or span localization before editing;
requiring evidence for important sub-agent claims;
running invariant or regression tests after handoffs;
invoking a second reader when uncertainty is high;
allowing the orchestrator to reject or repeat a sub-agent call.

The paper does not experimentally establish which verification mechanism works best. It identifies the failure pattern and proposes lightweight verification as a promising direction.

The business implication is more immediate: delegation without validation transfers errors faster.

Agent Architecture Should Be Managed Like a Capability Portfolio

The paper directly demonstrates that BOAD can discover an effective small hierarchy for SWE-bench issue resolution. Translating that result into a production practice requires a few additional steps.

A practical architecture-discovery process could look like this:

Operational stage	Business action	Evaluation question
Build candidate archive	Define or generate specialist prompts for recurring bottlenecks	Does this capability address a common and separable problem?
Warm up interfaces	Refine tool descriptions, required context, outputs, and side effects	Can the orchestrator invoke it correctly?
Evaluate in mixed teams	Run candidates across representative tasks and collaborators	Does it remain useful under different conditions?
Assign individual credit	Judge whether each specialist materially advanced the task	Did it help, or merely participate?
Select a small team	Retain the few capabilities with high marginal value	Does the added specialization exceed coordination cost?
Customize orchestration	Teach the coordinator when and how to use selected specialists	Are handoffs integrated into the workflow?
Validate handoffs	Cross-check high-impact intermediate outputs	Can one mistaken specialist derail the process?
Re-evaluate periodically	Add candidates and retire weak roles	Has the workload or model changed?

This framework differs from the common practice of designing a full multi-agent workflow in advance and then measuring whether the entire system succeeds.

It treats roles as hypotheses.

An issue-analysis agent may be valuable because ambiguous requests are a recurring bottleneck. A legal-review agent may be valuable only in a minority of cases. A formatting specialist may appear productive while contributing almost nothing to the final outcome. The relevant measure is not whether the role sounds legitimate, but whether it improves results often enough to justify its cost and coordination burden.

For return-on-investment analysis, that means measuring at least four dimensions:

improvement in task success;
reduction in context, latency, or inference cost;
frequency with which the specialist is genuinely useful;
additional failures caused by incorrect handoffs.

A specialist that improves rare cases but increases complexity on every case may not belong in the default team. It may be better invoked conditionally.

BOAD selects a fixed top-two team for evaluation. A production system may eventually need adaptive team sizing: simple tasks remain single-agent, ambiguous tasks call an analyzer, and repository-wide changes trigger additional navigation or verification capabilities.

That extension is plausible. It is not yet demonstrated by the paper.

What the Paper Shows, What Businesses Can Infer, and What Remains Open

The paper’s findings are strong enough to challenge several common assumptions, but narrow enough that implementation still requires local evidence.

What the paper directly shows

Automatically discovered sub-agents outperform the tested single-agent, manual multi-agent, and evolutionary-search configurations on SWE-bench.
Hindsight helpfulness selects better sub-agents than team success rate.
A customized orchestrator improves the value of the selected specialists.
Two sub-agents outperform teams containing one, three, four, or five sub-agents.
The discovered hierarchy shortens maximum input context and reduces total tokens on SWE-bench Live.
The most useful discovered roles focus on issue analysis and repository navigation.
The discovered roles transfer partially, but not fully, to another execution model.

What Cognaptus infers for business use

Agent teams should be designed around measured bottlenecks rather than familiar job titles.
Organizations should evaluate the marginal contribution of each agent component.
Small teams are likely to be easier to coordinate, validate, and economically justify than broad swarms.
Tool documentation and orchestration plans are part of the architecture, not implementation trivia.
Early-stage diagnostic specialists may create more reusable value than downstream agents that duplicate the main model’s capabilities.
Agent handoffs should be treated as control points requiring validation.

What remains uncertain

The primary execution experiments use Seed-OSS-36B-Instruct within the SWE-agent scaffold on benchmark issue-resolution tasks. Claude-4 generates candidate designs and judges helpfulness. Different models, codebases, languages, tool permissions, or business workflows may produce different optimal roles.

The helpfulness judge is itself an LLM. Its labels may contain bias or error, and the paper does not compare them extensively with human marginal-contribution judgments.

The strongest system uses a fixed pair of sub-agents. The paper does not test an adaptive policy that changes team size by task.

The qualitative analysis identifies error propagation from unvalidated handoffs, but verification mechanisms remain future work.

Finally, benchmark patch resolution is not the same as safe production deployment. Coding agents operating on real systems require sandboxing, permission controls, review policies, security testing, and audit trails. A system that produces more successful patches can also produce more consequential mistakes.

Fewer Agents, Better Chosen

The important comparison is no longer between a lone coding agent and an elaborate agent swarm.

It is between unmanaged participation and measured contribution.

BOAD’s strongest result comes from a modest hierarchy: one specialist clarifies the issue, another maps the repository, and an orchestrator uses their findings to implement the fix. The system succeeds not because every phase receives its own agent, but because delegation is concentrated where it reduces uncertainty most.

The architecture also exposes the real challenge of multi-agent systems. Adding agents is easy. Determining whether they helped, teaching the orchestrator to use them, and preventing their errors from spreading is the actual engineering work.

For companies building agentic systems, the sensible starting point is therefore not a crowded workflow diagram. It is a small archive of candidate capabilities, a representative evaluation set, and an inconvenient question for every proposed specialist:

Did this agent improve the outcome, or was it simply in the room when the outcome improved?

Cognaptus: Automate the Present, Incubate the Future.

Iris Xu et al., “BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization,” arXiv:2512.23631, https://arxiv.org/abs/2512.23631. ↩︎

Single-Agent Coding Compresses the Entire Project Into One Memory#

Manual Specialists Look Sensible but Behave Unpredictably#

Team Success Is the Wrong Measure of Individual Value#

Evolution Searches for Teams; BOAD Reuses Evidence About Specialists#

The Main Result Is Better Generalization, Not Merely a Better Prompt#

Two Specialists Beat One Agent, Three Agents, Four Agents, and Five Agents#

The Orchestrator Must Learn the Team, Not Merely Receive More Tools#

Hierarchy Reduces Context Without Increasing Total Token Use#

The Ablations Explain the Architecture; They Are Not Separate Theses#

Multi-Agent Success Comes With a New Failure Mode: Trusted Mistakes#

Agent Architecture Should Be Managed Like a Capability Portfolio#

What the Paper Shows, What Businesses Can Infer, and What Remains Open#

What the paper directly shows#

What Cognaptus infers for business use#

What remains uncertain#

Fewer Agents, Better Chosen#