Many AI teams discover multi-agent systems the same way some companies discover meetings: one agent seems useful, so surely sixteen must be strategic.

The logic is seductive. Add more agents. Let them vote. Let them debate. Let them critique each other. Give the workflow a name with a little theatrical flair. Somewhere in the process, intelligence is expected to emerge from volume.

The paper “Understanding Agent Scaling in LLM-Based Multi-Agent Systems via Diversity” offers a useful correction: the question is not how many agents are in the room. The question is how many independent, task-relevant reasoning channels the system actually has.1 Two agents repeating the same reasoning path are not two sources of evidence. They are one source wearing two badges.

That distinction matters because enterprise AI systems are now moving from single-model chatbots toward agentic workflows: document reviewers, coding assistants, research agents, compliance checkers, customer-support triage systems, financial-analysis copilots, and internal automation pipelines. It is tempting to improve reliability by adding more agents to the workflow. The paper’s message is colder and more useful: agent count is a compute metric, not an intelligence metric.

The business lesson is not “use diversity” in the decorative HR-poster sense. It is narrower and sharper: diversify only when it creates complementary routes to the correct answer. Otherwise, you are just paying extra for synchronized hallucination. Very modern, very expensive.

The real bottleneck is redundant evidence, not insufficient headcount

The paper begins from a practical puzzle. Multi-agent systems often improve when moving from one agent to a few agents, but the gains weaken quickly. In homogeneous setups, where agents share the same model, prompt style, and configuration, adding more agents often produces diminishing returns. Sometimes it can even degrade performance.

The authors explain this through an information-theoretic lens. A multi-agent system receives an input, produces a transcript of agent outputs, and then aggregates those outputs into a final answer. The useful question is how much the transcript reduces uncertainty about the correct answer beyond what was already available in the input.

In plain language: did the new agent add new usable evidence, or did it restate the same idea with slightly different punctuation?

The paper calls this useful information usable evidence. Each additional agent call contributes only the information that was not already contained in previous outputs. If the new output overlaps heavily with earlier ones, its incremental contribution is small. This is the mechanism behind saturation.

The authors formalize a finite information budget: no system can extract more information about the answer than the task itself contains. That sounds obvious, but it gives the scaling story teeth. Once a workflow has extracted the easy evidence, more calls help only if they uncover still-missing evidence. If those calls are correlated, they do not expand the evidence base much.

That is why raw agent count is a weak scaling variable. A system with 16 similar agents may have far fewer than 16 effective reasoning channels. It has 16 invoices, perhaps. It does not necessarily have 16 perspectives.

Effective channels are the missing unit of agent scaling

The paper’s central concept is the effective channel: an independent, non-redundant source of task-relevant information in the multi-agent transcript.

A simple mental model:

System design What it looks like What it really contributes
Many identical agents Same model, same prompt, same reasoning style Many calls, few effective channels
Persona variation only Same model, different role prompts Some extra diversity, but not always task-relevant
Model diversity Different model families or checkpoints More chances of genuinely different reasoning
Full diversity Different models plus different task-relevant prompts/personas Highest chance of complementary evidence

The paper also introduces a complementarity rate: the probability that a new effective channel uncovers previously missing evidence. This matters because diversity is not automatically useful. A system can produce diverse wrong answers. That is not insight; it is noise with better costume design.

The authors connect effective channels and complementarity to a fast-then-slow scaling curve. Early additions can help because the system is still discovering missing evidence. Later additions help less because the remaining uncertainty becomes harder to reduce, and because redundancy increases.

The practical point is simple:

$$ \text{Performance gain} \neq \text{more agents} $$

A better approximation is:

$$ \text{Performance gain} \approx \text{more non-redundant correct evidence} $$

This is the conceptual move that makes the paper valuable for business readers. It does not merely say heterogeneous agents perform better. It explains why they perform better when they do: they expand the system’s effective evidence channels.

The label-free metric: estimating diversity before knowing the answer

A useful agent-design metric cannot depend entirely on ground-truth labels. In real deployment, the system usually does not know the correct answer in advance. That is why the paper introduces a label-free proxy called $K^\ast$.

The construction is technical but intuitive. The authors embed the agents’ full outputs, compute a cosine-similarity Gram matrix, normalize it, and then use entropy effective rank to estimate how many distinct output directions the system has. When outputs are nearly identical, $K^\ast$ approaches 1. When outputs are more independent, $K^\ast$ rises.

This is not a magic truth detector. It is a redundancy detector.

That distinction is important. $K^\ast$ can tell you whether agents are saying different things. It cannot, by itself, tell you whether those different things are correct. The paper handles this by later decomposing effective-channel diversity into correct-path diversity and wrong-path diversity. That decomposition is one of the most useful parts of the study because it prevents a very common managerial misreading: “diverse outputs must be better.”

No. Diverse correct reasoning is better. Diverse wrong reasoning is just a more colorful failure mode.

The main experiments test scale, diversity, and mechanism separately

The experimental setup is deliberately structured. The authors test seven benchmarks: GSM8K, ARC, Formal Logic, TruthfulQA, HellaSwag, WinoGrande, and Pro Medicine. These cover arithmetic reasoning, formal deduction, commonsense reasoning, truthfulness, and domain knowledge.

They use three open-source models: Qwen-2.5-7B, Llama-3.1-8B, and Mistral-7B. They evaluate two common multi-agent workflows:

  • Vote, where agents independently generate answers and the system aggregates by majority decision.
  • Debate, where agents interact sequentially over four rounds before producing a final answer.

The diversity configurations are layered:

Layer Configuration What it tests
L1 No diversity: same base model, same default prompt Homogeneous scaling baseline
L2 Persona diversity only Whether prompt-role variation helps
L3 Model diversity only Whether different model backbones add complementary evidence
L4 Full diversity: model plus persona diversity Whether multiple diversity sources combine

This structure is useful because it avoids the lazy conclusion that “diversity works” without asking what kind of diversity works. Persona diversity and model diversity are not the same intervention. One changes the instruction frame; the other changes the underlying model distribution. Their combination is stronger in the paper’s experiments, but the mechanism is still evidence complementarity, not aesthetic variety.

The headline result: two diverse agents can beat sixteen homogeneous agents

The most business-relevant result is Table 2 in the paper. It asks how many agents are needed to match or exceed the no-diversity baseline with 16 agents.

Workflow Configuration Agents needed to match L1 with 16 agents Accuracy at that agent count Peak accuracy
Vote L1, no diversity 16 65.34 65.49
Vote L2, persona only 8 65.44 66.01
Vote L3, model only 4 67.29 71.54
Vote L4, full diversity 2 67.71 76.86
Debate L1, no diversity 16 65.48 65.48
Debate L2, persona only 12 66.08 66.08
Debate L3, model only 4 66.26 71.33
Debate L4, full diversity 2 67.90 77.43

The obvious interpretation is cost efficiency: full diversity with two agents can match or exceed homogeneous scaling with sixteen agents. But the deeper interpretation is architectural. The diverse system is not winning because two is a special number. It is winning because those two agents provide more effective channels than a larger set of redundant agents.

This is why “more agents” is such a poor procurement question. The better question is:

What independent evidence sources does this additional agent add?

If the answer is “it uses the same model, the same prompt, the same retrieval source, and the same reasoning style,” then the system may be scaling compute faster than intelligence. Congratulations: the GPU bill is growing nicely.

Output similarity explains why homogeneous systems saturate

The paper then checks whether redundancy is visible in the outputs themselves. The authors embed each agent’s full reasoning trace using NV-Embed-v2 and compute mean pairwise cosine similarity. Higher similarity means the agents are producing more overlapping outputs.

This test is not the main theory; it is a mechanism check. It asks whether homogeneous systems really are more redundant in observable output space.

They are.

Homogeneous persona settings show higher similarity and weaker performance. More diverse configurations preserve lower similarity and achieve stronger performance. The paper also finds that redundancy tends to increase with agent count across diversity layers. This is exactly what the mechanism predicts: as you add more agents, especially similar ones, many of them land on already-covered reasoning paths.

But cosine similarity is crude. It can show overlap, but it does not fully estimate how many independent channels exist. That is why the paper moves from pairwise similarity to $K^\ast$.

On ARC, for example, the L4 configuration has both higher $K^\ast$ and higher accuracy than L1 under both Vote and Debate. In the Vote setting, L1 reaches 81.3% accuracy with $K^\ast = 1.201$, while L4 reaches 87.5% with $K^\ast = 1.521$. In Debate, L1 reaches 81.6% with $K^\ast = 1.197$, while L4 reaches 85.9% with $K^\ast = 1.517$.

The absolute values should not be overinterpreted as universal constants. They depend on embedding model, task, and setup. The pattern is the point: configurations with more effective output channels tend to perform better.

Correct-path diversity matters more than generic diversity

The paper’s most important nuance comes from decomposing $K^\ast$ into two parts:

  • $K_c^\ast$: effective channels among agents that reach the correct answer.
  • $K_w^\ast$: effective channels among agents that reach incorrect answers.

This distinction changes the practical interpretation. Generic diversity is not the goal. Correct-path diversity is the goal.

High-performing configurations tend to appear where correct reasoning diversity dominates. When multiple agents reach the correct answer through different reasoning paths, the final answer receives support from independent evidence. That makes aggregation more robust. By contrast, if incorrect answers are also highly diverse, errors scatter across alternatives. That can dilute the wrong signal, but it can also indicate the system is generating uncontrolled noise.

The supplementary regression analysis reinforces this. A baseline model using only agent count and configuration labels explains little variance in performance. Adding $K^\ast$ improves explanatory power. But adding the correctness-conditioned component $K_c^\ast$ improves it much more. Further adding the wrong-answer component contributes little beyond that.

That is the cleanest business translation of the paper:

Do not optimize for disagreement. Optimize for independent routes to the right answer.

This is where many enterprise agent designs go wrong. They create “roles” that sound different: analyst, critic, reviewer, planner, domain expert, skeptic. But if all roles use the same model, the same context, and the same retrieval evidence, the system may produce role-play diversity rather than evidence diversity.

A “skeptic” agent that merely rephrases the same hidden assumptions is not a control function. It is theater.

What the supplementary tests actually support

The appendix is useful, but it should not be read as a second thesis. Its tests mostly support robustness, sensitivity, and scope.

Test Likely purpose What it supports What it does not prove
Closed-source model experiments on Formal Logic Robustness extension Heterogeneity advantage is not limited to the three open-source models Full generalization to all proprietary systems
Embedding model comparison Sensitivity test Relative $K^\ast$ rankings are not purely an artifact of one embedding model Absolute $K^\ast$ values are universally comparable
Regression with $K^\ast$ and correctness-conditioned components Mechanism support Output diversity explains performance beyond agent count and configuration labels Causal proof in every real deployment
Permutation sanity checks Statistical sanity check The $K^\ast$–performance relation is unlikely to be accidental in their data A universal deployment guarantee
Formal Logic model/workflow ablations Exploratory robustness Heterogeneity often helps across models and workflows That every workflow benefits from every diversity source

The closed-source extension is interesting but bounded. The authors test gpt-4.1-mini and gpt-5-mini on Formal Logic. Heterogeneity improves over homogeneous baselines in at least one interaction mechanism for all tested models, but scaling behavior differs. For gpt-4.1-mini, adding agents in debate can hurt even under heterogeneity. For gpt-5-mini, heterogeneity produces large gains from weak homogeneous baselines.

This is a useful warning. Diversity is not a slogan; it is an intervention. Its effect depends on the base model, task, workflow, and aggregation protocol.

The embedding robustness test is also important. The authors recompute $K^\ast$ using a different embedding model, gte-Qwen2-1.5B-instruct, and compare it with NV-Embed-v2. They report strong agreement in relative ordering and positive correlation with accuracy for both embeddings. That supports $K^\ast$ as a useful diagnostic direction, but it does not make $K^\ast$ a universal meter with portable units. A $K^\ast$ value from one system should not be compared casually with a value from a different task, embedding model, or output format.

Business interpretation: design the evidence portfolio, not the agent roster

For companies building agentic systems, the paper suggests a shift in design language.

Stop asking:

  • How many agents should we use?
  • Should we add a critic?
  • Should we make the agents debate?
  • Can we improve accuracy by adding more samples?

Start asking:

  • What evidence source does each agent access?
  • What reasoning path does each agent specialize in?
  • Which agents are redundant?
  • Does diversity increase correct-path coverage or only output noise?
  • At what point does the marginal agent stop adding effective information?

This turns multi-agent design into an evidence-portfolio problem. Each agent should justify its existence by contributing a non-redundant path to useful evidence.

A practical enterprise workflow might use diversity across several dimensions:

Diversity dimension Useful version Weak version
Model diversity Different model families with different strengths Same model copied many times
Prompt diversity Task-relevant reasoning strategies Decorative personas
Retrieval diversity Different databases, document slices, or search strategies Same context pasted into every agent
Tool diversity Agents using different validators, calculators, code execution, or rule checks All agents relying on free-form text
Aggregation diversity Weighted synthesis based on confidence, evidence, and failure modes Blind majority vote

The paper directly studies model and persona diversity in voting and debate workflows. The broader business inference is that real systems should diversify the sources of evidence and reasoning, not merely the labels assigned to agents.

For example, in a contract-review system, useful diversity might mean:

  • one agent checks clause extraction against the document text;
  • one compares terms against a policy library;
  • one identifies missing obligations;
  • one validates numerical or date consistency;
  • one summarizes legal risk in business language.

That is more likely to create effective channels than five agents all reading the same contract with the prompt “be careful.”

Careful is not an architecture.

A simple operating framework: duplicate, differentiate, diagnose, prune

The paper can be translated into a four-step operating framework for agent teams.

1. Duplicate only to establish the baseline

A homogeneous ensemble is not useless. It provides a baseline and can improve early performance. But it should be treated as a starting point, not the final architecture.

Run the simple version first: one model, one prompt style, several samples or agents. Measure where performance saturates. This identifies the point where adding similar agents stops paying.

2. Differentiate by evidence channel, not by personality label

Add diversity where it creates a plausible independent route to the answer. For reasoning tasks, this may mean different solution strategies. For technical review, it may mean separate validation tools. For research, it may mean different retrieval corpora. For financial analysis, it may mean separating macro, market microstructure, valuation, and risk-control perspectives.

The key test is whether the new agent changes the evidence set, not whether the prompt sounds different.

3. Diagnose redundancy with output similarity and task metrics

A label-free metric like $K^\ast$ is useful as a monitoring signal. If agent outputs remain highly similar, the system is likely redundant. If outputs become more diverse but accuracy does not improve, diversity may be adding noise rather than useful evidence.

In production, a practical dashboard would combine:

  • output similarity;
  • answer agreement;
  • task success rate on labeled evaluation sets;
  • disagreement cases reviewed by humans;
  • marginal value per agent call;
  • failure-type clustering.

The last one matters. If a new agent mainly creates new categories of wrong answers, the system is not becoming wiser. It is becoming more imaginative in all the wrong places.

4. Prune agents that do not add marginal evidence

Once the system has a stable evaluation set, remove agents whose outputs are redundant or whose disagreement does not improve final accuracy. Agent teams should be maintained like portfolios, not family photos. Sentimentality is expensive.

The paper’s Table 2 makes this point sharply. A small, diverse system can outperform a much larger homogeneous one. That means pruning is not just cost control; it can improve signal quality.

Where the result applies, and where it should not be stretched

The paper’s evidence is strongest for reasoning and benchmark-style tasks using voting and debate workflows. It is also strongest for the tested model scale: mostly 7B–8B open-weight models, with a supplementary Formal Logic test on closed-source models.

Several boundaries matter.

First, $K^\ast$ measures semantic diversity in embedding space. It is a useful proxy for effective channels, not a direct measurement of task-relevant truth. The paper itself shows why correctness-conditioned diversity is more predictive than generic diversity.

Second, the benefits vary by task type. The authors note that $K^\ast$ predicts accuracy more strongly on reasoning tasks than on knowledge-heavy tasks. That matters for enterprise deployments. If the task is factual retrieval from a controlled knowledge base, better retrieval and verification may matter more than agent diversity.

Third, the paper studies Vote and Debate. Many real agent systems involve long-horizon planning, external tools, memory, APIs, workflow state, human review, and changing business rules. In those settings, effective channels may involve tool traces, database queries, or procedural checkpoints, not just textual reasoning outputs.

Fourth, diversity can hurt. A weakly designed debate can amplify confusion. A critic agent can overrule correct answers. A high-temperature sampling strategy can increase wrong-path diversity. The paper’s message is not “make everything diverse.” It is “increase non-redundant correct evidence.”

That is less catchy, but reality often is.

The managerial lesson: stop buying agent count as if it were capability

The paper is valuable because it replaces a vague engineering habit with a clearer unit of analysis. Multi-agent performance does not scale with the number of agents. It scales with effective channels, complementarity, and especially correct-path diversity.

For business teams, this changes how agentic AI systems should be scoped.

A proposal that says “we will use eight agents” is incomplete. Eight doing what? Reading which evidence? Using which tools? Producing what independent checks? Reducing which failure modes? Improving which metric relative to a smaller design?

Without those answers, the agent count is just numerology with an API key.

The better design target is an evidence-efficient system: enough agents to cover genuinely different reasoning paths, not so many that the workflow becomes a redundant committee. A two-agent system with strong complementary evidence can beat a sixteen-agent chorus. The paper’s experiments make that point empirically; the information-theoretic framing explains why.

The quiet lesson is that agentic AI architecture is becoming less about adding more synthetic workers and more about designing better epistemic division of labor. The system should not merely have more voices. It should have more independent ways to be right.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yingxuan Yang, Chengrui Qu, Muning Wen, Laixi Shi, Ying Wen, Weinan Zhang, Adam Wierman, and Shangding Gu, “Understanding Agent Scaling in LLM-Based Multi-Agent Systems via Diversity,” arXiv:2602.03794, 2026, https://arxiv.org/abs/2602.03794↩︎