More Isn’t Smarter: Why Agent Diversity Beats Agent Count

Opening — Why this matters now

Multi-agent LLM systems have quietly become the industry’s favorite way to brute-force intelligence. When one model struggles, the instinct is simple: add more agents. Vote harder. Debate longer. Spend more tokens.

And yet, performance curves keep telling the same unflattering story: early gains, fast saturation, wasted compute.

This paper asks the uncomfortable question most agent frameworks politely ignore: why does scaling stall so quickly—and what actually moves the needle once it does? The answer, it turns out, has less to do with how many agents you run, and more to do with how different they truly are.

Background — From ensemble intuition to information ceilings

The intuition behind multi-agent systems is borrowed from ensemble learning: independent views reduce error. In practice, LLM agents are rarely independent. Same backbone, same pretraining data, similar prompts—different seeds, identical thinking.

Empirically, this shows up as diminishing returns across common benchmarks (GSM8K, ARC, HellaSwag, TruthfulQA). Accuracy improves at small agent counts, then flattens—or even degrades—as redundancy overwhelms novelty.

What’s been missing is a unifying explanation that applies across workflows: voting, debate, sequential orchestration. This paper supplies it by reframing MAS scaling as an information-budget problem rather than a compute-budget problem.

Analysis — What the paper actually does

1. A hard ceiling: intrinsic task uncertainty

The core result is deceptively simple. For any task with input $X$ and answer $Y$, a multi-agent system can extract at most:

$$ I(Z_{1:n}; Y | X) \le H(Y | X) $$

No number of agents can exceed the task’s intrinsic uncertainty. Once you approach this ceiling, additional calls add correlation, not information.

2. Why homogeneous scaling saturates

The paper decomposes total information gain into incremental contributions:

$$ I(Z_{1:n}; Y | X) = \sum_i I(Z_i; Y | X, Z_{<i}) $$

In homogeneous systems, these increments decay rapidly because outputs are strongly correlated. Later agents mostly paraphrase earlier ones. The system looks busy while learning nothing new.

3. Effective channels, not raw agents

To replace agent count $n$, the authors introduce a more meaningful state variable:

Effective channels ($K$): the number of non-redundant information sources in the transcript
Complementarity rate ($\alpha$): how likely a new channel reveals previously missing evidence

Together, they govern recoverable information:

$$ \mathbb{E}[I] \ge H(Y|X)\big(1 - e^{-\alpha K}\big) $$

This explains the familiar “fast-then-slow” curve: early gains when $K$ grows, saturation once it doesn’t.

4. Measuring diversity without labels: $K^*$

Since true $K$ depends on unknown ground truth, the paper proposes $K^*$, a label-free proxy computed from agent output embeddings. Technically, it’s the entropy-based effective rank of the cosine-similarity Gram matrix.

Intuition:

Identical reasoning → $K^* \approx 1$
Genuinely different reasoning paths → $K^*$ increases

Empirically, $K^*$ tracks accuracy far better than agent count or configuration labels.

Findings — Results that actually matter

Key empirical results

Observation	What it means in practice
16 homogeneous agents ≈ 2 diverse agents	Scale is an expensive substitute for diversity
Model + persona diversity outperforms either alone	Heterogeneity compounds
$K^*$ correlates strongly with accuracy	Diversity is measurable, not hand-wavy
Correct-path diversity matters more than total diversity	Not all disagreement is useful

A crucial refinement is the decomposition:

$K^*_c$: diversity among correct reasoning paths
$K^*_w$: diversity among incorrect ones

High-performing systems cluster where $K^_c > K^_w$. In other words: many ways to be right, few coherent ways to be wrong.

Implications — Design rules hiding in plain sight

1. Stop counting agents

Agent count is a poor control knob. Two systems with the same $n$ can sit at wildly different points on the information curve depending on redundancy.

2. Engineer diversity deliberately

Effective diversity comes from:

Different base models (not just different seeds)
Distinct reasoning personas
Orthogonal tools or representations

Random temperature noise is a blunt—and often counterproductive—instrument.

3. Right-size your MAS

Across tasks, homogeneous systems plateau around $N \approx 4$. Heterogeneous ones remain efficient up to $N \approx 8$. Beyond that, returns collapse.

4. Measure before you scale

$K^*$ offers a practical diagnostic: if it’s not increasing, your system isn’t learning—no matter how many tokens you burn.

Conclusion — Scaling is an illusion without diversity

This paper quietly dismantles a popular myth: that agentic intelligence scales like infrastructure. It doesn’t.

Multi-agent systems are constrained by information, not enthusiasm. Once redundancy sets in, more agents simply argue louder.

The real lever is diversity—measured, structured, and aligned with correctness. Systems that understand this will be smaller, cheaper, and more reliable. The rest will keep paying for conversations that go nowhere.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — From ensemble intuition to information ceilings#

Analysis — What the paper actually does#

1. A hard ceiling: intrinsic task uncertainty#

2. Why homogeneous scaling saturates#

3. Effective channels, not raw agents#

4. Measuring diversity without labels: $K^*$#

Findings — Results that actually matter#

Key empirical results#

Implications — Design rules hiding in plain sight#

1. Stop counting agents#

2. Engineer diversity deliberately#

3. Right-size your MAS#

4. Measure before you scale#

Conclusion — Scaling is an illusion without diversity#