Opening — Why this matters now
Multi-agent LLM systems have quietly become the industry’s favorite way to brute-force intelligence. When one model struggles, the instinct is simple: add more agents. Vote harder. Debate longer. Spend more tokens.
And yet, performance curves keep telling the same unflattering story: early gains, fast saturation, wasted compute.
This paper asks the uncomfortable question most agent frameworks politely ignore: why does scaling stall so quickly—and what actually moves the needle once it does? The answer, it turns out, has less to do with how many agents you run, and more to do with how different they truly are.
Background — From ensemble intuition to information ceilings
The intuition behind multi-agent systems is borrowed from ensemble learning: independent views reduce error. In practice, LLM agents are rarely independent. Same backbone, same pretraining data, similar prompts—different seeds, identical thinking.
Empirically, this shows up as diminishing returns across common benchmarks (GSM8K, ARC, HellaSwag, TruthfulQA). Accuracy improves at small agent counts, then flattens—or even degrades—as redundancy overwhelms novelty.
What’s been missing is a unifying explanation that applies across workflows: voting, debate, sequential orchestration. This paper supplies it by reframing MAS scaling as an information-budget problem rather than a compute-budget problem.
Analysis — What the paper actually does
1. A hard ceiling: intrinsic task uncertainty
The core result is deceptively simple. For any task with input $X$ and answer $Y$, a multi-agent system can extract at most:
$$ I(Z_{1:n}; Y | X) \le H(Y | X) $$
No number of agents can exceed the task’s intrinsic uncertainty. Once you approach this ceiling, additional calls add correlation, not information.
2. Why homogeneous scaling saturates
The paper decomposes total information gain into incremental contributions:
$$ I(Z_{1:n}; Y | X) = \sum_i I(Z_i; Y | X, Z_{<i}) $$
In homogeneous systems, these increments decay rapidly because outputs are strongly correlated. Later agents mostly paraphrase earlier ones. The system looks busy while learning nothing new.
3. Effective channels, not raw agents
To replace agent count $n$, the authors introduce a more meaningful state variable:
- Effective channels ($K$): the number of non-redundant information sources in the transcript
- Complementarity rate ($\alpha$): how likely a new channel reveals previously missing evidence
Together, they govern recoverable information:
$$ \mathbb{E}[I] \ge H(Y|X)\big(1 - e^{-\alpha K}\big) $$
This explains the familiar “fast-then-slow” curve: early gains when $K$ grows, saturation once it doesn’t.
4. Measuring diversity without labels: $K^*$
Since true $K$ depends on unknown ground truth, the paper proposes $K^*$, a label-free proxy computed from agent output embeddings. Technically, it’s the entropy-based effective rank of the cosine-similarity Gram matrix.
Intuition:
- Identical reasoning → $K^* \approx 1$
- Genuinely different reasoning paths → $K^*$ increases
Empirically, $K^*$ tracks accuracy far better than agent count or configuration labels.
Findings — Results that actually matter
Key empirical results
| Observation | What it means in practice |
|---|---|
| 16 homogeneous agents ≈ 2 diverse agents | Scale is an expensive substitute for diversity |
| Model + persona diversity outperforms either alone | Heterogeneity compounds |
| $K^*$ correlates strongly with accuracy | Diversity is measurable, not hand-wavy |
| Correct-path diversity matters more than total diversity | Not all disagreement is useful |
A crucial refinement is the decomposition:
- $K^*_c$: diversity among correct reasoning paths
- $K^*_w$: diversity among incorrect ones
High-performing systems cluster where $K^_c > K^_w$. In other words: many ways to be right, few coherent ways to be wrong.
Implications — Design rules hiding in plain sight
1. Stop counting agents
Agent count is a poor control knob. Two systems with the same $n$ can sit at wildly different points on the information curve depending on redundancy.
2. Engineer diversity deliberately
Effective diversity comes from:
- Different base models (not just different seeds)
- Distinct reasoning personas
- Orthogonal tools or representations
Random temperature noise is a blunt—and often counterproductive—instrument.
3. Right-size your MAS
Across tasks, homogeneous systems plateau around $N \approx 4$. Heterogeneous ones remain efficient up to $N \approx 8$. Beyond that, returns collapse.
4. Measure before you scale
$K^*$ offers a practical diagnostic: if it’s not increasing, your system isn’t learning—no matter how many tokens you burn.
Conclusion — Scaling is an illusion without diversity
This paper quietly dismantles a popular myth: that agentic intelligence scales like infrastructure. It doesn’t.
Multi-agent systems are constrained by information, not enthusiasm. Once redundancy sets in, more agents simply argue louder.
The real lever is diversity—measured, structured, and aligned with correctness. Systems that understand this will be smaller, cheaper, and more reliable. The rest will keep paying for conversations that go nowhere.
Cognaptus: Automate the Present, Incubate the Future.