Consensus is comforting. That is the problem.
In a meeting, consensus often means people have compared evidence, challenged assumptions, and settled on a workable answer. In a multi-agent AI system, consensus can look similar from the outside: several agents interact, exchange outputs, and converge on one shared response. The dashboard shows agreement. The workflow moves on. Everyone enjoys the small luxury of not asking what just happened.
The paper behind this article asks the impolite question: when an LLM population agrees, did it reason collectively, inherit a bias, or simply amplify a lucky early sample? Hidenori Tanaka’s When Is Collective Intelligence a Lottery? Multi-Agent Scaling Laws for Memetic Drift in LLMs gives a precise answer for a controlled setting: under neutral conditions, consensus can arise from sampling noise alone.1
That matters because enterprise AI is drifting—quietly, efficiently, with impressive slide decks—toward multi-agent architectures. We use one agent to retrieve, one to critique, one to summarize, one to negotiate, one to approve, and one to write the cheerful final answer. The implied promise is that a group of models should be more reliable than a single model. Sometimes it may be. But this paper shows why agreement itself is not enough. A population can converge because it has discovered a better answer. It can also converge because the first random token became everybody else’s evidence.
The useful phrase in the paper is memetic drift. It is a deliberately biological analogy: just as neutral genetic drift can fix a trait without that trait being fitter, neutral memetic drift can fix a convention without that convention being better. In plainer language: sometimes the winning answer is not correct, wise, or even preferred. It merely got sampled early and repeated often.
The business misconception: agreement is not independent validation
The reader’s natural intuition is simple: if multiple AI agents independently arrive at the same answer, the answer is more trustworthy. This intuition borrows from human committees, prediction markets, ensemble models, and basic common sense. Unfortunately, multi-agent LLM systems often violate the independence assumption that makes that intuition work.
In many agent workflows, agents are not independent judges. They read each other’s outputs. They update from previous interaction history. They reuse labels, phrases, framings, and intermediate conclusions. The system becomes its own data source.
Tanaka calls this mutual in-context learning. Standard in-context learning places a model in front of a prompt and lets it infer from examples supplied by an external context. Mutual in-context learning is stranger. The examples are generated by other agents that are themselves changing because of the same interaction loop. The population is no longer learning from a stable outside signal. It is learning from its own sampled emissions.
That feedback loop is the mechanism. It is also the reason a neat consensus screen can become a very polished coin toss.
A small early fluctuation works like this:
| Step | What happens inside the population | Why it matters |
|---|---|---|
| 1. Neutral start | No label or answer has an intrinsic advantage. | There is no ground truth signal in the task. |
| 2. Sampled output | One agent emits a discrete label from its internal probability distribution. | A random draw becomes visible evidence. |
| 3. Listener adaptation | Another agent updates toward that sampled message. | The random draw now changes another agent’s future outputs. |
| 4. Repeated reuse | Later agents see outputs shaped by earlier outputs. | Noise is recycled as social proof. |
| 5. Convention | One option dominates the population. | Consensus appears, even though no option was better. |
The paper’s argument is not that all multi-agent consensus is fake. That would be a wonderfully dramatic claim, and therefore almost certainly wrong. The argument is sharper: in neutral or weak-signal environments, agreement can be produced by the interaction dynamics themselves. If we do not model those dynamics, we may mistake coordination for intelligence.
QSG isolates the machinery behind the lottery
The paper introduces Quantized Simplex Gossip, or QSG, as a minimal model of multi-agent naming. The setup is intentionally stripped down. A population of agents must choose labels for a fixed referent. Each agent holds an internal probability distribution over possible labels. At each interaction, one agent speaks, another listens, and the listener updates.
The listener update is simple:
Here, $x_L$ is the listener’s internal distribution, $y$ is the message received from the speaker, and $\alpha$ is the adaptation rate. A small $\alpha$ means the listener moves slowly. A large $\alpha$ means the listener strongly absorbs the current interaction. At $\alpha = 1$, the listener is essentially overwritten by the received message.
The crucial design choice is the word quantized. Internally, an agent may hold a smooth probability distribution over labels. Externally, it usually communicates a discrete output: one label, a short list, or a compressed summary. That mismatch between continuous belief and discrete communication creates sampling variance.
QSG compares three communication regimes:
| Regime | Message type | Purpose in the paper |
|---|---|---|
| Soft | Speaker transmits the full distribution. | Baseline: removes quantization noise and shows what happens without discrete sampling. |
| Hard | Speaker samples and transmits one label. | Main drift mechanism: a single sampled output can destabilize neutrality. |
| Top-$m$ | Speaker sends an empirical distribution from $m$ sampled labels. | Bandwidth test: more samples reduce sampling noise. |
This is the paper’s mechanism-first move. Soft exchange preserves the population mean in expectation and contracts disagreement. In the perfectly symmetric neutral case, it does not spontaneously create a winning convention. Hard exchange has the same conditional mean, but adds a positive variance term because the listener updates toward a sample, not the speaker’s full distribution. That extra variance is enough to make the symmetric state unstable.
The result is almost rude in its simplicity. The population does not need a hidden preference. It does not need a reward function. It does not need a malicious prompt. It only needs repeated interaction through a finite-bandwidth message channel.
The scaling laws say when agreement becomes less lottery-like
The paper’s business value is not merely the warning that “AI agents can copy each other.” Anyone who has watched a committee rewrite one person’s bad idea in six fonts already knows that social systems can copy. The contribution is that QSG turns the problem into scaling laws.
The main variables are:
| Variable | Meaning | Effect on drift |
|---|---|---|
| $N$ | Population size | Larger populations dilute the impact of any one interaction. |
| $m$ | Communication bandwidth in Top-$m$ messaging | Higher bandwidth reduces sampling variance roughly as $1/m$. |
| $\alpha$ | Adaptation rate | Stronger adaptation speeds convergence but also strengthens drift relative to a fixed weak bias. |
| Internal uncertainty | How spread out the agent’s distribution is | Higher uncertainty increases sampling noise; peaked distributions drift less. |
Near symmetry, the paper predicts early drift that weakens with larger population size and higher message bandwidth. In total interaction steps, consensus time grows close to quadratically with $N$; measured in population rounds, it grows roughly linearly with $N$. The Top-$m$ result is especially operational: if a speaker communicates an empirical distribution from $m$ samples rather than one hard label, the symmetry-breaking drift term scales down as $1/m$.
This gives a design interpretation. Larger agent populations and richer messages do not magically make a system intelligent, but they reduce the influence of one lucky early sample. Conversely, a small team of highly adaptive agents communicating one discrete answer at a time is a lovely little drift amplifier. Efficient, confident, and possibly useless. A modern business classic.
The adaptation rate deserves special care. Stronger adaptation makes consensus faster. That sounds good if the metric is “time to answer.” But the paper’s drift-selection analysis shows a subtler trade-off: for a fixed weak asymmetry, stronger adaptation can strengthen drift relative to that bias, making the same bias less decisive. In practice, fast convergence is not automatically a virtue. It may mean the system has become very good at committing to the first plausible convention.
Drift and selection are different regimes, not moral labels
The paper also studies what happens when neutrality is slightly broken. In real systems, perfect neutrality is rare. Prompt order, label wording, model priors, memory format, temperature, retrieval artifacts, and UI defaults all create small asymmetries. The question is whether those asymmetries dominate the outcome or get washed out by stochastic drift.
QSG frames this as a drift-selection crossover. In the drift-dominated regime, the winner is close to a lottery. In the selection-dominated regime, a weak bias is reliably amplified and shapes the final convention.
This distinction is easy to misunderstand. “Selection” does not mean truth. It means systematic advantage. A biased label can be selected. A misleading framing can be selected. A retrieval artifact can be selected. Selection is only better than drift if the asymmetry corresponds to useful evidence.
That is why the paper is more useful than a simple anti-consensus essay. It does not say, “Consensus is bad.” It says, “Consensus has regimes.”
| Regime | What drives the winner | What a business user should ask |
|---|---|---|
| Drift-dominated | Random early samples recycled through interaction | Would a different seed, ordering, or first speaker change the answer? |
| Selection-dominated | A systematic asymmetry amplified by the population | Is the selected asymmetry evidence, bias, or implementation residue? |
| Mixed regime | Both stochastic drift and weak bias matter | How stable is the outcome across controlled perturbations? |
This is the practical pivot. The question is not only whether the system agrees. The question is what kind of process produced the agreement.
The experiments test the mechanism, not a universal failure rate
The paper’s empirical work uses a Neutral Naming Drift protocol. Agents repeatedly name a fixed referent using synthetic labels. There is no external reward and no ground truth. Speaker-listener interactions are ordered. Probes are used only for measurement and are not incorporated into memory, which matters because otherwise measurement itself would become another intervention.
The paper reports LLM experiments with GPT-4o and Claude Haiku 4.5, alongside QSG simulations. The core empirical result is scaling-level agreement: the observed LLM population dynamics follow the qualitative and quantitative trends predicted by QSG. Polarization trajectories are captured by a shared effective-adaptation mean-field fit across population sizes. Early drift decreases close to the predicted law. Time to consensus grows close to the expected scaling in total interactions. A separate Top-$m$ sweep with GPT-4o follows the predicted bandwidth trend.
That is the main evidence. The appendix and variant tests help clarify what kind of evidence it is.
| Component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| QSG theory | Mechanistic model | Quantized communication injects sampling variance and creates drift. | That every enterprise agent workflow follows QSG exactly. |
| Soft vs Hard simulation | Main mechanism check | Full-distribution exchange behaves differently from sampled-message exchange. | That real LLMs literally transmit hidden probability vectors in any practical mode. |
| One-step drift identity test | Direct validation of the variance term | Measured excess drift matches the predicted variance-injection term. | That long-run behavior in all network structures is solved. |
| $N$-sweep in LLMs | Scaling test | Population-size trends align with QSG predictions in two LLM families. | A precise operational risk number for any deployed system. |
| Top-$m$ sweep | Bandwidth sensitivity test | More communicated samples reduce early drift as predicted. | That longer messages always improve real task quality. |
| Temperature / weak-asymmetry analysis | Drift-selection crossover | Bias and noise compete in a predictable finite-size regime. | That all business bias comes from one scalar temperature-like parameter. |
This table matters because the wrong reading is tempting. The paper is not claiming that every multi-agent system in finance, law, healthcare, research, or policy is secretly a naming game. It is not estimating the probability that your procurement agent will hallucinate a supplier contract because three subagents liked the same sentence. The paper gives a null model and a diagnostic language. That is already enough.
Good null models are not realistic in every detail. They are useful because they tell us what can happen even before the messy details arrive.
What this changes for multi-agent system design
The most immediate implication is that agreement should be treated as an object of diagnosis, not a quality metric by itself.
A business system that uses several agents to produce one answer should report more than the final consensus. It should report how sensitive that consensus is to stochastic and structural perturbations. The useful diagnostics are not exotic. They are boring in the best possible way.
| Design question | Practical diagnostic | Reason from the paper |
|---|---|---|
| Is the answer seed-sensitive? | Run the same workflow across random seeds and initial agent orderings. | Drift-dominated outcomes should show run-to-run variability. |
| Is one early output anchoring the group? | Track which messages enter memory and how often they reappear. | Mutual in-context learning can recycle early samples as evidence. |
| Is bandwidth too low? | Compare one-label exchange with richer ranked lists or evidence packets. | Top-$m$ communication reduces sampling variance. |
| Is adaptation too aggressive? | Test slower memory updates, delayed commitment, or independent re-evaluation. | High $\alpha$ accelerates convergence and can strengthen drift. |
| Is consensus actually independent? | Include probe-only evaluations that do not enter agent memory. | Measurement should not become another social signal. |
| Is a weak bias driving the outcome? | Randomize label order, prompt order, agent roles, and first-speaker assignment. | Selection can amplify small implementation asymmetries. |
This shifts governance away from “Did the agents agree?” toward “What would have made them disagree?”
For Cognaptus-style automation, the relevant lesson is especially direct. Multi-agent systems are attractive because they let us modularize business processes: retrieval, review, validation, forecasting, drafting, escalation. But if these modules repeatedly consume each other’s outputs, then process design becomes statistical design. The architecture needs controlled independence, probe-only measurement, memory hygiene, and repeat-run variance checks.
A multi-agent workflow should therefore separate at least three channels:
- Evidence channel: source documents, structured data, retrieved facts, market data, or policy records.
- Interaction channel: agent messages, critiques, proposals, and revisions.
- Measurement channel: probes used to evaluate population state without feeding back into memory.
Mix all three together and you may still get consensus. You may even get a beautiful final report. But you will not know whether the report is evidence aggregation or a well-formatted echo chamber. Naturally, the echo chamber will have bullet points.
The risk is not only random noise; it is false social proof
The paper’s most uncomfortable business implication is that multi-agent systems can manufacture social proof internally. An answer appears stronger because more agents repeat it, but those repetitions are not independent observations. They may be descendants of the same ancestor sample.
This is familiar in human organizations. A tentative suggestion becomes “the team’s view” because it appeared in the first memo, then in the meeting deck, then in the risk summary, then in the executive brief. By the time someone asks for evidence, the idea has accumulated institutional gravity. Nobody remembers that it began as a guess. LLM populations can compress that failure mode into minutes.
This is why the paper’s naming-game setup is more than a toy. The label itself does not matter. What matters is convention formation under repeated interaction. Many business tasks have naming-game-like subproblems: assigning categories, choosing risk labels, ranking policy options, framing customer segments, classifying incidents, selecting narratives for market movement, or deciding which explanation should become the official one.
In those tasks, the danger is not merely that agents are wrong. The danger is that the system converts an initially weak reason into a shared convention and then presents the convention as corroboration.
Boundaries: what the paper does not license us to claim
The study is deliberately controlled. Its neutral naming games use synthetic labels, well-mixed interactions, delayed-reveal protocols, fixed memory settings, and no external reward. These are strengths for mechanism identification but boundaries for deployment inference.
Three boundaries are especially important.
First, the paper does not show that all multi-agent LLM consensus is drift. Real systems often include external evidence, retrieval, tool calls, human feedback, scoring rules, and task-specific priors. Those can introduce selection pressures that are useful, harmful, or both.
Second, QSG assumes a simplified interaction structure. Enterprise workflows are rarely well-mixed populations. They are directed graphs with manager agents, evaluator agents, memory stores, role hierarchies, tool permissions, and sometimes one very confident agent named “Chief Strategy Officer,” because apparently software needed office politics too.
Third, the experiments validate scaling trends, not exact point forecasts for arbitrary products. The paper is strongest as a mechanistic baseline: if your system displays consensus, QSG tells you which knobs to test before you call that consensus intelligence.
That is not a weakness. It is the correct level of ambition. A minimal model should not pretend to be a deployment audit. It should make the audit smarter.
The practical test: can your consensus survive perturbation?
The paper gives us a simple replacement for the lazy consensus metric.
Do not ask only:
Did the agents agree?
Ask:
Would they still agree if we changed the seed, first speaker, label order, message bandwidth, memory length, adaptation strength, and probe protocol?
If the answer changes wildly, the system is probably in a drift-heavy regime. If the answer persists across perturbations, the next question is whether the stable force is genuine evidence or merely a systematic bias. Either way, the consensus itself is only the beginning of the analysis.
This is the real takeaway for business AI. Multi-agent design should not be sold as an automatic upgrade from single-model reasoning. It is a different statistical object. It can aggregate evidence. It can amplify bias. It can also run a lottery and call the winner “alignment.”
The paper’s value is that it makes that lottery visible. It gives us a language for the mechanism, a model for the scaling, and a checklist of system parameters that can be stress-tested. That is much more useful than another vague warning about “AI hallucinations.” Hallucination is an output problem. Memetic drift is a population-dynamics problem.
And population dynamics, unlike vibes, can be tested.
Cognaptus: Automate the Present, Incubate the Future.
-
Hidenori Tanaka, “When Is Collective Intelligence a Lottery? Multi-Agent Scaling Laws for Memetic Drift in LLMs,” arXiv:2603.24676, 2026, https://arxiv.org/abs/2603.24676. ↩︎