It Takes a Village (of Models): Why Multi-Agent Intelligence Won't Emerge by Accident

Agents are easy to multiply.

That is the attractive part. Give one model a browser. Give another a code editor. Add a planner, a critic, a memory layer, a few tools, a dashboard, and suddenly the product demo looks like a small digital office. Everyone has a job title. Everyone talks. Nobody asks whether the “team” actually knows how to be a team.

That is the uncomfortable question behind Single-Agent Scaling Fails Multi-Agent Intelligence: Towards Foundation Models with Native Multi-Agent Intelligence, a paper from Shanghai AI Lab that tests whether stronger foundation models naturally become better at multi-agent reasoning.¹ The paper’s answer is not subtle: single-agent capability improves sharply across model generations, but multi-agent understanding improves more slowly, and multi-agent planning remains stubbornly weak or noisy.

For businesses building agentic workflows, this is the part worth reading twice. The paper is not saying that multi-agent systems are useless. It is saying that buying a stronger model and placing it inside a multi-agent framework is not the same as building native multi-agent intelligence. Apparently, putting several clever interns in a Slack channel does not automatically create a competent organization. Shocking, I know.

The evidence starts with a mismatch, not a manifesto

The authors begin from a plausible industry assumption: if foundation models keep improving on math, coding, general knowledge, and reasoning benchmarks, then multi-agent competence may eventually arrive as a byproduct. Bigger models solve harder individual tasks. Perhaps coordination, negotiation, belief tracking, and collaborative planning are just downstream benefits waiting for enough scale.

The paper tests that assumption empirically.

It evaluates 41 instruction-tuned open-weight models from two model families, Qwen and LLaMA, spanning parameter scales from 0.5B to 235B and releases from 2023 to 2025. The benchmark set is divided into three groups:

Test category	Benchmarks used	What the category is meant to reveal
Single-agent tasks	MATH-500, MMLU-Pro, HumanEval, GPQA	Whether the model is improving as an individual problem-solver
Multi-agent understanding	ToMBench, EmoBench	Whether the model can reason about beliefs, intentions, emotions, and social states
Multi-agent planning	CoordinationQA	Whether the model can handle coordination and joint planning problems

This design matters because it does not compare “agents” in a vague product sense. It asks a cleaner question: when the same model families improve across generations, do multi-agent abilities rise at a comparable rate?

The answer is no.

For roughly 8B Qwen models, the paper reports that average single-agent task accuracy rises from 0.23 in Qwen-1 to 0.64 in Qwen-3. Multi-agent understanding rises only from 0.44 to 0.55. Multi-agent planning does not show the same kind of progress. For roughly 72B Qwen models, single-agent accuracy rises from 0.42 in Qwen-1 to 0.71 in Qwen-2.5, while multi-agent understanding rises from 0.57 to 0.67. Again, the planning gains are much less convincing.

The LLaMA family shows the same qualitative pattern. Newer models become much stronger at single-agent work. Multi-agent understanding improves somewhat. Multi-agent planning remains flat, irregular, or only modestly better.

That is the paper’s central empirical contribution. Not a philosophical complaint. Not another “agents are hard” sermon. A measured gap.

Figure 2 is the main result: scaling helps, but not evenly

Figure 2 is the paper’s main evidence, not a decorative plot. Its purpose is to compare progress across model generations at similar parameter scales. That makes it stronger than a simple “large models beat small models” story.

The key observation is the slope mismatch. Single-agent task performance climbs steeply across generations. Multi-agent understanding climbs more gently. Multi-agent planning barely moves in comparison.

This is important because the obvious industry shortcut is to treat multi-agent problems as advanced single-agent problems. Under that view, the recipe is simple: wait for the next model, increase context length, add better tool calling, and let the framework handle the rest.

The paper pushes against that shortcut. Multi-agent planning is not just “planning, but with more tokens.” It requires the model to reason about other actors whose actions can change the environment. Those actors may have different goals, incomplete information, private beliefs, incentives to cooperate, incentives to defect, or simply inconsistent behavior. In a single-agent benchmark, the world usually waits politely while the model thinks. In a multi-agent setting, the world answers back.

A useful way to read Figure 2 is not “models fail at multi-agent tasks.” That would be too crude. The better reading is:

Paper result	Interpretation	Business meaning
Single-agent scores improve sharply across generations	General reasoning, coding, math, and knowledge capabilities benefit strongly from model progress	Upgrading the base model can improve individual agent competence
Multi-agent understanding improves more slowly	Some social and belief-state reasoning transfers from general capability, but not at the same rate	Stronger models may read situations better, but this does not guarantee coordination
Multi-agent planning remains weak or noisy	Coordination appears to require capabilities not reliably learned through ordinary single-agent scaling	Multi-agent workflows need explicit design, tests, protocols, and recovery logic

That last row is the expensive one. A single customer-service bot that answers poorly is a quality-control issue. A group of agents that miscoordinate across CRM, billing, compliance, and customer messaging is an operational risk wearing a productivity costume.

Figure 3 turns the story from lag into diminishing returns

Figure 3 has a different role. It is not merely repeating Figure 2. Its likely purpose is to test whether single-agent ability predicts multi-agent planning across individual models.

There is a positive relationship. The paper reports logarithmic fits with statistically significant correlations: $R^2 = 0.608$ for Qwen and $R^2 = 0.576$ for LLaMA. So the correct conclusion is not “single-agent ability is irrelevant.” It helps.

But it does not help enough.

The logarithmic form matters. It suggests diminishing returns: as single-agent accuracy rises, additional gains translate into smaller improvements in multi-agent planning. The scatter also matters. Models with similar single-agent performance can differ meaningfully in multi-agent planning accuracy.

This is where the paper becomes more useful than a simple anti-scaling slogan. Scaling is not useless. It is incomplete.

For enterprise teams, this distinction is practical. A stronger model may reduce local errors: better parsing, better tool use, better code generation, better summarization. But multi-agent reliability depends on system-level behavior: whether one agent understands another’s state, whether messages are compressed without losing the wrong details, whether conflicts are detected, whether a plan is revised when another agent acts unexpectedly, and whether the group converges rather than spirals into a polite committee meeting from hell.

The model can be smarter and the organization can still be badly designed. Anyone who has worked in a real organization may find this result offensively familiar.

The appendix tables are an audit trail, not a second thesis

The appendix tables list the model-level scores behind the figures. Their likely purpose is implementation transparency and result verification, not a separate experimental claim. They let readers inspect whether the aggregated story hides important counterexamples.

The useful detail is that the tables show variation inside the broad pattern. Some large models do better on specific CoordinationQA subcomponents. Some newer smaller models beat older larger ones on certain single-agent tasks. Some emotional understanding scores rise more clearly than planning scores. These details do not overturn the thesis; they clarify it.

The paper is not claiming that every multi-agent benchmark is frozen forever. Nor is it claiming that no model improves at coordination. It is claiming that the improvement is not comparable to the dramatic progress observed on single-agent benchmarks.

That boundary matters because exaggerated readings would be easy here. A careless summary could say: “Scaling does not improve multi-agent intelligence.” The paper’s more precise claim is: scaling single-agent capability alone does not reliably produce robust multi-agent intelligence. The difference is small in wording and large in engineering consequence.

The four-capability blueprint explains why the gap exists

After presenting the empirical gap, the paper proposes a blueprint for native multi-agent intelligence in foundation models. The four capabilities are understanding, planning, efficient communication, and adaptation.

This blueprint is not the main evidence; it is the explanatory frame. It helps interpret why single-agent scaling may be structurally insufficient.

Capability	What it means in a multi-agent setting	Why single-agent training may not produce it reliably
Multi-agent understanding	Reasoning about others’ beliefs, intentions, emotions, shared knowledge, norms, and protocols	Single-agent tasks often do not require persistent modeling of other actors
Multi-agent planning	Planning while accounting for other agents’ actions, goals, uncertainty, negotiation, and coordination	Ordinary planning benchmarks usually assume a passive environment
Efficient communication	Passing information precisely, compactly, and with low ambiguity	General language fluency can be verbose, redundant, and poorly optimized for coordination bandwidth
Multi-agent adaptation	Updating beliefs and strategies when other agents behave unexpectedly	Static QA or single-turn tasks rarely train real-time adjustment across interacting agents

The third capability, efficient communication, deserves more attention than it usually receives. Modern reasoning models often produce long explanations. That is useful when a human wants transparency, but disastrous if agents must exchange high-frequency state updates. Natural language is flexible, but flexibility is not the same as efficiency. Multi-agent systems may need structured communication protocols, symbolic representations, learned message formats, or code-like compressed signals.

That does not mean businesses should immediately invent mysterious agent languages. Please do not let the agents develop a private dialect in your production compliance workflow and call it innovation. The operational point is narrower: communication between agents should be designed, logged, constrained, and evaluated. The default “send a long message to another model” pattern is a convenience, not a coordination architecture.

Dataset construction is the first bottleneck

The paper’s next move is from diagnosis to research agenda. The first bottleneck is data.

Current foundation-model datasets are not well balanced across the four capabilities. Theory-of-mind and emotional inference receive more attention than planning, communication, and adaptation. Interactive environments such as games and simulations can generate multi-agent data, but raw trajectories are often poorly suited for foundation-model training. They are long, redundant, noisy, and filled with sparse or delayed rewards.

That means the next useful datasets will probably not be simple dumps of game logs. They need structure. The authors suggest extracting key decision points, generating targeted interaction episodes, using controlled simulators, and collecting human-human or human-agent interactions.

For businesses, this maps directly to workflow data design. If a company wants agent teams to improve, it should not only store final outputs. It should capture the interaction trace:

Data to capture	Why it matters
Agent role and task state	Distinguishes planner failures from executor failures
Messages exchanged	Reveals whether agents communicate too much, too little, or ambiguously
Decision points	Identifies where coordination actually succeeds or breaks
Human interventions	Shows which failures require escalation
Outcome and recovery path	Separates harmless inefficiency from serious operational failure

This is where many enterprise “agent platforms” remain shallow. They log prompts and outputs because that is easy. But multi-agent learning requires traces of coordination, not just artifacts of completion.

Evaluation must measure interaction, not just answers

The evaluation section has a practical complaint: the AI field likes QA-style benchmarks because they are cheap, fast, and easy to score. Multi-agent behavior is harder. It often requires multi-step execution, repeated trials, token-heavy interactions, and success metrics that depend on trajectories rather than one-shot answers.

This is not merely an academic inconvenience. It affects procurement and product evaluation.

If a vendor demonstrates a multi-agent system with a few polished examples, the buyer still does not know whether the system can coordinate under variation. Does it recover when one tool fails? Does it avoid duplicate work? Does it detect conflicting assumptions between agents? Does it escalate uncertainty? Does it reduce communication when the task is simple and increase it when ambiguity matters?

Those questions do not fit neatly into a single accuracy score. But ignoring them does not make the risk vanish. It only makes the demo cheaper.

A better business evaluation should combine two layers:

Evaluation layer	Best use	Limitation
QA-style diagnostic tests	Cheap screening for theory-of-mind, coordination reasoning, and role comprehension	Cannot fully capture long-horizon interaction
Interactive workflow simulations	Tests communication, adaptation, recovery, and group-level behavior	More expensive and harder to standardize

The paper does not solve this evaluation problem. It correctly identifies why the usual benchmark habit is insufficient.

Training one model is not the same as training a society

The paper then distinguishes single-model training from population-based training.

For a single foundation model, the challenge is that “good” behavior in multi-agent settings is partner-dependent. A strategy that works with one collaborator may fail with another. A cooperative behavior may be exploited by an adversarial actor. A concise message may be efficient for a familiar partner and dangerously ambiguous for a new one.

This clashes with the standard supervised fine-tuning or RLHF framing, where training often assumes a stable preferred answer or trajectory. Multi-agent competence needs feedback that changes with partner behavior, role, preference, belief state, and history.

Population-based training addresses a different issue: diversity. Real deployments often involve heterogeneous roles—planners, executors, verifiers, critics, user-facing agents, compliance agents. A population can learn role specialization and interface conventions. It can also prevent overfitting to one partner type. In multi-agent settings, generalization is not just about new tasks; it is about new counterparts.

The business analogy is simple. You do not evaluate a negotiator by letting them negotiate against one person forever. You vary the counterpart. You change incentives. You test under incomplete information. You observe whether the strategy generalizes.

Agent systems deserve the same treatment.

What the paper directly shows, and what Cognaptus infers

The paper directly shows that, across the tested Qwen and LLaMA instruction-tuned models and the selected seven benchmarks, single-agent scaling produces much stronger gains on standard individual capability tests than on multi-agent understanding or planning. It also shows that single-agent accuracy is positively related to multi-agent planning accuracy, but with substantial variance and diminishing returns.

Cognaptus infers a practical design rule from this: enterprise agent teams should be treated as coordination systems, not as collections of individually capable models.

That implies five operational requirements:

Requirement	Practical meaning
Role design	Define what each agent is responsible for, what it may decide, and when it must defer
Communication protocol	Specify message formats, required fields, compression rules, and uncertainty markers
Coordination tests	Evaluate not only task success, but duplication, conflict resolution, latency, and recovery
Interaction logging	Store decision points and message traces, not only final outputs
Safety governance	Test for collusion-like behavior, misinformation amplification, hidden dependency loops, and uncontrolled escalation

This is not glamorous. It is mostly architecture, logging, testing, and governance. Unfortunately, that is where real automation tends to live after the keynote ends.

The boundary: useful evidence, not universal proof

The paper has clear boundaries.

First, the experiments focus on open-weight Qwen and LLaMA models. The results should not be treated as a direct measurement of every closed commercial model. A closed model trained with richer interaction data or specialized agentic post-training could behave differently.

Second, the multi-agent benchmark set is necessarily limited. ToMBench and EmoBench test aspects of understanding. CoordinationQA tests coordination through QA-style tasks derived from coordination games. These are useful diagnostics, but they do not cover the full mess of real multi-agent deployment: tool failures, mixed incentives, partial observability, changing user requirements, asynchronous communication, and organizational constraints.

Third, the paper’s proposed research directions are agenda-setting rather than experimentally validated solutions. Dataset construction, evaluation redesign, single-model training, population-based training, and safety governance are plausible paths. The paper does not prove which recipe works best.

These limitations do not weaken the article’s business lesson. They define its proper use. The paper should not be read as “all current agent systems are doomed.” It should be read as “do not assume stronger base models will automatically solve coordination.”

That is enough to change how serious teams build.

The real upgrade path is organizational, not just computational

The strongest part of this paper is its refusal to let “agentic AI” hide behind model scaling. The evidence shows a gap: individual problem-solving improves faster than multi-agent competence. The blueprint explains why: multi-agent intelligence requires understanding other actors, planning with them or against them, communicating efficiently, and adapting when interaction changes the situation.

For businesses, the implication is sharp. The next advantage will not come only from using bigger models. It will come from designing better collectives of models.

That means fewer vague “agent swarms” and more disciplined coordination systems. Fewer demos where agents chat charmingly with each other, and more tests of what happens when they disagree, duplicate work, misunderstand roles, or face incentives that are not perfectly aligned. Less faith in emergence. More engineering.

A village of models can be powerful. But a village still needs roads, rules, signals, memory, and a way to stop the town council from arguing forever.

Cognaptus: Automate the Present, Incubate the Future.

Shuyue Hu, Haoyang Yan, Yiqun Zhang, Yang Chen, Dongzhan Zhou, and Lei Bai, “Single-Agent Scaling Fails Multi-Agent Intelligence: Towards Foundation Models with Native Multi-Agent Intelligence,” arXiv:2512.08743, 2025. https://arxiv.org/pdf/2512.08743 ↩︎

The evidence starts with a mismatch, not a manifesto#

Figure 2 is the main result: scaling helps, but not evenly#

Figure 3 turns the story from lag into diminishing returns#

The appendix tables are an audit trail, not a second thesis#

The four-capability blueprint explains why the gap exists#

Dataset construction is the first bottleneck#

Evaluation must measure interaction, not just answers#

Training one model is not the same as training a society#

What the paper directly shows, and what Cognaptus infers#

The boundary: useful evidence, not universal proof#

The real upgrade path is organizational, not just computational#