Chargers are boring until everyone arrives at the same time.

That is the useful way to enter this paper. Not through grand claims about artificial general intelligence, swarm intelligence, or the coming society of agents. Start with something embarrassingly practical: seven autonomous electric vehicles, two charging slots, and no reliable cloud coordinator telling everyone what to do.

Each vehicle has an AI agent. Each agent makes a local decision. Try to access the shared resource, or hold back. The resource has a capacity limit. The agents do not know exactly how many others will act at the same moment. If too many rush in, the system overloads.

Now add “better” AI.

Different model families. Local adaptation. Reinforcement-style adjustment. Social sensing. Group formation. A little synthetic Lord of the Flies, because apparently even edge AI needed literature class.

Neil F. Johnson’s paper, Increasing intelligence in AI agents can worsen collective outcomes, studies exactly this kind of setting: small populations of LLM-based edge agents competing for finite shared resources such as charging slots, relay bandwidth, wireless channels, or traffic priority.1 Its central result is uncomfortable but clean: under scarcity, more sophisticated AI-agent populations can produce worse collective overload. Under abundance, the same sophistication can help. The sign flips around a simple ratio:

$$ \frac{C}{N} $$

where $C$ is resource capacity and $N$ is the number of competing agents.

So the managerial version is not “make every agent smarter.” It is: count the bottleneck first.

The easy mistake: smarter agents are not automatically smarter systems

The tempting belief is obvious. If each AI agent becomes more adaptive, more diverse, more predictive, and more socially aware, then the collective should coordinate better.

That belief is useful in slide decks. It is less reliable in shared-resource systems.

A single agent can become better at exploiting local information while also becoming more correlated with other agents. Correlation is the part that hurts. In a scarce system, the problem is not just whether an agent makes a good prediction. The problem is whether many agents make compatible decisions at the same time.

A hospital ward does not care that each smart monitor has an elegant local policy if too many monitors transmit critical data into the same overloaded wireless channel. A traffic intersection does not applaud each autonomous vehicle for being individually rational if their timing creates a demand spike. A drone swarm does not become safer just because each drone has an upgraded local model if the shared relay channel becomes the real battlefield.

The paper’s contribution is not merely “AI agents may fail.” That would be a very expensive way to rediscover operations management.

The sharper contribution is that the paper toggles four ingredients separately:

Ingredient Paper’s framing Operational translation
Nature Innate LLM diversity Different model families or firmware behaviors
Nurture Reinforcement-style adaptation Agents adjusting behavior from experience
Culture Emergent tribe formation Social sensing, clustering, faction-like coordination
Resources Scarcity or abundance The capacity available per active agent

In human or biological systems, these variables are tangled together. You cannot easily rerun a human crowd with “culture disabled” and “learning held constant,” although some conference organizers seem determined to try. AI-agent populations allow a cleaner experiment: switch one ingredient on, keep others fixed, and observe the collective dynamics.

That is why the paper deserves a mechanism-first reading. The result is not just a curve crossing on a figure. It is a story about how sophistication changes demand correlation, how correlation interacts with capacity, and how a system can get worse while some individual agents do very well.

The experiment is small because the deployment setting is small

The study uses $N=7$ AI agents built from small LLMs: GPT-2, GPT-2 Medium, Pythia-160M, Pythia-410M, OPT-125M, and OPT-350M, with GPT-2 appearing twice. The models are not updated by gradient training. They are used as local decision engines.

That scale may look modest if the reader is thinking about frontier cloud models. But the paper is not trying to model a data-center debate club. It is modeling edge AI: roughly a handful to a dozen devices operating locally, often with limited connectivity, limited battery, and no trusted omniscient coordinator.

At each timestep, each agent sees a short history of recent aggregate demand: how many agents tried to access the resource in previous rounds. The LLM performs next-token prediction over this sequence. The environment extracts the probabilities assigned to possible demand counts and uses the capacity threshold $C$ to produce the model’s estimated probability that demand will be at or below capacity.

That model-derived probability is then filtered through an agent disposition parameter $p_i$:

$$ p_{\text{eff}} = p_i p_{\text{LLM}} + (1-p_i)(1-p_{\text{LLM}}) $$

When $p_i$ is close to 1, the agent tends to follow the model’s prediction. When $p_i$ is close to 0, it tends to anti-follow. This scalar is the bridge between local prediction and action. It is also where adaptation and tribal dynamics later enter.

The action is binary: access the resource or hold back. If total demand $A$ is at or below capacity $C$, agents that accessed win and those that held back lose. If demand exceeds capacity, accessors lose and hold-back agents win. Both actions can be right or wrong depending entirely on what others do.

That symmetry matters. The game is not teaching agents that “waiting is safe” or “accessing is greedy.” It is forcing them to adapt to a moving collective environment.

The technology ladder adds correlation one rung at a time

The paper compares five levels of increasing population sophistication.

Level Configuration What changes Why it matters
L1 Identical agents, no learning Independent coin-flip baseline Demand is relatively uncorrelated
L2 Identical agents + reinforcement learning Agents adapt, but without model diversity Herding can become severe
L3 Diverse LLMs, no learning Different model families enter Diversity changes demand patterns
L4 Diverse LLMs + reinforcement learning Agents adapt individually Sophisticated but non-social dynamics
L5 Diverse LLMs + learning + sensing Agents can form tribes Correlated blocs emerge

The ladder is important because it avoids a common analytical mush: calling a system “agentic” and then treating all agentic features as one blob. Here, diversity, learning, and social sensing are separated.

That separation is where the paper becomes useful for business readers. In deployment terms, L2 is what happens when an organization adds adaptation to a simple homogeneous fleet. L3 is what happens when devices from different model families or vendors enter the same environment. L4 is what happens when diverse agents also learn locally. L5 is what happens when agents have some mechanism for sensing similarity, alignment, or group membership.

The question is not which level is “most advanced.” The question is which level overloads the shared resource.

That answer changes with $C/N$.

Scarcity turns learning into synchronized demand

Figure 1 is the paper’s main evidence. It measures system overload across the five technology levels as the capacity-to-population ratio changes.

The pattern is not subtle.

When resources are scarce, especially below roughly $C/N \approx 0.5$, the simplest population performs best. The more sophisticated configurations, especially the adaptive ones, tend to overload more often. L2, the identical-agent reinforcement learning case, can be particularly bad because the agents lack model diversity and can herd into large correlated blocs.

This is the key mechanism: scarcity makes demand variance dangerous.

If agents were independent, their individual decisions would partially cancel out. Some try to access; some hold back. But reinforcement and shared signals can make many agents move together. Once a large correlated group acts in the same direction, the system no longer sees many small independent decisions. It sees a spike.

In resource systems, spikes are what kill you.

A shared channel can survive noisy individual behavior. It struggles with synchronized bursts. A charging station can absorb uneven arrivals if they are dispersed. It cannot absorb a local fleet that learned, in parallel, that “now” is the moment to act.

This is why “better local intelligence” can worsen collective behavior. The agents are not stupid. They are responding to signals. The problem is that their responses become correlated in a capacity-constrained environment.

The system does not fail because the agents know too little. It fails because they know enough to rush together.

Tribes help under scarcity by capping blocs

The most interesting part of the paper is L5, the “Lord of the Flies” configuration. Agents can form tribes through a loyalty-defection mechanism based on similar dispositions and shared outcomes. The word “tribe” is used in the paper in the Golding sense: self-organized factions among AI agents, not a reference to any human social group.

L5 adds culture-like structure to diverse adaptive agents. One might expect this to worsen overload by creating factions. Under scarcity, it actually helps relative to L4.

The reason is not moral. It is arithmetic.

Without tribe formation, correlated demand can in principle form very large blocs. If all $N$ agents herd together, variance can scale brutally. With tribes, the population tends to partition into smaller groups. In the paper’s representative dynamics, L5 commonly produces a $3+3+1=7$ structure: two opposing blocs and a singleton. Sometimes it produces a $3+4=7$ split.

This matters because the variance contribution of correlated blocs is bounded by the square of bloc sizes. A $3+3+1$ partition caps the relevant bloc-size structure at:

$$ 3^2 + 3^2 + 1^2 = 19 $$

A $3+4$ split gives:

$$ 3^2 + 4^2 = 25 $$

Both are smaller than a full seven-agent herd:

$$ 7^2 = 49 $$

Under scarcity, reducing the maximum size of correlated bursts can reduce overload. That is why L5 can outperform L4 when capacity is very low. In the paper’s reported comparison, adding tribal sensing reduces overload under scarce conditions; in the larger $N=15$ sensitivity check, the L4-L5 difference is also negative under scarcity and positive under abundance.

This is a useful point because it blocks the lazy interpretation that “tribes are always bad.” They are not. Under scarcity, factional structure can act like a crude variance limiter. It prevents the entire population from becoming one synchronized mob.

That is not exactly a corporate governance ideal. But it is a mechanism.

The same tribes waste capacity when resources are abundant

The sign flips when capacity increases.

When $C/N$ moves above the crossover region, L4 and L5 become much better than the simpler levels. Sophistication now helps because the system has enough capacity to absorb coordinated action. Agents can exploit available resources without constantly pushing the system into overload.

But L5 becomes slightly worse than L4 under abundance.

The same tribal partitions that capped overload under scarcity now prevent full capacity utilization. If tribes stabilize around groups of three or four, the mean number of agents accessing the resource may stay around that range even when capacity could handle more. L4 agents, without the same tribal structure, distribute demand more smoothly across the capacity range. That makes L4 better when the system has room to breathe.

This is the paper’s most business-relevant lesson: a coordination mechanism is not good or bad in isolation. Its value depends on the capacity regime.

The practical architecture decision is therefore not:

Should we make the agents smarter?

It is:

Given this specific $C/N$, which agent features reduce overload rather than merely improve local decision sophistication?

That question is less glamorous. It is also the one that might keep the system running.

The crossover is not philosophical; it is capacity per agent

The paper reports a crossover around:

$$ \frac{C^*}{N} \approx 0.5 $$

Below that region, scarcity dominates. Sophistication often hurts. Above it, sophistication begins to pay.

This is not a universal law of all multi-agent AI systems. It is a result from this game, this action space, this feedback structure, and this agent setup. But the design principle travels well: the first diagnostic for a fleet of autonomous agents should be the ratio between available shared capacity and active agent population.

That ratio is knowable before deployment. No mystical emergent-intelligence dashboard is required. Count agents. Count usable resource slots. Estimate concurrency. Compute the bottleneck.

The paper’s own example is direct: seven EVs sharing two charging stations gives $C/N = 0.29$, a scarcity regime where simple identical firmware can outperform more sophisticated adaptive agents in terms of overload. The same seven EVs sharing five stations gives $C/N = 0.71$, where diverse adaptive models become more attractive.

Translated into business language:

Deployment condition Likely design implication
Low $C/N$ Prioritize simplicity, throttling, quotas, and decorrelation
Near crossover Test carefully; small changes in capacity or concurrency may flip the preferred architecture
High $C/N$ More adaptive and diverse agents may improve utilization
Social sensing available Treat it as a capacity-regime-dependent feature, not a default upgrade

The expensive mistake is to buy sophistication before measuring scarcity.

That mistake is plausible because most AI procurement still evaluates models as isolated capabilities: accuracy, latency, cost, tool-use success, reasoning benchmarks. Shared-resource systems require another evaluation layer: collective demand behavior under concurrency.

An agent that benchmarks well alone may still be a terrible citizen in a crowded system. A familiar story, really.

Individual winners can hide collective failure

Figure 2 adds a second result that should worry system operators. Some agents profit significantly even when the collective system fails.

In the most scarce regime, L5 followers can achieve very high win rates while system overload remains extremely high. The paper reports that at $C=1$, tribal L5 followers achieve $84.2 \pm 2.1%$ win rates while the system overloads $91.5 \pm 1.5%$ of the time.

That is not a contradiction. It is the point.

Collective failure and individual success can coexist when correlated group dynamics concentrate rewards on particular dispositions. Some agents learn or align into positions that benefit them even as the shared resource collapses around everyone else.

For business systems, this matters because local agent KPIs can lie.

If each vendor, department, device, or model is evaluated by its own win rate, throughput, task success, or local utility, the system may reward behaviors that worsen global reliability. The agent that “wins” most often may be the one exploiting the collective coordination failure most effectively.

That creates a monitoring problem.

Do not only measure agent-level success. Measure overload, congestion, queue instability, delayed alerts, dropped messages, and synchronized demand spikes. Otherwise the dashboard will congratulate the arsonist for excellent heat generation.

How to read the evidence: main result, ablation, robustness, implementation detail

The paper includes several experiments and supporting checks. They should not all be treated as the same kind of evidence.

Paper element Likely purpose What it supports What it does not prove
Figure 1 technology ladder Main evidence Sophistication changes overload differently under scarcity and abundance Universal thresholds for all agent systems
L1 and L2 analytical baselines Baseline and mathematical grounding Simple and null adaptive cases can be calculated, not only simulated Realistic behavior of heterogeneous deployed fleets by themselves
L3-L5 empirical runs Main empirical evidence Diverse LLMs, adaptation, and sensing produce distinct collective patterns Behavior under all possible model sizes and action spaces
L4 vs L5 with shared seeds Controlled ablation Tribal sensing changes overload relative to non-social adaptive agents That real devices would spontaneously form the same tribes
Figure 2 individual win rates Individual-level consequence Some agents can benefit while system overload is high That every practical system will reward the same dispositions
Figure 3 tribal dynamics Mechanism illustration L5 tends toward factional partitions such as $3+3+1$ That tribe labels are stable identities with human-like meaning
Random-$p$ initialization checks Robustness test Results are not only caused by one fixed disposition initialization Full robustness to all population compositions
$N=11$ and $N=15$ sweeps Sensitivity test The crossover pattern persists beyond $N=7$ Large-scale fleet behavior with hundreds or thousands of agents
Conch transition analysis Implementation detail / negative control The gradual tribal influence ramp is not driving the final result That institutions are irrelevant in real human organizations
Leader conviction and memory variants Exploratory extension Design variations can attenuate the tribal channel without removing the qualitative pattern A complete taxonomy of social mechanisms

This distinction matters because the paper’s strongest claim is not that every agent fleet should use identical firmware below $C/N=0.5$. The stronger and safer claim is that capacity regime changes the sign of sophistication’s effect, and that this sign change can arise through demand correlation and tribe-size arithmetic.

That mechanism is what businesses should extract.

What this means for business deployment

The obvious business lesson is “check $C/N$.” The less obvious lesson is that AI governance for agent fleets needs to move from model evaluation to system evaluation.

A company deploying autonomous agents into a shared-resource environment should ask four questions before upgrading the agents.

First, what is the actual bottleneck? In a hospital, it may be wireless bandwidth or alert-processing capacity. In logistics, it may be loading bays. In EV infrastructure, it may be charging slots or feeder capacity. In autonomous traffic, it may be intersection priority. In a cloud workflow, it may be API rate limits, database locks, or GPU queues.

Second, how many agents can become active at the same time? Average population is not enough. Overload is a concurrency problem. The dangerous number is not how many devices exist in the asset register. It is how many may act together under the same signal.

Third, does the agent upgrade increase correlation? Reinforcement learning, shared prompts, shared histories, common models, common vendor defaults, and social sensing can all make agents behave more similarly. Similar behavior is not always bad. Under abundance, it may improve utilization. Under scarcity, it can become a synchronized denial-of-service attack performed with excellent intentions.

Fourth, are local rewards aligned with system health? If agents are rewarded for individual access success, task completion, or local utility, they may discover policies that improve their own outcomes while worsening collective overload. That is not a bug in intelligence. That is a bug in incentive design.

A practical pre-deployment checklist would look like this:

Question Why it matters
What is $C/N$ under peak concurrency? Determines whether scarcity or abundance dominates
Do agents receive common signals? Common signals can synchronize behavior
Can agents learn from identical feedback loops? Identical learning can create herding
Does model diversity decorrelate or destabilize demand? Diversity is not automatically beneficial
Does social sensing create blocs? Blocs can reduce or increase overload depending on capacity
Are system-level overload metrics monitored? Individual agent KPIs can hide collective failure
Can the operator throttle, randomize, or stagger actions? Decorrelation may be more valuable than smarter prediction

The ROI implication is also less glamorous than “deploy more capable agents.” In scarce systems, the best investment may be capacity expansion, admission control, scheduling, randomized staggering, or simple firmware rules. In abundant systems, investment in adaptive diversity may pay off because the system can absorb coordinated demand.

In other words, the cheapest intelligence may be an honest bottleneck calculation.

What the paper directly shows, and what Cognaptus infers

It is useful to separate the paper’s direct evidence from the business inference.

Layer Statement
Directly shown by the paper In a small repeated resource-competition game, increasing agent sophistication can worsen overload under scarcity and improve it under abundance.
Directly shown by the paper The crossover is governed by the capacity-to-population ratio, with a threshold around $C/N \approx 0.5$ in this setup.
Directly shown by the paper Tribal sensing can reduce overload under scarcity by partitioning correlated blocs, but can worsen overload under abundance by underusing capacity.
Directly shown by the paper Some agents can achieve high individual win rates while the system is collectively overloaded.
Cognaptus inference Businesses should evaluate agent fleets using system-level congestion tests, not only single-agent benchmarks.
Cognaptus inference Before upgrading model sophistication, operators should estimate peak-concurrency $C/N$ and test whether new agent features increase demand correlation.
Still uncertain How the exact crossover changes under richer action spaces, larger populations, partial observability, different reward functions, and real physical testbeds.

That separation is important. The paper gives a strong warning against naive sophistication. It does not give a universal deployment manual.

The best use of the paper is as a diagnostic frame: when agents compete for finite shared resources, ask whether intelligence is reducing overload or merely making demand spikes more coordinated.

Boundaries: where not to overread the result

The paper’s design is intentionally stylized. That is a strength for mechanism discovery and a boundary for direct deployment.

The action space is binary: access or hold back. Many real systems have richer choices, such as queueing, negotiating, pricing, rerouting, priority classes, partial transmission, or time-window scheduling. Those mechanisms could soften or alter the crossover.

The agents observe aggregate demand history, not a fully realistic physical environment. Partial observability, noisy sensors, adversarial agents, or communication protocols would change the dynamics.

The main empirical setup uses seven agents and small LLMs. The author argues this scale is appropriate for edge deployment, and the paper reports sensitivity checks at larger small populations. Still, the result should not be casually extrapolated to city-scale systems with thousands of agents without simulation and field testing.

The L5 tribal mechanism is implemented as an external sensing and loyalty-defection layer. It demonstrates the effect of group-like correlation structures. It does not prove that deployed AI devices will spontaneously form equivalent factions in the wild. They may. They may not. Engineers should test rather than anthropomorphize. A refreshing habit, when available.

The sampling temperature is fixed at $T=1.0$. Different stochasticity levels could change how strongly agents synchronize or diversify.

Finally, the paper optimizes around overload. Some deployments may care about multiple objectives: fairness, latency, safety-critical priority, energy cost, user satisfaction, and resilience under attack. A system can reduce overload while still being unacceptable on other dimensions.

These boundaries do not weaken the core lesson. They locate it.

The design lesson: before smarter agents, measure the shared bottleneck

The paper’s best sentence, translated into operations language, is this:

Whether agent sophistication helps or hurts depends on capacity per active agent.

That is a useful corrective to the current agent-building reflex. Many teams still treat “more agentic” as a one-way upgrade path: add memory, add learning, add tool use, add model diversity, add communication, add social awareness, add a dashboard, then call the resulting creature “autonomous.”

But in shared-resource environments, each added capability can change correlation. Correlation changes overload. Overload changes safety and ROI.

A smarter agent fleet may be exactly what the system needs when capacity is abundant. The same fleet may be exactly what breaks the system when capacity is scarce.

The uncomfortable part is that this is knowable early. The operator does not need to wait for mysterious emergent behavior to discover whether the system is resource-constrained. Capacity, concurrency, and bottleneck structure can be estimated before deployment.

The article’s title says “when AI agents get smarter, systems get worse.” The precise version is narrower and more useful:

When agents become more adaptive in a scarce shared-resource environment, their intelligence can synchronize demand faster than the system can absorb it.

That is not an argument against AI agents. It is an argument against treating intelligence as a substitute for capacity design.

Before giving every device a smarter local brain, ask the boring question:

How many agents can the shared resource actually handle at once?

If the answer is “not many,” the best upgrade may not be a better model.

It may be a queue.

Cognaptus: Automate the Present, Incubate the Future.


  1. Neil F. Johnson, “Increasing intelligence in AI agents can worsen collective outcomes,” arXiv:2603.12129, 2026, https://arxiv.org/abs/2603.12129↩︎