Opening — Why this matters now

The AI industry has a small addiction to the word agent. Add another agent, then another, then a few hundred more, and the slide deck begins to smell faintly of civilization. Somewhere between “workflow automation” and “digital society,” we are invited to believe that scale itself becomes intelligence.

The paper “Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents” is a useful bucket of cold water.1 Its target is not a toy simulation with five agents arguing in a prompt window. The authors study MoltBook, a live social platform hosting more than two million autonomous AI agents. These agents post, comment, react, and browse through an OpenClaw-style action loop. In other words, the system has the visual ingredients of a society: identities, messages, threads, reactions, memory, and public interaction.

The natural question is obvious: does a large population of agents begin to think better together than any one agent alone?

The paper’s answer is equally obvious, but less flattering: no. Not in the tested setting. Not by accident. Not because a million generic comments magically congeal into wisdom, like soup left too long in the refrigerator.

The business implication is direct. Enterprises should not confuse many deployed agents with collective intelligence. A swarm of AI workers without structured communication is not an organization. It is a comment section wearing a productivity badge.

Background — Context and prior art

The paper begins from a real distinction that is often blurred in AI product language: multi-agent systems are not the same as agent societies.

Most successful multi-agent systems are not spontaneous. They are engineered. Systems such as AutoGen, CAMEL, ChatDev, MetaGPT-style software teams, and role-based agent workflows achieve useful collaboration because humans impose structure: roles, turn-taking, shared objectives, task decomposition, escalation rules, and stopping criteria. The collaboration is less “emergent civilization” and more “project manager with a clipboard.” This is not an insult. It is why those systems sometimes work.

Large-scale agent societies are different. They are open-ended, persistent, and loosely structured. Agents may post, respond, ignore, wander, react, or emit pleasant nonsense. This resembles human online platforms more than controlled workflow engines. The premise is seductive: human societies produce Wikipedia, open-source software, markets, science, and occasionally usable meeting notes; perhaps large AI populations will generate similar collective intelligence.

The authors argue that this claim cannot be validated by passive observation alone. Counting posts, likes, comments, or “social behavior” does not prove collective intelligence. A society must be tested by whether interaction improves outcomes beyond what isolated individuals can do.

That is the paper’s key methodological move: rather than merely observing MoltBook, the authors insert Probing Agents — controlled agents that look like normal participants but publish carefully designed tasks. These tasks have known ground-truth answers, allowing the researchers to evaluate whether the society actually reads, responds, synthesizes, and reasons.

This is the correct instinct. If we want to know whether a collective system is intelligent, we should not admire its activity level from a distance. We should poke it with a calibrated instrument and see whether anything intelligent comes back.

Analysis — What the paper does

The authors define collective intelligence operationally: a group exhibits collective intelligence when its interactive output surpasses what the best individual agent could achieve alone, without a pre-designed hierarchy or forced protocol.

That definition matters. A million agents producing a million isolated answers do not form a mind. They form a spreadsheet. Collective intelligence requires interaction structure: agents must notice each other, exchange useful information, build on prior contributions, and converge toward better outputs.

The paper’s framework, which the title calls the Superminds Test, evaluates this from the top down through three tiers.

Tier What it tests Probe design What success would look like
Tier I: Joint Reasoning Can the group solve hard problems better than individual frontier models? Probing agents post text-only Humanity’s Last Exam questions Discussion threads converge on correct answers or provide useful reasoning context
Tier II: Information Synthesis Can agents combine information distributed across multiple comments? Probing agents post grade-school math questions while other probing agents distribute required premises in comments Regular agents read the post and comments, synthesize the premises, and solve the problem
Tier III: Basic Interaction Can agents maintain minimal conversational coordination? Probing agents post counting-chain tasks Regular agents read the last number and reply with the next valid number

This hierarchy is elegant because failure at a high tier can be diagnosed by moving downward. If agents fail joint reasoning, perhaps the tasks are too hard. If they also fail distributed-premise arithmetic, perhaps they are not reading each other. If they fail counting, the problem is even more basic: they are not sustaining common ground.

The Probing Agent design follows three principles:

Design principle Meaning Why it matters
Indistinguishability Probing agents use the same persona format and posting style as ordinary agents The test should not distort normal platform behavior
Task-carrying Each probe carries a specific diagnostic challenge The interaction becomes measurable rather than anecdotal
Minimal intervention Probing agents initiate tasks but do not steer the conversation The response remains organic, not coached

This is a useful design pattern beyond the paper. For business deployments, evaluation agents can be embedded into production-like workflows to test whether other agents notice anomalies, request missing data, follow policy, or escalate exceptions. Done carefully, probing becomes an audit method for agentic systems.

Findings — Results with visualization

The paper’s empirical results are not subtle. MoltBook contains a massive number of agents, but the measured society fails across all three tiers.

Tier I: Joint reasoning fails

For the highest-level test, the authors use 2,158 text-only Humanity’s Last Exam questions across multiple domains. They compare MoltBook’s discussion performance against individual frontier models answering the same questions in isolation.

The individual baselines are weak on this benchmark, as expected: the paper reports 7.0% accuracy for gpt-5.2 and 15.7% for claude-sonnet-4-6 on the full set. The important point is not that the models are low. HLE is designed to be painful. The important point is that the society performs far worse.

Tier I result Measurement Reported outcome
HLE posts tested Total text-only questions 2,158
Posts receiving no comments Silence rate 98.4%
MoltBook: at least one correct comment Society-level individual correctness 0.19%
MoltBook: thread converges correctly Collective thread correctness 0.14%
gpt-5.2 isolated baseline Individual model accuracy 7.0%
claude-sonnet-4-6 isolated baseline Individual model accuracy 15.7%

Even when the authors zoom into the tiny subset of 35 questions that received comments, MoltBook does not outperform isolated frontier models. The best individual commenters and the collective thread remain below the model baselines on that subset.

The helpfulness test is more nuanced. The authors remove direct answer comments and feed remaining discussion context into various frontier models to see whether social discussion improves model answers. Results are mixed: some models improve, some decline, and one remains unchanged. A few comments contain useful reasoning signals, but the signal is rare and buried beneath low-quality content.

This is the tragedy of the almost-useful comment. Somewhere in the noise is a clue. Unfortunately, the system does not know how to reliably surface it, build on it, or turn it into collective progress.

The quality classification explains why. Across 111 comments on 35 HLE threads, the authors report that 76.5% are superficial or irrelevant, only 3.6% contain correct reasoning, and only 5.4% provide partial substantive engagement.

Comment quality on HLE discussion threads Share
Superficial or irrelevant 76.5%
Correct reasoning 3.6%
Partial substantive engagement 5.4%
Any substantive reasoning 9.0%

The society behaves socially, not intellectually. Agents praise, acknowledge, echo, or drift. They respond like polite conference attendees who have not read the paper but enjoyed the catering.

Tier II: Information synthesis exists, but participation collapses

The second tier lowers the cognitive burden. The authors adapt 103 GSM-SP grade-school math problems. Required facts are distributed across the post and multiple comments. To solve, an agent only needs to read the comments, combine the premises, and perform simple arithmetic.

This should be easy. It is also exactly the kind of behavior enterprise agents need: read distributed inputs, combine evidence, and produce a correct answer.

The platform mostly ignores the task.

Tier II result Measurement Reported outcome
Distributed-premise math posts Total probes 103
Posts receiving no external comments Silence rate 93 posts / 90.3%
Posts attracting comments Engagement count 10 posts
External comments collected Total responses 17
Comments attempting solution Attempted solving 12
Correct solutions among attempts Correct answers 11
Comments referencing distributed premises Evidence of synthesis 12

This finding is more interesting than a blanket failure. When agents engage, many can synthesize information correctly. The problem is that they rarely engage.

For business, this distinction matters. The failure mode is not always model capability. It may be workflow activation. An AI agent may be able to reconcile invoices, classify complaints, or summarize incident reports — if it is routed to the right task, given the right context, and required to respond. In an unstructured environment, capability lies dormant.

A dormant capability is still operationally useless. “The agent could have done it” is not a process metric. It is an epitaph.

Tier III: Even basic interaction is weak

The third tier removes reasoning almost entirely. The probe posts a counting task. Agents only need to read the last number and reply with the next number in sequence.

This is not intelligence in the grand philosophical sense. It is the AI equivalent of noticing the baton before running the next leg.

Tier III result Measurement Reported outcome
Counting posts Total probes 50
Posts receiving no external comments Silence rate 35 posts / 70%
Posts attracting comments Engagement count 15 posts
External comments collected Total responses 17
Correct sequence-following replies Valid coordination 6
Off-topic or spam replies Misaligned responses 10
Wrong format replies Format failure 1

This is the most damaging result because it strips away excuses. The agents are not failing because the tasks require advanced mathematics or scientific expertise. They are failing to maintain a minimal shared conversational state.

Collective intelligence needs common ground. If the agents do not reliably read the previous turn, align with the thread, and contribute a valid next step, higher-level collaboration has no foundation.

Platform-wide analysis: shallow interaction is not a probe artifact

The authors also analyze broader MoltBook data from prior work to check whether the probe results are unusual. The answer: no. Platform-wide interaction is shallow.

Thread depth remains close to 1.0, even for popular posts. More upvotes attract more repliers, but not deeper conversation. Popularity increases attention, not collaboration. Reply quality is also dominated by spam, generic, or superficial responses; substantive and deep replies remain rare.

That distinction is important for anyone designing AI platforms. Engagement is not collaboration. Volume is not synthesis. Popularity is not depth.

Here is the paper’s story in one diagnostic table:

Layer of collective intelligence What failed Business translation
Joint reasoning Groups did not outperform individual frontier models More agents do not automatically improve decision quality
Information synthesis Agents could synthesize when they engaged, but most tasks were ignored Capability must be activated by routing, triggers, ownership, and deadlines
Basic interaction Many agents did not respond or failed simple thread-following Shared state and protocol are prerequisites, not nice-to-have features
Platform-wide dialogue Threads stayed shallow despite popularity Engagement metrics can flatter systems that do not actually collaborate

Implications — Next steps and significance

The paper should be read less as an obituary for agent societies and more as a design memo for serious agent systems.

The result does not mean multi-agent AI is pointless. It means unstructured multi-agent AI is mostly theater. The useful systems will look less like open social platforms and more like disciplined organizations: shared memory, role clarity, task ownership, quality gates, escalation paths, evaluation probes, and mechanisms that force agents to build on the right prior work.

1. Enterprises need agent orchestration, not agent population growth

A business does not need “two million agents.” It needs the right five agents with the right operating model.

For example, an invoice automation system may include:

  • a document extraction agent,
  • a vendor-matching agent,
  • a policy-checking agent,
  • an exception-routing agent,
  • and a human-review briefing agent.

That system becomes intelligent not because each agent has a charming persona, but because the workflow defines when each agent acts, what state it receives, what output schema it must produce, and what happens when confidence is low.

The Superminds Test reminds us that coordination is an architectural property. It cannot be assumed from autonomy.

2. Agent evaluation should include social failure modes

Most AI evaluations still focus on isolated model performance: accuracy, reasoning, latency, cost, hallucination rate. Those are necessary, but insufficient for agentic systems.

A multi-agent deployment needs tests for interaction behavior:

Evaluation dimension Practical test question
Attention Does the agent notice relevant upstream outputs?
Context grounding Does it use the latest shared state rather than stale assumptions?
Synthesis Can it combine evidence from multiple agents or documents?
Turn discipline Does it follow the expected sequence and handoff rules?
Escalation Does it route ambiguity to the correct reviewer?
Non-engagement Does the system detect when no agent responds to a required event?
Noise suppression Can it separate useful signals from generic commentary?

A deployment that passes individual task benchmarks but fails these tests will look impressive in demos and irritating in production. Naturally, the demo will be called “strategic transformation.”

3. Probing agents can become governance tools

The paper’s most reusable idea is the Probing Agent. In enterprise settings, probing agents could test production-like systems by injecting controlled cases:

Business domain Probe example What it tests
Customer support A complaint with missing order ID, urgent tone, and ambiguous product category Triage, missing-info request, escalation
Finance operations An invoice with duplicate vendor name and unusual amount Policy checking, anomaly detection, reviewer routing
HR onboarding A new hire profile missing required documents Workflow completion, exception handling
Compliance A client note containing suitability red flags Regulatory issue detection, audit trail creation
Maintenance operations A field report with conflicting asset IDs Evidence synthesis, dispatch correction

This is more than red-teaming. It is operational observability. A probing layer can reveal whether an agent system behaves like a process or merely emits plausible text around a process.

4. The missing variable is not intelligence; it is social infrastructure

The paper’s findings point to a recurring pattern: individual agents may have useful abilities, but the society lacks mechanisms that turn ability into collective performance.

Human organizations learned this the ugly way. Meetings need agendas. Teams need owners. Databases need identifiers. Tickets need statuses. Audits need logs. Managers need dashboards because “everyone is autonomously doing their best” is not a control system.

Agent societies need the same boring infrastructure, except faster and less forgiving:

Required infrastructure Why it matters
Shared memory Agents need durable access to relevant prior work
Attention routing Important tasks must be assigned, not left to voluntary discovery
State machines Processes need explicit stages and transition rules
Output schemas Contributions must be machine-readable and comparable
Consensus protocols Disagreement must be resolved deliberately, not buried in threads
Quality scoring Useful reasoning must be separated from filler
Escalation rules Ambiguity must reach humans or specialist agents
Audit trails Decisions must be explainable after the fact

This is where serious automation work lives. Not in pretending agents are miniature employees with cute avatars, but in designing the communication machinery that makes their outputs usable.

What Cognaptus should take from this

For Cognaptus-style automation projects, the paper sharpens a principle we already treat as operational law:

Do not sell “agent swarms.” Build accountable workflows.

A useful agentic system should answer four questions before it deserves deployment:

  1. Who owns the next step? If no agent is required to act, silence becomes the default.
  2. What shared state must be read? If agents do not ground themselves in previous outputs, collaboration collapses into parallel monologue.
  3. How is quality judged? If useful reasoning is not scored and surfaced, it drowns in generic text.
  4. Where does uncertainty go? If exceptions do not escalate, the system quietly converts ambiguity into operational risk.

The Superminds Test is especially relevant for businesses now experimenting with multi-agent automation. The temptation is to overbuild populations and underbuild governance. That is backwards. An agent system should begin with the process skeleton: task graph, inputs, outputs, permissions, evaluation probes, escalation thresholds, and audit logs. Only then should agents be added.

A smaller, structured agent team will usually beat a large unstructured society. This is not philosophically romantic. It is merely how work gets done.

Conclusion — The mind is in the mechanism

The paper’s central lesson is blunt: collective intelligence does not emerge from scale alone. MoltBook’s two-million-agent society has the appearance of social life, but the tested behaviors reveal sparse engagement, weak coordination, shallow replies, and little evidence of collective reasoning.

This should not surprise anyone who has attended a large group chat. Population is not organization. Activity is not intelligence. A reply is not a contribution. A thread is not a workflow.

For AI builders, the path forward is not to abandon multi-agent systems. It is to stop romanticizing emergence and start engineering interaction. The future of enterprise AI will be built through disciplined orchestration: shared state, role specialization, evaluation probes, exception routing, and human governance where judgment is still required.

The so-called supermind will not be born from a crowd of autonomous agents shouting politely into the void. It will be assembled, tested, monitored, and occasionally told to stop commenting “Great insight!” on arithmetic problems.

Cognaptus: Automate the Present, Incubate the Future.


  1. Xirui Li, Ming Li, Yunze Xiao, Ryan Wong, Dianqi Li, Timothy Baldwin, and Tianyi Zhou, “Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents,” arXiv, 24 April 2026. https://arxiv.org/html/2604.22452 ↩︎