When Agents Treat Agents as Tools: What Tool-RoCo Tells Us About LLM Autonomy

Dispatch is where autonomy usually goes to die.

A warehouse manager may have ten workers, three forklifts, two packing stations, and one increasingly dramatic dashboard. The hard part is not merely deciding what each person should do. The hard part is knowing when to call someone in, when to release them, and when extra “help” is just a polite name for congestion.

That is also the uncomfortable part of multi-agent AI. A system may contain several LLM agents, each with a role, a tool list, and a lovely little prompt describing its personality. The demo looks cooperative because many agents are talking. The logs look active because many calls are being made. But activity is not organization. An agent team is not automatically autonomous because everyone is invited to the meeting and nobody knows when to leave.

The paper behind Tool-RoCo attacks exactly this soft spot: it turns other agents into callable tools and then watches whether LLM agents actually use, activate, and deactivate collaborators in long-horizon multi-robot tasks.¹ This is a small conceptual move with a large diagnostic consequence. Instead of asking only whether the task was completed, Tool-RoCo asks whether the agents managed the team.

That is the article’s central point. Tool-RoCo is not just another “multi-agent benchmark.” It is a benchmark for the difference between doing more actions and organizing work. A distinction so obvious that many agent systems, naturally, still manage to avoid measuring it.

The mechanism: teammates become tools

Tool-RoCo builds on RoCo-style multi-robot cooperation tasks, but changes the evaluation lens. In ordinary tool-use benchmarks, an LLM chooses from tools such as PICK, OPEN, MOVE, PLACE, or WAIT. The system can then measure whether the model selected a valid tool, filled parameters correctly, and produced an executable action.

Tool-RoCo keeps that structure, but adds a second class of tools: cooperative tools. These include actions such as activating another agent or disconnecting one that is no longer needed. In other words, another robot-agent can be called like a tool.

That sounds almost too simple. It is not.

The mechanism forces cooperation to appear as an explicit decision. If an agent needs help, it must choose to call another agent. If that agent is redundant, it must choose to release it. The benchmark can therefore observe a behavior that final task success often hides: whether the model understands collaboration as a lifecycle, not a permanent state.

The paper’s three tasks make this concrete:

Task	Operational structure	Why it stresses coordination
CABINET	Three robots must place a cup and mug on correct coasters; one object is inside a cabinet	Agents need sequencing and division of access, not just isolated manipulation
PACK	Two robots pack grocery items into a bin from different sides of a table	Parallel work is possible, but coordination still matters
SORT	Three robots move cubes across seven panels with limited reach	The task requires relay-like cooperation through intermediate positions

The important design choice is not that these are simulated robot tasks. It is that they create situations where local action and team configuration interact. A robot may know how to pick. It may even know where to place. But if the right collaborator is inactive, or every collaborator remains active forever, the system is not really managing the work.

Four autonomy settings reveal where orchestration ends

Tool-RoCo defines four cooperation paradigms. They are not merely experimental variants; they are a ladder of autonomy.

Paradigm	Who decides actions?	Are agents treated as tools?	What it tests
Centralized cooperation	One central LLM assigns tools to all agents	No	Basic tool selection, parameter filling, and execution under global visibility
Centralized self-organization	One central LLM assigns tools and manages active agents	Yes	Whether a central controller can activate or deactivate collaborators
Decentralized cooperation	Each agent has its own LLM and local observation	No	Whether agents can coordinate without a global controller
Self-organization cooperation	Agents start from one active agent and dynamically activate others	Yes	Whether coordination can emerge through distributed agent-as-tool decisions

This ladder matters because “multi-agent” is often used as if it were one thing. It is not. A centralized planner assigning actions to many agents is operationally different from several agents deciding locally when to recruit one another. The first resembles a dispatcher. The second resembles an organization.

Tool-RoCo uses the ladder to separate three questions that are often mixed together:

Can the model call tools correctly?
Can the system coordinate multiple agents?
Can the agents manage their own team structure?

Most agent demos answer the first question loudly and the third question with suspicious silence.

CT and SO measure cooperation as behavior, not vibes

The paper’s most useful contribution is its measurement design. Tool-RoCo introduces two metrics: Cooperative Tool Ratio and Self-Organization ratio.

The Cooperative Tool Ratio, or CT, measures how often cooperative tools are used among all tool calls:

$$ CT = \frac{N_{cooperative}}{N_{tools}} $$

Here, $N_{cooperative}$ is the number of cooperative tool calls, such as activating or deactivating another agent. $N_{tools}$ is the total number of tool calls across episodes.

The Self-Organization ratio, or SO, measures the share of cooperative calls that are activation calls:

$$ SO = \frac{N_{activate}}{N_{cooperative} + \epsilon} $$

Read these carefully. CT asks: does the system use collaboration as part of its action space? SO asks: when it uses collaboration, is it mostly adding agents rather than releasing them?

This distinction is the paper’s quiet little trap. High SO may sound good if one reads “self-organization” as a positive label. But in Tool-RoCo, very high SO can mean something less flattering: agents keep activating others and rarely disconnect them. That is not mature cooperation. That is how a small task becomes a committee.

The accepted reader misconception is therefore worth keeping: a multi-agent LLM system is not cooperative simply because several agents are active. It is not cooperative simply because agents frequently call teammates. Cooperation requires judgment about when to call help, whom to call, and when to stop consuming the team’s attention.

The main evidence: tool competence improves faster than team judgment

Tool-RoCo evaluates GPT-4o-mini, GPT-5-mini, GPT-4.1, and GPT-5 across the three tasks and four paradigms. The paper uses ordinary tool-use measures as the first layer of evidence: tool calling, parameter validation, execution validity, reflection rate, modification rate, and win/loss completion.

This is the main evidence for basic operational competence. Larger models generally perform better at structured tool use. In the Cabinet task under centralized cooperation, for example, GPT-5 reaches 93.04% tool-calling success and 75.90% execution validity, while GPT-4o-mini is at 19.40% and 9.40% respectively. That tells us something unsurprising but necessary: if the model cannot even call the right tool with valid parameters, sophisticated autonomy is not the next problem. The first problem is literacy in the action space.

But the more interesting result comes when the benchmark becomes less forgiving.

Under decentralized cooperation, even strong tool callers struggle. In Table 3, none of the evaluated models wins the tasks under the decentralized cooperation paradigm, despite GPT-5 maintaining high tool-calling percentages in several cases. That gap is important. A model can produce valid local actions and still fail at distributed coordination.

Under self-organized cooperation, GPT-5 does better than the smaller models, winning Cabinet and Pack but not Sort. This suggests that scale and capability help, but they do not dissolve the coordination problem. Sort is structurally harder because limited reach forces relay-like cooperation. You cannot solve it by having every agent enthusiastically exist.

The paper’s second evidence layer is Table 4, which reports CT and SO only for the paradigms where agents are treated as tools. This is the sharper test.

In centralized self-organization, smaller models have very low CT values: GPT-4o-mini ranges from 0% to 1.73%, and GPT-5-mini from 1.69% to 2.38% across the tasks. Larger models improve, but only modestly: GPT-4.1 reaches 8.19% CT on Cabinet and 5.69% on Pack; GPT-5 reaches 9.28% on Cabinet but only 2.08% on Pack and 2.06% on Sort.

In self-organized cooperation, GPT-5 reaches higher CT values: 26.31% on Cabinet, 5.73% on Pack, and 10.81% on Sort. GPT-4.1 reaches 17.16%, 1.19%, and 2.78% respectively. Smaller models remain weak, often at zero or near-zero CT.

Across the paper, the authors report an aggregate cooperative tool usage ratio of only 7.09%. That is the number business readers should not over-dramatize, but should not ignore either. It does not mean “LLM agents cannot cooperate.” It means that in this benchmark, cooperative tool use remains sparse relative to ordinary action tools.

The activation pattern is even more revealing. The paper reports that activation tools account for 96.42% of cooperative tool usage. In the self-organized cooperation setting, SO is 100% for many model-task combinations. Translation: once agents enter the business of calling other agents, they mostly add collaborators and rarely remove them.

That is the diagnostic punchline. Current LLM agents show some ability to ask for help. They show much weaker evidence of knowing when help has become overhead.

Evidence map: what each result supports, and what it does not

Evidence item	Likely purpose in the paper	What it supports	What it does not prove
Four cooperation paradigms	Benchmark design and comparison scaffold	Autonomy can be evaluated progressively from centralized execution to distributed self-organization	That these four paradigms cover all practical agent architectures
Table 3 tool-use metrics	Main evidence for baseline tool competence	Larger models generally handle structured tool calls and execution better than smaller models	That high tool validity alone produces good team coordination
Decentralized and self-organized results	Main evidence for autonomy difficulty	Local decision-making and dynamic team formation are harder than centralized orchestration	That decentralized agents are always worse in real deployments
Table 4 CT and SO	Main evidence for cooperation and self-organization behavior	Cooperative tool use is limited, and activation dominates deactivation	That CT and SO are sufficient to measure every form of collaboration
Figure 3 token comparison	Implementation and cost observation	Centralized paradigms require more prompt context because global state and tool sets must be encoded together	That decentralized systems are always cheaper end-to-end
Future-work discussion	Boundary and research direction	Current cooperative tools are too simple and should be extended	That Tool-RoCo is already a complete training environment

This table matters because the paper is easy to misread. A casual summary would say: “Tool-RoCo benchmarks multi-agent cooperation, and larger models perform better.” True, but thin.

The better reading is: Tool-RoCo separates action correctness from organizational judgment. Larger models improve action correctness and partially improve cooperative tool use. But the lifecycle problem remains: agents are better at entering cooperation than exiting it.

The business lesson: evaluate coordination lifecycle, not just task completion

Cognaptus’ business interpretation is straightforward, with one useful sting: agentic workflow evaluation should not stop at final success.

For business automation, the analogy is not limited to robots moving mugs and cubes. The same structure appears in customer support workflows, logistics exception handling, procurement approval chains, software incident response, compliance review, and research automation. In each case, an agent may need to call another specialist agent, API, human reviewer, or process module. The question is not only whether it can call the specialist. The question is whether it can manage the specialist’s involvement.

A business deployment should therefore measure at least four lifecycle behaviors:

Lifecycle question	Business version	Failure mode if ignored
When should help be requested?	Escalate to legal, finance, engineer, planner, or human reviewer	The agent struggles alone and produces low-quality work
Which helper should be activated?	Choose the right department, model, tool, or policy module	The system burns cost on irrelevant specialists
When should help be released?	Stop involving agents once their contribution is no longer needed	Token cost, latency, and workflow noise accumulate
Did collaboration improve the outcome?	Compare final result, cost, time, and error recovery	Activity is mistaken for productivity

Tool-RoCo directly shows the research version of this problem. It does not directly show that a corporate document-review agent will behave the same way. That would be too convenient, and therefore suspicious. But the mechanism generalizes cleanly: once agents can call other agents, collaboration becomes an action that needs governance.

This is where the paper is most useful for AI product teams. The benchmark suggests that agent orchestration platforms should log and score not only tool-call success, but also collaboration decisions. In a production workflow, a useful evaluation dashboard would not merely show “number of agents used.” It would show unnecessary activations, missing activations, delayed releases, repeated failed calls, and whether added agents improved the outcome.

“More agents joined the workflow” is not a KPI. It is an invoice with better branding.

Centralized control is not primitive; it is a baseline

There is a tempting but wrong hierarchy in agent discourse: centralized equals old-fashioned, decentralized equals advanced. Tool-RoCo is more careful.

Centralized cooperation gives the system global visibility and stable control. It is less autonomous in the organizational sense, but it can be more reliable. The paper treats centralized cooperation partly as an upper-bound baseline for task efficiency because a single LLM sees the global state and assigns tools across agents.

For businesses, that matters. If the workflow is high-risk, repetitive, and well-scoped, centralized orchestration may be the right architecture. A central planner that routes claims, checks constraints, and assigns modules may outperform a swarm of agents negotiating among themselves like interns after an espresso tasting.

The decentralized and self-organized settings are more interesting when conditions are local, dynamic, or too complex for one controller to represent cleanly. Logistics, robotics, distributed monitoring, and multi-site operations may need that kind of autonomy. But Tool-RoCo’s results suggest that “need” and “current capability” should not be confused. Decentralized autonomy is attractive exactly because it removes a bottleneck; it is risky exactly because it removes a stabilizer.

The practical conclusion is not “avoid decentralized agents.” It is: deploy them only after evaluating the behaviors that central control used to hide.

What Tool-RoCo directly shows

The paper directly shows four things.

First, agent-as-tool is a workable benchmark design. Treating agents as callable tools makes cooperation observable in tool logs rather than inferred from final success.

Second, the four paradigms create a useful autonomy gradient. They let researchers compare centralized assignment, centralized activation management, decentralized local tool use, and fully self-organized activation.

Third, ordinary tool competence does not equal cooperative competence. GPT-5 can perform well on tool-calling metrics and still face failures under decentralized or self-organized cooperation, especially in harder tasks.

Fourth, current models show an imbalance between activation and deactivation. The reported 7.09% cooperative tool ratio and 96.42% activation share indicate that models rarely use cooperative tools overall, and when they do, they mostly add agents rather than release them.

That is enough to justify the paper’s value. It is not enough to justify declaring that agent swarms are doomed, or that a particular commercial model is “bad at teamwork.” The benchmark is narrower than that. Good. Narrow evidence is still evidence. It just refuses to become a TED Talk.

What Cognaptus infers for business use

Cognaptus’ inference is that agentic automation needs coordination observability before it needs more decorative autonomy.

A business agent platform should record whether an agent requested help, why it requested help, which helper was selected, what happened after activation, when the helper was released, and whether the workflow improved. These logs should become evaluation data, not debugging leftovers.

This also changes procurement questions. Instead of asking a vendor, “Do you support multi-agent workflows?” ask:

Procurement question	Why it matters
Can the system distinguish task tools from collaborator tools?	Without this, cooperation is hard to audit
Does it log activation and deactivation decisions?	Without lifecycle logs, overhead is invisible
Can it measure whether helper agents improved outcomes?	Without outcome attribution, collaboration becomes theater
Can it penalize unnecessary agent involvement?	Without cost-sensitive metrics, the system may optimize for activity
Can it run centralized and decentralized modes on the same task?	Without architectural comparison, “autonomy” is just a design preference

Tool-RoCo does not provide a ready-made enterprise benchmark for every workflow. But it provides a useful pattern: build tasks where collaboration is optional, measurable, and sometimes costly. Then see whether agents use help like professionals or like people adding everyone to CC.

Boundaries: this is a benchmark, not a production verdict

The limitations are not decorative; they affect how the results should be used.

First, Tool-RoCo uses simulated multi-robot tasks derived from RoCo. Simulation is valuable because it gives controlled feedback and repeatable task structure, but it is not the same as a warehouse, factory, hospital, or trading desk.

Second, the cooperative tool set is intentionally simple. The paper’s future-work section notes that current cooperation tools are limited, essentially covering activation and deactivation, and do not capture all organizational structures. Real teams may delegate, negotiate, monitor, vote, escalate, hand off, or merge plans. Connect and disconnect are only the beginning.

Third, the experiments use five episodes per task, with ten turns and up to five replanning opportunities. That is enough to expose meaningful behavior, but not enough to settle reliability claims across domains.

Fourth, CT and SO are useful but incomplete. A low CT may mean poor cooperation, but it may also mean the task did not require much help in a particular state. A high SO may indicate active collaboration, but it may also indicate failure to disengage. These metrics need to be interpreted with task structure, not worshipped as dashboard idols.

Fifth, the paper evaluates specific model settings under its own prompts and environment. The result should guide evaluation design more than model ranking. The more durable lesson is not “Model X is better.” The durable lesson is “measure team formation and release.”

The uncomfortable upgrade path for agent systems

Tool-RoCo points toward a more mature evaluation stack for autonomous agents.

The first layer is tool correctness: did the model call the right function with valid parameters?

The second layer is execution validity: did the action respect environment constraints?

The third layer is adaptation: did the agent revise its plan after feedback?

The fourth layer is cooperation: did it call collaborators when needed?

The fifth layer is organizational discipline: did it stop using collaborators when they were no longer useful?

Most business demos live between layers one and three. Tool-RoCo is valuable because it pushes attention toward layers four and five. That is where agent systems start to resemble organizations rather than autocomplete with job titles.

This is also where cost enters. Persistent activation creates token overhead. In centralized systems, global state and all candidate tool sets inflate prompt context. In decentralized systems, local reasoning may reduce per-agent prompt size but increases coordination difficulty. Neither architecture is free. The choice is not “centralized bad, decentralized good.” The choice is what kind of complexity you want to pay for: context complexity, coordination complexity, or both, because apparently software architecture enjoys comedy.

Conclusion: autonomy begins when agents know when to leave

Tool-RoCo’s best idea is not that agents can be tools. The best idea is that once agents are tools, cooperation becomes measurable.

The paper shows that current LLM agents can often call ordinary tools, sometimes recover from feedback, and in stronger models, partially engage collaborators. But it also shows the gap that matters for real autonomy: agents struggle to manage the lifecycle of cooperation. They may activate helpers, but they rarely demonstrate refined judgment about deactivation, redundancy, and adaptive team size.

For business readers, this is the right level of seriousness. Multi-agent systems should not be dismissed because early self-organization is messy. Nor should they be trusted because a demo shows five agents talking in circles with impressive confidence. The next step is evaluation: measure who gets called, why they get called, when they leave, and whether their presence actually improves the job.

Autonomy is not the number of agents in motion. It is the quality of the decisions that decide motion.

Cognaptus: Automate the Present, Incubate the Future.

Ke Zhang, Xiaoning Zhao, Ce Zheng, Jiahong Ning, Dandan Zhu, Wenqi Zhang, Chen Sun, and Toshiharu Sugawara, “Tool-RoCo: An Agent-as-Tool Self-organization Large Language Model Benchmark in Multi-robot Cooperation,” arXiv:2511.21510, submitted November 26, 2025, revised November 29, 2025. arXiv:2511.21510. ↩︎

The mechanism: teammates become tools#

Four autonomy settings reveal where orchestration ends#

CT and SO measure cooperation as behavior, not vibes#

The main evidence: tool competence improves faster than team judgment#

Evidence map: what each result supports, and what it does not#

The business lesson: evaluate coordination lifecycle, not just task completion#

Centralized control is not primitive; it is a baseline#

What Tool-RoCo directly shows#

What Cognaptus infers for business use#

Boundaries: this is a benchmark, not a production verdict#

The uncomfortable upgrade path for agent systems#

Conclusion: autonomy begins when agents know when to leave#