Communication sounds harmless until the wrong person gets the microphone.
That is true in meetings. It is also true in multi-agent AI systems. The polite version says agents “collaborate,” “debate,” and “refine each other’s reasoning.” The less decorative version is that one agent’s output becomes another agent’s input. If the first agent is wrong, confused, strategically misleading, or simply having one of those tiny synthetic breakdowns that LLMs have with impressive confidence, the system has just created a distribution channel for bad judgment.
The TodyComm paper makes this problem concrete: in multi-round LLM-based multi-agent systems, the hard part is not only what each agent thinks, but who is allowed to influence whom, at which round, and under what communication budget.1
That sounds like plumbing. It is not. It is governance.
Most multi-agent systems still treat the communication graph as something decided before inference: a complete graph, a random graph, a role-designed graph, or a graph learned during training and then mostly frozen in use. This is convenient. It is also slightly optimistic, in the way leaving every meeting attendee permanently unmuted is optimistic.
TodyComm’s central move is to make the communication topology itself dynamic. It learns round-by-round graph structures from agent behavior, optimizes them with task-level reward, and uses those structures both during agent interaction and final decision aggregation. The paper’s real contribution is not the familiar claim that “adaptive is better than static.” The interesting part is the mechanism: behavior memory, node screening, constrained edge selection, and task-driven reinforcement learning work together to make communication selective.
Fixed communication is a fragile assumption disguised as architecture
In a simple multi-agent workflow, fixed communication is easy to defend. Agents have assigned roles. The planner talks to the critic. The critic talks to the solver. The summarizer reads everything. Everyone performs their little office ritual. The diagram looks responsible.
But multi-round interaction changes the problem. Agents do not merely express independent opinions; they update after seeing other agents’ messages. A single bad message can change the downstream reasoning state of the recipient. Over several rounds, the question becomes less “which agent is correct?” and more “which path of influence should survive?”
TodyComm studies this under a dynamically adversarial setting. Six agents collaborate over four rounds. Some agents may become adversarial at unknown later rounds, while still keeping their original role and producing plausible-looking analysis. The paper evaluates three attack-rate regimes: fewer than half the agents adversarial, exactly half, and more than half.
This setup is synthetic, but the failure pattern is not. In business workflows, an “adversarial” agent does not have to be malicious. It can be a retrieval agent using stale documents, a finance agent hallucinating an assumption, a compliance agent overfitting to the wrong policy, or a sales agent smuggling optimism into forecasts because apparently even software can learn corporate culture.
The important point is that unreliability can be dynamic. An agent can be useful in one round and harmful in another. A static graph cannot react to that change. It can only keep routing.
TodyComm turns communication into the action
The paper formulates multi-round collaboration as a Markov Decision Process. The state includes the query, agent outputs, and relevant history. The action is not a generated answer. The action is the communication graph for each round, plus the final decision graph.
That framing matters. In many agent systems, communication is treated as background infrastructure. TodyComm treats it as a control variable.
At each round, the system must decide:
- which agents remain eligible to communicate;
- which directed edges should connect agents;
- how dense the communication should be under degree budgets;
- which agents should influence the final answer.
This is not direct graph brute force. The action space would become ugly very quickly. Instead, TodyComm learns node representations and uses them to construct constrained graphs.
The mechanism has four core parts.
| Mechanism | What it does | Why it matters |
|---|---|---|
| Behavioral node features | Summarize each agent’s own answer, analysis, persistence, neighbor context, and disagreement with neighbors | The system observes behavior, not just assigned role |
| GRN-based memory | Uses a gated recurrent network to update per-agent embeddings across rounds | Agent reliability can change over time |
| Node screening and potentials | Estimates node potentials and screens out low-potential agents before graph construction | Bad agents can be quietly excluded before they spread influence |
| Constrained edge construction | Builds directed acyclic graphs under masks and in/out degree budgets | Communication becomes selective, not merely adaptive in name |
The final decision is also treated as a graph problem. After the last communication round, TodyComm performs one more update and constructs a decision graph with learned weights. That is a small but important design choice. A system can communicate well and still fail if the final aggregation gives equal weight to agents that should have been ignored.
In other words: the paper does not only ask who should talk. It asks who should be listened to at the end.
The mechanism is a trust policy, not a personality score
It is tempting to describe TodyComm as learning which agents are “good” or “bad.” That is too crude.
The paper’s node potentials are not permanent reputation labels. They are round-specific estimates derived from behavior. The node feature design includes self information, neighborhood information, and difference information. Self information captures the agent’s previous solution and analysis, plus persistence from its initial answer. Neighborhood information summarizes neighboring agents’ behavior and disagreement. Difference information measures how closely the agent aligns with or deviates from its neighborhood.
This is closer to a dynamic trust policy than a trust score.
That distinction is useful for business readers. In enterprise agent workflows, the right question is rarely “which agent is always reliable?” A retrieval agent may be excellent on current policy documents and useless on legacy contracts. A coding agent may be reliable on syntax and reckless on system design. A domain expert agent may be strong until the task shifts outside its assumed jurisdiction.
TodyComm’s architecture says: do not hard-code trust into the org chart. Infer it from behavior in context.
The graph construction step then converts these behavioral estimates into communication structure. It prioritizes edges between high-potential nodes, applies screening to remove suspicious candidates, and respects constraints such as acyclicity and node-wise degree budgets. The theoretical section frames regret in final utility as bounded partly by screening regret, product-score estimation loss, and edge-additivity loss. For practical readers, the message is simpler: if screening is poor, graph quality suffers; if graph quality suffers, final task utility suffers. The formalism is not decorative math pasted on top. It explains where the control layer can fail.
The main evidence: dynamic routing helps most when bad agents are numerous
The main experiments evaluate TodyComm across five benchmarks: MMLU, ARC-Challenge, GSM8K, OpenBookQA, and MedQA. The baselines include Random Graph, Complete Graph, G-Designer, and AgentPrune.
The strongest pattern appears when adversarial agents reach or exceed half of the agent pool. This is where fixed or training-pruned communication structures become brittle.
Selected results under the “more than 50% adversarial” regime show the size of the gap:
| Benchmark | Best baseline accuracy under >50% attack | TodyComm accuracy under >50% attack | Interpretation |
|---|---|---|---|
| MMLU | 53.38% | 64.71% | Dynamic communication protects performance when majority pressure is hostile |
| ARC-Challenge | 69.79% | 81.05% | The gain is not limited to one dataset style |
| GSM8K | 73.00% | 83.19% | Mathematical reasoning also benefits from selective routing |
| OpenBookQA | 41.00% | 80.50% | The largest gap appears where adversarial messages are especially damaging |
| MedQA | 45.67% | 57.50% | Gains remain visible in the medical QA setting |
The OpenBookQA result is the most dramatic. The paper first shows that a single adversarial sender can cause a 61.11% accuracy drop for a reliable receiving agent on OpenBookQA. In that environment, indiscriminate communication is not collaboration. It is contamination with a group-chat interface.
The average results across all five benchmarks reinforce the same story. Under more than 50% attack, TodyComm reaches 73.39% average accuracy, compared with 55.78% for AgentPrune, 54.07% for G-Designer, 50.70% for Random Graph, and 50.23% for Complete Graph. Its token usage is also lower than most baselines and comparable to AgentPrune.
This is the part many summaries will flatten into “TodyComm performs better.” That is accurate but incomplete. The mechanism tells us why: the system learns to reduce the influence pathways of agents whose behavior has become harmful. The performance gain is not simply from having better individual agents. It comes from changing the social geometry of the system.
More communication is not always more robustness
The paper’s degree-budget experiments are especially useful because they puncture a common assumption: if the system is uncertain, let agents communicate more.
Sometimes that helps. Sometimes it just gives bad messages more roads.
TodyComm tests node-wise in-degree and out-degree budgets of 1 and 2 across several benchmarks. The results are not a clean monotonic curve where “more budget equals better performance.” Instead, moderate restriction often preserves or improves accuracy while reducing token usage.
A few examples:
| Setting | No-budget TodyComm | Budgeted TodyComm | What changed |
|---|---|---|---|
| MMLU, >50% attack | 64.71% accuracy, 414 tokens | Budget 1: 70.59% accuracy, 397 tokens | Higher accuracy with fewer tokens |
| ARC-C, >50% attack | 81.05% accuracy, 366 tokens | Budget 2: 84.95% accuracy, 372 tokens | Small token increase, stronger accuracy |
| OpenBookQA, >50% attack | 80.50% accuracy, 459 tokens | Budget 1: 83.50% accuracy, 337 tokens | Better accuracy and much lower token use |
| MedQA, >50% attack | 57.50% accuracy, 566 tokens | Budget 2: 58.00% accuracy, 564 tokens | Slight accuracy gain with similar token use |
The OpenBookQA budget result is particularly revealing. Because the benchmark is highly sensitive to adversarial communication, limiting edges can become a safety feature. Less communication is not merely cheaper; it is cleaner.
For businesses, this matters because token cost is usually treated as an efficiency concern. TodyComm suggests it can also be a reliability concern. A lower communication budget can reduce latency and cost, yes, but it may also reduce the number of opportunities for an unreliable agent to poison the workflow.
That is a useful design principle: token governance and trust governance should not be separate dashboards.
The appendix tests are not a second thesis
The paper includes several additional tests: scalability, generalization, graph-construction ablations, node-feature ablations, robustness across LLMs, and embedding-model variants. These are easy to misread as a pile of extra claims. They are better understood as checks on the mechanism.
| Test group | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Main benchmark comparison | Main evidence | Dynamic task-driven routing improves performance under dynamic adversarial settings | Universal superiority in all real-world agent workflows |
| Degree-budget tests | Robustness and operational constraint test | Selective communication can preserve or improve accuracy while reducing tokens | That the same budget works for every task |
| Scalability tests | Stress test | TodyComm remains usable as agent count grows under budget constraints | Large-scale production orchestration with hundreds of tools |
| Generalization tests | Robustness across conditions | Learned routing transfers across agent counts, tasks, outbreak modes, and attack mechanisms | Full out-of-domain reliability |
| Graph-construction ablation | Mechanism validation | Learning potentials and edge ordering matter | That every implementation detail is optimal |
| Node-feature ablation | Mechanism validation | Self-only behavior is insufficient; neighbor and difference signals matter | That this feature set is final or complete |
The scalability test is modest but informative. On MMLU, TodyComm is tested with 6, 10, 15, and 20 agents under varying budgets. With 20 agents, an attack rate of 0.6, 8 reliable agents, and budget 5, accuracy reaches 68.63% with 469 tokens. Under a harsher 20-agent setting with attack rate 0.8, only 4 reliable agents, and budget 3, accuracy drops to 54.90%, but the system remains functional rather than collapsing into the comedy genre.
The generalization tests matter because a dynamic routing policy would be less useful if it only worked under the exact training configuration. The paper evaluates transfer across number of agents, task domain, adversary outbreak mode, and attack mechanism. For example, training on 4 agents and evaluating on 6 agents produces only slight degradation on MMLU and even marginal improvement on ARC-Challenge. Training on one dataset and evaluating on another also retains relatively strong performance. The hardest combined setting appears when the model is trained under lower attack intensity and evaluated under higher, less predictable attack patterns; there, performance can fall sharply. That is not a footnote. It is the boundary of the method showing its face.
The ablations are more directly diagnostic. Removing learning and using oracle reliability labels for node potentials still performs poorly under heavy attack, reaching only 49.70% on MMLU when attack rate exceeds 50%. That sounds odd at first: why would oracle reliability not solve the problem? The paper explains that when reliable agents disagree, random final selection can drag performance toward chance. The lesson is sharp: knowing who is reliable is not enough. The order and structure of communication still matter.
The random-ordering ablation also degrades performance across attack rates. This supports the claim that constrained graph construction is doing real work, not merely decorating a learned score with graph vocabulary.
The node-feature ablation gives the cleanest design lesson. Using self information only falls to 41.18% under more than 50% attack on MMLU. Removing persistence improves over self-only but still trails the full model. Variants that keep self plus difference or self plus neighbor information perform much better. In plainer language: an agent’s own output is not enough. The system needs to know how that output behaves relative to neighbors and across time.
That is almost too human. Reliability is partly individual, partly relational, and partly temporal. Apparently the machines have also discovered office politics.
The business value is not “more agents”; it is communication governance
For enterprise AI, the practical implication is not that every company should immediately implement TodyComm. The direct evidence is benchmark-based, adversarially framed, and evaluated in controlled multi-agent QA settings. Production workflows are messier: tools have side effects, tasks are multi-objective, reward signals are delayed, and business correctness is often not a single labeled answer.
The useful inference is architectural.
TodyComm points toward a governance layer between agents: a learned routing policy that decides which agents exchange information, how much they exchange, and whose final answer receives weight. This layer can be optimized for task success, token cost, latency, and reliability signals.
A business implementation would likely look less like a pure research replica and more like an agent-communication controller:
| Business workflow component | TodyComm-inspired design principle |
|---|---|
| Role-based agents | Keep roles, but do not confuse role with reliability |
| Tool and document agents | Track behavioral consistency, disagreement, and downstream correction signals |
| Multi-step workflows | Recompute communication routes at each step, not only at initialization |
| Cost control | Treat communication budget as both cost policy and risk policy |
| Final approval | Weight final aggregation by observed reliability, not by equal voting or seniority cosplay |
| Monitoring | Log which agents influenced which decisions, so failures can be traced through the graph |
The most valuable product version may not be a fully autonomous router trained by reinforcement learning from day one. Many firms do not have clean reward functions, enough labeled task outcomes, or the patience to discover that their internal data pipeline is held together by optimism and three spreadsheets named “final_v7.”
A realistic path is staged:
- Start with fixed role graphs but log inter-agent influence paths.
- Add rule-based communication budgets for high-risk workflows.
- Measure which agent messages improve or degrade downstream outcomes.
- Train or tune a routing policy once enough task-level feedback exists.
- Add dynamic final aggregation only after communication routing is stable.
This staged version preserves the business lesson without pretending that the research system can be pasted into production with a confident diagram and a procurement invoice.
Where the paper is strongest, and where it remains bounded
The paper is strongest when it shows that dynamic, behavior-driven communication matters under changing adversarial pressure. It also does a good job separating several supporting claims: main benchmark performance, token efficiency, budget sensitivity, scalability, generalization, and ablation-level mechanism validation.
The limits are equally important.
First, the setting is adversarial by design. That is useful because it stresses the system, but many enterprise failures are not adversarial in the formal sense. They are ambiguous, stale, incomplete, or incentive-distorted. TodyComm likely has relevance there, but the paper does not directly prove it.
Second, task utility is relatively clean in benchmark settings. Real business workflows often have delayed or contested reward signals. Was the customer-support answer correct? Did the compliance agent reduce risk or just slow the process? Did the investment-research agent produce insight or a well-formatted hallucination with Bloomberg cosplay? Reinforcement learning needs feedback. Enterprises often have vibes, tickets, and partial labels.
Third, the agent pool is still small. The scalability tests extend to 20 agents, which is meaningful for research but not the final word for large operational systems with dozens of tools, retrieval sources, and human-in-the-loop checkpoints.
Fourth, routing policy itself becomes a governance object. If the system learns whom to ignore, someone must audit why. A wrong exclusion can be as damaging as a wrong inclusion, especially in compliance, medicine, finance, and legal workflows. Dynamic silence is still a decision.
These boundaries do not weaken the paper’s core contribution. They define where the idea must mature before becoming infrastructure.
The deeper shift: agent systems need organizational design
The usual sales pitch for multi-agent AI is that more agents create more perspectives. TodyComm offers a less cheerful but more useful correction: more perspectives help only when influence is governed.
A complete graph is not a democracy. A random graph is not diversity. A fixed role graph is not governance. These are communication defaults, not trust policies.
The paper’s mechanism-first lesson is that multi-agent systems need something closer to organizational design: routing, screening, budgeting, memory, and final authority. Once agents can influence each other over multiple rounds, the system designer is no longer just building a reasoning pipeline. They are designing an institution, albeit one that runs on tokens and occasionally fails at arithmetic.
TodyComm does not make agents inherently wiser. It makes the system more selective about influence. That is a smaller claim, and therefore a more valuable one.
The future of agentic AI will not be decided only by larger base models or longer context windows. It will also be decided by whether systems learn when to stop listening.
Cognaptus: Automate the Present, Incubate the Future.
-
Wenzhe Fan, Tommaso Tognoli, Henry Peng Zou, Chunyu Miao, Yibo Wang, and Xinhua Zhang, “TodyComm: Task-Oriented Dynamic Communication for Multi-Round LLM-based Multi-Agent System,” arXiv:2602.03688, 2026. ↩︎