Meetings are expensive, even when the employees are synthetic
Every organization has seen the meeting that should have been an email. Everyone attends, everyone hears everything, and somehow the person who needed one precise fact receives it after forty minutes of theatrical alignment.
Multi-agent AI systems often reproduce the same disease, only faster. A coding agent, a testing agent, a research agent, a planning agent, and a manager agent are assembled into a “team.” Then the system lets them talk through a fixed pipeline, a broadcast channel, or a reusable graph. It feels collaborative. It is also a polite way to dump irrelevant context into everyone’s prompt and call the mess intelligence.
The paper behind DyTopo asks a sharper question: what if the main bottleneck in multi-agent reasoning is not the number of agents, but the communication topology?1
DyTopo, short for Dynamic Topology Routing, treats agent communication as an adaptive routing problem. In each round, a manager sets the current goal. Each worker agent says what it needs and what it can provide. A semantic matching layer embeds these “query” and “key” descriptors, compares them, and builds a sparse directed graph. Messages travel only along the edges that survive the relevance threshold.
That is the core move. DyTopo does not merely add agents. It teaches the system to stop inviting the whole company to every subproblem.
The real problem is not collaboration; it is indiscriminate collaboration
Most multi-agent LLM systems begin with a plausible intuition: different agents can specialize. One agent writes code, another tests, another researches algorithms, another checks logic. In mathematical reasoning, one agent can parse the problem, one can solve it, and another can verify the derivation.
The intuition is fine. The wiring is often lazy.
The paper contrasts DyTopo with common communication patterns: single-agent prompting, one-turn multi-agent generation, random sparse topologies, and fixed frameworks such as AgentScope-style coordination. These systems differ in surface design, but many share the same weakness: their communication structure is not sufficiently conditioned on the stage of reasoning.
Early in a task, broad exchange may help. Agents need to establish the problem frame, identify constraints, and propose candidate strategies. Later, broad exchange becomes less attractive. The system may need a tester to inspect the developer’s implementation, or a verifier to check the solver’s derivation. Adding more voices at that stage can create noise, not wisdom.
DyTopo’s premise is therefore simple but operationally important:
| Common assumption | DyTopo’s correction | Practical meaning |
|---|---|---|
| More agents improve reasoning. | Only useful agent interactions improve reasoning. | Team size matters less than routing quality. |
| More rounds create better answers. | Rounds have task-dependent returns and can become harmful. | Stopping policy is part of the system design. |
| Broadcast increases shared context. | Broadcast also increases irrelevant context. | Context is a budget, not a storage closet. |
| Communication graphs are architecture choices. | Communication graphs should change during inference. | Workflow structure can be adaptive without retraining the model. |
This matters because many business AI workflows will not fail dramatically. They will fail quietly through context dilution. The legal-review agent reads irrelevant sales notes. The invoice-checking agent receives strategy commentary. The operations agent sees compliance concerns before the facts are stable. Everyone “collaborates,” and the final output becomes slightly more confused.
The expensive part is not the token bill. The expensive part is that nobody knows which message contaminated the decision.
DyTopo turns agent coordination into query-key routing
DyTopo’s mechanism has five moving parts.
First, a Manager agent defines the round-level goal. This is not decorative supervision. The round goal conditions what each worker should focus on, and therefore affects the descriptors that later determine the communication graph.
Second, each worker agent performs a single forward pass per round. It produces a public message, a private message, a query descriptor, and a key descriptor. The query describes what the agent currently needs. The key describes what it can provide.
In a code-generation setting, a Developer might need “test cases and edge-case validation” while offering “a complete Python implementation.” A Tester might need “the developer’s implementation to verify” while offering “a test report and failing cases.” In a math setting, a Solver might need a structured plan; a Verifier might need a full derivation to inspect.
Third, DyTopo embeds the query and key descriptors with a fixed sentence embedding model and computes cosine similarity between every potential provider and consumer pair. If agent $i$’s key matches agent $j$’s query strongly enough, DyTopo activates a directed edge from $i$ to $j$.
Fourth, a threshold controls sparsity. Too low, and the graph becomes noisy. Too high, and useful links disappear. The topology is not just a pretty diagram; it is a communication budget.
Fifth, messages are routed only along the activated edges and incorporated into each recipient’s next-round memory. A synchronization barrier ensures that the topology is induced from the current round before memories are updated for the next round. DyTopo also defines deterministic message ordering, using topological sorting when possible and a cycle-breaking heuristic when the graph contains cycles. This sounds technical because it is; the business translation is simpler: if a workflow routes evidence into prompts, the order and eligibility of that evidence must be reproducible.
The important design principle is the decoupling. Agents do not directly choose who deserves their message. They express need and offer. Routing is handled externally through an inspectable semantic matching layer.
That is a useful separation of concerns. Agent prompts remain role-focused. The system owns coordination.
The benchmark result is not “agents win”; it is “routing wins”
The paper evaluates DyTopo on code generation and mathematical reasoning benchmarks: HumanEval, APPS-Competition, MATH-500, and Omni-MATH. It tests multiple LLM backbones, including MiMo-V2-Flash, GPT-oss-120B, Llama3-8B-Instruct, and Qwen3-8B.
DyTopo is the best method across all 16 backbone-dataset settings reported in the main table. The improvement over the strongest non-DyTopo baseline ranges from 0.90 to 17.14 percentage points, with a mean improvement of about 6.09 points.
That average is useful, but it is not the most interesting part.
The pattern of improvement tells us where routing matters most. On harder math tasks, the gains are especially visible. For MATH-500, DyTopo reaches 47.14% with Llama3-8B-Instruct, compared with 30.00% for the strongest baseline in that row. On Omni-MATH, Qwen3-8B improves to 51.43%, compared with 35.71% for the strongest non-DyTopo baseline.
The paper’s coding results are also positive, but the interpretation is slightly different. On HumanEval, several models are already strong, so the available room for improvement is smaller. On APPS-Competition, where algorithmic design and edge cases matter more, routing becomes more valuable. For GPT-oss-120B on APPS-Competition, DyTopo reports 69.66%, compared with 60.55% for the strongest non-DyTopo baseline.
A lazy summary would say DyTopo “improves multi-agent reasoning.” True, but incomplete. The better reading is that the method improves reasoning when collaboration needs to become stage-specific: exploration first, verification later, consolidation at the end.
| Evidence type | Likely purpose in the paper | What it supports | What it does not prove |
|---|---|---|---|
| Main benchmark table | Main evidence | Dynamic semantic routing beats fixed, random, and baseline agent communication across tested tasks and models. | It does not prove the same gains in enterprise workflows. |
| Communication-round analysis | Sensitivity / behavior test | More rounds are not monotonically better; optimal depth depends on task type. | It does not identify a universal stopping rule. |
| Topology evolution trace | Interpretability case study | DyTopo’s graph changes in a way that aligns with task stage. | It does not statistically prove every edge is semantically optimal. |
| Similarity-threshold ablation | Hyperparameter sensitivity test | Sparsity has a sweet spot; both dense and overly sparse graphs degrade performance. | It does not remove the need for tuning in new domains. |
| Token and latency analysis | Efficiency comparison | Sparse routing and early stopping can reduce cost relative to fixed-horizon multi-agent baselines. | It is measured on a specific benchmark/backbone setting, not all deployments. |
This distinction is not academic hair-splitting. If a company reads the paper as “multi-agent systems work,” it may build a larger agent room. If it reads the paper correctly, it builds a router.
More rounds help until they start redecorating the answer
The communication-round analysis is one of the more practically useful parts of the paper. DyTopo’s performance is non-monotonic as rounds increase.
For HumanEval, performance peaks at round 5 with 92.07%. Additional rounds slightly degrade performance. This is plausible. Coding tasks can reach a correct implementation relatively early. After that, extra rounds may introduce edits, alternative designs, or unnecessary second-guessing. The system starts polishing the doorknob while the house is already open.
For MATH-500, performance peaks later, at round 9 with 87.14%. That also makes sense. Mathematical reasoning often benefits from extended derivation, checking, correction, and rechecking. The useful communication horizon is longer.
The lesson is not “use five rounds for code and nine for math.” That would be cargo-cult engineering, always fashionable, rarely useful.
The lesson is that interaction depth is task-dependent. A manager-controlled halting mechanism is therefore not an optional convenience. It is part of the reasoning architecture. A system that cannot stop is not deliberative; it is merely verbose.
For business workflows, this maps directly onto process design. Some tasks need one pass and a confidence check. Others need staged review. A customer-support summarizer should not behave like a contract-risk analyst. An invoice classifier should not deliberate like a board-strategy simulator. The number of agent rounds should reflect uncertainty, risk, and evidence requirements.
The topology trace is a debugging interface, not just an illustration
One of DyTopo’s more useful claims is interpretability through topology evolution. Because edges are induced from explicit query-key descriptors, each communication graph becomes a coordination trace.
The paper’s qualitative case study uses a HumanEval problem involving is_palindrome and make_palindrome. In the first round, the system focuses on initial exploration and algorithm selection. The Researcher proposes algorithmic approaches, and the Developer drafts an implementation. A strong edge forms from Researcher to Developer because the Developer needs efficient algorithmic guidance and the Researcher offers candidate algorithms and complexity considerations.
In the second round, the focus shifts toward implementation and verification. A high-confidence edge appears from Developer to Tester because the Tester needs the implementation, and the Developer offers complete code. In the final round, after tests pass, the graph becomes sparse and oriented toward final formatting and convergence.
This is not merely a cute visualization. It gives developers an operational object to inspect.
If the system fails, one can ask: did the right agent request the right evidence? Did the right provider describe its capability accurately? Was the threshold too strict? Did the graph route a weak message into a critical verifier? Did the manager set a vague round goal that produced vague descriptors?
In fixed broadcast systems, failure analysis is harder. Everyone saw everything, which means nobody is responsible for the information path. Broadcast is not transparency. It is often opacity with more tokens.
DyTopo’s trace also has governance implications. In regulated or high-stakes workflows, a company may need to explain not only the final AI output but the internal evidence flow that shaped it. A routed graph with descriptors is not a full audit system, but it is closer to one than a pile of concatenated agent messages.
The threshold ablation shows the cost of being too social
The similarity threshold controls which query-key matches become edges. The ablation tests thresholds from 0.1 to 0.9 on APPS-Competition and Omni-MATH.
The results show a clear middle region. For APPS-Competition, the best reported threshold is 0.3, with accuracy of 49.81%. For Omni-MATH, the best threshold is 0.4, with accuracy of 52.86%.
At low thresholds, the graph becomes too dense. Irrelevant messages enter agents’ context windows. The system has more communication but less useful signal. At high thresholds, the graph becomes too sparse. Useful information fails to move. The system becomes quiet, but not intelligently quiet.
This is the useful part for practitioners: sparsity is not automatically good. Selectivity is good. These are not the same thing.
In business AI systems, teams often overcorrect. After seeing context bloat, they aggressively filter messages. After seeing missed information, they broadcast everything. DyTopo’s threshold ablation says the productive region sits between these instincts. Communication should be sparse enough to reduce noise and dense enough to preserve dependencies.
That sounds obvious after being said. Most design principles do. The trick is turning it into a tunable mechanism.
Efficiency comes from stopping and routing, not magic compression
The appendix reports token and latency analysis on HumanEval using the MiMo-V2-Flash backbone. DyTopo reaches 92.07% accuracy while consuming 9,453 tokens on average and taking 22.3 seconds. AgentScope reaches 90.24% while consuming 19,520 tokens and taking 39.8 seconds. Random Topology consumes 15,783 tokens and takes 34.2 seconds while reaching 88.17%.
The paper attributes DyTopo’s efficiency to two factors: manager-controlled early stopping and sparse routing. DyTopo averages 2.6 rounds in that analysis, while fixed-horizon baselines run for five rounds.
This is important because the efficiency claim is not “semantic matching is free.” The matching layer still computes embeddings and similarities. But relative to LLM generation, the paper argues this overhead is small. The real savings come from fewer unnecessary rounds and shorter routed contexts.
For business use, this suggests a practical cost model:
| Design choice | Cost effect | Quality effect | Governance effect |
|---|---|---|---|
| Fixed broadcast | High token use | Can help early exploration but risks noise | Harder to attribute influence |
| Random sparse routing | Lower than broadcast, but unstable | May miss critical dependencies | Weak explanation for edges |
| Static pipeline | Predictable cost | Brittle when task stage changes | Easy to document, hard to adapt |
| DyTopo-style semantic routing | Potentially lower cost through sparsity and stopping | Better when task needs shift by stage | Descriptors and edges create an audit trail |
The ROI implication is not just cheaper inference. Cheaper inference is pleasant. It is not a strategy.
The strategic value is cheaper diagnosis. When an agentic workflow fails, a routed topology can show whether the failure came from poor generation, poor role design, poor manager goals, weak descriptors, or bad routing thresholds. That is the difference between improving a system and performing ritual prompt edits at midnight.
What Cognaptus would infer for business workflows
The paper directly shows improved performance on code and math benchmarks. It does not directly show invoice automation, procurement review, loan underwriting, financial reporting, customer service, or legal triage.
The business inference is therefore architectural, not empirical: if enterprise AI workflows use multiple agents, they should treat information routing as a first-class layer.
A DyTopo-inspired business system would not simply create agents such as “Analyst,” “Reviewer,” “Compliance,” and “Summarizer” and let them all read the same transcript. It would require each agent to state what it needs and what it can provide at each stage. Then a routing layer would decide which messages enter which memories.
A procurement workflow might work like this:
- The extraction agent provides invoice fields, purchase order references, and confidence levels.
- The matching agent requests vendor, PO, receipt, and line-item evidence.
- The exception agent requests mismatches, missing approvals, or unusual payment terms.
- The compliance agent receives only high-risk or policy-relevant evidence, not every extracted token.
- The final summarizer receives the resolved facts and unresolved exceptions, not the entire internal debate.
The same pattern applies to financial analysis. A macro-data agent does not need every paragraph from a news-summarization agent. A risk agent may need only uncertainty flags, assumptions, and exposure-sensitive claims. A charting agent needs structured series and labels, not legal caveats. A publishing agent needs the final interpretation and source list, not every failed intermediate hypothesis.
The principle is boring in the best possible way: route information by operational need.
The boundaries: descriptors can lie, thresholds can misfire, traces can leak
DyTopo’s limitations matter because they point directly to implementation risk.
First, the routing quality depends on descriptor quality. If an agent poorly describes what it needs or offers, semantic matching can route the wrong messages. This is not a minor issue. The descriptor is the system’s self-declared interface. Bad interfaces produce bad integrations.
Second, semantic similarity is not the same as operational relevance. Two descriptors can be semantically close but procedurally unhelpful. A compliance agent asking for “risk indicators” may match a strategy agent offering “market risk narrative,” even when the needed object is a policy violation checklist. In enterprise workflows, descriptors should be constrained, typed, or supported by schema fields when possible.
Third, thresholds need tuning. The paper’s ablation shows different optimal thresholds for APPS-Competition and Omni-MATH. A production workflow should not assume a universal value. It should monitor edge density, missed dependencies, downstream error rates, and human override patterns.
Fourth, topology traces create privacy and security questions. A trace that records what each agent needed, offered, and received can be valuable for debugging. It can also expose sensitive business logic, client information, or internal deliberation. Logging must be designed as a governance asset, not an accidental transcript archive.
Fifth, the paper’s evidence is still benchmark evidence. Code and math tasks are useful because they have measurable correctness. Business workflows often have ambiguous success criteria, delayed feedback, and political constraints. Very inconvenient. Also known as reality.
The safe interpretation is this: DyTopo provides a credible mechanism and promising benchmark evidence for adaptive agent communication. It does not eliminate the need for domain-specific evaluation.
The better mental model is a switchboard, not a conference room
DyTopo is valuable because it shifts the mental model of multi-agent AI.
The common model is a conference room: create specialized agents, let them talk, hope synthesis emerges. The DyTopo model is a switchboard: each agent declares need and offer; the system routes information through a sparse graph; the manager changes the goal and stops the process when enough evidence has accumulated.
That model is more disciplined. It is also closer to how serious organizations actually work when they are not performing alignment theater. Experts do not need every message. They need the right message at the right time, with enough context to act and not enough noise to hallucinate responsibility.
For builders of agentic systems, the takeaway is blunt: collaboration is not a virtue by itself. It is a cost center until routed correctly.
DyTopo does not make agents magically smarter. It makes their communication less stupid. In multi-agent AI, that may be the more scalable achievement.
Cognaptus: Automate the Present, Incubate the Future.
-
Yuxing Lu, Yucheng Hu, Xukai Zhao, and Jiuxin Cao, “DyTopo: Dynamic Topology Routing for Multi-Agent Reasoning via Semantic Matching,” arXiv:2602.06039, 2026, https://arxiv.org/abs/2602.06039. ↩︎