Agents Assemble: When Multi‑Agent LLMs Stop Hallucinating and Start Doing Science

A scientist does not usually fail because they cannot ask the right question. More often, they fail because the useful answer is buried behind five separate systems: a biomedical knowledge graph, a disease-module algorithm, a drug-prioritization method, a literature database, and a visualization tool that looks innocent until someone has to configure it.

That is the quiet business problem behind ChatDRex, a conversational multi-agent system for network-based drug repurposing built on the NeDRex platform.¹ The paper is not saying, “Look, a chatbot discovered a cure.” Thankfully. We have already had enough of that genre.

The more interesting claim is narrower and more useful: a multi-agent LLM system can make specialized biomedical workflows accessible to users who understand the disease but do not want to manually assemble a bioinformatics pipeline. ChatDRex lets the LLM do what LLMs are relatively good at: interpreting user intent, routing tasks, translating between natural language and tool calls, summarizing outputs, preserving conversational context, and checking coherence. The actual scientific operations are delegated to structured databases and established computational tools.

That division of labor is the paper’s real contribution. Not “AI replaces experts.” Not “LLMs do drug discovery.” More like: the chatbot becomes a switchboard operator for serious machinery. Less glamorous, much more useful.

The mechanism matters more than the chatbot wrapper

The obvious summary of ChatDRex would be simple: it is a no-code assistant for drug repurposing. Users ask biomedical questions in natural language; the system queries NeDRex, runs network algorithms, retrieves literature, performs functional analysis, and returns a readable answer.

Accurate, but too shallow.

The paper’s mechanism is more important than its interface. ChatDRex is built as a multi-agent system where a central planning agent decomposes user requests and routes subtasks to specialized agents. A Summary agent maintains compressed conversational memory and checks for prompt injection. A Planning agent decides which expert agent should act next. A NeDRex agent executes network-medicine tools such as DIAMOnD, TrustRank, and Closeness Centrality. A Knowledge Graph agent translates natural-language questions into schema-constrained Cypher queries over the NeDRex KG. A DIGEST agent evaluates functional coherence of gene sets or disease modules. A Research agent uses Semantic Scholar when KG coverage is insufficient. A Network agent controls Drugst.One visualizations. A Finalize agent synthesizes the outputs into a response and applies an additional hallucination-verification step.

That may sound like architecture paperwork. It is not. It is the difference between a language model improvising biomedical claims and a language model coordinating a constrained workflow.

The paper’s architecture can be read as a useful enterprise AI pattern:

System layer	What ChatDRex lets the LLM do	What the LLM is not trusted to do alone
User interface	Parse natural-language biomedical questions	Decide scientific truth from text fluency
Planning	Break the workflow into subtasks and route agents	Collapse the entire workflow into one unverified answer
Knowledge graph access	Translate questions into structured KG operations	Invent biomedical relationships not present in the graph
Network analysis	Call DIAMOnD, TrustRank, and Closeness tools	Replace validated algorithms with verbal reasoning
Functional interpretation	Summarize DIGEST outputs	Treat narrative coherence as statistical validation
Final response	Combine tool results, citations, and explanation	Turn candidates into clinical recommendations

The practical lesson is blunt: reliable vertical agents are often not “smarter chatbots.” They are controlled interpreters wrapped around domain-specific infrastructure.

Drug repurposing is a workflow problem before it is an AI problem

Drug repurposing sounds like a natural use case for AI because existing drugs already have safety and tolerability information, and the question becomes whether they might work for new indications. But the actual workflow is not a single prediction. It is a chain.

A typical network-medicine path begins with seed genes associated with a disease. These seeds are used to identify a disease module: a group of genes or proteins whose network relationships may reflect disease biology. Algorithms such as DIAMOnD expand the module by finding proteins significantly connected to the seed genes. Drug-prioritization methods such as TrustRank or Closeness Centrality then rank candidate drugs or targets based on network proximity. Functional coherence tools such as DIGEST ask whether the resulting gene set has meaningful biological consistency rather than being a random-looking list with fancy gene names. Literature retrieval then helps contextualize the candidates.

The pain point is that each step usually demands a different kind of expertise. A clinician may know the disease. A bioinformatician may know the pipeline. A computational biologist may understand the network assumptions. A pharmacologist may interpret the candidate list. And the poor human in the middle has to move identifiers, parameters, graphs, and citations across tools.

ChatDRex tries to compress this operational friction. It does not remove the need for expertise. It changes where the expertise is spent. Instead of spending time on query syntax, API parameters, and manual cross-tool stitching, the user can focus on whether the resulting module, pathways, and candidate drugs make biological sense.

This is also why the no-code framing should be handled carefully. “No-code” does not mean “no judgment.” It means the system hides pipeline implementation from users who should not need to become tool plumbers before asking a scientific question.

The architecture’s best trick is refusing to make the LLM the scientist

The Knowledge Graph agent is a good example of the paper’s philosophy.

A naïve biomedical chatbot might take a user question, retrieve a few passages, and answer in prose. ChatDRex instead maps the question to the fixed schema of the NeDRex KG. It decomposes the question into typed subquestions, resolves relevant nodes using embedding-based similarity search, and then generates deterministic Cypher queries under schema and edge constraints. If query generation fails after retries, the system falls back to a GraphRAG-style strategy.

That design does two things at once. First, it makes the LLM useful as a translator between messy human language and structured biomedical data. Second, it prevents the LLM from becoming the database. The model is not asked to “remember” which genes, drugs, proteins, and diseases are connected. It is asked to help formulate a query against a system built for that purpose.

This is the same principle behind the NeDRex tool agents. DIAMOnD performs disease-module expansion. TrustRank and Closeness Centrality rank candidates using network topology. DIGEST evaluates functional coherence against randomized reference sets. Drugst.One handles graph visualization. Semantic Scholar supports literature retrieval.

The LLM’s role is therefore orchestration, not scientific omniscience. That sounds less magical. It is also how one avoids producing extremely confident biomedical fan fiction, which is generally considered suboptimal.

The evaluation separates tool use from interpretation, which is exactly the right split

The strongest part of the paper is not the use-case demonstration. It is the evaluation design.

The authors recognize that evaluating an LLM agent is harder than evaluating a deterministic program. Some subtasks have an explicit answer: Did the agent select the right tool? Did it call the right API? Did the returned result contain expected KG entries? Other subtasks are inherently interpretive: Did the final answer explain structured enrichment output correctly? Did it preserve scientific nuance? Did it overstate evidence?

So the paper separates three evaluation dimensions:

Evaluation dimension	What it checks	Why it matters
Tool-Accuracy	Whether the agent selects the correct tool for the user request	Tests orchestration and intent recognition
Call-Accuracy	Whether the correct API/tool invocation is made	Tests whether the workflow is operationally reliable
Answer-Accuracy	Whether the returned result is reflected correctly in natural language	Tests interpretation, summarization, and explanation
NeDRex KG F1-score	Whether KG query results match a silver-standard annotation dataset	Tests structured query performance rather than prose quality

This split is more than methodological tidiness. It tells businesses where the risk actually lives.

For deterministic network tools, ChatDRex performs very well. DIAMOnD, TrustRank, and Closeness Centrality reach near-perfect Tool-Accuracy, with DIAMOnD and TrustRank at 1.00 and Closeness Centrality at 0.99. Their Call-Accuracy is also high: 0.96 for DIAMOnD, 0.98 for TrustRank, and 0.95 for Closeness Centrality.

The important interpretation is not “the LLM understands medicine.” The stronger and safer interpretation is: when the task is well-scoped and backed by deterministic APIs, the agent can reliably recognize the user’s intent and invoke the proper analytical component.

The Knowledge Graph agent is harder. Its schema-constrained NeDRex KG query evaluation reports a Call-Accuracy F1-score of 0.74 and Answer-Accuracy of 0.83. That is not a disaster; it is a reminder that translating natural language into graph queries is still a meaningful technical bottleneck. Complex biomedical questions do not always become perfect Neo4j queries just because a chatbot smiles politely while trying.

DIGEST is where the paper becomes most useful

The DIGEST result is the most revealing part of the study because it shows the boundary between tool reliability and narrative reliability.

For DIGEST-Set, Tool-Accuracy is 0.87 and Call-Accuracy is 0.79. Answer-Accuracy is reported as 0.68 under manual expert evaluation, while the automated LLM-as-a-judge score is only 0.29. The paper explains that the underlying analytical information was factually correct, but the automatic judge strongly penalized the generated answers.

That mismatch is not a footnote-level inconvenience. It is the paper’s best enterprise warning.

DIGEST produces structured functional-coherence and enrichment outputs. The system must then turn those statistical results into a natural-language explanation. This introduces a new abstraction layer. When a tool returns a ranked list or a deterministic graph output, the final answer can mostly reflect the tool result. When a tool returns statistical enrichment information, the system has to interpret what the signal means, how strong it is, and what not to overclaim.

That is exactly where LLM evaluation gets fragile. A judge model may dislike a valid explanation because it differs from an expected phrasing. Or it may reward a fluent answer that is scientifically too smooth. Either way, automated judgment becomes less trustworthy when the task requires domain interpretation rather than simple output matching.

For businesses, this is a very useful distinction:

Test or evidence	Likely purpose	What it supports	What it does not prove
DIAMOnD, TrustRank, Closeness metrics	Main orchestration evidence	The agent can select and call deterministic network tools reliably	The system discovers clinically validated therapies
NeDRex KG F1-score	Structured query evaluation	Natural-language-to-KG querying is feasible but imperfect	Complex graph QA is solved
DIGEST manual vs LLM-judge gap	Interpretation and evaluation stress test	Narrative biological interpretation remains harder to score automatically	Automated judges can replace expert review
Colorectal cancer and Huntington’s workflows	Use-case demonstration	ChatDRex can stitch multi-step workflows into a conversational process	The candidate outputs are experimentally validated
Literature-search limitation example	Exploratory limitation evidence	External retrieval and summarization can still produce inaccuracies	The whole architecture is unreliable

This table is where the article’s mechanism-first structure pays off. The point is not that one metric is “good” and another is “bad.” The point is that reliability changes by layer. Tool routing is strong. Structured graph querying is workable but imperfect. Statistical-to-verbal interpretation remains the bottleneck. Automated evaluation of that interpretation is even shakier.

In other words: the system is strongest where the LLM is closest to being a dispatcher and weakest where the LLM must become a scientific narrator.

The use cases show workflow integration, not clinical validation

The paper includes several use cases, including functional enrichment for Alzheimer’s-related genes, literature search, network analysis, colorectal cancer module analysis, and Huntington’s disease workflows. These examples are best read as workflow demonstrations.

For colorectal cancer, ChatDRex retrieves disease-associated seed genes from NeDRex, including genes such as POLE and POLD1, expands the gene set using DIAMOnD into a 20-gene candidate disease module, applies DIGEST to evaluate functional coherence, and queries drug–gene interactions for POLE. The paper reports suggestive DIGEST signals for immune-related processes and DNA replication/cell-cycle functions, with empirical $p = 0.005$ for both GO Biological Process and KEGG pathway annotation layers in the main text. It also notes that the small and heterogeneous module size makes the results indicative rather than confirmatory.

That is the correct tone. The workflow generates hypotheses. It does not validate a therapeutic strategy.

The supplementary examples are equally important because they show both capability and rough edges. The literature-search workflow for Alzheimer’s disease is explicitly described as an example with limitations: the initial response contains a large amount of Semantic Scholar data, and the final summary exhibits some inaccuracies. This is not a minor embarrassment hidden in the appendix. It tells us that retrieval plus summarization is still risky when the system must compress messy literature into a confident answer.

The Huntington’s disease workflow illustrates the agentic chain more clearly: identify associated genes, run DIAMOnD, perform functional analysis, rank drugs with TrustRank, then retrieve research on top-ranked drugs. This is close to how users would expect an analytical assistant to behave. The interaction is not one query, one answer. It is a sequence of scientific moves.

The strongest reading of these examples is therefore operational: ChatDRex can coordinate multi-step biomedical analysis through conversation. The weakest reading would be clinical: ChatDRex proves that particular repurposing candidates should be used. The paper does not support that, and neither should the reader.

The business value is workflow compression, not automated discovery

For pharma, biotech, hospital research teams, and expert-service firms, the commercial relevance is not that ChatDRex magically discovers drugs. The value is more practical: it lowers the coordination cost of using complex analytical infrastructure.

In many expert workflows, the bottleneck is not the absence of tools. It is that the tools are fragmented. The database has one interface. The algorithm has another. The visualization layer has another. The literature search has another. The expert has a hypothesis but not necessarily the time or technical comfort to wire everything together.

ChatDRex suggests a reusable business pattern:

Put trusted domain assets behind agent-accessible APIs.
Let a planning agent decompose user intent into tool-specific subtasks.
Keep deterministic computation inside validated tools.
Use the LLM to translate, route, summarize, and explain.
Evaluate each layer separately instead of pretending the final answer is the only object that matters.

This pattern is not limited to drug repurposing. It applies to legal research, financial risk analysis, engineering simulation, clinical operations, compliance review, procurement intelligence, and any other field where experts use specialized systems but lose time to interface friction.

The ROI story is also not simply “replace analysts.” A more realistic path is:

Business objective	How the ChatDRex pattern helps	Boundary
Faster hypothesis triage	Users can run multi-step workflows without manually configuring every tool	Candidate outputs still require expert review
Better tool adoption	Natural-language access lowers training and onboarding friction	Poorly designed prompts or ambiguous inputs can still misroute tasks
Reusable governance	Deterministic tools remain the source of analytical outputs	Governance must cover both APIs and generated explanations
Better auditability	Tool calls, intermediate results, and final summaries can be separated	The paper does not prove enterprise-grade audit coverage across all use cases
Reduced hallucination risk	LLMs query validated databases instead of relying on memory	Literature summarization and biological interpretation remain vulnerable

This is where many enterprise AI pitches get the story backwards. They sell intelligence first and integration second. ChatDRex shows the opposite. The system becomes valuable because the intelligence is constrained by integration.

The architecture boundary is the product lesson

The most generalizable lesson from ChatDRex is architectural humility.

The paper does not try to fine-tune a biomedical model into a universal expert. It builds a system where the LLM operates inside a network of tools, memories, routers, guards, and evaluators. It uses schema constraints for KG querying. It uses established network algorithms for disease-module and drug-prioritization steps. It uses DIGEST for functional coherence. It uses visualization tooling rather than asking the model to describe graphs from imagination. It uses a Finalize agent plus hallucination verification for response synthesis.

That is a mature direction for vertical AI agents. It accepts that LLMs are powerful but not self-sufficient. They are excellent at flexible language mediation. They are not, by default, reliable biomedical databases, statistical engines, graph algorithms, clinical validators, or regulatory-grade evidence reviewers.

The business implication is simple: if a company wants a serious domain agent, the first question should not be, “Which model should we use?” It should be, “Which parts of the workflow must remain deterministic, inspectable, and owned by domain tools?”

Model choice matters. But architecture decides where errors can enter, where they can be caught, and whether the final answer is a tool-grounded result or a fluent hallucination wearing a lab coat.

The limits are concentrated in interpretation, coverage, and evaluation

The limitations are not generic “AI may be wrong” caveats. They are specific and operational.

First, ChatDRex depends on the coverage and quality of NeDRex and connected tools. If a disease, gene, drug, or relationship is missing or poorly represented in the KG, the system can only work around that gap through retrieval or fallback strategies. A cleaner interface does not create missing biomedical evidence.

Second, natural-language ambiguity can propagate through multi-step workflows. Even when tool selection is reliable, the user’s phrasing may lead to different decomposition choices, different KG filters, or different downstream analyses. The paper’s NeDRex KG F1-score of 0.74 is a useful reminder that structured query generation remains non-trivial.

Third, narrative interpretation is weaker than deterministic orchestration. DIGEST is the clearest example. The underlying functional analysis can be correct, while the generated explanation and its automated evaluation remain difficult. This matters because business users often consume the final explanation, not the raw statistical output.

Fourth, automated LLM-as-a-judge evaluation is not enough for domain outputs that require expert interpretation. The manual DIGEST Answer-Accuracy of 0.68 versus automated judge accuracy of 0.29 is not merely a scoring discrepancy. It tells us that evaluation infrastructure can become its own failure point.

Fifth, the paper demonstrates hypothesis generation and workflow execution, not clinical efficacy. Drug repurposing candidates produced through network analysis require biological interpretation, experimental validation, and ultimately clinical evidence before they become therapies. The system can accelerate the queue. It cannot certify the destination.

What Cognaptus would take from this paper

The useful Cognaptus reading is not “build a biomedical chatbot.” It is “build agents around the real workflow boundary.”

For enterprise AI, ChatDRex points to a practical design rule:

The LLM should own conversation, decomposition, routing, and explanation. The domain tools should own evidence-producing computation.

That rule sounds conservative. It is also scalable. A company can replace a model, upgrade a tool, add a guardrail, or improve a KG schema without rewriting the whole system. The architecture becomes modular because responsibility is modular.

This also changes how such systems should be sold and evaluated. Do not promise autonomous expertise. Show which tools are being invoked. Show how inputs become structured calls. Show intermediate outputs. Report tool-selection accuracy separately from final-answer quality. Identify where expert review is required. Treat narrative interpretation as a controlled risk, not a decorative final paragraph.

The business value is not that humans disappear from scientific work. The value is that humans spend less time babysitting pipeline mechanics and more time judging scientific meaning. That is not as cinematic as “AI discovers the next blockbuster drug.” It is probably more valuable.

Conclusion: the agent that knows when not to be the expert

ChatDRex is a strong paper because it understands a basic truth that many AI products still avoid: in serious domains, the agent should not be the source of truth. It should be the coordinator of sources, tools, and reasoning steps.

The system’s best results come when the LLM routes users to deterministic network-medicine tools and reflects their outputs. Its weakest and most instructive results appear when structured biological statistics must be turned into natural language and then judged automatically. That is not a failure of the whole idea. It is the boundary condition that makes the idea usable.

For businesses building vertical agents, this is the lesson worth keeping. The future is not one giant model pretending to be a scientist, lawyer, engineer, analyst, and compliance officer before lunch. The future is more likely a disciplined set of agents that know which tool to call, when to stop talking, and where expert judgment must re-enter the room.

Less oracle. More operating system.

Cognaptus: Automate the Present, Incubate the Future.

Simon Süwer, Kester Bagemihl, Sylvie Baier, Lucia Dicunta, Markus List, Jan Baumbach, Andreas Maier, and Fernando M. Delgado-Chaves, “Conversational No-code, Multi-agentic Disease Module Identification and Drug Repurposing Prediction with ChatDRex,” arXiv:2511.21438, PDF version accessed via https://arxiv.org/pdf/2511.21438. ↩︎

The mechanism matters more than the chatbot wrapper#

Drug repurposing is a workflow problem before it is an AI problem#

The architecture’s best trick is refusing to make the LLM the scientist#

The evaluation separates tool use from interpretation, which is exactly the right split#

DIGEST is where the paper becomes most useful#

The use cases show workflow integration, not clinical validation#

The business value is workflow compression, not automated discovery#

The architecture boundary is the product lesson#

The limits are concentrated in interpretation, coverage, and evaluation#

What Cognaptus would take from this paper#

Conclusion: the agent that knows when not to be the expert#