When Tools Think Before Tokens: What TxAgent Teaches Us About Safe Agentic AI

Tools are supposed to make AI safer.

That is the sales pitch, anyway. Give the model access to curated biomedical databases, let it call APIs instead of hallucinating from memory, and clinical reasoning suddenly becomes more grounded. Less improvisation, more evidence. Less theatrical confidence, more traceable work.

The paper MedAI: Evaluating TxAgent’s Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition is useful because it makes that story less comfortable.¹ The authors do not merely ask whether an LLM can answer therapeutic questions. They examine a more operational failure point: whether an agentic system can select the right biomedical tool before the answer is generated.

That distinction matters. In ordinary chatbot evaluation, we often judge the final response. In agentic medical AI, the final response is only the last visible symptom. The real disease may have started earlier, when the model rewrote the query badly, retrieved the wrong function, formatted a parameter incorrectly, or called a narrow database when the task required a full clinical label narrative.

In other words: before the tokens “think,” the tools have already shaped the answer.

The safety problem starts before the model answers

The tempting misconception is that medical AI safety is mainly a model-size problem. If a small model fails, use a larger model. If a general model misses drug-label detail, use a biomedical model. If a biomedical model is not enough, fine-tune harder. Beautiful. Very expensive. Also incomplete.

TxAgent’s workflow shows why.

TxAgent is built around a fine-tuned Llama-3.1-8B model and a fine-tuned Qwen2-1.5B component used for tool retrieval. A therapeutic question enters the system. The LLM reformulates the question to clarify intent. That rewritten query is compared with ToolUniverse function descriptions. The retriever returns the top candidate tools. The LLM then chooses which tools to call, supplies parameters in JSON, receives the returned information, and decides whether another ToolRAG cycle is needed.

That gives us a simple failure chain:

clinical question
→ query rewrite
→ tool retrieval
→ tool selection
→ parameter construction
→ tool execution
→ retrieved context
→ final answer

The paper’s central value is not “TxAgent is good” or “DailyMed helps.” The more useful lesson is this: in an agentic medical system, answer quality depends on upstream routing quality. Once the wrong source enters the context window, the model may reason fluently over the wrong evidence. Medicine has enough real complications; we do not need decorative ones produced by a tool router.

The authors report several recurring issues in TxAgent evaluation runs: repeated calls caused by incorrectly formatted input parameter names, wrong functions selected even when better candidates were retrieved, and function calls that failed to return the expected information. These are not glamorous benchmark failures. They are production failures wearing lab coats.

OpenFDA gives precision; DailyMed gives narrative context

The paper’s main intervention is the integration of DailyMed into ToolUniverse.

This is not merely another database plugged into an agent. It changes the shape of what the agent can retrieve. The authors describe openFDA tools as useful for granular queries and metadata retrieval. That granularity is valuable when the question is narrow and the needed fact is specific. But broad therapeutic questions often require clinical narrative: contraindications, warnings, adverse reactions, pregnancy-related information, dosage context, and label language that was written to be read as a coherent document rather than assembled from tiny fragments.

DailyMed contributes access to Structured Product Labeling, giving TxAgent a way to retrieve complete, version-controlled clinical label narratives. That matters because many medical questions do not fail from lack of “facts.” They fail because the facts are scattered across the wrong level of abstraction.

Source style	Operational strength	Operational weakness	Best-fit use
Granular openFDA-style calls	Efficient targeted retrieval; lower context burden	May require multiple calls for broad therapeutic reasoning	Specific metadata or narrow fact lookup
DailyMed label retrieval	Complete, human-readable clinical narratives	Larger context payload; less fine-grained selection	Broader drug-label reasoning and safety-oriented interpretation
No retrieval	Simpler pipeline	Relies on parametric memory and stale or incomplete knowledge	Mostly unsuitable for high-stakes therapeutic reasoning

The business lesson is not that DailyMed is universally “better.” That would be the sort of procurement logic that produces expensive dashboards and sad meetings. The lesson is that tool catalogs need role separation. A safe healthcare AI stack should not simply ask, “Do we have drug data?” It should ask: “Which source should answer which class of question, at which level of narrative depth, with what retrieval cost?”

The retriever experiment is main evidence about routing quality

The paper’s retriever comparison is the cleanest evidence for the first contribution: tool/function-call retrieval quality is a bottleneck.

The authors rewrote TxAgent’s ToolRAG functionality to support several sparse and dense retrieval approaches while keeping the decision structure consistent. Each retriever compared the TxAgent-rewritten question against ToolUniverse function descriptions and returned the top $k = 10$ function names. That detail matters because the experiment is not a loose comparison of unrelated systems. It isolates the retrieval method within the same tool-selection pipeline.

The result pattern is straightforward. No retrieval performs poorly. BM25 struggles, likely because exact lexical matching is a bad fit when function descriptions are short and semantic intent matters. Dense retrievers perform more similarly to one another, but TxAgent’s fine-tuned Qwen2-1.5B retriever does better than generic retrieval alternatives. The strongest setting comes from the TxAgent retriever with DailyMed integrated.

The paper’s Figure 1 reports performance relative to the best DailyMed-enhanced setting. The exact visual values are less important than the structure of the result: generic retrieval is not enough, domain-tuned tool retrieval helps, and source coverage can matter even when the retrieval model is already strong.

That gives us a sharper interpretation:

Experiment	Likely purpose	What it supports	What it does not prove
Retriever comparison across sparse and dense methods	Main evidence	Tool retrieval quality materially affects therapeutic QA performance	That one retriever is universally best across all medical-agent tasks
DailyMed integration into ToolUniverse	Main evidence plus implementation contribution	Better source coverage can improve agent access to drug-label information	That broader retrieval always improves safety or efficiency
No-retrieval baseline	Comparison baseline	External context matters for therapeutic reasoning	That all retrieval methods are useful
BM25 comparison	Retrieval-method comparison	Exact-match retrieval is weak for short tool descriptions and semantic medical intent	That sparse retrieval is always inferior in all biomedical search settings

The critical point is not that BM25 had a bad day. The point is that tool descriptions are small, semantic, and operational. A function name plus a one- or two-sentence description is not the same as a document corpus. Retrieval over tools is closer to dispatching work to the right specialist than searching a library shelf. If the dispatcher is wrong, the specialist never gets called.

Fixed retrieval shows that context quality and model behavior interact

The second experimental block uses a fixed-retrieval setup. Here the authors take identical retrieved information from a sample TxAgent run, clean the tool-call context, and feed it to various LLMs using a simplified tool-query prompt. They evaluate both multiple-choice and open-ended multiple-choice formats, and they also test answer-option permutation using the transformation $[A, B, C, D] \rightarrow [B, D, A, C]$.

This experiment should be read carefully. It is not a second thesis about replacing TxAgent with generic LLMs. It is closer to a sensitivity test: if the retrieved context is held fixed, how well do different models use it?

The result pattern is again practical. Removing retrieved information generally lowers performance. Permuting answer options generally lowers accuracy too, although not uniformly. Multiple-choice questions are easier than open-ended multiple-choice questions because the options help focus the model’s reasoning and retrieval direction. GPT-OSS models are the only ones reported as exceeding the TxAgent baseline in some settings, suggesting that stronger parametric knowledge and context use can still matter. At the same time, the fine-tuned TxAgent Llama-3.1-8B performs better than non-fine-tuned Llama-3.1-8B when both receive identical context.

That last comparison is easy to underread. It suggests that retrieval alone does not solve the problem. The model still has to know how to use the retrieved context. A bad reader with a good library is still a bad reader, just better documented.

The smaller-model result is also business-relevant. The authors note that models such as Gemma3-4B and Qwen3-4B can achieve high accuracy when useful retrieved information is present. That does not mean hospitals should hand clinical decision-making to cheap small models next Monday. It does mean that the economics of medical AI may depend less on always buying the largest model and more on engineering the evidence pipeline well enough that smaller models can operate within bounded tasks.

The dataset makes this a benchmark result, not clinical deployment evidence

The competition setting matters.

The paper uses the CURE-Bench NeurIPS 2025 Challenge datasets. The validation set contains 459 questions with ground-truth answers. The two test sets contain 2,097 and 2,491 questions, respectively, but do not contain ground-truth labels for participants. The validation set includes three question styles: open-ended, multiple-choice, and open-ended multiple-choice.

Dataset	Total questions	MC	OE-MC	OE
Validation	459	183	230	46
Test 1	2,097	663	1,274	142
Test 2	2,491	779	1,474	238

This setup is appropriate for competition evaluation, but it also defines the boundary of interpretation. The paper shows performance behavior in therapeutic QA benchmark conditions. It does not show prospective clinical safety, workflow integration, physician trust, malpractice risk reduction, or improved patient outcomes.

That is not a criticism. It is a category label. A benchmark paper should not be forced to pretend it is a hospital deployment study. The useful business move is to translate the evidence at the right level: system design, evaluation protocol, and procurement criteria.

For builders, the product is the tool-routing layer

For healthcare AI vendors, the paper points toward a less glamorous but more defensible product thesis: the value is not only in the chat interface or the foundation model. The value is in the governed tool layer.

A serious therapeutic agent needs at least four operational assets.

First, it needs a tool catalog with clear source roles. OpenFDA-style granular tools and DailyMed-style label retrieval should not compete as vague “drug data.” They should be mapped to task types: adverse events, pregnancy warnings, contraindications, drug interactions, dosage context, mechanism-level explanations, and regulatory label narratives.

Second, it needs retrieval evaluation at the function-call level. Many teams evaluate final answers and call it safety testing. That is late-stage inspection. Tool-routing evaluation asks earlier questions: Did the system retrieve the right candidate functions? Did it select the right one? Did it format parameters correctly? Did the tool return the expected information? Did the model know when to stop calling tools?

Third, it needs prompt and context policies. The fixed-retrieval experiment shows that identical context can be used differently by different models. That means context injection is not a dump-and-pray operation. The ordering, framing, and task structure matter.

Fourth, it needs cost-aware context management. DailyMed improves access to richer label narratives, but the paper explicitly notes the tradeoff: broader retrieval may retrieve information that would otherwise require several fine-grained ToolUniverse calls, potentially leading to larger context windows and higher computational overhead.

In business terms, the stack looks like this:

Layer	What the paper directly shows	Cognaptus inference for business use	What remains uncertain
Tool retrieval	Retrieval quality affects therapeutic QA performance	Evaluate tool routers as first-class safety components	Generalization across hospital workflows
Source integration	DailyMed improves access to complete label narratives	Govern sources by task role, not brand name	Optimal balance between granular and broad retrieval
Model choice	Fine-tuning and model family affect context use	Smaller models may be viable when context is strong	Safety under real clinical ambiguity
Answer-option permutation	Accuracy can shift when options move	Test positional robustness in evaluation suites	Whether this translates to free-text clinical decisions
Competition benchmark	CURE-Bench supports systematic comparison	Use benchmark evidence for design screening	Do not treat it as deployment validation

This is where many AI product roadmaps quietly go wrong. They budget for the model. They demo the chatbot. They mention “RAG” as if it were a compliance certificate. Then the actual system fails because the wrong tool was called with the wrong parameter and the final answer looked perfectly fluent.

Very modern. Very preventable.

For buyers, ask about tool failures before model scores

Healthcare buyers should not read this paper as a reason to demand TxAgent specifically. They should read it as a reason to ask better due-diligence questions.

A vendor claiming medical-agent capability should be able to answer:

What are the available tools, and what clinical question types are they intended to support?
How is tool retrieval evaluated separately from final answer accuracy?
How often does the system retrieve the right tool but fail to call it correctly?
How are malformed parameters detected and repaired?
What happens when a tool returns incomplete or unexpected information?
How does the system decide between granular database calls and broader label retrieval?
What is the context-cost impact of richer sources such as full label narratives?
Are answer formats tested for positional bias and prompt sensitivity?

These questions are not academic decoration. They map directly to operational risk. If an AI assistant recommends a treatment, the buyer needs to know whether the recommendation came from current label information, stale model memory, a failed function call, or a lucky multiple-choice guess. “The model is state-of-the-art” is not an answer. It is a bumper sticker.

The boundary: richer tools are not automatically safer tools

The paper’s own limitation section is important and should not be buried.

TxAgent’s original retrieval system uses precise tool selection to minimize context expansion, even if that requires multiple iterative function calls. The DailyMed extension is broader. It can retrieve information that would otherwise require several ToolUniverse calls, but that breadth can enlarge the context and increase computational overhead.

This creates a practical tradeoff:

fine-grained tools
→ lower context burden
→ more calls and more routing opportunities for failure

broader label retrieval
→ richer narrative context
→ larger context windows and higher processing cost

Neither side wins universally. The right design depends on task type, risk level, latency budget, audit requirement, and how well the model can compress and use long retrieved narratives.

There is another boundary. The experiments are based on validation and competition-style therapeutic QA settings, not live clinical deployment. They reveal mechanisms and design sensitivities. They do not establish that an agent is safe for autonomous clinical decision-making. The paper is strongest when read as infrastructure evidence: it tells us where to inspect the machinery.

The uncomfortable lesson: agentic safety is infrastructure work

The most useful insight from this paper is also the least flashy: safe agentic AI is not created at the final answer layer.

It is created in the boring middle: tool descriptions, function retrieval, source selection, schema design, parameter validation, context assembly, prompt structure, and evaluation protocols that inspect intermediate behavior. That is where the system either earns the right to reason or quietly contaminates the reasoning process before anyone sees the answer.

TxAgent is interesting because it makes that middle layer visible. DailyMed is interesting because it shows that source design is not interchangeable. The fixed-retrieval experiments are interesting because they show that context and model behavior interact. The competition setting is interesting because it forces tool use and reasoning quality to be evaluated together instead of hidden behind a polished response.

For business leaders, the message is simple: do not buy “medical AI agents” as if they were smarter chatbots. Buy them, build them, and evaluate them as controlled tool-using systems.

The model may write the answer. The tools decide what world the answer is allowed to see.

Cognaptus: Automate the Present, Incubate the Future.

Tim Cofala, Christian Kalfar, Jingge Xiao, Johanna Schrader, Michelle Tang, and Wolfgang Nejdl, “MedAI: Evaluating TxAgent’s Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition,” arXiv:2512.11682, 2025. ↩︎

When Tools Think Before Tokens: What TxAgent Teaches Us About Safe Agentic AI#

The safety problem starts before the model answers#

OpenFDA gives precision; DailyMed gives narrative context#

The retriever experiment is main evidence about routing quality#

Fixed retrieval shows that context quality and model behavior interact#

The dataset makes this a benchmark result, not clinical deployment evidence#

For builders, the product is the tool-routing layer#

For buyers, ask about tool failures before model scores#

The boundary: richer tools are not automatically safer tools#

The uncomfortable lesson: agentic safety is infrastructure work#