Opening — Why this matters now

Agentic AI is having a moment. From autonomous coding agents to self-directed research assistants, the industry has largely agreed on one thing: reasoning is no longer just about tokens—it’s about action. And once models are allowed to act, especially in high‑stakes domains like medicine, the question stops being can the model answer correctly? and becomes can it act correctly, step by step, without improvising itself into danger?

The NeurIPS 2025 CURE‑Bench challenge lands squarely in this tension. Instead of rewarding flashy end answers, it evaluates something far less glamorous but far more consequential: whether an AI system can reason therapeutically through tools, safely, verifiably, and repeatedly. The TxAgent system—and its post‑mortem analysis—offers a rare, concrete look at what actually breaks when agentic AI is deployed under real constraints fileciteturn0file0.

Background — From RAG to regulated agency

Traditional retrieval‑augmented generation (RAG) was designed to patch a single weakness: hallucination. Fetch documents, stuff them into context, and hope the model behaves. That works tolerably well for Wikipedia trivia. It collapses in medicine.

Clinical reasoning is not a single lookup problem. It’s a multi‑step decision process spanning patient context, disease mechanisms, contraindications, drug interactions, and evolving regulatory knowledge. Errors compound. Worse, they can’t be hand‑waved away as “creative variance.”

This is where agentic RAG emerges. Instead of retrieving text blobs, systems like TxAgent decompose questions into tool calls—structured API‑level interactions with curated biomedical sources (FDA, OpenTargets, Monarch). In theory, this moves AI closer to clinical workflows. In practice, it introduces a new failure mode: choosing the wrong tool, repeatedly, with confidence fileciteturn0file0.

Analysis — What TxAgent actually does

TxAgent is not a monolithic model. It is a pipeline:

  • Llama‑3.1‑8B (fine‑tuned): rewrites the clinical question to expose intent.
  • Qwen2‑1.5B (fine‑tuned): retrieves candidate tool functions by semantic similarity.
  • ToolUniverse: executes structured calls against biomedical databases.
  • Iterative control loop: decides whether more tools are needed—or whether to stop.

This design choice is critical. TxAgent treats tool usage as a first‑class reasoning primitive, not a side effect. The model doesn’t just answer—it plans, calls, evaluates, retries. That is precisely why CURE‑Bench evaluates not only final accuracy, but also:

  • correctness of tool selection
  • validity of parameters
  • coherence of reasoning traces

In regulated domains, these are not diagnostics. They are safety rails fileciteturn0file0.

Findings — Retrieval beats raw intelligence

The most counterintuitive result from the paper is also the most important for business leaders betting on bigger models.

1. Bigger models didn’t win by default

When retrieval was frozen, or removed entirely, even strong general models degraded sharply. Parametric knowledge alone was insufficient for therapeutic tasks. In some settings, TxAgent without retrieval outperformed larger proprietary models relying on internal knowledge alone fileciteturn0file0.

2. Tool retrieval quality dominated outcomes

A simplified summary of retriever performance:

Retriever Type Outcome
BM25 (sparse) Consistently weak — lexical mismatch
Dense retrievers Similar accuracy, higher variance
Fine‑tuned Qwen2‑1.5B Strong baseline
Qwen2‑1.5B + DailyMed Best overall

The takeaway is brutal: wrong tool → wrong reasoning, no matter how good the LLM is.

3. DailyMed changed the game

Integrating DailyMed didn’t add “more data.” It added better‑structured authority. Unlike granular FDA endpoints, DailyMed provides cohesive, human‑readable drug narratives in a single call. That reduced tool churn, improved reasoning stability, and delivered measurable gains across evaluation metrics fileciteturn0file0.

Implications — Lessons beyond medicine

If you strip away the clinical specifics, TxAgent’s lessons generalize uncomfortably well.

For enterprise AI:

  • Tool schemas matter more than prompt elegance.
  • Fine‑tuning retrieval often outperforms scaling generation.
  • Observability at the action level is non‑negotiable.

For AI governance:

  • Auditing must include tool‑call traces, not just outputs.
  • “Explainability” without execution correctness is cosmetic.
  • Safety emerges from constrained action spaces, not moral prompts.

For ROI‑driven automation:

  • Smaller, cheaper models can outperform larger ones if retrieval is right.
  • Agentic workflows fail silently when tools are misaligned.
  • Investing in domain‑specific tool layers yields compounding returns.

In short: intelligence without instrumentation is just liability.

Conclusion — The quiet discipline of safe agency

TxAgent didn’t win CURE‑Bench by being clever. It won by being disciplined—about tools, about iteration, and about stopping conditions. That’s the unsexy truth about deploying AI in the real world.

As agentic systems move from demos to decision‑makers, the industry’s obsession with model size will look increasingly misplaced. The future belongs to systems that know when and how to act, not just how to speak.

Cognaptus: Automate the Present, Incubate the Future.