Scan You Believe It? Why RadAgent Makes Medical AI Show Its Work

Hospitals do not merely need an AI that can write a radiology report. They need an AI whose work can be checked before the report becomes somebody else’s problem.

That sounds obvious, which is exactly why it is often ignored. A chest CT is a dense three-dimensional diagnostic object. A radiologist does not just glance at it, produce prose, and walk away. They inspect anatomy, compare regions, test impressions, look for omissions, and decide whether a finding is actually supported by the scan. Many vision-language models, by contrast, still behave like a polished black box: scan in, report out, confidence implied by typography.

The paper RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography proposes a more operationally serious design.¹ RadAgent starts with an initial CT report draft, follows a clinician-reviewed diagnostic checklist, calls specialized CT tools through Model Context Protocol servers, keeps a scratchpad of intermediate evidence, and then produces a final report. The important point is not simply that the system uses tools. Everyone and their intern can now attach tools to a model. The important point is that RadAgent is trained to use those tools as part of a coherent diagnostic process.

That distinction matters. In regulated, high-stakes environments, “the model was accurate on average” is not the same as “the model produced a trace that a clinician can inspect.” The first is a benchmark claim. The second is closer to a deployable workflow.

The real contribution is the workflow, not the toolbox

A shallow reading of RadAgent would say: add specialist tools to a medical VLM, get better reports. Tempting, but incomplete.

RadAgent’s architecture is more specific. It combines three elements that are easy to confuse but operationally different:

Component	What it does	Why it matters operationally
Initial report draft	Uses CT-Chat to produce a first-pass report	Gives the agent a starting hypothesis rather than forcing every case to begin from zero
Diagnostic checklist	Guides the agent through chest CT review categories	Reduces omission risk and anchors the workflow in radiology practice
Tool-using agent loop	Selects tools, asks diagnostic questions, updates memory, and refines findings	Turns model output into an inspectable sequence of decisions

The paper describes RadAgent as a ReAct-style agent for 3D chest CT analysis. At each step, it can decide whether to keep investigating, which tool to call, and what diagnostic question to ask. Its available tools include report generation, disease classification, 3D and 2D visual question answering, anatomy and effusion segmentation, slice extraction, and CT windowing. These are not decorative plugins. They correspond to different parts of a radiology workflow: screening, localization, visual verification, and report synthesis.

The scratchpad is the quiet governance feature. It records preliminary findings and tool outputs as the agent proceeds. In plain language, RadAgent is designed to leave a trail. That trail does not magically make the model correct, but it gives a clinician a surface to inspect, challenge, or refine. For medical AI, this is less glamorous than a giant leaderboard jump. It is also much closer to what deployment actually requires.

Why “just add tools” is the wrong lesson

The authors explicitly test a training-free version of RadAgent. This variant has the same tool set, prompt structure, and diagnostic checklist, but its tool-calling policy is not optimized by reinforcement learning. It already improves over CT-Chat in macro-F1, which suggests the architecture itself helps. But the fully trained RadAgent performs better, especially in external generalization.

That is the useful lesson. Tools increase capability only when the model learns when and how to use them.

A tool-using system can fail in several boring but expensive ways. It can call a segmentation model and never use the segmentation. It can ask a vague visual question and receive a vague answer. It can inspect the wrong slice. It can chase tool diversity because tool diversity looks impressive in a demo. It can follow the checklist mechanically without improving the report. Enterprise AI does not usually fail because a component is missing; it fails because components are assembled into rituals rather than workflows.

RadAgent addresses this with reinforcement learning using GRPO. The orchestrator is an instruction-tuned Qwen3-14B model, fine-tuned with LoRA. The reward is not a single “better report” score. It is a composite reward that tries to balance report quality, successful tool use, tool diversity, graph coherence, and checklist adherence.

That reward design is not a technical footnote. It is the management system of the agent.

The reward ablation is where the paper becomes practical

The appendix reward ablation is easy to skip because appendices are where papers hide useful information from busy readers. Here, it is central.

The authors compare three training paradigms. First, their mixed reward curriculum, which starts with more exploration and later emphasizes coherence and checklist adherence. Second, a version without the tool-sequence reward. Third, a version that applies the sequence judge from the start.

The result is a clean warning for anyone building business agents:

Training variant	Likely purpose of the test	What it supports	What it does not prove
Mixed reward curriculum	Main ablation for final reward design	Balancing exploration first and coherence later produces better overall behavior	It does not prove this exact schedule is optimal for other hospitals or toolboxes
No sequence reward	Tests whether report quality alone is enough	Without sequence-oriented rewards, the model can ignore checklist discipline and produce incoherent tool calls	It does not mean report-quality metrics are useless
Sequence judge from the start	Tests whether process control should dominate early	Early penalty on tool sequence behavior can suppress useful exploration and trade off report quality	It does not mean coherence rewards should be removed

This is the paper’s most generalizable business insight. Process metrics cannot simply be bolted on at the end, but they also cannot dominate before the system has learned how to explore the task. In other words, governance is not a dashboard. It is part of training.

The uncomfortable part: this makes deployment harder. A hospital or vendor cannot merely buy a base model, connect ten APIs, write “be careful” in the system prompt, and call it clinical AI. The agent needs a policy that has been optimized for its actual tool environment. If the tools change, the learned policy may become suboptimal. The paper says as much. That is inconvenient, which usually means it is important.

The benchmark gains are meaningful because of what they test

RadAgent is evaluated on CT-RATE and RadChestCT. CT-RATE is used for training, validation, and in-distribution testing. RadChestCT serves as an external evaluation set, using the publicly released subset. The authors evaluate report generation with macro- and micro-F1 over 18 common pathologies, extracted from generated reports with the CT-RATE text classifier.

This metric choice matters. The authors discuss why generic natural-language metrics are inadequate for CT reports. BLEU and ROUGE do not know clinical negation from literary style. Even more domain-aware metrics can behave oddly when reports differ in how many normal findings they explicitly mention. A templated report that lists normality at length can look good while missing the abnormality that matters. Medicine, as usual, refuses to be impressed by elegant nonsense.

Against CT-Chat, RadAgent improves CT-RATE test macro-F1 by 6.0 percentage points and micro-F1 by 5.4 percentage points. The paper reports these as 36.4% and 19.6% relative improvements. The gains are especially relevant because CT-Chat is not merely an unrelated baseline; it is also the model RadAgent uses for the initial report draft and some visual question answering. So the comparison is not “our system versus a straw man.” It asks whether an agentic verification loop can improve the output of the underlying 3D VLM.

The answer is yes, within the tested setup.

But the interpretation should stay precise. Macro-F1 gives more balanced weight across pathologies, so a macro-F1 gain suggests improvement is not only coming from already-common, easy labels. Micro-F1 is more affected by frequent labels, so the micro-F1 gain indicates broader aggregate improvement. External RadChestCT results support generalization beyond the internal CT-RATE setting, but they are still benchmark results, not a prospective clinical trial. The correct business conclusion is not “replace radiologists.” It is “agentic report review may reduce omissions and improve evidence-grounded drafting under clinician supervision.” Less dramatic. More useful.

Robustness tests whether the model resists bad suggestions

The paper’s hint-injection experiment is one of the more valuable parts because it tests behavior under pressure, not just average report quality.

The authors sample 1,000 CT-RATE test cases. For each, they choose a pathology and inject either a correct or incorrect hint into the prompt. A false hint might suggest that a scan shows a finding that the ground truth report does not support. Robustness is defined as the system preserving an originally correct prediction even after exposure to an incorrect hint.

RadAgent reaches 83.7% robustness under false hints, compared with 58.9% for CT-Chat. That is a 24.7 percentage point improvement.

This is not a side quest. In real workflows, models will receive context: prior notes, clinician questions, patient history, noisy labels, copied text, and occasionally wrong assumptions. A system that politely follows misleading context is dangerous in the most corporate way possible: it is helpful until it is not.

RadAgent’s advantage likely comes from the tool-grounded loop. A false hint is less persuasive when the agent has to check findings against intermediate evidence. Not immune, but less gullible. That is exactly the level of ambition we should want in medical AI: not mystical truth, just measurable resistance to bad inputs.

Faithfulness is new here, but 37% is not a victory lap

The faithfulness result is both exciting and sobering.

In the paper’s setup, faithfulness asks whether the report generation process explicitly acknowledges when an injected hint influenced the system’s final judgment. CT-Chat scores 0.0%. RadAgent scores 37.0%.

That number should be read in two ways.

First, it is a real capability gap. CT-Chat may be influenced by hints, but it does not reveal that influence in its generated report. This is the classic black-box problem with nicer prose: the system produces an answer that looks evidence-based even when part of the answer was steered by the prompt. In clinical settings, that is not just an interpretability issue. It is an auditability issue.

Second, 37.0% is still low. The authors themselves note that faithfulness remains far from solved. They also conservatively treat the estimated faithfulness scores as upper bounds because hint-acknowledgement labeling relies on an LLM judge, albeit one they test for reliability against another model on a subset.

For business readers, this distinction is important. RadAgent does not solve explainability. It creates a measurable interface for improving it. That is still valuable. Most black-box systems cannot even fail transparently. RadAgent at least gives failure a shape.

The evidence stack is stronger when each test is read for its job

The paper contains several result types, and mixing them together creates confusion. A cleaner reading is to separate what each test is trying to establish.

Evidence	Likely purpose	Business interpretation	Boundary
CT-RATE validation and test F1	Main report-generation evidence	The agentic workflow improves pathology-label agreement over the baseline VLM	F1 over extracted labels is not full clinical report quality
RadChestCT evaluation	External dataset comparison	The gain is not limited to one internal split	Still retrospective benchmark evaluation
Training-free RadAgent	Architecture-versus-training comparison	Tools and checklist help, but learned policy adds value	Does not isolate every individual tool contribution
Reward ablation	Process-control sensitivity test	Reward design shapes whether the agent uses tools coherently	Exact reward weights may not transfer to other settings
Hint-injection robustness	Robustness test under misleading context	Tool-grounded traces reduce susceptibility to false hints	Artificial hints are not the same as all real clinical noise
Faithfulness metric	Transparency behavior test	The agent can sometimes expose prompt influence where black-box VLMs do not	37.0% remains incomplete and judge-dependent

This table is also a useful template for evaluating other agent papers. Ask not only “what improved?” but “what kind of evidence is this?” Main evidence, ablation, robustness test, comparison with prior work, implementation detail, and exploratory extension should not be blended into one triumphant soup.

What hospital AI teams should take from RadAgent

The near-term business value of RadAgent is not autonomous diagnosis. That line should be printed, laminated, and placed near every healthcare AI sales deck.

The more realistic value lies in four workflows.

First, report drafting with evidence review. RadAgent can produce a draft, but the better workflow is that the clinician can inspect how the draft was assembled. This shifts the role of AI from “report author” to “structured assistant with receipts.”

Second, omission checking. The checklist-driven loop is useful because missed findings are often more operationally damaging than stylistic imperfections. A system that systematically revisits categories can act as a second-pass reviewer.

Third, targeted visual verification. Segmentation, slice extraction, windowing, and VQA tools can help surface the evidence behind a claim. The clinician should not have to trust a sentence when the system can show which intermediate tool output supported it.

Fourth, workflow analytics. The learned tool policy itself can reveal which tools are frequently used, which are redundant, and where GPU resources should be allocated. The authors note that learned policies might later be distilled into more fixed inference workflows, which could be attractive for efficiency and regulatory stability.

The broader enterprise pattern extends beyond radiology. In finance, an agent might coordinate risk engines, transaction monitors, and document checks. In insurance, it might route claims through fraud scoring, policy extraction, and evidence review. In legal operations, it might combine contract parsers, clause libraries, and citation checks. But the transfer is architectural, not clinical. Nobody should pretend a chest CT agent directly validates an insurance agent. That would be the kind of analogy that looks clever until procurement asks for evidence.

The deployment boundary is real infrastructure, not vibes

RadAgent’s design is ambitious and expensive. The paper reports deployment across eight GPUs on two nodes: one node for the trained agent and another for auxiliary tools distributed across four GPUs. Some tools are computationally heavy. Some reward-computation components can be removed after training, and rarely used tools could be disabled, but this is not a lightweight clinic-room chatbot.

The system is also optimized for a specific toolbox. If the available tools change materially, the learned policy may no longer be the best policy. The authors frame rerunning the RL pipeline as a benefit of learned agents over hand-crafted workflows. That is reasonable. It is also a maintenance cost.

There are evaluation boundaries too. The metrics focus on extracted pathology labels, not every nuance of a radiology report. The external evaluation is useful, but it does not replace prospective clinical validation. The faithfulness metric is innovative, but its score remains limited and partly judge-dependent. The checklist was reviewed by a radiologist, but local guidelines and reporting norms can differ.

None of this weakens the paper. It clarifies what kind of product RadAgent points toward. The deployable object is not “AI radiologist.” It is an auditable diagnostic-assistance workflow that must be validated, localized, monitored, and kept in sync with its tool environment. Less science fiction. More engineering. Usually a good sign.

The serious AI product is the one that can show its work

RadAgent’s most important message is not that agents are magically better than VLMs. It is that medical AI needs an operating model where intermediate evidence, tool choices, and final outputs are connected.

The paper’s gains in macro-F1, micro-F1, robustness, and faithfulness are meaningful because they come from a mechanism: draft, checklist, tool use, scratchpad, reward-shaped policy, final synthesis. That mechanism is the article’s center of gravity. Without it, the numbers become just another benchmark paragraph. With it, they become a design argument.

For healthcare AI vendors, the strategic implication is uncomfortable but clear. The winning product may not be the model that writes the prettiest report in one pass. It may be the system that coordinates specialist tools, records its own evidence trail, resists misleading prompts, and gives clinicians something inspectable before they sign off.

Black boxes are convenient until someone asks where the conclusion came from. RadAgent’s answer is not perfect, but at least it has an answer.

Cognaptus: Automate the Present, Incubate the Future.

Mélanie Roschewitz et al., “RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography,” arXiv:2604.15231, version 1, submitted April 16, 2026. https://arxiv.org/abs/2604.15231 ↩︎

Scan You Believe It? Why RadAgent Makes Medical AI Show Its Work#

The real contribution is the workflow, not the toolbox#

Why “just add tools” is the wrong lesson#

The reward ablation is where the paper becomes practical#

The benchmark gains are meaningful because of what they test#

Robustness tests whether the model resists bad suggestions#

Faithfulness is new here, but 37% is not a victory lap#

The evidence stack is stronger when each test is read for its job#

What hospital AI teams should take from RadAgent#

The deployment boundary is real infrastructure, not vibes#

The serious AI product is the one that can show its work#