Scan, Plan, Report: When Agentic AI Starts Thinking Like a Radiologist
Report writing is the visible part of radiology. It is also the part easiest for AI vendors to misunderstand.
A radiology report looks like text, so the naive automation pitch is obvious: give the CT scan to a vision-language model, ask for a report, and let the model type faster than a human. Congratulations, we have reinvented autocomplete with more liability.
The paper behind today’s article, Radiologist Copilot: An Agentic Framework Orchestrating Specialized Tools for Reliable Radiology Reporting, takes a more interesting path.1 It argues that radiology reporting is not merely a text-generation task. It is a sequence of professional actions: localize the relevant anatomy, examine the region, decide what clinical features matter, select a suitable report template, compose findings and impression, then check whether the result is internally consistent and clinically formatted.
That sounds almost disappointingly practical. Which is exactly why it matters.
The paper’s contribution is not “a bigger medical VLM writes better liver reports.” The contribution is that the authors rebuild the task around a workflow. Radiologist Copilot uses an LLM as a reasoning backbone, but the backbone is not asked to hallucinate its way through a 3D CT volume in one heroic pass. Instead, it orchestrates specialized tools: a segmentator, an analyzer, a report composer, and a quality controller. The result is an agentic reporting pipeline that resembles the structure of the expert process more closely than ordinary report generation models.
For business readers, that is the transferable point. In high-stakes AI, the product is often not the model. The product is the workflow boundary around the model.
The mistake is treating a report as the task
Most automated radiology-reporting systems are evaluated as if the job begins with an image and ends with a paragraph. That abstraction is convenient for benchmarking, but it compresses the work until the hard parts disappear.
The paper pushes back on that compression. It identifies several steps that ordinary single-pass systems tend to blur together:
| Hidden task in radiology reporting | Why it matters operationally | What goes wrong when it is collapsed into “generate report” |
|---|---|---|
| Region localization | The model needs to focus on the relevant organ or lesion area inside a 3D volume | The system may describe generic anatomy or miss localized abnormalities |
| Region analysis planning | Different organs and lesions require different observations | The model may mention features that are irrelevant or omit clinically important ones |
| Template selection | Reports need standardized structure and terminology | Output may be fluent but inconsistent with reporting conventions |
| Report composition | Findings and impression serve different functions | The report may mix observation with diagnosis |
| Quality control | Internal consistency, spelling, terminology, and format matter | A plausible report may contain contradictions or subtle errors |
This is why a mechanism-first reading is more useful than a leaderboard-first reading. The numbers in the paper are strong, but they are not the main intellectual story. The main story is that Radiologist Copilot treats radiology reporting as a managed process rather than a single generative act.
The authors’ framing is direct: existing methods for volumetric medical images, including 3D medical VLMs and earlier agentic methods, mostly focus on isolated report generation. Radiologist Copilot instead attempts the complete workflow: image localization, detailed image analysis, template selection, report composition, and quality assurance.
That is a small wording difference with large product consequences. “Generate a report” sounds like a model capability. “Complete a reporting workflow” sounds like a system design problem.
The architecture is a toolchain, not a chatbot with a lab coat
Radiologist Copilot is built as an agentic assistant with three central execution pieces: an action planner, an action executor, and memory. Given a user query and a 3D CT image, the system decides which tool to use next, executes the action, observes the outcome, stores the trajectory, and continues until it returns a qualified report.
In the paper’s formulation, the agent samples an action based on its history, updates the environment state after executing that action, and appends the action-observation pair to memory. The math is simple, but the design intent is important: the agent is not merely producing text; it is managing a sequence of steps.
The tool library contains four main roles:
| Tool | Input | Output | Practical role |
|---|---|---|---|
| Segmentator Tool | CT image | Organ and lesion masks | Locate the region of interest |
| Analyzer Tool | Region image and mask | Region-specific analysis | Convert image evidence into structured observations |
| Report Composer Tool | Image, mask, analysis, optional feedback | Draft report | Select a template and write findings, impression, and key slice references |
| Quality Controller Tool | Generated report | Assessment and feedback | Check format, content, language, and consistency |
This division matters because each component narrows the ambiguity faced by the next component. Segmentation narrows where to look. Region analysis planning narrows what to examine. Template selection narrows how to write. Quality control narrows what must be corrected before the report is accepted.
A single-pass model must internalize all of those decisions implicitly. Radiologist Copilot externalizes them.
That is not just cleaner engineering. It changes what can be inspected. When the system fails, the failure can be traced to a stage: poor segmentation, weak lesion analysis, wrong template selection, report inconsistency, or inadequate quality control. In regulated or semi-regulated environments, inspectability is not a decorative feature. It is how managers sleep, occasionally.
Region analysis planning turns “look at the scan” into a checklist
The first major mechanism is region analysis planning.
After localizing the relevant organ and lesion masks with a pretrained segmentation model, the system extracts the region image. It then identifies region-specific analysis items: anatomical structures and clinically significant characteristics such as size, shape, density, and lesion-related features when lesions are present.
For the liver task used in the experiments, the paper says template reports are summarized into analysis items including liver surface, liver parenchyma, bile ducts, and liver lesions. These items become prompts for the 3D medical VLM.
This is one of the quiet but important moves in the system. The VLM is not simply asked, “What does this CT show?” It is directed to inspect relevant aspects of the organ.
That changes the cognitive load of the model. Instead of asking the model to infer the reporting protocol and visual task at the same time, the workflow gives it a structured observation plan. The Analyzer Tool still depends on a 3D medical VLM, but the VLM is embedded inside an anatomical and reporting logic.
For business use, this is the difference between buying a “smart model” and building a “smart operating procedure.” In complex professional work, good performance often comes less from raw reasoning and more from forcing the reasoning to happen in the right slots.
Template selection is not cosmetic standardization
The second mechanism is strategic template selection.
Templates can sound boring. That is because they are boring. Boring is underrated in medicine, finance, law, aviation, and every other domain where a creative surprise is not necessarily a gift.
Radiologist Copilot obtains candidate template reports by clustering historical radiology reports. It uses BioBERT embeddings and K-means clustering on the training reports, then derives template reports from those clusters. During reporting, the LLM selects the most relevant template based on the current analysis result. The report composer then adapts the selected template to the specific case.
This matters for two reasons.
First, templates stabilize structure. A radiology report is not a blog post. Findings and impression have different jobs. Findings describe imaging manifestations; impression summarizes diagnostic conclusions. If a generated report mixes them casually, it may still look fluent to a general language metric while becoming clinically awkward.
Second, template selection connects the generated report to historical reporting patterns. The model is not writing from a blank page. It is writing with reference to a clustered reporting precedent.
The ablation results support this point, although with one interesting wrinkle. Removing strategic template selection lowers BLEU-1 from 0.4025 to 0.2983, ROUGE-L from 0.3222 to 0.2698, METEOR from 0.4560 to 0.3915, BERTScore from 0.7024 to 0.6553, and F1-RadGraph from 0.2585 to 0.2429. However, GREEN rises slightly from 0.4379 to 0.4582.
That mixed result should not be hand-waved. It suggests that strategic template selection improves most text similarity and clinical graph metrics in this setup, especially structural and semantic alignment, but it is not uniformly superior across every metric. The authors interpret the component as providing standardized templates and structural consistency. That interpretation is plausible, but the GREEN exception is a reminder that “better report” is not one-dimensional.
For deployment, this is exactly the kind of result that product teams should care about. A template system may improve consistency while occasionally constraining or altering certain types of clinically oriented evaluation. That is not a reason to discard it. It is a reason to tune it against the operational definition of report quality.
Quality control is a gate, not the main engine
The third mechanism is report quality control.
The Quality Controller Tool evaluates generated reports across format, content, language, and expression. It checks whether findings are objective, whether impression summarizes key conclusions, whether anatomy and lesion characterization are consistent, and whether language uses standardized terminology with correct spelling and concise expression. If the report is not qualified, feedback is used for adaptive refinement.
In the component ablation table, removing report quality control has only a small effect on the reported metrics: BLEU-1 moves from 0.4025 to 0.3998, ROUGE-L from 0.3222 to 0.3149, METEOR from 0.4560 to 0.4462, BERTScore from 0.7024 to 0.6944, F1-RadGraph from 0.2585 to 0.2545, and GREEN from 0.4379 to 0.4360.
This does not mean quality control is useless. It means that, in this experimental pipeline, most generated reports already passed the relevant quality expectations, so quality control did not move aggregate metrics much.
The paper’s Figure 6(b) clarifies the intended role. The authors use a simulated faulty report where the findings describe a lesion in the right lobe, while the impression says left lobe. The simulated report also misspells “parenchyma.” The Quality Controller Tool marks it as not qualified and identifies both the content inconsistency and the spelling error.
That example is not main evidence of benchmark superiority. It is closer to a functional validation of the quality-control mechanism. It shows that the tool can catch at least certain obvious report-level defects. It does not prove that it will reliably catch subtle diagnostic errors, rare disease presentations, or clinically dangerous omissions.
This distinction is important. Quality control in the paper is a safety-relevant design layer, but not yet a regulatory safety case. It is a gate in the prototype workflow, not a certificate from the gods of hospital compliance.
The benchmark gains are large, but read them as workflow evidence
The task-level evaluation focuses on liver radiology reports extracted from AMOS-MM. The authors use 1,149 CT scans with liver reports for the training side, mainly for report clustering, and 367 validation CT scans with liver reports for evaluation. The system itself is training-free: it uses pretrained components, including Qwen3-32B as the LLM backbone, TotalSegmentator for segmentation, and Hulu-Med as the main 3D medical VLM analyzer. The maximum number of agent steps is set to 10.
The paper evaluates reports with natural-language generation metrics and clinical-efficacy metrics:
| Metric category | Metrics used | What they roughly capture |
|---|---|---|
| Natural language generation | BLEU-1, ROUGE-L, METEOR, BERTScore | Lexical and semantic similarity to reference reports |
| Clinical efficacy | F1-RadGraph, GREEN | Clinically oriented report quality and entity-relation alignment |
Against six 3D medical VLM baselines, Radiologist Copilot performs best across all listed task-level metrics.
| Method | BLEU-1 | ROUGE-L | METEOR | BERTScore | F1-RadGraph | GREEN |
|---|---|---|---|---|---|---|
| RadFM | 0.1492 | 0.1415 | 0.2340 | 0.5541 | 0.0686 | 0.0353 |
| M3D | 0.1775 | 0.1302 | 0.1220 | 0.5359 | 0.0475 | 0.0209 |
| Merlin | 0.0015 | 0.0908 | 0.0569 | 0.5119 | 0.1617 | 0.1024 |
| CT-CHAT | 0.2440 | 0.2012 | 0.2599 | 0.6127 | 0.1196 | 0.0390 |
| Med3DVLM | 0.1967 | 0.1422 | 0.1847 | 0.5608 | 0.0660 | 0.0539 |
| Hulu-Med | 0.1867 | 0.1723 | 0.2380 | 0.5947 | 0.1209 | 0.2163 |
| Radiologist Copilot | 0.4025 | 0.3222 | 0.4560 | 0.7024 | 0.2585 | 0.4379 |
The magnitude is not subtle. Compared with Hulu-Med, the VLM used inside the Analyzer Tool, Radiologist Copilot raises BLEU-1 from 0.1867 to 0.4025, METEOR from 0.2380 to 0.4560, F1-RadGraph from 0.1209 to 0.2585, and GREEN from 0.2163 to 0.4379.
That comparison is especially important because it weakens the lazy explanation that the system wins merely because it uses a stronger visual model. In the main setup, Hulu-Med is not replaced; it is orchestrated. The same model becomes more useful when placed inside segmentation, planning, template selection, and quality-control machinery.
The authors further test this by equipping Radiologist Copilot with different VLM analyzers. The agentic wrapper improves RadFM and CT-CHAT as well:
| Base VLM setting | BLEU-1 before | BLEU-1 with Copilot | F1-RadGraph before | F1-RadGraph with Copilot | GREEN before | GREEN with Copilot |
|---|---|---|---|---|---|---|
| RadFM | 0.1492 | 0.3215 | 0.0686 | 0.1994 | 0.0353 | 0.3719 |
| CT-CHAT | 0.2440 | 0.3671 | 0.1196 | 0.2312 | 0.0390 | 0.1485 |
| Hulu-Med | 0.1867 | 0.4025 | 0.1209 | 0.2585 | 0.2163 | 0.4379 |
This table functions as a robustness-style test of the framework, not as a separate thesis about which VLM is best. The key message is that the workflow wrapper improves performance across different underlying 3D VLMs, although the size and profile of improvement vary. For example, the GREEN score with CT-CHAT inside the Copilot reaches 0.1485, better than CT-CHAT alone but still far below the Hulu-Med Copilot setup.
So the correct business reading is not “workflow replaces model quality.” It is “workflow can amplify model capability, but the underlying tool still matters.”
Very annoying for people who wanted a one-line strategy. Reality persists.
The ablation table says planning and templates carry most of the measurable lift
The ablation studies are the paper’s clearest evidence for which mechanisms matter.
| Variant | Likely purpose of test | What it supports | What it does not prove |
|---|---|---|---|
| Full Radiologist Copilot | Main evidence | End-to-end workflow beats listed VLM baselines on liver-report metrics | Clinical deployment readiness |
| Without region analysis planning | Ablation | Structured region-specific analysis contributes materially to performance | That the selected analysis items are optimal |
| Without strategic template selection | Ablation | Templates help most similarity and clinical graph metrics | That template-based reporting always improves every clinical metric |
| Without report quality control | Ablation | RQC has a smaller aggregate metric effect in this setup | That RQC is unnecessary in real clinical operation |
| Different VLM analyzers | Robustness / component sensitivity | The workflow improves several base VLMs | That any weak VLM can be made clinically reliable by orchestration |
| LLM-as-judge workflow evaluation | Agent-level evaluation | The workflow appears coherent across planning and execution dimensions | Independent clinical validation by radiologists |
The strongest measurable degradation comes from removing region analysis planning and strategic template selection.
Without region analysis planning, F1-RadGraph drops from 0.2585 to 0.1675, and GREEN drops from 0.4379 to 0.2269. That is a meaningful fall in clinically oriented metrics. The interpretation is straightforward: if the system does not plan what to examine within the region, the downstream report loses clinical alignment.
Without strategic template selection, text and semantic metrics fall sharply, although GREEN rises slightly. This suggests the template mechanism contributes to structure and reference-like phrasing, but its relationship with clinical scoring is more nuanced.
Without report quality control, the aggregate metrics barely move. This should be read carefully. Quality control may be less important for average metric improvement when upstream steps already produce decent reports. But in actual operation, rare but serious mistakes matter more than average text similarity. A gate that catches contradictions may have low impact on benchmark means and high operational value. Those are different evaluation regimes.
This is a broader lesson for AI system design: the component with the largest benchmark lift is not always the component with the largest risk-management value.
The agent-level evaluation is useful, but not the strongest evidence
The paper also evaluates the agentic workflow using LLM-as-a-Judge. The judge scores four dimensions from 1 to 5: Problem Analysis, Action Planning, Action Execution, and Overall Workflow. Figure 4 shows that most scores are 4 or 5, with only a small portion at 3.
This supports the authors’ claim that the workflow is coherent. It also helps demonstrate that the agent is not randomly calling tools in a confused sequence.
But this evaluation should be placed in the right category. It is agent-level process evaluation, not clinical outcome validation. The judge assesses whether the automated workflow appears reasonable, whether actions are planned and executed properly, and whether the overall process is coherent. That is valuable for debugging agentic systems. It is not a substitute for expert radiologist review, prospective clinical testing, or workflow integration studies in hospitals.
The case examples in Figure 5 are also best read as qualitative demonstrations. They show normal and abnormal workflows: segmentation, region analysis, report composition, quality control, and return of a qualified report. These examples make the mechanism concrete. They do not carry the statistical burden of the paper.
That burden rests mainly on the task-level comparison and component ablations.
The business lesson is workflow automation under professional constraints
Radiologist Copilot is a medical AI paper, but the business pattern is broader.
Many enterprise AI failures come from automating the visible artifact instead of the professional process that produces it. A contract is not just text. A loan decision is not just a score. A compliance memo is not just a paragraph. A market research report is not just an executive summary with bullet points pretending to be wisdom.
In each case, expert work has hidden stages: information retrieval, scope control, evidence selection, standards alignment, risk checking, and revision. A single model response collapses those stages. An agentic workflow can externalize them.
For business builders, Radiologist Copilot suggests a practical design framework:
| Product-design question | Radiologist Copilot example | Transferable business interpretation |
|---|---|---|
| What must be localized before reasoning begins? | Segment liver and lesions | Identify the relevant data slice before asking the model to analyze |
| What should the model inspect? | Liver surface, parenchyma, bile ducts, lesions | Convert expert heuristics into structured analysis prompts |
| What format should output follow? | Select historical report template | Use templates to impose domain conventions |
| What must be checked before delivery? | Format, content, consistency, terminology | Add quality gates for predictable failure modes |
| What should remain traceable? | Key CT slice references | Attach output to evidence, not just generated prose |
This is where the ROI pathway becomes realistic. The value is not “replace the radiologist.” The paper does not show that, and frankly the world has enough replacement fantasies wearing cheap lab coats.
The more plausible value is workflow support: reducing drafting burden, making report structure more consistent, surfacing image references, adding first-pass quality checks, and helping radiologists review a more organized candidate report. In business terms, the system targets throughput and standardization before it targets autonomy.
That distinction matters. A copilot that drafts and checks within a radiologist-controlled workflow faces a different product, regulatory, and liability path from a fully autonomous diagnostic system.
What the paper directly shows, what Cognaptus infers, and what remains uncertain
The paper directly shows that, on a liver radiology reporting task derived from AMOS-MM, Radiologist Copilot outperforms several 3D medical VLM baselines across reported NLG and clinical-efficacy metrics. It also shows through ablations that region analysis planning and strategic template selection contribute materially to performance, while report quality control has a smaller average metric effect but can catch simulated report inconsistencies.
Cognaptus infers that the larger design lesson is process-level AI orchestration. The model is valuable, but the workflow around the model is what makes the system more aligned with expert practice. This is especially relevant for business domains where outputs must follow professional conventions and where errors are not merely embarrassing but operationally costly.
What remains uncertain is equally important.
First, the evaluation is limited to liver reports extracted from AMOS-MM. The authors note that chest or abdomen reports may be feasible because the agent integrates VLMs capable of analyzing different anatomical regions, but feasibility is not the same as demonstrated performance across organs, diseases, scanners, institutions, and report styles.
Second, the metrics are useful but incomplete. BLEU, ROUGE, METEOR, and BERTScore capture textual similarity. F1-RadGraph and GREEN are more clinically oriented, but none of these replaces expert clinical review in real deployment.
Third, the agent-level evaluation uses LLM-as-a-Judge. That can evaluate process coherence, but it should not be confused with independent radiologist validation.
Fourth, quality control is shown through a simulated inconsistency example and aggregate ablation. That is promising, but the hard safety question is whether the system catches subtle and clinically meaningful failures under distribution shift.
Fifth, the system is training-free but not infrastructure-free. It depends on pretrained segmentation, VLM, LLM, report clustering, template handling, and agent execution. In a hospital environment, integration costs may sit not in model training but in data pipelines, PACS/RIS integration, audit logging, privacy controls, clinician review UI, and governance. The glamorous parts get the figure. The procurement committee gets the migraine.
The real signal: agentic AI becomes useful when it inherits the shape of work
Radiologist Copilot is best read as a workflow paper disguised as a medical AI benchmark.
Its central message is not that agents are magically better than VLMs. It is that expert work has a shape, and AI systems improve when they respect that shape. In this case, the shape is scan, localize, analyze, template, report, check. The agentic system matters because it makes those steps explicit and coordinates tools around them.
That is the direction enterprise AI should care about. The next phase of useful AI will not be won only by models that sound smarter in a blank chat box. It will be won by systems that know where expert judgment begins, where structured tools should intervene, where templates reduce variance, and where quality gates must stop bad output from becoming official output.
Radiologist Copilot does not prove that radiologists are replaceable. It proves something more operationally useful: even in a domain as specialized as 3D CT reporting, performance can improve when AI stops pretending that professional work is just text generation.
The report was never the whole task. It was the final artifact of a workflow.
The machine is beginning to notice.
Cognaptus: Automate the Present, Incubate the Future.
-
Yongrui Yu, Zhongzhen Huang, Linjie Mu, Shaoting Zhang, and Xiaofan Zhang, “Radiologist Copilot: An Agentic Framework Orchestrating Specialized Tools for Reliable Radiology Reporting,” arXiv:2512.02814v2, 2026. ↩︎