The useful meeting, unfortunately, exists

Meetings are usually where productivity goes to file a complaint. But there is one kind of meeting that high-stakes work still needs: the review session where a first draft is challenged, evidence is checked, and a senior decision-maker signs off.

Radiology has long understood this. A resident may draft the report. A fellow may question the interpretation. An attending radiologist resolves the remaining uncertainty. The point is not ceremony. The point is controlled disagreement.

The arXiv paper introducing MARCH — Multi-Agent Radiology Clinical Hierarchy for CT Report Generation — takes that professional workflow seriously.1 Instead of asking one vision-language model to look at a 3D chest CT scan and produce a report in a single pass, MARCH builds a hierarchy of agents: a Resident Agent drafts, Fellow Agents revise with retrieved evidence, and an Attending Agent coordinates consensus.

That sounds like another multi-agent paper until the mechanism is examined carefully. The interesting claim is not “more agents are better.” That would be too easy, and therefore suspicious. The paper’s more useful claim is that reliability improves when agent roles are separated into localized perception, evidence-grounded revision, and adjudicated consensus.

The distinction matters. A pile of agents is just a group chat with invoices. A hierarchy with evidence, disagreement rules, and final accountability starts to resemble an operating model.

The paper is not mainly about bigger models

The tempting reading is straightforward: MARCH outperforms prior CT report-generation systems, therefore better LLMs plus more agents solve the problem. The paper does not support that lazy version.

The authors are dealing with a particularly unforgiving task. Chest CT report generation is not ordinary image captioning with medical vocabulary sprinkled on top. The input is volumetric. Abnormalities may be sparse. Some findings appear in small regions. A fluent report can still be clinically wrong. In this context, a model that sounds authoritative is not a product feature. It is a liability wearing a lab coat.

MARCH responds by turning report generation into a staged workflow:

Stage Clinical analogy AI role Main risk addressed
Initial report drafting Resident Generate a first report from global and regional CT features Missing localized findings in 3D data
Retrieval-augmented revision Fellows Compare the draft with similar prior cases and revise Hallucination, omission, weak grounding
Consensus-driven finalization Attending Resolve disagreements through stance-based rounds Unchecked single-agent judgment

This is why a mechanism-first reading is more useful than a benchmark-first summary. The headline number is important, but the headline number is produced by a workflow. Strip away the workflow and the lesson becomes the usual vague advice: use better models, add retrieval, maybe add agents, pray politely. MARCH is more specific than that.

Step 1: The Resident Agent turns a CT scan into a structured first draft

The Resident Agent’s job is not to be final. That is already a useful design decision.

In MARCH, the initial draft is generated from chest CT scans using both global CT information and regional information. The paper describes a multi-region segmentation module based on SAT, partitioning the scan into ten anatomical subregions such as bone, breast, heart, lung, mediastinum, pleura, thyroid, and trachea/bronchi. The implementation uses a frozen dual-stream ViT3D backbone pre-trained on RadFM for spatial feature extraction, with LLaMA-2-Chat-7B adapted through LoRA for text generation.

The practical meaning is simple: before asking a language model to write, the system first forces the visual representation to respect anatomy. That matters because many CT findings are not globally obvious. A model looking at the whole volume as one blob may miss the small but clinically relevant detail. Regional decomposition gives the draft a better chance of noticing where the evidence lives.

But the Resident Agent is still only the first reader. The paper’s case study shows why this matters. In the first stage, the resident-style draft can describe broad normal findings and some observations. Later stages add or sharpen details such as small nonspecific pulmonary nodules, pleuroparenchymal sequelae, bronchial ectasia, and peribronchial thickening. The draft is useful because it is structured. It is not trusted because it is first.

That is a quiet but important design principle for enterprise AI: first-pass automation should often be treated as case preparation, not decision completion. The first model organizes the problem. It does not get a crown.

Step 2: Retrieval gives the fellows something better than vibes

The second stage is where MARCH becomes more than a multi-agent role-play exercise.

The Retrieval Agent searches for clinically relevant context from the training database using three retrieval paradigms:

Retrieval route What it compares Why it helps
Image-to-image Similar CT volumes Finds visually similar cases
Image-to-text CT image features against reports Connects visual patterns to prior descriptions
Logit-based retrieval Predicted abnormality profiles across 18 clinical abnormalities Finds cases with similar diagnostic signatures

Each retrieval agent retrieves the top three cases and provides structured evidence to a Fellow Agent. The Fellow Agent then compares the Resident Agent’s initial report with the retrieved reports, identifies discrepancies, and modifies the draft where needed.

This is the part business readers should slow down on. Retrieval is not merely an add-on library search. In MARCH, retrieval creates structured friction. It gives reviewer agents a basis for saying, in effect: “This draft missed something,” or “This phrasing looks unsupported,” or “The retrieved cases suggest a different emphasis.”

Without that evidence channel, multi-agent review can degrade into several language models confidently rephrasing the same uncertainty. A committee of hallucinations is still a hallucination. It just has minutes.

The paper’s retrieval design also helps explain why MARCH improves clinical efficacy more than surface language metrics. BLEU and METEOR move only modestly. Clinical F1 moves much more. That pattern is plausible: retrieval and review may not radically change the style of the report, but they can change whether clinically relevant abnormalities are captured.

Step 3: The Attending Agent makes disagreement operational

The final stage is the paper’s most transferable idea.

MARCH does not simply average fellow outputs. The Attending Agent first synthesizes the revised reports and identifies possible conflicts. Then, in subsequent rounds, Fellow Agents review the current consensus and provide a stance: agreement or disagreement, confidence, reasoning, and supporting evidence. The Attending Agent decides whether further discussion is needed and updates the report accordingly.

The appendix prompt templates are revealing. Fellows are asked not only whether they agree, but also how confident they are, what they changed, what they previously missed, and which retrieved evidence supports their view. The Attending Agent is asked to continue discussion when opposition is strong or when most doctors disagree. When agreement is sufficient, the process terminates.

This is procedural governance embedded into inference.

It also corrects a common misunderstanding of multi-agent AI. The point is not to simulate a crowd. The point is to create a decision protocol:

Draft → Evidence-grounded revisions → Consensus report → Stance checks → Adjudicated final report

The system is therefore not only generating text. It is managing disagreement under rules. That is exactly where many enterprise AI systems are still primitive. They produce outputs, perhaps attach a confidence score, and then leave the human user to perform governance manually. MARCH suggests a more mature pattern: make review and escalation part of the architecture, not a separate afterthought in the compliance deck.

The evidence: strong gains, but read the tests by purpose

The paper evaluates MARCH on RadGenome-ChestCT, a dataset of 25,692 chest CT scans from 21,304 patients, using the official split of 24,128 training scans and 1,564 test scans. The dataset includes reports across ten anatomical regions and 18 predefined clinical abnormalities.

The headline comparison is favorable. Against listed prior systems, MARCH reports the best performance across language metrics and clinical efficacy metrics.

Method BLEU-4 METEOR ROUGE-L CE-Precision CE-Recall CE-F1
Reg2RG 0.249 0.441 0.367 0.423 0.181 0.253
MARCH 0.257 0.456 0.383 0.495 0.335 0.399

The clinical efficacy result deserves attention. CE-F1 rises from 0.253 for the strongest listed baseline, Reg2RG, to 0.399 for MARCH. That is roughly a 58% relative improvement. More importantly, the gain comes from recall rising from 0.181 to 0.335 while precision also improves from 0.423 to 0.495.

That pattern matters. In clinical report generation, higher recall means the system is capturing more of the relevant abnormalities. Higher precision means it is not merely throwing in extra findings to look diligent. Both moving upward is the result one wants. It does not make the system clinically deployable by itself, but it is a better sign than a recall-only jump bought with noisy over-reporting.

The ablation table is even more useful because it shows the contribution of workflow stages.

Configuration BLEU-4 METEOR CE-F1 Likely purpose of test What it supports
Resident-only 0.246 0.435 0.219 Baseline for first-pass generation Drafting alone is not enough
Single-round single-agent review 0.250 0.447 0.332 Add a single reviewer pass Evidence-guided revision helps substantially
Single-round multi-agent review 0.251 0.454 0.352 Add multiple reviewers without multi-round refinement Diversity helps, but only partly
Multi-round multi-agent review 0.255 0.456 0.362 Add iterative discussion Iteration adds value beyond one-pass review
Full MARCH 0.257 0.456 0.399 Full hierarchical workflow The complete protocol gives the strongest clinical score

The best interpretation is not “agents magically reason like doctors.” The better interpretation is that each layer attacks a different failure mode. The Resident Agent creates a localized first draft. Retrieval gives fellows external comparison points. Multi-agent revision introduces diagnostic diversity. Iteration lets disagreement surface. The Attending Agent converts that disagreement into a final report.

The ablation is not a perfect full factorial study, so it should not be overread as a precise causal decomposition of every module. But it does support the paper’s main architecture claim: the gains are not coming only from the base visual-language model.

The LLM sensitivity result quietly embarrasses the bigger-is-always-better story

One of the most interesting tables is not the headline comparison. It is the sensitivity test across LLM backbones.

LLM setting BLEU-4 METEOR CE-F1
Resident-only 0.246 0.435 0.219
GPT-4.1-mini 0.255 0.454 0.393
GPT-4.1 0.257 0.456 0.399
GPT-4o 0.255 0.454 0.392
GPT-5 0.255 0.454 0.391

The paper presents this as sensitivity across LLMs. For business readers, the result is worth translating carefully: once the workflow is in place, stronger or newer LLMs do not automatically produce large gains in this experiment.

That does not mean model quality is irrelevant. It means orchestration can dominate marginal model upgrades when the task requires review, grounding, and adjudication. In plain language: a well-designed operating procedure can beat the habit of throwing a larger model at the problem and calling it strategy.

This is a recurring pattern in enterprise AI. Firms often treat the model as the product. MARCH suggests a different framing: the model is one employee inside a process. The process determines whether the employee’s work is checked, challenged, and corrected before it reaches production.

More agents help until they become noise with a budget line

The appendix includes an agent-count sensitivity test on a 100-sample subset of the RadGenome-ChestCT test set. The authors vary the number of Fellow Agents from 1 to 20.

Number of Fellow Agents BLEU-1 BLEU-4 METEOR CE-F1
1 0.473 0.253 0.451 0.323
3 0.470 0.255 0.456 0.330
5 0.476 0.257 0.455 0.335
10 0.473 0.254 0.455 0.337
20 0.475 0.255 0.454 0.327

This is a robustness or sensitivity test, not the main evidence. It is also constrained by budget and run on a subset, so it should not be treated as the final law of agent scaling.

Still, it carries a practical lesson. More fellows initially help, but 20 fellows reduce CE-F1 relative to 10 and are worse than 5 on several language metrics. The authors suggest that excessive agent density can introduce redundant information or discursive noise. That is a polite academic way of saying: past a certain point, adding more reviewers makes the meeting longer, not smarter.

For the full-dataset experiments, the authors use three Fellow Agents as the default configuration to balance report fidelity and inference cost. This is exactly the kind of tradeoff product teams need to make. The optimal architecture is not the most elaborate one. It is the cheapest workflow that catches the errors that matter.

What the clinical efficacy analysis adds

The paper also analyzes performance across 18 clinical abnormalities, including arterial wall calcification, coronary artery wall calcification, lung nodule, emphysema, cardiomegaly, lung opacity, consolidation, atelectasis, pleural effusion, pericardial effusion, hiatal hernia, bronchiectasis, and others.

The appendix reports that MARCH outperforms the Resident Agent baseline across these abnormalities in precision, recall, and F1. It highlights high recall for hiatal hernia, coronary artery wall calcification, and arterial wall calcification, with recall scores exceeding 0.8. It also reports notable F1 gains for complex findings such as arterial wall calcification, coronary artery wall calcification, and cardiomegaly.

This abnormality-level analysis is useful because aggregate clinical F1 can hide uneven performance. A model that performs well on common findings and poorly on subtle or rarer ones may still look acceptable in a single summary number. MARCH’s abnormality-level results suggest broader improvement, though the paper does not give a deployment-grade safety analysis by subgroup, institution, scanner type, or real-world workflow setting.

So the business interpretation should be disciplined: the result supports the value of hierarchical review on this benchmark. It does not prove clinical readiness across hospitals.

What business teams should steal from MARCH

MARCH is a radiology paper, but its deeper design pattern belongs to high-stakes enterprise AI.

Many business processes already contain the same structure: junior analyst drafts, specialist reviewers challenge, senior owner approves. Credit underwriting, legal review, cyber incident response, compliance screening, procurement risk assessment, insurance claims, and financial reporting all use some version of this division of labor.

The useful transfer is not “use medical AI for everything.” Please do not let the procurement department diagnose lungs. The transferable pattern is hierarchical assurance.

MARCH mechanism Business analogue Operational value Boundary
Resident draft First-pass analyst agent Fast case structuring Should not be final in high-risk decisions
Retrieval agents Evidence and precedent search Grounds review in comparable cases Retrieval quality controls output quality
Fellow agents Specialist reviewers Adds diversity of checks Too many reviewers add cost and noise
Attending agent Senior adjudicator or policy owner Resolves conflicts and produces final decision Needs clear authority and escalation rules
Stance-based rounds Explicit agreement/disagreement protocol Makes dissent visible Requires logging and audit design

The immediate ROI is not necessarily headcount replacement. That is the boring spreadsheet fantasy. The better ROI may come from fewer missed issues, faster review cycles, better traceability, and more consistent escalation.

For example, in credit underwriting, one agent might draft the borrower risk memo, another retrieves comparable default cases, another checks policy compliance, another examines fraud signals, and a final adjudicator agent decides whether the memo is ready for human approval. In legal contract review, the same pattern could separate clause extraction, precedent retrieval, jurisdictional review, and senior synthesis. In cybersecurity, it could separate alert triage, threat-intelligence retrieval, infrastructure context, and incident commander review.

The common lesson is this: when the cost of being wrong is high, AI should not behave like a lonely genius. It should behave like a review process.

What the paper directly shows, and what Cognaptus infers

It is worth separating evidence from interpretation.

Layer Claim Status
Direct paper result MARCH outperforms listed baselines on RadGenome-ChestCT across language and clinical efficacy metrics Supported by reported experiments
Direct paper result Full MARCH improves CE-F1 over Resident-only and partial review variants Supported by ablation table
Direct paper result LLM backbone changes produce only marginal differences among tested MARCH variants Supported by LLM sensitivity table
Direct paper result Too many Fellow Agents may introduce redundancy or noise Supported by subset sensitivity test, with budget/sample caveat
Cognaptus inference Hierarchical review is a general enterprise AI design pattern Reasonable but not directly proven by the medical benchmark
Cognaptus inference Auditability and disagreement logging may become product value Plausible for regulated workflows, requires implementation evidence
Still uncertain Human-in-the-loop deployment effectiveness in real clinical settings Not shown in the paper
Still uncertain Generalization to other datasets, hospitals, open-source medical LLMs, and longitudinal patient histories A limitation acknowledged by the authors

This separation is not academic fussiness. It prevents the article from turning into the usual AI sermon: one benchmark goes up, therefore every industry must reorganize by Thursday. The paper is strong enough without that nonsense.

The boundaries are practical, not decorative

The paper’s limitations matter because they affect how the architecture should be used.

First, the evaluation mainly uses GPT-series LLMs for multi-agent reasoning. The sensitivity table is encouraging, but the paper does not establish that the same hierarchy works equally well with smaller open-source medical models, local hospital deployments, or heavily constrained inference environments.

Second, MARCH lacks long-term memory. That is a major boundary for clinical use. A CT report is often not just a single image interpretation problem; patient history, prior scans, treatment timeline, and longitudinal change all matter. A system that cannot incorporate longitudinal context remains closer to single-case reporting than full clinical reasoning.

Third, the framework emulates a clinical hierarchy but operates autonomously in the paper. There is no deployed human-in-the-loop interface where radiologists review preliminary consensus reports, correct the system, and feed lessons back into future use. For medical AI, that missing layer is not a small UX detail. It is the difference between a benchmark system and a clinical product.

Fourth, the retrieval database is a strength and a dependency. If the retrieved cases are biased, incomplete, or poorly matched, the review process may become confidently grounded in the wrong neighborhood of evidence. Retrieval does not eliminate governance needs. It moves some of them upstream into data curation, indexing, and monitoring.

Finally, multi-agent systems cost more. MARCH trains the Resident and Retrieval components on a single NVIDIA H100 GPU for about 40 hours, and inference requires multiple agent calls. The architecture is therefore best suited for workflows where the value of error reduction justifies the cost of structured review. For low-risk content generation, this would be overengineering with a stethoscope.

The real shift is from model intelligence to organizational intelligence

MARCH is useful because it changes the unit of analysis.

The usual question is: how smart is the model?

The better question for high-stakes AI is: how well is the system organized around the model’s weaknesses?

MARCH’s answer is clear. Use a resident-style agent to prepare a localized first draft. Use retrieval to ground revision in comparable evidence. Use multiple fellows to create diagnostic diversity. Use an attending-style agent to expose disagreement, evaluate confidence, and produce a final report. Then test whether each layer actually improves the outcome.

That is a more mature story than “add agents.” It is also more uncomfortable, because it means AI product quality will depend less on demo fluency and more on workflow design. The boring boxes in the architecture diagram may matter more than the glamorous model name in the sales deck.

For radiology, MARCH is a benchmarked proposal for generating CT reports with stronger clinical fidelity. For enterprise AI, it is a reminder that trustworthy automation often looks less like a chatbot and more like an institution: division of labor, evidence review, dissent, escalation, and final accountability.

Meetings, regrettably, had a point. The trick is to make the useful parts computational — and leave the pastries out of it.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yi Lin, Yihao Ding, Yonghui Wu, and Yifan Peng, “MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation,” arXiv:2604.16175, 2026. https://arxiv.org/html/2604.16175 ↩︎