When Medical AI Stops Guessing and Starts Asking

Slides are easy to admire and hard to interrogate.

That is the unpleasant little problem behind medical AI. A pathology image can look like a rich source of clinical intelligence, and a large multimodal model can produce fluent comments about what it sees. But fluent comments are not the same thing as medical insight. A model can describe tissue architecture, mention invasion risk, add a treatment-sounding phrase, and still fail at the actual analytical task: asking the right question, finding the relevant evidence, connecting it to a clinically meaningful conclusion, and knowing when it has not seen enough.

The paper behind this article, MedInsightBench: Evaluating Medical Analytics Agents Through Multi-Step Insight Discovery in Multimodal Medical Data, targets exactly that gap.¹ Its contribution is not simply another medical benchmark where models answer questions about images. Thankfully. We already have enough leaderboards where models compete to sound more certain than the evidence deserves.

The useful idea is more mechanical: medical insight discovery should be evaluated as a workflow, not as a one-shot answer. The paper builds a pathology-image benchmark, defines metrics that punish both missing insights and producing unsupported ones, and proposes an agent system that decomposes the work into visual question generation, targeted image analysis, domain retrieval, and follow-up questioning.

That framing matters because the central business lesson is not “buy a bigger model.” It is closer to: if your medical AI product jumps directly from image to conclusion, it is probably skipping the most important part of analysis.

The real task is not diagnosis; it is guided discovery

Most medical AI discussions still orbit diagnosis. Show the model an image. Ask what disease is present. Compare the answer to a label. Score the model. Repeat until someone writes a press release.

MedInsightBench is aimed at a different task. It asks whether a model or agent can discover useful medical insights from multimodal pathology data. In the paper’s setup, each case contains a cancer pathology image, an analytical goal, and a set of question-linked insights. The model is not merely expected to identify a disease label. It is expected to generate insights relevant to staging, margins, invasion, prognosis, treatment planning, data reliability, or further exploration.

That distinction changes the difficulty. A diagnosis-style task can sometimes be solved by recognizing visual patterns. Insight discovery requires a sequence:

understand the analytical goal;
decide which questions matter;
inspect the image or linked evidence for those questions;
distinguish observed facts from inference;
produce a conclusion that is useful, grounded, and not redundant;
continue asking follow-up questions when the first answer is not enough.

This is the difference between a model that says, “This may indicate aggressive tumor behavior,” and a system that asks, “Is lymphovascular invasion present, is perineural invasion extensive, are margins involved, and how do these findings affect recurrence risk or adjuvant therapy discussion?”

The first sounds clinical. The second behaves more like analysis.

MedInsightBench turns pathology cases into question-insight chains

The benchmark itself contains 332 curated medical cases from The Cancer Genome Atlas, with 3,933 annotated insights across six insight categories. Each sample includes a pathology image, a specific analytical goal, and multiple insights. Each insight is linked to a question and supporting evidence from the original pathology report.

That construction choice is important. A plain image-caption benchmark would mostly test whether a model can describe visible features. A question-insight benchmark tests whether the model can organize analysis around clinically relevant inquiry.

The paper’s data pipeline has three major steps.

First, whole-slide images are processed into usable pathology images. The original TCGA slides are stored as SVS whole-slide image files. The authors downsample them, select appropriate image levels, apply color normalization, export them as PNG images, and filter unusable images through automated checks plus manual review.

Second, pathology reports are processed and converted into structured insight material. Reports are retrieved, OCR is applied, and the text is checked for quality. Then LLM-assisted steps decompose report evidence, generate insights, create analytical questions, and filter lower-confidence outputs. Human verification is used to check logical consistency, factual accuracy, and rationality.

Third, the paper generates an overarching analytical goal from the question set. This matters because the benchmark is not asking models to free-associate over a slide. It gives them a target: correlate histopathologic features with staging parameters, treatment implications, margin status, nodal metastasis, HPV status, or similar clinical concerns.

The six insight types are also worth noticing:

Insight type	What it tries to capture	Why it matters operationally
Descriptive	What the case states or shows	Basic extraction and reporting
Diagnostic	What disease or pathology is indicated	Clinical classification
Predictive	What future risk or outcome may follow	Prognosis and planning
Prescriptive	What action may be recommended	Decision support, with caution
Evaluative	Whether data, intervention, or analysis is reliable	Quality control
Exploratory	What unexpected pattern deserves investigation	Research and discovery

This taxonomy is not perfect, and the boundaries between predictive, prescriptive, and exploratory insights can be blurry. But the business value is that it forces a product team to ask a better question: what kind of “insight” are we claiming the system can generate?

A medical AI dashboard that produces descriptive summaries has a very different risk profile from one that proposes treatment implications. Pretending those are the same thing is a convenient way to make a demo look impressive and a deployment look reckless.

The benchmark’s metrics punish both silence and overconfidence

The paper’s evaluation framework is one of its more useful pieces. It does not rely only on whether a generated output overlaps with annotated ground truth. Instead, it evaluates four dimensions: recall, precision, F1, and novelty.

Recall measures whether the system recovers the ground-truth insights. Precision measures whether the generated insights are actually supported rather than irrelevant or hallucinated. F1 balances the two. Novelty attempts to measure whether the system finds potentially correct insights not already captured in the annotated set.

This is a better fit for insight discovery than ordinary answer accuracy. In a medical workflow, two failure modes matter at the same time.

The first failure mode is omission. The system misses a clinically important finding, such as lymphovascular invasion or margin involvement. The second is invention. The system produces a plausible-sounding insight that is not supported by the image or linked evidence. In ordinary AI demos, invention often looks like initiative. In medicine, it is usually called a problem.

MedInsightBench tries to score both sides. It uses ROUGE-1 and G-Eval for recall and precision. For G-Eval, the paper averages scores from GPT-3.5-Turbo and Gemini 2.5 Pro, then normalizes scores for comparison with ROUGE-1. The authors also sample 100 data points for human scoring by ten human experts. In the appendix, they report annotation reliability checks, including an average ICC of about 0.82, Krippendorff’s alpha around 0.84, and Pearson correlation around 0.76.

The novelty metric is more delicate. The paper classifies generated insights that do not match ground truth and asks multiple LMM evaluators whether they may still be correct. An insight is treated as potentially novel when at least two models judge it correct. This is sensible as an evaluation design, but it should not be confused with clinical discovery. “Potentially novel according to model judges” is not the same thing as “clinically validated new finding.” A tedious distinction, yes. Also the entire point.

Direct multimodal models are still weak at deep medical insight discovery

The main experiment evaluates several LMM-only baselines and agent frameworks. The direct baselines include GPT-4o, GPT-5, Deepseek-VL2, Qwen2.5-VL-32B-Instruct, and InternVL3-38B. The agent baselines include ReAct with GPT-4o and the proposed MedInsightAgent with GPT-4o or Qwen2.5-VL.

The results are not a victory lap for current LMMs.

System	G-Eval recall	G-Eval precision	G-Eval F1	Innovation
GPT-4o	0.298	0.358	0.325	0.209
GPT-5	0.305	0.365	0.332	0.213
Deepseek-VL2	0.323	0.407	0.360	0.271
Qwen2.5-VL-32B-Instruct	0.398	0.485	0.437	0.417
InternVL3-38B	0.339	0.399	0.367	0.255
ReAct (GPT-4o)	0.302	0.371	0.332	0.224
MedInsightAgent (GPT-4o)	0.361	0.413	0.384	0.270
MedInsightAgent (Qwen2.5-VL)	0.451	0.546	0.494	0.478

The strongest result comes from MedInsightAgent using Qwen2.5-VL as its backbone, with a G-Eval F1 of 0.494 and an Innovation score of 0.478. That is clearly better than the direct Qwen2.5-VL baseline, which scores 0.437 on G-Eval F1 and 0.417 on Innovation.

But the absolute level should sober the reader. Even the best system is not close to perfect. This is not “medical AI has solved pathology insight discovery.” It is more like “structured orchestration helps, and the task remains hard.” Less glamorous. More useful.

A second pattern is also important: precision tends to be higher than recall. The paper interprets this as models producing more conservative, well-supported insights while still missing breadth and depth. In practical terms, the systems are better at saying something plausible than at comprehensively discovering all the relevant things. That is not a small flaw. In clinical analytics, missing the relevant question can be just as damaging as answering one badly.

The mechanism: ask first, inspect second, infer third

The proposed MedInsightAgent has three main modules.

The Visual Root Finder generates initial root questions. It first summarizes the image and extracts keywords, then uses web retrieval to gather relevant domain context. The goal is not merely to describe the slide. The goal is to create better questions for downstream analysis.

The Analytical Insight Agent answers those questions. It uses a pathology-focused image-analysis tool, PathGen-LLaVA, to extract question-specific image evidence. Then it generates answers and insights grounded in that evidence.

The Follow-Up Question Composer generates deeper or complementary questions after the initial round. It selects follow-up questions and sends them back into the analysis loop. In the paper’s experiments, MedInsightAgent runs four rounds of iterations with three new questions generated in each round.

This workflow is the heart of the paper. It says, in effect, that the model should not be trusted to jump from image to insight in one poetic leap. The analysis should be staged.

A simplified version looks like this:

Goal + pathology image
        ↓
Visual summary + retrieved domain context
        ↓
Root questions
        ↓
Question-specific image evidence
        ↓
Answers and insights
        ↓
Follow-up questions
        ↓
More targeted evidence and deeper insights

That structure is not merely an engineering preference. It is a risk-control mechanism.

A direct model has to decide what to look for and what to conclude at the same time. MedInsightAgent separates those functions. Question generation becomes inspectable. Evidence extraction becomes targeted. Follow-up questions become a way to expand coverage. The system can still be wrong, of course. It is AI, not a saint with a microscope. But the workflow creates more points where errors can be detected, constrained, or audited.

For business teams, this is the most portable idea in the paper. The mechanism is not limited to pathology. Many enterprise AI failures have the same structure: the system produces conclusions before it has generated the right questions.

The ablation study shows orchestration is doing real work

The paper’s ablation study is not a second thesis. Its likely purpose is to test whether each module in MedInsightAgent contributes to performance, rather than merely decorating the architecture diagram.

Using GPT-4o as the backbone and G-Eval-based scores, the full MedInsightAgent reaches 0.361 recall, 0.413 precision, 0.384 F1, and 0.270 Innovation. Removing modules reduces performance.

Test	Likely purpose	What it supports	What it does not prove
Remove Image-Summarization Module	Ablation	Image-level summarization helps root-question generation	Summaries alone are sufficient
Remove Web-Retrieval Module	Ablation	External domain context improves novelty, especially Innovation	Retrieved material is always clinically reliable
Remove Image-Analysis Tool	Ablation	Question-specific visual evidence is central to recall, precision, and F1	The tool is clinically validated for deployment
Remove Follow-Up Question Composer	Ablation	Iterative questioning improves depth and novelty	More rounds always improve results

The most damaging removals are informative. Without the Image-Analysis Tool, G-Eval F1 falls from 0.384 to 0.353. Without the Follow-Up Question Composer, it falls to 0.338. Without the Web-Retrieval Module, Innovation drops from 0.270 to 0.239.

That tells us the performance gain is not just “use more tools.” The appendix adds a useful comparison: when ReAct is given the same tools, it improves from 0.332 to 0.371 G-Eval F1, but still trails MedInsightAgent’s 0.384. The gap is not huge, but it supports the authors’ claim that scheduling and orchestration matter.

This is a nice small correction to a common agentic-AI misconception. Tools are not architecture. A web search tool, an image model, and a prompt chain do not automatically become a useful analyst. The order of operations matters. The selection mechanism matters. The question loop matters. There is no business advantage in having an “agent” that simply owns more buttons to press.

The case study shows why contradiction is not a cosmetic defect

The paper’s case study compares ground-truth insights, GPT-4o outputs, and MedInsightAgent outputs across cases of varying difficulty. The examples are useful because they show a failure mode that aggregate metrics can make too abstract.

In one prostate cancer case, the ground truth highlights lymphovascular invasion and extensive perineural invasion as signals of increased metastatic potential. GPT-4o says there is an absence of perineural invasion in visible sections, while MedInsightAgent identifies perineural invasion as suggesting more aggressive tumor behavior and possible recurrence risk.

In another case, GPT-4o says no definitive lymphovascular invasion is observed, while the ground truth states that angiolymphatic invasion is present. MedInsightAgent better captures the treatment-relevant direction of the finding, though its output can still be somewhat generic.

The important lesson is not simply that MedInsightAgent is better. The important lesson is that medical insight systems can fail in ways that sound calm and professional.

A contradiction about invasion status is not a wording issue. It changes downstream interpretation. It affects staging, prognosis, treatment discussion, and review priority. When the paper says GPT-4o outputs can show internal contradictions, incorrect judgments, and omissions, that is not a minor benchmark artifact. It is exactly the kind of failure product teams need to surface before deployment.

What the paper directly shows, and what business readers may infer

For healthcare analytics vendors, pathology-AI teams, and hospital innovation groups, the paper’s practical value lies in product design and procurement evaluation. It does not prove that MedInsightAgent is ready for clinical use. It does show how to think more clearly about systems that claim to generate medical insights.

Paper result	Directly shown	Business inference	Boundary
MedInsightBench contains 332 cases and 3,933 insights	A benchmark can structure pathology-image analysis around goals, questions, evidence, and insight types	Procurement tests should ask vendors to show question-linked insight discovery, not only final answers	TCGA-based benchmark; not a deployment trial
LMM-only systems score modestly	Direct image-to-insight generation remains weak	One-shot multimodal chat should not be treated as a reliable medical analytics layer	Results depend on selected models and evaluation protocol
MedInsightAgent improves over baselines	Agent orchestration can improve recall, precision, F1, and novelty	Product architecture should decompose analysis into question generation, evidence extraction, retrieval, and follow-up	Improvement is meaningful but not sufficient for clinical validation
Ablations reduce performance	Individual modules contribute to performance	Teams should test modules separately instead of celebrating a black-box pipeline	Ablation does not prove optimal design
Novelty improves but may be overestimated	Automated novelty scoring needs human checking	New “insights” need verification workflows before being used	Model-judged novelty is not medical discovery

This separation is necessary. Otherwise, the paper will be misread as a familiar story: agent beats base model, therefore agentic medical AI is the future. That version is convenient, dramatic, and slightly lazy.

The better interpretation is narrower and more valuable. Medical analytics systems need to be evaluated on process quality. Did they ask the right questions? Did they gather the right evidence? Did they distinguish descriptive findings from predictive or prescriptive implications? Did they miss important findings? Did they invent unsupported ones? Did they produce anything novel, and was that novelty checked?

Those are not abstract research questions. They are product requirements.

The procurement lesson: stop buying answers, start testing inquiry

A hospital or healthcare analytics buyer should not evaluate a medical AI system only by reading its final paragraph. The paragraph is the easiest part to polish.

A more serious evaluation would request the intermediate artifacts:

Artifact to request	What it reveals
Generated root questions	Whether the system understands the analytical goal
Image-derived findings	Whether visual claims are grounded
Evidence links or report snippets	Whether conclusions are traceable
Follow-up questions	Whether the system can expand analysis beyond first-pass obvious findings
Insight type labels	Whether it separates description, prediction, prescription, and exploration
Precision and recall analysis	Whether it misses important findings or produces unsupported ones
Novelty review	Whether “new insights” are useful or just creative noise

This is where the paper quietly becomes relevant beyond pathology. In financial analysis, supply-chain diagnostics, legal document review, and enterprise operations, the same pattern appears. The model’s final answer is less important than the question path it used to get there. A system that cannot expose that path is difficult to audit. A system that does not have that path is worse.

For vendors, the design implication is equally direct: build the workflow as a chain of inspectable analytical steps. The point is not to create theatrical multi-agent roleplay. Nobody needs a “Pathology Detective Agent” wearing a little digital trench coat. The point is to separate cognitive functions so that each one can be tested.

The boundary: this is not clinical validation

The limitations are not decorative here. They affect how the result should be used.

First, MedInsightBench is built from public TCGA cancer pathology resources. That is valuable for research, but it is not equivalent to prospective clinical deployment across hospitals, scanners, staining protocols, patient populations, and real operational constraints.

Second, the benchmark construction uses LLM assistance combined with human verification. The paper does include quality checks, manual review, and annotation reliability analysis. Still, the ground truth is partly mediated through reports and generated question-insight structures. That makes it a strong benchmark design, not a substitute for direct clinical validation.

Third, the novelty metric is useful but fragile. The appendix itself notes that automated novelty scores can show misjudgment or overestimation. Human-assessed novelty remains lower than the automatic Innovation score in the reported examples, though it still shows improvement. That means novelty should be treated as a screening signal, not a discovery certificate.

Fourth, the outputs remain imperfect. The case study says MedInsightAgent produces more accurate and grounded insights than GPT-4o in selected examples, but some outputs remain overly conceptual. In medical work, “overly conceptual” is not harmless. A vague insight may sound appropriate while failing to support a concrete decision.

Finally, agent orchestration improves performance but does not remove dependence on the base model and tools. The strongest results come from MedInsightAgent with Qwen2.5-VL, while GPT-4o-based MedInsightAgent is clearly lower. Architecture helps, but the underlying model still matters. So does the pathology-specific image-analysis tool. The boring infrastructure still gets a vote.

The deeper message: medical AI needs epistemic choreography

The most interesting part of MedInsightBench is not that it adds another benchmark to the medical AI shelf. The shelf is already crowded and probably needs better ventilation.

The interesting part is that it treats insight discovery as epistemic choreography. The system must move through question, evidence, answer, insight, and follow-up. If it skips steps, it may still sound smart. That is the danger.

For business readers, this paper offers a useful mental model. The next generation of medical AI products should not be judged by how confidently they speak. They should be judged by how well they ask, inspect, justify, and revise.

That is less magical than the usual AI story. It is also closer to how serious work gets done.

Medical AI stops guessing when it stops treating the image as a prompt and starts treating it as an object of investigation. MedInsightBench gives researchers and product teams a way to test that shift. MedInsightAgent shows that structured questioning can improve performance. The remaining gap reminds us that asking better questions is necessary, not sufficient.

Which is annoying. Also true.

Cognaptus: Automate the Present, Incubate the Future.

Zhenghao Zhu, Chuxue Cao, Sirui Han, Yuanfeng Song, Xing Chen, Caleb Chen Cao, and Yike Guo, “MedInsightBench: Evaluating Medical Analytics Agents Through Multi-Step Insight Discovery in Multimodal Medical Data,” arXiv:2512.13297, submitted December 15, 2025, https://arxiv.org/abs/2512.13297. ↩︎

The real task is not diagnosis; it is guided discovery#

MedInsightBench turns pathology cases into question-insight chains#

The benchmark’s metrics punish both silence and overconfidence#

Direct multimodal models are still weak at deep medical insight discovery#

The mechanism: ask first, inspect second, infer third#

The ablation study shows orchestration is doing real work#

The case study shows why contradiction is not a cosmetic defect#

What the paper directly shows, and what business readers may infer#

The procurement lesson: stop buying answers, start testing inquiry#

The boundary: this is not clinical validation#

The deeper message: medical AI needs epistemic choreography#