Hospital data does not politely arrive as a paragraph.

It arrives as an ECG trace, an ultrasound video, a CMR sequence, a physician report, a half-remembered prior diagnosis, and a clinician trying to decide what matters before the next patient enters the room. The popular fantasy of medical AI is that a general model will simply “look at everything” and reason like a specialist. Nice fantasy. Very convenient for demo videos. Less convenient for actual cardiology.

The paper behind MARCUS — Multimodal Autonomous Reasoning and Chat for Ultrasound and Signals — is interesting because it attacks the problem at the level where the difficulty actually lives: not generic conversation, but raw clinical perception plus cross-modal synthesis.1 MARCUS is designed to interpret ECGs, echocardiograms, and cardiac magnetic resonance imaging studies, both separately and together. It does this through modality-specific expert models coordinated by an agentic orchestrator.

That last phrase is doing real work. The central claim is not merely that a medical model beats frontier VLMs on a benchmark. The more useful claim is architectural: when a task requires specialized perception, workflow decomposition, and evidence checking, a system of trained experts plus an orchestrator can outperform a single general model that is asked to swallow the whole problem in one bite.

For business readers, this is the part worth keeping. The lesson is not “AI will replace cardiologists,” which is the sort of conclusion that sounds dramatic because it skips every operational detail. The lesson is that clinical AI is moving from model-as-answer-engine toward model-as-coordinated-workflow. In domains where the data are private, technical, multimodal, and risky, the moat is not only parameter count. It is access to the right data, the ability to turn that data into perception modules, and the discipline to verify whether the system used the evidence it claims to have used.

The misconception: prompting a frontier VLM is not the same as clinical perception

A likely reader reaction to MARCUS is predictable: if frontier vision-language models are already strong, why not just give them ECG images, echo videos, and CMR scans with a good prompt?

Because cardiology is not a screenshot-identification task.

An ECG is not just a picture. It is a calibrated physiological signal converted into a visual representation where waveform shape, rhythm, voltage, timing, and lead relationships all matter. Echocardiography is worse, in the productive sense: multiple views, moving anatomy, operator-dependent acquisition, and measurement-heavy interpretation. CMR adds its own structure: sequences, planes, slices, cine motion, tissue characterization, late gadolinium enhancement, and metadata that determine which images should be examined for which question.

A general VLM may appear to reason over an image. But if it has not learned the visual grammar of the modality, fluency becomes a liability. The model can produce a plausible explanation before it has earned the right to explain anything. This is the classic medical AI danger: eloquence arriving before evidence.

MARCUS addresses the problem by refusing to treat “visual input” as one universal object. ECG, echocardiography, and CMR each receive their own expert pipeline. The system then uses an orchestrator to decompose clinical questions, route sub-questions to the right expert, and combine the answers into a patient-level response.

That is the mechanism. The benchmark results matter because they test whether the mechanism pays rent.

The architecture starts with expert perception, not a larger chat window

MARCUS has three modality-specific expert models: one for ECG, one for echocardiography, and one for CMR. Each is trained to map raw or near-raw clinical data into clinically meaningful language outputs.

The training corpus is large by medical standards: 249,785 ECGs, 1,266,144 echocardiogram images from 10,823 studies, and 12,191,751 CMR images from 9,473 studies, paired with physician reports and linked clinical diagnoses. From this material, the authors build 741,000 visual question-answer pairs for supervised fine-tuning and 879,000 multiple-choice diagnostic questions for GRPO optimization.

That pipeline matters because it separates three layers of capability:

Layer What MARCUS learns Why it matters operationally
Visual encoder pretraining The visual structure of ECGs, echo videos, and CMR studies The model learns from hospital-grade raw modality data rather than internet-adjacent image fragments
Supervised fine-tuning How clinical questions map to expert-derived answers The model becomes interactive instead of merely classificatory
GRPO optimization How to improve diagnostic answer selection The model is tuned toward correct clinical conclusions, not just fluent output

This is also where the paper quietly distances itself from the “one model to rule them all” mindset. ECG uses a SigLIP-style vision encoder, a projection module, and a Qwen2 language model over reconstructed full 10-second, 12-lead image grids. Echo and CMR require video-style handling: frames, temporal ordering, multi-view fusion, and modality-specific routing. CMR can lean on metadata for sequence and plane selection; echocardiography needs attention-based view selection because the metadata are less standardized.

The uncomfortable point is simple: the data structure dictates the model structure. A business that tries to flatten every workflow into a generic chat interface is not simplifying the problem. It is often deleting the part of the problem that contains the value.

Natural language becomes the system bus

The orchestrator is the second major design choice. MARCUS does not force every modality into one shared embedding space and hope that a single model will preserve all relevant evidence. Instead, the modality experts communicate through natural language outputs, and the orchestrator uses those outputs to form an integrated assessment.

This looks less glamorous than a fully unified latent space. It may also be more practical.

The authors argue that natural-language interfaces make the architecture more modular. If the language model component improves later, the system can potentially swap or update it without retraining every visual projection layer. In a field where foundation model releases move faster than hospital procurement cycles — admittedly a low bar, but still — modularity is not a minor engineering preference. It is a survival strategy.

The orchestrator also mirrors the structure of clinical reasoning. A cardiologist does not answer every case by fusing all evidence at once in a single opaque pass. The reasoning process is decomposed:

Clinical question
  → ask ECG expert about rhythm, voltage, conduction, ischemia
  → ask echo expert about structure, valves, motion, function
  → ask CMR expert about tissue, anatomy, fibrosis, geometry
  → reconcile conflicts
  → produce diagnosis, uncertainty, and management-relevant interpretation

That decomposition is especially important when single modalities are non-specific. Constrictive pericarditis versus restrictive cardiomyopathy is a good example from the paper’s discussion: echo findings alone can overlap, while CMR and ECG can provide context that changes the conclusion. A general model may latch onto the most visually salient cue. MARCUS is designed to ask different specialists different questions before forming the answer.

In business terms, this is the difference between a chatbot and an operating model. A chatbot waits for a prompt. An operating model knows which sub-processes must be invoked before the answer deserves to exist.

The main evidence: performance improves most when synthesis is required

The headline results are large. On internal and external test cohorts, MARCUS reports 87–91% accuracy for ECG, 67–86% for echocardiography, and 85–88% for CMR interpretation. The frontier comparators — GPT-5 Thinking and Gemini 2.5 Pro Deep Think, as named in the paper — are far lower across these modality-specific tasks.

The multimodal result is more revealing. MARCUS reaches about 70% accuracy on complex multimodal cases, while frontier models land around 22–28%. In the supplementary MCQ table, the paired filtered multimodal comparison reports 73.7% for MARCUS, 22.5% for GPT-5, and 29.4% for Gemini 2.5 Pro. The precise denominator differs slightly by evaluation subset, but the interpretation is stable: the gap grows when the task requires integration rather than isolated recognition.

That pattern supports the paper’s mechanism-first argument. If the improvement were merely about domain data, we would expect strong single-modality performance. We do see that. But the widening gap in multimodal tasks points to the second ingredient: orchestration.

Evaluation area Likely purpose in the paper What it supports What it does not prove
Single-modality MCQ accuracy Main evidence for expert perception Modality-specific encoders outperform general VLMs on curated diagnostic questions Real-world diagnostic workflow performance
Multimodal MCQ accuracy Main evidence for orchestration Decomposition and synthesis improve cross-modal reasoning Autonomous clinical decision-making safety
VQA Likert scoring Evidence for interactive clinical usefulness MARCUS generates higher-quality free-text responses than comparators That clinicians will trust, use, or act on those responses correctly
UCSF external validation Robustness/generalization test Performance transfers beyond the Stanford development site Broad deployment across community hospitals and global populations
Mirage probing Grounding and safety mechanism test Counterfactual checks can suppress ungrounded visual reasoning in the tested protocol Full clinical safety or absence of hallucination under all usage patterns
Failure case analysis Error anatomy and development guidance Echocardiography and free-text generation remain weaker areas A definitive taxonomy of all future error modes

The VQA results add a second layer. MARCUS scores 3.65 on open-ended ECG questions versus 2.60 for GPT-5 and 2.55 for Gemini. In CMR, it scores 2.91 versus 2.19 and 1.95. In multimodal free-text reasoning, it scores 3.28 versus 2.69 and 1.46. These are not perfect scores. That is precisely why they are useful. The paper is not showing a finished digital cardiologist; it is showing that domain perception plus orchestration produces more clinically useful responses than general VLMs, while still leaving plenty of room for improvement.

Echocardiography is the warning label. MARCUS beats the comparator models, but echo remains weaker than ECG and CMR. The paper’s own failure analysis points in the same direction: echo has higher failure rates, especially in free-text tasks. This is not surprising. Ultrasound is operator-dependent, temporally rich, and full of measurement subtleties. In other words, it is exactly the kind of modality that punishes shallow “multimodal” claims.

The appendix is not decorative; it tells us where the system breaks

The supplementary results are not a second thesis. They are a map of where the headline should be narrowed.

The MCQ confidence intervals and McNemar tests support the primary comparison against frontier models. The Likert statistics use ordinal scoring and Mann–Whitney tests, which is appropriate given that a one-to-five clinical quality score is not a continuous physical measurement, no matter how much dashboards would like it to be. The per-category heatmaps are more diagnostic than promotional: they show strong performance in some areas, such as ECG arrhythmia and CMR pathology identification, while exposing weak spots in continuous quantification and specific multimodal categories.

One detail deserves more attention: quantitative measurements remain difficult. In echocardiography, MARCUS’s advantage over GPT-5 in quantitative measurement subcategories is much smaller than in valves or ventricular function. In multimodal VQA, wall motion pattern and mitral regurgitation quantification are among the weakest categories. This matters commercially because clinical workflow value often depends on boring numbers: ejection fraction, chamber size, gradients, regurgitation severity, wall motion, volumes. The glamorous answer is “the model reasons.” The invoiceable answer is “the model helps produce reliable measurements and recommendations.” Different problem.

The failure case log is also important. The authors identify 209 MARCUS failure cases across MCQ and VQA evaluations. In VQA tasks, 169 of 400 MARCUS responses receive a Likert score of 2, with echocardiography again the most fragile modality. This does not erase the performance gains. It tells product teams where not to overclaim.

A credible deployment strategy would start by separating tasks into at least three baskets:

Task type MARCUS evidence suggests Practical product posture
Binary or discrete recognition in strong modalities More mature, especially ECG and CMR categories Candidate for triage, second read, or structured assistant workflows
Cross-modal synthesis Promising, especially against general VLM baselines Useful for case summarization and specialist support, with review
Continuous quantification and difficult echo tasks Still fragile Keep human-in-the-loop measurement validation and conservative escalation

This is not caution for its own sake. It is product segmentation. Good AI businesses do not deploy a model; they deploy the parts of a model that are ready for the jobs where failure is tolerable, visible, and correctable.

Mirage resistance is the system’s most business-relevant safety idea

The paper’s most interesting safety contribution is not a generic hallucination warning. It is a concrete inference-time verification mechanism.

The authors discuss “mirage reasoning”: the tendency of vision-language models to produce detailed image-based reasoning even when the image is absent or not actually grounding the answer. In clinical settings, this is not a philosophical nuisance. A plausible but ungrounded explanation can be worse than an obvious failure because it invites confidence.

MARCUS handles this through counterfactual probing. For each clinical sub-query, the orchestrator generates several semantically equivalent rephrasings and sends them to the relevant modality expert with the visual input. It then sends an image-absent version of the same query. The system compares two things: consistency across image-present rephrasings and divergence between image-present and image-absent answers.

The logic is elegant enough to be dangerous if oversold, so let’s state it precisely.

If a model gives similar answers across differently worded image-present questions, that suggests stable reasoning. If the answer changes meaningfully when the image is removed, that suggests the image mattered. But if the image-present and image-absent answers are too similar, the system flags potential mirage behavior. The orchestrator then adjusts confidence and weighting.

In isolated expert models, the paper reports non-zero mirage rates: about 33.0% for ECG, 38.5% for echocardiography, and 36.4% for CMR under image-absent queries. With the full MARCUS pipeline, the orchestrator identifies the mirages and the reported system-level mirage rate falls to 0% in the tested protocol, without suppressing correctly grounded responses.

That is a serious architectural idea. It shifts trust from “the model sounds right” to “the system tested whether the visual evidence changed the answer.”

For enterprise AI, this idea travels beyond medicine. In finance, a system can ask whether a recommendation changes when market data are withheld. In legal review, it can test whether a clause-level answer changes when the cited document section is removed. In compliance, it can compare evidence-present and evidence-absent responses before escalating a risk conclusion. The general principle is not medical: make grounding testable at inference time.

Of course, this is not a universal hallucination vaccine. It tests a specific failure mode under a specific protocol. A model can still misread an image, misunderstand a clinical question, or reason badly from real evidence. But it is a useful move from passive trust to active verification. In regulated work, that distinction is not cosmetic. It is often the difference between a demo and a process.

The business value is architecture, not autonomous diagnosis

What does the paper directly show?

It shows that, on curated retrospective benchmarks, a cardiac multimodal agentic system trained on substantial institutional data outperforms selected frontier VLMs across ECG, echo, CMR, multimodal multiple-choice diagnosis, and open-ended clinical reasoning tasks. It also shows that a counterfactual probing orchestrator can reduce the tested mirage behavior at the system level.

What can Cognaptus reasonably infer?

The practical value of clinical AI will likely come from system design more than generic model access. In specialized domains, vendors need four assets working together:

  1. Data rights and data quality. MARCUS’s advantage begins with raw clinical data paired with physician reports. Without that, the system is just fluent guessing wearing a lab coat.
  2. Modality-specific perception. ECG, ultrasound, and CMR do not deserve the same visual pipeline. The model architecture should respect the physics and workflow of the data.
  3. Agentic orchestration. The system must know which expert to ask, how to decompose the question, how to reconcile conflicts, and how to communicate uncertainty.
  4. Grounding checks. Verification should be part of inference, not a slide in the governance appendix.

The ROI logic is therefore not simply “replace expensive specialists.” That is the lazy spreadsheet version. The more credible ROI pathway is capacity expansion: faster preliminary interpretation, better triage, reduced diagnostic backlog, more consistent second reads, structured reporting support, and earlier escalation of cases requiring specialist review.

In hospital reality, those benefits still need prospective validation. A retrospective benchmark can show technical promise; it cannot prove shorter time to treatment, lower diagnostic error rates, better patient outcomes, reduced clinician burnout, or improved reimbursement economics. Those are deployment questions, not benchmark questions.

Data access becomes the moat, but also the bottleneck

MARCUS also exposes a strategic tension in medical AI.

Frontier models are trained on broad public or licensed corpora. Hospitals hold the data that matter most for specialist reasoning, but those data are protected, fragmented, institution-specific, and operationally messy. The paper’s full raw imaging data from Stanford and UCSF are not publicly available because of institutional data use agreements. The test questions are available through the repository; the full benchmark dataset and model weights are described as planned for release upon acceptance.

This creates a familiar but under-discussed business pattern: the best training data are exactly the data that are hardest to aggregate.

A vendor trying to replicate MARCUS does not merely need GPU budget. The training details are non-trivial — H100 systems, multi-stage training, modality-specific preprocessing, GRPO, external validation — but compute is not the only gate. The harder asset is a clinically meaningful, legally usable, multi-institutional dataset with reliable ground truth and enough diversity to survive deployment outside a flagship academic center.

That means clinical AI competition may not look like consumer AI competition. In consumer AI, distribution and model access can dominate. In clinical AI, the winning architecture may need deep data partnerships, local validation loops, and integration into clinical systems that are allergic to friction. The moat is not just intelligence. It is institutional trust converted into usable training and evaluation data.

Naturally, this also means deployment will be slower than the hype cycle prefers. The hype cycle will survive. It always does. Hospitals are less forgiving.

Where the evidence stops

The paper is strong enough that its boundaries should be stated cleanly, not sprinkled nervously into every paragraph.

First, the development data come from a single main center, Stanford. UCSF external validation is valuable, especially because no UCSF data were used in development, but two academic institutions do not establish global generalizability. Community hospitals, different imaging vendors, different acquisition protocols, different patient populations, and different reporting habits may change performance.

Second, the evaluation is retrospective and benchmark-based. Curated MCQ and VQA tasks are necessary for rigorous comparison, but clinical reality is messier. Real questions are often ambiguous, incomplete, and embedded in workflow constraints.

Third, echocardiography remains visibly weaker. The paper’s own numbers point to this: echo accuracy and free-text quality lag behind ECG and CMR in important settings, and failure rates are higher. Any deployment plan that treats all modalities as equally mature is ignoring the paper it claims to be inspired by.

Fourth, VQA scoring depends partly on evaluator design. A blinded cardiologist scores a subset, and an LLM-based evaluator scores the complete dataset. That is a practical solution at scale, but clinical usefulness ultimately has to be tested with clinicians in the loop, not only against report-derived answers.

Fifth, mirage resistance is not the same as clinical safety. The counterfactual protocol is a smart grounding test. It does not guarantee that all grounded reasoning is correct.

These limitations do not make MARCUS unimportant. They make the correct interpretation narrower and more useful: MARCUS is evidence that agentic, domain-trained, multimodal clinical systems can beat general VLMs on demanding cardiac interpretation tasks. It is not evidence that hospitals should hand over diagnosis to an autonomous agent tomorrow morning. Anyone selling the second claim from the first result is doing marketing, not interpretation.

The real shift: from AI that answers to AI that coordinates evidence

MARCUS is best read as a prototype of where serious vertical AI is heading.

The first generation of AI products answered questions. The next generation will coordinate evidence. That means knowing which tool to invoke, which data source to trust, which expert model to query, when to test for grounding, how to expose uncertainty, and when to leave the decision to a human.

Cardiology makes this shift obvious because the work is already multimodal. But the same pattern appears in other domains. Investment research combines prices, filings, macro data, transcripts, and risk constraints. Compliance review combines policies, contracts, transaction logs, and regulatory text. Industrial maintenance combines sensor streams, images, manuals, and technician notes. In each case, the high-value task is not “answer the user.” It is “assemble the right evidence pathway before answering.”

MARCUS gives that idea a concrete medical demonstration. Its performance gains come from the sequence: domain data creates expert perception; expert perception feeds the orchestrator; the orchestrator decomposes and synthesizes; counterfactual checks test grounding; final output becomes interactive and clinically interpretable.

That is why a mechanism-first reading is more useful than a leaderboard summary. The numbers are impressive, but the architecture explains why they appear and where they may fail.

The future clinical AI assistant will probably not look like one giant oracle. It will look more like a disciplined consultation room: three specialists, one coordinator, a habit of checking evidence, and a healthy suspicion of beautiful explanations that arrive too easily.

That is less magical than the usual AI story.

Good. Medicine has enough magic already.

Cognaptus: Automate the Present, Incubate the Future.


  1. Jack W. O’Sullivan, Mohammad Asadi, Lennart Elbe, Akshay Chaudhari, Tahoura Nedaee, Francois Haddad, Michael Salerno, Fei-Fei Li, Ehsan Adeli, Rima Arnaout, and Euan A. Ashley, “MARCUS: An Agentic, Multimodal Vision-Language Model for Cardiac Diagnosis and Management,” arXiv:2603.22179v1, March 23, 2026, https://arxiv.org/abs/2603.22179↩︎