Paging Dr. Model: When AI Runs the Workup

TL;DR for operators

DxDirector-7B is interesting because it does not behave like a normal medical chatbot. It does not wait for a doctor to gather a neat case history and then offer a polished answer. It starts with a vague chief complaint, decides what information is missing, asks for clinical operations when necessary, and stops when it believes enough evidence exists to make a diagnosis.¹

The operational claim is therefore workflow-level, not answer-level. The model is trained to own the diagnostic sequence: reason internally, choose the next diagnostic “strategy,” route tasks either to itself or to a physician, then produce a final diagnosis with step-level provenance. Physicians become executors of selected real-world operations—labs, imaging, observation, physical examination—rather than the sole directors of the workup. That is a fairly large reordering of authority, tucked inside a 7B model. Subtle, in the same way a hospital merger is “just paperwork.”

The paper reports three categories of evidence. First, DxDirector-7B outperforms larger medical and general LLMs on full-process diagnostic benchmarks: 36.23% on RareArena, 38.40% on NEJM Clinicopathologic Cases, and 63.46% on ClinicalBench. Second, it requests fewer physician operations than baselines, roughly 2.7 to 3.2 per case across the three main datasets, while reporting a 97% to 98% useful-operation rate. Third, in a 160-case real-world inpatient evaluation across nine departments in a Grade 3A hospital in China, the model’s diagnoses were judged replaceable for specialists in 60% to 75% of cases in several test-integration-heavy departments.

For operators, the best early use case is not “replace doctors.” That phrase is legally radioactive and operationally lazy. The better framing is diagnostic orchestration: a system that standardises workups, reduces unnecessary clinician requests, creates an auditable step log, and helps lower-seniority staff know what to collect next. The strongest candidate departments are those where diagnosis depends on combining multiple tests and histories—cardiology, gastroenterology, pulmonology, infectious disease, endocrinology, and pain management. The weaker fit is where diagnosis depends on touch, continuous observation, visual nuance, or real-time human interaction: dermatology, surgery, psychiatry, and similar high-contact settings.

The boundary is just as important. Much of the large-scale evaluation reconstructs full-process diagnosis from existing datasets using GPT-4o-generated chief complaints and simulated physician agents. The hospital study is closer to clinical reality, but still uses recorded patient and specialist behaviours replayed through GPT-4o-based agents, not direct live patient care. This is a serious prototype of a new operating model, not a production licence to hand the pager to a model and go for coffee.

The real change is who owns the next question

Most medical AI systems answer questions physicians already know how to ask. A clinician collects the history, orders the test, interprets the messy bits, frames the case, and then the model supplies a suggestion. That is useful, but it leaves the expensive part of diagnosis untouched: deciding what information is missing and what to do next.

DxDirector-7B targets that missing layer. The paper calls this a reversal of the physician-AI relationship. In the old arrangement, the physician directs and AI assists. In the proposed arrangement, the AI directs the diagnostic process and physicians assist when the model reaches something software cannot do.

That distinction matters because clinical work is not a single prediction. It is a sequence of decisions under incomplete information:

What is known?
What remains uncertain?
Which uncertainty matters most?
Can the model infer the answer from medical knowledge?
Does the next step require a physical-world operation?
Has the evidence become sufficient to stop?

A model that only answers final-diagnosis questions is not managing that loop. It is receiving the benefits of someone else’s loop. DxDirector-7B is designed to run the loop itself.

The mechanism is simple enough to describe, although not simple to make reliable:

Vague chief complaint
        ↓
Step-level "deep thinking"
        ↓
Next diagnostic question / strategy
        ↓
Route:
  - LLM answers medical reasoning questions
  - Physician executes physical-world clinical operations
        ↓
New information added
        ↓
Repeat or stop
        ↓
Final diagnosis + step-level references + responsibility trace

This is why the paper should not be read as “another medical LLM benchmark.” The important move is not that a 7B model beats larger models on several reported metrics, although that is certainly the part one would put on a slide if one had lost all moral resistance to investor decks. The important move is that the authors optimise the model for diagnostic direction: choosing the next step, conserving physician involvement, and leaving an audit trail.

DxDirector is trained to be a stingy user of physician time

The model’s training pipeline has four relevant pieces.

First, the authors continue-pretrain Llama-2-7B on medical data: clinical guidelines, PubMed and PubMed Central abstracts and full papers, with some general-domain replay data to preserve broader capabilities. This gives the model medical knowledge. It does not by itself make the model a diagnostic director.

Second, they instruction-tune the model for full-process diagnosis. The training data begins from MedQA-style cases rich in clinical detail. GPT-4o is used to extract the detailed information, rewrite it into vague patient-style chief complaints, and convert multiple-choice questions into more open-ended clinical questions. Then GPT-4o simulates step-by-step diagnostic reasoning using the hidden full case information. Medical experts sample and review the synthetic data. The result is 10,178 instruction-response pairs.

Third, they inject step-level “deep thinking” using o1-preview. The point is not merely to make answers longer. The authors want each diagnostic step to contain an explicit rationale for what should be asked next. In the training format, the model distinguishes between questions it can answer itself and questions requiring a physician. A question about what schistocytes and elevated LDH suggest can be handled by the model. A question asking for a blood smear result, renal function test, CT scan, or physical examination needs a human or clinical system to return information.

Fourth, the authors apply step-level strategy preference optimisation. This is the business-relevant part. The model samples multiple possible strategies at a diagnostic step. The reward prefers correct final answers. Among strategies with the same correctness, it penalises those that request more physician help. In the paper’s reward formulation, correct answers receive reward scaled by the inverse of the number of physician-assistance requests, while incorrect answers receive zero.

That makes DxDirector less like a general medical encyclopaedia and more like a workflow controller trained with a cost signal. The “cost” is not money directly. It is human clinical workload. But for hospitals, workload is money, capacity, delay, burnout, and occasionally the reason the patient waits six hours to be told what could have been triaged in twenty minutes.

Technical contribution	Operational consequence	ROI relevance
Continued medical pretraining	Gives the 7B base model medical domain knowledge	Reduces dependence on very large general models for every inference
Full-process instruction tuning	Teaches the model to start from vague complaints and build a diagnostic pathway	Moves AI from answer support to workup orchestration
Decoupled reasoning/knowledge training	Separates strategy generation from answer recall	Helps the model decide what to ask before trying to answer
Step-level preference optimisation	Rewards correct paths that request fewer physician operations	Directly targets clinician workload, not just accuracy
Literature-backed final output	Links reasoning steps to supporting medical sources	Supports audit, review, and clinical governance

The key design choice is that the model is not only trained to be right. It is trained to be right with fewer human interruptions. That is the difference between a clever assistant and a system that can plausibly change staffing patterns.

The evidence is accuracy plus workload, not accuracy alone

The paper’s main evidence sits on two axes: diagnostic accuracy and physician workload. Either axis alone would be weaker.

Accuracy without workload reduction gives another clinical decision-support tool. Workload reduction without accuracy gives an automated nuisance with a stethoscope-shaped user interface. The authors need both.

On the three main diagnostic benchmarks, DxDirector-7B reports the highest accuracy:

Evaluation setting	DxDirector-7B	Strong comparator	Interpretation
RareArena rare disease cases	36.23%	o3-mini: 32.96%; DeepSeek-V3-671B: 27.03%; MedFound-176B: 23.98%	The model improves on rare disease diagnosis in the reconstructed full-process setting
NEJM Clinicopathologic Cases	38.40%	Human physicians: 32.50%; GPT-4o: 30.80%; DeepSeek-V3-671B: 29.20%	The strongest claim: complex cases where DxDirector beats reported physician accuracy and frontier LLMs
ClinicalBench real-world cases	63.46%	DeepSeek-V3-671B: 46.66%; GPT-4o: 44.33%; MedFound-176B: 35.76%	Largest absolute gain, suggesting workflow control matters when cases resemble routine clinical distribution
USMLE open-ended full-process setting	50.88%	GPT-4o: 47.56%; DeepSeek-V3-671B: 47.04%; o1-preview: 46.30%	Broader task coverage beyond final diagnosis

The numbers are not “solved medicine.” They are “better than baselines under this experimental framing.” That phrasing is less fun, but more useful.

The RareArena result is especially instructive. DxDirector-7B reaches 36.23%, ahead of o3-mini at 32.96% and DeepSeek-V3-671B at 27.03%. The authors interpret this as evidence that stepwise reasoning helps with sparse, long-tail symptom patterns. That is plausible, but the result should also be read against the relatively low absolute accuracy. Rare disease diagnosis remains hard. DxDirector is better in the study; it is not magic with a pager.

The NEJM Clinicopathologic Cases result is the paper’s showpiece: 38.40% for DxDirector-7B versus 32.50% for human physicians and 30.80% for GPT-4o. These are complex, educational cases, not everyday sniffles. The model’s advantage here supports the authors’ thesis that diagnostic workflow reasoning can matter more than raw model scale or domain pretraining alone.

ClinicalBench gives the most operationally relevant signal. DxDirector-7B scores 63.46%, compared with 46.66% for DeepSeek-V3-671B and 44.33% for GPT-4o. This is also where the model shows the largest absolute improvement. In business terms, this is the benchmark to watch because it resembles real-world distribution more than rare-disease or NEJM-style challenge cases. Still, it is reconstructed into a full-process setting, which means the experimental scaffolding matters.

Now the workload side.

Dataset	DxDirector-7B physician operations per case	Best or notable baseline range	DxDirector useful-operation rate
RareArena	2.72	DeepSeek-V3-671B: 4.63; o3-mini: 5.91; many med-LLMs near 9–10	98.07%
NEJM Clinicopathologic Cases	3.15	o1-preview: 4.15; o3-mini: 4.98; med-LLMs often above 10	98.02%
ClinicalBench	2.68	o3-mini: 4.54; o1-preview: 5.02; MedFound-176B: 12.54	97.31%

This is where the paper becomes operational rather than merely academic. DxDirector is not just more accurate in the reported setting. It asks for fewer things, and the things it asks for are more likely to be useful. That is exactly the kind of metric hospital administrators understand, even before the clinicians begin sharpening knives over liability.

The useful-operation metric is clever but imperfect. The authors judge whether a requested operation is useful by checking whether it appears in the case report provided by medical specialists. That is a reasonable proxy for benchmark evaluation. It is not the same as proving that the request was cost-effective, safe, timely, clinically necessary, or locally available. A test can appear in a case report and still be overkill for a resource-constrained clinic. A missing test can also be clinically prudent if the patient is unstable. Context, annoyingly, remains undefeated.

The paper’s tests do different jobs

Not every experiment in the paper supports the same claim. Treating them as one big performance pile would blur the argument.

Test or analysis	Likely purpose	What it supports	What it does not prove
RareArena, NEJM, ClinicalBench accuracy	Main evidence	DxDirector improves full-process diagnostic accuracy under reconstructed benchmark settings	Safe live deployment in hospitals
Physician operation count and useful-operation rate	Main evidence	The model reduces simulated physician workload while maintaining higher accuracy	True cost savings across local hospital workflows
Department-level heatmaps	Granular evidence / boundary finding	The model performs best in many test-integration-heavy departments and less well in high-contact specialties	Universal clinical coverage
160-case hospital evaluation	Real-world extension / comparison with prior work	Specialist-aligned diagnoses and replacement judgments in a controlled hospital replay setting	Direct autonomous patient interaction
USMLE open-ended tasks	Robustness / task breadth test	The approach generalises beyond final diagnosis to 12 clinical task categories	Mastery of all clinical skills
Misdiagnosis accountability perturbation study	Implementation detail plus governance evidence	Structured outputs improve attribution of error source	Full legal accountability in real adverse events

The department-level results are particularly useful because they prevent the model from being oversold as an all-purpose “AI doctor.” On ClinicalBench, DxDirector-7B performs best in 14 of 17 departments. On RareArena, it leads in 16 of 19 departments. The strongest improvements appear in areas such as neurosurgery, oncology, pulmonology, hematology, orthopaedics, infectious disease, and other specialties where diagnosis depends on collecting and integrating multiple tests.

The failures are not random. The model does not dominate in dermatology, plastic surgery, psychiatry, urology, and nephrology in the same way. The paper attributes some of this to departments where diagnosis depends heavily on real contact, observation, and interaction. In rare disease urology and nephrology, the authors suggest possible long-tail knowledge gaps.

That is the right kind of limitation: specific enough to change deployment design.

The hospital study is promising, but it is not a staffing memo

The real-world evaluation is the paper’s most commercially tempting section. It covers 160 inpatient cases across nine departments in a Grade 3A hospital in China: gastroenterology, nephrology, dermatology, cardiovascular medicine, infectious diseases, endocrinology, pulmonology, general surgery, and pain management.

The setup is cautious. The model does not directly interact with patients. Patient behaviours and specialist operations are recorded from actual inpatient diagnosis. Then two GPT-4o-based agents replay patient and specialist behaviours during the diagnostic process. LLMs interact with these agents starting from vague chief complaints. Specialists review the outputs. The scoring uses a double-blind adjudication design, with both specialists and LLMs independently diagnosing the same cases and a third-party evaluation agent using GPT-4o and DeepSeek-V3 to judge alignment.

This gives stronger realism than a pure benchmark, but still not live clinical autonomy. It is a controlled replay environment.

The reported replacement results are striking:

Department grouping	Reported replacement signal for DxDirector-7B	Practical reading
Cardiovascular medicine	75.0%	Strongest reported replacement rate; likely benefits from structured test integration
Infectious diseases, gastroenterology, pain management, pulmonology, endocrinology	60%–66.7%	Strong candidate areas for diagnostic orchestration pilots
Dermatology and general surgery	Below 50%	Human-led workflows remain more appropriate where physical interaction dominates
Baseline LLMs	All fail to outperform specialists across departments	The result is not simply “frontier LLMs are enough”

This should not be translated as “cut 60% of specialists.” That would be the sort of spreadsheet cruelty that later becomes a regulatory inquiry.

A better reading is that DxDirector may be good enough to handle a meaningful share of the workup-directing burden in departments where the diagnostic pathway is test-heavy and evidence-integrative. In practice, this could mean shadow-mode workup suggestions, junior clinician support, triage pathway generation, or automated case preparation before specialist review.

The phrase “replacement rate” in the paper is therefore best interpreted as “diagnostic content judged sufficiently aligned to specialist output in this controlled setting,” not “licensed substitution under real-world liability.”

Accountability is not a side feature when the model runs the process

If AI only suggests a differential diagnosis, accountability remains messy but familiar. The physician made the final call. The AI was an input.

If AI directs the workup, accountability changes. Did the model ask for the wrong test? Did the physician return an incorrect observation? Did both contribute? Did the model ignore a result it requested? Did the model stop too early? The audit trail becomes part of the product, not a compliance afterthought.

DxDirector’s final output is designed around this problem. It itemises the diagnostic steps, marks whether each step is handled by the LLM or by a physician, and attaches supporting medical literature to LLM-generated content. In other words, the output is not just a diagnosis. It is a responsibility map.

The authors test this using a simulated misdiagnosis experiment. They perturb selected steps in diagnostic processes using DeepSeek-V3 so that the final diagnosis becomes wrong. The perturbation can affect LLM-generated content, physician-provided content, or both. A GPT-4o-based agent then classifies responsibility into three classes: LLM, physician, or both.

DxDirector-7B achieves the highest precision and recall across these categories. In the reported figure, DxDirector’s precision and recall are all roughly in the low-to-mid 80s: 85.49% and 84.40% for LLM responsibility, 86.22% and 86.29% for physician responsibility, and 83.72% and 82.31% for shared responsibility. Baselines show a skew: they tend to over-attribute errors to physicians, with higher physician-recall but lower physician-precision than their LLM-accountability metrics.

For business use, this is one of the more important findings. Hospitals do not merely need AI that gives answers. They need AI that can be reviewed after something goes wrong. A director model without a reviewable chain of responsibility is not automation. It is a liability generator wearing a productivity costume.

The business value is workflow compression, not model cleverness

The obvious commercial pitch is “a 7B model beats much larger models.” That matters for inference cost, on-prem deployment, and accessibility. But the more durable business value is workflow compression.

DxDirector compresses three layers of labour:

Labour layer	Traditional pattern	DxDirector-style pattern	Business implication
Case framing	Physician translates vague complaint into clinical problem	Model starts from vague complaint and builds the case frame	Faster intake and more standardised workups
Diagnostic sequencing	Physician decides which test or question comes next	Model proposes the next step and routes execution	Lower senior-clinician load per case
Review and accountability	Reasoning often spread across notes, orders, and memory	Step-level record distinguishes LLM and physician contributions	Better auditability and governance

The near-term product is not a robot doctor. The near-term product is a diagnostic operations layer.

A hospital, payer, or clinic network could use this kind of system in several practical ways:

Retrospective case replay. Compare actual workups against model-suggested pathways to identify unnecessary tests, missed early signals, or delayed escalation.
Shadow-mode diagnostic orchestration. Let the model suggest next steps without changing clinical decisions, then measure agreement, skipped work, and escalation quality.
Junior clinician support. Use the model to help less experienced staff know what information a specialist will likely need before review.
Department-specific pathway standardisation. Start in specialties where DxDirector’s evidence is strongest: pulmonology, gastroenterology, cardiovascular medicine, infectious disease, endocrinology, and pain management.
Clinical governance tooling. Use step-level logs to support morbidity and mortality review, quality audits, payer documentation, and liability analysis.

The return on investment would not come only from fewer tokens or a smaller GPU bill. It would come from fewer wasted clinical operations, faster time-to-diagnosis, better specialist utilisation, and cleaner documentation. The model cost matters. The redesign of human attention matters more.

Where the claim stops

The paper is ambitious. The boundaries need to be equally clear.

First, the public benchmark evaluations are reconstructed. GPT-4o extracts detailed clinical information and rewrites precise cases into vague chief complaints. A GPT-4o-powered agent then simulates physician interaction by returning relevant information from detailed case data. This is a clever evaluation design, and probably necessary at scale. It is also not the same as a chaotic outpatient encounter with incomplete records, anxious relatives, missing labs, and a printer that has chosen violence.

Second, the hospital study is controlled. The authors use recorded inpatient behaviours and replay them through GPT-4o-based agents. That reduces ethical risk and makes comparison feasible. It also means the model is not yet being tested as a live actor in a real clinical workflow.

Third, the study’s strongest business claim depends on local workflow fit. A hospital with robust electronic records, standardised lab access, and strong clinical governance may benefit more than a fragmented clinic where tests are slow, data entry is inconsistent, and escalation paths are unclear.

Fourth, “replacement” is not a regulatory category. The paper’s replacement rates are adjudication outcomes, not deployment permissions. In production, every jurisdiction will need its own answer to who signs the diagnosis, who owns the order, who validates the model, who updates it, and who gets blamed when the clever workflow misses the obvious.

Fifth, specialty variation is not a footnote. The model is better suited to departments where diagnosis is an information-integration problem. It is less suited to departments where diagnosis depends on direct observation, tactile examination, patient rapport, or fast physical intervention. That does not make the model weak. It makes medicine annoyingly real.

A sensible pilot would not start with autonomy

A serious deployment path would begin with proof of local reliability, not a press release about AI specialists.

A reasonable pilot would look like this:

Phase	Deployment mode	Main question	Metrics
1. Retrospective replay	Use past cases only	Would the model have asked for useful next steps?	Agreement with specialist workup, unnecessary-operation reduction, missed critical information
2. Shadow mode	Live cases, no clinician-facing recommendations	Does performance hold under local data quality?	Chief-complaint parsing quality, step relevance, escalation frequency
3. Assisted mode	Recommendations visible to clinicians	Does it reduce work without reducing safety?	Time-to-diagnosis, actions per case, clinician acceptance, override reasons
4. Governed workflow integration	Department-specific protocols	Can it standardise workups at scale?	Readmissions, adverse events, audit quality, cost per diagnosis

The escalation rules matter. The model should not be allowed to optimise away physician involvement when clinical risk is high, when requested operations are invasive or expensive, when evidence conflicts, when the case falls outside validated departments, or when the model’s own reasoning is unstable across repeated runs.

The irony is that a good “AI director” still needs a very human governance structure. No model escapes bureaucracy; the best it can do is make the bureaucracy slightly less theatrical.

The strategic lesson: clinical AI is moving from answer engines to process engines

DxDirector-7B represents a broader shift in AI deployment. The first wave of medical LLM work asked whether models could answer medical questions. The more important question is whether models can manage clinical processes under uncertainty.

That shift changes the product requirements.

A process engine needs:

state tracking;
decision routing;
cost-aware planning;
escalation logic;
audit trails;
specialty-specific validation;
accountability mapping;
integration with real clinical systems.

DxDirector-7B is not the final version of that stack. It is an unusually clear research prototype of what that stack may look like.

The strongest part of the paper is not the claim that a small model beats larger ones. Benchmark rankings change. The stronger claim is architectural: clinical value may come less from scaling the medical chatbot and more from training models to direct the flow of work, spend human effort carefully, and leave behind evidence that a hospital can actually inspect.

That is the useful provocation. Not “AI replaces doctors.” Not “bigger models solve diagnosis.” The better line is this: once AI can decide the next diagnostic step, the scarce resource is no longer just medical knowledge. It is control over the workflow.

And in healthcare, whoever controls the workflow controls the economics.

Cognaptus: Automate the Present, Incubate the Future.

Shicheng Xu, Xin Huang, Zihao Wei, Liang Pang, Huawei Shen, and Xueqi Cheng, “Reverse Physician-AI Relationship: Full-process Clinical Diagnosis Driven by a Large Language Model,” arXiv:2508.10492, 2025, https://arxiv.org/abs/2508.10492. ↩︎

TL;DR for operators#

The real change is who owns the next question#

DxDirector is trained to be a stingy user of physician time#

The evidence is accuracy plus workload, not accuracy alone#

The paper’s tests do different jobs#

The hospital study is promising, but it is not a staffing memo#

Accountability is not a side feature when the model runs the process#

The business value is workflow compression, not model cleverness#

Where the claim stops#

A sensible pilot would not start with autonomy#

The strategic lesson: clinical AI is moving from answer engines to process engines#