Doctor, Interrupted: How Multi-Agent AI Revives the Lost Art of Pre‑Consultation

TL;DR for operators

This paper is best read as a workflow paper, not a miracle-doctor paper. It shows that pre-consultation AI becomes more useful when it stops behaving like a polite symptom box and starts behaving like an intake coordinator with a checklist, memory, and a sense of unfinished business.

The system decomposes pre-consultation into triage, history of present illness, past history, and chief complaint generation. A Controller agent decides what still needs to be asked. A Monitor agent checks whether subtasks are complete. A Prompter and Inquirer convert those gaps into the next clinical question. This is less theatrical than “AI doctor,” which is precisely why it matters.

The reported results are encouraging but bounded. The framework achieved 87.0% primary department triage accuracy, 80.5% secondary department accuracy, and a 98.2% task completion rate under agent-driven scheduling, compared with 93.1% under a default sequential order. Physicians rated the generated documentation highly: 4.56 for chief complaint, 4.48 for history of present illness, and 4.69 for past history on a 5-point scale.¹

For hospitals, insurers, telehealth platforms, and regulated-service operators, the business implication is not “replace the clinician.” Please do not send that memo. The implication is that high-quality intake is an orchestration problem: identify what information is missing, ask for it efficiently, preserve context, and hand a cleaner case summary to the professional who remains responsible.

The bottleneck is not the doctor’s intelligence. It is the missing first ten minutes.

A familiar clinic scene: the patient sits down, the doctor opens the record, the clock starts eating the room, and the first two minutes vanish into basic reconstruction. When did it start? Where exactly is the pain? Any fever? Any allergies? Previous surgery? Medication? The patient answers in fragments. The doctor assembles the story while also thinking about triage, risk, documentation, and the next patient already waiting outside. Medicine is full of sophisticated tools, but the opening act still often resembles archaeology performed under fluorescent lighting.

The pressure is measurable. A systematic review of primary care consultation length across 67 countries found enormous variation, with 18 countries representing around half the world’s population averaging five minutes or less per primary care visit.² That is not a clinical encounter; it is a calendar event with a stethoscope.

Pre-consultation is supposed to help. In theory, it collects the patient’s story before the formal visit, routes the case to the right department, and reduces the amount of basic fact-finding a clinician must perform live. In practice, many digital intake tools still behave like forms wearing a conversational costume. They ask what they were scripted to ask. They stop when the patient gives an incomplete answer. They rarely know that a missing allergy history is not equivalent to a negative allergy history. A blank field is not a reassuring clinical finding, although software has spent decades pretending otherwise.

The paper behind this article, From Passive to Proactive: A Hierarchical Multi-Agent Framework for Automated Medical Pre-Consultation, tries to solve that specific failure mode.¹ Its central move is simple: turn pre-consultation from passive collection into active inquiry.

Pre-consultation fails when the system waits too politely

The likely misconception is that this is another “medical chatbot” paper. It is not, or at least that is not the part worth paying for. The useful distinction is between conversation and control.

A conversational interface can ask questions. A controlled intake system knows which information remains clinically unresolved, which question should come next, and when to stop asking. That distinction is small in product demos and large in operations.

Patient-facing pre-consultation chatbots already have evidence of acceptability. A CHI 2024 study with 33 walk-in clinic patients found that patients could respond positively to both an LLM-powered pre-consultation agent and a Wizard-of-Oz version simulated by medical professionals, but perceived thoroughness and sincerity depended on follow-up questions, empathy, and expectation-setting.³ In other words, patients may tolerate the interface. The harder question is whether the system asks the clinically useful next question without becoming either lazy or exhausting.

That is where orchestration matters. The clinical intake problem is not merely “generate a plausible response.” It is closer to managing a structured investigation under uncertainty. A patient may mention abdominal pain but omit onset, duration, aggravating factors, fever, stool changes, pregnancy status, medication use, surgical history, or allergies. Some omissions matter more than others. Some become urgent only after another answer. The intake system therefore needs a live map of missing information.

A single LLM prompt can imitate this for a while. Then the dialogue lengthens, context becomes messy, and the model starts doing what models do best: sounding coherent while quietly losing the plot. Delightful, in the way a filing cabinet fire is warm.

The paper’s real contribution is the scheduler, not the chatbot

The proposed framework uses eight agents, but the number is less important than the division of labour. The architecture separates the work of updating patient information, assessing completion, deciding the next subtask, generating inquiry strategy, asking the question, performing triage, simulating the patient, and evaluating outputs.

The clinical workflow is decomposed into four primary tasks:

Task	Operational role	Subtasks
Triage	Identify primary and secondary department	2
History of Present Illness	Collect onset, symptom characteristics, progression, associated symptoms, treatment history, general condition	6
Past History	Collect disease, immunisation, surgery/trauma, transfusion, and allergy history	5
Chief Complaint	Generate a concise clinical summary	Integrative

The useful mechanism is the loop:

Patient answer
    ↓
Recipient updates the evolving record
    ↓
Monitor scores which subtasks remain incomplete
    ↓
Controller selects the next highest-priority gap
    ↓
Prompter and Inquirer turn that gap into a clinical question
    ↓
The cycle repeats until intake is complete enough

This is not just modularity for architectural neatness. The paper’s design separates three problems that are often collapsed into one prompt:

Information state — what has the patient actually said?
Completion state — what clinically required information is still missing?
Question policy — what should be asked next to reduce uncertainty efficiently?

That separation is the business-relevant part. Enterprises do not fail at automation because they cannot generate text. They fail because generated text is not anchored to task state. The same pattern appears in legal intake, insurance claims, technical support, financial onboarding, procurement review, and compliance screening. The interface chats; the workflow shrugs. This paper is interesting because it gives the shrug a manager.

The 0.85 threshold is a governance knob, not a magic number

The Monitor agent evaluates subtasks using clinical semantic validity and task completeness, scoring each from 0 to 1. Subtasks that exceed a threshold of 0.85 are considered complete; those below remain pending. The Controller then uses the pending-task state to decide what to ask next.

That threshold is easy to overlook. It should not be.

A lower threshold risks premature closure: the system stops asking because an answer sounded good enough. A higher threshold risks over-questioning: the patient is interrogated into submission because the system wants every field polished to academic shine. The paper frames this as a balance between information integrity and dialogue efficiency. Operationally, it is a governance parameter.

In healthcare, that parameter should probably vary by specialty, setting, and risk level. Allergy history in routine dermatology intake is not the same as allergy history before a surgical referral. Psychiatric symptoms require different questioning density from ophthalmological symptoms. The paper reports large department-level variation in triage accuracy, with ophthalmology performing far better than psychiatry. That is not a footnote; it is a deployment warning wearing a lab coat.

The wider lesson is that agentic systems need adjustable completion criteria. “Done” is not a universal state. It is a risk decision.

The evidence supports better intake completion, not autonomous clinical authority

The results are strongest where the framework is closest to its design goal: structured information collection and case documentation.

Result	What the paper directly shows	Business interpretation	Boundary
Triage improves across iterations	Primary department accuracy rises from 83.0% to 87.0%; secondary department accuracy rises from 75.4% to 80.5%	Iterative questioning can refine routing decisions	Accuracy remains uneven across specialties
Agent-driven scheduling improves completion	98.2% task completion versus 93.1% under default sequential ordering	Dynamic task selection is better than a fixed checklist when answers arrive unpredictably	Completion is not the same as clinical correctness
Physician ratings are high	Mean scores: 4.56 for chief complaint, 4.48 for HPI, 4.69 for past history	Generated notes may be useful for clinician handoff	Ratings were based on sampled cases, not live deployment
Cross-model use is feasible	The framework works across GPT-OSS 20B, Qwen3-8B, and Phi4-14B without task-specific fine-tuning	Architecture can reduce dependence on a single vendor model	Model choice still affects unfinished cases and dialogue efficiency
Dialogue length remains bounded	HPI completes in about 12.7 rounds and past history in about 16.9 rounds	Intake can be thorough without becoming endless	Patient tolerance in real settings still needs testing

The comparison between agent-driven scheduling and default ordering is especially important. A fixed intake script is attractive because it is auditable and easy to explain. It is also brittle. Real patients do not answer in database order. They jump, forget, over-explain, under-explain, and occasionally treat “duration” as a philosophical category. The agent-driven scheduler performs better because it reacts to the actual information state rather than marching through a prewritten list.

That does not make it a doctor. It makes it a better intake clerk. In healthcare technology, this is praise, not an insult.

Multi-agent medicine is moving from debate to workflow

The paper fits into a broader movement in medical AI from single-answer generation toward structured collaboration. Earlier systems such as MedAgents used role-playing LLM agents to simulate multidisciplinary consultation and improve zero-shot medical reasoning across benchmark tasks.⁴ MDAgents extended this idea by assigning solo or group collaboration structures depending on medical task complexity, reporting gains across multiple medical knowledge and diagnosis benchmarks.⁵

Those systems mainly ask: can multiple agents reason better than one model?

The pre-consultation paper asks a different question: can multiple agents collect better input before reasoning begins?

That shift matters. Medical AI has often been evaluated at the point of answer: diagnosis, recommendation, classification, report. But many clinical failures begin earlier, when the input story is incomplete. If the intake misses duration, medication, pregnancy status, allergy, or red-flag symptoms, the downstream reasoning engine is already working inside a badly furnished room.

Related work on medical follow-up question generation makes the same point from another angle. FollowupQ, a multi-agent framework for asynchronous patient-provider conversations, reported that structured follow-up question generation reduced requisite provider follow-up communications by 34% and improved performance over baselines on real and synthetic data.⁶ The common theme is not that agents are magical. It is that medical information-seeking is too multidimensional for one-shot prompting.

For operators, the value is cleaner handoff

What the paper directly shows is feasibility: a hierarchical multi-agent framework can complete structured pre-consultation tasks, improve completion over default ordering, and generate documentation that physicians rate highly in a retrospective evaluation.

What Cognaptus infers for business use is narrower but more actionable: the first defensible ROI for this type of system is not diagnosis automation. It is handoff quality.

A useful deployment would measure outcomes such as:

Operational metric	Why it matters
Percentage of visits with complete HPI before clinician entry	Reduces live reconstruction time
Missing critical-history fields per encounter	Tracks safety-relevant omissions
Clinician edits to generated summaries	Measures documentation usefulness
Re-routing rate after clinician review	Tests triage reliability
Patient abandonment during intake	Detects over-questioning
Time from intake start to clinician-ready summary	Captures workflow efficiency
Escalation rate for uncertain or high-risk cases	Shows whether the system knows when to stop pretending

This is the unglamorous procurement test: does the clinician receive a cleaner case summary, sooner, with fewer missing fields, without annoying the patient into closing the app? If yes, the system has value. If no, it is just a very expensive waiting-room questionnaire with better grammar.

Model-agnostic does not mean model-irrelevant

The paper’s cross-model results are useful but should be read carefully. The architecture works across different foundation models without task-specific fine-tuning. That supports the claim that orchestration contributes meaningfully to performance. It does not mean any model will do.

The authors observe differences in unfinished cases and efficiency across GPT-OSS 20B, Qwen3-8B, and Phi4-14B. Qwen3-8B appears efficient in Chinese-language task execution, while GPT-OSS 20B has stronger stability in unfinished-case counts. That is exactly what one would expect: architecture controls workflow, but the model still supplies language understanding, reasoning, and question generation.

For buyers, the practical implication is a two-layer evaluation. First, test the orchestration layer: does it preserve task state, detect missing information, and select sensible next questions? Second, test the model layer under local language, specialty, patient literacy, and data-governance constraints. Vendor demos tend to blend these together because blended things are harder to audit. Convenient, that.

Where the evidence stops

The main limitation is not that the system is imperfect. All systems are. The limitation is that the evaluation environment is not yet the clinical environment.

The dataset comes from 1,372 validated, de-identified electronic health records from a Chinese medical platform. Consultations are simulated using a Virtual Patient agent derived from those records. That design allows controlled evaluation, but it does not fully capture real patient behaviour: hesitation, misunderstanding, anxiety, vague recall, non-compliance, or the proud human tradition of answering the question adjacent to the one asked.

The physician evaluation is valuable but limited. Forty samples were randomly selected, and 18 physicians rated generated documentation. That supports clinical plausibility of the notes. It does not prove improved patient outcomes, reduced clinician workload in a live clinic, or superiority over human-led pre-consultation on identical cases.

Safety is also mostly prompt-based. The paper instructs agents not to fabricate, not to ask leading questions, and to acknowledge uncertainty. Those are necessary constraints, but prompt-level rules are not the same as enforced safety architecture. Before deployment, a serious system would need escalation policies, uncertainty flags, audit logs, red-flag detection, clinician review gates, and specialty-specific validation.

Finally, the specialty variation matters. A system that performs well in symptom domains with clear routing signals may struggle where presentations overlap across departments. Psychiatry versus neurology is the obvious example from the paper. The business mistake would be to average across departments and call it ready. Averages are where edge cases go to hide.

The build lesson: automate inquiry before automating judgement

The seductive version of medical AI is the diagnostic oracle. The practical version is the tireless intake coordinator that asks the missing question, updates the record, and hands the clinician a usable summary. This paper belongs to the second category. That makes it less cinematic and more deployable.

Its deeper contribution is architectural: intelligent pre-consultation depends on dynamic task orchestration. The system must know what it is trying to collect, what it already has, what remains unresolved, and how aggressively to pursue missing information. Once that control loop exists, the LLM becomes a component inside a workflow rather than the workflow itself.

That distinction will matter beyond clinics. Any business process that begins with messy human input—claims, onboarding, compliance, support, lending, procurement, case management—faces the same intake problem. The first productivity gain is rarely a fully autonomous expert. It is a system that stops asking generic questions and starts managing uncertainty.

Medicine, being medicine, simply makes the stakes harder to ignore.

Cognaptus: Automate the Present, Incubate the Future.

ChengZhang Yu, YingRu He, Hongyan Cheng, Nuo Cheng, Zhixing Liu, Dongxu Mu, Zhangrui Shen, Yang Gao, and Zhanpeng Jin, “From Passive to Proactive: A Hierarchical Multi-Agent Framework for Automated Medical Pre-Consultation,” arXiv:2511.01445, 2025. https://arxiv.org/abs/2511.01445 ↩︎ ↩︎
Greg Irving et al., “International variations in primary care physician consultation time: a systematic review of 67 countries,” BMJ Open, 2017. https://bmjopen.bmj.com/content/7/10/e017902 ↩︎
Brenna Li et al., “Beyond the Waiting Room: Patient’s Perspectives on the Conversational Nuances of Pre-Consultation Chatbots,” CHI 2024. https://www.microsoft.com/en-us/research/publication/beyond-the-waiting-room-patients-perspectives-on-the-conversational-nuances-of-pre-consultation-chatbots/ ↩︎
Xiangru Tang et al., “MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning,” arXiv:2311.10537, 2023. https://arxiv.org/abs/2311.10537 ↩︎
Yubin Kim et al., “MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making,” arXiv:2404.15155, 2024. https://arxiv.org/abs/2404.15155 ↩︎
Joseph Gatto, Parker Seegmiller, Timothy Burdick, Inas S. Khayal, Sarah DeLozier, and Sarah M. Preum, “Follow-up Question Generation For Enhanced Patient-Provider Conversations,” arXiv:2503.17509, 2025. https://arxiv.org/abs/2503.17509 ↩︎

TL;DR for operators#

The bottleneck is not the doctor’s intelligence. It is the missing first ten minutes.#

Pre-consultation fails when the system waits too politely#

The paper’s real contribution is the scheduler, not the chatbot#

The 0.85 threshold is a governance knob, not a magic number#

The evidence supports better intake completion, not autonomous clinical authority#

Multi-agent medicine is moving from debate to workflow#

For operators, the value is cleaner handoff#

Model-agnostic does not mean model-irrelevant#

Where the evidence stops#

The build lesson: automate inquiry before automating judgement#