Doctor GPT, But Make It Explainable

Triage begins with messy language.

A patient does not usually arrive as a clean feature vector. They arrive with “I feel tired,” “my stomach is strange,” “I have fever but not always,” or the classic: “I searched online and now I am either fine or dying.” Traditional diagnostic models are not built for this level of human poetry. They prefer structured fields, stable vocabularies, and the fantasy that symptoms behave like dropdown menus.

That is why the interesting part of Towards Explainable Conversational AI for Early Diagnosis with Large Language Models is not simply that it uses GPT-4o for diagnosis.¹ Everyone and their nearest hospital procurement committee has imagined “Doctor GPT” by now. The more important contribution is architectural: the paper wraps the language model inside a constrained diagnostic workflow with retrieval, symptom tracking, adaptive questioning, test-based confirmation, and disease-symptom attribution.

In other words, the paper is not asking whether an LLM can sound medically intelligent. That bar is both too low and too dangerous. It asks whether a conversational system can collect enough structured evidence, keep that evidence visible, and rank likely conditions without becoming a black-box oracle in a white coat.

That distinction matters. The headline metric—100% Top-3 accuracy—sounds like the sort of number that makes slide decks glow. But the safer reading is narrower and more useful: under a curated 14-disease setting, using synthetic dialogues and a structured knowledge base, the system was very good at keeping the right diagnosis inside a short ranked list. That is not autonomous medicine. It is the beginning of a more disciplined intake and triage layer.

The paper’s value is not “the AI became a doctor.” The value is that the chatbot was forced to behave less like a chatbot.

The useful system is the machinery around the model

The proposed system combines five elements that are often discussed separately but rarely treated as one operational chain.

First, it uses GPT-4o through Azure OpenAI as the language interface. This handles the part conventional models struggle with: interpreting free-text patient descriptions, extracting symptoms, normalizing loose phrases into medical terms, and maintaining a multi-turn conversation.

Second, it uses retrieval-augmented generation. A medical knowledge base is built from public medical sources and converted into a searchable vector store. The system retrieves relevant disease information before generating diagnostic responses, so the LLM is not supposed to reason only from its internal pretraining. In clinical AI, this is less a nice-to-have than a seatbelt. Seatbelts do not make driving safe by magic, but driving without them is a personality defect.

Third, the system tracks confirmed and denied symptoms. This is important because diagnosis is not only about what the patient says yes to. A denied symptom should reduce the likelihood of diseases that depend on it. The paper’s symptom tracker records confirmed symptoms, rejected symptoms, and previously asked questions, then updates disease scores dynamically.

Fourth, the chatbot asks adaptive follow-up questions. It does not simply march through a fixed questionnaire. It ranks current disease candidates, identifies unasked symptoms that could distinguish among them, and asks questions that should reduce ambiguity. The workflow continues until enough evidence has been collected—especially around the minimum confirmed symptom requirement—or until the maximum question budget is reached.

Fifth, the system moves from symptom-based ranking into test-based confirmation. Once enough symptom information is gathered, the chatbot asks about relevant test results and compares them against stored thresholds. This creates a second diagnostic phase: not just “what symptoms fit?” but “what evidence confirms or rules out the candidate diseases?”

The result is a two-stage diagnostic pipeline:

Free-text patient complaint
        ↓
Symptom extraction and normalization
        ↓
Candidate disease ranking
        ↓
Adaptive follow-up questions
        ↓
Confirmed and denied symptom tracking
        ↓
Test-based confirmation or elimination
        ↓
Evidence-linked diagnosis and referral suggestion

That pipeline is the paper’s real object of study. GPT-4o is the conversational engine, but the surrounding structure decides what counts as evidence, when enough evidence has been collected, which symptoms matter, and how results should be explained.

Explainability appears before the final answer, not after the mistake

Many AI explainability systems behave like apology generators. First the model gives an answer. Then a separate module tries to explain why the answer might have happened. This is not useless, but in high-stakes domains it can become decorative: a polished explanation attached to a decision nobody can audit.

This paper takes a more operational approach. Explainability is built into the diagnostic process itself.

The system explains why it asks a follow-up question. It updates the diagnostic state as new symptoms are confirmed or rejected. It uses test-based reasoning to accept or eliminate candidate diseases. It links final diagnoses to symptoms, test results, and risk factors. It also produces a disease-symptom attribution map showing how each reported symptom contributes to candidate diseases.

That is a better direction for clinical support tools because the user does not only need an answer. The user needs to know what the answer is resting on.

Explainability mechanism	Operational role	Business relevance
Question justification	Shows why the system asks a symptom question	Reduces the feeling of random chatbot interrogation
Real-time diagnostic state	Updates candidate diseases as evidence changes	Creates an auditable intake trail
Test-based confirmation	Uses test thresholds to confirm or reject candidates	Makes the workflow closer to clinical triage
Evidence-linked final diagnosis	Connects conclusion to symptoms, tests, and risk factors	Supports clinician review rather than blind acceptance
Disease-symptom attribution map	Shows how symptoms affect disease scores	Helps identify whether the model is leaning on sensible evidence

The important shift is from answer explainability to process explainability. A final “because you have fever and fatigue” is weak. A visible chain of symptom collection, rejection, ranking, and test confirmation is stronger. Not perfect, but stronger.

For business use, this changes the product category. The system is not best understood as a consumer chatbot that “diagnoses disease.” It is closer to an intelligent intake assistant: a tool that turns messy patient language into a structured, ranked, evidence-linked case file before a clinician, nurse, or telehealth provider reviews it.

That is less glamorous than “AI doctor.” It is also more deployable.

The paper’s main evidence is strong inside a narrow sandbox

The system was evaluated with 5-fold cross-validation. Across folds, it achieved 90.3% Top-1 accuracy, 100% Top-3 accuracy, 85.8% precision, 88.0% recall, and 86.1% F1-score. The Top-1 accuracy varied from 75.0% to 100.0% across folds, while Top-3 accuracy remained 100.0%.

Those numbers deserve attention, but they need careful interpretation.

Top-1 accuracy answers: did the system put the correct disease first?

Top-3 accuracy answers: did the correct disease appear somewhere in the top three?

In clinical workflow terms, Top-3 accuracy is not a final diagnosis metric. It is a shortlist metric. A perfect Top-3 score means the system did not omit the true disease from the top three options in this evaluation. That is useful for triage and differential diagnosis support. It does not mean the system can safely act alone.

The disease-wise results make this distinction even clearer. Many diseases achieved 100% Top-1 accuracy, including acute diarrheal illness, dengue fever, COVID-19, IBS, migraine, anxiety disorders, tuberculosis, anemia, depression, asthma, and GERD. But three conditions were harder: malaria reached 83.33% Top-1 accuracy, hepatitis B reached 50%, and viral fever reached only 33.33%. All three still reached 100% Top-3 accuracy.

That pattern is not random. The paper explains it through symptom overlap. Viral fever, hepatitis B, and malaria share broad, non-specific symptoms with other infectious or systemic diseases. Fever, fatigue, body aches, and similar complaints do not always point cleanly to one disease. The system can keep the right candidate in the shortlist, but the final ranking becomes sensitive to symptom weights and semantic overlap.

This is the part readers should not skip. The system performs best where symptom clusters are distinctive. It becomes less decisive where symptoms are generic. That is not a failure unique to LLMs. It is the basic problem of diagnosis wearing a neural-network jacket.

The baseline comparison is useful, but not a victory parade

The paper compares the LLM system with traditional machine learning models, including Naive Bayes, Logistic Regression, SVM, Random Forest, and KNN, using TF-IDF and CountVectorizer feature extraction.

The comparison is more nuanced than a simple “LLM beats old ML” story.

The LLM system achieved accuracy of 0.903. Naive Bayes with TF-IDF achieved 0.881 accuracy and an F1-score of 0.864. The LLM’s F1-score was 0.861, slightly below TF-IDF Naive Bayes, while the LLM had the advantage of conversational interaction and ranked Top-3 output. CountVectorizer-based models performed much worse in several cases, with CountVectorizer Naive Bayes dropping to an F1-score of 0.616 and CountVectorizer Logistic Regression falling dramatically.

So the business interpretation should be precise. The LLM is not valuable simply because it crushes every old classifier on every metric. It does not. Its value is that it combines competitive classification performance with functions conventional classifiers do not naturally provide: free-text dialogue, adaptive questioning, symptom normalization, test interpretation, and ranked diagnostic reasoning.

That is a different kind of advantage.

A static classifier can label a symptom vector. A conversational diagnostic assistant can help create the vector in the first place.

The ablations reveal what the system is really buying

The ablation studies are the most useful part of the paper for anyone thinking about product design. They show which controls actually matter.

The first message is simple: symptom depth matters. In the refined ablation, reducing the minimum confirmed symptom requirement to two dropped Top-1 accuracy to 0.60, while raising it to six restored Top-1 accuracy to 0.88. The baseline with eight confirmed symptoms reached about 0.90. This suggests the system is not magically diagnosing from a few vague complaints. It needs enough structured evidence.

The second message is that question budget matters, but less dramatically. Limiting the maximum number of questions to 10 or 15 reduced performance modestly compared with the 20-question baseline. In practice, this creates a product trade-off: shorter conversations improve user experience, but less evidence weakens diagnosis. Healthcare UX has a cruel sense of humor. The patient wants fewer questions; the model wants more evidence; the clinician wants both.

The third message is that confidence thresholds did not help much. Changing the confidence threshold from 60 to 80 did not materially affect performance in the reported setting. The paper therefore removes confidence threshold and minimum-question settings from the refined ablation. That is a useful implementation lesson: a threshold that does not change behavior is not a control; it is decoration with a number attached.

The fourth message is that similarity threshold tuning is non-linear. A permissive threshold can admit weak matches and inflate false positives. A restrictive threshold can miss valid symptom matches. In the reported ablations, a high similarity threshold of 0.85 produced strong Top-1 accuracy but with greater latency. That is not a universal rule that 0.85 is “best.” It is evidence that semantic matching thresholds are product parameters, not theoretical ornaments.

The fifth message is that local symptom matching matters more than global disease similarity. In the scoring weight ablation, global-only similarity performed poorly, with Top-1 accuracy around 0.74. Local-only scoring performed much better, around 0.89. Hybrid scoring performed best or near-best, with 70/30 and 50/50 global-local combinations reaching about 0.90 Top-1 accuracy.

Here is the ablation story in product language:

Test	Likely purpose	What it supports	What it does not prove
Minimum confirmed symptoms	Sensitivity test	The system needs a rich symptom profile, with 6–8 symptoms appearing much stronger than 2–4	More questioning is always acceptable to users
Maximum question limit	UX-performance trade-off test	Shorter conversations can reduce accuracy	A 20-question dialogue is optimal in real clinics
Confidence threshold	Control validation	Nominal confidence settings may not affect behavior	The model’s confidence is clinically calibrated
Similarity threshold	Retrieval/matching sensitivity test	Matching strictness changes accuracy and latency	One threshold generalizes across diseases and languages
Embedding comparison	Implementation sensitivity test	Embedding choice affects latency and stability	One embedding model is universally superior
Global vs. local scoring	Scoring ablation	Fine-grained symptom matching is essential; hybrid scoring helps	Hybrid scores equal clinical probability

This is where the paper becomes more than a demo. It shows that performance comes from evidence collection and scoring discipline, not merely from asking GPT-4o to be medically clever.

The business value is structured triage, not autonomous diagnosis

The practical pathway is straightforward if we resist the urge to overclaim.

The paper directly shows that, within its curated disease set and synthetic dialogue evaluation, a conversational LLM system can extract symptoms, ask adaptive questions, rank disease candidates, use test-based confirmation, and provide evidence-linked explanations with strong Top-3 performance.

Cognaptus would interpret the business value as follows: the near-term use case is not replacing doctors. It is reducing the cost and friction of first-line intake.

In a telehealth platform, the system could collect symptoms before a clinician joins the call. In a rural clinic, it could help non-specialist staff structure cases before referral. In an insurance or employer health program, it could route users toward appropriate next steps while preserving an evidence trail. In a hospital outpatient department, it could make intake forms less useless, which is already a meaningful public service.

The highest-value workflow is not:

Patient → AI diagnosis → treatment

It is:

Patient → AI-guided intake → ranked differential → clinician review → routing / testing / referral

That middle layer is where automation is credible. It converts unstructured conversation into structured evidence. It reduces repetitive questioning. It highlights missing information. It keeps multiple plausible diagnoses visible instead of collapsing too early into one answer. It creates a record that a human can audit.

The ROI logic is therefore operational, not magical:

Operational pain	System capability	Business meaning
Patients describe symptoms inconsistently	Free-text extraction and normalization	Cleaner intake data
Staff ask repetitive screening questions	Adaptive follow-up questioning	Lower intake burden
Early triage misses alternatives	Top-3 ranked differential	Better routing support
Chatbot answers are hard to trust	Symptom attribution and evidence-linked output	More auditable decision support
Medical knowledge changes	RAG-based knowledge base	Easier updating than fully hard-coded systems

This is still hard to deploy. But at least it is the right kind of hard: workflow integration, validation, governance, and knowledge maintenance. Those are difficult problems, not fantasy problems.

The boundaries are not footnotes; they define the product

The paper’s limitations materially affect how the result should be used.

The evaluation used synthetic dialogues, and the authors state that ChatGPT was used to generate synthetic patient cases, with review and editing afterward. Synthetic data can be useful for early testing, but it cannot substitute for real patient interaction. Real patients omit details, contradict themselves, misunderstand questions, exaggerate, underreport, switch languages, and sometimes type like they are fighting the keyboard. A system that works on synthetic dialogues has passed a laboratory checkpoint, not a deployment exam.

The disease scope is also narrow: 14 diseases across infectious diseases, chronic conditions, mental health, and other common conditions. That scope is reasonable for a prototype, but it means the system’s behavior outside that disease list remains uncertain. A triage system must handle unknowns, rare diseases, co-morbidities, medication effects, pregnancy, age-specific risk, emergency red flags, and local epidemiological patterns. A 14-disease sandbox cannot prove that.

The symptom weights are fixed. That matters because symptom severity and relevance can vary by patient, disease stage, population, and context. Fatigue is not the same signal in a young adult after poor sleep, a patient with hepatitis risk, and an older patient with anemia. Fixed weights are useful for interpretability, but they can become brittle.

The knowledge base is curated from public sources. That helps grounding, but it introduces a maintenance problem. Medical guidelines change. Local protocols differ. Test availability differs. A deployed system would need versioned knowledge updates, clinician review, localization, and governance over what sources are allowed into the retrieval system.

Finally, the reported confidence scores should not be treated as clinically calibrated probabilities. The paper uses nonlinear scaling and caps confidence to avoid overconfidence. That is sensible. But avoiding extreme confidence is not the same as proving calibration. A system can be modest and still wrong. Humans have practiced this art for centuries.

These boundaries do not make the paper unimportant. They make the correct product interpretation narrower: constrained diagnostic support, not autonomous diagnosis; ranked triage, not final medical authority; evidence collection, not clinical replacement.

What a serious deployment would need next

A deployable version of this system would need validation beyond synthetic cases.

First, it would need real-world patient dialogues, including incomplete, noisy, multilingual, and culturally varied symptom descriptions. If the system is meant for low-resource settings, then local language and local care pathways are not optional accessories.

Second, it would need clinician-in-the-loop evaluation. The correct question is not only whether the system predicts the right label. It is whether it improves clinician speed, reduces missing information, improves referral appropriateness, or catches cases that ordinary intake would miss.

Third, it would need safety workflows for red flags. A patient describing chest pain, severe dehydration, suicidal ideation, shortness of breath, or neurological symptoms should not be trapped inside a long “let me ask one more question” loop. In medical triage, escalation logic is not a feature. It is the floor.

Fourth, it would need integration with electronic health records if used clinically. The paper lists EHR integration as future work, and that is exactly where this kind of system becomes more useful—and more dangerous. Patient history can improve reasoning, but it also raises privacy, access control, auditability, and liability issues.

Fifth, it would need monitoring after deployment. Diagnostic assistants can drift when medical guidelines change, when patient populations differ, or when the retrieval base becomes stale. A safe system needs logs, review queues, error analysis, and a way to update both knowledge and scoring behavior.

This is the unglamorous truth: medical AI becomes useful when it becomes boringly governed.

The real lesson: make the model collect evidence before it speaks

This paper is not the final answer to conversational diagnosis. It is an instructive prototype of how to make diagnostic chatbots less reckless.

The mechanism-first lesson is clear. Performance comes from forcing the system to collect enough symptoms, match them carefully, keep negative evidence, ask discriminating questions, validate against retrieved medical knowledge, use test-based confirmation, and expose the reasoning path. The LLM provides linguistic flexibility. The architecture provides discipline.

That is the pattern worth carrying into business practice.

The next generation of medical AI products should not compete on who can make the most confident chatbot. Confidence is cheap. Evidence is expensive. The systems that matter will be the ones that turn conversation into auditable clinical structure before making recommendations.

Doctor GPT, if it exists, should not be a genius in a text box. It should be a careful intake clerk, a differential diagnosis assistant, and a documentation engine that knows when to stop talking and send the case to a human.

Less dramatic, yes. Also less likely to kill anyone. A reasonable trade.

Cognaptus: Automate the Present, Incubate the Future.

Maliha Tabassum and M. Shamim Kaiser, “Towards Explainable Conversational AI for Early Diagnosis with Large Language Models,” arXiv:2512.17559, 2025. https://arxiv.org/abs/2512.17559 ↩︎

The useful system is the machinery around the model#

Explainability appears before the final answer, not after the mistake#

The paper’s main evidence is strong inside a narrow sandbox#

The baseline comparison is useful, but not a victory parade#

The ablations reveal what the system is really buying#

The business value is structured triage, not autonomous diagnosis#

The boundaries are not footnotes; they define the product#

What a serious deployment would need next#

The real lesson: make the model collect evidence before it speaks#