Therapy, Explained: How Multi‑Agent LLMs Turn DSM‑5 Screens into Auditable Logic

TL;DR for operators

DSM5AgentFlow is not a paper about an AI therapist replacing a clinician. That would be the loud interpretation, and therefore the least useful one. The paper introduces a three-agent workflow that turns DSM-5 Level-1 screening into a structured conversation, then converts the transcript into a provisional diagnosis with evidence-linked reasoning.¹

The operational lesson is that trustworthy clinical AI is becoming a workflow-design problem. One agent asks; one agent simulates a client profile; one agent diagnoses by retrieving DSM-5 material and mapping utterances to criteria. This separation matters because it creates surfaces for audit: question coverage, persona consistency, retrieved criteria, evidence tags, rationale structure, and final recommendation.

The evidence is promising but narrow. The authors generate 8,000 simulated therapist-client conversations across four LLM backbones and 10 disorder categories. Qwen-QWQ performs best on diagnostic metrics, reaching 70% accuracy, 72% recall, and 77% F1. Conversation-oriented models, especially Llama-4 and Mistral-Saba, score better on LLM-rubric dialogue quality. The expensive lesson is familiar: the model that sounds best is not necessarily the model that reasons best. Apparently bedside manner and classification performance still refuse to be the same KPI.

For business use, the near-term opportunity is not autonomous diagnosis. It is auditable intake: pre-visit screening, structured documentation, synthetic test-case generation, internal model benchmarking, and explanation review. The unresolved part is severe. The paper uses no real patient transcripts, no clinician-patient trials, no clinician adjudication as the main evaluation layer, and explicitly says the system is not a medical device and must not inform clinical decisions. Treat it as a blueprint for safer infrastructure, not a deployable clinician in a hoodie.

The intake form is not the product; the audit trail is

A mental health intake form looks simple from the outside. Questions go in, answers come out, a score or impression follows. In practice, the hard part is not asking the question. The hard part is explaining why a given answer mattered, which criterion it touched, what was missing, and whether the system quietly confused overlapping symptoms.

That is where DSM5AgentFlow is interesting. It does not merely ask an LLM to “diagnose this patient.” It decomposes the task into roles. The therapist agent administers the DSM-5 Level-1 questionnaire conversationally. The client agent responds according to a predefined synthetic profile without revealing the label. The diagnostician agent receives the transcript, retrieves relevant DSM-5 passages, predicts a disorder type, generates a rationale, links symptoms and quotes to diagnostic criteria, and adds treatment recommendations.

The design is not glamorous in the usual AI-demo sense. There is no cinematic claim that the machine has empathy. The valuable move is more bureaucratic, therefore more useful: make the reasoning process inspectable. Healthcare already runs on documentation, handoffs, justification, and liability. Any serious clinical AI system has to survive that environment. A fluent black box is still a black box, merely one that says “I understand how difficult this must be” before making a mess.

The mechanism: three agents, three failure surfaces

The workflow can be read as a small assembly line:

DSM-5 questionnaire
        ↓
Therapist Agent
        ↓
Synthetic Client Agent
        ↓
Conversation transcript
        ↓
Diagnostician Agent + DSM-5 retrieval
        ↓
Provisional diagnosis + rationale + evidence links + next steps

This is not just software neatness. Splitting the task changes what can be tested.

The therapist agent has a coverage problem. It must make sure all DSM-5 Level-1 questionnaire items are addressed, while phrasing them naturally enough to resemble a clinical interview rather than a tax audit wearing a cardigan. The paper’s therapist prompt instructs the agent to administer the actual questionnaire, avoid premature diagnostic assumptions, keep a warm professional tone, and proceed through the question set.

The client agent has a fidelity problem. It must stay in character, answer in the first person, express symptoms consistent with a profile, and avoid saying the diagnostic label aloud. That last constraint matters. If the simulated client says “I have PTSD,” the task collapses into label extraction, which is cheaper and much less interesting.

The diagnostician agent has an accountability problem. It must transform a messy conversation into a structured assessment by retrieving relevant DSM-5 passages and linking the transcript to criteria. This is the business-critical stage because it creates the thing a reviewer can inspect. A diagnosis without traceable evidence is a prediction. A diagnosis with utterance-to-criterion mapping is at least a candidate audit object.

Agent	Paper role	Main failure mode	Operational control point
Therapist	Covers DSM-5 Level-1 items through natural dialogue	Missed domains, leading questions, premature diagnosis	Item tracking, prompt constraints, coverage logs
Client	Simulates a profile across symptoms, demographics, context, and comorbid modifiers	Label leakage, inconsistent persona, unrealistic responses	Profile templates, role constraints, synthetic case review
Diagnostician	Retrieves DSM-5 passages and maps transcript evidence to diagnostic reasoning	Plausible but unsupported rationale, symptom overlap, wrong label	Retrieval logs, evidence tags, criterion anchors, human review

This is why the paper deserves a mechanism-first reading. The architecture is the contribution. The metrics tell us whether the architecture behaves plausibly under one experimental setup. They do not replace the architecture as the point of the work.

The therapist agent turns screening into controlled conversation

The therapist agent’s job is not therapy in the full human sense. It is controlled elicitation. It takes the DSM-5 Level-1 questionnaire and asks each item conversationally, tracking whether it has been addressed. If an item is not sufficiently covered, the agent can clarify or rephrase.

That sounds modest until you compare it with the usual LLM failure pattern: ask a few reasonable questions, drift toward whatever seems salient, then confidently summarize a partial conversation. The therapist agent is designed to resist that drift. It follows the questionnaire rather than vibes. In a medical-adjacent workflow, “less vibes” is a product feature.

The paper reports that generated conversations covered DSM-5 domains and were evaluated using both automatic readability/coherence metrics and a custom LLM rubric. The rubric scores completeness of DSM-5 coverage, clinical relevance of questions, consistency and flow, diagnostic justification, and empathy/professionalism. This is main evidence for whether the simulated conversations look usable as screening dialogues.

Still, the evaluation should be read carefully. Conversation quality here is not therapeutic effectiveness. It is not patient satisfaction. It is not safety in crisis cases. It is structured dialogue quality under synthetic conditions. Useful, yes. Clinically sufficient, no.

The synthetic client is useful precisely because it is fake

The client agent is the most easily misunderstood part of the paper. Synthetic clients do not prove that the system works on real clients. They solve a different problem: data availability, repeatability, and controlled testing.

Mental health data is sensitive, hard to obtain, and ethically constrained. Synthetic profiles allow researchers to generate cases across 10 primary disorder categories: Adjustment Disorder, Anxiety, Bipolar Disorder, Depression, OCD, Panic Disorder, PTSD, Schizophrenia, Social Anxiety Disorder, and Substance Abuse. The paper creates 8,000 simulated conversations, 2,000 for each of the four model backbones.

That scale enables benchmarking. It lets the authors ask: under similar profile conditions, which model produces better dialogue, which model classifies better, and where do disorders get confused?

But synthetic control has a price. The client profiles are structured by design. They do not fully reproduce real client ambiguity: hesitation, shame, evasiveness, cultural idioms, memory gaps, guarded responses, or the nonverbal information clinicians often use. The authors acknowledge this directly. No real client transcripts or clinician interactions are used, so ecological validity remains unproven.

That limitation is not a footnote-sized nuisance. It defines the article’s business boundary. This is not evidence that a deployed system can screen real people safely. It is evidence that a role-separated architecture can produce controlled, auditable synthetic screening workflows at scale.

The diagnostician agent is where “trustworthy” has to earn its keep

The diagnostician agent is the paper’s most commercially relevant component. It retrieves the top five relevant DSM-5 passages for a conversation, using chunk sizes of 512 or 1024 tokens and the nomic-embed-text embedding model. It then synthesizes the retrieved material with the transcript to predict the likely disorder and produce a rationale.

There are three important details here.

First, retrieval reduces reliance on model memory. The diagnostician is not supposed to free-associate from whatever the model absorbed during training. It is prompted to ground the assessment in retrieved DSM-5 material. This does not guarantee correctness, but it creates an inspectable source trail.

Second, the rationale is not merely a paragraph explaining the answer. The system is designed to cite client utterances as evidence for or against diagnostic criteria. That distinction matters. A generic explanation can sound responsible while hiding the actual inference. Evidence mapping forces the model to show its work in a way a human reviewer can challenge.

Third, the output includes next-step or treatment recommendations. This is where governance gets uncomfortable. Recommendations are operationally useful, but they also move the system closer to clinical action. In a real product, that boundary would need hard controls: clinician review, scope limits, crisis escalation, jurisdiction-specific compliance, logging, and clear separation between screening support and medical decision-making.

The paper itself draws the line firmly. The system is for research into explainable AI. It is not a medical device, and its outputs must not inform clinical decisions. That sentence should be printed on the product roadmap before anyone adds a “Book therapy with AI” button.

What the experiments actually test

The experiments are best read as three linked evaluations, not as one grand clinical validation.

Evaluation element	Likely purpose	What it supports	What it does not prove
Conversation quality metrics	Main evidence for whether agents can generate readable, coherent screening dialogue	Generated dialogues are moderately coherent and readable across models	Real patient engagement, therapeutic alliance, crisis safety
LLM rubric evaluation	Main evidence for coverage, relevance, flow, explainability, and professionalism	Some models produce stronger structured conversations than others	Human clinician agreement or patient-rated quality
Diagnostic accuracy metrics	Main evidence for disorder-label prediction against synthetic ground truth	Qwen-QWQ performs best overall under this benchmark	Clinical diagnostic validity on real cases
Per-disorder F1 and confusion matrices	Error analysis	Adjustment Disorder and Depression are hard; overlapping symptoms drive confusion	That the confusion pattern would match real-world caseloads
Explainability case study	Exploratory illustration	Rationale structure differs sharply across models	Generalisable explainability performance across all outputs
Appendix prompt templates and interface demo	Implementation detail	The workflow is reproducible and modular	Clinical readiness or regulatory adequacy

This table is the antidote to leaderboard intoxication. The paper does not say “LLMs diagnose mental illness at 77% F1, ship it.” It says a particular multi-agent workflow, under synthetic conditions, can generate structured screening conversations and evaluate how different models behave when asked to diagnose from those conversations.

That is a narrower claim. It is also a better one.

The model that talks best is not the model that diagnoses best

The results show a useful split between dialogue quality and diagnostic performance.

On conversation-quality metrics, GPT-4.1-Nano has the highest BERTScore at 54.87%, while Llama-4 has the easiest readability profile, with a Flesch Reading Ease score of 61.67 and lower grade-level indicators than the others. In the custom LLM rubric, Llama-4 and Mistral-Saba generally perform best, with mean rubric scores reported in the 4.26 to 4.41 range. Qwen-QWQ scores lower on dialogue quality, and GPT-4.1-Nano performs substantially worse on the rubric despite respectable automatic coherence.

On diagnostic performance, the story changes. Qwen-QWQ leads overall, reaching 70% accuracy, 72% recall, and 77% F1. GPT-4.1-Nano follows with 73% F1 and the highest precision at 83%. Llama-4 and Mistral-Saba trail on classification, with 52% and 57% accuracy and F1 scores of 65% and 63%.

Model	Dialogue signal	Diagnostic signal	Practical reading
Llama-4-Scout-17B	Strong rubric performance; most readable by FRE	52% accuracy; 65% F1	Better interviewer than diagnostician in this setup
Mistral-Saba-24B	Strong rubric performance	57% accuracy; 63% F1	Good dialogue quality, weaker classification
Qwen-QWQ-32B	Lower dialogue rubric than top conversational models	70% accuracy; 72% recall; 77% F1	Strongest diagnostic reasoner under this benchmark
GPT-4.1-Nano	Highest BERTScore; weak rubric scores	73% F1; 83% precision	Coherent text does not automatically mean better structured interview quality

The per-disorder results are more revealing than the aggregate score. Qwen-QWQ performs strongly on Panic Disorder, PTSD, Social Anxiety, OCD, Bipolar Disorder, and Schizophrenia. GPT-4.1-Nano is especially strong on PTSD, OCD, and Bipolar Disorder. Anxiety-related categories are generally easier across models.

Adjustment Disorder is the graveyard. Llama-4, Mistral-Saba, and GPT-4.1-Nano score below 3% F1 on Adjustment Disorder; Qwen-QWQ reaches only 40.25%. Depression is also difficult, with F1 scores ranging from 36.75% to 67.98%.

This is not random embarrassment. The paper’s confusion matrices show systematic overlap: Adjustment Disorder is often mislabeled as Depression; Bipolar Disorder and Depression are confused; Social Anxiety can collapse into Anxiety; Substance Abuse can be mislabeled as Depression. The DSM-5 Level-1 screen is broad by design. It can flag domains, but it may not contain enough temporal and contextual detail to separate closely related conditions.

The business implication is blunt: the workflow is better suited to structured triage and documentation than definitive differential diagnosis. It can help make the first pass more legible. It cannot make symptom overlap disappear.

Explanation quality is not the same as evidence decoration

The explainability case study is small but useful. The authors inspect one representative transcript per model and count transparency signals: symptom tags, direct quote tags, DSM clause references, and step-by-step logic.

Qwen-QWQ produces the cleanest example: 11 symptom tags, four direct quotes, DSM criterion anchors, and a five-point rationale. Mistral-Saba gives a correct diagnosis in the example but uses a less auditable paragraph-style explanation. Llama-4 produces an opaque output with minimal justification. GPT-4.1-Nano inserts many symptom tags but does not anchor them to DSM clauses or stepwise logic.

That last result is the one operators should remember. More tags do not automatically mean more explainability. A model can decorate an answer with markers while failing to connect evidence to criteria. In regulated workflows, decorative traceability is dangerous because it gives reviewers the feeling of audit without the discipline of audit.

A good explanation has at least three properties:

It identifies the relevant criterion.
It quotes or paraphrases the supporting conversation evidence.
It states the inference connecting the evidence to the criterion.

Miss any one of those, and the explanation becomes either a label, a transcript excerpt, or a confident essay. None of those is enough.

The business value is auditable intake, not automated therapy

Cognaptus’ reading is straightforward: this paper points toward infrastructure, not replacement. The near-term commercial value is not an AI clinician. It is a system that makes mental health intake more structured, testable, and reviewable.

Business use case	What the paper directly supports	Cognaptus inference	Boundary
Pre-visit screening	The workflow can simulate DSM-5 Level-1 conversations and produce structured outputs	Similar architectures could support clinician-reviewed intake before appointments	Must be validated on real patients and reviewed by licensed professionals
Documentation support	Diagnostician outputs include rationale, evidence links, and next steps	Evidence-linked summaries could reduce review burden and improve auditability	Explanation faithfulness must be tested, not assumed
Synthetic test data	The paper generates 8,000 simulated conversations across 10 categories	Vendors can use controlled personas to stress-test screening workflows before live pilots	Synthetic realism remains uncertain
Model selection	Four backbones show different dialogue and diagnostic strengths	Health AI teams should evaluate models by role, not by one generic “best model” score	Results may change with other models, prompts, or clinical corpora
Governance and QA	Role separation creates logs at each stage	Audit teams can inspect coverage, retrieval, evidence mapping, and rationale structure	Regulatory acceptance requires much more than readable logs

This is the more sober version of “agentic AI in healthcare.” The system does not become trustworthy because it is multi-agent. It becomes more governable because multi-agent decomposition creates checkpoints. Checkpoints are where quality assurance, legal review, and clinical oversight can attach themselves.

That is the part many AI demos skip, mostly because checkpoints are less photogenic than a chatbot pretending to be Freud.

The architecture is modular enough to matter outside DSM-5

The paper’s modularity section is easy to skim, but it is important. The framework supports local and cloud inference through an adapter layer. It can run with Ollama locally or with Groq and OpenAI APIs. Questionnaires are loaded from files, and persona definitions are stored in compact text profiles. The dataset and implementation are open-sourced.²

This makes DSM5AgentFlow less like a single-purpose demo and more like a workflow pattern:

Assessment instrument
        ↓
Structured elicitation agent
        ↓
Controlled respondent or real user
        ↓
Transcript
        ↓
Retriever + domain criteria
        ↓
Evidence-linked assessment

That pattern is not limited to mental health. It applies wherever organisations need structured intake, natural-language elicitation, and evidence-linked classification: insurance claims, compliance interviews, HR case intake, education diagnostics, benefits eligibility, customer risk triage. The domain changes; the architectural question remains the same.

Can the system ask all required questions, avoid premature conclusions, map answers to criteria, and show the reasoning in a form that a human can reject?

That is not “AI magic.” That is workflow engineering. Much less glamorous. Much more likely to survive procurement.

The limitations are not decorative; they define the deployment line

The paper’s limitations are not polite academic throat-clearing. They are the practical boundary of the work.

First, the data is synthetic only. No real client transcripts or clinician interactions are used. The authors cannot yet quantify how closely the synthetic conversations approximate real clinical discourse. For any mental health product, this is the difference between a lab benchmark and operational evidence.

Second, conversations were generated in a one-shot process to reduce cost. That matters because true counselling dialogue is adaptive. A real clinician changes direction based on hesitation, risk signals, contradictions, or emotional shifts. One-shot generation may make conversations look smoother than real turn-by-turn interaction.

Third, the evaluation relies partly on LLM-based rubric scoring. That creates the risk of shared blind spots. An LLM judge may reward fluent structure, familiar phrasing, or its own preferred style. Human clinician review is not optional if the claim moves toward clinical utility.

Fourth, the model pool is limited. The authors benchmark Llama-4-Scout-17B, Mistral-Saba-24B, Qwen-QWQ-32B, and GPT-4.1-Nano, largely due to availability and infrastructure constraints. Larger, clinically tuned, or differently prompted models could shift the results.

Finally, the system is explicitly not for clinical decision-making. This is the cleanest boundary in the paper. Any vendor that reads this work as permission to automate diagnosis has missed the point with admirable efficiency.

What operators should build after reading this paper

The best next product step is not a chatbot that diagnoses users. It is an internal evaluation and intake scaffold.

A serious implementation would start with synthetic cases, then add clinician-authored cases, then run blinded clinician review of system outputs, then evaluate real-world intake under human supervision. It would track not just diagnostic accuracy but missing-question rates, contradiction handling, crisis escalation, evidence faithfulness, retrieval quality, demographic robustness, and reviewer disagreement.

The model architecture should also be role-specific. A warmer model may be better for elicitation. A reasoning-heavy model may be better for criteria mapping. A smaller model may be enough for formatting, logging, and summarisation. One model to rule them all is convenient. Healthcare has a habit of punishing convenience.

Most importantly, explanation should be treated as a testable output. The question is not “does the rationale sound good?” The question is whether each cited symptom is present in the transcript, whether the criterion is relevant, whether contradictory evidence is represented, and whether a clinician would agree that the inference is justified.

That is the bridge from impressive demo to operational system: not better prose, but better reviewability.

The quiet advance: diagnosis as a traceable workflow

DSM5AgentFlow is valuable because it reframes LLM mental-health screening as a traceable process. The paper’s best contribution is not that Qwen-QWQ gets the highest F1. It is that the system makes it possible to ask better questions about every stage before a label appears.

Did the interviewer cover the required domains? Did the synthetic client stay consistent? Did the diagnostician retrieve relevant criteria? Did the rationale cite actual utterances? Did it separate supporting from contradictory evidence? Did it confuse Adjustment Disorder with Depression because the questionnaire lacked enough context? These are the questions a serious clinical AI workflow must answer.

The paper does not solve trustworthy AI psychotherapy. The title reaches further than the evidence permits. But it gives a useful architecture for making screening logic visible. In healthcare AI, visible logic is not the same as safe logic. It is simply the point at which safety work can begin.

That is progress. Not the kind that replaces clinicians. The kind that gives them a cleaner audit trail before anyone pretends the machine has earned a white coat.

Cognaptus: Automate the Present, Incubate the Future.

Mithat Can Ozgun, Jiahuan Pei, Koen Hindriks, Lucia Donatelli, Qingzhi Liu, and Junxiao Wang, “Trustworthy AI Psychotherapy: Multi-Agent LLM Workflow for Counseling and Explainable Mental Disorder Diagnosis,” arXiv:2508.11398, 2025. https://arxiv.org/abs/2508.11398 ↩︎
The paper states that its datasets and implementation are open-sourced at: https://github.com/mithatco/mental_health_multiagent ↩︎

TL;DR for operators#

The intake form is not the product; the audit trail is#

The mechanism: three agents, three failure surfaces#

The therapist agent turns screening into controlled conversation#

The synthetic client is useful precisely because it is fake#

The diagnostician agent is where “trustworthy” has to earn its keep#

What the experiments actually test#

The model that talks best is not the model that diagnoses best#

Explanation quality is not the same as evidence decoration#

The business value is auditable intake, not automated therapy#

The architecture is modular enough to matter outside DSM-5#

The limitations are not decorative; they define the deployment line#

What operators should build after reading this paper#

The quiet advance: diagnosis as a traceable workflow#