From Static Scripts to Self-Evolving Minds: The Rise of Experience-Driven AI Counselors

Counseling is a bad place to hide a static AI system

Customer-support bots can get away with being forgetful. They apologize, ask for the order number again, and everyone quietly lowers their expectations.

Psychological counseling is less forgiving. A counselor who forgets the last session, repeats generic comfort, or treats every conversation as a fresh prompt is not merely inefficient. The whole relationship becomes unstable. Continuity is not a UX feature here; it is part of the intervention.

That is why PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor is more interesting than the usual “LLM does therapy better” headline would suggest.¹ The paper is not mainly arguing that counseling models need more empathy words, longer context windows, or another round of domain fine-tuning. The more useful claim is sharper: a counseling agent should improve from accumulated cases by turning experience into reusable skills and then partially internalizing those skills into the model.

That distinction matters. A bigger model can sound more polished. A longer context window can remember more. A retrieval system can fetch old notes. None of those, by itself, means the system has learned how to become a better practitioner.

PsychAgent is an attempt to close that loop.

The real proposal is a learning loop, not a warmer chatbot

The easiest way to misread the paper is to place it in the familiar pile of mental-health chatbots: empathy tuning, CBT templates, safety filters, perhaps a memory module bolted on for continuity. That reading misses the architecture.

PsychAgent is built around three engines:

Engine	What it does	Why it matters
Memory-Augmented Planning Engine	Maintains evolving client profiles, session summaries, and session goals	Gives the conversation continuity and direction
Skill Evolution Engine	Extracts practice-grounded atomic skills from successful counseling trajectories and organizes them in a hierarchy	Converts experience into reusable procedural knowledge
Reinforced Internalization Engine	Generates alternative session trajectories, selects high-reward ones, and fine-tunes on the selected paths	Moves useful behavior from external scaffolding toward model-internal competence

The paper’s core mechanism can be summarized as:

remember the case → plan the next session → retrieve relevant skills → generate candidate trajectories → select the better trajectory → extract refined skills → internalize successful behavior → repeat.

That loop is the article’s main object. The mental-health setting is important, but the broader design pattern is larger than therapy. Many enterprise AI deployments suffer from the same disease in a less emotionally dramatic form: they interact with clients, employees, documents, markets, tickets, and workflows every day, but their operating experience mostly evaporates. At best, it becomes logs. Logs are not learning. They are archaeology with better timestamps.

PsychAgent asks what happens when those traces become skills.

Memory gives the agent continuity, but memory alone is not the upgrade

The Memory-Augmented Planning Engine is the most intuitive part of the system. It stores an evolving client profile and episodic session summaries. Before a session, it reasons over that memory to produce a therapeutic stage and specific session objectives. In other words, the agent is not supposed to answer only the latest turn. It is supposed to know where this client has been and what the current session should accomplish.

This is useful, but the paper is careful to show that memory is not the whole story. One of its more business-relevant tests compares generic memory mechanisms added to a strong backbone model. Vanilla RAG, Graph RAG, Mem0, and MemoryBank do not produce meaningful gains in the reported benchmark; in several columns, they slightly underperform the base DeepSeek-V3.2 result.

The implication is not that memory is useless. That would be a silly conclusion, and we already have enough silly conclusions in AI strategy decks. The better interpretation is that counseling memory is not just document recall. The agent needs to track changing emotional state, intervention history, therapeutic progress, and the reason a previous strategy did or did not work.

For enterprise AI, this is a familiar problem wearing a lab coat. A CRM assistant that remembers a customer’s last complaint is not necessarily better at account management. A compliance assistant that retrieves a policy clause is not necessarily better at deciding which control pattern applies. A trading assistant that stores old signals is not necessarily better at market regime recognition.

Memory is the floor. Skill abstraction is the step up.

The skill engine turns experience into operational knowledge

The Skill Evolution Engine is the most important component in the paper because it tries to answer a practical question: what exactly does an agent learn from experience?

The answer is not “new therapy schools.” The appendix is useful precisely because it narrows the claim. Across 715 dialogue records, the extraction pipeline produces 9,701 atomic skills, of which 5,923 are judged to be practice-grounded skill units. The reported practice-grounded proportions vary by orientation: 73.29% in behavioral therapy, 69.77% in CBT, 56.0% in humanistic-existential therapy, 51.61% in psychodynamic therapy, and 63.49% in postmodernist therapy.

Those numbers should not be read as proof that the model has invented new clinical science. The authors explicitly describe the stronger pattern as operational enrichment: thresholds, fallback rules, minimum viable actions, structured templates, monitoring loops, and client-specific execution formats.

That is less glamorous than “AI discovers therapy.” It is also more believable.

Consider what the case studies show. A broad behavioral-contracting idea becomes a micro-contract with a red-light threshold, fallback sequence, minimum practice duration, and flexible logging format. A general CBT review-and-transfer pattern becomes a one-page process card with a low activation threshold and a personalized starter cue. A broad goal-setting skill becomes a minimal viable target with a simplified record and permission to pause.

This is not theoretical novelty. It is procedural sharpening.

And procedural sharpening is exactly where many enterprise systems currently fail. They know the policy but not the workaround. They know the sales methodology but not the moment when a client’s hesitation requires a lower-friction next step. They know the process map but not the tiny operational rule that keeps a real user from abandoning it.

PsychAgent’s useful contribution is showing an architecture where those tiny rules are not left as tribal knowledge. They are extracted, named, organized, and reused.

Internalization is where the paper moves beyond RAG

The third engine, Reinforced Internalization, tries to move successful behavior from external scaffolding into the model itself.

The mechanism is session-level rejection fine-tuning. For a given session, the agent generates multiple candidate trajectories. A reward model evaluates them using counselor- and client-level dimensions. The system selects the best trajectory, updates memory based on that selected path, and fine-tunes the model to increase the likelihood of these selected “golden” trajectories.

This is not token-level reinforcement learning. It is closer to experiential curation: try several session strategies, keep the better one, train on the better path, discard the weaker candidates.

The distinction is important. Many agent systems today are elaborate prompt-and-retrieval machines. They can appear adaptive because they retrieve different context each time. But when the retrieval disappears, the agent’s underlying behavior may not have changed much.

PsychAgent attempts to reduce that dependence. Skills remain explicit, but successful use of those skills is gradually distilled into the model weights. The paper describes this as turning external skills into endogenous intuition. That phrase sounds a bit grand, but the underlying idea is practical: if a pattern keeps working, the system should need less scaffolding to reproduce it.

For businesses, this is the difference between a playbook that must always be pasted into the prompt and a model that has absorbed enough of the playbook to act competently with lighter guidance. The former is easier to audit. The latter may scale better. Sensible deployment would want both: internalized competence with an external skill layer that remains inspectable. Nobody should be volunteering for fully invisible organizational learning. That is how governance teams develop migraines.

The main evidence supports the loop, not just model size

The experimental setup uses PsychEval, a multi-session and multi-therapy benchmark. The authors sample client profiles across five therapeutic schools and evaluate both counselor-side and client-side metrics. The reported baselines include strong general-purpose models and psychology-specific systems, including TheraMind, a longitudinal counseling agent.

The headline results are clear. PsychAgent reports the best scores across four aggregated dimensions:

Model	Counselor shared	Counselor specific	Client shared	Client specific
Qwen3-Max	5.88	7.74	5.41	7.81
TheraMind	6.25	6.94	5.48	7.83
PsychAgent	7.32	7.91	5.92	8.24

Against Qwen3-Max, the reported gains are +1.44, +0.17, +0.51, and +0.43 across the four dimensions. Against TheraMind, the gains are +1.07, +0.97, +0.44, and +0.41.

The 8B version of PsychAgent also performs competitively, which matters because otherwise the result could be dismissed as another “larger model wins, researchers discover scale” moment. The paper’s interpretation is that the framework contributes beyond raw model size.

The more revealing evidence comes from ablation.

Variant	Counselor shared	Counselor specific	Client shared	Client specific	Likely purpose
Full PsychAgent	7.32	7.91	5.92	8.24	Main system result
Without Memory-Augmented Planning	7.07	7.66	5.86	8.03	Component ablation
Without Skill Evolution	7.02	7.52	5.61	7.90	Component ablation
Without Reinforced Internalization	7.05	7.67	5.67	7.89	Component ablation

All three removals hurt performance. Removing the Skill Evolution Engine produces the largest drop in this ablation, followed closely by removing internalization. Memory/planning also helps, but less dramatically.

This supports the article’s central reading: the advantage is not merely remembering more client history. It is the combination of continuity, skill extraction, and internalization.

The human evaluation gives a separate check. Across 522 matched dialogues, two human raters and one Gemini-3 LLM rater rank PsychAgent first, Qwen3-Max second, and TheraMind third. PsychAgent’s average scores are 4.295 and 4.370 from the two human raters, compared with 3.943 and 4.024 for Qwen3-Max, and 3.743 and 3.649 for TheraMind.

The appendix makes this evidence more interpretable. The human-human Quadratic Weighted Kappa is 0.675 overall, while the LLM rater’s agreement with the two human raters is 0.770 and 0.877. The authors also note ceiling effects: most ratings are high, especially in perception-related dimensions. So the human evaluation is supportive, but not a magical gold standard. It is a useful triangulation, not a final clinical verdict.

The appendix prevents the strongest overclaim

The appendix is not decorative. It keeps the paper honest.

If we only read the main results, we might say: PsychAgent learns new counseling skills and generalizes across therapies. That sentence is technically tempting and editorially dangerous.

The appendix shows a more careful version:

Claim	Evidence	Better interpretation
The system extracts new skills	5,923 of 9,701 atomic skills are classified as practice-grounded units	The system often creates operationally enriched skills, not entirely new therapy techniques
Internalization changes behavior	Recurrent functional packages appear after internalization	The model reorganizes known micro-skills into stable intervention units
Some patterns transfer across therapies	Functional families such as Goal-to-Practice Bridge and Progress Consolidation Package recur across orientations	Transfer appears at the level of intervention logic, not full therapy-agnostic competence
Human evaluation supports the ranking	Human and LLM raters rank PsychAgent first	The ranking is stable, but score ceilings and rater-style differences limit interpretation

This is exactly the kind of distinction AI commentary often loses. The paper does not prove that an autonomous AI therapist is ready for real-world practice. It does suggest that domain agents can improve when they convert interaction history into structured, reusable, and partially internalized skills.

That is a meaningful result without needing to dress it up as science fiction.

The business lesson is experience-to-skill conversion

For Cognaptus readers, the paper’s most transferable lesson is not “deploy AI counselors.” Please do not read one benchmark paper and conclude that your next SaaS feature should be a therapist with a pastel avatar. The market already has enough soft-gradient liability machines.

The transferable pattern is experience-to-skill conversion.

Enterprise AI systems increasingly sit inside workflows where repeated cases generate local know-how. Customer support teams learn which escalation phrasing prevents churn. Sales teams learn which objection-handling sequence works for a given buyer profile. Compliance teams learn which review patterns catch problems earlier. Operations teams learn which exception-handling rule prevents a process from stalling.

Most AI systems do not capture that learning well. They may store chats, tickets, and documents, but storage does not automatically become capability.

PsychAgent suggests a different architecture:

PsychAgent concept	Enterprise translation	ROI relevance	Governance risk
Evolving client profile	Dynamic account, case, or workflow profile	Less repetition, better continuity	Privacy and data minimization
Session planning	Next-best-action planning	More coherent task progression	Over-automation of judgment
Atomic skill extraction	Playbook rule discovery from successful cases	Faster transfer of frontline know-how	Skill drift and weak validation
Skill hierarchy	Structured process knowledge graph	Reusable training and execution assets	Taxonomy maintenance burden
Rejection fine-tuning	Train on selected successful trajectories	Reduced prompt dependence over time	Harder auditability after internalization
Human/LLM evaluation	Scalable quality review	Cheaper monitoring and improvement	Evaluator bias and metric gaming

The most valuable part is not replacing human expertise. It is capturing the small execution patterns that human teams already discover but rarely formalize.

In many organizations, the difference between mediocre automation and useful automation is not the grand strategy. It is whether the system knows the tiny rule: when the user hesitates here, lower the activation threshold; when the client rejects the standard template, offer a minimum viable action; when the process becomes aversive, add a pause boundary before pushing harder.

That is not glamorous. It is where work actually happens.

The governance problem moves from prompts to learned behavior

Self-evolving systems create a governance shift.

A static prompt can be reviewed. A retrieved document can be traced. A skill library can be inspected. A fine-tuned behavior is harder to audit because part of the decision logic has moved into model weights.

PsychAgent’s architecture therefore raises a design question that applies far beyond counseling: how much should an enterprise system internalize, and how much should it keep explicit?

For low-risk domains, more internalization may improve speed and quality. For high-risk domains, especially health, finance, law, security, and compliance, the system should probably preserve an explicit skill layer even if the model has absorbed parts of it. The skill layer becomes a control surface: inspectable, versioned, rollbackable, and tied to evaluation data.

A practical self-evolving enterprise agent should therefore include at least four governance layers:

Experience selection: not every successful-looking trajectory should become training data.
Skill validation: extracted skills need human or rule-based review before entering the library.
Internalization boundaries: only certain skill families should be eligible for fine-tuning.
Post-update evaluation: every model update should be tested against regression suites, edge cases, and safety constraints.

PsychAgent points toward self-improving agents. It also reminds us why “self-improving” should not mean “self-authorizing.” The former is engineering. The latter is a lawsuit with a progress bar.

Where the paper stops: benchmark success is not clinical proof

The paper’s evidence is meaningful, but the boundary is equally important.

First, the reported gains are benchmark gains. PsychEval is designed for multi-session and multi-therapy evaluation, which makes it relevant, but it is still not the same as real clinical deployment. Real clients bring risk, ambiguity, crisis situations, incomplete disclosure, cultural context, and long-term outcome uncertainty. Benchmarks can approximate some of this. They cannot license clinical trust.

Second, the reward model and evaluator design matter. If the system learns from selected high-reward trajectories, then the definition of reward becomes a central source of power and bias. The paper’s human evaluation helps, but the appendix also shows rater-style differences and ceiling effects. Those are not fatal problems; they are reminders that evaluator choice shapes the learning loop.

Third, internalization makes behavior harder to inspect. A skill extracted into a library is visible. A behavior absorbed through fine-tuning is less visible. For counseling, that matters. For enterprise AI, it also matters whenever decisions affect money, rights, safety, or compliance.

Fourth, the paper supports a conservative form of skill evolution. The model creates operationally richer skills and recombines known micro-skills into functional packages. It does not demonstrate the discovery of entirely new therapeutic paradigms. That is not a weakness. It is probably why the result is useful.

The durable idea is not AI therapy; it is organizational learning in model form

PsychAgent is valuable because it shifts the question.

The old question was: can we fine-tune a model to imitate expert counseling dialogues?

The better question is: can an AI system accumulate experience, abstract what worked, organize it into reusable skills, and internalize selected patterns without losing auditability?

The paper’s answer is early but concrete. Memory gives continuity. Skill evolution turns experience into operational knowledge. Internalization reduces dependence on external scaffolding. The results suggest that the full loop performs better than any single piece, and the appendix shows that the strongest claim is practical skill refinement rather than grand conceptual invention.

That is a useful direction for AI counseling. It is also a useful direction for enterprise AI.

Static systems scale yesterday’s knowledge. Experience-driven systems can, in principle, scale yesterday’s learning.

The competitive edge will not belong only to whoever owns the largest model. It will belong to whoever builds the best machinery for turning lived cases into reusable competence—and knows where to stop before the machinery starts teaching itself nonsense with confidence.

A little evolution is powerful. Unsupervised evolution in production is just chaos wearing a lab badge.

Cognaptus: Automate the Present, Incubate the Future.

Yutao Yang et al., “PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor,” arXiv:2604.00931v3, 28 April 2026, https://arxiv.org/html/2604.00931. ↩︎

Counseling is a bad place to hide a static AI system#

The real proposal is a learning loop, not a warmer chatbot#

Memory gives the agent continuity, but memory alone is not the upgrade#

The skill engine turns experience into operational knowledge#

Internalization is where the paper moves beyond RAG#

The main evidence supports the loop, not just model size#

The appendix prevents the strongest overclaim#

The business lesson is experience-to-skill conversion#

The governance problem moves from prompts to learned behavior#

Where the paper stops: benchmark success is not clinical proof#

The durable idea is not AI therapy; it is organizational learning in model form#