Empathy is easy to fake for one sentence.
A chatbot can say “that sounds exhausting” without knowing anything about you, your situation, your city, your time zone, or whether the advice it is about to give is physically possible. That is the awkward part of emotional support AI: the tone can be soft while the facts are made of air. A very caring assistant can still recommend a midnight walk at 3 p.m., suggest a closed café, or confidently invent local details because it wants to be helpful. The kindness is real enough in style. The grounding is not.
The paper behind this article introduces TEA-Bench, a benchmark for tool-enhanced emotional support dialogue agents.1 Its core argument is simple but useful: emotional support is not only about affective language. Sometimes support also needs instrumental grounding—time, place, weather, nearby resources, comparable experiences, or other contextual facts that make advice practical rather than decorative.
That sounds obvious until one notices how many AI products still treat empathy as a writing style. Add a warmer system prompt. Add “I hear you.” Add a safety disclaimer. Sprinkle validation like parsley. Then hope the model does not hallucinate its way into helpfulness.
TEA-Bench asks a more operational question: what happens when emotional support agents are allowed to use tools, and how do we evaluate whether those tools actually make support better?
The important word is allowed. The benchmark does not simply hand the model a retrieved answer and ask it to paraphrase. The agent must decide whether to use a tool, which tool to use, how often to use it, and how to fold the result into a natural reply. That makes the paper more interesting than another “tools improve LLMs” story. It is really about discretion: when to look things up, when to stay emotionally present, and when not to turn a distressed user into a search task.
The problem is not emotion versus facts; it is unsupported specificity
The usual split between “emotional” and “practical” support is too neat. In real conversations, the two often depend on each other.
Affective support says: “That sounds draining.” Instrumental support says: “There is a quiet park three minutes away, and the weather is still mild enough for a short walk.” Bad instrumental support says the same thing, except the park does not exist, the weather is wrong, and the user is left wondering whether the assistant is comforting them or improvising geography.
TEA-Bench starts from this gap. Existing emotional support conversation systems and benchmarks have mainly focused on text-only empathy: whether the model sounds understanding, coherent, human-like, and helpful. Tool-use benchmarks, on the other hand, usually focus on task completion: whether an agent calls the right API and completes the instruction. Emotional support sits between those worlds. It is open-ended, relational, and sensitive to timing, but it can still require external facts.
The paper’s mechanism can be summarized like this:
Emotional scenario
↓
Latent time / place / user type
↓
Scenario-grounded tool environment
↓
Agent decides whether to call tools
↓
Agent replies naturally
↓
Hallucination detector checks factual grounding
↓
Simulated user reacts across turns
↓
Dialogue-level TEA score + factuality metrics
This pipeline matters because the benchmark is not evaluating a single reply in isolation. It evaluates a dialogue process. A good support agent should not only sound empathetic once. It should maintain support quality while handling uncertainty, user reactions, practical suggestions, and factual claims across turns. A small burden, apparently. Almost like conversation is not a multiple-choice exam.
How TEA-Bench builds a world for emotional support agents
TEA-Bench contains 81 grounded emotional support scenarios adapted from ExTES. The authors filter for richer scenarios, generate latent situational context, ground that context through map-based APIs, and manually validate the resulting cases. The latent context includes details such as local time, city-level location, and place type. These details are not simply dumped into the user’s opening message. Instead, they become retrievable context that tools can access.
That design choice is important. In a real deployment, an assistant may have access to device-level time, approximate location, calendar context, local weather, policy documents, or a company resource directory. The user may not spell out every relevant fact. A support agent must infer when a quick lookup would improve the reply.
The tool environment includes 31 tools across seven categories:
| Tool category | What it contributes to support | Practical role |
|---|---|---|
| Utils | Scenario time, location, webpage text | Basic context retrieval |
| Map | Nearby places, routes, reachable areas, location info | Concrete local suggestions |
| Weather | Current and forecast weather | Feasibility of outdoor or location-based advice |
| Posts, subreddits, comments, communities | Shared experiences and social resonance | |
| News | Events and themes | Situational awareness |
| Wikipedia | Summaries, sections, full content | Background knowledge |
| Music | Artists, releases, recordings | Affective recommendation |
The tools run in a scenario-aware way. Time-sensitive tools use the scenario timestamp rather than the system clock, which keeps evaluation reproducible while preserving temporal realism. This is not just a technical convenience. It prevents one model from being evaluated in a different “world” than another because the real clock changed between runs.
The agent does not receive hand-holding about which tool to use. Tool usage is optional and unsupervised. At each turn, the agent can respond directly or call one or more tools before replying. The user only sees the final natural-language response, not the internal tool calls. That mirrors real product design: the user does not care whether the assistant checked a map, a database, or a corporate policy file. They care whether the answer is grounded and useful.
The benchmark judges both warmth and provenance
TEA-Bench evaluates two broad things: emotional support quality and factual grounding.
For support quality, the paper uses five TEA dimensions. Four come from ESC-Eval: Diversity, Fluency, Humanoid, and Information. TEA-Bench adds Effectiveness, which measures whether the agent’s suggestions are accepted and meaningfully integrated by the user. That addition is sensible because emotional support is not a product brochure. The advice has to land.
For factual grounding, the benchmark tracks three dialogue-level metrics:
| Metric | Meaning | Why it matters |
|---|---|---|
| Factual content ratio | Share of agent responses containing factual claims | Measures how much concrete information the agent introduces |
| Hallucination ratio | Share of responses containing hallucinated factual content | Measures how often the agent says unsupported factual things |
| Hallucination rate | Hallucinated factual content relative to all factual content | Measures reliability among the factual claims the agent does make |
A factual claim is counted as hallucinated if it cannot be traced to either user-provided information or tool observations. The hallucination detector excludes generic emotional validation and general coping suggestions that do not rely on external world states. That distinction is crucial. “Take a deep breath” is not hallucinated merely because no tool confirmed the oxygen supply. But “there is a quiet café two blocks away” had better come from somewhere.
The authors also validate the automatic evaluation. For TEA-Scores, they sample 150 dialogue episodes and compare automatic scores with human judgments. The overall TEA score correlates with human ratings at Spearman $\rho = 0.7448$, Pearson $r = 0.7563$, and Kendall $\tau = 0.6174$. For hallucination detection, the module reaches hallucination precision of 0.8947, recall of 0.7612, F1 of 0.8226, MCC of 0.7056, and Cohen’s Kappa of 0.6990 against human verification.
That does not make LLM-as-judge magically perfect. It does make the benchmark more credible than a leaderboard where the judge is simply asked to vibe-check the vibes.
The main result: tools usually help, but not evenly
The paper evaluates nine models under two settings: without tools and with tools. The broad finding is that tool access generally improves TEA-Scores and consistently reduces hallucination-related metrics. But the size and shape of the improvement vary sharply by model.
Here are selected main results from Table 1:
| Model | AVG TEA without tools | AVG TEA with tools | Hallucination rate without tools | Hallucination rate with tools | Interpretation |
|---|---|---|---|---|---|
| GPT-4o-mini | 76.67 | 81.11 | 24.95 | 18.44 | Clear support gain, lower hallucination |
| GPT-4.1-nano | 80.15 | 81.20 | 14.51 | 11.69 | Small quality gain, already strong baseline |
| Gemini-2.5-flash | 66.73 | 77.53 | 64.92 | 21.01 | Large improvement; tools sharply reduce hallucination |
| Qwen-plus | 72.69 | 78.09 | 68.81 | 34.89 | Strong grounding benefit, still nontrivial hallucination |
| Qwen3-235B-A22B | 71.88 | 79.32 | 71.21 | 31.44 | Large improvement with frequent tool use |
| Qwen3-14B | 76.82 | 76.36 | 21.76 | 12.45 | Quality slightly drops, grounding improves |
| Qwen3-8B | 77.44 | 77.78 | 11.41 | 7.16 | Minimal quality gain, lower hallucination |
The first lesson is comforting: tools reduce hallucination across all evaluated models. Even when the empathy score barely improves, factual grounding becomes safer. This is the part product teams will like.
The second lesson is less convenient: tool access is not automatically useful for support quality. For Qwen3-14B, average TEA slightly decreases with tools, even though hallucination rate improves. Smaller or weaker models can struggle to decide when and how to use tools, or they may fail to integrate tool output in a way that feels emotionally appropriate.
That distinction should be pinned to the wall of every “just add tools” roadmap. Tools can make factual claims safer while making the conversation clumsier. These are different objectives.
More tool calls are not the same as better support
The mechanism-first reading becomes clearer in the tool usage analysis.
Figure 5 reports average tool calls per dialogue. Stronger models often use tools sparingly. GPT-4o-mini averages 0.370 tool calls per dialogue; GPT-4.1-nano averages 0.241. Mid-capability models such as Qwen-plus and Qwen3-235B-A22B call tools much more often, around 3.130 and 3.198 times per dialogue. Some smaller models barely use tools at all.
The paper interprets this as a capability-dependent pattern:
| Model behavior | What it means | Product implication |
|---|---|---|
| Strong model, few calls, good gains | The model knows when a lookup is useful | Tool access can be lightweight and selective |
| Mid-capability model, many calls, good grounding gains | The model compensates through frequent retrieval | Tool cost and latency may rise, but reliability improves |
| Weak model, few calls, small gains | The model does not exploit the tool environment | Tool integration alone is insufficient |
Figure 6 adds another layer: hallucination reduction generally rises with tool usage frequency, but efficiency differs. Some models reduce hallucination substantially with relatively few calls. Others need many more tool interactions to achieve similar grounding gains. In business terms, the question is not simply “does this agent have tools?” It is “how much tool traffic does it need per reliable support outcome?”
That matters for deployment economics. Tool calls have latency, cost, privacy, logging, permission, and UX implications. A support agent that checks five systems before saying something emotionally tone-deaf is not “agentic.” It is an intern with too many browser tabs.
Action-oriented users benefit more clearly than emotion-oriented users
The appendix provides an important robustness-style analysis by user type. This is not a second thesis; it explains where the main effect is strongest.
TEA-Bench simulates two user types:
- Action-oriented users regulate emotions through action or environmental change and may accept concrete advice quickly.
- Emotion-oriented users need to feel heard and understood before accepting practical suggestions.
For action-oriented users, tool augmentation consistently improves overall scores across models. Several models gain more than 10 points. Improvements are especially strong in Information and Effectiveness, which makes sense: these users are more receptive to actionable support, and tools make action guidance more grounded.
For emotion-oriented users, the picture is mixed. Some models improve, but others degrade, especially weaker models. The appendix suggests a plausible reason: once tools are available, weaker models may overemphasize tool-driven recommendations instead of adapting to emotional cues. The user needed acknowledgment first; the model brought a map. Very efficient. Also wrong.
This is one of the paper’s most useful product lessons. Emotional support agents need not only retrieval ability but interaction timing. For an action-oriented user, “there is a quiet park nearby” may be helpful. For an emotion-oriented user, the same suggestion may feel abrupt if it arrives before the user feels understood.
The hallucination story remains more stable: hallucination rates decrease consistently across both user types. So factual grounding improves even when emotional quality does not. Again, two metrics, not one.
TEA-Dialog shows what good tool use looks like
The authors also release TEA-Dialog, a dataset of 365 high-quality tool-enhanced dialogues. These are selected from multiple models based on TEA-Scores above 80, absence of detected hallucinations, and additional human filtering.
The dataset analysis gives a useful sketch of good tool use over a conversation. Early stages often rely on utilities and contextual tools to establish situational grounding. Then map and weather tools appear when concrete environmental suggestions become useful. Later stages may involve more personalized tools such as music or news.
That sequence is operationally intuitive:
First: understand where and when the user is
Then: assess feasible options
Then: personalize the support
Finally: stop using tools when emotional presence matters more
TEA-Dialog is heavily action-oriented: 320 action-oriented dialogues versus 45 emotion-oriented dialogues. Emotion-oriented dialogues are longer on average, with 13.47 turns compared with 8.73 for action-oriented dialogues, and longer user and model utterances. This fits the underlying interaction pattern: some users want a path; others need the conversation itself to do more work.
For business builders, this implies that a single “support bot” metric may hide important segment differences. A student advising bot, employee wellbeing assistant, customer frustration handler, and local resource recommender may all require different balances of validation, tool use, and action guidance.
Fine-tuning helps in-domain but does not solve robustness
The paper’s final experiment tests whether supervised fine-tuning on TEA-Dialog improves tool-enhanced emotional support. The authors fine-tune Qwen3-8B and Qwen3-14B using LoRA on dialogues from the first 60 scenarios, then evaluate on all 81 scenarios, separating in-domain and out-of-domain cases.
The results are encouraging but not clean.
For Qwen3-8B, the base model has an average TEA score of 77.78. After TEA fine-tuning, the overall score rises to 78.52. On in-domain scenarios, it reaches 79.92. But on out-of-domain scenarios, it falls to 74.52, and hallucination rate jumps to 24.39, compared with 7.16 for the base model and 4.61 for TEA-ID.
For Qwen3-14B, the pattern is less severe. The base model scores 76.36, TEA fine-tuning raises it to 77.99, and TEA-OOD remains at 77.02. But hallucination rate still rises out of domain, from 12.45 for the base model to 18.89 for TEA-OOD.
So the fine-tuning experiment supports two conclusions at once:
| Finding | What it supports | What it does not prove |
|---|---|---|
| SFT improves TEA-Scores, especially Information and Effectiveness | TEA-Dialog contains useful behavioral patterns | That small supervised datasets create robust support agents |
| In-domain performance is stronger than OOD performance | Models learn scenario-specific patterns | That learned tool behavior generalizes safely |
| OOD hallucination can rise | More factual content creates more chances to be wrong | That tool training is inherently unsafe |
The business translation is plain: fine-tuning can teach a model the style of grounded support, but it can also teach it to be more assertive with facts. If the model generalizes poorly, assertiveness becomes hallucination with better manners. Lovely packaging, still fragile.
What this means for business support agents
The direct claim of the paper is about a benchmark and experimental results. The broader business inference is about design discipline.
Tool-augmented emotional support has obvious applications: customer support escalation, employee wellbeing triage, education advising, travel disruption support, community management, and non-clinical wellness copilots. In all these settings, users often need both validation and situated guidance. “I’m sorry your flight was canceled” is affective. “There are two available rebooking options and a lounge nearby that closes at 10 p.m.” is instrumental. A good system needs both.
But TEA-Bench suggests that “both” is not achieved by blindly attaching tools to a chatbot. The operational design needs at least four layers:
| Design layer | Practical requirement | Why TEA-Bench makes it visible |
|---|---|---|
| Context access | Time, location, policy, resource, or environment data must be retrievable | Instrumental support depends on facts |
| Tool discretion | The model must decide when not to use tools | Emotion-oriented users can be harmed by premature advice |
| Grounding audit | Factual claims should be traceable to user input or tool output | Hallucination reduction is measurable |
| Outcome-sensitive evaluation | Judge full dialogue effects, not isolated replies | Support quality depends on user reaction across turns |
This is where the paper becomes useful beyond emotional support. Many enterprise AI agents face the same structure. A sales assistant, claims handler, HR advisor, or technical support copilot must combine tone, facts, and procedural judgment. The question is rarely “can the model call an API?” The question is whether it can use external information without turning the interaction into a brittle workflow or a confident fiction.
Cognaptus would infer three practical design rules from this paper.
First, instrumental support should be modular. The system should distinguish validation, context retrieval, recommendation, and factual audit. Otherwise every reply becomes a blended soup of empathy and unverified specificity.
Second, tool use should be evaluated by efficiency and timing, not just frequency. More calls may reduce hallucination, but they also create cost and UX risks. Fewer calls are better only if the model knows when to call. Silence is not discretion when the model simply failed to use the tools.
Third, fine-tuning should be treated as behavior shaping, not safety proof. TEA-Dialog improves in-domain support, but OOD hallucination risk shows why tool-trained agents need adversarial scenario testing, retrieval provenance checks, and deployment monitoring.
The boundary: TEA-Bench is not a clinical safety certificate
The paper’s limitations matter because emotional support is a high-trust domain.
TEA-Bench relies on simulated users. That enables controlled, scalable, reproducible evaluation, but simulated users cannot fully represent real human unpredictability, distress, cultural variation, or long-term relationship dynamics. The evaluation covers short to medium-length interactions, not sustained support across weeks or months. Long-term trust formation, memory, adaptation, and dependency are outside its scope.
The benchmark also relies heavily on automatic evaluation, even though the authors validate parts of it with human annotation. The correlations and hallucination verification are useful, but they do not eliminate evaluator bias or the difficulty of judging emotional support quality automatically.
Most importantly, TEA-Bench studies emotional support conversation, not clinical therapy. It should inform product evaluation and risk controls for support agents. It should not be read as evidence that a tool-augmented chatbot can replace professional mental-health care. That would be a very on-brand hallucination.
Empathy with infrastructure
The central lesson of TEA-Bench is not that emotional support agents need more tools. It is that empathy becomes more trustworthy when the system can separate three things: emotional presence, external facts, and the decision of when facts are actually useful.
The strongest models in the benchmark do not merely call tools. They call them selectively. Mid-capability models can improve by calling tools more often, but that raises questions of cost, latency, and conversational tact. Weaker models may gain factual grounding while losing emotional appropriateness. Fine-tuning improves familiar scenarios, but distribution shift can make the agent more factual-sounding and more wrong.
That is the real business lesson. The future of support agents is not a chatbot with a map pasted onto it. It is an interaction architecture where tools are used quietly, facts are audited, user reactions matter, and advice arrives only when the conversation is ready for it.
Empathy, in other words, needs infrastructure. Not a bigger sympathy template. Not a motivational quote generator with API access. A system that knows when to listen, when to check, and when to shut up. Progress, at last.
Cognaptus: Automate the Present, Incubate the Future.
-
Xingyu Sui, Yanyan Zhao, Yulin Hu, Jiahe Guo, Weixiang Zhao, and Bing Qin, “TEA-Bench: A Systematic Benchmarking of Tool-enhanced Emotional Support Dialogue Agent,” arXiv:2601.18700v2, 2026. https://arxiv.org/abs/2601.18700 ↩︎