Opening — Why this matters now

Emotional support from AI has quietly moved from novelty to expectation. People vent to chatbots after work, during grief, and in moments of burnout—not to solve equations, but to feel understood. Yet something subtle keeps breaking trust. The responses sound caring, but they are often wrong in small, revealing ways: the time is off, the location is imagined, the suggestion doesn’t fit reality. Empathy without grounding turns into polite hallucination.

This is the gap TEA‑Bench sets out to measure.

Background — Empathy alone isn’t enough

Emotional Support Conversation (ESC) research traditionally focuses on affective support: empathy, validation, and tone. That made sense when systems were text‑only. But real emotional support also relies on instrumental support—concrete, situationally correct guidance. Suggesting a walk only works if it’s actually daytime, nearby, and feasible.

Prior benchmarks rarely test this. Tool‑use benchmarks focus on task completion. ESC benchmarks focus on warmth. The uncomfortable middle—grounded empathy across multiple turns—has been largely ignored.

Analysis — What TEA‑Bench actually tests

TEA‑Bench reframes emotional support as an interactive, tool‑augmented agent problem.

Instead of grading isolated replies, it evaluates full dialogues where an agent may (or may not) use external tools—maps, weather, news, time, location—to ground its support. The user never sees the tools. They only react to the outcome, just like real life.

Three design choices matter:

  1. Scenario realism Emotional situations are adapted from real ESC datasets, then enriched with latent but retrievable context: local time, city, environment type.

  2. Process‑level evaluation Dialogues are scored holistically on five dimensions—Diversity, Fluency, Humanoid quality, Information value, and Effectiveness—capturing whether advice is actually accepted.

  3. Explicit hallucination accounting A Hallucination Detection Module flags any factual claim that cannot be traced to user input or tool results. Sounding confident no longer earns a free pass.

In short, TEA‑Bench tests whether an agent knows when to look things up—and whether that restraint improves trust.

Findings — Tools help, but not equally

The headline result is simple: tool access reduces hallucination and improves emotional support quality. But the details are less flattering.

1. Capability matters more than access

Stronger models use tools selectively. They call fewer tools, but integrate them cleanly, improving both empathy and factual grounding.

Mid‑tier models compensate with frequency—more tool calls, decent gains.

Weaker models mostly fail to benefit. Tool access alone does not teach judgment.

2. Fewer hallucinations, consistently

Across all models, hallucinated factual content drops when tools are enabled. Even when empathy scores barely improve, grounding improves. This is one of the clearest quantitative links between tool use and trustworthiness seen so far.

3. User type changes everything

Action‑oriented users—those open to practical steps—benefit the most. Tools shine when advice is welcome.

Emotion‑oriented users are trickier. Over‑eager tool use can actually hurt perceived empathy, especially for smaller models. Strong models know when not to fetch data.

4. Training helps… narrowly

Fine‑tuning on high‑quality, tool‑enhanced dialogues improves in‑distribution performance. But outside familiar scenarios, hallucinations rebound. Grounded empathy does not generalize easily.

Implications — What builders should take away

TEA‑Bench sends a quiet but firm message:

  • Empathy is no longer just linguistic—it is situational.
  • Tool use is not a skill add‑on; it is a judgment problem.
  • More data is not the same as better grounding.

For product teams building mental‑health companions, coaching agents, or support chatbots, this suggests a shift in priorities:

Old Focus New Reality
Warmer tone Correct context
Longer responses Better timing
More suggestions Fewer, grounded ones
Generic fine‑tuning Scenario‑aware evaluation

The uncomfortable truth: users forgive emotional awkwardness more easily than factual wrongness.

Conclusion — Empathy that checks the weather

TEA‑Bench doesn’t argue that emotional support should become transactional. It argues the opposite: that trustworthy empathy requires restraint, verification, and context awareness. Looking things up quietly—like a good friend would—turns out to matter.

As AI systems move deeper into human spaces, benchmarks like this will shape who earns trust and who merely sounds kind.

Cognaptus: Automate the Present, Incubate the Future.