TL;DR for operators

Most enterprise “emotion AI” still treats emotion as a label: anger, sadness, fear, joy. That is tidy, dashboard-friendly, and psychologically thin.

The CoRE paper asks a better question: when an LLM interprets an emotional situation, does it reason through the underlying cognitive appraisals that humans use — fairness, responsibility, control, effort, certainty, pleasantness, obstacles, and related dimensions? The answer is not “no”. It is more inconvenient: LLMs do show structure, but the structure is fragile.

Across 74,802 prompts, six LLMs, 15 emotion categories, and 17 appraisal dimensions, the models partly recover human-like patterns. They recognise that responsibility, control, and obstacles matter. They separate positive from negative emotions fairly well. They can often distinguish broad valence-arousal groups better than fine-grained emotions.

But the cracks are operationally important. Models tend to over-weight effort, under-weight perceived fairness, lean heavily on valence, and struggle to separate nearby emotions such as frustration versus anger or hope versus interest. They also show a gap between what they explicitly say matters and what their generated appraisal ratings actually use. Very familiar. Not admirable, but familiar.

For companies deploying LLMs in customer support, coaching, HR, education, companion agents, or mental-health-adjacent workflows, the lesson is simple: do not treat emotionally polished language as evidence of stable emotional reasoning. Test the appraisal layer. Monitor it across model upgrades. Use it for escalation and routing. And do not assume “cultural persona prompting” makes a system culturally sensitive just because the prompt now says Japan, Mexico, Nigeria, Denmark, or the United States.

Emotion labels are the receipt; appraisals are the transaction

A customer writes: “This is completely unfair. I followed your policy and still got charged.”

A sentiment model sees anger. A more polished LLM sees anger and replies with empathy. A better system notices the actual mechanism: the user is appraising the situation as unfair, externally caused, goal-blocking, and possibly remediable. That difference matters. Anger is not just negative affect with a louder font. It often contains a fairness claim.

This is where ordinary emotion recognition becomes too shallow for business use. Labelling a user as “angry” may help tone selection. It does not tell the system whether to explain a policy, reverse a fee, escalate to a supervisor, investigate a procedural failure, or stop pretending that “I understand your frustration” is a service recovery strategy.

The paper behind CoRE — Large language models show fragile cognitive reasoning about human emotions — starts from this gap.1 It does not ask whether LLMs can name emotions. They can, often impressively. It asks whether they represent emotions through cognitive appraisal dimensions in a way that is coherent, human-plausible, and robust across models and contexts.

That is the right level of diagnosis. Emotion labels are outputs. Appraisals are the machinery.

CoRE tests the “why” behind the feeling

CoRE stands for Cognitive Reasoning for Emotions. The benchmark is built around 274 daily-life scenarios, validated through a two-stage process involving human raters, then expanded into prompts that ask models to rate a specific appraisal dimension for a specific emotional situation.

The main benchmark covers 15 emotion categories:

Positive / mixed Negative / self-evaluative / threat-related
Happiness, pride, hope, interest, surprise, challenge Boredom, disgust, contempt, shame, guilt, anger, frustration, fear, sadness

Each scenario is paired with questions across 17 appraisal dimensions, grouped into broader categories such as pleasantness, control, certainty, goal-path obstacle, legitimacy, responsibility, anticipated effort, engagement, and understanding.

The evaluated models are GPT-o4-mini, Gemini 2.5 Flash, DeepSeek R1, LLaMA 3 8B, Phi-4 14B, and Qwen QwQ 32B. The released benchmark contains 4,658 vanilla appraisal prompts, 274 prompts for explicit value probing, 23,290 prompts using cultural personas, and 46,580 prompts using personality-based personas — 74,802 prompts in total. The paper reports 448,812 model generations.

The important design choice is self-appraisal. Prior work often asks models to infer what another person feels. That is useful, but it mixes emotional reasoning with social assumptions about the imagined person. CoRE instead asks the model to appraise a scenario from the inside of the situation. Less theatre. More diagnostic plumbing.

The first mechanism: models compress emotion into appraisal axes, but not the same ones humans use

The paper’s first major test uses principal component analysis with varimax rotation. This is main evidence, not a decorative statistics exercise. The goal is to see which appraisal dimensions naturally organise the model-generated ratings before the model is asked to predict any emotion label.

The result is half reassuring and half awkward.

Proprietary models produce relatively compact appraisal structures: the top six principal components explain roughly 80% of total variance, close to the human comparison pattern used by the authors. Most open-source models, except Phi-4, are more diffuse, with the same six components explaining only about 50–60% of variance. QwQ-32B needs seven components to reach human-level explanatory power; LLaMA needs nearly nine. That does not mean those models are useless. It means their appraisal space is less neatly organised.

Some dimensions align with human appraisal theory. Responsibility, control, and problem or obstacle dimensions matter across models. Models also show some coherent treatment of agency, including distinctions between self-control, other-control, and situational control.

Then comes the more interesting failure: effort.

For humans in the reference appraisal study, effort appears as a lower-ranked standalone factor. In the LLMs, effort often becomes highly influential and gets entangled with pleasantness and obstacles. In plain English, many models seem to read emotional situations as if “this will take work” is one of the dominant organising facts.

That is not absurd. Effort often matters. But it is not how human appraisal theory usually ranks the terrain. If a support model over-reads effort, it may treat a user’s fear, frustration, or challenge primarily as workload. That may lead to advice that sounds practical while missing the moral or relational content of the interaction.

The opposite problem appears with perceived fairness. In the human comparison, perceived fairness is a major factor. In most LLMs, it rarely explains much variance, except in QwQ and LLaMA. Other models tend to tuck fairness into other constructs such as certainty, engagement, or pleasantness.

That is a serious product issue. A model can sound emotionally intelligent while under-representing one of the central mechanisms behind anger, grievance, and trust collapse. The customer said “unfair”; the model heard “negative and effortful”. Congratulations, your empathy layer has become a fog machine.

The second mechanism: models do not always know what they are using

The paper then asks models directly which appraisal dimension they consider most important for distinguishing a scenario. This is best read as an internal consistency check: do the models’ stated priorities match the dimensions that actually drive their generated appraisal ratings?

Not really.

When asked explicitly, models often select responsibility and control. That broadly matches part of the implicit structure. But they almost never select effort as the most important cue, even though effort appears as a major driver in their appraisal distributions. They also under-report problem and obstacle dimensions despite relying on them in practice.

This gap matters for any product that exposes model explanations. If the model says it focused on responsibility, but its rating pattern is being driven by effort and obstacles, the explanation may be plausible without being faithful. That is not a small flaw in emotionally sensitive domains. In HR coaching, mental-health-adjacent triage, education, or dispute resolution, an attractive explanation can be worse than no explanation if it gives the operator false confidence.

The practical rule is blunt: do not audit emotional reasoning by reading the model’s rationale alone. Compare rationales with structured appraisal outputs. If the rationale says fairness is central but the fairness score barely moves, the system is not “subtle”. It is inconsistent.

The emotion map is mostly valence, which is useful until it is not

The paper next uses Wasserstein distances to compare appraisal distributions between emotions within each model. This is main evidence for the structure of the model’s emotion space.

Across all models, valence dominates. Positive emotions cluster with positive emotions; negative emotions cluster with negative emotions. That is useful. It is also the easy part.

The difficulty is what happens beyond valence. The models show much weaker fine-grained structure along other appraisal axes. Challenge and surprise become “border emotions”, not fitting neatly into positive or negative clusters. The authors argue that models may link them through uncertainty, even where human appraisal findings distinguish them differently. Challenge, for example, is treated by models as low-certainty, while the human comparison suggests the opposite pattern.

That should interest anyone building emotionally aware workflows. A model may distinguish “happy” from “sad” almost trivially because valence does most of the work. But operationally expensive cases are rarely that clean. Customer anger versus frustration, shame versus guilt, hope versus interest, or fear versus challenge can imply different interventions.

The cross-model comparison sharpens the point. Some appraisal dimensions show high agreement across models — enjoyment, pleasantness, self-responsibility, self-control, legitimacy-cheated, obstacle, and effort-related dimensions. More abstract dimensions such as attention and consideration show weaker agreement.

The paper then applies Maximum Mean Discrepancy tests to compare whether different models represent the same emotion similarly. Some emotions, including happiness, surprise, hope, pride, fear, and sadness, appear more similar across models. Others — boredom, shame, frustration, and anger — vary substantially. LLaMA 3 is the most divergent, showing significant differences across all emotions.

This is where model replacement becomes risky. If you swap the underlying LLM in a support or coaching product, you may not simply change writing style or latency. You may silently change the system’s emotional geometry.

Prediction tests show where appraisal becomes brittle

The regression analysis asks a different question: given the model’s 17-dimensional appraisal ratings, how well can those ratings predict the emotion category? This is main evidence for whether the appraisal vectors are informative and separable.

Some patterns are psychologically plausible. Hope is linked with pleasantness and low effort. Surprise is linked with low certainty and high enjoyment. Fear is predicted by low certainty and high exertion. Anger is driven less by raw valence and more by perceived unfairness. Guilt and shame involve self-responsibility and low enjoyment.

Other patterns are less aligned. Happiness is predicted by high certainty and low effort, but not strongly by pleasantness or enjoyment. Pride is associated with high effort and low problem, suggesting that models frame pride through effortful achievement more than the human comparison would imply. Contempt becomes strongly tied to understanding and low situational control, diverging from human findings where anger and contempt are more closely related.

The one-vs-all prediction results are operationally useful because they show which emotions are easy or hard to recover from appraisal vectors. Challenge is predicted most accurately across models, followed by boredom, happiness, and fear. Frustration, disgust, sadness, and shame are consistently weak. Hope and interest also underperform for several models. LLaMA 3 performs poorly overall, including an F1 as low as 0.16 for disgust. DeepSeek R1 produces more consistently informative appraisals, while Gemini struggles with sadness and frustration but predicts guilt comparatively well.

The paper then uses SynthTree as an interpretable follow-up, not a second thesis. SynthTree helps expose hierarchical decision structures and feature usage beyond global logistic coefficients. In the 15-class task, the results remain uneven. When the authors coarsen emotions into four valence-arousal quadrants, performance becomes much stronger and more stable. That is not a surprise; broad buckets are easier than fine distinctions. It is still informative.

The pairwise follow-ups make the same point cleanly. Happiness versus sadness reaches near-ceiling performance because valence is enough. Interest versus hope is more diagnostic because both are positive but differ in anticipated outcome, engagement, and goal structure. DeepSeek handles that distinction better than LLaMA, while the models use different appraisal mechanisms despite shared valence.

A useful way to read the experiments is this:

Test Likely purpose What it supports What it does not prove
PCA with varimax rotation Main evidence for implicit appraisal structure Models have organised appraisal spaces, but the axes differ from human patterns That models understand emotion in a human-equivalent way
Explicit dimension-choice task Consistency check between stated and implicit priorities Models often say they use agency dimensions while implicitly relying on effort and obstacles That explanations are faithful
Wasserstein distances Main evidence for within-model emotion geometry Valence is the dominant organising axis; fine-grained structure is weaker That similar labels imply similar interventions
MMD cross-model tests Cross-model generalisation test Same emotion can be appraised differently by different models That model upgrades preserve emotional behaviour
Logistic regression Predictive evidence for separability of appraisal vectors Some emotions are well represented; others are brittle That low prediction always means the emotion is inherently ambiguous
SynthTree and VA4 Interpretability and sensitivity follow-up Coarse valence-arousal grouping is more stable than 15-way emotion separation That coarse emotion categories are sufficient for real workflows
Culture and personality personas Contextual robustness test Personality prompts shift appraisals; cultural prompts do not meaningfully differentiate appraisals That prompt personas produce reliable localisation

The business interpretation is not “LLMs fail at emotion”. That would be too easy and not quite true. The better reading is that LLMs have an appraisal layer that is measurable, partially useful, and too unstable to leave ungoverned.

Personality personas move the dial; cultural personas barely touch it

The contextual tests are best understood as robustness and sensitivity tests. The authors vary the prompt context using cultural personas and personality personas, then compare the resulting appraisal distributions against vanilla prompts.

The cultural personas use nationality as a proxy for culture: United States, Mexico, Nigeria, Denmark, and Japan. The expectation, based on human psychology, would be that cultural context affects appraisals. The models do not show meaningful differentiation. Across the tested LLMs, cognitive appraisals remain statistically indistinguishable across these cultural personas.

This does not prove models can never represent culture. It does show that simple cultural persona prompting is not enough to produce stable, differentiated emotional appraisal patterns. That matters for localisation. A prompt that says “respond as someone from Japan” or “adapt to Mexican culture” may change vocabulary, etiquette, or surface framing while leaving the underlying appraisal logic almost untouched.

Personality prompts behave differently. The paper creates Big Five-based personas: high and low agreeableness, conscientiousness, extraversion, neuroticism, and openness. Positive personas — high agreeableness, high conscientiousness, high extraversion, low neuroticism, and high openness — tend to increase self-control and understanding, along with certainty, attention, enjoyment, and pleasantness. They also reduce perceptions of external control, obstacles, and negative fairness.

Negative personas produce the complementary pattern: more external control, more obstacles, sometimes stronger feelings of being cheated or needing effort, and lower attention, certainty, understanding, self-control, enjoyment, fairness, and self-responsibility. High neuroticism has the strongest effect on effort and exertion.

So the models can absorb individual-level affective traits in psychologically plausible ways, at least under prompt conditioning. They do not show the same reliability for culture. That is a useful asymmetry. It tells builders where persona prompting may produce measurable behavioural shifts, and where it may merely decorate the prompt with geography.

What the paper directly shows, and what operators should infer

There are three layers to keep separate.

First, the paper directly shows that six evaluated LLMs generate structured appraisal ratings over emotional scenarios, but those structures are inconsistent across models, only partly aligned with human appraisal theory, and sensitive to some prompt contexts but not others.

Second, it shows that coarse emotional distinctions are easier than fine-grained ones. Valence does a lot of work. The system can often tell bright from dark. It is less reliable at telling resentment from frustration, shame from guilt, or hopeful curiosity from engaged interest.

Third, Cognaptus infers an operational design principle: emotional AI should be governed through appraisal-level instrumentation, not sentiment labels alone.

That leads to a more useful deployment pattern:

Operational area What to test Why it matters
Customer support Fairness, responsibility, obstacle, control, certainty Anger and grievance often require policy repair, not empathy garnish
HR and coaching Self-responsibility, situational control, effort, certainty Advice changes depending on whether the user sees agency or constraint
Education Challenge, boredom, effort, uncertainty, attention “Struggling” can mean motivated challenge or disengaged boredom
Companion agents Rationale-rating consistency A warm explanation should not hide unstable appraisal logic
Mental-health-adjacent triage Fear, sadness, shame, control, effort, uncertainty Fragile appraisal can misroute high-sensitivity cases
Model operations Appraisal drift after upgrades Switching models can change emotional behaviour without changing labels

This is not about making the model more sentimental. It is about making the emotional layer inspectable enough to govern.

A practical appraisal-governance checklist

An operator deploying emotionally aware LLMs should ask for more than a benchmark score on empathy or sentiment classification.

A minimal appraisal-governance checklist looks like this:

  1. Evaluate appraisals, not just labels. For each relevant workflow, track dimensions such as fairness, control, responsibility, certainty, effort, obstacles, and pleasantness.

  2. Map interventions to appraisal patterns. If anger is driven by perceived unfairness, the correct response may be restitution or policy explanation. If fear is driven by uncertainty and effort, the correct response may be clarification and step reduction.

  3. Check rationale-rating consistency. If the model explains an answer through fairness but the fairness score does not move, flag the case for review or calibration.

  4. Run model-swap regression tests. Before changing the underlying LLM, compare appraisal distributions, not only response quality. Emotional drift can be invisible in ordinary QA.

  5. Test nearby emotions pairwise. If the product depends on distinguishing frustration from anger, hope from interest, or shame from guilt, evaluate those pairs directly. Broad valence accuracy is not enough.

  6. Treat cultural persona prompting as unproven localisation. Cultural sensitivity needs evidence from local data, user testing, and policy design. A nationality phrase in a prompt is not a localisation strategy. It is a stationery header.

  7. Use escalation policies for low-certainty and high-risk appraisal patterns. Emotionally sensitive systems should know when not to continue autonomously.

The unpleasant but useful conclusion is that appraisals are not merely research variables. They can become production observability metrics.

Boundary conditions: this is a benchmark, not a clinical trial

The study is valuable because it tests the mechanism beneath emotion labels. But its boundaries matter.

The scenarios are prompted and benchmarked, not observed in live enterprise deployments. The human validation process improves scenario quality, but the raters are a small expert group rather than a large demographically representative sample. The cultural test uses nationality as a proxy for culture, which is necessarily coarse. The personality test uses in-context personas, which may not match persistent user modelling. The evaluated models are six specific systems, and the field changes quickly enough to make any fixed model comparison perishable.

The benchmark also studies model-generated self-appraisals. That is useful for exposing internal structure, but it is not the same as proving that a model has human-like emotional cognition. The paper’s own evidence points in the opposite direction: structure exists, but it is noisy, model-specific, and often shallower than the surface fluency suggests.

For business use, the boundary is clear. CoRE does not tell you whether an LLM is safe to deploy in a live HR, healthcare, education, or customer remediation workflow. It tells you what to measure before pretending that deployment is safe.

The real product is not empathy; it is emotional observability

The most important contribution of CoRE is not that it catches LLMs being bad at emotions. The more interesting result is that their emotional reasoning is legible enough to inspect.

That changes the product question. Instead of asking, “Can the model sound empathetic?”, operators should ask:

  • What appraisal dimensions drove this response?
  • Are those dimensions stable across prompts, users, and models?
  • Do explanations match ratings?
  • Which fine-grained emotions collapse into the same broad bucket?
  • Does localisation change appraisal logic or merely change phrasing?
  • When should the system escalate rather than continue?

The future of emotional AI will not be won by chatbots that apologise more elegantly. We have enough elegant apologies. Many of them are attached to broken workflows.

The useful systems will be the ones that can tell the difference between sadness and shame, fear and challenge, anger and frustration, unfairness and inconvenience — and can show operators enough of that reasoning to be audited.

Emotionally fluent language is cheap now. Emotionally governed infrastructure is the harder part. CoRE is a reminder that the harder part is the one worth buying.

Cognaptus: Automate the Present, Incubate the Future.


  1. Sree Bhattacharyya, Evgenii Kuriabov, Lucas Craig, Tharun Dilliraj, Reginald B. Adams Jr., Jia Li, and James Z. Wang, “Large language models show fragile cognitive reasoning about human emotions,” arXiv:2508.05880. ↩︎