TL;DR for operators

LLMs are not merely getting better at choosing the right emotion label. This paper shows that, inside their output distributions, larger models organise emotion words into increasingly rich hierarchies: broad emotions such as joy or sadness sit above more specific states such as optimism, disappointment, or grief.1

That matters because the hierarchy itself becomes an evaluation object. Instead of asking only whether a model correctly labels a customer message as “angry,” an operator can ask whether the model’s internal emotion map has enough depth, whether related emotions cluster sensibly, and whether that structure changes when the model is prompted to adopt different demographic personas.

The uncomfortable part is that better emotional structure does not mean safer emotional judgement. The paper finds that Llama 405B recognises emotions less accurately for several underrepresented persona conditions, including female, Black, low-income, and low-education personas, and that intersectional personas can compound these errors. Some groups are pushed toward specific misclassifications: Asian personas toward “shame,” Hindu personas toward “guilt,” and physically disabled personas toward “frustration.” Emotional sophistication, charmingly, can still carry the baggage.

For business use, the lesson is not “LLMs feel like humans.” They do not. The lesson is that affective behaviour can be audited structurally. Emotion-tree depth, path length, clustering, and persona-conditioned shifts could become part of model selection, red-teaming, and fairness review for customer service agents, tutoring systems, AI companions, negotiation bots, and therapy-adjacent products.

The boundary is important. The experiments rely on Llama-heavy models, GPT-4o-generated scenarios, a 135-word emotion taxonomy from psychology, next-token probability assumptions, and a small human study with forced-choice emotion labels. Treat the paper as a strong diagnostic proposal, not a licence to sell “emotionally intelligent AI” with a pastel landing page and questionable life choices.

The customer says one thing; the model hears another

A customer writes: “I’ve been waiting three weeks, and no one has bothered to explain what happened.”

A human support agent may hear frustration, disappointment, anger, distrust, fatigue, or some blend of them. A weak chatbot may hear only “negative sentiment.” A stronger model may output “anger.” A better system should probably notice that the sentence lives somewhere in a family of related emotions, not as a single isolated label.

That distinction is the point of the paper. The authors are not satisfied with the usual benchmark question: did the model choose the correct emotion word? They ask a more structural question: how does the model organise emotions before it commits to a label?

This matters because many AI products now sit inside emotionally loaded workflows. Customer support, education, sales, coaching, employee feedback, dispute resolution, healthcare triage, and companionship all require some inference about human emotional state. If the model flattens every negative experience into “anger,” it will respond badly. If it distinguishes irritation from humiliation, fear from surprise, loneliness from boredom, it may respond more appropriately.

But the paper’s sharper contribution is that the same machinery that creates nuance can also encode bias. A model can learn a richer map of emotion while still misreading some people more than others. That is not a contradiction. It is the job description of modern AI evaluation: find the improvement, then find the invoice attached to it.

The mechanism: emotion trees from next-token probabilities

The paper begins with a simple but powerful move. Take a sentence. Append a prompt such as: “The emotion in this sentence is”. Then inspect the model’s next-token probability distribution over a fixed vocabulary of emotion words.

The authors use 135 emotion words drawn from Shaver et al.’s hierarchical emotion framework, which groups emotions into broader families such as love, joy, surprise, anger, sadness, and fear.2 For each scenario, the model assigns probability mass across those words. Instead of taking only the top label, the method keeps the distribution.

That distribution matters because the model may assign probability to several related emotions. A scenario may strongly activate “optimism,” but also “joy.” Another may activate “grief,” “sadness,” and “loneliness.” The researchers collect these probability patterns across many scenarios, producing a large scenario-by-emotion table.

From there, they construct what the paper calls a matching matrix: an emotion-by-emotion representation of how often two emotion words appear in similar contexts. If two emotions tend to receive high probability on the same scenarios, they are close in the model’s behavioural space.

The hierarchical step comes from asymmetry. If “optimism” strongly implies “joy,” but “joy” does not always imply “optimism,” then “optimism” is treated as the more specific concept and “joy” as the broader parent. In business English: the model has learned that optimism is a kind of joy, not that joy is merely optimism with better lighting.

This turns a flat classifier into a tree-like structure. The method does not need the model’s training data. It does not need human labels for every scenario. It uses the model’s own output probabilities as the raw material.

That is the clever part. It makes the model expose not just what it says, but how its emotion labels hang together.

The tree is an audit object, not a personality test

The paper’s main methodological claim rests on an assumption: next-token probabilities over emotion words can be interpreted as the model’s estimate that a scenario reflects those emotions. That assumption is not metaphysics. It is an operational bridge. It lets the researchers convert output distributions into conditional relationships between emotion concepts.

Once the tree exists, it can be measured. Two metrics become especially useful:

Metric What it captures Operational interpretation
Number of nodes How many emotion words meaningfully appear in the hierarchy Breadth of the model’s usable emotion vocabulary
Average depth / total path length How far specific emotions sit beneath broader categories Granularity of the model’s affective structure

A shallow tree is not necessarily useless. It may be enough for a refund bot that only needs to distinguish calm, confused, and angry. But for a tutoring agent, clinical intake assistant, coaching tool, negotiation bot, or AI companion, shallow affective structure becomes a product risk. It can produce responses that are technically polite and psychologically tone-deaf. A familiar enterprise achievement, really.

The paper shows that as model scale increases, the trees become deeper and more complex. GPT-2 does not yield a meaningful hierarchy. Llama 3.1 models at 8B, 70B, and 405B parameters show progressively richer structures. Larger models also align more closely with the emotion-wheel categories used in cognitive psychology.

The authors report quantitative alignment between model-derived hierarchies and the human-annotated emotion wheel. For cluster distances, the reported correlations are 0.38 for Llama 8B, 0.64 for Llama 70B, and 0.51 for Llama 405B. For node-hop distances, the reported correlations are 0.55, 0.60, and 0.55 respectively. The pattern is not perfectly monotonic, but the important point is that the extracted structures are not random piles of feeling-words. They carry recognisable psychological organisation.

This is where the paper quietly shifts the evaluation frame. Emotion recognition is no longer just “accuracy on a label set.” It becomes structural fidelity: does the model’s affective map have the right families, distances, and levels of abstraction?

Scale creates granularity, but granularity is not empathy

The authors interpret the scale result through the lens of emotion differentiation. In human development, people typically move from broad affective categories to more differentiated emotional concepts. A child may say “bad.” An adult may distinguish embarrassment, disappointment, shame, guilt, loneliness, dread, and resentment, depending on temperament, vocabulary, and whether they have recently had to sit through a strategic alignment meeting.

The analogy is useful but should be handled carefully. The paper does not show that LLMs experience emotions, possess self-awareness, or understand users in the human sense. It shows that larger models produce output distributions whose structure resembles hierarchical emotion concepts from psychology.

That is still valuable. A model that recognises fine-grained distinctions may be better at selecting responses. For example:

User state Bad response pattern Better response pattern
Anger Defend the policy Acknowledge harm, reduce friction, offer concrete next step
Shame Overexplain the user’s mistake Preserve dignity, make recovery easy
Fear Push urgency Provide reassurance and control
Disappointment Apologise generically Restore expectation, explain what changed
Surprise Treat as confusion Explain the expectation gap

The operational value is not that the model “feels.” The operational value is that its internal label geometry may predict whether it can respond with enough emotional specificity for the task.

And that is exactly what the paper later tests.

Persona-conditioned trees predict recognition accuracy

After establishing that larger models form richer emotion trees, the authors ask whether these structures relate to behaviour. The answer is yes, at least in their experimental setup.

They prompt Llama 405B to adopt different personas and then construct persona-specific emotion trees. They also evaluate the model’s emotion-recognition accuracy on persona-conditioned scenarios. Across 26 personas, richer tree geometry is associated with better recognition accuracy. Longer total path length and greater average depth correlate positively with accuracy.

This is an important move because it connects the tree to downstream behaviour. A hierarchy is not merely a pretty diagram. It becomes a diagnostic signal.

For operators, this suggests a practical evaluation pattern:

  1. Build emotion trees for the base model.
  2. Build emotion trees under persona, market, language, domain, or customer-segment conditions.
  3. Compare tree depth, path length, clustering, and edge changes.
  4. Test whether structural degradation predicts weaker task performance.
  5. Use the result to decide where fine-tuning, retrieval context, guardrails, or human escalation are needed.

This is more useful than a leaderboard score. A single aggregate accuracy number can hide the precise segment where the model fails. Tree geometry may expose that the model has a shallow or distorted affective map for a particular persona or context.

A vendor saying “our model scores 87% on broad emotion categories” is pleasant. An evaluator asking “does your model’s sadness branch collapse into anger for low-income Black female personas?” is less pleasant. Also more useful.

The bias result is structural, not merely statistical

The paper’s most business-relevant section is not the scale result. It is the persona result.

For neutral prompts, Llama 405B achieves 15.2% accuracy across all 135 fine-grained emotion words and 87.1% accuracy when those words are grouped into six broad categories. That already tells us something: broad emotion recognition is much easier than fine-grained classification. Saying “negative” is cheap. Knowing which negative emotion is expensive.

When the model is prompted with demographic personas, performance gaps appear. The paper reports lower emotion-recognition accuracy for underrepresented persona conditions, including female, Black, low-income, and low-education personas. The gaps are amplified when multiple minority attributes are combined.

The more revealing finding is not just lower accuracy. It is directional misclassification.

Asian personas often see negative emotions collapsed into “shame.” Hindu personas show negative emotions pushed toward “guilt.” Physically disabled personas have 26.5% of all emotions misclassified as “frustration.” Black personas more often misclassify sadness or fear as anger. Low-income female personas tend to misclassify other emotions as fear. Low-income Black female personas combine multiple error patterns and show the lowest overall recognition accuracy among the examined intersectional cases.

This is not a random failure mode. It looks like affective stereotyping inside the model’s output distribution.

For businesses, this is where “emotional AI” becomes a governance issue. A support bot that systematically reads one group’s fear as anger may escalate defensively. A tutoring system that reads confusion as laziness or frustration may reduce help quality. A coaching product that maps certain users toward shame or guilt may become quietly harmful while still sounding polished. The model does not need to be malicious. It only needs to be consistently wrong in socially patterned ways. Software has achieved profitability with less.

The human comparison is not a comfort blanket

The authors also conduct a human study with 60 online participants recruited through Prolific. Participants are shown one randomly selected scenario for each of the 135 emotion words and asked to choose among six broad emotion categories.

The purpose of this experiment is comparison with human misclassification patterns, not proof that the model has human emotional understanding. The distinction matters.

The paper finds parallels between Llama and human participants. Black participants and Black personas in Llama are more likely to interpret fear scenarios as anger. Female participants and female personas tend to confuse anger and fear in the opposite direction, though the paper notes that the gender accuracy pattern differs: human females outperform males, while Llama favours male personas. Race and education patterns show stronger alignment between model and human results.

This comparison should not be read as reassurance. “The model is biased like humans” is not a safety certificate. It is a description of the contamination route. If training data reflects human social perception, and the model captures that distribution well, the model may inherit both useful social regularities and ugly social shortcuts.

That is the central irony of the paper. Better modelling of humans can produce better service and better stereotyping at the same time.

Surprise is where prediction-error training leaves a fingerprint

The paper includes an additional experiment on “surprise.” The authors motivate it psychologically: surprise is often linked to a mismatch between expectation and reality. That makes it conceptually related to prediction error.

They compare a base Mistral-7B model with an RL-fine-tuned variant trained on social interaction tasks such as persuasion and negotiation. Recognition of surprise rises from 20.0% in the base model to 33.3% in the RL-fine-tuned model, while most other emotion categories remain similar. The authors report the improvement as statistically significant using a McNemar test.

This is best treated as an exploratory extension, not the paper’s main thesis. It suggests that training regimes may change sensitivity to particular emotional concepts. For operators, that matters because “emotion capability” may not be a single scalar property. A model fine-tuned for persuasion may become better at detecting surprise because surprise is central to expectation management. Another model fine-tuned for customer retention may learn different affective distortions. Delightful.

The practical implication is that affective evaluation should be repeated after fine-tuning. It is not enough to test the base model once, laminate the score, and call it governance.

The appendix tests robustness, not a second thesis

The appendix is useful because it separates the method from a few obvious objections. The main experiments rely heavily on GPT-4o-generated scenarios and Llama models. The appendix asks whether the hierarchy-construction procedure survives changes in data, model family, prompt wording, and comparison method.

Test or result Likely purpose What it supports What it does not prove
GoEmotions dataset with Llama 8B and 70B Robustness/sensitivity test The hierarchy method can recover coherent structure on a human-annotated real-world corpus, not only synthetic GPT-4o scenarios That all real-world domains will produce valid emotion trees
DeepSeek-R1 distilled reasoning models Robustness across model family The extracted hierarchy is not limited to Llama That all reasoning models have superior affective structure
Alternative prompt wording, “sentiment” instead of “emotion” Prompt sensitivity test Similar clustering appears under a nearby prompt template That the method is immune to prompt design
Threshold variation for path length and depth Robustness test The scale trend persists across threshold choices That any threshold is operationally valid
Internal-representation clustering comparison Comparison with prior method Simple representation clustering diverges more from psychological groupings, supporting the logits-tree method’s value That logits always dominate internal probes in every setting
Wine aroma hierarchy Generalisation demo The algorithm can extract class hierarchies beyond emotions That emotion findings transfer automatically to other cognitive domains
Persona edge and prediction differences Implementation/evidence detail Demographic prompts change both labels and hierarchy edges That persona prompts fully represent lived demographic experience

This is the right use of appendices. They do not create a second paper by stealth. They protect the main claim from easy dismissal.

The GoEmotions test is particularly important. It reduces the concern that the hierarchy is merely an artefact of GPT-4o-generated emotional scenarios. The alternative prompt test reduces the concern that the exact phrase “The emotion in this sentence is” mechanically drives the result. The threshold test reduces the concern that the scale trend is a parameter trick.

None of these tests eliminate the need for deployment-specific validation. They simply make the proposed diagnostic more credible.

What operators can actually do with this

The paper points toward a new evaluation layer for affective AI systems. The layer sits between benchmark scoring and red-team transcripts.

Traditional evaluation asks: “Did the model choose the right label?”

A structural affect audit asks:

Audit question Why it matters
Does the model form a deep enough hierarchy for the target use case? Shallow affect maps may be acceptable for routing, but risky for coaching, tutoring, therapy-adjacent, or negotiation systems
Do related emotions cluster sensibly? Bad clustering creates inappropriate responses even when broad sentiment is correct
Does tree geometry shift across personas or customer segments? Segment-specific distortion can create fairness and service-quality failures
Which emotions are over-attributed to which groups? Directional misclassification reveals stereotypes, not just noise
Does fine-tuning improve one emotional capability while damaging another? A model trained for persuasion may become sharper in some affective cues and more manipulative in others
Do hierarchy metrics predict live performance? Diagnostics are useful only if they forecast behaviour that matters

For customer service, this could become part of escalation policy. If the model’s affective map collapses fear into anger for certain user groups, route sensitive cases to a human earlier. For education, test whether the model distinguishes confusion, embarrassment, boredom, and anxiety across student profiles. For AI companions, audit whether loneliness, dependency, shame, and grief are being recognised with dangerous confidence. For sales and negotiation agents, check whether improved affect recognition is being used to support the user or merely to find softer psychological tissue.

That last point is not decorative ethics. The paper itself notes dual-use risk: better emotion recognition can support counselling-like applications, but it can also make manipulation more efficient. A model that reads vulnerability well can comfort. It can also exploit. Same sensor, different business model.

What the paper shows, what Cognaptus infers, and what remains uncertain

The paper directly shows that LLM output distributions can be converted into emotion hierarchies, that those hierarchies become richer with scale in the tested models, that they align with established psychological emotion groupings, and that persona-conditioned hierarchy geometry correlates with recognition accuracy.

Cognaptus infers that emotion-tree diagnostics could become useful in enterprise AI evaluation. The inference is plausible because the tree metrics connect model structure to task behaviour. It is especially relevant for systems where emotional misreading affects user trust, escalation, legal exposure, or harm.

What remains uncertain is deployment transfer. The scenarios are generated largely by GPT-4o. The primary hierarchy experiments are Llama-heavy. The emotion taxonomy is fixed and language-mediated. The human study is small and forced-choice. The persona prompts are abstractions, not real people embedded in real cultural contexts. And the method assumes that next-token probabilities over emotion words meaningfully represent the model’s emotion estimates.

None of these limitations break the paper. They define its operating envelope.

A good operator should not ask, “Can I ship this metric directly into production governance?” The better question is, “Can I adapt this diagnostic to my own task, user segments, language context, and harm model?” That is a more annoying question, and therefore usually the correct one.

The real lesson: affective AI needs map audits, not mood rings

The paper is easy to misread. A lazy version of the story says: “LLMs are learning to feel like humans.” That is not what the evidence proves.

The stronger version is more interesting. LLMs appear to organise emotion concepts in increasingly structured, human-aligned ways as scale grows. That structure can predict recognition behaviour. But it can also vary across demographic personas and reproduce socially patterned misclassification. In other words, the model’s emotional map can become richer and more biased at the same time.

For businesses building affective AI, this changes the evaluation brief. Do not buy emotional intelligence as a product claim. Audit the map. Inspect the branches. Test the personas. Measure what collapses into shame, guilt, anger, fear, or frustration. Then decide whether the model is capable enough for the workflow, and whether that capability is pointed in a direction that will not embarrass you in discovery.

The machine is not feeling. But it is learning the grammar of feelings. That may be useful. It may also be precisely where the trouble begins.

Cognaptus: Automate the Present, Incubate the Future.


  1. Maya Okawa, Bo Zhao, Eric J. Bigelow, Rose Yu, Tomer Ullman, Ekdeep Singh Lubana, and Hidenori Tanaka, “Emergence of Hierarchical Emotion Organization in Large Language Models,” arXiv:2507.10599v2, 11 June 2026, https://arxiv.org/abs/2507.10599↩︎

  2. Phillip Shaver, Judith Schwartz, Donald Kirson, and Cary O’Connor, “Emotion Knowledge: Further Exploration of a Prototype Approach,” Journal of Personality and Social Psychology 52, no. 6 (1987): 1061–1086. ↩︎