Avatars are easy to make expressive once.
That is the boring version of the problem. Give a motion model enough examples of sad walking, angry gesturing, or excited dancing, and it can learn the broad association between text and motion. The harder problem starts later, after the product has already shipped. A game studio adds a new combat animation pack. A VR training company expands from office scenarios to emergency response. A digital-human platform moves from daily-life gestures into sports, performance, musical instruments, and acrobatics. Suddenly “sad” is no longer just a lowered head during walking. It must become a lowered head while jogging, a constrained body during performance, or a professional movement pattern inside a sport.
The new paper Towards Closed-Loop Embodied Empathy Evolution: Probing LLM-Centric Lifelong Empathic Motion Generation in Unseen Scenarios frames exactly this problem as LLM-Centric Lifelong Empathic Motion Generation, or L2-EMG.1 Its proposed method, ES-MoE, is not merely another motion generator with a nicer emotional adjective attached. The paper’s useful idea is more specific: emotional expression and scenario-specific motion style should be treated as two different things that must be learned together without being confused.
That distinction matters. “Sad” may transfer across scenarios. “Jogging sadly,” “walking sadly,” and “performing sadly” share some emotional cues. But jogging, walking, and performing do not share the same motion grammar. If a model fuses emotion too tightly with the first scenario where it saw the emotion, it may learn sadness as “the way people looked sad in that old dataset,” which is not the same as sadness as a reusable control signal. Very human mistake. Slightly embarrassing for a machine, but only slightly.
The paper’s central question is therefore not “can we generate emotional motion?” It is: can a model keep learning new motion scenarios while preserving transferable emotional understanding and avoiding catastrophic forgetting of old scenario styles?
The hard part is sadness while jogging, not sadness in the abstract
A tempting reader misconception is that emotionally expressive motion generation mainly needs better emotion labels. Add “sad,” “angry,” or “happy” to the text prompt, collect more labeled examples, and the model should improve.
The paper argues that this is too flat. Emotion labels are not the whole control problem because motion is shaped by scenario. A sad daily-life walk may involve lowered head and smaller leg movement. A sad sports motion may still need to preserve the structure of jogging or athletic movement. A sad stage performance may exaggerate the body differently. The emotion is shared, but the expression is filtered through the activity.
This creates a two-sided failure mode:
| Problem | What the model must preserve | What can go wrong |
|---|---|---|
| Emotion transfer | Shared emotional cues across scenarios | The model learns “sadness” as a dataset-specific visual habit rather than a transferable representation |
| Scenario adaptation | Motion grammar unique to each activity | The model updates for new scenarios and forgets how old scenarios should move |
| Lifelong learning | Sequential learning without storing all prior data | The model performs well on the latest scenario but degrades on earlier ones |
This is why the paper’s mechanism-first framing is stronger than a leaderboard-first summary. The method only makes sense after the reader sees the split: emotion should travel across scenarios, while scenario-specific motion style should remain locally adapted.
L2-EMG turns emotional motion into a lifelong learning task
The paper defines L2-EMG as a sequential learning setup. The model sees motion scenarios over time. At each step, it has access to the current scenario’s data and must learn that scenario without forgetting previous ones. The generated output is not raw text; the system converts 3D human motion into discrete motion tokens and trains an LLM to generate those tokens from emotional text instructions.
The authors construct two benchmark settings:
| Dataset setting | What it tests | Scale reported in the paper |
|---|---|---|
| Unseen L2-EMG | Sequential learning across eight scenario-specific subsets such as Daily Life, Sports, Dance, Shows, Game, Animation, Instrument Play, and Acrobatics | 19,916 total samples |
| Mixed L2-EMG | Incremental data where scenario categories are mixed, intended to better mimic messy real-world data updates | 19,916 total samples |
For dataset construction, the authors use EmotionalT2M and selected subsets of Motion-X. Because EmotionalT2M is relatively small, they expand the data using Motion-X and merge text descriptions with emotion labels following prior work. They also report scenario categorization by two annotators with expert adjudication and a Kappa value of 0.85. That is useful because the benchmark depends heavily on whether the scenario boundaries are meaningful, not just convenient folder names.
The model pipeline has two stages. First, a VQ-VAE motion tokenizer learns to encode and reconstruct 3D motion as discrete tokens. Second, a LLaMA2-7B backbone is fine-tuned sequentially to generate those motion tokens from instructions such as generating a motion sequence aligned with an emotional text description.
This design is not claiming that the LLM “understands empathy” in the human sense. It treats emotional motion as a controllable generation problem under sequential domain expansion. That is a narrower claim, and therefore a more useful one.
ES-MoE separates transferable emotion from scenario-specific motion
ES-MoE has two main modules, each aimed at one side of the failure mode.
| ES-MoE component | Targeted failure | Mechanism | Practical reading |
|---|---|---|---|
| Causal-Guided Emotion Decoupling Block, or CGED | Emotion gets entangled with shallow motion semantics | Uses a causal intervention graph and deconfounded attention with self-sampling and cross-sampling | Try to make emotion a reusable representation, not just a side effect of one scenario |
| Scenario-Adaptive Mixture of Experts, or SAMoE | New scenario learning overwrites old scenario behavior | Builds scenario-specific LoRA experts and gates among them | Add new motion knowledge without rewriting the whole model every time |
The causal-guided emotion block is the more conceptually delicate part. The paper models input features as $X$, a decoupled representation as $M$, the emotion category as $Y$, and confounding factors such as shallow motion semantics as $C$. The intended front-door path is $X \rightarrow M \rightarrow Y$, while the problematic back-door path is $X \leftarrow C \rightarrow Y$.
In plainer language: the model may notice surface motion patterns that correlate with emotion in one dataset, then mistake those patterns for emotion itself. CGED tries to reduce that confusion. It uses attention-based self-sampling and cross-sampling, with a global dictionary compressed from the training set using K-means. The self-sampling path uses the current input. The cross-sampling path brings in representations from other samples, encouraging the model to learn what emotional commonality survives beyond one motion instance.
The method also adds an emotion classification loss, $L_{emo}$, so the decoupled representation is explicitly pressured to preserve emotion-relevant information. The overall second-stage training loss is:
Here, $L_{llm}$ is next-token prediction for motion tokens, while $L_{emo}$ supervises emotional category recognition. This is not philosophical empathy. It is much more mechanical: generate the right motion tokens and keep enough emotion signal alive in the representation. Less romantic, more deployable.
Scenario experts keep motion styles from collapsing into one average human
The second half of ES-MoE handles scenario adaptation. For each new scenario, the method constructs a LoRA expert. Instead of fully fine-tuning the whole LLM every time, each scenario gets a low-rank parameter update. The model then uses a gating network to decide how much each expert contributes.
The paper represents the updated model parameters after scenario $i$ as a weighted combination of the base parameters and the learned scenario experts:
where $\Delta \theta_j$ is the LoRA update for scenario $j$, and $W_j$ is the expert weight assigned by the gate.
This design matters operationally. In a product environment, new motion scenarios are not rare exceptions. They are the normal lifecycle of an animation system. New game mechanics, training modules, sports movements, industrial tasks, cultural gestures, and avatar personalities arrive over time. A method that requires full retraining on all historical data every time is expensive and often unrealistic, especially when historical motion data is large, proprietary, privacy-constrained, or simply poorly archived.
SAMoE’s design is a reasonable architectural response: isolate scenario-specific adaptation into experts, then learn when to reuse or combine them. The paper also masks previously trained experts during training to avoid over-reliance, then selects the top relevant experts for weight recalculation. This is not just model expansion for the sake of adding parameters. It is model expansion with routing.
The evidence shows a stronger continual learner, not a magic all-data winner
The paper evaluates ES-MoE against several baselines, including sequential LoRA, LwF-LoRA, EPI, O-LoRA, Prog-Prompt, and SAPT. It also includes a non-continual multi-task learning baseline, MTL, which trains with access to all tasks simultaneously.
That distinction is important. ES-MoE is not better than MTL on every metric. MTL has a structural advantage because it is trained with all data at once. The real comparison is with continual-learning methods that must learn sequentially. On that comparison, ES-MoE is the strongest result in the table across most reported metrics.
The main metrics are:
| Metric | Meaning | Better direction |
|---|---|---|
| AF | Average FID across scenarios after final training | Lower |
| AR | Average top-1 R-Precision for text-motion alignment | Higher |
| AD | Average diversity | Higher |
| AMM | Average multimodality | Higher |
| AWF | Average weighted F1 for emotion performance | Higher |
| FR | Forgetting rate | Lower |
A compressed view of the key continual-learning comparison looks like this:
| Setting | Method | AF ↓ | AR ↑ | AD ↑ | AMM ↑ | AWF ↑ | FR ↓ |
|---|---|---|---|---|---|---|---|
| Unseen | SAPT | 2.12 | 0.237 | 9.61 | 1.59 | 0.313 | -0.54 |
| Unseen | ES-MoE | 1.89 | 0.241 | 9.74 | 1.47 | 0.340 | -1.03 |
| Mixed | SAPT | 1.65 | 0.245 | 9.47 | 1.82 | 0.327 | -1.97 |
| Mixed | ES-MoE | 1.39 | 0.259 | 9.87 | 1.65 | 0.347 | -3.03 |
The cleanest reading is not “ES-MoE dominates everything.” It does not. For example, SAPT reports higher AMM than ES-MoE in both settings. But ES-MoE improves the more central trade-off: lower average FID, higher text-motion alignment, stronger emotion performance, and lower forgetting rate among the continual-learning methods.
The negative forgetting rates deserve a small pause. In the paper’s formulation, forgetting rate is based on the drop between previous best R-Precision and final R-Precision over earlier scenarios. A negative value means the final model can perform better than the earlier measured best on average, suggesting backward transfer rather than forgetting under that metric. This is a useful signal, though not a universal guarantee that every kind of old knowledge is preserved. Metrics are contracts. They say what they measure, not what we wish they measured.
The ablations say the two modules solve different problems
The ablation tests are not a side quest. They are the paper’s best evidence that ES-MoE’s mechanism is doing more than adding architectural decoration.
| Variant | Purpose of test | Main interpretation |
|---|---|---|
| w/o CGED | Ablation of causal-guided emotion decoupling | Tests whether explicitly separating emotional commonality helps emotion-aware generation |
| w/o SAMoE | Ablation of scenario-adapted experts | Tests whether expert routing is needed for scenario adaptation and retention |
| w/o $L_{emo}$ | Ablation of the emotion classification loss | Tests whether explicit emotional supervision improves the representation |
The results are revealing. Removing SAMoE hurts badly. On the Unseen setting, AF worsens from 1.89 to 2.48, AR drops from 0.241 to 0.206, AWF drops from 0.340 to 0.286, and FR rises from -1.03 to 3.26. On the Mixed setting, AF worsens from 1.39 to 1.96 and FR rises from -3.03 to 0.83. This supports the paper’s claim that scenario-adapted experts are doing real work in continual learning.
Removing CGED also weakens the model, especially on emotional performance. On Unseen, AWF drops from 0.340 to 0.312. On Mixed, it drops from 0.347 to 0.305. However, the ablation is not perfectly monotonic across every metric: for example, w/o CGED has slightly higher AR than ES-MoE on the Unseen setting. That nuance matters. CGED appears especially useful for emotional expression and overall quality, not a magic switch that improves every number in every condition.
Removing $L_{emo}$ produces a smaller but consistent degradation in the emotion-related direction. On Mixed, AWF drops from 0.347 to 0.336 and AR from 0.259 to 0.237. This supports the intuition that emotion decoupling needs explicit pressure; hoping the LLM will keep the right emotional signal while learning motion tokens is optimistic in the way engineering teams later call “a lesson learned.”
The heatmaps and visual cases are diagnostic evidence, not a second thesis
The paper includes heatmaps of average FID across sequential training stages and scenarios. These are best understood as diagnostic visualizations. Each entry shows FID for a scenario after the model has trained up to a certain point. The authors compare SAPT and ES-MoE under both Unseen and Mixed settings.
The visual pattern supports the table: ES-MoE generally maintains lower FID values and shows less severe degradation across scenario transitions. This is evidence about forgetting behavior, not a separate claim that the model has solved all embodied adaptation.
The qualitative visualizations serve a different purpose. They compare generated motion examples from ES-MoE, SAPT, and O-LoRA. In one daily-life sad-walking example, O-LoRA produces uncoordinated arm movement, while ES-MoE better captures lowered-head and slightly swinging-arm cues. In a shows scenario involving angry gesturing, SAPT’s gesture amplitude is too small to express the intended anger, while O-LoRA shows coordination problems. ES-MoE better matches both emotion and scenario.
These examples are useful because emotion in motion is partly about amplitude, timing, posture, and coordination. A metric table can say AWF improved. A visual case can show what that might look like: not just whether the model generated an arm motion, but whether the arm motion had the emotional force implied by the text.
Still, qualitative examples should remain examples. They illustrate failure modes; they do not replace broader evaluation.
The business value is cheaper adaptation, not instant empathetic robots
For business readers, the practical value of this paper is not that it makes avatars emotionally intelligent overnight. Please, no. The more grounded value is that it points toward motion-generation systems that can keep updating without starting from zero.
That has implications for several markets:
| Business domain | Why continual empathic motion matters | What the paper supports | What it does not yet prove |
|---|---|---|---|
| Games and virtual worlds | New characters, actions, and animation packs arrive continuously | Scenario-specific adapters can help preserve old motion styles while learning new ones | Real-time animation quality inside a production engine |
| VR training | Training scenarios expand from simple gestures to complex role-play and emergency behavior | Emotion control may transfer across changing motion contexts | Human learning outcomes or training effectiveness |
| Digital humans | Brand avatars need consistent emotional style across domains | Emotional representation can be treated separately from scenario motion | Commercial-grade personality consistency |
| Robotics and humanoids | Robots may need motion that is socially legible across tasks | Emotional motion tokens could become a conditioning signal | Physical robot safety, control stability, or social acceptability |
| Simulation and synthetic data | Scenario libraries grow over time | Continual learning benchmarks help test update behavior | Coverage of real human cultural, social, and physical variation |
The ROI pathway is therefore adaptation cost. A platform that frequently adds new motion domains faces three recurring expenses: collecting and labeling data, retraining models, and validating that old behavior did not break. ES-MoE addresses the second and third parts: use scenario-specific LoRA experts instead of overwriting the whole model, and evaluate forgetting explicitly.
For a production team, that suggests a more realistic operating model:
- Treat each new motion domain as a scenario update, not a complete model replacement.
- Maintain separate evaluation sets for old scenarios and new scenarios.
- Track emotion performance separately from motion realism and text alignment.
- Use adapter-style architecture when the product roadmap implies recurring scenario expansion.
- Avoid declaring success from the latest scenario demo alone, because the old scenario may have quietly deteriorated in the basement.
This is where the paper is more useful than its “closed-loop embodied empathy evolution” language might suggest. The phrase is grand. The operational idea is concrete: build systems that can learn new emotional motion contexts while measuring whether they forgot the old ones.
The boundary: benchmark success is not deployment readiness
The paper’s boundary is clear.
First, the evaluation is offline. The method generates tokenized 3D motion and is evaluated with motion-generation metrics, emotion classification performance, forgetting rate, and visual cases. That is not the same as interactive deployment in a game engine, VR system, or robot controller.
Second, the backbone is LLaMA2-7B fine-tuned with LoRA. This tells us the method can be implemented in an LLM-centric pipeline, but it does not settle latency, cost, memory, or serving architecture for a commercial system.
Third, emotional expression is represented through dataset labels and motion cues. This is useful for controllable generation, but it is not a full model of human emotion. It does not account for culture-specific display rules, user perception studies, safety constraints, or whether viewers actually interpret the generated motion as intended.
Fourth, the benchmark is constructed from EmotionalT2M and Motion-X-derived data. The authors make a serious effort to create Unseen and Mixed L2-EMG settings, but constructed benchmarks always reflect their construction choices. Scenario labels, annotation quality, class balance, and the selected motion categories shape what the system learns.
Finally, the paper’s future-work discussion points toward human-scenario interaction and humanoid robot control. Those are plausible directions, but they remain future work. Motion that looks emotionally aligned in tokenized 3D generation is not automatically safe, stable, or useful when transferred to a physical robot. Reality has torque limits. Reality is rude like that.
What this paper changes in the conversation
The paper’s contribution is not that it invents emotion in motion generation. Prior work already studies emotion-controllable text-to-motion systems. Its contribution is to move the question from static performance to lifelong adaptation.
That shift matters because many AI systems fail not at launch, but during maintenance. The first demo works. The second domain works. The third update breaks something the team stopped testing. In motion generation, that breakage is especially visible because bodies are unforgiving. A slightly wrong sentence may pass. A slightly wrong arm swing looks like a haunted mannequin.
L2-EMG gives this maintenance problem a task definition. ES-MoE gives it a mechanism: decouple shared emotional representations, allocate scenario-specific experts, and route among them as new scenarios arrive. The experiments support the idea that this helps against continual-learning baselines, especially in average FID, text-motion alignment, emotion performance, and forgetting behavior.
The business lesson is equally narrow and useful: if an embodied-AI product must keep expanding across motion domains, the model architecture should be designed for updates from the beginning. Emotion should be evaluated as a transferable control signal. Scenario style should be evaluated as something that can be forgotten. And old scenarios should remain in the test suite long after the launch video has done its little dance.
Empathy, in this paper, is not a feeling. It is a representation that must survive updates.
That may sound less poetic. It is also closer to the engineering problem.
Cognaptus: Automate the Present, Incubate the Future.
-
Jiawen Wang, Jingjing Wang, Tianyang Chen, Min Zhang, and Guodong Zhou, “Towards Closed-Loop Embodied Empathy Evolution: Probing LLM-Centric Lifelong Empathic Motion Generation in Unseen Scenarios,” arXiv:2512.19551, 2025. https://arxiv.org/abs/2512.19551 ↩︎