Customer support has a familiar failure mode: the first answer sounds polished, the second answer sounds patient, the third answer sounds as if the system has quietly forgotten what problem it is solving.

The user is still there. The emotional state has changed. The unresolved issue has shifted. The model, meanwhile, keeps producing individually acceptable replies, like a waiter bringing one beautifully plated dish at a time to the wrong table.

That is the long conversation problem. A good multi-turn AI agent cannot optimize each response as a small independent performance. It has to manage a trajectory. Earlier responses reshape later user states; later success may depend on whether the model did something subtle five turns ago; and a locally soothing reply can still be strategically poor if it closes the wrong door.

The paper discussed here calls its framework MICA, short for Multi-granularity Intertemporal Credit Assignment.1 The existing article title uses MAPO, which is best understood as the optimization core: mixed advantage policy optimization. The label matters less than the mechanism. The real contribution is not that the model becomes more “empathetic” in the decorative chatbot sense. It is that the training signal starts asking a harder question:

Did this turn move the whole conversation closer to resolution?

That is a better question than “Was this reply nice?” Slightly less cozy, much more useful.

The mistake is treating empathy as a reply-level skill

The obvious interpretation of empathy benchmarks is simple: bigger models understand people better. Train a larger model, add better preference data, make the response warmer, and the problem shrinks.

MICA argues that this is not the core bottleneck.

In emotional support dialogue, the user’s state evolves. One response may validate a feeling; another may help the user reinterpret a memory; another may increase agency. The value of a reply often depends on what it makes possible later. If the model jumps too quickly into advice, it may look proactive while missing the user’s immediate need for affective validation. If it keeps validating forever, it may sound gentle while failing to help the user move forward. Empathy, in this setting, is not a tone. It is a sequence-control problem with psychological furniture.

The paper builds on EMPA, a multi-turn empathy environment that represents user needs across three dimensions:

Empathy dimension What the model must do Common failure
Cognitive empathy Understand the user’s mental state and internal conflict Offering generic comfort without understanding the situation
Affective empathy Validate and regulate emotional distress Explaining when the user needs to feel heard
Proactive empathy Increase agency and action feasibility Comforting endlessly without helping the user move

This matters because a conversational agent must choose which kind of support is needed at a particular turn. The wrong kind of empathy can be worse than no sophistication at all. A crisp analysis delivered when the user needs reassurance is not “advanced reasoning”; it is a very expensive misunderstanding.

MICA turns a conversation into movement through state space

The central mechanism is simple enough to explain without pretending reinforcement learning is a lifestyle brand.

MICA models the user’s support state as a structured vector. The target is a fully supported state, represented as the origin. At every turn, the environment’s judge estimates how the assistant’s response changes the user’s cognitive, affective, and proactive empathy needs. The policy is rewarded not merely for being in a good state, but for moving the state in the right direction.

That distinction is the paper’s first important technical move.

A naive reward would look at the absolute distance from the current user state to the target. If the user is already close to resolution, even a mediocre response may receive a good score. If the user is in a difficult state, a genuinely helpful response may still look bad because the conversation remains far from resolved. Absolute position confuses the current response with everything that happened before it.

MICA instead uses Incremental Distance Reward:

$$ r_t^{IDR} = D_{t-1} - D_t $$

Here, $D_t$ is the residual distance from the current support state to the target after turn $t$. A positive reward means the assistant reduced the remaining distance. A negative reward means the assistant moved the conversation away from resolution.

This is the right abstraction for business agents too. In customer support, the question is not “Is the ticket now fully solved?” after every message. Many tickets are not solved after one good turn. The better question is whether the latest response reduced uncertainty, increased trust, clarified the next action, or removed a blocker. Progress is not the same as completion. Anyone who has managed an enterprise support queue already knows this; MICA merely gives the idea a training signal.

The second mechanism: local progress is not enough

Incremental reward solves one problem but creates another. If the model optimizes only for immediate improvement, it may become myopic. A reply can create a short-term emotional lift while weakening the longer conversation. The model may choose quick reassurance over deeper repair, or premature closure over sustained interaction.

MICA therefore combines two credit signals.

The first signal is the Monte Carlo return, which captures the future reward following a response:

$$ G_t = \sum_{\tau=t}^{T} \gamma^{\tau-t} r_\tau $$

This tells the model whether a turn helped the remaining dialogue unfold well. It is long-horizon and trajectory-aware.

The second signal is the immediate IDR reward, which tells the model whether the current turn produced local progress.

The paper’s key optimization design is to normalize these signals at different granularities:

Signal What it captures Normalization scope Why this scope matters
Monte Carlo return Long-term effect of a response Across samples at the same turn index Returns vary systematically by dialogue position
Immediate IDR reward Local movement toward support Across the rollout group Immediate rewards are more stable across turns
Mixed advantage Combined local and delayed credit Weighted mixture Balances progress now with trajectory quality later

This is the core of MICA. It does not require a learned critic. It does not require matched-state rollout trees. It does not pretend that multi-turn dialogue can be replayed from identical states after every user reaction. The framework accepts the annoying fact that conversations branch endogenously because the model’s own earlier responses change the user.

That is why the mechanism-first reading matters. A plain summary would say “MICA improves empathy benchmarks.” True, but not very useful. The interesting part is how it avoids three bad choices:

Bad choice Why it fails in dialogue MICA’s alternative
Trajectory-only reward Tells whether the conversation ended well but not which turns mattered Use per-turn IDR plus future returns
Pure turn-level reward Can become myopic and miss delayed consequences Include Monte Carlo return
Critic or rollout-tree methods Add approximation error or exponential rollout cost Use critic-free mixed advantage

The paper’s tone is technical. The business translation is blunt: if an AI agent works over time, the reward must also work over time.

The main evidence: the gains are largest where credit assignment is hardest

The experiments evaluate MICA on EMPA, EmoBench, and EQ-Bench. EMPA is the most central test because it is multi-turn: the model has up to 45 turns to calm and support a simulated user, and failure can occur if the user’s emotional state regresses for five consecutive turns. EmoBench and EQ-Bench add broader emotional understanding and emotional intelligence tests, but they are less direct evidence for long-horizon support behavior.

The authors compare MICA with base models, GRPO, REINFORCE++ with trajectory-level optimization, and REINFORCE++ using batch-normalized IDR. They test Qwen2.5-7B-Instruct and Qwen3 models at 8B, 14B, and 32B scale.

The headline result is that MICA consistently improves EMPA scores and outperforms the RL baselines.

Base model Base EMPA score MICA EMPA score Improvement Base pass count MICA pass count
Qwen2.5-7B-Instruct 15.7 58.9 +43.2 0 9.0
Qwen3-8B 13.3 41.5 +28.2 0 8.3
Qwen3-14B 53.5 68.4 +14.9 12 20.0
Qwen3-32B 68.9 84.2 +15.3 19 26.3

The smaller models benefit most in absolute EMPA score. That should not be read as magic. It likely means that when a model is weak at managing a multi-turn trajectory, better credit assignment unlocks a large amount of previously wasted capacity. The larger model already begins from a stronger baseline, so the improvement is smaller but still meaningful.

The Qwen3-32B result is also useful for calibration. After MICA training, Qwen3-32B reaches 26.3 passed EMPA cases and an 84.2 EMPA score. In the paper’s comparison table, Claude-3.5-Sonnet records 25 passed cases and an 85.1 score, while Gemini-2.5-pro remains stronger at 27 passed cases and 90.7. So MICA narrows the gap with strong closed-source systems; it does not abolish the gap because, regrettably, tables still contain numbers.

On EmoBench and EQ-Bench, the gains are smaller but mostly positive. That is exactly what we should expect. MICA is designed for trajectory-level dialogue learning. If it also improves single-turn emotional reasoning, good. But the strongest evidence is the multi-turn EMPA movement, not a vague claim that the model has become emotionally enlightened.

The ablations explain why the mechanism works

The paper’s ablations are not decorative appendix confetti. They answer three important questions: whether mixed advantage is necessary, whether the reward design matters, and whether the method depends too heavily on one judge model.

Mixed advantage is an ablation of credit granularity

The authors compare Mixed Advantage against two single-granularity variants: group-level immediate reward and turn-level Monte Carlo return. The purpose is not to introduce a second thesis. It tests whether the hybrid signal is doing real work.

It is. On Qwen3-8B, Mixed Advantage reaches 8.3 EMPA pass cases and a 41.5 EMPA score, compared with 5.7/38.5 for group-level and 5.7/36.1 for turn-level. On Qwen2.5-7B-Instruct, Mixed Advantage reaches 9.0/58.9, ahead of 6.0/50.8 for group-level and 6.7/54.5 for turn-level.

The optimization curves add a second point: group-level advantage can produce stronger but less stable gradients, while turn-level advantage is more stable but weaker. Mixed Advantage tries to keep the signal without inheriting the tantrum. This is not merely “two things are better than one.” It is a specific compromise between local feedback and delayed consequences.

IDR is an ablation of reward meaning

The reward ablation compares Incremental Distance Reward with Absolute Distance Reward. This is where the paper’s mechanism becomes most intuitive.

On Qwen3-8B with group-level advantage, ADR reaches only 1.0 EMPA pass and an 18.2 score. IDR reaches 5.7 passes and a 38.5 score. With turn-level advantage, ADR improves to 3.3/31.5, while IDR reaches 5.3/34.0. Mixed IDR performs best at 8.3/41.5.

The interpretation is straightforward. Absolute distance is contaminated by history. Later turns may look better simply because previous turns already moved the user closer to resolution. Earlier turns may look worse because they happen before enough progress has accumulated. IDR asks a cleaner question: did this response reduce the remaining distance?

For enterprise AI, this is the difference between scoring an agent on final customer satisfaction alone and scoring whether each interaction reduced the actual unresolved state. Final satisfaction is useful, but it is a blunt instrument. Blunt instruments have their place. Usually not inside training loops.

Judger tests are robustness checks, not proof of human truth

The paper also replaces the training Judger with Qwen3-235B, MiniMax-M2.5, and GLM-4.7 while keeping the rest of the setup fixed. The downstream results are similar across these judges. For example, Qwen3-32B trained with the three judges ranges from 83.4 to 84.2 in EMPA score; Qwen3-14B ranges from 68.0 to 69.1; Qwen3-8B ranges from 41.1 to 43.5.

This supports a limited but important claim: MICA’s gains do not appear to depend on one exact open-source judge calibration in this environment.

It does not prove that the reward signal is equivalent to human judgment. The paper itself still depends on environment-provided dense feedback, simulated users, and judge models. The cross-judger test says the compass points in roughly similar directions across selected compasses. It does not prove the map is the territory. Yes, we do still have to say that. Reality insists.

The business lesson: optimize interaction trajectories, not charming fragments

The most useful business interpretation is not “use MICA for empathy.” That would be too narrow.

The broader lesson is that long-horizon AI systems need structured progress signals. Any AI agent that works across multiple steps faces the same credit assignment problem:

Business setting Evolving state Bad reward design Better MICA-style framing
Customer support User trust, issue clarity, resolution path Score only final satisfaction Reward each turn for reducing unresolved issue distance
Education Student confusion, confidence, misconception state Score answer correctness only Reward movement from misconception toward mastery
HR or coaching Employee concern, motivation, action readiness Score perceived warmth Reward progress in understanding, validation, and agency
Sales or onboarding Buyer uncertainty, objections, next-step commitment Score meeting outcome only Reward movement through a structured decision state
Tool-using agents Task state, error state, dependency completion Score final task success only Reward reduction in remaining task distance plus future completion

The operational pattern is clear:

  1. Define the state that matters.
  2. Define the target state.
  3. Reward incremental movement toward that target.
  4. Preserve long-horizon consequences through returns.
  5. Avoid relying on one final outcome score to explain twenty turns of behavior.

This is especially relevant for enterprise agents because many business workflows are not one-shot. A compliance assistant, procurement assistant, claims assistant, investment research assistant, or customer success agent must manage context and consequences. The agent must know whether it is reducing uncertainty, increasing decision readiness, or merely producing fluent paragraphs.

The paper’s contribution is a reminder that “multi-turn” is not a UI feature. It is an optimization problem.

What the paper directly shows, and what Cognaptus infers

It is worth separating evidence from extrapolation.

Layer What can be said
Direct paper result MICA improves performance over GRPO and REINFORCE++ variants on EMPA, EmoBench, and EQ-Bench across tested Qwen backbones.
Mechanistic evidence IDR outperforms ADR; Mixed Advantage outperforms single-granularity variants; judge substitution produces similar downstream results in the tested environment.
Cognaptus inference Business agents should be trained and evaluated on structured trajectory progress, not isolated response quality.
Still uncertain Whether the same method transfers cleanly to live human conversations, enterprise workflows, safety-critical coaching, or domains without dense reliable feedback.

This distinction matters because the tempting sales version writes itself: “New RL method teaches AI to care.” That sentence is catchy and mostly useless. MICA does not prove machine compassion. It shows that when a conversational task has an evolving user state, reward design must measure progress across that state.

That is more boring than “AI empathy.” It is also more actionable.

The boundary: dense feedback is the expensive part

MICA is critic-free, but not cost-free.

The training environment still needs a user simulator, a director-like mechanism for tracking psychological state, and a judge that can score turn-level changes. The paper’s implementation uses large models as Actor and Judger components. The authors also report A100-scale training and separate large-model judge deployments. This is not a laptop recipe, unless the laptop is hiding a data center under the keyboard.

The practical bottleneck for businesses is therefore not only the policy update. It is the reward environment.

To apply this style of method outside emotional support, a company would need to build domain-specific state tracking. For customer support, the state might include issue diagnosis, user sentiment, policy constraint, refund eligibility, and next-step clarity. For tutoring, it might include misconception type, confidence, concept mastery, and transfer ability. For sales, it might include objections, urgency, budget clarity, stakeholder alignment, and trust.

That work is unglamorous. It is also where most of the value lives.

The paper’s limitation is precise: dense environment-provided feedback is hard to obtain in many real-world settings. Simulated dialogue benchmarks are useful testbeds, but they are not the same as messy customers, confused students, emotionally vulnerable users, or regulated enterprise processes. The correct business move is not immediate deployment into sensitive human support. It is to borrow the structure: state, target, incremental progress, delayed consequences, and validation against real outcomes.

The real shift is from answer quality to trajectory quality

The older generation of LLM evaluation rewarded beautiful fragments: correct answers, preferred responses, safe completions, helpful tone. Those still matter. But agents expose the weakness of fragment-level optimization. A conversation can contain ten acceptable replies and still fail as a conversation.

MICA points toward a different evaluation grammar. The unit of quality becomes the trajectory. The response is judged by what it changes, not just by how it reads.

That is the article’s main revision from the earlier MAPO framing. The original version correctly identified mixed advantage as important, but underplayed the more subtle reward design. The paper is not just blending trajectory-level and turn-level signals. It is first redefining what a turn-level signal should mean: movement toward a structured target state, not absolute goodness at a moment in time.

For business AI, this is the durable idea. The future of useful agents will not be built by making every sentence a little more charming. It will be built by teaching systems to manage state over time.

Empathy is simply the paper’s test case. The mechanism is larger.

Cognaptus: Automate the Present, Incubate the Future.


  1. Naifan Zhang, Ruihan Sun, Jinwei Su, Hengjie Yang, Zhengyuan Pan, Zhaohan Chen, and Xiaofan Zhang, “MICA: Multi-granularity Intertemporal Credit Assignment for Long-Horizon Emotional Support Dialogue,” arXiv:2603.06194, 2026, https://arxiv.org/html/2603.06194↩︎