The answer is not the conversation
Customer support is a useful place to begin, because the failure is easy to recognize. A customer asks a question. The AI gives a technically correct answer. Then the customer asks a follow-up that exposes confusion, irritation, a missing constraint, or a completely different intention. The system that looked excellent on the first turn suddenly looks like it has never met a human being. Which, to be fair, it has not.
Most LLM evaluation still stops at the assistant turn: prompt in, answer out, score the answer. The paper Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models asks a sharper question: after the model gives an answer, can the same model generate a plausible next user turn that reacts to that answer?1
This sounds like a small methodological trick. It is not. It turns the evaluation target from “Can the model solve the task?” into “Does the model encode what its own answer might cause next?” Those are different capabilities. A system can pass the first test and fail the second with impressive confidence, the traditional house style of AI deployment disappointment.
The paper’s useful contribution is not that it invents another leaderboard column. Its contribution is that it exposes a blind spot in assistant-only evaluation: answer quality does not reliably predict interaction awareness. Larger models do not automatically fix it. Sampling can reveal latent follow-up behavior that greedy decoding hides. And collaboration-oriented post-training can move the metric, though not cleanly enough to declare victory.
For businesses building customer-support agents, sales copilots, training simulators, user simulators, self-play evaluation loops, or multi-agent workflows, the message is direct: a model that answers well is not automatically a credible conversational partner. It may be a very competent answer machine that has no stable sense of what a user would say next.
The probe: make the assistant play the user
The paper formalizes a simple setup. Start with a user query $q$. Let a model generate an assistant answer $a$. Then append a user-role header and ask the same model to generate a user turn $u$ conditioned on the query and answer:
$$ u = M_\theta([q; a])$$ The question is whether $u$ is a grounded follow-up: a user turn that references, reacts to, challenges, clarifies, or builds on the assistant’s answer. A good follow-up might say, “That explanation skips the second constraint,” or “Can you make the shorter version less formal?” A bad one might simply repeat the original prompt, continue writing as the assistant, produce internal planning text, or start a completely new task. The authors call the measured behavior “interaction awareness.” The term needs careful handling. This is not a claim that the model literally understands the user in a human psychological sense. It is a behavioral probe: when forced into the user role, does the model produce a plausible conversational consequence of its own answer? That distinction matters. The paper is not mainly trying to build a user simulator. It is asking whether the assistant model’s own weights contain usable next-turn behavior. A dedicated user simulator can be trained separately. This probe asks whether the model we already use as the assistant can also act as a realistic counterparty when repurposed for self-play or evaluation. The answer is usually: not reliably, please stop pretending otherwise. The evaluation uses an LLM judge to classify generated user turns into labels such as previous-turn restatement, assistant-turn restatement, malformed artifact, meta-planning, degenerate short output, and plausible follow-up. The primary metric is the genuine-follow-up rate: the share of generated user turns judged as grounded reactions to the preceding conversation. The authors validate the binary judge decision against blinded human annotation; the headline validation is stronger for genuine-versus-nongenuine classification than for the full detailed label taxonomy.
Correct assistant, broken user: the first comparison
The paper’s central comparison is between task accuracy and follow-up quality. If interaction awareness were just another face of general capability, higher accuracy should roughly imply better follow-ups. It does not. Across 11 open-weight models from Qwen3.5, gpt-oss, and GLM-4.7, the authors evaluate both answer accuracy and genuine-follow-up rate across math, instruction-following, expert QA, and conversational datasets. The headline pattern is uncomfortable in the productive sense: models can be strong at answering while weak at generating a plausible next user turn. A few numbers make the point clearer.
| Comparison | What the paper reports | Why it matters |
|---|---|---|
| Qwen3.5-0.8B on GSM8K | 41.6% accuracy, 1.0% genuine follow-up | The small model is weak at task solving and weak at follow-up. No surprise yet. |
| Qwen3.5-397B-A17B on GSM8K | 96.8% accuracy, 0.8% genuine follow-up | Scaling task accuracy does not make the model interaction-aware. |
| gpt-oss-20B on GPQA Diamond | 61.1% accuracy, 20.7% genuine follow-up | Lower answer accuracy can coexist with higher follow-up behavior. |
| Qwen3.5-397B-A17B on GPQA Diamond | 86.1% accuracy, 0.5% genuine follow-up | The more accurate model can be the worse conversational proxy. |
| Qwen3.5-397B-A17B on IFBench | 51.6% accuracy, 9.7% genuine follow-up | Interaction awareness is context-dependent, not a single global model score. |
| The obvious business reading is not “accuracy is useless.” Accuracy still matters. A wrong answer with beautiful follow-up behavior remains a wrong answer, now with better manners. The better reading is that answer accuracy is incomplete. It does not measure whether the system can anticipate the next conversational state. | ||
| This matters most in systems where the model is used not only to answer users but also to simulate them. Self-play training, multi-agent testing, customer-journey simulation, and synthetic conversation generation often assume that a capable model can play both sides. The paper says: check that assumption before building a product roadmap on top of it. Engineering by wishful symmetry is still wishful thinking. |
Bigger is not the same as more interaction-aware
The within-family Qwen3.5 analysis is especially useful because it reduces the excuse that cross-family differences are just training-recipe noise. The Qwen sweep covers eight models from 0.8B to 397B-A17B. On GSM8K, answer accuracy rises from 41.6% to 96.8%. Follow-up rates under deterministic generation stay near zero, with five of eight models producing 0.0% genuine follow-up on GSM8K. This is the article’s first useful contrast: a model can become much better at solving tasks while remaining almost unchanged in how it behaves after the answer. The paper’s qualitative example makes the contrast concrete. On the same GPQA chemistry question, both Qwen3.5-9B and Qwen3.5-27B answer correctly. The 9B model generates a user turn that engages with a specific part of the assistant’s reasoning. The 27B model restates the original prompt. Same correct answer. Different awareness of what the answer invites next. For procurement and architecture decisions, this is a quiet warning. “Use the biggest model we can afford” may improve first-turn answer quality, latency permitting. It does not guarantee better behavior as a simulated user, evaluator, debate partner, or collaborator. The paper even shows non-monotonic behavior: on some sampled settings, mid-sized models match or outperform larger ones in follow-up quality. The deeper lesson is that interaction awareness is not a passive byproduct of scale. It is shaped by what the model was trained to predict and optimize. If the training process mostly rewards immediate assistant response quality, the model may learn to finish the answer beautifully and then, when asked to become the user, simply restart the task, continue as itself, or leak internal scaffolding. A very large model can still fail in a very large way. At least the GPU bill will be memorable.
The capability sometimes exists, just not where decoding looks first
The second comparison is between deterministic generation and sampled generation. Greedy decoding asks for the most likely next token sequence. Higher-temperature sampling explores more of the model’s distribution. The paper uses this contrast to ask whether interaction awareness is absent or merely not placed at the mode. For Qwen3.5 and GLM-4.7, higher temperature often surfaces follow-up behavior that greedy decoding hides. Qwen3.5-27B rises from 0% to 22% genuine follow-up on GSM8K, from 1.5% to 35.9% on GPQA Diamond, and from 1.0% to 30.7% on IFBench at $T=1.0$. GLM-4.7 reaches 15.2% on GSM8K and 35.4% on GPQA Diamond at the same temperature. That does not mean the problem is solved by turning up temperature. In production, higher sampling can also increase variance, irrelevance, and brand-new categories of nonsense. The value of the temperature sweep is diagnostic, not operational magic. It suggests that some models contain plausible follow-up behavior in the distribution, but default decoding does not reliably select it. The gpt-oss results add another useful contrast. On GPQA Diamond, gpt-oss-20B reaches 47% at $T=1.0$. But gpt-oss-120B remains near zero on GSM8K even at high temperature, and both gpt-oss models stay below 4% on IFBench across temperatures. That pattern matters because it separates “latent but suppressed” from “not available in this context.” For business systems, this means failures should not be treated as one generic “LLM limitation.” There are at least two different cases:
| Failure pattern | What it suggests | Practical response |
|---|---|---|
| Low follow-up at greedy decoding, higher follow-up under sampling | The behavior exists but is not the default continuation | Use targeted decoding, reranking, or post-training diagnostics; do not assume default behavior is representative. |
| Low follow-up even under sampling | The model may not have learned the relevant behavior for that context | Prompting alone is unlikely to be enough; consider model choice, training data, or external user simulation. |
| Higher follow-up only on some datasets | Interaction awareness is context-sensitive | Evaluate on your actual workflow, not a generic benchmark proxy. |
| This distinction is valuable because it changes the remedy. A latent-behavior problem invites reranking, sampling, or preference tuning. An absent-behavior problem points toward data, training, or model-family selection. Calling both “the model is bad at conversation” is accurate but operationally lazy. |
Perturbation tests show the metric is not just decorative
The paper’s controlled perturbations are important because they test whether the genuine-follow-up metric responds to meaningful changes in the assistant turn. This is the difference between a measurement and a vibes counter with a table attached. The authors use two positive controls. First, they truncate the assistant response. If a model is sensitive to the assistant content, it should be more likely to produce a user turn such as “complete your answer.” Second, they append a generic conversational question such as “What do you think?” If the model tracks the conversational cue, the generated user turn should change, and perhaps become more grounded. The perturbations do move the metric, but not uniformly.
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Accuracy vs. follow-up comparison | Main evidence | Task competence and interaction awareness diverge. | It does not prove follow-up rate predicts business ROI. |
| Qwen3.5 size sweep | Within-family comparison | Bigger models do not monotonically improve follow-up behavior. | It does not isolate every training-data difference inside the family. |
| Temperature sweep | Sensitivity / exploratory extension | Some interaction behavior is latent in the distribution. | It does not imply high temperature is safe for deployment. |
| Truncation perturbation | Positive control | The metric responds when assistant output is visibly incomplete. | It does not prove the generated follow-up matches real users. |
| Explicit-question perturbation | Positive control | Some models attend to conversational cues in the assistant turn. | It does not mean they have robust user modeling across workflows. |
| Collaboration-oriented post-training | Exploratory intervention | Follow-up behavior can be moved by training aimed at multi-turn collaboration. | It is preliminary and tested mainly on Qwen3.5-2B. |
| LLM-judge validation | Measurement validation | The binary genuine-follow-up metric has human-aligned support. | The full label taxonomy is less strongly validated than the binary split. |
| The truncation results are striking. On GSM8K, GLM-4.7 rises from 1.0% to 55.0% genuine follow-up when the answer is truncated. gpt-oss-120B rises from 0.0% to 24.2%. On GPQA Diamond, gpt-oss-120B rises from 20.7% to 65.7%, and GLM-4.7 from 2.0% to 39.4%. | |||
| Qwen models are much less sensitive. Qwen3.5-27B stays flat on GSM8K and GPQA Diamond under truncation. This supports the paper’s interpretation that different model families fail differently. Qwen often restates the original prompt, as if the assistant answer did not happen. gpt-oss and GLM more often attend to the assistant turn but fail to convert that attention into a user-appropriate response. | |||
| The explicit-question perturbation shows a related pattern. On IFBench, gpt-oss-120B moves from 1.3% to 25.7% genuine follow-up, with 99.0% of user turns changing text. GLM-4.7 also changes 99.0% of its user turns, but genuine follow-up barely changes, from 4.7% to 5.0%. That is a nice reminder that sensitivity is not the same as competence. A model can notice that something changed and still respond uselessly. Many organizations have meetings like this. |
Post-training helps, but the trade-off is real
The post-training experiment asks whether interaction awareness can be improved without directly training the model to generate user turns. The authors apply a collaboration-oriented training recipe to Qwen3.5-2B, using multi-turn MATH-derived examples and assistant-side objectives. They test both supervised fine-tuning and reinforcement learning. The result is encouraging but not clean enough to oversell.
| Qwen3.5-2B variant | GSM8K accuracy | IFBench follow-up | GPQA Diamond follow-up | HealthBench follow-up | Coval follow-up |
|---|---|---|---|---|---|
| Base | 62.9% | 1.0% | 2.0% | 36.7% | 19.4% |
| SFT | 40.3% | 48.0% | 46.0% | 54.4% | 45.2% |
| RL | 67.4% | 10.0% | 9.1% | 46.5% | 29.0% |
| Supervised fine-tuning produces large follow-up gains, but GSM8K accuracy falls from 62.9% to 40.3%. The authors interpret this as likely forgetting or overfitting to the multi-turn training examples. Reinforcement learning preserves or slightly improves GSM8K accuracy while giving more modest follow-up gains. | |||||
| The important point is not that one recipe wins forever. The important point is that follow-up behavior moves when the training objective creates pressure for multi-turn collaboration. That supports the paper’s broader claim: interaction awareness is not just a mysterious emergent property that appears when parameter count becomes spiritually sufficient. It is trainable, fragile, and shaped by incentives. | |||||
| For companies, that changes how the problem should be framed. The question is not simply “Which model has the best benchmark score?” The better question is “Which model and training setup produce the conversational consequences our workflow requires?” If the workflow depends on users clarifying, objecting, negotiating, correcting, or revising, then those reactions should appear somewhere in evaluation and training. Otherwise the system is optimized for a theatrical first turn. |
The business issue is not politeness; it is false simulation
It is tempting to read this paper as another argument for better chatbots. That is too narrow. The more serious business implication concerns evaluation loops that rely on synthetic users. Consider four common patterns.
| Business use case | Hidden assumption | Risk exposed by the paper | Better diagnostic |
|---|---|---|---|
| Customer-support testing | The model can simulate confused or dissatisfied users. | Synthetic users may be too repetitive, too cooperative, or simply nongenuine. | Test generated user turns against real ticket follow-ups and escalation patterns. |
| Sales copilot training | The model can play a prospect realistically. | It may generate generic objections rather than grounded reactions to the pitch. | Score next-turn specificity: price concern, product-fit challenge, timing objection, stakeholder issue. |
| Self-play for agent improvement | The assistant model can provide useful counterparty behavior. | The model may reward itself with unrealistic interaction trajectories. Convenient, yes. Scientific, less so. | Include genuine-follow-up rate before using self-play data for training. |
| Multi-agent workflow simulation | Agents can anticipate each other’s responses. | Role drift and prompt restatement can create fake coordination signals. | Audit role consistency and grounded reaction quality between turns. |
| The paper directly shows that models differ in genuine-follow-up behavior and that this behavior is not predicted by first-turn accuracy. Cognaptus’ inference is that businesses should treat next-turn quality as a separate evaluation dimension when conversation dynamics affect value. | |||
| This is not a call to replace all evaluation with user-turn generation. It is a call to add one missing measurement. For many products, the costliest failures happen after the first answer: the clarification that goes nowhere, the complaint that is mishandled, the sales objection that is answered with a brochure, the agent-to-agent handoff that collapses into duplicated planning text. The paper gives teams a relatively cheap probe for whether a model is likely to fail in exactly that zone. |
What to change in an AI evaluation pipeline
A practical evaluation pipeline should separate three layers: answer quality, next-turn realism, and downstream workflow outcome. Blending them into one model score is tidy, but tidiness is not the same as knowledge. A more useful evaluation structure would look like this:
| Evaluation layer | Question | Example metric | Why it belongs separately |
|---|---|---|---|
| Assistant answer quality | Did the model solve the immediate task? | Accuracy, rubric score, factuality, constraint compliance | Necessary but not sufficient. |
| User-turn realism | Does the model anticipate a grounded user reaction? | Genuine-follow-up rate, role-consistency rate, specificity of critique | Measures conversational consequence, not answer correctness. |
| Workflow outcome | Does the interaction reach the desired result? | Resolution rate, escalation reduction, conversion quality, human correction load | Captures product value and operational cost. |
| For a customer-support agent, the user-turn realism test should be built from real transcripts: remove the actual final user message, ask the model to generate it, and judge whether the generated turn is grounded in the prior exchange. For a sales copilot, compare generated objections with real objections by segment, deal stage, and product line. For internal enterprise copilots, test whether the generated follow-up correctly identifies missing permissions, ambiguous data, or workflow constraints. | |||
| The important design choice is domain grounding. A generic follow-up such as “Can you clarify?” is not enough. A useful generated user turn should react to something specific: a missing legal clause, a wrong SKU, a budget constraint, a compliance limitation, a contradictory chart, a tone mismatch. Interaction awareness is valuable only when it is attached to the actual workflow. |
Boundaries: what the paper does not prove
The paper is careful about its boundaries, and the business interpretation should be equally disciplined. First, the probe measures generated behavior, not hidden representation. A model might internally encode useful conversational structure but fail to express it under the user role because training and chat-template conventions push the modal output elsewhere. For deployment, behavior matters. For interpretability, this leaves room for deeper representation-level work. Second, the genuine-follow-up metric does not require matching the actual human next turn. It asks whether the generated user turn is grounded and plausible. That is useful for evaluation, but it is not the same as predicting a specific customer’s next message. Third, the evidence is mostly from open-weight models, English datasets, benchmark tasks, and two held-out conversational domains. Generalization to multilingual customer service, code-review workflows, legal drafting, medical triage, enterprise procurement, or long-horizon multi-agent planning remains unproven. Fourth, even the strongest reported follow-up rates are not majority behavior in many settings. A rise from near zero to 30% is scientifically meaningful and operationally incomplete. If your system needs reliable user simulation, one good follow-up in three is not a dependable staff member. It is a promising intern with a clipboard. Fifth, the post-training result is preliminary. SFT improved follow-up dramatically but damaged GSM8K accuracy; RL preserved accuracy but delivered smaller gains. That is exactly the kind of trade-off product teams should expect when moving from benchmark optimization to interaction optimization.
The useful question after the answer
The paper’s best idea is not complicated: after the model answers, ask what a user would say next. The simplicity is the point. It inserts a missing turn into evaluation. The old evaluation habit asks whether the assistant’s answer is correct. The better habit asks a paired question: correct for what conversational future? In a one-off benchmark, there is no future. In a business system, there is always a next turn, even if it appears as a frustrated user, an escalation ticket, a failed conversion, or a silent abandonment event in the analytics dashboard. The misconception to retire is that high assistant accuracy automatically implies a realistic user simulator or a useful multi-agent partner. This paper does not say current models are useless in interactive systems. It says that one major ingredient of interaction has been under-measured, and that the missing ingredient does not reliably arrive through scale. That is a useful correction. The next generation of AI systems will not be judged only by whether they can answer. They will be judged by whether they can survive the consequences of their own answers.
-
Sarath Shekkizhar, Romain Cosentino, and Adam Earle, “Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models,” arXiv:2604.02315v2, 2026. https://arxiv.org/abs/2604.02315 Cognaptus: Automate the Present, Incubate the Future. ↩︎