Beyond the Answer: Why AI Still Doesn’t Know What You’ll Say Next

Opening — Why this matters now

We’ve spent the last two years obsessing over how well AI answers questions.

Accuracy benchmarks. Reasoning benchmarks. Coding benchmarks. Leaderboards everywhere.

And yet, in production environments—customer support bots, copilots, multi-agent systems—failure rarely comes from wrong answers. It comes from awkward, brittle, or downright bizarre interactions.

The uncomfortable truth: today’s best models can solve problems but still don’t understand conversations.

This paper introduces a subtle but consequential shift in perspective: instead of evaluating what the assistant says, evaluate what the user would say next.

That’s where things start to unravel.

Modern LLM evaluation is fundamentally one-sided.

You give a prompt → the model produces an answer → you score it.

This paradigm assumes that solving the task is equivalent to participating in a conversation. It isn’t.

The paper highlights a critical omission: current benchmarks ignore whether a model has any awareness of the consequences of its own responses.

In real interactions, responses trigger reactions:

Clarifications
Corrections
Follow-up requests
Frustration (often silent, occasionally expensive)

Yet standard evaluation pipelines stop before this moment ever happens.

The result? Models that look brilliant in isolation but behave like socially oblivious interns in production.

Analysis — Measuring “interaction awareness”

The authors propose a deceptively simple probe:

After generating an assistant response, ask the same model to generate the next user turn.

If the model understands interaction dynamics, it should produce a grounded follow-up—a response that meaningfully reacts to the assistant’s output.

If not, you get failure modes like:

Repeating the original prompt
Continuing as the assistant (identity confusion)
Producing meta-reasoning (“Here’s how I would think…”)
Generic filler (“What do you think?”)

The Core Metric

They define a key measure:

Metric	Meaning
Genuine Follow-up Rate	% of user turns that meaningfully respond to the assistant output

This becomes a proxy for what they call interaction awareness—the model’s ability to anticipate how a user would react.

Crucially, this is not about simulating users. It’s about probing what the model already encodes.

Findings — When smart models act clueless

The results are… mildly embarrassing for the state of the art.

1. Accuracy ≠ Interaction Awareness

Across multiple model families and benchmarks:

Model Scale	Task Accuracy (GSM8K)	Follow-up Rate (T=0)
Small (~0.8B)	~41%	~0%
Large (~397B)	~96%	~0%

Despite massive gains in reasoning accuracy, follow-up quality remains near zero under deterministic generation.

In other words: models can solve the problem but have no idea what happens next.

2. Bigger models don’t fix it

Within the same model family, scaling does not improve interaction awareness.

Mid-sized models often match or outperform larger ones in generating meaningful follow-ups.

This breaks a deeply held assumption:

More parameters ≠ better conversational intelligence

3. The capability is latent—but suppressed

When sampling temperature increases, follow-up rates rise significantly:

Temperature	Follow-up Rate (Example Range)
0.0 (greedy)	~0%
0.7	~15–30%
1.0	up to ~40%

This suggests something subtle:

Models can anticipate user reactions—but their training suppresses it at the most likely output.

A polite way of saying: we trained them to ignore the conversation.

4. Training interventions can recover it

Using collaboration-oriented training (multi-turn rewards), the authors improve follow-up rates without explicitly training for user-turn generation.

That’s important.

It implies interaction awareness is:

Not a separate capability
But a byproduct of how models are trained

5. Failure modes are systematic

Different model families fail differently:

Model Family	Typical Failure
Qwen	Restates prompt
GPT-OSS	Leaks internal reasoning
GLM	Generates planning text

This isn’t random noise. It’s a fingerprint of training data and objectives.

Visualization — The hidden gap

The paper’s most important insight can be summarized simply:

Dimension	What we measure today	What we miss
Task solving	Accuracy, benchmarks	✔️ Well covered
Interaction dynamics	User reaction, follow-up	❌ Largely ignored
Deployment readiness	Multi-turn robustness	❌ Underestimated

This gap explains why models that dominate leaderboards still fail in real workflows.

Implications — Why this changes how we build AI

1. Assistant-only evaluation is incomplete

If your system only measures answer quality, you’re optimizing for a world that doesn’t exist.

Real systems are iterative, not transactional.

2. Multi-agent and self-play systems are at risk

Many architectures assume models can simulate users or collaborate effectively.

This research suggests otherwise.

Without interaction awareness, self-play becomes:

Unrealistic
Overly cooperative
Misleading in evaluation

3. Training objectives need to evolve

Single-turn optimization (RLHF, SFT) prioritizes immediate response quality.

But ignores downstream consequences.

Future training loops will likely incorporate:

Multi-turn rewards
User-state modeling
Outcome-based optimization

4. Product design should adapt immediately

You don’t need new models to act on this.

Practical steps:

Inject structured follow-up prompts
Add verification turns
Simulate user reactions during testing
Monitor “conversation breakdown” metrics

Because the failure isn’t rare—it’s systemic.

Conclusion — The next frontier isn’t answers, it’s reactions

We’ve been asking the wrong question.

Not:

“Can the model solve the task?”

But:

“Does the model understand what happens after it solves the task?”

This paper makes it clear: current LLMs don’t—at least not reliably.

They are brilliant responders, but mediocre conversationalists.

And until that changes, every production system built on them will inherit the same flaw:

They’ll know what to say—but not what it leads to.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The blind spot in LLM evaluation#

Analysis — Measuring “interaction awareness”#

The Core Metric#

Findings — When smart models act clueless#

1. Accuracy ≠ Interaction Awareness#

2. Bigger models don’t fix it#

3. The capability is latent—but suppressed#

4. Training interventions can recover it#

5. Failure modes are systematic#

Visualization — The hidden gap#

Implications — Why this changes how we build AI#

1. Assistant-only evaluation is incomplete#

2. Multi-agent and self-play systems are at risk#

3. Training objectives need to evolve#

4. Product design should adapt immediately#

Conclusion — The next frontier isn’t answers, it’s reactions#