A support agent keeps asking the same diagnostic question after the customer has already answered it. A research agent revisits the same failed source path with slightly different wording. A workflow agent tries the same invalid action again because, apparently, the best evidence for what to do next is what it just did badly.

This is usually described as a memory problem. Give the agent more context. Keep the full transcript. Add summarization. Build a bigger recall layer. The standard instinct is simple: if the agent fails over time, it must be losing information.

The paper behind today’s article argues almost the opposite. In Mitigating Conversational Inertia in Multi-Turn Agents, Yang Wan and coauthors identify a failure mode where multi-turn agents degrade not because they forget too much, but because they remember themselves too faithfully.1 The model treats its own earlier actions as examples. Those examples may be mediocre, stale, or flatly wrong. The model does not care. It has seen a pattern. Large language models do love a pattern. Sometimes a little too much.

The authors call this failure mode conversational inertia: an agent’s tendency to imitate its own previous responses across turns, producing behavioral rigidity instead of adaptive exploration. The useful part of the paper is not merely the label. The useful part is that the authors connect this behavior to a measurable attention pattern, then test two ways of breaking it: Context Preference Learning, a training-time method, and Clip Context, an inference-time context management strategy.

The uncomfortable implication for business agent design is clear: long context is not automatically institutional memory. Sometimes it is a very expensive echo chamber.

Few-shot learning becomes dangerous when the examples are the agent’s own actions

LLMs are good few-shot learners because they infer patterns from examples in the prompt. In normal prompting, this is convenient. Show the model a few high-quality demonstrations, and it continues the pattern.

A multi-turn agent environment quietly changes the meaning of “examples.” The prompt no longer contains only curated instructions or demonstrations. It also contains the agent’s own prior observations, reasoning traces, and actions. After several rounds, the model has a new set of examples: itself.

That is harmless only if the prior actions are consistently useful. In real agent tasks, they are not. Agents search dead ends, try invalid actions, inspect irrelevant pages, make premature guesses, repeat navigation mistakes, and accumulate awkward reasoning debris. The longer the episode, the more the transcript contains not just environmental feedback but also behavioral residue.

The paper’s key mechanism is that the model does not merely read this residue as history. It partially reuses it as a behavioral template. The current response attends to structurally corresponding positions in previous assistant outputs. The authors describe this as a diagonal attention pattern: the current output token attends disproportionately to tokens around the same relative position in earlier assistant responses.

That sounds technical, but the business interpretation is simple. The agent is not only asking, “What does the current state require?” It is also asking, “What did I tend to say at this point last time?” Very useful when the past was a gold-standard demonstration. Less charming when the past was a confused assistant repeatedly clicking the wrong button with the dignity of a spreadsheet macro.

The mechanism is not “long context is bad”; it is “unmanaged self-history becomes a template”

The paper is careful not to reduce the result to a lazy anti-context slogan. Context matters. Past observations can contain necessary state. A research agent may need to remember which sources were already checked. A web-navigation agent may need the sequence of previous page states. A customer-service agent may need the user’s earlier constraints.

The problem is the mixture. A full interaction history contains at least four different kinds of information:

Context component Usually useful for Failure risk
System instructions Task discipline and role constraints Gets diluted as transcript grows
User observations and environment feedback Current state and evidence Can be buried under assistant text
Previous assistant actions Action continuity Can become self-imitation fuel
Previous assistant reasoning Search memory and rationale Can preserve wrong assumptions with confidence

The paper’s attention analysis matters because it separates having history from being controlled by history. The authors report that as context grows, attention to previous assistant outputs increases, while attention to user inputs remains relatively stable. More precisely, the harmful signal is not just generic attention to previous assistant text; it is the diagonal, token-to-token pattern associated with copying prior response structure.

This distinction is important. If the only issue were context length, the solution would be straightforward compression. Summarize everything. Retrieve only relevant chunks. Increase the context window. Buy a bigger model and call it a strategy. The paper instead suggests that some of the degradation comes from the model’s learned habit of pattern continuation. The agent is not drowning in text. It is over-learning from its own transcript.

The evidence map: what each experiment is actually doing

The paper includes main task evaluations, attention analysis, deep research experiments, ablations, efficiency analysis, and appendices on summarization, retrieval, and alternative explanations. These are not all the same kind of evidence. Treating them as one large “the method works” pile would flatten the paper’s argument.

A cleaner reading is this:

Paper component Likely purpose What it supports What it does not prove
Eight agent environments Main evidence Context strategy and CPL improve average task performance across diverse agent tasks Universal gains in every production workflow
Diagonal attention analysis Mechanistic evidence Inertia is associated with selective attention to corresponding positions in prior assistant outputs Definitive causal proof that diagonal attention alone causes failure
BrowseComp-style deep research test Exploratory extension / external scenario Clip Context can help in long-horizon information-seeking tasks where retention still matters That all research agents should delete history aggressively
CPL preference-pair ablation Ablation Short-vs-long context preference construction matters That DPO is the only valid training method
LoRA and general benchmark checks Capability preservation / implementation detail CPL can reduce inertia without obvious degradation on GPQA-Diamond and MMLU-Redux in the reported setup That no downstream capability will ever be affected
Clip hyperparameter tests Sensitivity analysis Clip is less fragile than window context across tested settings That tuning is unnecessary
Summarization appendix Diagnostic comparison Some summarization gains may come from truncation, while summaries can introduce overconfidence or omit important failed paths That summarization is useless in all systems
StreamingLLM and retrieval baselines Comparison with prior context methods Agent-specific round-level clipping can beat token-level or retrieval-only baselines in the tested setup That retrieval should not be combined with clipping

This evidence structure is why a mechanism-first article is better than a standard paper summary. The headline is not “new context method improves benchmarks.” The more useful claim is: agent transcripts contain self-demonstrations, and those self-demonstrations can become behavioral gravity.

Context Preference Learning turns context length into a weak supervision signal

The training-time contribution is elegant because it avoids needing expert demonstrations or environment rewards.

For the same environment state, the authors generate two candidate actions:

  • one using a longer context, which tends to carry stronger inertia;
  • one using a shorter context, which tends to carry weaker inertia.

They then construct preference pairs that favor the short-context action over the long-context action. This becomes Context Preference Learning: the model is trained with Direct Preference Optimization so that it prefers lower-inertia behavior.

The important detail is not just “short context good, long context bad.” The authors use the contrast between short and long contexts as a relative signal. They do not need a human to label the action. They do not need a dense reward function from the environment. They exploit the empirical tendency that, for identical states, longer self-history often increases imitation bias.

In the main configuration, the authors use LoRA fine-tuning with rank 16 and alpha 16, updating about 0.4% of parameters. The training data consists of 1,000 preference pairs per environment across eight environments, giving 8,000 pairs in total. That is small by the standards of modern model training, which is exactly why this part is interesting. The goal is not to teach the model how to solve every task from scratch. The goal is to adjust a behavioral preference: when the model is tempted to continue its own stale pattern, prefer a less inertial action.

The reported results support this interpretation. On Qwen3-8B, CPL combined with Clip Context reaches a 72.5 average score across the eight environments, compared with 68.9 for the base model using Clip Context and 64.9 for the base model using Window Context. Under Long Context, CPL improves the Qwen3-8B average from 54.4 to 59.0, which is still worse than clipped or windowed context, but reveals something useful: training helps, but it does not make unlimited self-history harmless.

That is a practical lesson. Fine-tuning can reduce the agent’s tendency to copy itself. It does not remove the need for context management. Training is not a license to hoard transcripts.

Clip Context is scheduled amnesia, not random forgetting

The inference-time contribution is simpler and probably more immediately useful for builders: Clip Context.

The paper compares three main context strategies:

Strategy What it keeps Main advantage Main problem
Long Context Full conversation history Maximum information retention Strong inertia and higher compute cost
Window Context Most recent fixed-size window Simple recency control Constant boundary shifting prevents efficient KV-cache reuse
Clip Context Periodically clears history down to a recent core Breaks inertia while preserving cache-friendly cycles Can drop information needed for long-range dependencies

Clip Context works by periodically resetting the accessible history. For example, a configuration like Clip-12to1 lets context grow for a cycle, then clears it down to a small recent core. The point is not to pretend earlier turns never happened in every possible sense. The point is to prevent the model from continuously accumulating assistant-output patterns across the entire episode.

This design has two separate benefits that should not be confused.

First, it reduces conversational inertia. By breaking the continuity of self-history, the model has fewer stale assistant patterns to imitate. The attention analysis reports that Clip Context reduces diagonal attention by about 7% for both base and CPL models.

Second, it improves computational efficiency relative to sliding-window context. Window Context constantly shifts its boundary, making prefix cache reuse difficult. Clip Context creates stable within-cycle prefixes, allowing KV-cache reuse until the next reset. The authors report 2–7x prefill speedups over window methods across the tested environments.

The performance numbers are also revealing. For Qwen3-8B, Base + Window averages 64.9 across the eight environments, Base + Long drops to 54.4, Base + Summarization reaches 68.8, and Base + Clip reaches 68.9. With CPL, Qwen3-8B + Clip rises to 72.5. GPT-4o-mini also benefits strongly from Clip Context: Window averages 66.0, Long 64.6, Summarization 67.5, and Clip 71.1.

So Clip Context is not merely a cheaper approximation of summarization. In these results, it is often the thing summarization was getting credit for.

Summarization is not a magic memory layer; sometimes it is a confident rumor

One of the more useful parts of the paper is its treatment of summarization. Many agent systems use summaries as the obvious compromise: keep the important information, compress the rest, avoid long-context cost. Reasonable. Also dangerous.

The authors’ BrowseComp-style deep research experiment shows why. In the reported table, Clip-12to6 achieves a score of 29.3, outperforming Window-6 at 25.0, Window-9 at 23.4, and Window-12 at 24.2. Sum-12to6 reaches 28.1, similar but slightly lower. The more interesting case is Sum-12to0: it gets 27.7, but its proactive answer rate jumps to 63.7%, while proactive score falls to 37.4. In plain English: the agent answers earlier much more often, but those early answers are less accurate.

That is a very business-relevant failure mode. A bad agent that hesitates is annoying. A bad agent that becomes decisive because a summary made uncertainty sound settled is worse. That is how an internal research assistant turns “we have not verified this supplier’s compliance status” into “supplier looks acceptable,” and then everyone enjoys a governance meeting with fluorescent lighting.

The appendix case study identifies summary errors such as definitive language for uncertain matters, overly absolute claims, missing critical failed-path information, and omission of previous key findings. These are exactly the errors that matter in operational agents. A summary is not only a compression layer. It is also an interpretation layer. Once interpretation enters the context, the actor model may inherit the summarizer’s confidence, omissions, and framing.

This does not mean summarization should be banned. It means summarization should be treated as a lossy decision-support artifact, not as neutral memory. In many business systems, the safer architecture may be:

  1. keep recent objective turns;
  2. clip periodically to break inertia;
  3. store structured state separately;
  4. retrieve specific verified facts when needed;
  5. use summaries only when their failure modes are tested.

The paper’s blunt contribution here is useful: before buying a more elaborate summarization stack, test whether simple clipping gives most of the benefit. Embarrassing, perhaps. But so is paying for complexity that mostly performs strategic forgetting in a nicer suit.

The result is strongest where repetition is expensive

The paper evaluates agents across eight environments, including navigation, web interaction, text-based games, crafting, and puzzle tasks. This matters because conversational inertia is not tied to a single interface. It appears in tasks where an agent must repeatedly observe, act, and adapt.

The paper also reports that models with stronger baseline inertia benefit more. In the appendix comparison, Qwen3-8B shows higher diagonal attention, higher assistant attention, and much higher repeat-last-action rates than Llama3.1-8B-Instruct in selected environments. This helps explain why Qwen3-8B sees larger gains from CPL and Clip than Llama3.1-8B.

For business users, this suggests that the question is not “Should every agent use Clip Context?” The better question is: How much does this agent’s task punish repeated stale behavior?

A few examples:

Business agent type Inertia risk Practical test
Customer support triage Repeats already-failed diagnostic scripts Count repeated questions/actions after negative feedback
Web research agent Revisits failed search paths or overcommits to early hypotheses Track duplicated source paths and premature final answers
Compliance workflow agent Carries forward earlier assumptions despite new evidence Compare decision changes after contradictory observations
Data-cleaning agent Applies the same transformation after validation failure Count repeated invalid operations
Sales operations assistant Reuses earlier messaging even after customer objections Track template repetition after user correction

The common pattern is not text repetition alone. The damaging pattern is state-insensitive repetition: the environment changes, but the agent’s action policy does not adapt enough.

That distinction matters for monitoring. A support bot repeating “I understand your concern” is stylistically boring, but not necessarily operationally dangerous. An agent repeatedly trying the same invalid refund workflow after receiving a system error is a different creature. The first needs better writing. The second needs inertia control.

What this paper directly shows, and what Cognaptus would infer for business systems

The paper directly shows three things.

First, multi-turn agents can exhibit diagonal attention to their own previous assistant outputs, and this pattern is associated with imitation bias and performance degradation.

Second, preference learning based on short-vs-long context action pairs can reduce this inertia pattern and improve performance in the tested environments without relying on expert demonstrations or environment rewards.

Third, periodic context clearing can improve both performance and inference efficiency relative to full history and sliding-window baselines, while often matching or exceeding summarization-based methods.

The business inference is narrower but important: context length should be treated as a controllable risk variable, not a default good. For deployed agents, “retain everything” should require justification. So should “summarize everything.” The correct unit of design is not memory size; it is behavior under repeated feedback.

A practical evaluation protocol would look like this:

Evaluation question Metric to collect Why it matters
Does the agent repeat failed actions? Repeat action rate after negative observation Direct behavioral inertia signal
Does it revisit failed paths? Duplicate tool-call or URL-path rate Research and navigation failure indicator
Does clipping improve outcomes? Full vs Window vs Clip success rate under same task set Tests whether memory is helping or trapping
Does summarization create premature confidence? Early-answer rate and early-answer accuracy Separates useful compression from confident compression
Does retention length matter by task type? Performance by clip parameters Identifies long-range dependency sensitivity
Does model choice change inertia? Repeat-last-action and diagonal-proxy behavior Some models may need stronger context control

Not every production system can inspect attention matrices. That is fine. Operational telemetry can still detect the external symptoms: repeated invalid actions, loops, duplicated search paths, unchanged plans after new observations, or escalating confidence without new evidence.

The paper’s mechanistic analysis is valuable because it tells builders where to look. You do not need to prove diagonal attention in your own stack before acting. You can test whether your agent behaves like it is trapped by its own transcript.

The boundary: clipping is not a substitute for real state

The authors are clear about the main limitation: Clip Context involves an information-inertia tradeoff. If a task depends on early information that is not recoverable later, clipping can hurt. This is not a footnote for academic politeness. It is the central deployment boundary.

A procurement agent may need a constraint mentioned at the beginning of the conversation. A legal assistant may need a clause from an earlier document section. A long research workflow may need to remember which hypotheses were rejected and why. If clipping removes that information without storing it elsewhere, the system has merely traded self-imitation for amnesia. A very modern bargain, but still a bargain.

The practical answer is not to choose between full transcript and forgetting. The answer is to separate memory types:

Memory type Suggested handling
Current observation and last few turns Keep in raw context
Stable task constraints Store as structured state
Verified facts Store in a retrievable evidence table
Failed paths Store as compact negative memory
Assistant reasoning traces Clip aggressively unless audited useful
User preferences or policy constraints Persist outside transient conversation context

This is where the paper’s result becomes more powerful than its implementation. Clip Context is a strong baseline, not a complete enterprise memory architecture. It shows that raw transcript accumulation is a bad default. It does not eliminate the need for structured state, retrieval, audit logs, or task-specific memory.

The authors also avoid claiming definitive causality between diagonal attention and performance degradation. Their appendix frames diagonal attention as a plausible mechanism supported by converging evidence, not as a fully isolated causal variable. That restraint is correct. Attention analysis can be informative without being a courtroom confession.

A better design rule: memory should earn its place in the prompt

The common agent-building instinct is to keep context until the model’s window protests. This paper suggests a more disciplined rule:

Every piece of history should earn its place in the active prompt.

That rule sounds obvious. Most good engineering rules do. Then everyone ignores them because a bigger context window is easier to explain in a product demo.

A better agent architecture would use three layers:

  1. Active context: recent turns and current observations, managed with clipping or another inertia-aware policy.
  2. Structured state: task constraints, verified facts, completed subtasks, failed actions, and unresolved questions.
  3. External memory or retrieval: older evidence that can be pulled back only when relevant.

The active context should stay behaviorally fresh. Structured state should preserve what must not be forgotten. Retrieval should support long-range dependencies without forcing the model to continuously stare at its own old reasoning.

That is the real business takeaway. The goal is not smaller memory. The goal is cleaner memory boundaries.

Conclusion: agents need memory hygiene, not memory maximalism

Conversational inertia is a useful name for a familiar embarrassment: an agent that slowly becomes less responsive to the world and more obedient to its own previous behavior.

The paper’s contribution is to make that embarrassment measurable and actionable. It identifies a diagonal attention pattern associated with self-imitation. It shows that short-vs-long context differences can train a model away from inertial behavior without expert labels. It shows that periodic context clearing can outperform sliding windows and compete with summarization while also improving cache efficiency. And it reminds us that summarization can be a compression layer, but also a confidence laundering machine if left untested.

For business builders, the lesson is not “delete context.” The lesson is sharper: do not confuse transcript accumulation with intelligence. An agent’s memory should support adaptation, not preserve every previous stumble as a sacred demonstration.

Sometimes the best way to make an agent think better is to stop letting it copy the intern who lived five seconds ago.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yang Wan, Zheng Cao, Zhenhao Zhang, Zhengwen Zeng, Shuheng Shen, Changhua Meng, and Linchao Zhu, “Mitigating Conversational Inertia in Multi-Turn Agents,” arXiv:2602.03664v3, 2026, https://arxiv.org/abs/2602.03664↩︎