When Memory Stops Guessing: Stitching Intent Back into Agent Memory

Opening — Why this matters now

Everyone is chasing longer context windows. Million-token prompts. Endless chat logs. The assumption is simple: if the model can see everything, it will remember correctly.

This paper shows why that assumption fails.

In long-horizon, goal-driven interactions, errors rarely come from missing information. They come from retrieving the wrong information—facts that are semantically similar but contextually incompatible. Bigger windows amplify the problem. Noise scales faster than relevance.

Background — Context is not content

Most agent memory systems still treat history as text to be compressed, embedded, or summarized. Whether it’s vector search, hierarchical summaries, or graph-based recall, the retrieval signal is largely semantic similarity.

That works when:

Topics are cleanly separated
Queries immediately follow relevant context
Entities appear once, with stable meaning

Real agent workflows violate all three.

A hotel price depends on the day. An argument depends on which side said it. A plan depends on which goal was active at the time. Semantic similarity alone cannot tell these apart.

The authors frame this correctly: retrieval failure is a cue problem, not a storage problem.

Analysis — What STITCH actually changes

STITCH (Structured Intent Tracking in Contextual History) reframes memory indexing around contextual intent rather than text.

Each step in an agent’s trajectory is annotated online with three signals:

Thematic scope — the latent goal episode (e.g., “Day 2 itinerary”, “Model optimization”). This persists across non-adjacent turns.
Event type — the kind of action being performed (compare, decide, rebut, inquire).
Key entity types — what kind of details matter here (price vs rating, metric vs hyperparameter).

Together, these form a structured retrieval cue that answers a simple question:

Under what intent was this information produced?

At retrieval time, the query is mapped into the same intent space. Memory snippets are filtered and ranked by structural compatibility first, semantic similarity second.

This ordering matters. It prevents “correct but wrong-context” facts from even entering the candidate set.

Why this is different from prior work

Approach	What it optimizes	Where it breaks
Long-context LLMs	Raw visibility	Lost-in-the-middle, cost, noise
Embedding RAG	Semantic proximity	Context collisions
Hierarchical summaries	Compression	Goal drift, detail loss
Knowledge graphs	Entity relations	Missing episodic intent
STITCH	Intent alignment	Ingestion cost, schema evolution

STITCH doesn’t summarize harder or retrieve faster. It retrieves more cautiously.

Findings — The benchmark exposes the failure mode

The paper introduces CAME-Bench, a benchmark explicitly designed to punish context-blind retrieval.

Key properties:

Interleaved goals
Repeated entities under different constraints
Deferred questions
Non-turn-taking interaction structure

Across both CAME-Bench and LongMemEval, STITCH dominates as trajectories grow longer.

On the largest CAME-Bench subset:

Best baseline collapses
STITCH improves Macro-F1 by 35.6% absolute

Ablation results are revealing:

Removing thematic scope causes the largest performance drop
Removing coreference resolution breaks everything quietly
Event types help precision but can hurt synthesis if too fine-grained

This is not a modeling trick. It’s an information architecture result.

Implications — What this means for agent builders

Three uncomfortable takeaways:

Long context is not memory Without intent-aware indexing, more tokens mean more interference.
Recall must be conditional Facts are not atomic. They are valid only under the goals and actions that produced them.
Memory systems need structure before scale Scaling ingestion without fixing retrieval cues just accelerates failure.

For production agents—research copilots, planning assistants, autonomous tools—this suggests a shift:

Stop asking “How much history can we store?” Start asking “Under what intent was this history created?”

Conclusion — Memory, finally grounded

STITCH is not flashy. It adds overhead. It introduces schemas that evolve over time. It refuses to treat memory as flat text.

That restraint is precisely why it works.

By grounding memory in contextual intent, the system retrieves less—and answers better. In long-horizon reasoning, that trade-off is not optional. It is the difference between coherence and collapse.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context is not content#

Analysis — What STITCH actually changes#

Why this is different from prior work#

Findings — The benchmark exposes the failure mode#

Implications — What this means for agent builders#

Conclusion — Memory, finally grounded#