Opening — Why this matters now
Everyone is chasing longer context windows. Million-token prompts. Endless chat logs. The assumption is simple: if the model can see everything, it will remember correctly.
This paper shows why that assumption fails.
In long-horizon, goal-driven interactions, errors rarely come from missing information. They come from retrieving the wrong information—facts that are semantically similar but contextually incompatible. Bigger windows amplify the problem. Noise scales faster than relevance.
Background — Context is not content
Most agent memory systems still treat history as text to be compressed, embedded, or summarized. Whether it’s vector search, hierarchical summaries, or graph-based recall, the retrieval signal is largely semantic similarity.
That works when:
- Topics are cleanly separated
- Queries immediately follow relevant context
- Entities appear once, with stable meaning
Real agent workflows violate all three.
A hotel price depends on the day. An argument depends on which side said it. A plan depends on which goal was active at the time. Semantic similarity alone cannot tell these apart.
The authors frame this correctly: retrieval failure is a cue problem, not a storage problem.
Analysis — What STITCH actually changes
STITCH (Structured Intent Tracking in Contextual History) reframes memory indexing around contextual intent rather than text.
Each step in an agent’s trajectory is annotated online with three signals:
- Thematic scope — the latent goal episode (e.g., “Day 2 itinerary”, “Model optimization”). This persists across non-adjacent turns.
- Event type — the kind of action being performed (compare, decide, rebut, inquire).
- Key entity types — what kind of details matter here (price vs rating, metric vs hyperparameter).
Together, these form a structured retrieval cue that answers a simple question:
Under what intent was this information produced?
At retrieval time, the query is mapped into the same intent space. Memory snippets are filtered and ranked by structural compatibility first, semantic similarity second.
This ordering matters. It prevents “correct but wrong-context” facts from even entering the candidate set.
Why this is different from prior work
| Approach | What it optimizes | Where it breaks |
|---|---|---|
| Long-context LLMs | Raw visibility | Lost-in-the-middle, cost, noise |
| Embedding RAG | Semantic proximity | Context collisions |
| Hierarchical summaries | Compression | Goal drift, detail loss |
| Knowledge graphs | Entity relations | Missing episodic intent |
| STITCH | Intent alignment | Ingestion cost, schema evolution |
STITCH doesn’t summarize harder or retrieve faster. It retrieves more cautiously.
Findings — The benchmark exposes the failure mode
The paper introduces CAME-Bench, a benchmark explicitly designed to punish context-blind retrieval.
Key properties:
- Interleaved goals
- Repeated entities under different constraints
- Deferred questions
- Non-turn-taking interaction structure
Across both CAME-Bench and LongMemEval, STITCH dominates as trajectories grow longer.
On the largest CAME-Bench subset:
- Best baseline collapses
- STITCH improves Macro-F1 by 35.6% absolute
Ablation results are revealing:
- Removing thematic scope causes the largest performance drop
- Removing coreference resolution breaks everything quietly
- Event types help precision but can hurt synthesis if too fine-grained
This is not a modeling trick. It’s an information architecture result.
Implications — What this means for agent builders
Three uncomfortable takeaways:
-
Long context is not memory Without intent-aware indexing, more tokens mean more interference.
-
Recall must be conditional Facts are not atomic. They are valid only under the goals and actions that produced them.
-
Memory systems need structure before scale Scaling ingestion without fixing retrieval cues just accelerates failure.
For production agents—research copilots, planning assistants, autonomous tools—this suggests a shift:
Stop asking “How much history can we store?” Start asking “Under what intent was this history created?”
Conclusion — Memory, finally grounded
STITCH is not flashy. It adds overhead. It introduces schemas that evolve over time. It refuses to treat memory as flat text.
That restraint is precisely why it works.
By grounding memory in contextual intent, the system retrieves less—and answers better. In long-horizon reasoning, that trade-off is not optional. It is the difference between coherence and collapse.
Cognaptus: Automate the Present, Incubate the Future.