Cover image

Memory That Fights Back: How SEDM Turns Agent Logs into Verified Knowledge

Every agent platform eventually develops a storage problem and pretends it is a memory strategy. The logs are all there: user turns, tool calls, partial plans, failed attempts, corrected answers, retry traces, database lookups, compliance notes, and the occasional heroic workaround that actually solved something. The tempting move is obvious. Store everything. Embed everything. Retrieve whatever looks semantically close. Then call it “long-term memory,” because “expensive junk drawer with cosine similarity” sounds less fundable. ...

September 17, 2025 · 14 min · Zelina
Cover image

Pieces, Not Puzzles: How ArcMemo Turns LLM Reasoning into Reusable Skills

Tickets repeat. Spreadsheets repeat. Compliance reviews repeat. Code reviews repeat. Not exactly, of course. That would be merciful. They repeat with just enough variation to make last month’s solution almost useful and therefore mildly dangerous. This is where many enterprise “AI memory” systems become filing cabinets with delusions of competence. They store prior chats, snippets, tickets, documents, and summaries, then hope the next prompt will rhyme closely enough with something in the archive. Sometimes it does. Often it does not. The agent remembers the old puzzle, not the transferable piece. ...

September 8, 2025 · 15 min · Zelina
Cover image

Benchmarks with Benefits: What DeepScholar-Bench Really Measures

TL;DR for operators DeepScholar-Bench is useful because it turns “deep research” from a demo category into a measurable workflow: retrieve the right sources, synthesize the right facts, and attach citations that actually support the claims.1 The headline result is not flattering. No evaluated system exceeds a 31% geometric mean across all metrics. OpenAI DeepResearch leads overall with a 0.309 geometric mean, but its best-looking strengths hide serious gaps: 0.857 on organization, 0.392 on nugget coverage, 0.187 on reference coverage, and 0.124 on document importance. Translation: the report may read well while still missing the intellectual furniture. ...

August 30, 2025 · 14 min · Zelina
Cover image

Breaking the Question Apart: How Compositional Retrieval Reshapes RAG Performance

TL;DR for operators A standard RAG system often retrieves the most individually relevant chunks. That is useful until the question needs several different pieces of evidence that must work together. Then the system may return five near-duplicates of the most obvious fact and miss the less obvious fact that actually completes the answer. Excellent. We have reinvented the meeting where everyone brings the same slide. ...

August 11, 2025 · 4 min · Zelina
Cover image

Layers of Thought: How Hierarchical Memory Supercharges LLM Agent Reasoning

TL;DR for operators An enterprise agent does not fail only because it forgets. Often, it fails because it remembers like a hoarder with a search bar. The H-MEM paper proposes a hierarchical memory system for LLM agents: Domain, Category, Memory Trace, and Episode layers, connected by positional child indices so retrieval can move from broad meaning to specific memory instead of scanning a flat pile of stored vectors.1 That sounds like software housekeeping. It is actually the main point. ...

August 1, 2025 · 16 min · Zelina
Cover image

GraphRAG Without the Drag: Scaling Knowledge-Augmented LLMs to Web-Scale

TL;DR for operators GraphRAG usually sounds like a clean enterprise promise: put your knowledge into a graph, attach it to a language model, and enjoy more grounded answers. The less glamorous truth is that someone has to build the graph. At web scale, that “someone” is usually an LLM being asked to extract triples from millions or billions of passages, which is a fine idea if the procurement team has recently discovered oil under the server room. ...

July 24, 2025 · 15 min · Zelina
Cover image

The Retrieval-Reasoning Tango: Charting the Rise of Agentic RAG

TL;DR for operators Static RAG is still useful. It is also no longer the whole game. The paper behind this article argues that retrieval and reasoning are converging into a more tightly coupled architecture: reasoning can improve retrieval, retrieval can improve reasoning, and agentic systems can interleave both over multiple steps.1 That sounds like a neat academic symmetry until you put it inside an enterprise workflow, where every extra retrieval call means latency, cost, permissions, ranking risk, and one more place for the machine to confidently ingest rubbish. ...

July 15, 2025 · 18 min · Zelina
Cover image

Chunks, Units, Entities: RAG Rewired by CUE-RAG

TL;DR for operators Enterprise RAG teams often treat retrieval quality as a graph-construction problem: extract more entities, more relationships, more summaries, and hope the answer appears somewhere in the resulting machinery. Clue-RAG suggests a more useful diagnosis: the failure is often not that the graph is too small, but that the system has chosen the wrong semantic unit for the job.1 ...

July 14, 2025 · 16 min · Zelina
Cover image

How Ultra-Large Context Windows Challenge RAG

TL;DR for operators Ultra-large context windows are not a ceremonial funeral for retrieval-augmented generation. They are a price renegotiation. If your task is to analyse a bounded, self-contained document set — a contract bundle, diligence folder, policy manual, code repository, or technical appendix — a long-context model may now be the cleaner first option. The main benefit is not that it “knows more”. It is that it can inspect more of the original evidence without depending on a retriever to guess which passages matter. ...

March 29, 2025 · 12 min · Zelina