Retrieval

Memory With a Pulse: Real-Time Feedback Loops for RAG Systems

Ask an enterprise chatbot the wrong question on the wrong day and the problem is rarely that the language model has forgotten how to write English. The problem is that it has been handed the wrong pile of evidence. That is the expensive little defect inside many retrieval-augmented generation systems. The model may be fluent. The corpus may be current. The vector database may be humming along like a well-funded filing cabinet. Yet the answer still disappoints because the system chose the wrong snippets, placed a useful document too low, missed a newly relevant runbook, or treated yesterday’s user intent as if it were carved into basalt. ...

Beyond Answers: Measuring How Deep Research Agents Really Think

A research report is not an answer with extra paragraphs. That sounds obvious until an enterprise team tries to evaluate a deep research agent by asking whether its final conclusion looks plausible, whether it included citations, and whether the prose sounded confident enough to survive a board deck. Congratulations: the machine has produced something that resembles diligence. Whether it actually performed diligence is the inconvenient question. ...

Backtrack to Breakthrough: Why Great AI Agents Revisit

Search is easy. Knowing when to go back is harder. That is the useful irritation inside GSM-Agent, a new benchmark for studying agentic reasoning under controlled conditions.1 The paper takes grade-school maths problems from GSM8K, removes the premises from the prompt, hides those premises in a searchable document database, and asks an LLM agent to recover the facts before solving the problem. The arithmetic is not supposed to be impressive. That is the point. If a model fails here, we cannot calmly blame differential geometry, PhD-level law, or some mysteriously adversarial enterprise workflow. The agent simply did not find and use the facts. ...

Memory That Fights Back: How SEDM Turns Agent Logs into Verified Knowledge

Every agent platform eventually develops a storage problem and pretends it is a memory strategy. The logs are all there: user turns, tool calls, partial plans, failed attempts, corrected answers, retry traces, database lookups, compliance notes, and the occasional heroic workaround that actually solved something. The tempting move is obvious. Store everything. Embed everything. Retrieve whatever looks semantically close. Then call it “long-term memory,” because “expensive junk drawer with cosine similarity” sounds less fundable. ...

Pieces, Not Puzzles: How ArcMemo Turns LLM Reasoning into Reusable Skills

Tickets repeat. Spreadsheets repeat. Compliance reviews repeat. Code reviews repeat. Not exactly, of course. That would be merciful. They repeat with just enough variation to make last month’s solution almost useful and therefore mildly dangerous. This is where many enterprise “AI memory” systems become filing cabinets with delusions of competence. They store prior chats, snippets, tickets, documents, and summaries, then hope the next prompt will rhyme closely enough with something in the archive. Sometimes it does. Often it does not. The agent remembers the old puzzle, not the transferable piece. ...

Benchmarks with Benefits: What DeepScholar-Bench Really Measures

TL;DR for operators DeepScholar-Bench is useful because it turns “deep research” from a demo category into a measurable workflow: retrieve the right sources, synthesize the right facts, and attach citations that actually support the claims.1 The headline result is not flattering. No evaluated system exceeds a 31% geometric mean across all metrics. OpenAI DeepResearch leads overall with a 0.309 geometric mean, but its best-looking strengths hide serious gaps: 0.857 on organization, 0.392 on nugget coverage, 0.187 on reference coverage, and 0.124 on document importance. Translation: the report may read well while still missing the intellectual furniture. ...

Breaking the Question Apart: How Compositional Retrieval Reshapes RAG Performance

TL;DR for operators A standard RAG system often retrieves the most individually relevant chunks. That is useful until the question needs several different pieces of evidence that must work together. Then the system may return five near-duplicates of the most obvious fact and miss the less obvious fact that actually completes the answer. Excellent. We have reinvented the meeting where everyone brings the same slide. ...

Layers of Thought: How Hierarchical Memory Supercharges LLM Agent Reasoning

TL;DR for operators An enterprise agent does not fail only because it forgets. Often, it fails because it remembers like a hoarder with a search bar. The H-MEM paper proposes a hierarchical memory system for LLM agents: Domain, Category, Memory Trace, and Episode layers, connected by positional child indices so retrieval can move from broad meaning to specific memory instead of scanning a flat pile of stored vectors.1 That sounds like software housekeeping. It is actually the main point. ...

GraphRAG Without the Drag: Scaling Knowledge-Augmented LLMs to Web-Scale

TL;DR for operators GraphRAG usually sounds like a clean enterprise promise: put your knowledge into a graph, attach it to a language model, and enjoy more grounded answers. The less glamorous truth is that someone has to build the graph. At web scale, that “someone” is usually an LLM being asked to extract triples from millions or billions of passages, which is a fine idea if the procurement team has recently discovered oil under the server room. ...

The Retrieval-Reasoning Tango: Charting the Rise of Agentic RAG

TL;DR for operators Static RAG is still useful. It is also no longer the whole game. The paper behind this article argues that retrieval and reasoning are converging into a more tightly coupled architecture: reasoning can improve retrieval, retrieval can improve reasoning, and agentic systems can interleave both over multiple steps.1 That sounds like a neat academic symmetry until you put it inside an enterprise workflow, where every extra retrieval call means latency, cost, permissions, ranking risk, and one more place for the machine to confidently ingest rubbish. ...