Retrieval

When AI Can Solve But Can't Search: The MathNet Equation

Opening — Why this matters now The AI industry enjoys announcing that models now perform at medal level on Olympiad mathematics. Impressive headlines. Elegant demos. Much applause. Then MATHNET arrives with the social grace of an auditor. This new benchmark shows that while leading models can often solve difficult mathematics, they are far worse at finding related problems, recognizing structural equivalence, or reliably using retrieved examples to improve reasoning. In practical terms: your AI intern may ace the exam, then fail to locate the right binder. ...

Process Reward Agents — When Reasoning Learns to Judge Itself (Before It’s Too Late)

Opening — Why this matters now There is a quiet but consequential flaw in modern AI reasoning systems: they are excellent storytellers, but poor self-editors. In domains like healthcare, finance, and law, correctness is not a property of the final answer—it is a property of the entire reasoning trajectory. Yet most large language models (LLMs) only discover their mistakes at the very end, if at all. By then, the damage is already embedded in the chain of thought. ...

Build a Small RAG Knowledge Tool

How to build a lightweight retrieval-augmented knowledge tool with grounded answers, source citations, narrow scope, and a realistic MVP.

Ultra‑Sparse Embeddings Without Apology

Opening — Why this matters now Embeddings have quietly become the metabolic system of modern AI. Every retrieval query, recommendation list, and ranking pipeline depends on them—yet we keep feeding these systems increasingly obese vectors. Thousands of dimensions, dense everywhere, expensive always. The paper behind CSRv2 arrives with an unfashionable claim: you can make embeddings extremely sparse and still win. ...

Search-R2: When Retrieval Learns to Admit It Was Wrong

Opening — Why this matters now Search-integrated LLMs were supposed to be the antidote to hallucination. Give the model tools, give it the web, let it reason step by step—problem solved. Except it wasn’t. What we actually built were agents that search confidently, reason eloquently, and fail quietly. One bad query early on, one misleading paragraph retrieved at the wrong moment, and the whole reasoning chain collapses—yet reinforcement learning still rewards it if the final answer happens to be right. ...

When Retrieval Learns to Breathe: Teaching LLMs to Go Wide and Deep

Opening — Why this matters now Large language models are no longer starved for text. They are starved for structure. As RAG systems mature, the bottleneck has shifted from whether we can retrieve information to how we decide where to look first, how far to go, and when to stop. Most retrieval stacks still force an early commitment: either search broadly and stay shallow, or traverse deeply and hope you picked the right starting point. ...

Rationales Before Results: Teaching Multimodal LLMs to Actually Reason About Time Series

Opening — Why this matters now Multimodal LLMs are increasingly being asked to reason about time series: markets, traffic, power grids, pollution. Charts are rendered. Prompts are polished. The answers sound confident. And yet—too often—they’re wrong for the most boring reason imaginable: the model never actually reasons. Instead, it pattern-matches. This paper dissects that failure mode with unusual clarity. The authors argue that the bottleneck is not model scale, data access, or even modality alignment. It’s the absence of explicit reasoning priors that connect observed temporal patterns to downstream outcomes. Without those priors, multimodal LLMs hallucinate explanations after the fact, mistaking surface similarity for causality. ...

Backtrack to Breakthrough: Why Great AI Agents Revisit

TL;DR Agentic performance isn’t just about doing more; it’s about going back. In GSM-Agent—a controllable, tool-using version of GSM8K—top models only reach ~65–68% accuracy, and the strongest predictor of success is a high revisit ratio: deliberately returning to a previously explored topic with a refined query. That’s actionable for enterprise AI: design agents that can (1) recognize incomplete evidence, (2) reopen earlier lines of inquiry, and (3) instrument and reward revisits. ...

Memory That Fights Back: How SEDM Turns Agent Logs into Verified Knowledge

TL;DR Most “agent memory” is a junk drawer: it grows fast, gets noisy, and slows everything down. SEDM (Self‑Evolving Distributed Memory) proposes an auditable, efficiency‑first overhaul. It verifies each candidate memory by replaying the exact run in a Self‑Contained Execution Context (SCEC), assigns an initial utility‑aligned weight, and then self‑schedules what to retrieve next. The result: higher task accuracy with fewer tokens versus strong memory baselines on FEVER and HotpotQA. ...

Branching Out of the Middle: How a ‘Tree of Agents’ Fixes Long-Context Blind Spots

TL;DR — Tree of Agents (TOA) splits very long documents into chunks, lets multiple agents read in different orders, shares evidence, prunes dead-ends, caches partial states, and then votes. The result: fewer hallucinations, resilience to the “lost in the middle” effect, and accuracy comparable to premium large models—while using a compact backbone. Why this matters for operators If your business parses contracts, annual reports, medical SOPs, or call-center transcripts, you’ve likely felt the pain of long-context LLMs: critical details buried mid-document get ignored; retrieval misses cross-paragraph logic; and bigger context windows inflate cost without guaranteeing better reasoning. TOA is a pragmatic middle path: it re-imposes structure on attention—not by scaling a single monolith, but by coordinating multiple lightweight readers with disciplined information exchange. ...