Cover image

Search, Critique, Repeat: Critic-R Turns RAG Complaints into Retriever Training

Search failure is boring until it becomes expensive. A research agent asks for evidence. The retriever returns documents. The reasoning model reads them, continues writing, and eventually produces a confident answer. Somewhere in the middle, the evidence was slightly wrong: not irrelevant enough to trigger an obvious failure, not useful enough to support the next reasoning step. The agent proceeds anyway, because that is what agents do when we dress up uncertainty as workflow automation. ...

June 8, 2026 · 17 min · Zelina
Cover image

Wide Thinking, Narrow Context: Why InfoSeeker Rewrites the Economics of AI Search

A spreadsheet is a cruel test of artificial intelligence. Not the toy spreadsheet used in demos, with six rows, three columns, and a suspiciously cooperative universe. I mean the kind of table a real analyst asks for: every qualifying supplier in a region, every product SKU released over a decade, every regulatory filing matching a narrow condition, every competitor with exact addresses, dates, sources, and no missing cells because apparently human suffering needs columns. ...

April 6, 2026 · 16 min · Zelina
Cover image

Judgment Day for RAG: How L‑MARS Cuts Legal Hallucinations by Design

TL;DR for operators Legal AI does not fail only because models “hallucinate”. That word has become the industry’s favourite fog machine. The more operational diagnosis is sharper: models fail when they answer current legal questions from stale internal memory and then dress the error in confident reasoning. The L-MARS paper is useful because it separates two tasks that vendors often blend together for convenience: retrieving current legal facts and reasoning over stable legal principles.1 On LegalSearchQA, a new 50-question benchmark built around recent U.S. legal facts verified in March 2026, L-MARS reaches 96.0% accuracy. Zero-shot GPT-4o-mini reaches 58.0%. Chain-of-thought falls to 30.0%, because step-by-step reasoning from outdated premises merely creates a more articulate mistake. ...

September 4, 2025 · 14 min · Zelina
Cover image

Think Twice, Then Speak: Deliberative Searcher and the Future of Reliable LLMs

TL;DR for operators Search-augmented LLMs are not safe merely because they can look things up. They can still retrieve relevant documents, stitch together a plausible answer, and then express high confidence in something wrong. That is the failure mode this paper targets: not hallucination in the abstract, but the operationally poisonous state of being both false and certain. ...

July 23, 2025 · 16 min · Zelina