Retrieval

When AI Can Solve But Can't Search: The MathNet Equation

Search. That is the unglamorous part of AI work. The demo asks a model to solve a clean problem. The enterprise system asks a model to find the right prior case, retrieve the relevant precedent, avoid the misleading near-match, and then adapt the answer without making a confident mess of it. MathNet is interesting because it puts that distinction under pressure. The paper introduces a large multilingual, multimodal Olympiad mathematics benchmark, but the more useful business lesson is not merely that frontier models can solve hard math. We already have enough leaderboards wearing medals. The sharper finding is that models and embedding systems can still fail at recognizing when two problems are mathematically the same, or when one problem is structurally useful for another.1 ...

Write-Back to the Future: When Your RAG Starts Learning

Write-Back to the Future: When Your RAG Starts Learning A RAG system usually fails in a very ordinary way. The retriever finds something relevant, but not quite enough. The generator receives five passages, three of which are useful, one of which is decorative furniture, and one of which looks relevant only because it shares the right vocabulary. The answer is then expected to emerge from this little committee of half-helpful paragraphs. Sometimes it does. Sometimes it does what committees do. ...

Memory Diet for AI Agents: Distilling Conversations Without Forgetting

Memory has become the awkward invoice attached to every serious AI agent demo. A short chatbot can survive on vibes. A long-running coding assistant cannot. After a few weeks of debugging sessions, architecture debates, config changes, rejected fixes, and “remember we tried this already?” moments, the agent’s past becomes valuable. It also becomes inconveniently large. The obvious solution is to stuff more transcript into the prompt. The obvious solution is usually how software gets expensive before it gets useful. ...

Paperwork Intelligence: Why AI Still Struggles With Real Enterprise Documents

Paperwork is where enterprise AI demos go to lose their charm. In a product demo, an AI agent usually receives a clean PDF, a friendly question, and a document that has the decency to behave like a document. It summarizes, retrieves, answers, maybe even produces a small spreadsheet. Everyone nods. Someone says “workflow automation.” Someone else says “agentic.” The meeting ends before anyone asks whether the same system can handle 89,000 pages of historical reports, nested tables, revised statistics, scanned pages, ambiguous row headers, and a calculation that must be correct to the last digit. ...

Ultra‑Sparse Embeddings Without Apology

Search gets expensive quietly. At small scale, an embedding is just a vector. At product scale, it becomes rent: storage rent, memory rent, GPU rent, latency rent, and the recurring emotional tax of explaining why a semantic search feature needs yet another infrastructure budget. Dense embeddings made this bargain feel natural. More dimensions, more semantic capacity. More semantic capacity, better retrieval. Better retrieval, more invoices. Elegant, if one enjoys expensive inevitability. ...

Search-R2: When Retrieval Learns to Admit It Was Wrong

Search is supposed to make language models safer. The model does not know something, so it searches. It finds evidence, reasons over that evidence, and gives a better answer. Very civilized. Very responsible. Then the first search query goes slightly wrong. The model retrieves a relevant-looking but misleading paragraph. It builds the next reasoning step around the wrong entity. The next query becomes narrower, but in the wrong direction. The final answer may still sound fluent, because fluency is the one department where language models rarely file sick leave. The actual reasoning chain, however, has already drifted. ...

Seeing Is Misleading: When Climate Images Need Receipts

A picture lies differently from a sentence. A sentence can be checked against a source. A picture can be old, cropped, staged, reused, mislabeled, emotionally loaded, or paired with a claim it never supported. This is why climate disinformation is annoying in the precise technical sense: it often does not need to fabricate a new fact. It can simply attach a real-looking image to a slippery claim and let the audience do the rest. Very efficient. Very human. Very platform-native. ...

Bubble Trouble: Why Top‑K Retrieval Keeps Letting LLMs Down

The problem is not finding documents. It is spending the prompt budget badly. Ask an enterprise RAG system for “scope of work,” and the system may look confident for exactly the wrong reason. The query sounds simple. Somewhere in the document set, there is probably a sheet, paragraph, or clause literally called “Scope of Works.” A flat top-k retriever will happily grab the highest-scoring chunks from that section, stack them into the model context, and call the job done. Very tidy. Very wrong. ...

Making Noise Make Sense: How FANoise Sharpens Multimodal Representations

Search systems fail in boring ways before they fail in spectacular ones. A customer uploads a product photo and receives visually similar items that miss the actual intent. A compliance analyst searches a scanned document and gets pages that look close but answer the wrong question. A visual QA system finds the right region but ranks the wrong evidence first. Nobody in the meeting says, “Ah yes, our embedding space has poor spectral noise allocation.” They say the search feels unreliable. Much more executive-friendly. Much less useful. ...

One Pass to Rule Them All: YOFO and the Rise of Compositional Judging

Search is where nuance goes to die. A customer asks for a long evening dress, preferably not pink. A retrieval model sees “dress,” “evening,” perhaps “pink,” and returns something short, bright, and entirely wrong with the confidence of a clerk who has technically read the sentence but not understood the assignment. The business consequence is familiar: fewer conversions, more irrelevant recommendations, and yet another dashboard where “semantic relevance” looks respectable while customers quietly leave. ...