AI Evaluation

Blink and You Miss It: The Two-Stage Reality Check for Multimodal AI

Multimodal AI has reached the point where it can describe videos, summarize documents with images, answer visual questions, and generate outputs that look satisfyingly complete. This is exactly why evaluation is becoming more dangerous. A system that looks competent is not necessarily reliable. It may miss the one-second event that determines the answer. Or it may notice enough evidence but then produce a fluent, attractive, visually decorated summary that quietly distorts the facts. The first failure is upstream: the model did not capture the decisive evidence. The second is downstream: the output did not preserve and present the evidence in a human-useful way. ...

Pixels to Purchase Orders: A Business Map for Choosing Vision-Language Models

Pixels to Purchase Orders: A Business Map for Choosing Vision-Language Models Receipts are a good way to ruin an AI demo. A clean product photo is polite. A scanned receipt is not. It has shadows, folds, strange fonts, tiny numbers, merchant abbreviations, table-like structure, and one suspiciously important total amount hiding near the bottom. Ask a generic multimodal assistant what it sees, and it may produce an answer that sounds fluent enough to make everyone in the meeting relax. That is usually the dangerous part. ...

Search, Critique, Repeat: Critic-R Turns RAG Complaints into Retriever Training

Search failure is boring until it becomes expensive. A research agent asks for evidence. The retriever returns documents. The reasoning model reads them, continues writing, and eventually produces a confident answer. Somewhere in the middle, the evidence was slightly wrong: not irrelevant enough to trigger an obvious failure, not useful enough to support the next reasoning step. The agent proceeds anyway, because that is what agents do when we dress up uncertainty as workflow automation. ...

Pretty Text, Ugly Logic: When Image Models Learn to Write but Not to Reason

A slide looks finished. The headline is sharp, the equations are aligned, the answer box is confident, and the design has the mild corporate glow of something that has already been approved by three people who did not read it. That is exactly the problem. For years, text-to-image models failed in a wonderfully obvious way: they could not spell. A poster would say “Qaurterly Reveneu,” the mockup button would contain mystical glyphs, and everyone understood the output was decorative, not operational. Recent models have changed that. They can now place readable text inside images, produce document-like pages, and generate slide-like visual artifacts. The failure mode has become less funny and more expensive: the text may be readable, but the reasoning may be wrong. ...

Rank and File: AI Leaderboards Are Measurement Instruments, Not Scoreboards

Procurement meetings have a familiar ritual now. Someone opens a leaderboard, sorts by average score, points at a model near the top, and asks why the company is not using that one. It feels empirical. It is neatly ranked. It has decimals. Very scientific-looking decimals, the most seductive species of decimal. The problem is not that leaderboards are useless. The problem is that we often treat them as scoreboards when they are closer to measurement instruments. A scoreboard tells us who won under agreed rules. A measurement instrument first has to prove that it measures the thing it claims to measure. If the instrument mixes model size, benchmark difficulty, contributor practices, post-training choices, item redundancy, and residual artifacts into one number, then the number may still be useful. It is just not self-explanatory. ...

Clue by Clue: ProjectionBench and the Business of Testing AI Discovery

Clue by Clue: ProjectionBench and the Business of Testing AI Discovery Lab meeting. A scientist has a topic, a research question, and not much else. No dataset yet. No final chart. No results section quietly waiting in the appendix. Just a question and the uncomfortable business of guessing what nature might do. ...

Vibe Check: AutoResearch Is a Workflow, Not a Robot Scientist

Demo day is not discovery day Demo day has a familiar rhythm. An AI system reads papers, proposes an idea, edits code, runs an experiment, drafts a manuscript, and perhaps even produces something that looks suspiciously like a conference submission. The slide title then arrives with great ceremony: autonomous scientist. The paper AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery is useful because it interrupts that ceremony before everyone starts clapping at the PDF generator.1 Its central move is not to deny progress. Current systems really can automate meaningful pieces of research work. They can search, summarize, plan, code, run tools, assemble figures, and draft reports. That is already operationally important. ...

Follow the Heads, Not the Hype: How LLMs Route Deductive Reasoning

Policy rules are boring until a chatbot applies the wrong one. A customer asks whether they qualify for a refund. The rule says refunds require purchase within 30 days, unused condition, and no prior replacement claim. The model answers confidently. It even writes a neat step-by-step explanation. Wonderful. The explanation looks like reasoning. It may even be correct. ...

Same Maps, Different Moves: Why LLMs Can Converge Without Understanding

Meetings are useful theatre. Everyone can nod at the same slide, repeat the same market keywords, and still leave the room with incompatible plans. The agreement was real. The shared understanding was not. Large language models may be doing something uncomfortably similar. The paper Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning studies whether models that look similar internally are actually reasoning in similar ways.1 This matters because a tempting story has been building around representational convergence: as models scale, their internal representations become more alike, perhaps because they are converging toward a shared statistical model of reality. That story is elegant. It is also a little too convenient, which is usually where expensive mistakes begin. ...

Scaffold and Ladder: Why AI Agents Need Meta-Reasoning, Not Longer Monologues

Workflow is where AI agents usually stop looking magical. Ask one to summarize a short memo, and it behaves like a competent intern with suspiciously fast typing. Ask it to investigate a compliance question across policies, contract clauses, ticket histories, and messy attachments, and the illusion starts to wobble. The agent searches once, reads too much at once, jumps to a plausible answer, and then politely explains the wrong conclusion with the confidence of a junior consultant who has discovered formatting. ...