AI Evaluation

When Models Get Lost in Space: Why MLLMs Still Fail Geometry

Geometry looks clean. A cube has edges. A projection has rules. A missing view should follow from the views already shown. This is not the messy world of occluded street scenes, motion blur, shadows, or a warehouse camera pointed at the wrong shelf. It is the kind of visual reasoning many students learn before they are trusted with anything more dangerous than a compass, a ruler, and mild boredom. ...

Lost in Translation: When 14% WER Hides a 44% Failure Rate

Taxi dispatch is not a poetry recital. When a passenger calls and says, “I’m on Arguello,” the system does not need to appreciate the full expressive richness of the sentence. It needs to identify one street name, map it to the right place, and send a vehicle there. This is not a broad language-understanding task. It is a narrow operational task with coordinates attached. ...

Too Much Spice, Not Enough Soul: When LLMs Cook Without Culture

Recipe localization looks like an easy prompt. “Create a Jamaican version of Moroccan couscous.” The model smiles politely, throws in jerk seasoning, allspice, scotch bonnet, maybe coconut milk if it is feeling ambitious, and returns something that looks country-specific enough to survive a quick marketing review. The title says “Jamaican.” The ingredients sound Jamaican. The format is clean. No hallucinated oven temperature from another dimension. Excellent, ship it. ...

From Features to Actions: Why Agentic AI Needs a New Explainability Playbook

A customer-service agent rebooks a flight, checks a policy, calls an API, updates the passenger record, apologizes politely, and still gets the outcome wrong. The old explainability question would be: which input tokens influenced the final answer? That question is not useless. It is just late to the crime scene. When an AI system only predicts, explanation can focus on a single input-output decision. When an AI system acts, explanation has to follow the behavior across time: the state it maintained, the tool it selected, the observations it received, the recovery move it attempted, and the point where the run quietly became unrecoverable. A nice feature-importance chart does not tell you that. It tells you what mattered to a prediction, not how a workflow failed. ...

First Proofs, No Training Wheels

Proof is where AI systems stop performing confidence and start owing the reader money. A model can restate a theorem elegantly. It can cite the right neighborhood of literature. It can produce LaTeX with the visual manners of a publishable paper. None of that is a proof. It is proof-shaped material. Sometimes useful. Sometimes impressive. Sometimes a very expensive fog machine. ...

When Benchmarks Lie: Teaching Leaderboards to Care About Preferences

A leaderboard is a comforting object. It gives procurement teams, product managers, and slightly sleep-deprived founders the same small pleasure: a ranked list. Bigger number, better model. Lower rank, worse model. Decision made. Spreadsheet closed. Everyone can return to pretending vendor evaluation is objective. Unfortunately, benchmarks do not care what your business actually needs. ...

RAudit: When Models Think Too Much and Still Get It Wrong

The model is not always confused. Sometimes it has already done the work, reached the right answer, and then politely walks away from it because the user sounded confident. That is the quietly irritating problem behind RAudit, a paper that studies how large language models behave when their reasoning is audited without giving the auditor the correct answer.1 The paper is not just another “LLMs can be sycophantic” warning. We have enough of those. At this point, saying models flatter users is like saying spreadsheets contain hidden errors. True, useful, and somehow still not enough to change deployment practice. ...

When Language Learns to Doubt Itself: Self-Contradiction as an Upgrade Path for Multimodal AI

Image generation has become good enough to be useful and unreliable enough to remain annoying. That is the normal condition of enterprise AI: impressive demos, awkward edge cases, and someone in operations quietly asking whether the model actually understood the instruction or merely produced something that looked plausible from a distance. A user asks for “a red ceramic mug on a wooden desk, next to an open notebook, in morning light.” The model produces a beautiful desk, credible sunlight, maybe even the notebook. The mug is blue. Or metallic. Or missing. If a separate vision model can look at the image and say, “That is not a red ceramic mug,” the failure feels almost rude. The system can see the problem after creating it. Very efficient, in the same way that a committee can discover a typo after approving the brochure. ...

When Benchmarks Forget What They Learned

The leaderboard said “learning.” The model may have heard “storage.” Benchmarks are supposed to answer a simple business question: does this model actually perform the task? That sounds clean. A model receives a test. It gives answers. Someone turns the answers into a score. Procurement teams, product managers, investors, and mildly overconfident LinkedIn commentators then convert the score into a story about intelligence. The machinery is familiar enough to feel objective. ...

Pay to Think: Incentive Design Is the Hidden Variable in Human–AI Research

Payment sounds like the boring part of a user study. Recruit participants. Estimate task time. Set a base rate. Add a small bonus if the budget allows. Put the number in the methods section, preferably somewhere readers can skim past with dignity. Then move on to the interesting material: trust, reliance, explanations, fairness, error rates, cognitive load, and all the other variables that make human–AI decision-making sound like a serious field rather than a procurement spreadsheet. ...