Benchmarks

Seeing Is Not Reasoning: Why Mental Imagery Still Breaks Multimodal AI

A model can generate a pretty sequence of images. Good. So can a slide deck. The harder question is whether those images actually help it think. That is the uncomfortable point behind MentisOculi: Revealing the Limits of Reasoning with Mental Imagery, a new benchmark paper that tests whether frontier multimodal models can do something closer to human mental imagery: form a visual state, keep it stable, transform it step by step, and use the transformed state to decide what to do next.1 Not merely “look at an image and answer a question.” Not “draw a plausible intermediate picture.” Actual visual reasoning, with consequences. ...

When Benchmarks Forget What They Learned

The leaderboard said “learning.” The model may have heard “storage.” Benchmarks are supposed to answer a simple business question: does this model actually perform the task? That sounds clean. A model receives a test. It gives answers. Someone turns the answers into a score. Procurement teams, product managers, investors, and mildly overconfident LinkedIn commentators then convert the score into a story about intelligence. The machinery is familiar enough to feel objective. ...

When Empathy Needs a Map: Benchmarking Tool‑Augmented Emotional Support

Empathy is easy to fake for one sentence. A chatbot can say “that sounds exhausting” without knowing anything about you, your situation, your city, your time zone, or whether the advice it is about to give is physically possible. That is the awkward part of emotional support AI: the tone can be soft while the facts are made of air. A very caring assistant can still recommend a midnight walk at 3 p.m., suggest a closed café, or confidently invent local details because it wants to be helpful. The kindness is real enough in style. The grounding is not. ...

Sequential Beats Parallel: When Deep Research Agents Learn to Reflect

A research request usually begins with a deceptively harmless sentence: “Can you give me the full picture?” Then comes the usual enterprise ritual. Someone breaks the topic into pieces. One person checks competitors. Another checks regulation. Another reads technical reports. Another searches recent news. Everyone works quickly. Everyone returns with fragments. Then one unlucky analyst stitches the fragments into a report and pretends the seams are a design choice. ...

SokoBench: When Reasoning Models Lose the Plot

A corridor is not supposed to be hard. There is one player. One box. One goal. No maze. No clever trap. No branching strategy tree with a thousand tempting wrong turns. The player stands at one end, the goal sits at the other, and the box is between them. Push the box along the corridor until it reaches the goal. That is the task. ...

Your Agent Remembers—But Can It Forget?

Memory is usually sold as a virtue. An AI agent with memory sounds safer, smarter, more personal, more autonomous. A warehouse robot remembers where boxes were placed. A navigation agent remembers which corridor led to the exit. A workflow agent remembers what the user asked yesterday and uses that context tomorrow. This is the comforting version of memory: the past as an asset. ...

When Benchmarks Break: Why Bigger Models Keep Winning (and What That Costs You)

Budget. That is where the benchmark story usually becomes less elegant. A vendor shows a model card with better reasoning scores, stronger multi-task accuracy, and a leaderboard position polished to a mirror finish. Then someone in operations asks the rude question: what does this improvement cost per customer case, per analyst hour, per compliance review, or per failed escalation? ...

$Cover image$

When the Right Answer Is No Answer: Teaching AI to Refuse Messy Math

A scanned exam paper is not a polite input. It arrives bent, shadowed, annotated, folded, half-covered by a student’s handwriting, and occasionally photographed at an angle chosen by someone apparently in active conflict with geometry. For a human teacher, this is annoying. For a document AI system, it is more than annoying. It creates a dangerous fork in the road: extract what is visible, or admit that the question cannot be recovered. ...

Explaining the Explainers: Why Faithful XAI for LLMs Finally Needs a Benchmark

Hiring. A candidate writes a personal statement. A screening model gives a score. A manager asks the AI system why. The explanation says work experience mattered most, education came next, and demographic variables barely moved the decision. Everyone relaxes, because the explanation sounds reasonable. That is the dangerous part. A reasonable explanation is not necessarily a faithful explanation. A counterfactual edit that looks plausible is not necessarily a causal counterfactual. And a model that appears insensitive to demographic concepts may not be “fair”; it may simply have learned, or been aligned, to suppress visible sensitivity in the narrow setting being tested. ...

TowerMind: When Language Models Learn That Towers Have Consequences

Tower placement is a small decision until it is wrong. In a tower-defense game, a bad tower is not merely an inelegant plan. It is money spent, coverage lost, enemies leaked, and time wasted. The game does not care that the explanation sounded strategic. It only asks whether the tower actually touches the road. ...