AI Evaluation

LemmaBench: When AI Finally Meets Real Mathematics

Most AI math benchmarks still feel like exam rooms. The model receives a problem. It produces an answer. We score the answer. Everyone argues about whether the problem was hard enough, whether the model saw something similar during training, and whether the leaderboard means anything outside the leaderboard. Very productive. Almost as peaceful as a faculty meeting. ...

The Context Ceiling: When Long Context Stops Thinking

Documents are the easiest way to fool an AI system into looking serious. A procurement team uploads the full contract archive. A compliance team adds policy manuals, audit notes, and emails. A financial analyst stuffs transcripts, filings, and market commentary into one heroic prompt. The interface accepts it. The model answers fluently. Everyone relaxes. ...

Brains, Bias & Benchmarks: Why Multimodal AI Still Struggles with Tumor Truth

MRI is a useful reality check for multimodal AI. It looks like an image problem, behaves like a reasoning problem, and punishes lazy confidence with the quiet brutality of clinical ambiguity. That is why MM-NeuroOnco is more interesting than another “new benchmark” headline.1 The paper introduces a multimodal instruction dataset and benchmark for MRI-based brain tumor diagnosis, but the dataset size is not the main story. Yes, the authors curate a 73,226-image pool, build 24,726 semantically attributed samples, generate more than 200,000 VQA pairs, and construct a 1,000-image benchmark with more than 3,000 questions. Fine. The spreadsheet is muscular. ...

When Seeing Isn’t Understanding: Closing the Multimodal Generation–Understanding Gap

Image generation has become very good at looking confident. That is convenient for demos, investor decks, and social media clips where a dragon, a dashboard, or a product mockup only needs to survive five seconds of human attention. Unfortunately, enterprise systems are less forgiving. A generated image may be beautiful, on-brand, and still wrong. The product is held in the wrong hand. The safety sign is placed behind the hazard. The chart looks plausible but reverses the relationship it was supposed to explain. Charming, as long as nobody uses it. ...

Flip the Script: When Causality Breaks the LLM Illusion

A fire alarm can cause people to evacuate. It can cause a building to enter alert mode. It can trigger emergency procedures, bring firefighters, and make everyone suddenly remember where the stairs are. But does a fire alarm cause a fire? Obviously not. At least, obviously not to a human who understands the causal structure. The alarm is usually an effect or signal of fire risk, not the origin of the fire itself. A model trained on enough sentences of the form “fire alarm causes…” may not be so careful. It may see the familiar phrase pattern, complete the familiar answer, and walk directly into the wrong conclusion with excellent grammar. ...

Ready Player None: Why AI Still Can’t Beat the Human Game Multiverse

Games are not supposed to be frightening. A commuter plays them between meetings. A child learns one in thirty seconds. A bored adult opens a mobile puzzle, fails once, notices the trick, and improves. No dissertation. No onboarding deck. No “agentic workflow architecture.” Just look, act, remember, adjust. That is precisely why the new AI GAMESTORE paper is awkward for the current AI narrative.1 It does not ask whether frontier models can solve another static exam, write another function, or produce another polished paragraph about strategic transformation. They can do all of that, often impressively. The paper asks something more ordinary and therefore more damaging: can a model learn unfamiliar human-designed games under roughly human-like gameplay constraints? ...

Who Was Where When? AI Tries to Remember History

Archive work has a very simple-looking question at its center: who was where, and when? That question looks harmless until a machine has to answer it from a century-old newspaper, after OCR has mangled the spelling, the place names have shifted, the language is not always English, and the text only implies the answer through an event, job title, or institutional affiliation. At that point, “extracting information” becomes less like copying a fact from a sentence and more like making a legally cautious inference from a witness who speaks in fragments. ...

Do They Mean It? Testing Whether AI Actually ‘Reasons’ Behind the Wheel

A car follows a cyclist on a narrow road. The double solid yellow line says: do not cross. The empty oncoming lane says: perhaps you can. The cyclist may feel uncomfortable being followed. The passenger may be late. The vehicle behind may be getting impatient. The automated vehicle must choose. A normal benchmark would ask whether the final maneuver is safe, legal, smooth, or close to a human reference trajectory. Useful, yes. Complete, no. ...

When Agents Browse Back: Why Multimodal Search Still Fails the Real Web

Search looks easy until the answer is hiding in a caption, a cropped image region, a second web page, and one annoyingly necessary intermediate clue. That is the problem BrowseComp-V3 is trying to measure.1 Not whether a multimodal model can recognize an object in an image. Not whether a chatbot can summarize the first search result. Not whether a web agent can click around long enough to look busy. The benchmark asks a more operationally relevant question: can an AI system browse the open web, combine text and visual evidence across multiple steps, and still arrive at the right answer? ...

Consistency Is Not a Coincidence: When LLM Agents Disagree With Themselves

A support ticket arrives. The agent reads the same customer history, sees the same policy document, and has access to the same tools. On Monday, it searches for the refund rule, retrieves the correct clause, and gives a clean answer. On Tuesday, with the same input, it searches for a different phrase, retrieves a less relevant document, wanders through two extra steps, and ends with a confident answer that is only approximately useful. ...