AI Evaluation

$Cover image$

Proofs and Consequences: How Math Reveals What AI Still Doesn’t Know

TL;DR for operators Mathematical proof is a nasty evaluation setting for AI systems because it leaves fewer hiding places. A model cannot merely land on a final number; it has to preserve the truth of each step. That is precisely why Guo et al.’s RFMDataset is useful: it tests whether advanced reasoning models can construct complete natural-language proofs, then classifies how they fail when they cannot.1 ...

Good Bot, Bad Reward: Fixing Feedback Loops in Vision-Language Reasoning

TL;DR for operators The useful lesson is not that vision-language models need longer reasoning traces. They already produce plenty of words. Some of them are even adjacent to thought. The useful lesson is that multimodal systems need feedback that can tell where a reasoning path breaks, not merely whether the final answer looks acceptable. ...

Reflections in the Mirror Maze: Why LLM Reasoning Isn't Quite There Yet

TL;DR for operators Adding “reasoning” to an LLM agent is not the same as making it reason better. Wong et al. test four open-source models across dynamic SmartPlay tasks using a baseline prompt, reflection, reflection plus an Oracle that mutates heuristics, and reflection plus a Planner that simulates short future trajectories.1 The clean result is not “planning wins” or “bigger models win.” The result is more annoying, therefore more useful: the same scaffold can be a booster, a distraction, or a failure amplifier. ...

Raising the Bar: Why AI Competitions Are the New Benchmark Battleground

TL;DR for operators A model score is not a certificate. It is a timestamp. That is the operational message of D. Sculley and co-authors’ position paper on GenAI evaluation.1 Their argument is not that every static benchmark is useless, nor that competitions are magical truth machines with leaderboards attached. The argument is sharper: GenAI has broken the old bargain behind machine-learning evaluation. ...

Logos, Metron, and Kratos: Forging the Future of Conversational Agents

TL;DR for operators Conversational agents are moving from polite text boxes into operational systems: booking, triaging, recommending, retrieving, judging, escalating, and occasionally making a confident mess with impressive formatting. The useful lesson from these two papers is simple: enterprise agents cannot be trusted just because they can reason, remember, or call tools. Those are necessary capabilities, not sufficient safeguards. A serious agent needs a fourth layer: a way to evaluate whether its own decisions and judgments deserve to be used. ...

Judge, Jury, and GPT: Bringing Courtroom Rigor to Business Automation

TL;DR for operators A web agent that looks impressive in a demo may still fail when asked to complete ordinary live tasks across messy websites. That is the central finding of An Illusion of Progress? Assessing the Current State of Web Agents, which introduces Online-Mind2Web, a benchmark of 300 realistic tasks across 136 websites.1 ...