Benchmarking

Thinking Inside the Gameboard: Evaluating LLM Reasoning Step-by-Step

TL;DR for operators Most AI evaluations still ask the wrongly narrow question: did the model get the answer right? That is useful, but it is not enough when the model is expected to act as an agent, revise plans, obey constraints, and recover from failure without turning the workflow into a procedural bonfire. ...

Judge, Jury, and GPT: Bringing Courtroom Rigor to Business Automation

TL;DR for operators A web agent that looks impressive in a demo may still fail when asked to complete ordinary live tasks across messy websites. That is the central finding of An Illusion of Progress? Assessing the Current State of Web Agents, which introduces Online-Mind2Web, a benchmark of 300 realistic tasks across 136 websites.1 ...

Rules of Engagement: Why LLMs Need Logic to Plan

TL;DR for operators Enterprise agents fail less like philosophers and more like junior coordinators with access to the wrong dropdown menu. They propose actions that are not currently possible. They miss actions that are possible. They forget that an action changes the world. They treat impossible future states as if determination will somehow make them available. They add redundant steps, skip mandatory subgoals, or pick a next move that feels plausible but does not reduce the distance to the goal. ...