Benchmarks

Hex Marks the Spot: Terra Nova and the New Frontier of Agent Intelligence

A strategy game is a cruelly efficient way to embarrass an intelligent system. Not because games are magic. Not because hexagonal maps secretly contain the meaning of cognition. They do not, despite what several overexcited benchmark papers might imply after a strong coffee. Games are useful because they compress decision pressure. They make planning visible. They force trade-offs. They punish agents that confuse local competence with strategic understanding. ...

Benchmarked Brilliance: How CreBench Rewrites the Rules of Machine Creativity

Design review is where creativity usually goes to become awkward. One person likes the concept because it feels original. Another dislikes it because it looks impractical. A third praises the visual polish while quietly ignoring whether the idea solves the actual problem. Then someone asks whether the AI can “evaluate creativity”, and everyone pretends the word creativity has a stable meaning. Excellent. Very efficient. ...

Breaking the Tempo: How TempoBench Reframes AI’s Struggle with Time and Causality

A failed deployment usually produces two questions. The first is easy enough to ask: what happened? The second is where the room goes quiet: what actually caused it? Most AI systems are now quite comfortable with the first question. Give them logs, traces, workflows, tool calls, or transition histories, and they can often produce a plausible reconstruction. They can narrate the incident in confident sequence. They can point to every condition that was present. They can provide a tidy post-mortem, ideally before the humans have finished opening the dashboard. ...

The Agent Olympics: How Toolathlon Tests the Limits of AI Workflows

Office work is not one task. It is a chain of small obligations pretending to be one task. “Check the homework submissions, download the attached Python files, run them, grade the students in Canvas, and use the latest submission if someone sent more than one.” That sounds like a normal administrative request. It is also a compact torture device for an AI agent. The agent must read email, handle attachments, inspect local files, run code, interpret results, map students to course records, update Canvas, and not confidently grade the wrong person. Easy, apparently, as long as nothing has to actually work. ...

Beyond Answers: Measuring How Deep Research Agents Really Think

A research report is not an answer with extra paragraphs. That sounds obvious until an enterprise team tries to evaluate a deep research agent by asking whether its final conclusion looks plausible, whether it included citations, and whether the prose sounded confident enough to survive a board deck. Congratulations: the machine has produced something that resembles diligence. Whether it actually performed diligence is the inconvenient question. ...

Paper Tigers or Compliance Cops? What AIReg‑Bench Really Says About LLMs and the EU AI Act

Audit queues have a special talent for turning urgency into fog. A product team wants to ship. Legal wants assurance. Governance wants evidence. The vendor has supplied a beautifully formatted technical document, full of dataset sizes, risk controls, model validation steps, and the usual confidence perfume. Somewhere inside that document may be a real compliance gap. Or it may simply be written by someone who knows how to sound compliant. Naturally, someone asks the modern executive question: can we let an LLM take the first pass? ...

The Mr. Magoo Problem: When AI Agents 'Just Do It'

Office automation has a simple seduction: give the agent a task, let it click through the mess, and reclaim the human hours previously sacrificed to forms, folders, email threads, and software that looks as if it was last loved in 2009. That is the promise. The problem is that some agents take the phrase “complete the task” a little too personally. ...

Benchmarks That Fight Back: Adaptive Testing for LMs

A benchmark is supposed to be a measuring instrument. In practice, many AI benchmarks behave more like a tired clipboard. Every model gets the same questions. Every question receives the same accounting treatment. The final score is usually a mean accuracy number, neat enough for a leaderboard and blunt enough to hide the messy truth underneath. Some items are too easy to tell strong models apart. Some are too hard to tell weak models apart. Some are mislabeled. Some have stopped mattering because everyone competent now solves them. Yet the ritual continues: run the suite, average the answers, update the chart, pretend the thermometer is not melting. ...

Automate All the Things? Mind the Blind Spots

A research report lands on your desk. It has a neat abstract, respectable tables, clean code attached, and just enough methodological language to sound like someone suffered through the usual academic rituals. Except this time, no one did. An AI scientist system generated the idea, wrote the code, ran the experiments, selected the result, and drafted the paper. ...

Benchmarks with Benefits: What DeepScholar-Bench Really Measures

TL;DR for operators DeepScholar-Bench is useful because it turns “deep research” from a demo category into a measurable workflow: retrieve the right sources, synthesize the right facts, and attach citations that actually support the claims.1 The headline result is not flattering. No evaluated system exceeds a 31% geometric mean across all metrics. OpenAI DeepResearch leads overall with a 0.309 geometric mean, but its best-looking strengths hide serious gaps: 0.857 on organization, 0.392 on nugget coverage, 0.187 on reference coverage, and 0.124 on document importance. Translation: the report may read well while still missing the intellectual furniture. ...