Benchmarks

TowerMind: When Language Models Learn That Towers Have Consequences

Tower placement is a small decision until it is wrong. In a tower-defense game, a bad tower is not merely an inelegant plan. It is money spent, coverage lost, enemies leaked, and time wasted. The game does not care that the explanation sounded strategic. It only asks whether the tower actually touches the road. ...

Stuck on Repeat: When Reinforcement Learning Fails to Notice the Rules Changed

A dashboard still looks the same after the business changes. The buttons are in the same place. The form fields have the same labels. The workflow still asks for the same approval, the same handoff, the same final action. From the outside, nothing has moved. Then the rules underneath change. A supplier starts behaving differently after a policy shift. A trading market reacts differently after a liquidity regime changes. A robot arm keeps seeing the same objects, but the hardware has worn slightly. A customer-service automation still receives the same message types, but the escalation logic behind the organization has quietly changed. ...

NPCs With Short-Term Memory Loss: Benchmarking Agents That Actually Live in the World

Minecraft is not the point. That may sound rude to the blocks, but it is the cleanest way to read MineNPC-Task: Task Suite for Memory-Aware Minecraft Agents.1 The paper does use Minecraft. It does study an AI companion agent inside a live game world. It does report that a GPT-4o-powered setup failed on 71 out of 216 attempted subtasks, or roughly one-third of the subtask denominator. ...

SpatialBench: When AI Meets Messy Biology

A dataset arrives. Not a clean demo dataset. Not a tidy CSV with three columns and a tutorial notebook waiting nearby like a hotel concierge. A real spatial biology dataset arrives: high-dimensional, platform-specific, noisy, partially processed, full of tacit assumptions, and attached to a scientific question that cannot be answered by knowing biology in the abstract. ...

Competency Gaps: When Benchmarks Lie by Omission

Scores are comforting. That is their main commercial advantage. A vendor can say its model reaches a certain accuracy on a benchmark, a leaderboard can rank systems neatly, and an internal AI team can report that the new model is “better” than the old one. Everyone gets a number. The procurement slide looks tidy. The risk committee, if mercifully sleepy, moves on. ...

Personas, Panels, and the Illusion of Free A/B Tests

A/B tests are expensive in the least glamorous way. Not because the math is hard. Not because a conversion metric is philosophically mysterious. The real cost is coordination: product approval, legal review, user-risk arguments, instrumentation, waiting for enough traffic, and then explaining to someone why the “obvious winner” was not statistically obvious at all. ...

When Benchmarks Rot: Why Static ‘Gold Labels’ Are a Clinical Liability

Clinical AI has a paperwork problem. Not the usual paperwork problem, where doctors drown in documentation and everyone promises that software will save them. The more interesting problem sits one layer below: the paperwork used to judge the software may itself be wrong. That is the uncomfortable center of Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight, a paper that audits MedCalc-Bench, a benchmark for testing whether language models can compute medical risk scores from patient narratives.1 The paper’s target is not a toy dataset. MedCalc-Bench covers 55 medical calculators and includes 10,053 training instances plus 1,047 test instances. Its labels were produced through an LLM-assisted pipeline: GPT-3.5 matched patient contexts to calculator questions, GPT-4 extracted clinical features, and Python scripts aggregated those features into final scores. ...

LLMs, Gotta Think ’Em All: When Pokémon Battles Become a Serious AI Benchmark

Game AI usually has a familiar job: lose convincingly. Not too quickly, because that feels insulting. Not too brutally, because that feels like homework wearing a boss battle costume. Good game AI sits in the narrow emotional band between “I can beat this” and “I need to think.” The old solution was scripted behavior, heuristics, difficulty sliders, or reinforcement learning trained until the agent stopped embarrassing itself. The newer temptation is simpler: give the game state to an LLM and ask it to play. ...

CitySeeker: Lost in Translation, Found in the City

The city does not answer literal questions A person says, “I’m thirsty.” A human does not usually reply, “Please specify whether you require a vending machine, café, convenience store, supermarket, juice shop, water fountain, or bubble tea store.” That would be technically attentive and socially catastrophic. A human looks around, remembers what cities usually contain, infers which places can satisfy the need, and starts walking toward a plausible target. ...

From Benchmarks to Beakers: Stress‑Testing LLMs as Scientific Co‑Scientists

Benchmarks are clean. Research is not. A benchmark asks a model to answer a question, then politely stops. A research workflow asks the model to form a hypothesis, test it, read the result, notice what went wrong, adjust the plan, and try again without wandering into scientific nonsense. One is a quiz. The other is a beaker with a budget, a deadline, and a surprisingly expensive simulation queue. ...