AI Evaluation

Thinking in Branches: Why LLM Reasoning Needs an Algorithmic Theory

A manager asks an AI system for a risk assessment. It gives a plausible answer. The manager asks again with a slightly different prompt. Another plausible answer appears, with different reasoning. Ask five more times and the system scatters clues across the attempts like a consultant who has read the documents but refuses to assemble the memo in one draft. ...

Stuck on Repeat: Why LLMs Reinforce Their Own Bad Ideas

Meetings have a familiar failure mode. Someone states an early opinion, then spends the next thirty minutes “thinking through the issue” in a way that somehow makes the original opinion look increasingly inevitable. Evidence enters the room. Counterarguments are acknowledged. The conclusion remains suspiciously loyal to the opening bid. Apparently, large language models have been attending the same meetings. ...

When Agents Treat Agents as Tools: What Tool-RoCo Tells Us About LLM Autonomy

Dispatch is where autonomy usually goes to die. A warehouse manager may have ten workers, three forklifts, two packing stations, and one increasingly dramatic dashboard. The hard part is not merely deciding what each person should do. The hard part is knowing when to call someone in, when to release them, and when extra “help” is just a polite name for congestion. ...

$Cover image$

Error Hunting Season: Why Pessimism Makes LLMs Smarter at Math

Review is not a democracy. That sounds unpleasant, which is why it is useful. In many business settings, we like consensus because it feels stable. Three analysts agree, five reviewers approve, the dashboard turns green, and everyone can pretend the risk has been domesticated. Mathematics is less polite. One invalid theorem application, one hidden assumption, one algebraic step that does not follow, and the whole proof may collapse. The majority does not get to vote a contradiction out of existence. ...

Enviro-Mental Gymnastics: Why Cross-Environment Agents Still Trip Over Their Own Feet

Demo day is easy. Give an AI agent one workflow, one tool stack, one database schema, one approval rule, and one forgiving evaluator, and it may look surprisingly competent. It files the ticket. It updates the CRM. It writes the SQL query. Everyone nods. Someone says “agentic transformation,” because apparently every procurement meeting now needs a spell. ...

Mind the Gaps: Why LLMs Reason Like Brilliant Amnesiacs

A model can write a flawless explanation, check its own work, announce a correction, and then make the same mistake three paragraphs later. This is the familiar enterprise horror show: the AI appears to reason, but its reasoning has no working memory of its own commitments. It is articulate, capable, and sometimes genuinely useful. It is also, in the wrong setting, a brilliant amnesiac. ...

One Pass to Rule Them All: YOFO and the Rise of Compositional Judging

Search is where nuance goes to die. A customer asks for a long evening dress, preferably not pink. A retrieval model sees “dress,” “evening,” perhaps “pink,” and returns something short, bright, and entirely wrong with the confidence of a clerk who has technically read the sentence but not understood the assignment. The business consequence is familiar: fewer conversions, more irrelevant recommendations, and yet another dashboard where “semantic relevance” looks respectable while customers quietly leave. ...

Pop-Ups, Pitfalls, and Planning: Why GUI Agents Break in the Real World

Pop-up. That tiny word hides a surprisingly large operational problem. A human sees a battery warning, an update prompt, a permission dialog, or a frozen app and does something boringly competent: dismiss it, recover context, re-check the screen, and continue. A GUI agent, meanwhile, may confidently continue a plan that no longer matches reality. The machine has not “failed” in the theatrical sense. It has simply treated a live workflow like a polite screenshot sequence. Very enterprise. Very doomed. ...

Hex Marks the Spot: Terra Nova and the New Frontier of Agent Intelligence

A strategy game is a cruelly efficient way to embarrass an intelligent system. Not because games are magic. Not because hexagonal maps secretly contain the meaning of cognition. They do not, despite what several overexcited benchmark papers might imply after a strong coffee. Games are useful because they compress decision pressure. They make planning visible. They force trade-offs. They punish agents that confuse local competence with strategic understanding. ...

Prompted and Confused: When LLMs Forget the Assignment

A requirements document walks into a model. It says: assign resources, respect capacity, avoid conflicts, minimise waste. The model nods politely, emits a tidy block of MiniZinc, and everyone is briefly tempted to believe the future has arrived. Then someone changes the story from cars to knapsacks, or adds one stray sentence about maximising something, and the same system quietly forgets the assignment. ...