Benchmarks

The Solver Isn’t the Strategy: FrontierOR’s Reality Check for AI Optimisation Agents

Scheduling a factory, routing a fleet, pricing airline seats, allocating scarce capacity: these are not “write me a Python script” problems with nicer stationery. In real operations research, the useful answer is not merely a correct mathematical model. It is a method that stays feasible, keeps solution quality high, and finishes before the business context has expired. ...

Kitchen Confidential: FoodMonitor and the Compliance AI Reality Check

Cameras are easy. Audits are not. That is the useful irritation inside FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis, a new benchmark for testing multimodal large language models on commercial-kitchen compliance monitoring.1 The paper is not asking whether a model can watch a kitchen video and say something vaguely sensible about hygiene. Many systems can now do that, at least with enough confidence to impress a demo audience and mildly alarm the legal department. ...

Relight at Your Own Risk: WildRelight and the Synthetic Vision Reality Check

Lighting is a cruel product demo. A relighting model can look impressive when the input is clean, the geometry is polite, the materials are obedient, and the benchmark has been assembled in the reassuringly sterile world of synthetic data. Then someone points it at a real outdoor scene: leaves moving in the wind, glass behaving like glass, the sun half-occluded by a branch, indirect light bouncing from surfaces nobody bothered to model, and the whole thing starts to look rather less like computational photography and rather more like a confident intern guessing where shadows should go. ...

Trust Issues, Benchmarked: Why Hallucination Detection Is a Portfolio Problem

Trust is a bad deployment strategy. That is not a moral statement. It is an operations statement. In most enterprise AI workflows, the uncomfortable question is not “Can the model answer?” The model will answer. Models are generous like that. The question is whether the organization has a reliable way to notice when the answer is unsupported, fabricated, overconfident, or merely polished nonsense wearing a tie. ...

Wrong on Purpose: FalsifyBench and the Agent Skill We Keep Forgetting

A good analyst should occasionally try to break their own idea. Not performatively. Not with a decorative “on the other hand” paragraph. Actually break it. Ask the kind of question that could make the current hypothesis collapse, then watch whether the evidence forces a better one. That simple discipline is the center of FalsifyBench: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games, a new paper by Leonardo Bertolazzi, Katya Tentori, and Raffaella Bernardi.1 The paper is framed around scientific reasoning, but its practical message travels well beyond science. If an AI agent cannot test outside its own current belief, it may look careful while doing something much less impressive: confirming the first plausible story it invented. ...

$Cover image$

Do the Math, Not the Mime: Why LLM Reasoning Needs a Verification Pipeline

Spreadsheet errors have a special talent: they look boring until they become expensive. That is the business version of the LLM math problem. A model can produce a calm, step-by-step explanation, put a confident number at the bottom, and still be wrong in the only place that matters. Worse, the reasoning may look plausible enough that a manager, analyst, tutor, or compliance reviewer nods and moves on. The answer has the rhythm of thinking. It has the costume of calculation. It may even have a chain-of-thought trace. Very civilized. Still not proof. ...

RAG’s Receipt Problem: When Correct Answers Don’t Prove Retrieval

RAG’s Receipt Problem: When Correct Answers Don’t Prove Retrieval Retrieval-augmented generation has become the respectable outfit enterprise AI wears when it wants to look grounded. Add a document store, retrieve a few passages, attach citations, and the answer suddenly appears more disciplined than a free-floating chatbot. That appearance is useful. It is not proof. ...

Search Me If You Can: Why AI Agent Discovery Needs Receipts

Opening — Why this matters now The AI agent market is beginning to look like an overconfident airport duty-free shop: everything claims to be premium, every label promises capability, and somehow the thing you need is still hard to find. That matters because the next phase of business automation will not be built from one general chatbot sitting politely in a browser tab. It will involve agent ecosystems: finance agents, customer-support agents, coding agents, compliance agents, research agents, scheduling agents, procurement agents, and a thousand microscopic “I can do that” assistants wrapped in glossy product pages. ...

Clawing Back the Benchmark: When AI Agents Start Testing Themselves

Tickets. That is where the future of AI agents becomes less theatrical and more irritatingly real. Not in a glossy demo where an agent books a holiday after three polite prompts, but in a helpdesk queue where it must read a ticket, check a knowledge base, update a CRM record, avoid leaking private data, recover from a failed API call, and still produce something a human manager can audit later. ...

Where to Go Deeper Beyond This Academy

A curated guide to textbooks, authors, websites, and papers for readers who want to study transformer internals, attention math, fine-tuning, GPU optimization, and benchmarking in more depth.