Benchmarks

Wheel Smarts > Wheel Reinvention: What GitTaskBench Really Measures

TL;DR for operators GitTaskBench is useful because it evaluates code agents where enterprise automation usually breaks: not in a clean coding puzzle, but inside an existing repository with dependencies, pretrained weights, fragile instructions, file formats, runtime constraints, and a user asking for a finished output.1 The paper’s headline is not “agents can code”. We have enough confetti for that parade. The sharper finding is that agents are still inconsistent at the whole delivery chain. The best reported combination, OpenHands with Claude 3.7, reaches 72.22% execution completion but only 48.15% task pass rate. In other words, many runs produce something executable, but far fewer produce something good enough. ...

Crystal Ball, Meet Cron Job: What FutureX Reveals About ‘Live’ Forecasting Agents

TL;DR for operators FutureX is less interesting as a leaderboard and more interesting as an operating model for evaluating AI agents that claim to forecast the future. The benchmark runs a live loop: collect future-facing questions from curated web sources, ask agents to predict before the answer exists, wait for resolution, crawl the answer, and score the prior prediction. That matters because most “forecasting” evaluations are either historical backtests with leakage risk or static datasets quietly ageing into trivia. ...

Bias in the Warehouse: What AIM-Bench Reveals About Agentic LLMs

TL;DR for operators AIM-Bench is not another “which model is smartest?” leaderboard. It is a warehouse stress test for agentic LLMs asked to make replenishment decisions under uncertainty.1 The useful lesson is uncomfortable: inventory agents can look mathematically fluent while still behaving like biased managers. Most evaluated models show mean anchoring in the newsvendor task. All evaluated models show bullwhip amplification in the Beer Game. Some models over-order to avoid stockouts; others keep leaner inventory but accept higher shortage risk. In other words, the operational personality of the model matters. ...

Reasoning with Both Eyes Open: Why Multimodal Chain-of-Thought Still Trips Up LLMs

TL;DR for operators Multimodal chain-of-thought is not automatically “reasoning with images.” In many systems, it is still text reasoning with an image attached for moral support. That is a problem for any business process where the model must inspect a document, chart, screen, medical image, product photo, map, or operational scene and then make several dependent inferences. ...

Beyond Words: Teaching AI to See and Fix Charts with ChartM3

TL;DR for operators ChartM3 is useful because it reframes chart editing as a four-step control problem: identify the visual target, connect that target to code, apply the edit, and avoid damaging everything else. That sounds obvious until one watches a multimodal model obediently edit the wrong pie slice with great confidence. A familiar little tragedy, now with bounding boxes. ...

The User Is Present: Why Smart Agents Still Don't Get You

TL;DR for operators Most agent demos show the easy part: the model calls a tool, gets results, and returns something plausible. The harder part is less cinematic. The user starts with an incomplete request, reveals constraints in fragments, phrases preferences indirectly, changes emphasis mid-conversation, and expects the system to somehow keep up. This is where many supposedly “smart” agents begin to look less like assistants and more like interns with excellent API access. ...

The Two Minds of Finance: Testing LLMs for Divergence and Discipline

TL;DR for operators Finance teams do not ask AI systems to do one kind of thinking. They ask them to imagine plausible futures, extract investable implications, choose between similar explanations, and avoid being seduced by the prettiest narrative. Those are not the same task. A model can be fluent, plausible, and still strategically dull. Finance has a long tradition of rewarding that, but we do not need to automate the habit. ...

Red Flag on the Track: Why LLMs Still Struggle with Real Algorithmic Reasoning

TL;DR for operators FormulaOne is a useful red flag because it tests something many businesses quietly assume LLMs already possess: the ability to design deep algorithms, not merely write plausible code around familiar patterns.1 The benchmark contains 120 hard dynamic-programming problems on tree-like graphs, plus 100 easier FormulaOne-Warmup problems. The hard tasks are generated from Monadic Second-Order logic, come with verifiable evaluation, and sit near the kind of combinatorial reasoning used in routing, scheduling, network design and other optimisation-heavy domains. ...

Beyond Stack Overflow: CodeAssistBench Exposes the Real Gaps in LLM Coding Help

TL;DR for operators Coding assistants look much better when the task is a clean question than when the task is a messy software support conversation. That is the inconvenient point of CodeAssistBench, or CAB, a benchmark that turns resolved GitHub issues into multi-turn, project-grounded conversations where a model must behave like a maintainer, not a code-snippet vending machine.1 ...

Memory Games: The Data Contamination Crisis in Reinforcement Learning

TL;DR for operators A model that improves after training on random rewards has not necessarily discovered a secret route to reasoning. It may simply be remembering the exam. The paper behind this article investigates a strange result in reinforcement learning for large language models: Qwen2.5 models appeared to improve on public math benchmarks even when the reward signal was random, inverted, or based on wrong majority-voted answers.1 That sounds exciting, in the same way that a finance team “beating forecast” after seeing next quarter’s numbers is exciting. Technically impressive, commercially dangerous, and not something one should build governance around. ...