LLM Benchmarks

Think Twice, Halt Once

TL;DR for operators The current enterprise mistake is treating “reasoning” as a personality trait of a model. It is not. It is a process: decompose the task, inspect the evidence, decide what matters, test counterarguments, synthesize a position, and stop before the machine starts producing beautifully cited nonsense. Two recent papers expose that process from opposite ends. Hedge-Bench defines a realistic demand signal: open-ended financial reasoning tasks derived from hedge fund analyst work, graded against expert analytical moves and source-grounded claims.1 It finds that frontier agents remain weak on this kind of work, with the best model achieving only a limited perfect-score rate and with stronger exploration often bringing more hallucination along for the ride. Delightful. The junior analyst has read the filings, opened the spreadsheet, and still occasionally invents the economy. ...

Rank and File: AI Leaderboards Are Measurement Instruments, Not Scoreboards

Procurement meetings have a familiar ritual now. Someone opens a leaderboard, sorts by average score, points at a model near the top, and asks why the company is not using that one. It feels empirical. It is neatly ranked. It has decimals. Very scientific-looking decimals, the most seductive species of decimal. The problem is not that leaderboards are useless. The problem is that we often treat them as scoreboards when they are closer to measurement instruments. A scoreboard tells us who won under agreed rules. A measurement instrument first has to prove that it measures the thing it claims to measure. If the instrument mixes model size, benchmark difficulty, contributor practices, post-training choices, item redundancy, and residual artifacts into one number, then the number may still be useful. It is just not self-explanatory. ...

Clue by Clue: ProjectionBench and the Business of Testing AI Discovery

Clue by Clue: ProjectionBench and the Business of Testing AI Discovery Lab meeting. A scientist has a topic, a research question, and not much else. No dataset yet. No final chart. No results section quietly waiting in the appendix. Just a question and the uncomfortable business of guessing what nature might do. ...

The Cost of Playing It Safe: When AI Safety Creates Harm

Refusal looks safe. That is the problem. A user says they have run out of ordinary options: the specialist is gone, the appointment is weeks away, the emergency department has already sent them home, and the remaining medication supply is not enough to bridge the gap. The user asks an AI system what to do. The model refuses to provide concrete guidance and recommends the same professional route the user has just explained is unavailable. ...

Temperament Over Talent: Why AI Behavior Is the New Competitive Edge

Procurement loves a leaderboard. That is understandable. A leaderboard is clean, sortable, and emotionally comforting. One model scores higher on reasoning. Another is cheaper per token. A third has a larger context window and a launch page written in the usual dialect of technological destiny. Decision made, presumably. Then the model enters a real workflow. ...

The Sealed Score: Why AI Evaluation Needs an Exam Day

A leaderboard score is useful until everyone starts treating it as a target. That is the uncomfortable business problem behind LLM Olympiad: Why Model Evaluation Needs a Sealed Exam.1 The paper is not arguing that benchmarks are useless. That would be theatrical, and not especially true. It argues something sharper: in the LLM era, a benchmark score is only as credible as the procedure that produced it. ...

Don’t Just Answer — Ask: Why Interactive Benchmarks May Redefine AI Intelligence

Meeting. That is where many AI demos go to die. A model receives a tidy prompt, produces a tidy answer, and everyone nods. Then the real work begins: the client clarifies a requirement, the dataset has a missing column, the UI screenshot does not match the written description, the user contradicts themselves, and the model has to decide whether to ask, revise, infer, test, or gracefully admit that it is flying blind. ...

LemmaBench: When AI Finally Meets Real Mathematics

Most AI math benchmarks still feel like exam rooms. The model receives a problem. It produces an answer. We score the answer. Everyone argues about whether the problem was hard enough, whether the model saw something similar during training, and whether the leaderboard means anything outside the leaderboard. Very productive. Almost as peaceful as a faculty meeting. ...

AgenticPay: When LLMs Start Haggling for a Living

Procurement looks boring until the software starts spending money. A human buyer can be slow, inconsistent, and occasionally allergic to spreadsheets. But at least we know what failure looks like: overpaying, accepting bad terms, walking away too late, or trusting the wrong supplier. When the buyer is an LLM agent, the failure mode becomes more polished. It can overpay in fluent English. It can miss a deal while sounding reasonable. It can keep bargaining after the answer is already visible. Progress, apparently, now comes with better punctuation. ...

When Benchmarks Lie: Teaching Leaderboards to Care About Preferences

A leaderboard is a comforting object. It gives procurement teams, product managers, and slightly sleep-deprived founders the same small pleasure: a ranked list. Bigger number, better model. Lower rank, worse model. Decision made. Spreadsheet closed. Everyone can return to pretending vendor evaluation is objective. Unfortunately, benchmarks do not care what your business actually needs. ...