
Benchmarks That Fight Back: Adaptive Testing for LMs
TL;DR Static benchmarks treat every question as equally informative; reality doesn’t. FLUID BENCHMARKING runs language-model evals like adaptive exams: it estimates each item’s difficulty and discrimination, then routes the model to the most informative items and scores it in ability space instead of raw accuracy. Result: higher validity, lower variance, better resistance to saturation—at a fraction of the items and cost. Why today’s LM scores keep lying to you Noise: Two adjacent training checkpoints can jiggle up/down purely from sampling variance. Label problems & stale sets: Old leaderboards accumulate mislabeled or gameable items. Saturation: Frontier models cluster near 100%—differences become invisible. Procurement risk: If your ranking flips when you change the random seed or the subset size, you’re buying model lottery tickets, not capabilities. We’ve argued in past Cognaptus pieces that “benchmarks are microscopes, not mirrors”—the microscope has to be focused. FLUID BENCHMARKING dials the focus automatically. ...