Cover image

Benchmarks That Fight Back: Adaptive Testing for LMs

TL;DR Static benchmarks treat every question as equally informative; reality doesn’t. FLUID BENCHMARKING runs language-model evals like adaptive exams: it estimates each item’s difficulty and discrimination, then routes the model to the most informative items and scores it in ability space instead of raw accuracy. Result: higher validity, lower variance, better resistance to saturation—at a fraction of the items and cost. Why today’s LM scores keep lying to you Noise: Two adjacent training checkpoints can jiggle up/down purely from sampling variance. Label problems & stale sets: Old leaderboards accumulate mislabeled or gameable items. Saturation: Frontier models cluster near 100%—differences become invisible. Procurement risk: If your ranking flips when you change the random seed or the subset size, you’re buying model lottery tickets, not capabilities. We’ve argued in past Cognaptus pieces that “benchmarks are microscopes, not mirrors”—the microscope has to be focused. FLUID BENCHMARKING dials the focus automatically. ...

September 20, 2025 · 5 min · Zelina
Cover image

Kill Switch Ethics: What the PacifAIst Benchmark Really Measures

TL;DR PacifAIst stress‑tests a model’s behavioral alignment when its instrumental goals (self‑preservation, resources, or task completion) conflict with human safety. In 700 text scenarios across three sub‑domains (EP1 self‑preservation vs. human safety, EP2 resource conflict, EP3 goal preservation vs. evasion), leading LLMs show meaningful spread in a “Pacifism Score” (P‑Score) and refusal behavior. Translation for buyers: model choice, policies, and guardrails should not assume identical safety under conflict—they aren’t. Why this matters now Most safety work measures what models say (toxicity, misinformation). PacifAIst measures what they would do when a safe choice may require self‑sacrifice—e.g., dumping power through their own servers to prevent a human‑harmful explosion. That’s closer to agent operations (automation, tool use, and control loops) than classic content benchmarks. If you’re piloting computer‑use agents or workflow copilots with action rights, this is the missing piece in your risk model. ...

August 16, 2025 · 5 min · Zelina