Cover image

SpatialBench: When AI Meets Messy Biology

Opening — Why this matters now AI agents are having a good year. They write code, refactor repositories, debug production bugs, and occasionally embarrass junior developers. Naturally, biology is next. Spatial transcriptomics—arguably one of the messiest, most insight-rich data domains in modern life science—looks like a perfect proving ground. If agents can reason over spatial biology data, the promise is compelling: fewer bottlenecks, faster discovery, and less dependence on scarce bioinformatics talent. ...

December 29, 2025 · 5 min · Zelina
Cover image

Competency Gaps: When Benchmarks Lie by Omission

Opening — Why this matters now Large Language Models are scoring higher than ever, yet complaints from real users keep piling up: over-politeness, brittle refusals, confused time reasoning, shaky boundaries. This disconnect is not accidental—it is statistical. The paper Uncovering Competency Gaps in Large Language Models and Their Benchmarks argues that our dominant evaluation regime is structurally incapable of seeing certain failures. Aggregate benchmark scores smooth away exactly the competencies that matter in production systems: refusal behavior, meta-cognition, boundary-setting, and nuanced reasoning. The result is a comforting number—and a misleading one. ...

December 27, 2025 · 4 min · Zelina
Cover image

Personas, Panels, and the Illusion of Free A/B Tests

Opening — Why this matters now Everyone wants cheaper A/B tests. Preferably ones that run overnight, don’t require legal approval, and don’t involve persuading an ops team that this experiment definitely won’t break production. LLM-based persona simulation looks like the answer. Replace humans with synthetic evaluators, aggregate their responses, and voilà—instant feedback loops. Faster iteration, lower cost, infinite scale. What could possibly go wrong? ...

December 25, 2025 · 5 min · Zelina
Cover image

When Benchmarks Rot: Why Static ‘Gold Labels’ Are a Clinical Liability

Opening — Why this matters now Clinical AI has entered an uncomfortable phase of maturity. Models are no longer failing loudly; they are failing quietly. They produce fluent answers, pass public benchmarks, and even outperform physicians on narrowly defined tasks — until you look closely at what those benchmarks are actually measuring. The paper at hand dissects one such case: MedCalc-Bench, the de‑facto evaluation standard for automated medical risk-score computation. The uncomfortable conclusion is simple: when benchmarks are treated as static truth, they slowly drift away from clinical reality — and when those same labels are reused as reinforcement-learning rewards, that drift actively teaches models the wrong thing. ...

December 23, 2025 · 4 min · Zelina
Cover image

CitySeeker: Lost in Translation, Found in the City

Opening — Why this matters now Urban navigation looks deceptively solved. We have GPS, street-view imagery, and multimodal models that can describe a scene better than most humans. And yet, when vision-language models (VLMs) are asked to actually navigate a city — not just caption it — performance collapses in subtle, embarrassing ways. The gap is no longer about perception quality. It is about cognition: remembering where you have been, knowing when you are wrong, and understanding implicit human intent. This is the exact gap CitySeeker is designed to expose. ...

December 19, 2025 · 3 min · Zelina
Cover image

Breaking the Tempo: How TempoBench Reframes AI’s Struggle with Time and Causality

Opening — Why this matters now The age of “smart” AI models has reached an uncomfortable truth: they can ace your math exam but fail your workflow. While frontier systems like GPT‑4o and Claude‑Sonnet solve increasingly complex symbolic puzzles, they stumble when asked to reason through time—to connect what happened, what’s happening, and what must happen next. In a world shifting toward autonomous agents and decision‑chain AI, this isn’t a minor bug—it’s a systemic limitation. ...

November 5, 2025 · 4 min · Zelina
Cover image

From Prototype to Profit: How IBM's CUGA Redefines Enterprise Agents

When AI agents first emerged as academic curiosities, they promised a future of autonomous systems capable of navigating apps, websites, and APIs as deftly as humans. Yet most of these experiments never left the lab. The jump from benchmark to boardroom—the point where AI must meet service-level agreements, governance rules, and cost-performance constraints—remained elusive. IBM’s recent paper, From Benchmarks to Business Impact, finally brings data to that missing bridge. The Benchmark Trap Generalist agents such as AutoGen, LangGraph, and Operator have dazzled the research community with their ability to orchestrate tasks across multiple tools. But academic triumphs often hide operational fragility. Benchmarks like AppWorld or WebArena measure intelligence; enterprises measure ROI. They need systems that are reproducible, auditable, and policy-compliant—not just clever. ...

November 2, 2025 · 4 min · Zelina
Cover image

Benchmarks That Fight Back: Adaptive Testing for LMs

TL;DR Static benchmarks treat every question as equally informative; reality doesn’t. FLUID BENCHMARKING runs language-model evals like adaptive exams: it estimates each item’s difficulty and discrimination, then routes the model to the most informative items and scores it in ability space instead of raw accuracy. Result: higher validity, lower variance, better resistance to saturation—at a fraction of the items and cost. Why today’s LM scores keep lying to you Noise: Two adjacent training checkpoints can jiggle up/down purely from sampling variance. Label problems & stale sets: Old leaderboards accumulate mislabeled or gameable items. Saturation: Frontier models cluster near 100%—differences become invisible. Procurement risk: If your ranking flips when you change the random seed or the subset size, you’re buying model lottery tickets, not capabilities. We’ve argued in past Cognaptus pieces that “benchmarks are microscopes, not mirrors”—the microscope has to be focused. FLUID BENCHMARKING dials the focus automatically. ...

September 20, 2025 · 5 min · Zelina
Cover image

Automate All the Things? Mind the Blind Spots

Automation is a superpower—but it’s also a blindfold. New AI “scientist” stacks promise to go from prompt → idea → code → experiments → manuscript with minimal human touch. Today’s paper shows why that convenience can quietly erode scientific integrity—and, by extension, the credibility of any product decisions built on top of it. The punchline: the more you automate, the less you see—unless you design for visibility from day one. ...

September 14, 2025 · 4 min · Zelina
Cover image

Razor Burn: Why LLMs Nick Themselves on Induction and Abduction

TL;DR A new synthetic benchmark (INABHYD) tests inductive and abductive reasoning under Occam’s Razor. LLMs handle toy cases but falter as ontologies deepen or when multiple hypotheses are needed. Even when models “explain” observations, they often pick needlessly complex or trivial hypotheses—precisely the opposite of what scientific discovery and root-cause analysis require. The Big Idea Most reasoning work on LLMs obsesses over deduction (step-by-step proofs). But the real world demands induction (generalize rules) and abduction (best explanation). The paper introduces INABHYD, a programmable benchmark that builds fictional ontology trees (concepts, properties, subtype links) and hides some axioms. The model sees an incomplete world + observations, and must propose hypotheses that both explain all observations and do so parsimoniously (Occam’s Razor). The authors score: ...

September 6, 2025 · 4 min · Zelina