Cover image

SpatialBench: When AI Meets Messy Biology

Opening — Why this matters now AI agents are having a good year. They write code, refactor repositories, debug production bugs, and occasionally embarrass junior developers. Naturally, biology is next. Spatial transcriptomics—arguably one of the messiest, most insight-rich data domains in modern life science—looks like a perfect proving ground. If agents can reason over spatial biology data, the promise is compelling: fewer bottlenecks, faster discovery, and less dependence on scarce bioinformatics talent. ...

December 29, 2025 · 5 min · Zelina
Cover image

Competency Gaps: When Benchmarks Lie by Omission

Opening — Why this matters now Large Language Models are scoring higher than ever, yet complaints from real users keep piling up: over-politeness, brittle refusals, confused time reasoning, shaky boundaries. This disconnect is not accidental—it is statistical. The paper Uncovering Competency Gaps in Large Language Models and Their Benchmarks argues that our dominant evaluation regime is structurally incapable of seeing certain failures. Aggregate benchmark scores smooth away exactly the competencies that matter in production systems: refusal behavior, meta-cognition, boundary-setting, and nuanced reasoning. The result is a comforting number—and a misleading one. ...

December 27, 2025 · 4 min · Zelina
Cover image

Benchmarks That Fight Back: Adaptive Testing for LMs

TL;DR Static benchmarks treat every question as equally informative; reality doesn’t. FLUID BENCHMARKING runs language-model evals like adaptive exams: it estimates each item’s difficulty and discrimination, then routes the model to the most informative items and scores it in ability space instead of raw accuracy. Result: higher validity, lower variance, better resistance to saturation—at a fraction of the items and cost. Why today’s LM scores keep lying to you Noise: Two adjacent training checkpoints can jiggle up/down purely from sampling variance. Label problems & stale sets: Old leaderboards accumulate mislabeled or gameable items. Saturation: Frontier models cluster near 100%—differences become invisible. Procurement risk: If your ranking flips when you change the random seed or the subset size, you’re buying model lottery tickets, not capabilities. We’ve argued in past Cognaptus pieces that “benchmarks are microscopes, not mirrors”—the microscope has to be focused. FLUID BENCHMARKING dials the focus automatically. ...

September 20, 2025 · 5 min · Zelina
Cover image

Confidence, Not Confidence Tricks: Statistical Guardrails for Generative AI

Generative AI still ships answers without warranties. Edgar Dobriban’s new review, “Statistical Methods in Generative AI,” argues that classical statistics is the fastest route to reliability—especially under black‑box access. It maps four leverage points: (1) changing model behavior with guarantees, (2) quantifying uncertainty, (3) evaluating models under small data and leakage risk, and (4) intervening and experimenting to probe mechanisms. The executive takeaway If you manage LLM products, your reliability roadmap isn’t just RLHF and prompt magic—it’s quantiles, confidence intervals, calibration curves, and causal interventions. Wrap these around any model (open or closed) to control refusal rates, surface uncertainty that matters, and measure performance credibly when eval budgets are tight. ...

September 13, 2025 · 5 min · Zelina
Cover image

Fair or Foul? How LLMs ‘Appraise’ Emotions

Most AI conversations equate “emotional intelligence” with sentiment labels. Humans don’t work that way. We appraise situations—Was it fair? Could I control it? How much effort will this take?—and then feel. This study puts that lens on large language models and asks a sharper question: Do LLMs reason about emotions through cognitive appraisals, and are those appraisals human‑plausible? What CoRE Actually Measures (and Why It’s Different) CoRE — Cognitive Reasoning for Emotions evaluates seven LLMs across: ...

August 11, 2025 · 4 min · Zelina
Cover image

The Diligent but Brittle Student Inside Every LLM

If you put a large language model in a classroom for a year, what kind of student would it become? According to Simulating Human-Like Learning Dynamics with LLM-Empowered Agents, the answer isn’t flattering: most base LLMs act like “diligent but brittle surface learners”—hardworking, seemingly capable, but unable to generalize deeply. From Psych Lab to AI Lab Educational psychology has spent decades classifying learners into profiles like deep learners (intrinsically motivated, reflective, conceptual) and surface learners (extrinsically motivated, test-oriented, shortcut-prone). The authors built LearnerAgent, a multi-agent framework grounded in these theories, and dropped four AI ‘students’ into a simulated high school English class: ...

August 8, 2025 · 3 min · Zelina
Cover image

Homo Silicus Goes to Wall Street

As AI systems step into the boardroom and brokerage app, a new question arises: How do they think about money? In a world increasingly shaped by large language models (LLMs) not just answering questions but making decisions, we need to ask not just whether AI is accurate—but what kind of financial reasoner it is. A recent study by Orhan Erdem and Ragavi Pobbathi Ashok tackles this question head-on by comparing the decision-making profiles of seven LLMs—including GPT-4, DeepSeek R1, and Gemini 2.0—with those of humans across 53 countries. The result? LLMs consistently exhibit a style of reasoning distinct from human respondents—and most similar to Tanzanian participants. Not American, not German. Tanzanian. That finding, while seemingly odd, opens a portal into deeper truths about how these models internalize financial logic. ...

July 16, 2025 · 4 min · Zelina
Cover image

The First Hurdle: Why Coding Agents Struggle with Setup

In the race to build autonomous software engineers, large language model (LLM) agents like Devin and Copilot Chat are lauded for fixing bugs, writing code, and even completing tasks from GitHub issues. But what happens when the code doesn’t even run? That’s the uncomfortable gap SetupBench aims to measure—and the results are sobering. SetupBench introduces a 93-task benchmark evaluating a foundational but under-tested skill: bootstrapping a development environment from scratch. Unlike prior benchmarks that hand agents a fully pre-configured Docker container, SetupBench drops them into a barebones Linux sandbox and challenges them to install dependencies, initialize databases, configure background services, and resolve real-world version conflicts. It sounds simple. It isn’t. ...

July 15, 2025 · 4 min · Zelina
Cover image

Echo Chamber in a Prompt: How Survey Bias Creeps into LLMs

Large Language Models (LLMs) are increasingly deployed as synthetic survey respondents in social science and policy research. But a new paper by Rupprecht, Ahnert, and Strohmaier raises a sobering question: are these AI “participants” reliable, or are we just recreating human bias in silicon form? By subjecting nine LLMs—including Gemini, Llama-3 variants, Phi-3.5, and Qwen—to over 167,000 simulated interviews from the World Values Survey, the authors expose a striking vulnerability: even state-of-the-art LLMs consistently fall for classic survey biases—especially recency bias. ...

July 11, 2025 · 3 min · Zelina
Cover image

Mind Games for Machines: How Decrypto Reveals the Hidden Gaps in AI Reasoning

As large language models (LLMs) evolve from mere tools into interactive agents, they are increasingly expected to operate in multi-agent environments—collaborating, competing, and communicating not just with humans but with each other. But can they understand the beliefs, intentions, and misunderstandings of others? Welcome to the world of Theory of Mind (ToM)—and the cleverest AI benchmark you haven’t heard of: Decrypto. Cracking the Code: What is Decrypto? Inspired by the award-winning board game of the same name, Decrypto is a three-player game of secret codes and subtle hints, reimagined as a benchmark to test LLMs’ ability to coordinate and deceive. Each game features: ...

June 26, 2025 · 4 min · Zelina