Cover image

The First Hurdle: Why Coding Agents Struggle with Setup

In the race to build autonomous software engineers, large language model (LLM) agents like Devin and Copilot Chat are lauded for fixing bugs, writing code, and even completing tasks from GitHub issues. But what happens when the code doesn’t even run? That’s the uncomfortable gap SetupBench aims to measure—and the results are sobering. SetupBench introduces a 93-task benchmark evaluating a foundational but under-tested skill: bootstrapping a development environment from scratch. Unlike prior benchmarks that hand agents a fully pre-configured Docker container, SetupBench drops them into a barebones Linux sandbox and challenges them to install dependencies, initialize databases, configure background services, and resolve real-world version conflicts. It sounds simple. It isn’t. ...

July 15, 2025 · 4 min · Zelina
Cover image

Echo Chamber in a Prompt: How Survey Bias Creeps into LLMs

Large Language Models (LLMs) are increasingly deployed as synthetic survey respondents in social science and policy research. But a new paper by Rupprecht, Ahnert, and Strohmaier raises a sobering question: are these AI “participants” reliable, or are we just recreating human bias in silicon form? By subjecting nine LLMs—including Gemini, Llama-3 variants, Phi-3.5, and Qwen—to over 167,000 simulated interviews from the World Values Survey, the authors expose a striking vulnerability: even state-of-the-art LLMs consistently fall for classic survey biases—especially recency bias. ...

July 11, 2025 · 3 min · Zelina
Cover image

Mind Games for Machines: How Decrypto Reveals the Hidden Gaps in AI Reasoning

As large language models (LLMs) evolve from mere tools into interactive agents, they are increasingly expected to operate in multi-agent environments—collaborating, competing, and communicating not just with humans but with each other. But can they understand the beliefs, intentions, and misunderstandings of others? Welcome to the world of Theory of Mind (ToM)—and the cleverest AI benchmark you haven’t heard of: Decrypto. Cracking the Code: What is Decrypto? Inspired by the award-winning board game of the same name, Decrypto is a three-player game of secret codes and subtle hints, reimagined as a benchmark to test LLMs’ ability to coordinate and deceive. Each game features: ...

June 26, 2025 · 4 min · Zelina
Cover image

Raising the Bar: Why AI Competitions Are the New Benchmark Battleground

In the rapidly evolving landscape of Generative AI (GenAI), we’ve long relied on static benchmarks—standardized datasets and evaluations—to gauge model performance. But what if the very foundation we’re building our trust upon is fundamentally shaky? Static benchmarks often rely on IID (independent and identically distributed) assumptions, where training and test data come from the same statistical distribution. In such a setting, a model achieving high accuracy might simply be interpolating seen patterns rather than truly generalizing. For example, in language modeling, a model might “memorize” dataset-specific templates without capturing transferable reasoning patterns. ...

May 3, 2025 · 3 min
Cover image

Unchained Distortions: Why Step-by-Step Image Editing Breaks Down While Chain-of-Thought Shines

When large language models (LLMs) learned to think step-by-step, the world took notice. Chain-of-Thought (CoT) reasoning breathed new life into multi-step arithmetic, logic, and even moral decision-making. But as multimodal AI evolved, researchers tried to bring this paradigm into the visual world — by editing images step-by-step instead of all at once. And it failed. In the recent benchmark study Complex-Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark1, the authors show that CoT-style image editing — what they call sequential editing — not only fails to improve results, but often worsens them. Compared to applying a single, complex instruction all at once, breaking it into sub-instructions causes notable drops in instruction-following, identity preservation, and perceptual quality. ...

April 21, 2025 · 5 min
Cover image

Judge, Jury, and GPT: Bringing Courtroom Rigor to Business Automation

In the high-stakes world of business process automation (BPA), it’s not enough for AI agents to just complete tasks—they need to complete them correctly, consistently, and transparently. At Cognaptus, we believe in treating automation with the same scrutiny you’d expect from a court of law. That’s why we’re introducing CognaptusJudge, our novel framework for evaluating business automation, inspired by cutting-edge research in LLM-powered web agents. ⚖️ Inspired by Online-Mind2Web Earlier this year, a research team from OSU and UC Berkeley published a benchmark titled An Illusion of Progress? Assessing the Current State of Web Agents (arXiv:2504.01382). Their findings? Many agents previously hailed as top performers were failing nearly 70% of tasks when evaluated under more realistic, human-aligned conditions. ...

April 4, 2025 · 3 min