Cover image

Fair or Foul? How LLMs ‘Appraise’ Emotions

Most AI conversations equate “emotional intelligence” with sentiment labels. Humans don’t work that way. We appraise situations—Was it fair? Could I control it? How much effort will this take?—and then feel. This study puts that lens on large language models and asks a sharper question: Do LLMs reason about emotions through cognitive appraisals, and are those appraisals human‑plausible? What CoRE Actually Measures (and Why It’s Different) CoRE — Cognitive Reasoning for Emotions evaluates seven LLMs across: ...

August 11, 2025 · 4 min · Zelina
Cover image

The Diligent but Brittle Student Inside Every LLM

If you put a large language model in a classroom for a year, what kind of student would it become? According to Simulating Human-Like Learning Dynamics with LLM-Empowered Agents, the answer isn’t flattering: most base LLMs act like “diligent but brittle surface learners”—hardworking, seemingly capable, but unable to generalize deeply. From Psych Lab to AI Lab Educational psychology has spent decades classifying learners into profiles like deep learners (intrinsically motivated, reflective, conceptual) and surface learners (extrinsically motivated, test-oriented, shortcut-prone). The authors built LearnerAgent, a multi-agent framework grounded in these theories, and dropped four AI ‘students’ into a simulated high school English class: ...

August 8, 2025 · 3 min · Zelina
Cover image

Homo Silicus Goes to Wall Street

As AI systems step into the boardroom and brokerage app, a new question arises: How do they think about money? In a world increasingly shaped by large language models (LLMs) not just answering questions but making decisions, we need to ask not just whether AI is accurate—but what kind of financial reasoner it is. A recent study by Orhan Erdem and Ragavi Pobbathi Ashok tackles this question head-on by comparing the decision-making profiles of seven LLMs—including GPT-4, DeepSeek R1, and Gemini 2.0—with those of humans across 53 countries. The result? LLMs consistently exhibit a style of reasoning distinct from human respondents—and most similar to Tanzanian participants. Not American, not German. Tanzanian. That finding, while seemingly odd, opens a portal into deeper truths about how these models internalize financial logic. ...

July 16, 2025 · 4 min · Zelina
Cover image

The First Hurdle: Why Coding Agents Struggle with Setup

In the race to build autonomous software engineers, large language model (LLM) agents like Devin and Copilot Chat are lauded for fixing bugs, writing code, and even completing tasks from GitHub issues. But what happens when the code doesn’t even run? That’s the uncomfortable gap SetupBench aims to measure—and the results are sobering. SetupBench introduces a 93-task benchmark evaluating a foundational but under-tested skill: bootstrapping a development environment from scratch. Unlike prior benchmarks that hand agents a fully pre-configured Docker container, SetupBench drops them into a barebones Linux sandbox and challenges them to install dependencies, initialize databases, configure background services, and resolve real-world version conflicts. It sounds simple. It isn’t. ...

July 15, 2025 · 4 min · Zelina
Cover image

Echo Chamber in a Prompt: How Survey Bias Creeps into LLMs

Large Language Models (LLMs) are increasingly deployed as synthetic survey respondents in social science and policy research. But a new paper by Rupprecht, Ahnert, and Strohmaier raises a sobering question: are these AI “participants” reliable, or are we just recreating human bias in silicon form? By subjecting nine LLMs—including Gemini, Llama-3 variants, Phi-3.5, and Qwen—to over 167,000 simulated interviews from the World Values Survey, the authors expose a striking vulnerability: even state-of-the-art LLMs consistently fall for classic survey biases—especially recency bias. ...

July 11, 2025 · 3 min · Zelina
Cover image

Mind Games for Machines: How Decrypto Reveals the Hidden Gaps in AI Reasoning

As large language models (LLMs) evolve from mere tools into interactive agents, they are increasingly expected to operate in multi-agent environments—collaborating, competing, and communicating not just with humans but with each other. But can they understand the beliefs, intentions, and misunderstandings of others? Welcome to the world of Theory of Mind (ToM)—and the cleverest AI benchmark you haven’t heard of: Decrypto. Cracking the Code: What is Decrypto? Inspired by the award-winning board game of the same name, Decrypto is a three-player game of secret codes and subtle hints, reimagined as a benchmark to test LLMs’ ability to coordinate and deceive. Each game features: ...

June 26, 2025 · 4 min · Zelina
Cover image

Raising the Bar: Why AI Competitions Are the New Benchmark Battleground

In the rapidly evolving landscape of Generative AI (GenAI), we’ve long relied on static benchmarks—standardized datasets and evaluations—to gauge model performance. But what if the very foundation we’re building our trust upon is fundamentally shaky? Static benchmarks often rely on IID (independent and identically distributed) assumptions, where training and test data come from the same statistical distribution. In such a setting, a model achieving high accuracy might simply be interpolating seen patterns rather than truly generalizing. For example, in language modeling, a model might “memorize” dataset-specific templates without capturing transferable reasoning patterns. ...

May 3, 2025 · 3 min
Cover image

Unchained Distortions: Why Step-by-Step Image Editing Breaks Down While Chain-of-Thought Shines

When large language models (LLMs) learned to think step-by-step, the world took notice. Chain-of-Thought (CoT) reasoning breathed new life into multi-step arithmetic, logic, and even moral decision-making. But as multimodal AI evolved, researchers tried to bring this paradigm into the visual world — by editing images step-by-step instead of all at once. And it failed. In the recent benchmark study Complex-Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark1, the authors show that CoT-style image editing — what they call sequential editing — not only fails to improve results, but often worsens them. Compared to applying a single, complex instruction all at once, breaking it into sub-instructions causes notable drops in instruction-following, identity preservation, and perceptual quality. ...

April 21, 2025 · 5 min
Cover image

Judge, Jury, and GPT: Bringing Courtroom Rigor to Business Automation

In the high-stakes world of business process automation (BPA), it’s not enough for AI agents to just complete tasks—they need to complete them correctly, consistently, and transparently. At Cognaptus, we believe in treating automation with the same scrutiny you’d expect from a court of law. That’s why we’re introducing CognaptusJudge, our novel framework for evaluating business automation, inspired by cutting-edge research in LLM-powered web agents. ⚖️ Inspired by Online-Mind2Web Earlier this year, a research team from OSU and UC Berkeley published a benchmark titled An Illusion of Progress? Assessing the Current State of Web Agents (arXiv:2504.01382). Their findings? Many agents previously hailed as top performers were failing nearly 70% of tasks when evaluated under more realistic, human-aligned conditions. ...

April 4, 2025 · 3 min