Cover image

Mirage Agents: When LLMs Act on Illusions

As large language models evolve into autonomous agents, their failures no longer stay confined to text—they materialize as actions. Clicking the wrong button, leaking private data, or falsely reporting success aren’t just hypotheticals anymore. They’re happening now, and MIRAGE-Bench is the first benchmark to comprehensively measure and categorize these agentic hallucinations. Unlike hallucinations in chatbots, which may be amusing or embarrassing, hallucinations in LLM agents operating in dynamic environments can lead to real-world consequences. MIRAGE—short for Measuring Illusions in Risky AGEnt settings—provides a long-overdue framework to elicit, isolate, and evaluate these failures. And the results are sobering: even top models like GPT-4o and Claude hallucinate at least one-third of the time when placed under pressure. ...

July 29, 2025 · 4 min · Zelina
Cover image

Tools of Thought: Why Reasoning Isn’t an Illusion After All

In early 2025, Apple’s now-infamous “thinking-illusion” benchmark delivered a sobering verdict: large reasoning models (LRMs)—those step-by-step thinkers like DeepSeek-R1 and Qwen 3 Thinking—failed to show meaningful advantages over simpler LLMs. Their verbose, reflective outputs didn’t help on easy problems, nor did they scale on hard ones. In some cases, they even underperformed. But what if we were judging thinking models under unfair conditions? A new study titled “Thinking Isn’t an Illusion” argues that the problem isn’t with reasoning itself—it’s with reasoning in a vacuum. When these models are augmented with tools like Python interpreters and structured scratchpads, their performance transforms dramatically. In fact, they begin to consistently outperform their non-reasoning counterparts across a diverse set of logic puzzles. ...

July 24, 2025 · 4 min · Zelina
Cover image

The Watchdog at the Gates: How HalMit Hunts Hallucinations in LLM Agents

In the ever-expanding ecosystem of intelligent agents powered by large language models (LLMs), hallucinations are the lurking flaw that threatens their deployment in critical domains. These agents can compose elegant, fluent answers that are entirely wrong — a risk too great in medicine, law, or finance. While many hallucination-detection approaches require model internals or external fact-checkers, a new paper proposes a bold black-box alternative: HalMit. Hallucinations as Boundary Breakers HalMit is built on a deceptively simple premise: hallucinations happen when LLMs step outside their semantic comfort zone — their “generalization bound.” If we could map this bound for each domain or agent, we could flag responses that veer too far. ...

July 23, 2025 · 3 min · Zelina
Cover image

The Butterfly Defect: Diagnosing LLM Failures in Tool-Agent Chains

As LLM-powered agents become the backbone of many automation systems, their ability to reliably invoke external tools is now under the spotlight. Despite impressive multi-step reasoning, many such agents crumble in practice—not because they can’t plan, but because they can’t parse. One wrong parameter, one mismatched data type, and the whole chain collapses. A new paper titled “Butterfly Effects in Toolchains” offers the first systematic taxonomy of these failures, exposing how parameter-filling errors propagate through tool-invoking agents. The findings aren’t just technical quirks—they speak to deep flaws in how current LLM systems are evaluated, built, and safeguarded. ...

July 22, 2025 · 3 min · Zelina
Cover image

Agents of Disruption: How LLMs Became Adversarial Testers for Autonomous Driving

The promise of fully autonomous vehicles hinges on their ability to handle not just the average drive—but the unexpected. Yet, creating rare, safety-critical scenarios for testing autonomous driving (AD) systems has long been a bottleneck. Manual scene creation doesn’t scale. Generative models often drift away from real-world distributions. And collecting edge cases on the road? Too dangerous, too slow. Enter AGENTS-LLM, a deceptively simple yet powerful framework that uses Large Language Models (LLMs) not to solve traffic scenes, but to break them. The twist? These aren’t just static prompts or synthetic scripts. AGENTS-LLM organizes LLMs into a multi-agent, modular system that modifies real traffic scenarios with surgical precision—making them trickier, nastier, and far more useful for evaluating planning systems. ...

July 21, 2025 · 3 min · Zelina
Cover image

Personas with Purpose: How TinyTroupe Reimagines Multiagent Simulation

If you’ve ever tried to simulate user behavior using LLMs, you’ve probably noticed the same frustrating pattern: the agents are too polite, too helpful, and too similar. They lack the kind of quirks, inconsistencies, and contextually grounded views that make real people interesting—and unpredictable. Enter TinyTroupe, Microsoft’s new open-source toolkit that flips the script on LLM-agent design. Instead of building yet another task-oriented assistant or collaborative workflow bot, TinyTroupe takes the form of a behavioral simulation laboratory. It invites us to think of agents not as obedient coworkers, but as idiosyncratic personas—each with their own backstories, beliefs, and sometimes maddening biases. ...

July 15, 2025 · 4 min · Zelina
Cover image

The First Hurdle: Why Coding Agents Struggle with Setup

In the race to build autonomous software engineers, large language model (LLM) agents like Devin and Copilot Chat are lauded for fixing bugs, writing code, and even completing tasks from GitHub issues. But what happens when the code doesn’t even run? That’s the uncomfortable gap SetupBench aims to measure—and the results are sobering. SetupBench introduces a 93-task benchmark evaluating a foundational but under-tested skill: bootstrapping a development environment from scratch. Unlike prior benchmarks that hand agents a fully pre-configured Docker container, SetupBench drops them into a barebones Linux sandbox and challenges them to install dependencies, initialize databases, configure background services, and resolve real-world version conflicts. It sounds simple. It isn’t. ...

July 15, 2025 · 4 min · Zelina
Cover image

Threading the Needle: How GRAFT Reinvents Document Translation with DAGs and LLM Agents

Document-level machine translation (DocMT) has long been riddled with a paradox: while LLMs can translate fluent paragraphs and even simulate discourse, they often falter at stitching meaning across paragraphs. Pronouns go adrift, tenses waver, and terminology mutates like a broken telephone game. The new paper GRAFT: A Graph-based Flow-aware Agentic Framework for Document-level Machine Translation proposes an ambitious fix: treat a document not as a sequence, but as a graph — and deploy a team of LLM agents to navigate it. ...

July 12, 2025 · 4 min · Zelina
Cover image

Secret Handshakes at Scale: How LLM Agents Learn to Collude

As large language models (LLMs) evolve from passive tools into autonomous market participants, a critical question emerges: can they secretly coordinate in ways that harm fair competition? A recent paper titled Evaluating LLM Agent Collusion in Double Auctions explores this unsettling frontier, and its findings deserve attention from both AI developers and policy makers. The study simulates a continuous double auction (CDA), where multiple buyer and seller agents submit bids and asks in real-time. Each agent is an LLM-powered negotiator, operating on behalf of a hypothetical industrial firm. Sellers value each item at $80, buyers at $100, and trades execute when bids meet asks. The fair equilibrium price should hover around $90. ...

July 7, 2025 · 4 min · Zelina
Cover image

From ETL to Orchestral Intelligence: The Rise of the Data Agent

Enterprise data workflows have long been a patchwork of scripts, schedulers, human-in-the-loop dashboards, and brittle integrations. Enter the “Data Agent”: an AI-native abstraction designed not just to automate, but to reason over, adapt to, and orchestrate complex Data+AI ecosystems. In their paper, “Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems”, Zhaoyan Sun et al. from Tsinghua University propose a new agentic blueprint for data orchestration—one that moves far beyond traditional ETL. ...

July 3, 2025 · 3 min · Zelina