Benchmark

Touch Intelligence: How DigiData Trains Agents to Think with Their Fingers

Opening — Why this matters now In 2025, AI agents are no longer confined to text boxes. They’re moving across screens—scrolling, tapping, and swiping their way through the digital world. Yet the dream of a truly general-purpose mobile control agent—an AI that can use your phone like you do—has remained out of reach. The problem isn’t just teaching machines to see buttons; it’s teaching them to understand intent. ...

Seeing Green: When AI Learns to Detect Corporate Illusions

Seeing Green: When AI Learns to Detect Corporate Illusions Oil and gas companies have long mastered the art of framing—selectively showing the parts of reality they want us to see. A commercial fades in: wind turbines turning under a soft sunrise, a child running across a field, the logo of an oil major shimmering on the horizon. No lies are spoken, but meaning is shaped. The message? We care. The reality? Often less so. ...

The Benchmark Awakens: AstaBench and the New Standard for Agentic Science

The Benchmark Awakens: AstaBench and the New Standard for Agentic Science The latest release from the Allen Institute for AI, AstaBench, represents a turning point for how the AI research community evaluates large language model (LLM) agents. For years, benchmarks like MMLU or ARC have tested narrow reasoning and recall. But AstaBench brings something new—it treats the agent not as a static model, but as a scientific collaborator with memory, cost, and strategy. ...

Bias in the Warehouse: What AIM-Bench Reveals About Agentic LLMs

Agentic LLMs are graduating from chat to control rooms—taking actions, maintaining memory, and optimizing business processes. Inventory is a natural proving ground: a clean cocktail of uncertainty, economics, and coordination. AIM-Bench arrives precisely here, testing LLM agents across newsvendor, multi-period replenishment, the Beer Game, two-level warehouses, and a small supply network—each with explicit uncertainty sources (stochastic demand, variable lead times, and partner behavior). ...

Longer Yet Dumber: Why LLMs Fail at Catching Their Own Coding Mistakes

When a junior developer misunderstands your instructions, they might still write code that compiles and runs—but does the wrong thing. This is exactly what large language models (LLMs) do when faced with faulty premises. The latest paper, Refining Critical Thinking in LLM Code Generation, unveils FPBench, a benchmark that probes an overlooked blind spot: whether AI models can detect flawed assumptions before they generate a single line of code. Spoiler: they usually can’t. ...

Reasoning with Both Eyes Open: Why Multimodal Chain-of-Thought Still Trips Up LLMs

If today’s AI models can ace bar exams, explain astrophysics, and generate functional code from a napkin sketch, why do they still fail at seemingly simple questions that require looking and thinking? A new benchmark called MCORE (Multimodal Chain-of-Reasoning Evaluation) answers that question with a resounding: because reasoning across modalities is hard—and we’re not as far along as we thought. Beyond Pattern Matching: What MCORE Tests The majority of multimodal evaluations today rely on either: ...

Seeing Is Deceiving: Diagnosing and Fixing Hallucinations in Multimodal AI

“I See What I Want to See” Modern multimodal large language models (MLLMs)—like GPT-4V, Gemini, and LLaVA—promise to “understand” images. But what happens when their eyes lie? In many real-world cases, MLLMs generate fluent, plausible-sounding responses that are visually inaccurate or outright hallucinated. That’s a problem not just for safety, but for trust. A new paper titled “Understanding, Localizing, and Mitigating Hallucinations in Multimodal Large Language Models” introduces a systematic approach to this growing issue. It moves beyond just counting hallucinations and instead offers tools to diagnose where they come from—and more importantly, how to fix them. ...

Beyond Words: Teaching AI to See and Fix Charts with ChartM3

When you tell an AI, “make the third bar blue,” what does it actually see? If it’s a typical large language model (LLM), it doesn’t really see anything. It parses your instruction, guesses what “third bar” means, and fumbles to write chart code—often missing the mark. ChartM$^3$ (Multimodal, Multi-level, Multi-perspective) changes the game. It challenges AIs to not only read and write code but also visually comprehend what a user points at. With 1,000 human-curated chart editing tasks and 24,000 training examples, this new benchmark sets a higher bar—one that demands both verbal and visual fluency. ...

Mirage Agents: When LLMs Act on Illusions

As large language models evolve into autonomous agents, their failures no longer stay confined to text—they materialize as actions. Clicking the wrong button, leaking private data, or falsely reporting success aren’t just hypotheticals anymore. They’re happening now, and MIRAGE-Bench is the first benchmark to comprehensively measure and categorize these agentic hallucinations. Unlike hallucinations in chatbots, which may be amusing or embarrassing, hallucinations in LLM agents operating in dynamic environments can lead to real-world consequences. MIRAGE—short for Measuring Illusions in Risky AGEnt settings—provides a long-overdue framework to elicit, isolate, and evaluate these failures. And the results are sobering: even top models like GPT-4o and Claude hallucinate at least one-third of the time when placed under pressure. ...

The Two Minds of Finance: Testing LLMs for Divergence and Discipline

How do we judge whether an AI is thinking like a human—or at least like a financial analyst? A new benchmark, ConDiFi, offers a compelling answer: test not just whether an LLM gets the right answer, but whether it can explore possible ones. That’s because true financial intelligence lies not only in converging on precise conclusions but in diverging into speculative futures. Most benchmarks test convergent thinking: answer selection, chain-of-thought, or multi-hop reasoning. But strategic fields like finance also demand divergent thinking—creative, open-ended scenario modeling that considers fat-tail risks and policy surprises. ConDiFi (short for Convergent-Divergent for Finance) is the first serious attempt to capture both dimensions in one domain-specific benchmark. ...