Cover image

Bias in the Warehouse: What AIM-Bench Reveals About Agentic LLMs

Agentic LLMs are graduating from chat to control rooms—taking actions, maintaining memory, and optimizing business processes. Inventory is a natural proving ground: a clean cocktail of uncertainty, economics, and coordination. AIM-Bench arrives precisely here, testing LLM agents across newsvendor, multi-period replenishment, the Beer Game, two-level warehouses, and a small supply network—each with explicit uncertainty sources (stochastic demand, variable lead times, and partner behavior). ...

August 18, 2025 · 4 min · Zelina
Cover image

Longer Yet Dumber: Why LLMs Fail at Catching Their Own Coding Mistakes

When a junior developer misunderstands your instructions, they might still write code that compiles and runs—but does the wrong thing. This is exactly what large language models (LLMs) do when faced with faulty premises. The latest paper, Refining Critical Thinking in LLM Code Generation, unveils FPBench, a benchmark that probes an overlooked blind spot: whether AI models can detect flawed assumptions before they generate a single line of code. Spoiler: they usually can’t. ...

August 6, 2025 · 3 min · Zelina
Cover image

Reasoning with Both Eyes Open: Why Multimodal Chain-of-Thought Still Trips Up LLMs

If today’s AI models can ace bar exams, explain astrophysics, and generate functional code from a napkin sketch, why do they still fail at seemingly simple questions that require looking and thinking? A new benchmark called MCORE (Multimodal Chain-of-Reasoning Evaluation) answers that question with a resounding: because reasoning across modalities is hard—and we’re not as far along as we thought. Beyond Pattern Matching: What MCORE Tests The majority of multimodal evaluations today rely on either: ...

August 6, 2025 · 3 min · Zelina
Cover image

Seeing Is Deceiving: Diagnosing and Fixing Hallucinations in Multimodal AI

“I See What I Want to See” Modern multimodal large language models (MLLMs)—like GPT-4V, Gemini, and LLaVA—promise to “understand” images. But what happens when their eyes lie? In many real-world cases, MLLMs generate fluent, plausible-sounding responses that are visually inaccurate or outright hallucinated. That’s a problem not just for safety, but for trust. A new paper titled “Understanding, Localizing, and Mitigating Hallucinations in Multimodal Large Language Models” introduces a systematic approach to this growing issue. It moves beyond just counting hallucinations and instead offers tools to diagnose where they come from—and more importantly, how to fix them. ...

August 5, 2025 · 3 min · Zelina
Cover image

Beyond Words: Teaching AI to See and Fix Charts with ChartM3

When you tell an AI, “make the third bar blue,” what does it actually see? If it’s a typical large language model (LLM), it doesn’t really see anything. It parses your instruction, guesses what “third bar” means, and fumbles to write chart code—often missing the mark. ChartM$^3$ (Multimodal, Multi-level, Multi-perspective) changes the game. It challenges AIs to not only read and write code but also visually comprehend what a user points at. With 1,000 human-curated chart editing tasks and 24,000 training examples, this new benchmark sets a higher bar—one that demands both verbal and visual fluency. ...

July 30, 2025 · 4 min · Zelina
Cover image

Mirage Agents: When LLMs Act on Illusions

As large language models evolve into autonomous agents, their failures no longer stay confined to text—they materialize as actions. Clicking the wrong button, leaking private data, or falsely reporting success aren’t just hypotheticals anymore. They’re happening now, and MIRAGE-Bench is the first benchmark to comprehensively measure and categorize these agentic hallucinations. Unlike hallucinations in chatbots, which may be amusing or embarrassing, hallucinations in LLM agents operating in dynamic environments can lead to real-world consequences. MIRAGE—short for Measuring Illusions in Risky AGEnt settings—provides a long-overdue framework to elicit, isolate, and evaluate these failures. And the results are sobering: even top models like GPT-4o and Claude hallucinate at least one-third of the time when placed under pressure. ...

July 29, 2025 · 4 min · Zelina
Cover image

The Two Minds of Finance: Testing LLMs for Divergence and Discipline

How do we judge whether an AI is thinking like a human—or at least like a financial analyst? A new benchmark, ConDiFi, offers a compelling answer: test not just whether an LLM gets the right answer, but whether it can explore possible ones. That’s because true financial intelligence lies not only in converging on precise conclusions but in diverging into speculative futures. Most benchmarks test convergent thinking: answer selection, chain-of-thought, or multi-hop reasoning. But strategic fields like finance also demand divergent thinking—creative, open-ended scenario modeling that considers fat-tail risks and policy surprises. ConDiFi (short for Convergent-Divergent for Finance) is the first serious attempt to capture both dimensions in one domain-specific benchmark. ...

July 25, 2025 · 4 min · Zelina
Cover image

Sound and Fury Signifying Stock Picks

In an age where TikTok traders and YouTube gurus claim market mastery, a new benchmark dataset asks a deceptively simple question: Can AI tell when someone really believes in their own stock pick? The answer, it turns out, reveals not just a performance gap between finfluencers and index funds, but also a yawning chasm between today’s multimodal AI models and human judgment. Conviction Is More Than a Call to Action The paper “VideoConviction” introduces a unique multimodal benchmark composed of 288 YouTube videos from 22 financial influencers, or “finfluencers,” spanning over six years of market cycles. From these, researchers extracted 687 stock recommendation segments, annotating each with: ...

July 14, 2025 · 4 min · Zelina
Cover image

Divide and Model: How Multi-Agent LLMs Are Rethinking Real-World Problem Solving

When it comes to real-world problem solving, today’s LLMs face a critical dilemma: they can solve textbook problems well, but stumble when confronted with messy, open-ended challenges—like optimizing traffic in a growing city or managing fisheries under uncertain climate shifts. Enter ModelingAgent, an ambitious new framework that turns this complexity into opportunity. What Makes Real-World Modeling So Challenging? Unlike standard math problems, real-world tasks involve ambiguity, multiple valid solutions, noisy data, and cross-domain reasoning. They often require: ...

May 23, 2025 · 3 min