Cover image

RxnBench: Reading Chemistry Like a Human (Turns Out That’s Hard)

Opening — Why this matters now Multimodal Large Language Models (MLLMs) have become impressively fluent readers of the world. They can caption images, parse charts, and answer questions about documents that would once have required a human analyst and a strong coffee. Naturally, chemistry was next. But chemistry does not speak in sentences. It speaks in arrows, wedges, dashed bonds, cryptic tables, and reaction schemes buried three pages away from their explanations. If we want autonomous “AI chemists,” the real test is not trivia or SMILES strings — it is whether models can read actual chemical papers. ...

December 31, 2025 · 4 min · Zelina
Cover image

Benchmarks on Quicksand: Why Static Scores Fail Living Models

Opening — Why this matters now If you feel that every new model release breaks yesterday’s leaderboard, congratulations: you’ve discovered the central contradiction of modern AI evaluation. Benchmarks were designed for stability. Models are not. The paper you just uploaded dissects this mismatch with academic precision—and a slightly uncomfortable conclusion: static benchmarks are no longer fit for purpose. ...

December 15, 2025 · 3 min · Zelina
Cover image

Seeing Isn’t Knowing: Why Vision-Language Models Still Miss the Details

Opening — Why this matters now Vision-language models (VLMs) have become unreasonably confident. Ask them to explain a chart, reason over a meme, or narrate an image, and they respond with eloquence that borders on arrogance. Yet, beneath this fluency lies an uncomfortable truth: many of these models still struggle with seeing the right thing. ...

December 14, 2025 · 4 min · Zelina
Cover image

Beyond Answers: Measuring How Deep Research Agents Really Think

Artificial intelligence is moving past chatbots that answer questions. The next frontier is Deep Research Agents (DRAs) — AI systems that can decompose complex problems, gather information from multiple sources, reason across them, and synthesize their findings into structured reports. But until recently, there was no systematic way to measure how well these agents perform beyond surface-level reasoning. That is the gap RigorousBench aims to fill. From Q&A to Reports: The Benchmark Shift Traditional LLM benchmarks — like GAIA, WebWalker, or BrowseComp — test how accurately a model answers factual questions. This approach works for short-form reasoning but fails for real-world research tasks that demand long-form synthesis and multi-source validation. ...

October 9, 2025 · 3 min · Zelina
Cover image

The User Is Present: Why Smart Agents Still Don't Get You

If today’s AI agents are so good with tools, why are they still so bad with people? That’s the uncomfortable question posed by UserBench, a new gym-style benchmark from Salesforce AI Research that evaluates LLM-based agents not just on what they do, but how well they collaborate with a user who doesn’t say exactly what they want. At first glance, UserBench looks like yet another travel planning simulator. But dig deeper, and you’ll see it flips the standard script of agent evaluation. Instead of testing models on fully specified tasks, it mimics real conversations: the user’s goals are vague, revealed incrementally, and often expressed indirectly. Think “I’m traveling for business, so I hope to have enough time to prepare” instead of “I want a direct flight.” The agent’s job is to ask, interpret, and decide—with no hand-holding. ...

July 30, 2025 · 3 min · Zelina
Cover image

Tunnel Vision: Why Vision-Language Models Still Miss the Bigger Picture

It’s no secret that Vision-Language Models (VLMs) have dazzled us with their prowess—excelling at image captioning, chart understanding, and even medical diagnostics. But beneath the glitter of benchmark wins, a deeper flaw lurks: these models often suffer from what Berman and Deng (Princeton) have sharply diagnosed as “tunnel vision.” Their new paper, VLMs Have Tunnel Vision, introduces a battery of tasks that humans can breeze through but that leading VLMs—from Gemini 2.5 Pro to Claude Vision 3.7—fail to solve even marginally above chance. These tasks aren’t edge cases or contrived puzzles. They simulate basic human visual competencies like comparing two objects, following a path, and making discrete visual inferences from spatially distributed evidence. The results? A sobering reminder that state-of-the-art perception doesn’t equate to understanding. ...

July 21, 2025 · 4 min · Zelina
Cover image

Proofs and Consequences: How Math Reveals What AI Still Doesn’t Know

What happens when we ask the smartest AI models to do something truly difficult—like solve a real math problem and prove their answer is correct? That’s the question tackled by a group of researchers in their paper “Mathematical Proof as a Litmus Test.” Instead of testing AI with casual tasks like summarizing news or answering trivia, they asked it to write formal mathematical proofs—the kind that leave no room for error. And the results? Surprisingly poor. ...

June 23, 2025 · 4 min · Zelina