Evaluation

LLMs, Gotta Think ’Em All: When Pokémon Battles Become a Serious AI Benchmark

Opening — Why this matters now For years, game AI has been split between two extremes: brittle rule-based scripts and opaque reinforcement learning behemoths. Both work—until the rules change, the content shifts, or players behave in ways the designers didn’t anticipate. Pokémon battles, deceptively simple on the surface, sit exactly at this fault line. They demand structured reasoning, probabilistic judgment, and tactical foresight, but also creativity when the meta evolves. ...

From Benchmarks to Beakers: Stress‑Testing LLMs as Scientific Co‑Scientists

Opening — Why this matters now Large Language Models have already aced exams, written code, and argued philosophy with unsettling confidence. The obvious next step was inevitable: can they do science? Not assist, not summarize—but reason, explore, and discover. The paper behind this article asks that question without romance. It evaluates LLMs not as chatbots, but as proto‑scientists, and then measures how far the illusion actually holds. ...

When LLMs Get Fatty Liver: Diagnosing AI-MASLD in Clinical AI

Opening — Why this matters now AI keeps passing medical exams, acing board-style questions, and politely explaining pathophysiology on demand. Naturally, someone always asks the dangerous follow-up: So… can we let it talk to patients now? This paper answers that question with clinical bluntness: not without supervision, and certainly not without consequences. When large language models (LLMs) are exposed to raw, unstructured patient narratives—the kind doctors hear every day—their performance degrades in a very specific, pathological way. The authors call it AI-MASLD: AI–Metabolic Dysfunction–Associated Steatotic Liver Disease. ...

Seeing Isn’t Knowing: Why Vision-Language Models Still Miss the Details

Opening — Why this matters now Vision-language models (VLMs) have become unreasonably confident. Ask them to explain a chart, reason over a meme, or narrate an image, and they respond with eloquence that borders on arrogance. Yet, beneath this fluency lies an uncomfortable truth: many of these models still struggle with seeing the right thing. ...

Replace, Don’t Expand: When RAG Learns to Throw Things Away

Opening — Why this matters now RAG systems are having an identity crisis. On paper, retrieval-augmented generation is supposed to ground large language models in facts. In practice, when queries require multi-hop reasoning, most systems panic and start hoarding context like it’s a survival skill. Add more passages. Expand the window. Hope the model figures it out. ...

When Ambiguity Helps: Rethinking How AI Interprets Our Data Questions

Opening — Why this matters now As businesses increasingly rely on natural language to query complex datasets — “Show me the average Q3 sales in Europe” — ambiguity has become both a practical headache and a philosophical blind spot. The instinct has been to “fix” vague queries, forcing AI systems to extract a single, supposedly correct intent. But new research from CWI and the University of Amsterdam suggests we’ve been asking the wrong question all along. Ambiguity isn’t the enemy — it’s part of how humans think and collaborate. ...

The Benchmark Awakens: AstaBench and the New Standard for Agentic Science

The Benchmark Awakens: AstaBench and the New Standard for Agentic Science The latest release from the Allen Institute for AI, AstaBench, represents a turning point for how the AI research community evaluates large language model (LLM) agents. For years, benchmarks like MMLU or ARC have tested narrow reasoning and recall. But AstaBench brings something new—it treats the agent not as a static model, but as a scientific collaborator with memory, cost, and strategy. ...

Beyond Answers: Measuring How Deep Research Agents Really Think

Artificial intelligence is moving past chatbots that answer questions. The next frontier is Deep Research Agents (DRAs) — AI systems that can decompose complex problems, gather information from multiple sources, reason across them, and synthesize their findings into structured reports. But until recently, there was no systematic way to measure how well these agents perform beyond surface-level reasoning. That is the gap RigorousBench aims to fill. From Q&A to Reports: The Benchmark Shift Traditional LLM benchmarks — like GAIA, WebWalker, or BrowseComp — test how accurately a model answers factual questions. This approach works for short-form reasoning but fails for real-world research tasks that demand long-form synthesis and multi-source validation. ...

Backtrack to Breakthrough: Why Great AI Agents Revisit

TL;DR Agentic performance isn’t just about doing more; it’s about going back. In GSM-Agent—a controllable, tool-using version of GSM8K—top models only reach ~65–68% accuracy, and the strongest predictor of success is a high revisit ratio: deliberately returning to a previously explored topic with a refined query. That’s actionable for enterprise AI: design agents that can (1) recognize incomplete evidence, (2) reopen earlier lines of inquiry, and (3) instrument and reward revisits. ...

Lost in the Long Game: What UltraHorizon Reveals About Agent Failure at Scale

TL;DR UltraHorizon is a new benchmark that finally tests what real enterprise projects require: months‑long reasoning crammed into a single run—35k–200k tokens, 60–400+ tool calls, partially observable rules, and hard commitments at the end. Agents underperform badly versus humans. The pattern isn’t “not enough IQ”; it’s entropy collapse over time (the paper calls it in‑context locking) and foundational capability gaps (planning, memory, calibrated exploration). Simple scaling fails; a lightweight strategy—Context Refresh with Notes Recall (CRNR)—partially restores performance. Below we translate these findings into a deployer’s playbook. ...