Evaluation

ReAct Without the Chaos: AgentScope 1.0 Turns Tools into Strategy

Thesis: AgentScope 1.0 is less a toolkit and more a discipline for agentic software. By pinning everything to ReAct loops, unifying “message–model–memory–tool,” and adding group-wise tool provisioning, it addresses the real failure mode of agents in production: tool sprawl without control. The evaluation/Studio/runtime trio then turns prototypes into shippable services. What’s actually new (and why it matters) 1) A crisp core: Message → Model → Memory → Tool Most frameworks blur these into ad‑hoc objects; AgentScope forces a clean, composable boundary: ...

Charting a Better Bedside: When Agentic RL Teaches RAG to Diagnose

Why this paper matters: Retrieval‑augmented generation (RAG) has been the default answer to “how do we make LLMs factual?” But clinical work is not a single hop to a single document; it’s a workflow—observe, hypothesize, retrieve, cross‑check, and only then decide. Deep‑DxSearch reframes RAG as a sequential policy, trained end‑to‑end with reinforcement learning (RL) so the model learns when to reason internally and when to consult guidelines, match similar patients, or search broader knowledge—before committing to a diagnosis. That design change is the story. ...

Atom by Atom, Better Research: How Fine-Grained Rewards Make Agentic Search Smarter

If you’ve ever watched a web agent swing from elegant reasoning to face‑plants on basic facts, you’ve met the limits of outcome‑only training. Atom‑Searcher proposes a simple but radical fix: stop treating the whole reasoning trace as one monolith. Instead, break it down into Atomic Thoughts—the minimal, functional units of reasoning—and supervise them directly with a Reasoning Reward Model (RRM). Then blend those process‑level rewards with the final answer score using a decaying curriculum. The result? More stable training, deeper search behavior, and better generalization across in‑ and out‑of‑domain QA. ...

Crystal Ball, Meet Cron Job: What FutureX Reveals About ‘Live’ Forecasting Agents

The one-sentence take A new live benchmark, FutureX, swaps lab-style trivia for rolling, real-world future events, forcing agentic LLMs to search, reason, and hedge under uncertainty that actually moves—and the results expose where today’s “agents” are still brittle. Why FutureX matters now Enterprise teams are deploying agents to answer questions whose truth changes by the hour—markets, elections, sports, product launches. Static leaderboards don’t measure that. FutureX runs as a cron job on reality: it collects new events every day, has agents make predictions, and grades them after events resolve. That turns evaluation from a screenshot into a time series and makes overfitting to benchmark quirks a lot harder. ...

Knows the Facts, Misses the Plot: LLMs’ Knowledge–Reasoning Split in Clinical NLI

The gist A new clinical natural language inference (NLI) benchmark isolates what models know from how they reason—and the results are stark. State‑of‑the‑art LLMs ace targeted fact checks (≈92% accuracy) but crater on the actual reasoning tasks (≈25% accuracy). The collapse is most extreme in compositional grounding (≈4% accuracy), where a claim depends on multiple interacting clinical constraints (e.g., drug × dose × diagnosis × schedule). Scaling yielded fluent prose, not reliable inference. ...

Meta-Game Theory: What a Pokémon League Taught Us About LLM Strategy

When language models battle, their strategies talk back. In a controlled Pokémon tournament, eight LLMs drafted teams, chose moves, and logged natural‑language rationales every turn. Beyond win–loss records, those explanations exposed how models reason about uncertainty, risk, and resource management—exactly the traits we want in enterprise decision agents. Why Pokémon is a serious benchmark (yes, really) Pokémon delivers the trifecta we rarely get in classic AI games: Structured complexity: 18 interacting types, clear multipliers, and crisp rules. Uncertainty that matters: imperfect information, status effects, and accuracy trade‑offs. Resource management: limited switches, finite HP, role specialization. Crucially, the action space is compact enough for language-first agents to reason step‑by‑step without search trees—so we can see the strategy, not just the score. ...

Reasoning with Both Eyes Open: Why Multimodal Chain-of-Thought Still Trips Up LLMs

If today’s AI models can ace bar exams, explain astrophysics, and generate functional code from a napkin sketch, why do they still fail at seemingly simple questions that require looking and thinking? A new benchmark called MCORE (Multimodal Chain-of-Reasoning Evaluation) answers that question with a resounding: because reasoning across modalities is hard—and we’re not as far along as we thought. Beyond Pattern Matching: What MCORE Tests The majority of multimodal evaluations today rely on either: ...

Beyond the Pareto Frontier: Pricing LLM Mistakes in the Real World

For all the hype about model accuracy, inference cost, and latency, most organizations are still squinting at scatter plots to decide which large language model (LLM) to use. But what if we could cut through the tradeoff fog with a single number that tells you exactly which model is worth deploying—for your use case, under your constraints? That’s the bold proposal in a recent paper by Zellinger and Thomson from Caltech: treat LLM selection as an economic decision. Rather than searching for models on the accuracy-cost “Pareto frontier,” they suggest an approach grounded in price-tagging errors, delays, and abstentions in dollar terms. Think of it as a model selection framework that answers: How much is a mistake worth to you? ...

Mind the Gap: Fixing the Flaws in Agentic Benchmarking

If you’ve looked at any leaderboard lately—from SWE-Bench to WebArena—you’ve probably seen impressive numbers. But how many of those reflect real capabilities of AI agents? This paper by Zhu et al. makes a bold claim: agentic benchmarks are often broken, and the way we evaluate AI agents is riddled with systemic flaws. Their response is refreshingly practical: a 33-point diagnostic called the Agentic Benchmark Checklist (ABC), designed not just to critique, but to fix the evaluation process. It’s a must-read not only for benchmark creators, but for any team serious about deploying or comparing AI agents in real-world tasks. ...