Evaluation

Backtrack to Breakthrough: Why Great AI Agents Revisit

TL;DR Agentic performance isn’t just about doing more; it’s about going back. In GSM-Agent—a controllable, tool-using version of GSM8K—top models only reach ~65–68% accuracy, and the strongest predictor of success is a high revisit ratio: deliberately returning to a previously explored topic with a refined query. That’s actionable for enterprise AI: design agents that can (1) recognize incomplete evidence, (2) reopen earlier lines of inquiry, and (3) instrument and reward revisits. ...

Lost in the Long Game: What UltraHorizon Reveals About Agent Failure at Scale

TL;DR UltraHorizon is a new benchmark that finally tests what real enterprise projects require: months‑long reasoning crammed into a single run—35k–200k tokens, 60–400+ tool calls, partially observable rules, and hard commitments at the end. Agents underperform badly versus humans. The pattern isn’t “not enough IQ”; it’s entropy collapse over time (the paper calls it in‑context locking) and foundational capability gaps (planning, memory, calibrated exploration). Simple scaling fails; a lightweight strategy—Context Refresh with Notes Recall (CRNR)—partially restores performance. Below we translate these findings into a deployer’s playbook. ...

Options = Power: Turning Empowerment into a KPI for AI Agents

If your agents can reach more valuable futures with fewer steps, they’re stronger—whether you measured that task or not. Today’s paper offers a clean way to turn that intuition into a number: empowerment—an information‑theoretic score of how much an agent’s current action shapes its future states. The authors introduce EELMA, a scalable estimator that works purely from multi‑turn text traces. No bespoke benchmark design. No reward hacking. Just trajectories. This is the kind of metric we’ve wanted at Cognaptus: goal‑agnostic, scalable, and diagnostic. Below, I translate EELMA into an operator’s playbook: what it is, why it matters for business automation, how to wire it into your stack, and where it can mislead you if unmanaged. ...

Paths > Outcomes: Measuring Agent Quality Beyond the Final State

When we measure a marathon by who crosses the line, we ignore how they ran it. For LLM agents that operate through tool calls—editing a CRM, moving a robot arm, or filing a compliance report—the “how” is the difference between deployable and dangerous. Today’s paper introduces CORE: Full‑Path Evaluation of LLM Agents Beyond Final State, a framework that scores agents on the entire execution path rather than only the end state. Here’s why this matters for your roadmap. ...

Agency Check, Please: What a New Benchmark Says About LLMs That Actually Empower Users

If you only measure what’s easy, you’ll ship assistants that feel brilliant yet quietly take the steering wheel. HumanAgencyBench (HAB) proposes a different yardstick: does the model support the human’s capacity to choose and act—or does it subtly erode it? TL;DR for product leaders HAB scores six behaviors tied to agency: Ask Clarifying Questions, Avoid Value Manipulation, Correct Misinformation, Defer Important Decisions, Encourage Learning, Maintain Social Boundaries. Across 20 frontier models, agency support is low-to-moderate overall. Patterns matter more than single scores: e.g., some models excel at boundaries but lag on learning; others accept unconventional user values yet hesitate to push back on misinformation. HAB shows why “be helpful” tuning (RLHF-style instruction following) can conflict with agency—especially when users need friction (clarifiers, deferrals, gentle challenges). Why “agency” is the missing KPI We applaud accuracy, reasoning, and latency. But an enterprise rollout lives or dies on trustworthy delegation. That means assistants that: ...

Agreeable to a Fault: Why LLM ‘People’ Can’t Hold Their Ground

If you’ve been tempted to A/B‑test a marketing idea on thousands of synthetic “customers,” read this first. A new study introduces a dead‑simple but devastating test for LLM‑based agents: ask them to first state their internal stance (preference) and their openness to persuasion, then drop them into a short dialogue and check whether their behavior matches what they just claimed. That’s it. If agents are believable stand‑ins for people, the conversation outcome should line up with those latent states. ...

Fusion Cuisine for RAG: Z‑Scores, Rankers, and the Two‑Source Diet

Retrieval‑augmented generation tends to pick a side: either lean on labeled exemplars (ICL/L‑RAG) that encode task semantics, or on unlabeled corpora (U‑RAG) that provide broad knowledge. HF‑RAG argues we shouldn’t choose. Instead, it proposes a hierarchical fusion: (1) fuse multiple rankers within each source, then (2) fuse across sources by putting scores on a common scale. The result is a simple, training‑free recipe that improves fact verification and, crucially, generalizes better out‑of‑domain. ...

From Prompts to Policies: The Agentic RL Playbook

How a new survey formalizes the shift from RLHF’d text bots to tool-using operators—and the practical playbook for product teams. TL;DR Agentic RL reframes LLMs from one-shot text generators to policies acting in dynamic environments with planning, tool use, memory, and reflection. The paper contrasts PBRFT (preference-based RL fine-tuning) with Agentic RL via an MDP→POMDP upgrade; action space now includes text + structured actions. It organizes the space by capabilities (planning, tools, memory, self-improvement, reasoning, perception) and tasks (search, code, math, GUI, vision, embodied, multi-agent). Open challenges: trust, scalable training, and scalable environments. For builders: start with short-horizon agents (verifiable rewards), invest early in evaluation, and plan a migration path from RAG pipelines to tool-integrated reasoning (TIR) with RL. What the paper actually changes Most “LLM RL” work you’ve seen is PBRFT—optimize responses to fit human/AI preferences (RLHF/DPO/etc.). This new survey argues that real autonomy needs Agentic RL: treat the model as a policy embedded in a sequential, partially observable world. That sounds academic, but the practical consequences are huge: ...

Numbers Need Narration: Making LLMs Do Reasoning‑Intensive Regression

Thesis: When the job is to read text, reason carefully, and return a precise number (not just a label), ordinary regression heads and vanilla prompting often fail in opposite ways. The paper introduces MENTAT, a lightweight recipe that marries batch‑reflective prompt evolution with a small MLP aggregator over multiple LLM rollouts. The result: tighter calibration and better ranking on tasks where each example demands real reasoning, not surface features. What counts as “Reasoning‑Intensive Regression” (RiR)? RiR tasks look like this: the model must (1) think through the input with step‑wise analysis, and then (2) score it on a real‑valued scale. The paper frames three such tasks: ...

Benchmarks with Benefits: What DeepScholar-Bench Really Measures

TL;DR DeepScholar-Bench introduces a live (continuously refreshable) benchmark and a holistic automated evaluation for generative research synthesis. Its reference pipeline, DeepScholar‑base, is simple yet competitive. The headline: today’s best systems organize text well but miss key facts, under-retrieve important sources, and fail verifiability at scale. That’s not a death knell—it’s a roadmap. Why this matters for business readers Enterprise “research copilots” promise to digest the live web, summarize options, and provide auditable citations. In practice, three gaps keep showing up: ...