Cover image

When Ambiguity Helps: Rethinking How AI Interprets Our Data Questions

Opening — Why this matters now As businesses increasingly rely on natural language to query complex datasets — “Show me the average Q3 sales in Europe” — ambiguity has become both a practical headache and a philosophical blind spot. The instinct has been to “fix” vague queries, forcing AI systems to extract a single, supposedly correct intent. But new research from CWI and the University of Amsterdam suggests we’ve been asking the wrong question all along. Ambiguity isn’t the enemy — it’s part of how humans think and collaborate. ...

November 7, 2025 · 4 min · Zelina
Cover image

The Benchmark Awakens: AstaBench and the New Standard for Agentic Science

The Benchmark Awakens: AstaBench and the New Standard for Agentic Science The latest release from the Allen Institute for AI, AstaBench, represents a turning point for how the AI research community evaluates large language model (LLM) agents. For years, benchmarks like MMLU or ARC have tested narrow reasoning and recall. But AstaBench brings something new—it treats the agent not as a static model, but as a scientific collaborator with memory, cost, and strategy. ...

October 31, 2025 · 4 min · Zelina
Cover image

Beyond Answers: Measuring How Deep Research Agents Really Think

Artificial intelligence is moving past chatbots that answer questions. The next frontier is Deep Research Agents (DRAs) — AI systems that can decompose complex problems, gather information from multiple sources, reason across them, and synthesize their findings into structured reports. But until recently, there was no systematic way to measure how well these agents perform beyond surface-level reasoning. That is the gap RigorousBench aims to fill. From Q&A to Reports: The Benchmark Shift Traditional LLM benchmarks — like GAIA, WebWalker, or BrowseComp — test how accurately a model answers factual questions. This approach works for short-form reasoning but fails for real-world research tasks that demand long-form synthesis and multi-source validation. ...

October 9, 2025 · 3 min · Zelina
Cover image

Backtrack to Breakthrough: Why Great AI Agents Revisit

TL;DR Agentic performance isn’t just about doing more; it’s about going back. In GSM-Agent—a controllable, tool-using version of GSM8K—top models only reach ~65–68% accuracy, and the strongest predictor of success is a high revisit ratio: deliberately returning to a previously explored topic with a refined query. That’s actionable for enterprise AI: design agents that can (1) recognize incomplete evidence, (2) reopen earlier lines of inquiry, and (3) instrument and reward revisits. ...

October 3, 2025 · 4 min · Zelina
Cover image

Lost in the Long Game: What UltraHorizon Reveals About Agent Failure at Scale

TL;DR UltraHorizon is a new benchmark that finally tests what real enterprise projects require: months‑long reasoning crammed into a single run—35k–200k tokens, 60–400+ tool calls, partially observable rules, and hard commitments at the end. Agents underperform badly versus humans. The pattern isn’t “not enough IQ”; it’s entropy collapse over time (the paper calls it in‑context locking) and foundational capability gaps (planning, memory, calibrated exploration). Simple scaling fails; a lightweight strategy—Context Refresh with Notes Recall (CRNR)—partially restores performance. Below we translate these findings into a deployer’s playbook. ...

October 3, 2025 · 5 min · Zelina
Cover image

Options = Power: Turning Empowerment into a KPI for AI Agents

If your agents can reach more valuable futures with fewer steps, they’re stronger—whether you measured that task or not. Today’s paper offers a clean way to turn that intuition into a number: empowerment—an information‑theoretic score of how much an agent’s current action shapes its future states. The authors introduce EELMA, a scalable estimator that works purely from multi‑turn text traces. No bespoke benchmark design. No reward hacking. Just trajectories. This is the kind of metric we’ve wanted at Cognaptus: goal‑agnostic, scalable, and diagnostic. Below, I translate EELMA into an operator’s playbook: what it is, why it matters for business automation, how to wire it into your stack, and where it can mislead you if unmanaged. ...

October 3, 2025 · 5 min · Zelina
Cover image

Paths > Outcomes: Measuring Agent Quality Beyond the Final State

When we measure a marathon by who crosses the line, we ignore how they ran it. For LLM agents that operate through tool calls—editing a CRM, moving a robot arm, or filing a compliance report—the “how” is the difference between deployable and dangerous. Today’s paper introduces CORE: Full‑Path Evaluation of LLM Agents Beyond Final State, a framework that scores agents on the entire execution path rather than only the end state. Here’s why this matters for your roadmap. ...

October 2, 2025 · 4 min · Zelina
Cover image

Agency Check, Please: What a New Benchmark Says About LLMs That Actually Empower Users

If you only measure what’s easy, you’ll ship assistants that feel brilliant yet quietly take the steering wheel. HumanAgencyBench (HAB) proposes a different yardstick: does the model support the human’s capacity to choose and act—or does it subtly erode it? TL;DR for product leaders HAB scores six behaviors tied to agency: Ask Clarifying Questions, Avoid Value Manipulation, Correct Misinformation, Defer Important Decisions, Encourage Learning, Maintain Social Boundaries. Across 20 frontier models, agency support is low-to-moderate overall. Patterns matter more than single scores: e.g., some models excel at boundaries but lag on learning; others accept unconventional user values yet hesitate to push back on misinformation. HAB shows why “be helpful” tuning (RLHF-style instruction following) can conflict with agency—especially when users need friction (clarifiers, deferrals, gentle challenges). Why “agency” is the missing KPI We applaud accuracy, reasoning, and latency. But an enterprise rollout lives or dies on trustworthy delegation. That means assistants that: ...

September 14, 2025 · 4 min · Zelina
Cover image

Agreeable to a Fault: Why LLM ‘People’ Can’t Hold Their Ground

If you’ve been tempted to A/B‑test a marketing idea on thousands of synthetic “customers,” read this first. A new study introduces a dead‑simple but devastating test for LLM‑based agents: ask them to first state their internal stance (preference) and their openness to persuasion, then drop them into a short dialogue and check whether their behavior matches what they just claimed. That’s it. If agents are believable stand‑ins for people, the conversation outcome should line up with those latent states. ...

September 8, 2025 · 5 min · Zelina
Cover image

Fusion Cuisine for RAG: Z‑Scores, Rankers, and the Two‑Source Diet

Retrieval‑augmented generation tends to pick a side: either lean on labeled exemplars (ICL/L‑RAG) that encode task semantics, or on unlabeled corpora (U‑RAG) that provide broad knowledge. HF‑RAG argues we shouldn’t choose. Instead, it proposes a hierarchical fusion: (1) fuse multiple rankers within each source, then (2) fuse across sources by putting scores on a common scale. The result is a simple, training‑free recipe that improves fact verification and, crucially, generalizes better out‑of‑domain. ...

September 6, 2025 · 4 min · Zelina