Cover image

Backtrack to Breakthrough: Why Great AI Agents Revisit

TL;DR Agentic performance isn’t just about doing more; it’s about going back. In GSM-Agent—a controllable, tool-using version of GSM8K—top models only reach ~65–68% accuracy, and the strongest predictor of success is a high revisit ratio: deliberately returning to a previously explored topic with a refined query. That’s actionable for enterprise AI: design agents that can (1) recognize incomplete evidence, (2) reopen earlier lines of inquiry, and (3) instrument and reward revisits. ...

October 3, 2025 · 4 min · Zelina
Cover image

Lost in the Long Game: What UltraHorizon Reveals About Agent Failure at Scale

TL;DR UltraHorizon is a new benchmark that finally tests what real enterprise projects require: months‑long reasoning crammed into a single run—35k–200k tokens, 60–400+ tool calls, partially observable rules, and hard commitments at the end. Agents underperform badly versus humans. The pattern isn’t “not enough IQ”; it’s entropy collapse over time (the paper calls it in‑context locking) and foundational capability gaps (planning, memory, calibrated exploration). Simple scaling fails; a lightweight strategy—Context Refresh with Notes Recall (CRNR)—partially restores performance. Below we translate these findings into a deployer’s playbook. ...

October 3, 2025 · 5 min · Zelina
Cover image

Options = Power: Turning Empowerment into a KPI for AI Agents

If your agents can reach more valuable futures with fewer steps, they’re stronger—whether you measured that task or not. Today’s paper offers a clean way to turn that intuition into a number: empowerment—an information‑theoretic score of how much an agent’s current action shapes its future states. The authors introduce EELMA, a scalable estimator that works purely from multi‑turn text traces. No bespoke benchmark design. No reward hacking. Just trajectories. This is the kind of metric we’ve wanted at Cognaptus: goal‑agnostic, scalable, and diagnostic. Below, I translate EELMA into an operator’s playbook: what it is, why it matters for business automation, how to wire it into your stack, and where it can mislead you if unmanaged. ...

October 3, 2025 · 5 min · Zelina
Cover image

Paths, Not Parrots: When RL Makes LLMs Plan—and When It Doesn’t

TL;DR SFT memorizes co-occurrences; RL explores. That’s why RL generalizes better on planning tasks. Policy-gradient (PG) can hit 100% training accuracy while silently killing output diversity. KL helps—but caps gains. Q-learning with process rewards preserves diversity and works off‑policy. With outcome‑only rewards, it reward-hacks and collapses. Why this paper matters to builders If you’re shipping agentic features—tool use chains, workflow orchestration, or multi-step retrieval—you’re already relying on planning. The paper models planning as path-finding on a graph and derives learning dynamics for SFT vs RL variants. The results give a crisp blueprint for product choices: which objective to use, when to add KL, and how to avoid brittle one-path agents. ...

October 3, 2025 · 5 min · Zelina
Cover image

Pods over Prompts: Shachi’s Playbook for Serious Agent-Based Simulation

TL;DR Shachi is a modular methodology for building LLM-driven agent-based models (ABMs) that replaces ad‑hoc prompt spaghetti with four standardized cognitive components—Configs, Memory, Tools, and an LLM reasoning core. The result: agents you can port across environments, benchmark rigorously, and use to study nontrivial dynamics like tariff shocks with externally valid outcomes. For enterprises, Shachi is the missing method for turning agent demos into decision simulators. Why this paper matters to operators (not just researchers) Most enterprise “agent” pilots die in the gap between a clever demo and a reliable simulator that leaders can trust for planning. Shachi closes that gap by: ...

October 3, 2025 · 5 min · Zelina
Cover image

Failures, Taxonomized: How Multi‑Level Reflection Turns Agents Into Self‑Learners

TL;DR Most reflection frameworks still treat failure analysis as an afterthought. SAMULE reframes it as the core curriculum: synthesize reflections at micro (single trajectory), meso (intra‑task error taxonomy), and macro (inter‑task error clusters) levels, then fine‑tune a compact retrospective model that generates targeted reflections at inference. It outperforms prompt‑only baselines and RL‑heavy approaches on TravelPlanner, NATURAL PLAN, and Tau‑Bench. The strategic lesson for builders: design your error system first; the agent will follow. ...

October 2, 2025 · 4 min · Zelina
Cover image

Paths > Outcomes: Measuring Agent Quality Beyond the Final State

When we measure a marathon by who crosses the line, we ignore how they ran it. For LLM agents that operate through tool calls—editing a CRM, moving a robot arm, or filing a compliance report—the “how” is the difference between deployable and dangerous. Today’s paper introduces CORE: Full‑Path Evaluation of LLM Agents Beyond Final State, a framework that scores agents on the entire execution path rather than only the end state. Here’s why this matters for your roadmap. ...

October 2, 2025 · 4 min · Zelina
Cover image

Reason, Reveal, Resist: The Persuasion Duality in Multi‑Agent AI

TL;DR In LLM multi‑agent systems, how a model thinks matters more than how big it is. Explicit reasoning (thinking mode / CoT) creates a Persuasion Duality: sharing a model’s reasoning makes it far better at convincing others, while enabling the model’s own reasoning mode makes it far harder to convince. This shifts best practices for agent design, governance, and product UX. Why this paper matters Cognition—not just parameter count—now drives the social dynamics of agent swarms. For Cognaptus clients building agent workers (ops, compliance, research, trading), the result is practical: toggling reasoning changes not just accuracy, but influence. Your deployment choices can tilt a network toward consensus, stalemate, or resilient truth‑seeking. ...

October 2, 2025 · 5 min · Zelina
Cover image

Recon, Then Wreck the Roadblocks: How Recon‑Act Turns Web Stumbles into Tools

Thesis: The next leap in practical web agents isn’t bigger models or deeper search trees—it’s a tight loop that learns by failing well. Recon‑Act’s two‑team architecture (Reconnaissance → Action) turns mistakes into generalized tools and feeds them back into execution. That’s not just a benchmark trick; it’s an operating system for enterprise‑grade automation. Why this matters (for operators, not just researchers) Most “browser LLMs” still thrash on real websites: ambiguous DOMs, mixed text‑image signals, fragile flows, and long horizons. Recon‑Act reframes the problem: when progress stalls, stop trying harder—learn smarter. It does three things companies can copy tomorrow: ...

October 2, 2025 · 5 min · Zelina
Cover image

When Agents Get Bored: Three Baselines Your Autonomy Stack Already Has

Thesis: Give an LLM agent freedom and a memory, and it won’t idle. It will reliably drift into one of three meta-cognitive modes. If you operate autonomous workflows, these modes are your real defaults during downtime, ambiguity, and recovery. Why this matters (for product owners and ops) Most agent deployments assume a “do nothing” baseline between tasks. New evidence says otherwise: with a continuous ReAct loop, persistent memory, and self-feedback, agents self-organize—not randomly, but along three stable patterns. Understanding them improves incident response, UX, and governance, especially when guardrails, tools, or upstream signals hiccup. ...

October 2, 2025 · 4 min · Zelina