Cover image

Backtrack to Breakthrough: Why Great AI Agents Revisit

TL;DR Agentic performance isn’t just about doing more; it’s about going back. In GSM-Agent—a controllable, tool-using version of GSM8K—top models only reach ~65–68% accuracy, and the strongest predictor of success is a high revisit ratio: deliberately returning to a previously explored topic with a refined query. That’s actionable for enterprise AI: design agents that can (1) recognize incomplete evidence, (2) reopen earlier lines of inquiry, and (3) instrument and reward revisits. ...

October 3, 2025 · 4 min · Zelina
Cover image

Options = Power: Turning Empowerment into a KPI for AI Agents

If your agents can reach more valuable futures with fewer steps, they’re stronger—whether you measured that task or not. Today’s paper offers a clean way to turn that intuition into a number: empowerment—an information‑theoretic score of how much an agent’s current action shapes its future states. The authors introduce EELMA, a scalable estimator that works purely from multi‑turn text traces. No bespoke benchmark design. No reward hacking. Just trajectories. This is the kind of metric we’ve wanted at Cognaptus: goal‑agnostic, scalable, and diagnostic. Below, I translate EELMA into an operator’s playbook: what it is, why it matters for business automation, how to wire it into your stack, and where it can mislead you if unmanaged. ...

October 3, 2025 · 5 min · Zelina
Cover image

Failures, Taxonomized: How Multi‑Level Reflection Turns Agents Into Self‑Learners

TL;DR Most reflection frameworks still treat failure analysis as an afterthought. SAMULE reframes it as the core curriculum: synthesize reflections at micro (single trajectory), meso (intra‑task error taxonomy), and macro (inter‑task error clusters) levels, then fine‑tune a compact retrospective model that generates targeted reflections at inference. It outperforms prompt‑only baselines and RL‑heavy approaches on TravelPlanner, NATURAL PLAN, and Tau‑Bench. The strategic lesson for builders: design your error system first; the agent will follow. ...

October 2, 2025 · 4 min · Zelina
Cover image

Recon, Then Wreck the Roadblocks: How Recon‑Act Turns Web Stumbles into Tools

Thesis: The next leap in practical web agents isn’t bigger models or deeper search trees—it’s a tight loop that learns by failing well. Recon‑Act’s two‑team architecture (Reconnaissance → Action) turns mistakes into generalized tools and feeds them back into execution. That’s not just a benchmark trick; it’s an operating system for enterprise‑grade automation. Why this matters (for operators, not just researchers) Most “browser LLMs” still thrash on real websites: ambiguous DOMs, mixed text‑image signals, fragile flows, and long horizons. Recon‑Act reframes the problem: when progress stalls, stop trying harder—learn smarter. It does three things companies can copy tomorrow: ...

October 2, 2025 · 5 min · Zelina
Cover image

Bracket Busters: When Agentic LLMs Turn Law into Code (and Catch Their Own Mistakes)

TL;DR Agentic LLMs can translate legal rules into working software and audit themselves using higher‑order metamorphic tests. This combo improves worst‑case reliability (not just best‑case demos), making it a practical pattern for tax prep, benefits eligibility, and other compliance‑bound systems. The Business Problem Legal‑critical software (tax prep, benefits screening, healthcare claims) fails in precisely the ways that cause the most reputational and regulatory damage: subtle misinterpretations around thresholds, phase‑ins/outs, caps, and exception codes. Traditional testing stumbles here because you rarely know the “correct” output for every real‑world case (the oracle problem). What you do know: similar cases should behave consistently. ...

October 1, 2025 · 5 min · Zelina
Cover image

Keys to the Kingdom… with a Chaperone: How Agentic JWT Grounds AI Agents in Real Intent

If autonomous agents are the new employees, your bearer tokens are their keycards. Today’s OAuth/JWT keycards open too many doors for too long, and no one can prove why a door was opened—only that it was. This is fine for deterministic apps; it breaks for stochastic, tool‑calling LLM agents. Agentic JWT (A‑JWT) proposes a surgical fix: bind every API call to a cryptographically verifiable intent (and optional workflow step), and give each agent its own identity plus proof‑of‑possession (PoP) keys. Zero‑Trust, but practical. ...

October 1, 2025 · 5 min · Zelina
Cover image

Provenance, Not Prompts: How LLM Agents Turn Workflow Exhaust into Real-Time Intelligence

TL;DR Most teams still analyze pipelines with brittle SQL, custom scripts, and static dashboards. A new reference architecture shows how schema-driven LLM agents can read workflow provenance in real time—across edge, cloud, and HPC—answering “what/when/who/how” questions, plotting quick diagnostics, and flagging anomalies. The surprising finding: guideline-driven prompting (not just bigger context) is the single highest‑ROI upgrade. Why this matters (for operators, data leads, and CTOs) When production AI/data workflows sprawl across services (queues, training jobs, GPUs, file systems), the real telemetry isn’t in your app logs; it’s in the provenance—the metadata of tasks, inputs/outputs, scheduling, and resource usage. Turning that exhaust into live answers is how you: ...

October 1, 2025 · 4 min · Zelina
Cover image

Org Charts for Robots: What AgentArch Really Tells Us About Enterprise AI

If you’ve ever tried turning a clever chatbot into a reliable employee, you already know the pain: great demos, shaky delivery. AgentArch, a new enterprise-focused benchmark from ServiceNow, is the first study I’ve seen that tests combinations of agent design choices—single vs multi‑agent, ReAct vs function-calling, summary vs complete memory, and optional “thinking tools”—across two realistic workflows: a simple PTO process and a gnarly customer‑request router. The result is a cold shower for one‑size‑fits‑all playbooks—and a practical map for building systems that actually ship. ...

September 20, 2025 · 4 min · Zelina
Cover image

From DAGs to Swarms: The Quiet Revolution of Agentic Workflows

TL;DR Traditional workflow managers treat science as a frozen DAG; the agentic era treats it as a living state machine that learns, optimizes, and—at scale—swarms. The payoff isn’t just speed. It’s a shift from execution pipelines to discovery loops, where hypotheses are generated, tested, and replanned continuously across labs, clouds, and HPC. Why this matters (beyond the lab) Enterprises keep wiring LLMs into point solutions and call it “automation.” Science, under stricter constraints (traceability, causality, irreversibility), is sketching a federated architecture where reasoning agents, facilities, and data fabrics negotiate in real time. If it works in a beamline, it’ll work in your back office. The blueprint is a reusable pattern for any AI-powered operation that must be auditable, distributed, and adaptive. ...

September 19, 2025 · 5 min · Zelina
Cover image

Sandboxes & Ladders: How to Build a Steerable Agent Economy

If AI agents become the economy’s new workforce, what keeps their markets from melting into ours like solder—fast, hot, and hard to undo? DeepMind’s “Virtual Agent Economies” proposes a practical map (and a modest constitution) for that future: treat agent markets as sandboxes and tune their permeability to the human economy. Once you see permeability as the policy lever, the rest of the architecture falls into place: auctions to resolve clashes, mission-led markets to direct effort, and identity rails so agents can be trusted, priced, and sanctioned. ...

September 19, 2025 · 6 min · Zelina