Cover image

From Prototype to Profit: How IBM's CUGA Redefines Enterprise Agents

When AI agents first emerged as academic curiosities, they promised a future of autonomous systems capable of navigating apps, websites, and APIs as deftly as humans. Yet most of these experiments never left the lab. The jump from benchmark to boardroom—the point where AI must meet service-level agreements, governance rules, and cost-performance constraints—remained elusive. IBM’s recent paper, From Benchmarks to Business Impact, finally brings data to that missing bridge. The Benchmark Trap Generalist agents such as AutoGen, LangGraph, and Operator have dazzled the research community with their ability to orchestrate tasks across multiple tools. But academic triumphs often hide operational fragility. Benchmarks like AppWorld or WebArena measure intelligence; enterprises measure ROI. They need systems that are reproducible, auditable, and policy-compliant—not just clever. ...

November 2, 2025 · 4 min · Zelina
Cover image

Org Charts for Robots: What AgentArch Really Tells Us About Enterprise AI

If you’ve ever tried turning a clever chatbot into a reliable employee, you already know the pain: great demos, shaky delivery. AgentArch, a new enterprise-focused benchmark from ServiceNow, is the first study I’ve seen that tests combinations of agent design choices—single vs multi‑agent, ReAct vs function-calling, summary vs complete memory, and optional “thinking tools”—across two realistic workflows: a simple PTO process and a gnarly customer‑request router. The result is a cold shower for one‑size‑fits‑all playbooks—and a practical map for building systems that actually ship. ...

September 20, 2025 · 4 min · Zelina
Cover image

Prolog & Paycheck: When Tax AI Shows Its Work

TL;DR Neuro‑symbolic architecture (LLMs + Prolog) turns tax calculation from vibes to verifiable logic. The paper we analyze shows that adding a symbolic solver, selective refusal, and exemplar‑guided parsing can lower the break‑even cost of an AI tax assistant to a fraction of average U.S. filing costs. Even more interesting: chat‑tuned models often beat reasoning‑tuned models at few‑shot translation into logic — a counterintuitive result with big product implications. Why this matters for operators (not just researchers) Most back‑office finance work is a chain of (1) rules lookup, (2) calculations, and (3) audit trails. Generic LLMs are great at (1), decent at (2), and historically bad at (3). This work shows a practical path to auditable automation: translate rules and facts into Prolog, compute with a trusted engine, and price the risk of being wrong directly into your product economics. ...

August 31, 2025 · 5 min · Zelina