Cover image

Org Charts for Robots: What AgentArch Really Tells Us About Enterprise AI

If you’ve ever tried turning a clever chatbot into a reliable employee, you already know the pain: great demos, shaky delivery. AgentArch, a new enterprise-focused benchmark from ServiceNow, is the first study I’ve seen that tests combinations of agent design choices—single vs multi‑agent, ReAct vs function-calling, summary vs complete memory, and optional “thinking tools”—across two realistic workflows: a simple PTO process and a gnarly customer‑request router. The result is a cold shower for one‑size‑fits‑all playbooks—and a practical map for building systems that actually ship. ...

September 20, 2025 · 4 min · Zelina
Cover image

Prolog & Paycheck: When Tax AI Shows Its Work

TL;DR Neuro‑symbolic architecture (LLMs + Prolog) turns tax calculation from vibes to verifiable logic. The paper we analyze shows that adding a symbolic solver, selective refusal, and exemplar‑guided parsing can lower the break‑even cost of an AI tax assistant to a fraction of average U.S. filing costs. Even more interesting: chat‑tuned models often beat reasoning‑tuned models at few‑shot translation into logic — a counterintuitive result with big product implications. Why this matters for operators (not just researchers) Most back‑office finance work is a chain of (1) rules lookup, (2) calculations, and (3) audit trails. Generic LLMs are great at (1), decent at (2), and historically bad at (3). This work shows a practical path to auditable automation: translate rules and facts into Prolog, compute with a trusted engine, and price the risk of being wrong directly into your product economics. ...

August 31, 2025 · 5 min · Zelina