From Claim Chaos to Review-Ready Case Files

A small insurance broker redesigned a fragmented claims-preparation workflow into a human-reviewed agentic process that turns scattered documents into completeness-checked, risk-screened, underwriter-ready files.

October 15, 2025 · 9 min · Vox
Cover image

The Mr. Magoo Problem: When AI Agents 'Just Do It'

Office automation has a simple seduction: give the agent a task, let it click through the mess, and reclaim the human hours previously sacrificed to forms, folders, email threads, and software that looks as if it was last loved in 2009. That is the promise. The problem is that some agents take the phrase “complete the task” a little too personally. ...

October 9, 2025 · 17 min · Zelina
Cover image

Backtrack to Breakthrough: Why Great AI Agents Revisit

Search is easy. Knowing when to go back is harder. That is the useful irritation inside GSM-Agent, a new benchmark for studying agentic reasoning under controlled conditions.1 The paper takes grade-school maths problems from GSM8K, removes the premises from the prompt, hides those premises in a searchable document database, and asks an LLM agent to recover the facts before solving the problem. The arithmetic is not supposed to be impressive. That is the point. If a model fails here, we cannot calmly blame differential geometry, PhD-level law, or some mysteriously adversarial enterprise workflow. The agent simply did not find and use the facts. ...

October 3, 2025 · 15 min · Zelina
Cover image

Options = Power: Turning Empowerment into a KPI for AI Agents

Login. That is where many agent evaluations become strangely unserious. A benchmark asks whether the agent completed a task. A dashboard records whether the browser session ended successfully. A monitoring system checks whether the tool call returned an error. Then the agent enters valid credentials and suddenly gains access to a much larger part of the environment. ...

October 3, 2025 · 16 min · Zelina
Cover image

Failures, Taxonomized: How Multi‑Level Reflection Turns Agents Into Self‑Learners

Failure is usually treated as waste. The demo breaks, the agent apologises, someone adds a prompt patch, and everyone pretends the next retry will be more mature. Very enterprise. Very ceremonial. The SaMuLe paper makes a more useful claim: failed agent runs are not just embarrassing logs. They are the curriculum.1 More precisely, they are raw material for a structured reflection pipeline that turns messy trajectories into error taxonomies, cross-task lessons, and finally a small retrospective model trained to diagnose future failures. ...

October 2, 2025 · 14 min · Zelina
Cover image

Bracket Busters: When Agentic LLMs Turn Law into Code (and Catch Their Own Mistakes)

TL;DR Tax law is full of brackets, caps, cliffs, phase-outs, and exceptions. Conveniently, those are also the places where software quietly breaks. The paper behind this article introduces Synedrion, a multi-agent LLM framework for translating legal tax documents into executable software.1 Its most useful idea is not “use agents” in the vague conference-demo sense. It is more specific: split legal interpretation, code generation, senior review, and behavioural testing into separate roles, then use higher-order metamorphic testing to catch systematic errors that normal test cases and pairwise comparisons can miss. ...

October 1, 2025 · 16 min · Zelina
Cover image

Provenance, Not Prompts: How LLM Agents Turn Workflow Exhaust into Real-Time Intelligence

Logs are where teams go after the dashboard has already failed. A pipeline stalls. A model run produces nonsense. A compute job quietly burns budget on the wrong node. Someone opens three dashboards, two notebooks, and one ancient SQL snippet named final_debug_v3_really_final.sql. Then the archaeology begins. The paper LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology proposes a more interesting answer: do not ask an LLM to “understand the workflow” in the abstract. Give it live provenance metadata, a compact schema, query guidelines, and tools that execute structured queries on its behalf.1 In other words, stop treating the model as a psychic dashboard. Treat it as a controlled interface to workflow exhaust. ...

October 1, 2025 · 17 min · Zelina

From School Office Overload to Reviewable Administrative Intelligence

A mid-sized private K-12 school redesigned fragmented admissions, parent communication, attendance, fee, and teacher-report workflows into an AI-agent-enabled operating layer with human checkpoints for sensitive decisions.

September 30, 2025 · 8 min · Vox
Cover image

Org Charts for Robots: What AgentArch Really Tells Us About Enterprise AI

Enterprise AI teams love an architecture diagram. Boxes, arrows, specialist agents, memory stores, tool registries, a tasteful orchestrator sitting at the top like a middle manager with JSON access. It looks reassuring. It looks intentional. It also looks suspiciously like the kind of thing that can fail in six different places while still producing a beautifully formatted answer. ...

September 20, 2025 · 16 min · Zelina
Cover image

Right Tool, Right Thought: Difficulty-Aware Orchestration for Agentic LLMs

Tickets are not equal. Some user requests are glorified form-filling. Some are ambiguous investigations with missing context, tool calls, intermediate checks, and enough failure modes to keep a compliance officer quietly blinking at the ceiling. Yet many agentic systems still behave as if every query deserves the same ritual: summon the agents, run the workflow, pass outputs around, maybe add a debate round for theatrical effect, and hope the bill does not look too much like modern art. ...

September 20, 2025 · 15 min · Zelina