Reliability

Prompt and Circumstance: Why One Accuracy Number Is Not a Reliability Audit

Opening — Why this matters now The AI market has learned to worship benchmark tables with the solemnity once reserved for quarterly earnings. One model is up two points on MMLU, another is slightly better at reasoning, a third is cheaper, smaller, faster, and therefore apparently ready to run your compliance workflow by Tuesday. ...

The Reliability Gap: Why Smarter AI Agents Still Fail When It Matters

A customer service agent gets the refund policy right on Monday, wrong on Tuesday, and confidently wrong on Wednesday. A coding agent passes the benchmark, then casually rewrites the wrong file in production. A workflow agent behaves perfectly in a demo, then becomes confused when the API returns the same fields in a different order. ...

World Models Meet the Office From Hell

Office software has a special talent: it says “success” at the exact moment something has gone wrong somewhere else. A ticket is updated. A role is assigned. An asset is transferred. The API returns a cheerful confirmation. The agent, bless its silicon heart, declares victory. Then a background workflow fires. A user’s clearance changes. Another workflow reacts to that clearance change. A different record is silently updated. A constraint is now violated. The agent does not notice, because the agent saw the office equivalent of a green checkmark and mistook it for reality. ...

Fault, Interrupted: How RIFT Reinvents Reliability for the LLM Hardware Era

A chip does not need to fail everywhere to fail badly A modern AI accelerator is not fragile in the poetic sense. It is not a porcelain teacup trembling on the edge of a desk. It is much more annoying than that. It can run billions of parameters at high throughput, survive ordinary engineering noise, and still contain a few small fault locations where one carefully placed disturbance can turn a capable model into expensive decorative silicon. The problem is not that every bit matters equally. The problem is that a few bits may matter absurdly more than the rest. ...

Small Gains, Long Games: Why Tiny Accuracy Bumps Explode into Big Execution Wins

A workflow does not fail because the first step is hard. It fails because the seventeenth step is boring, the twenty-third step depends on a slightly wrong state, and by the thirty-first step the agent is confidently building on its own rubbish. Very enterprise. Very scalable. Very expensive. The paper behind this article, The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs, makes a deceptively simple point: judging LLM progress by short-task accuracy can badly understate the value of reliability gains over long workflows.1 A model that improves only slightly on a single step may become dramatically better at completing long sequences without failure. That is not motivational poster mathematics. It is compounding. ...

Graph and Circumstance: Maestro Conducts Reliable AI Agents

A broken AI agent often looks deceptively close to working. It answers most questions. It calls the right tool sometimes. It follows the instruction until the conversation gets long, the retrieval query gets vague, or the arithmetic becomes just difficult enough for the model to start doing spreadsheet theatre. The usual repair is prompt editing. Add a stern sentence. Add a role. Add an example. Add “think step by step,” because apparently the machine needed a motivational poster. ...

The Butterfly Defect: Diagnosing LLM Failures in Tool-Agent Chains

TL;DR for operators Most LLM agent failures are still discussed as if the model had a grand philosophical lapse: bad reasoning, weak planning, insufficient context, not enough “agenticness” sprinkled on top. This paper points to a less glamorous culprit: parameter filling. A tool-agent chain can fail because the model supplies the wrong field name, omits a required value, invents a value not present in the user request, misreads a tool return, or follows a type description that was wrong in the first place.1 ...

Half-Life Crisis: Why AI Agents Fade with Time (and What It Means for Automation)

TL;DR for operators AI agents may not simply “get worse” on longer tasks. A better mental model is that every additional unit of human-equivalent task time adds another chance for the agent to fail. If that chance is roughly constant, success falls exponentially. That turns a cheerful benchmark number into a much less cheerful deployment number. Under Toby Ord’s constant-hazard interpretation of METR’s long-task data, an agent’s 50% success time horizon is its “half-life”: the point where half of attempts still succeed and half have already failed.1 The awkward part is what happens when a business needs 80%, 90%, or 99% reliability rather than a coin toss with better branding. ...