Survival Analysis

TL;DR for operators AI agents may not simply “get worse” on longer tasks. A better mental model is that every additional unit of human-equivalent task time adds another chance for the agent to fail. If that chance is roughly constant, success falls exponentially. That turns a cheerful benchmark number into a much less cheerful deployment number. Under Toby Ord’s constant-hazard interpretation of METR’s long-task data, an agent’s 50% success time horizon is its “half-life”: the point where half of attempts still succeed and half have already failed.1 The awkward part is what happens when a business needs 80%, 90%, or 99% reliability rather than a coin toss with better branding. ...