Half-Life Crisis: Why AI Agents Fade with Time (and What It Means for Automation)

TL;DR for operators

AI agents may not simply “get worse” on longer tasks. A better mental model is that every additional unit of human-equivalent task time adds another chance for the agent to fail. If that chance is roughly constant, success falls exponentially.

That turns a cheerful benchmark number into a much less cheerful deployment number. Under Toby Ord’s constant-hazard interpretation of METR’s long-task data, an agent’s 50% success time horizon is its “half-life”: the point where half of attempts still succeed and half have already failed.¹ The awkward part is what happens when a business needs 80%, 90%, or 99% reliability rather than a coin toss with better branding.

The paper’s rule of thumb is brutal but useful:

Required success rate	Approximate feasible task length under constant hazard
80%	$T_{80} \approx \frac{1}{3}T_{50}$
90%	$T_{90} \approx \frac{1}{7}T_{50}$
99%	$T_{99} \approx \frac{1}{70}T_{50}$
99.9%	$T_{99.9} \approx \frac{1}{700}T_{50}$

So if a model can complete a 70-minute task with 50% success, that does not mean it is ready for a 70-minute operational workflow. At 99% reliability, the constant-hazard model would push the safe autonomous span toward one minute. That is not a typo. It is what exponential decay does when politely invited into procurement.

The direct paper claim is modest: on METR’s research-engineering task suite, an extremely simple exponential survival model appears to explain much of the decline in AI-agent success as task duration increases. The Cognaptus inference is operational: businesses should stop asking only “Can the agent do the task?” and start asking “How long can it remain reliable at our required threshold, and where do we insert recovery?”

The practical design response is not despair. It is architecture. Break long workflows into verified stages. Add checkpoints. Use retries where retries are safe. Make failure local rather than global. Put humans at the points where error recovery matters most. The point is not that AI agents cannot be useful. The point is that unmanaged autonomy leaks reliability with time.

The dashboard number is not the deployment number

A familiar business scene: an AI vendor shows a benchmark slide. The agent completes difficult software, reasoning, or research tasks. The performance curve improves over time. The obvious conclusion is that longer autonomous work is arriving quickly.

That conclusion may be directionally right and operationally misleading.

The METR work that Ord builds on measured how long a task would take a human and then estimated the task length at which AI agents reach a given success probability.² The headline from that research was striking: frontier agents’ 50% success time horizon appeared to double roughly every seven months. A clean trend line, a neat forecasting handle, and enough optimism to power several conference panels without plugging them into the wall.

Ord’s paper asks a different question. If the 50% horizon is improving, what happens when we require a success rate that businesses actually care about?

A 50% success rate is valuable for forecasting model capability because it is comparatively easier to estimate. It is not, however, a normal operational target. Payroll systems, compliance workflows, customer refunds, infrastructure changes, security triage, procurement approvals, medical scheduling, and financial reporting do not usually aspire to “works half the time”. There are exceptions, mostly involving brainstorming and low-cost drafts. But for autonomous execution, 50% is not reliability. It is a coin flip with a user interface.

This is where the half-life framing matters. It gives a way to translate a benchmark horizon into higher reliability thresholds. The translation is not comforting, but it is far more useful than pretending the benchmark number speaks for itself.

The mechanism: one long task is many chances to fail

Ord’s core move is to treat AI-agent task completion as a survival problem.

In survival analysis, $S(t)$ is the probability that something has survived until time $t$. If the hazard rate is constant, then the chance of failure in the next moment is the same regardless of how far the process has already gone. That produces exponential decay:

$$ S(t) = e^{-\lambda t} $$

Here, $t$ is not wall-clock time. It is the amount of time a human would take to complete the task. That distinction matters. An agent may run for minutes, seconds, or hours in real time; the paper is about task duration measured in human-equivalent effort.

Under this model, the 50% success time horizon is literally the agent’s half-life:

$$ S(T_{50}) = 0.5 $$

The interpretation is simple. If an agent has a 50% chance of completing a one-day human-equivalent task, then it has about a 25% chance of completing two such chunks in sequence, assuming the same constant hazard. Not because the second day is magical. Because success requires surviving both days.

That resolves a common confusion: “If an AI can do eight hours of work, why can’t it just do eight hours twice?” It can try. The issue is compound reliability. Two successful segments require success in segment one and segment two. If each has probability $p$, the combined probability is $p^2$. The multiplication is where optimism goes to sober up.

The subtask mechanism is the business-relevant part. Long workflows are not single acts. They are chains: interpret instruction, retrieve context, choose tools, call APIs, inspect outputs, revise plans, notice inconsistencies, handle exceptions, and stop before doing something stupid. Each step may be individually likely to work. The whole chain can still become fragile.

That is the point. Long-horizon autonomy is not just “more intelligence”. It is error accumulation plus error recovery.

Why reliability thresholds collapse faster than intuition expects

The constant-hazard model lets us compare time horizons at different success probabilities. If $T_p$ is the task length at which success probability is $p$, then:

$$ T_p = \frac{\ln(p)}{\ln(0.5)}T_{50} $$

That formula is the small hinge on which the large operational door swings.

For an 80% success rate:

$$ T_{80} = \frac{\ln(0.8)}{\ln(0.5)}T_{50} \approx 0.322T_{50} $$

So $T_{80}$ is roughly one-third of $T_{50}$. METR’s estimate for Claude 3.7 Sonnet had a 50% time horizon of 59 minutes and an 80% time horizon of 15 minutes. The observed ratio, about one-quarter, is close to Ord’s theoretical constant-hazard estimate, especially given the noisiness of the measurement.

For higher thresholds, the compression becomes severe:

Threshold	Time-horizon ratio	Operational reading
$T_{80}$	$\approx 0.32T_{50}$	“Useful for supervised delegation if mistakes are cheap.”
$T_{90}$	$\approx 0.15T_{50}$	“Possible for bounded workflows with review.”
$T_{99}$	$\approx 0.014T_{50}$	“Autonomy must be very short or heavily checked.”
$T_{99.9}$	$\approx 0.0014T_{50}$	“Think system design, not raw model capability.”

This is the article’s central operational point. A model that looks impressive at 50% success can still be nowhere near deployable for tasks that require high reliability across many steps.

The paper also extends this into forecasting. If the underlying 50% time horizon doubles every seven months, then a higher reliability horizon reaches the same task length later. Ord’s illustrative estimates are roughly: $T_{80}$ reaches a given length about one year after $T_{50}$ does; $T_{90}$ about two years after; $T_{99}$ about four years after; and $T_{99.9}$ about six years after. These estimates rely on both the seven-month doubling pattern and the constant-hazard model. They are not a calendar prophecy. They are a reminder that reliability thresholds are not free.

That matters for investment planning. A board may hear “agents can now do hour-long tasks” and infer near-term end-to-end automation. The better question is: “At what reliability, with what recovery structure, and at what cost of failure?”

What the evidence actually supports

Ord’s paper is not a new benchmark. It is a theoretical reinterpretation of METR’s benchmark results. That makes its evidence structure important.

Evidence or analysis	Likely purpose	What it supports	What it does not prove
METR’s long-task benchmark of software engineering, cybersecurity, reasoning, and ML tasks	Main empirical basis	AI-agent success declines as human-equivalent task duration increases; 50% time horizons can be estimated across agents	General reliability across all real-world business workflows
The 59-minute vs 15-minute Claude 3.7 Sonnet comparison	Main magnitude illustration	Higher reliability thresholds sharply reduce feasible task length	Exact ratios for every model or domain
Exponential curve fitted against METR’s log-logistic-style fit	Model comparison / theoretical simplification	A one-parameter exponential model appears roughly competitive and conceptually simpler	Formal statistical superiority over alternatives
Human performance curve	Exploratory comparison	Humans may decay more slowly over long task durations, possibly due to better correction or aggregation effects	A settled human-vs-agent mechanism
Discussion of mixed human ability and mixed task difficulty	Robustness caveat	Aggregated curves can have thicker tails than individual exponential curves	That the constant-hazard mechanism definitely holds at the individual task level

The most important distinction is between “fits roughly well” and “has been proven as the law of agent reliability”. Ord is careful on this point. He does not claim AI agents fail with a precisely constant rate per human-equivalent minute. He argues that something like this appears roughly or stochastically true on the available task suite, and that the data may not yet justify more complex assumptions.

That is exactly the right level of ambition. A simple model can be valuable before it becomes a universal law. In business, the first use of a model is often not perfect prediction. It is forcing better questions.

Here, the question becomes: is agent failure mainly a problem of capability, or a problem of compounding unrecovered local errors?

Those lead to different investments. If it is pure capability, wait for larger models. If it is compounding failure, redesign workflows now.

The curve is a diagnosis, not just a forecast

The half-life metaphor is memorable, but the diagnostic value is better.

If an agent’s success curve follows exponential decay, that suggests the agent faces a roughly constant risk of failure per unit of human-equivalent task duration. That risk may come from many sources: misreading instructions, choosing the wrong tool, silently accepting a bad intermediate result, drifting from the goal, failing to notice contradictions, or being unable to repair an earlier mistake.

The paper’s interpretation is that tasks may behave like chains of subtasks where the whole task succeeds only if each component succeeds. This does not require the subtasks to be neatly equal in length or difficulty. The constant-hazard model can be read more generally: what matters is total exposure to opportunities for failure.

For operators, that changes where attention should go.

The naive procurement question is:

Can this agent complete a complex workflow?

The better operational question is:

How many unrecovered failure opportunities does this workflow expose the agent to, and which of them can we eliminate, detect, retry, or isolate?

This is why workflow design can improve reliability without waiting for the next model release. If the hazard rate is the enemy, then operational design has four levers:

Lever	What it changes	Example
Shorten exposure	Reduce autonomous task duration	Split a 90-minute workflow into reviewed 10-minute stages
Lower hazard	Make each step less failure-prone	Use structured inputs, constrained tools, schemas, and validation
Add recovery	Prevent local errors from becoming task failure	Check intermediate outputs, allow safe retries, compare against rules
Limit blast radius	Make failure local rather than global	Require approval before irreversible actions or external communication

None of this is glamorous. It is not the cinematic image of an agent independently running a department while humans sip espresso and pretend meetings are strategy. It is engineering. Dull, blessed engineering.

Humans may have thicker tails, and that is not a sentimental compliment

One of the more interesting parts of the paper is the comparison with human performance over longer tasks. Ord notes that the human survival curve seems to decay more slowly than a constant-hazard model would predict. In one example, if humans are around 50% at 1.5 hours, then constant hazard would predict roughly 25% at 3 hours, 12.5% at 6 hours, and 6.25% at 12 hours. The plotted human performance remains above 20% at that longer horizon.

There are two possible readings.

The first is mechanistic: humans may be better at recovering from earlier failed subtasks. A person can notice “this path is wrong”, revisit assumptions, reinterpret ambiguous instructions, and repair a partially broken plan. Current agents often do some of this, but not reliably enough for long chains.

The second is statistical: the human curve may aggregate people with different ability levels. A mixture of exponential curves with different rates can decay more slowly than any single clean exponential curve. Ord also notes that a similar aggregation issue can arise across tasks: some tasks may simply be easier per unit time than others, producing a thicker-tailed aggregate curve.

For business readers, the cautious interpretation is enough. Human review is not valuable merely because humans are morally reassuring or good at signing PDF approvals. Human review is valuable when it performs error recovery that the autonomous system does not yet perform reliably.

That means human-in-the-loop design should not be decorative. A human checkpoint placed after all important damage has already been done is compliance theatre. A useful checkpoint sits where uncertainty is high, consequences are meaningful, and the reviewer can actually detect and repair errors.

What this means for automation buyers

The paper directly shows a plausible mathematical interpretation of AI-agent performance on METR’s task suite. Cognaptus infers a practical operating principle: reliability should be planned as a function of task length, not treated as a static model property.

A model is not “90% reliable” in the abstract. It may be 90% reliable for a narrow five-minute workflow and unusable for a two-hour workflow. It may be excellent when outputs are automatically scorable and weaker when success requires negotiation, ambiguous judgement, or cross-agent interaction. Reliability is task-shaped.

A procurement or deployment review should therefore ask:

Procurement question	Why it matters under the half-life model
What is the human-equivalent duration of the workflow?	Task length determines exposure to failure opportunities
What reliability threshold is required?	50%, 80%, 90%, and 99% imply radically different feasible durations
Which steps are automatically verifiable?	Verification can reduce unrecovered failure
Where can retries be safely used?	Retries help only when repeated attempts do not compound harm
Which actions are irreversible or externally visible?	These require gates, approvals, or sandboxing
Does failure degrade gracefully?	A workflow that fails locally is easier to automate than one that fails globally
Are we measuring end-to-end success or step-level success?	Step-level metrics can hide compound failure

The uncomfortable implication is that many attractive agent demos are measuring the wrong thing for business adoption. They show that an agent can sometimes complete a long task. Operators need to know whether it can complete a bounded version of that task repeatedly at the required reliability.

That is a less glamorous evaluation. It is also the one that prevents a chatbot from becoming an incident report.

Benchmark strategy should move from pass rates to survival curves

A benchmark pass rate compresses too much information. A single number hides where performance collapses.

The half-life framing suggests a better benchmark object: a survival curve over task duration. Instead of asking whether an agent scores 63% on a mixed benchmark, ask how success probability changes as human-equivalent task length increases.

That enables comparisons with operational meaning:

Benchmark view	What it tells you	What it hides
Average pass rate	General capability across a task set	Whether failures concentrate in longer workflows
50% time horizon	Median autonomous task length	Deployability at high reliability
80% / 90% / 99% horizons	Practical reliability span	Domain-specific costs of failure
Shape of survival curve	Whether hazard changes with duration	Exact causal mechanism without further tests

The curve shape matters. If hazard increases with time, the agent may become more unstable as context grows, plans drift, or intermediate errors accumulate. If hazard decreases with time, perhaps early task understanding is the hard part and later execution is easier. If hazard is roughly constant, the half-life model becomes a useful baseline.

Ord’s paper does not settle these alternatives. It argues that the exponential model is simple, plausible, and apparently competitive with more complex fits on the available data. That is enough to justify using it as a baseline in evaluations.

A serious enterprise agent benchmark should report not just whether the agent succeeded, but how success changes with task duration, where failures occur, whether failures are recoverable, and what happens when the agent is forced to verify intermediate work.

The boundary: useful model, not universal law

The obvious misconception is to treat “AI agents have a half-life” as a universal law of nature. It is not. It is a proposed model for interpreting a specific empirical pattern.

The source task suite is not the whole economy. METR’s benchmark focuses on tasks relevant to AI research assistance, including software engineering, cybersecurity, reasoning, and machine-learning tasks. The paper itself notes reasons this suite may not generalise cleanly: tasks are automatically scorable, interaction with other agents is absent, and resource constraints are relatively lax.

There are also known mismatches between human time and AI difficulty. Some tasks are quick for humans and hard for AI, such as certain spatial or intuitive physical reasoning tasks. Other tasks are slow for humans and easy for AI, such as rote computation. Human-equivalent time is a useful common currency, not a complete theory of intelligence.

The curve fitting also needs more formal comparison. Ord overlays an exponential fit against the log-logistic-style curves used in the METR analysis and argues that the exponential is roughly competitive while using fewer parameters. But the paper calls for formal statistical analysis to compare models properly. That is the correct next step.

So the boundary is clear:

Claim type	Status
AI-agent success declines with task duration on the METR suite	Empirical basis from METR
A constant-hazard exponential model appears to explain much of this decline	Ord’s theoretical interpretation
The 50% horizon can be treated as a half-life under this model	Mathematical consequence
High reliability thresholds sharply reduce feasible task length	Mathematical consequence if the model holds
This applies to all agents, all domains, and all workflows	Not established

For business use, this is still enough. You do not need a universal law to improve deployment discipline. You need a conservative planning model that prevents a 50% benchmark from being mistaken for a 99% workflow.

The real automation lesson is recovery

The half-life model can sound pessimistic. It should not. It points to where the engineering work belongs.

If current agents fail because long tasks expose them to many unrecovered subtask failures, then the next generation of useful automation will not be defined only by larger context windows, better tool access, or faster inference. Those help. But the operational frontier is recovery.

Recovery means the system can notice when an intermediate result is bad, correct course, and continue without turning one local error into global failure. In business workflows, that usually requires a combination of model behaviour and external structure:

explicit task decomposition;
state tracking;
tool-call validation;
output schemas;
intermediate tests;
exception handling;
human escalation;
audit logs;
rollback paths;
domain-specific constraints.

This is why “agent autonomy” should be treated less like a binary switch and more like an exposure budget. The question is not whether to automate. The question is how much uninterrupted hazard exposure the process can tolerate before a checkpoint is economically justified.

For low-cost tasks, the answer may be “quite a lot”. For high-impact tasks, the answer may be “almost none”. Both can be good automation decisions. The bad decision is using the same autonomy level for both because the demo looked smooth.

Conclusion: half-life is a planning instrument, not a death sentence

Ord’s paper gives AI automation a useful piece of vocabulary. A model’s long-task capability is not just a capability score. It may be a survival curve. Its 50% time horizon may function like a half-life. And every extra nine of reliability may shrink the feasible autonomous span by roughly a factor of ten.

That is not the message many vendors will put on the first slide. Understandably. “Our agent has a measurable decay curve” lacks the sparkle of “autonomous digital worker”. But for operators, the decay curve is the more valuable sentence.

The paper’s strongest contribution is not the metaphor. It is the mechanism behind the metaphor: long tasks fail because they contain many opportunities for unrecovered failure. That mechanism turns reliability from a vague concern into a design problem.

Businesses should not respond by waiting passively for agents with longer half-lives. They should build systems that reduce hazard, shorten exposure, isolate failure, and add recovery. In other words: automate the parts that are ready, supervise the parts that are brittle, and stop treating a 50% research benchmark as if it were an operating licence.

The robots may be improving quickly. The multiplication table is improving at exactly the same speed as before.

Cognaptus: Automate the Present, Incubate the Future.

Toby Ord, “Is there a half-life for the success rates of AI agents?”, arXiv:2505.05115, 2025. ↩︎
Thomas Kwa et al., “Measuring AI Ability to Complete Long Tasks”, arXiv:2503.14499, 2025. ↩︎

TL;DR for operators#

The dashboard number is not the deployment number#

The mechanism: one long task is many chances to fail#

Why reliability thresholds collapse faster than intuition expects#

What the evidence actually supports#

The curve is a diagnosis, not just a forecast#

Humans may have thicker tails, and that is not a sentimental compliment#

What this means for automation buyers#

Benchmark strategy should move from pass rates to survival curves#

The boundary: useful model, not universal law#

The real automation lesson is recovery#

Conclusion: half-life is a planning instrument, not a death sentence#