The Agent Olympics: How Toolathlon Tests the Limits of AI Workflows

Office work is not one task. It is a chain of small obligations pretending to be one task.

“Check the homework submissions, download the attached Python files, run them, grade the students in Canvas, and use the latest submission if someone sent more than one.” That sounds like a normal administrative request. It is also a compact torture device for an AI agent. The agent must read email, handle attachments, inspect local files, run code, interpret results, map students to course records, update Canvas, and not confidently grade the wrong person. Easy, apparently, as long as nothing has to actually work.

That is the useful provocation behind TOOLATHLON, introduced in The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution.¹ The paper is not asking whether language models can call a function. We already know they can, in the same way a novice intern can open Excel. The question is whether an agent can survive a realistic workday: ambiguous instructions, messy initial state, multiple applications, long tool traces, distracting resources, and a final answer that is judged not by vibes but by whether the environment was changed correctly.

The answer, for now, is mostly “not yet”. Claude-4.5-Sonnet leads the benchmark with 38.6% Pass@1. GPT-5 reaches 30.6%. The best open-weight model in the paper, DeepSeek-V3.2-Exp, reaches 20.1%. These are not minor blemishes on a mature automation stack. They are warning lights on the dashboard, politely blinking while the demo deck says “agentic transformation”.

But the scoreboard is the least interesting part. TOOLATHLON matters because it shows how real workflows break agents.

The benchmark is built around the seams where work actually fails

Many tool-use benchmarks test whether a model can choose the right API or complete a short interaction. TOOLATHLON instead stresses the seams between systems. It spans 108 tasks, 32 software applications, and 604 tools, with applications ranging from Google Calendar, Notion, Google Sheets, GitHub, and Yahoo Finance to Snowflake, Kubernetes, WooCommerce, Canvas-LMS, and locally hosted email via Poste.io.¹

The point is not merely breadth. A shallow benchmark with many APIs can still be a catalogue, not a test. TOOLATHLON adds four design choices that make it closer to operational work.

First, tasks begin from realistic initial states. The paper notes that 72 out of 108 tasks, or 67%, require state initialisation. That means the agent is not operating in an empty inbox, blank database, or toy spreadsheet. It may face a Canvas course with many students, an e-commerce system with many products, or a workspace with useful and irrelevant files mixed together. This matters because real business process automation usually fails in the existing-state layer, not in the clean-room instruction.

Second, the prompts are intentionally concise and fuzzy. In one example, the user asks the agent to update candidate information in a Notion HR record according to resumes in the workspace, delete sample entries, and email candidates whose desired positions are not open. The instruction does not spell out every table column or every lookup path. The agent must infer structure from existing records, resumes, job openings, and templates. This is not unfair; it is Tuesday.

Third, the benchmark uses real or realistic software environments rather than purely synthetic stubs. Some services are remote, while others are locally containerised so that state can be reset and controlled. The authors use local deployments such as Poste.io, Canvas, Kubernetes, and WooCommerce where remote environments would be impractical to populate and reset at scale. This is a sensible compromise: enough realism to create messy observations, enough control to evaluate repeatedly.

Fourth, evaluation is deterministic and execution-based. Each task has a dedicated script that checks the final state, either against static ground truth or dynamically generated reference information. The agent does not get credit for sounding plausible. It gets credit when the right email was sent, the right database was updated, the right file was created, or the right state transition occurred. Cruel, perhaps. Also known as “work”.

TOOLATHLON is not a harder quiz; it is a different failure surface

The paper positions TOOLATHLON against earlier agent benchmarks using criteria such as real environments, state initialisation, verifiable execution, cross-application tasks, and realistic fuzzy prompts. TOOLATHLON is the only benchmark in the comparison table that receives all five marks. It also has a higher average turn count than most listed benchmarks: 26.8 tool-calling turns for Claude-4-Sonnet in the comparison table.¹

That turn count is important, but it can mislead. The easy interpretation is: long tasks are hard because they are long. The better interpretation is: long tasks expose more opportunities for agents to misread state, choose the wrong tool, lose track of requirements, omit items, mishandle outputs, or declare victory too early.

A realistic agent workflow has at least six failure surfaces:

Failure surface	What breaks	Business translation
Initial state	The agent does not inspect enough existing data	Automation misses hidden dependencies already present in the system
Fuzzy instruction	The agent waits for explicit steps or guesses the wrong format	Users must over-specify work, destroying the productivity case
Tool selection	The agent calls the wrong tool or wrong parameter pattern	Integration quality becomes as important as model quality
Long context	The agent loses earlier requirements or gets trapped in large outputs	Process reliability decays as workflow length grows
Completeness	The agent handles a sample, not the full population	Partial automation creates silent operational debt
Termination	The agent claims completion before the state is correct	Human review shifts from approval to forensic audit

This is why TOOLATHLON is best read mechanism-first. The headline number tells us agents are weak. The benchmark design explains why.

The main result is poor success, not total incapability

The main experiment evaluates leading proprietary and open-weight models, running each model three times and reporting Pass@1, Pass@3, a stricter consistency measure where all three attempts succeed, and average turns. Claude-4.5-Sonnet leads with 38.6 ± 2.7 Pass@1, 51.9 Pass@3, and 20.4 on the all-three-success measure. GPT-5 follows at 30.6 ± 1.5 Pass@1. Claude-4-Sonnet reaches 29.9, GPT-5-high 29.0, Grok-4 27.5, and Claude-4.5-Haiku 26.2. Most other proprietary models fall below that range, and open-weight models remain at or below roughly 20%, led by DeepSeek-V3.2-Exp at 20.1 ± 1.2.¹

The Pass@3 numbers soften the picture slightly. Claude-4.5-Sonnet’s Pass@3 is 51.9, meaning that for about half the tasks, at least one of three attempts succeeds. That suggests current agents sometimes possess the required capability but cannot deliver it consistently. This is an important distinction. A model that never succeeds is a research problem. A model that sometimes succeeds is an operations problem.

Unfortunately, operations problems are where businesses actually lose money.

For enterprise deployment, inconsistent success is not just lower accuracy. It means teams must decide when to retry, when to escalate, when to verify, and when an agent’s apparent completion is trustworthy. A 50% “one of three worked” capability can be useful in a sandbox, a recommendation layer, or a supervised assistant. It is much less attractive when the workflow sends emails, modifies customer records, updates inventory, or touches production infrastructure.

The benchmark therefore supports a restrained conclusion: frontier agents can complete some realistic multi-tool workflows, but they are not yet reliable autonomous operators for broad enterprise processes. Yes, this sentence is less exciting than “digital employees have arrived”. It is also less likely to get someone fired.

More thinking is not the same as better work

A common reader misconception is that these failures mainly show a shortage of reasoning tokens. Give the model more time to think, the theory goes, and the agent will eventually work through the task.

TOOLATHLON complicates that story. The paper reports that increased reasoning effort for thinking-oriented models, specifically the comparison between GPT-5 and GPT-5-high, does not improve performance. GPT-5 scores 30.6, while GPT-5-high scores 29.0. The authors suggest that exploring new observations may matter more than extended internal reasoning in agentic tasks.¹

This is the right lesson. A tool-using agent is not solving a puzzle entirely inside its own head. It is interacting with an external state. The missing information is often not latent in the model; it sits in a spreadsheet tab, a PDF manual, an API response, a branch name, a job database, or a file dependency. Thinking longer about the wrong partial view does not reveal the missing row.

The cost analysis points in the same direction. Figure 8 plots performance against average cost per task and output tokens. The paper observes that most models cluster between 5K and 10K output tokens, while some reasoning-focused models generate more. The Claude series and Grok-4 achieve strong results with fewer tokens, which the authors interpret as relying more on environment observation than extensive internal reasoning.¹

That is a useful business correction. The agentic bottleneck is not only cognitive depth. It is state acquisition discipline: knowing what to inspect, how to inspect it, when to stop inspecting, and how to preserve the task requirements while acting.

In ordinary management language: the agent needs less monologue and better fieldwork.

Long workflows expose premature completion, not just memory limits

The paper groups tasks by average execution turns and uses that as a proxy for difficulty. This analysis is not the main leaderboard; it is diagnostic. Its likely purpose is to separate “the task is long” from “the task has hidden operational traps”.

The results are revealing. Groups with more turns generally have lower success rates, but the decline is not smooth. For Claude-4.5-Sonnet, the easy group scores 45%, the medium group 32%, and the hard group 37%. GPT-5 similarly scores 41%, 23%, and 26%. The hard group is not always worse than the medium group.¹

That pattern matters. If length alone explained failure, the hardest long-turn group should consistently collapse. Instead, the authors suggest that models may terminate prematurely without sufficiently exploring available observations. In other words, the agent sometimes fails not because it cannot continue, but because it decides it has done enough.

The appendix makes this concrete. In a music-analysis task, Grok-Code-Fast-1 completed analysis for 1940, then said the same steps could be applied to 1941–1949 and claimed completion after 66 turns. The task required the agent to create one sheet per year for the 1940s. It did one year and handed the rest back to the user, with the breezy confidence of a consultant leaving before implementation.

This failure mode is particularly dangerous in business workflows because it produces plausible partial progress. A human reviewer sees some output, maybe even high-quality output, and must then discover that the agent skipped a population, a year, a branch, a candidate, or a dependency. That is not automation. That is a scavenger hunt with branding.

Tool errors are sometimes recoverable; tool misunderstanding is not

The paper’s tool-error analysis distinguishes two types of failure: hallucinating non-existent tool names and errors raised during tool execution. This section functions as diagnostic analysis rather than a new benchmark thesis.

The result is more nuanced than “tool errors bad”. All models produce execution errors to varying degrees, often from incorrect parameters or attempts to access non-existent resources. But the paper finds no significant correlation between overall success and the frequency of such execution errors in the main analysis. Why? Because an error message can become information. If the tool says the parameter is wrong or the resource does not exist, a capable agent may recover in the next turn.

Incorrect tool names are more damaging. They indicate the model has lost its grip on the available action space. Appendix Figure 9 further reports that for most models, trajectories containing tool-call errors suffer lower success than error-free trajectories, with the negative impact especially pronounced for GPT-5.¹

For enterprises, this shifts attention from model selection to interface design. Tool descriptions, schemas, parameter validation, error messages, and recovery paths are not plumbing. They are part of the agent’s cognition. A badly designed tool layer turns a capable model into an expensive intern with a broken keyboard.

The paper’s own framework reflects this. It adds tool-error handling so errors become observations instead of terminating the loop. It truncates overlong outputs at 100K characters and provides paging over cached raw outputs. It also includes context-history management tools that let models inspect token counts, drop old turns, and search prior history. These are implementation details, but they carry an operational lesson: realistic agents need scaffolding that assumes failure and supports recovery.

Overlong outputs are the boring enemy, which makes them dangerous

Enterprise systems produce ugly outputs. Long HTML pages. Exported tables. Database dumps. Logs. Search results. PDF text. Spreadsheet ranges large enough to make everyone regret their career choices.

TOOLATHLON explicitly studies overlong tool outputs. This analysis is best read as a robustness or sensitivity test: it asks whether models remain successful when tool observations become large and inconvenient. The paper reports that 15% to 35% of trajectories contain overlong tool outputs depending on the model, and most models see lower success when those outputs occur. The authors note that these tasks may be logically straightforward, such as price comparison or data extraction, yet models get trapped trying to process the lengthy outputs.¹

This is exactly the kind of issue that does not appear in polished demos. A demo shows the agent retrieving a neat answer from a friendly API. Production asks the agent to parse a 9,000-row export, ignore irrelevant blocks, find the two fields that matter, and not forget the original instruction while drowning in text.

The business implication is direct: agent deployments need output-shaping infrastructure. That includes summarisation with traceability, search over raw outputs, pagination, row-level filters, schema-aware extraction, and hard limits on what enters the model context. Without this, the agent’s context window becomes a landfill. Larger landfills are still landfills.

The qualitative cases show three kinds of almost-success

The appendix examples are not main evidence in the statistical sense. They are qualitative diagnostics, useful because they make the failure modes legible.

The first case is a dataset-license task. DeepSeek-V3.1 identifies the correct licence information, but fails to update the Hugging Face dataset pages. The prompt indicated that a token was available in a local file, but the model repeatedly tried less effective routes instead of using terminal commands or Python to access the needed dataset with the token. This is almost-success by analysis without execution.

The second case is a task-tracker workflow involving a project repository with more than ten developers. Claude-4-Sonnet attempts the process but fails to inspect all possible files and folders. The evaluation expected 116 task rows in Notion; the model produced 91. This is almost-success by sampling the world and mistaking the sample for the whole.

The third case is the music-analysis task described earlier. The model performs one year and stops. This is almost-success by premature delegation.

These cases matter because they map onto common enterprise failure modes:

Almost-success type	What the model did	Why the business still fails
Analysis without execution	Found the answer but did not change the target system	The workflow outcome was not delivered
Partial population coverage	Updated many records but missed a large subset	Silent omissions corrupt downstream operations
Premature delegation	Completed a representative slice and claimed done	The human inherits the unfinished process

The awkward conclusion is that agents can be dangerously competent. They can produce enough correct intermediate behaviour to make failure less obvious.

What this means for MCP-style enterprise automation

TOOLATHLON is especially relevant because it uses many tools sourced from or inspired by Model Context Protocol servers. MCP-style architectures promise a standard way to connect agents to applications. That promise is valuable. It is also insufficient.

A standard connector layer solves access. It does not solve judgment, persistence, completeness, recovery, or state verification.

The paper directly shows that current agents struggle on realistic, long-horizon, multi-application workflows under deterministic evaluation. Cognaptus infers three practical rules for business adoption.

First, do not evaluate agents only on happy-path demos. Build workflow-specific tests with realistic seeded states. A procurement team should not ask, “Can the agent use Gmail and Notion?” It should ask, “Can the agent process 200 messy records, infer the correct template from examples, update the right table, send only the required emails, and pass a deterministic state check?”

Second, treat the tool layer as a product surface. Tool names, schemas, parameter constraints, error messages, pagination, and permissions should be designed for agent recovery. A brittle connector does not merely fail; it teaches the agent the wrong next move.

Third, separate autonomous execution from autonomous verification. A useful enterprise agent should not only perform actions. It should maintain a checklist of requirements, verify final state against that checklist, report unresolved items, and avoid claiming completion when only a slice is done. “Claim done” is not a button. It is a control risk.

A practical deployment pattern would look less like a single omnipotent agent and more like a supervised operations loop:

seed the workflow with realistic state;
let the agent execute within scoped permissions;
log tool calls, errors, retries, and dropped context;
run deterministic validators against final state;
route failures by type: missing item, wrong target, incomplete population, tool error, context overflow, premature stop;
only then expand the workflow boundary.

This sounds less glamorous than “AI employee”. Good. Glamour is rarely an audit control.

The procurement lesson is not “buy the top model”

The model rankings are useful, but they should not be read as universal vendor guidance. TOOLATHLON uses a particular task set, a particular agent framework based on the OpenAI Agents SDK, selected MCP servers, selected local tools, and a specific evaluation configuration. The results are meaningful within that setup. They are not a permanent law of nature.

The cost-performance section reinforces this. Claude-4.5-Sonnet is the top performer and ranks third in cost in the paper’s measurement. Claude-4-Sonnet and Grok-4 are relatively expensive. Most other models remain under **$1 per task\ast\ast with prompt caching enabled, and models such as Grok-4-Fast, Grok-Code-Fast-1, and DeepSeek-V3.2-Exp may be reasonable alternatives under budget constraints when maximum performance is not the sole objective.¹

The procurement implication is model portfolios, not model worship. A company may use a stronger model for high-impact workflows, a cheaper model for low-risk extraction, retries for stochastic coverage, and deterministic validators for anything that mutates important state. Reliability is an architecture property. The model is only one part of it, albeit the part that gets invited to conferences.

Where the evidence stops

The benchmark is strong because it is concrete, but that concreteness also defines its boundaries.

It contains \ast\ast108 tasks\ast\ast, which is substantial for manually implemented, stateful, execution-based evaluation but still small relative to the diversity of enterprise work. The tasks were sourced and implemented by researchers and senior undergraduate computer science students, then checked by experienced authors. That gives them technical quality, but not necessarily perfect coverage of every industry’s operating conditions.

The application set is broad but not exhaustive. Some environments are remote; others are local substitutes selected for resetability and scale. Poste.io stands in for email management where Gmail-like state reset would be cumbersome. WooCommerce stands in for e-commerce workflows. These are reasonable engineering choices, not proof that every SaaS stack behaves the same.

The evaluation framework also shapes behaviour. The agent loop, context tools, overlong-output handling, local tools, and system prompt all affect outcomes. A different production framework with stronger planning, external memory, validators, tool routers, or human checkpoints could perform differently.

So the responsible interpretation is not “agents fail at 61.4% of office work”. TOOLATHLON does not measure office work as a whole. It measures a demanding set of realistic, long-horizon, multi-tool tasks under controlled evaluation. The result is still sobering enough.

The real agent benchmark is the messy middle

TOOLATHLON’s contribution is not that it proves agents are useless. It proves that realistic agent evaluation must move into the messy middle between toy tool calls and fully open-ended employment simulation.

That middle is where most business automation lives. It has real tools, but not infinite freedom. Real state, but controlled test resets. Fuzzy prompts, but deterministic final checks. Multiple systems, but bounded permissions. Enough ambiguity to matter, enough structure to know when the answer is wrong.

This is exactly where enterprises should test agents before deployment.

The polite fiction of agentic AI is that intelligence will flow through APIs and produce work. TOOLATHLON shows the less convenient truth: work is not a chain of API calls. Work is stateful, incomplete, repetitive, noisy, underspecified, and judged after the fact by whether the world changed in the right way.

Current agents can sometimes manage that. Sometimes is not nothing. It is also not operations readiness.

The Agent Olympics have begun. The athletes are impressive. Most still trip over the hurdles, drop the baton, and occasionally announce they have finished the race after the first lap.

\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast

Junlong Li et al., “The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution,” arXiv:2510.25726, https://arxiv.org/pdf/2510.25726. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

The benchmark is built around the seams where work actually fails#

TOOLATHLON is not a harder quiz; it is a different failure surface#

The main result is poor success, not total incapability#

More thinking is not the same as better work#

Long workflows expose premature completion, not just memory limits#

Tool errors are sometimes recoverable; tool misunderstanding is not#

Overlong outputs are the boring enemy, which makes them dangerous#

The qualitative cases show three kinds of almost-success#

What this means for MCP-style enterprise automation#

The procurement lesson is not “buy the top model”#

Where the evidence stops#

The real agent benchmark is the messy middle#