A competent assistant can make a list. A useful assistant knows what must happen first.
That distinction sounds small until an AI agent is asked to do something ordinary and annoyingly realistic: check a calendar, search the web, compare options, use a map, assemble a recommendation, and perhaps create a document at the end. None of those steps is exotic. The difficulty is that some of them can run in parallel, some must wait for earlier results, and some become nonsense if executed too early. This is less “genius at work” than “junior operations manager with access to too many browser tabs.” Naturally, it is where things get interesting.
TPS-Bench, a benchmark from Hanwen Xu and colleagues at Shanghai Jiao Tong University, targets precisely this coordination problem.1 The paper’s claim is not that LLM agents cannot use tools. Many can. The sharper claim is that compound work requires two different skills that are too often bundled together under the flattering word “reasoning”: choosing the right tools, and scheduling them in the right order.
That second skill is where the clock starts charging rent.
The real comparison is not smart versus dumb, but fast versus careful
Most agent benchmarks ask whether the model eventually solves the task. TPS-Bench asks a more operational question: how much time, money, and workflow friction did the agent burn on the way there?
The benchmark contains 200 compound tasks built over 15 Model Context Protocol (MCP) servers with 141 tools. The tasks are divided into two tiers. TPS-Bench-Easy combines simpler, weakly related subtasks, with at most five subtasks. TPS-Bench-Hard raises the coordination burden by combining subtasks with stricter dependencies and a much larger possible subtask count. The authors generate subtasks from tool descriptions, combine them into compound tasks, and then manually inspect the resulting benchmark.
The evaluation process is deliberately closer to business automation than to a classroom exam. The agent first selects relevant tools, capped at 10 tools to control context and token consumption. It then decomposes the user request into subtasks, identifies dependencies, decides which tools to invoke in the current turn, receives tool outputs, and either continues or produces the final answer.
That design matters because it separates two failure modes that look similar in demos but behave very differently in production.
A model may select the right tools but invoke them in the wrong order. Another may schedule cautiously but spend so many turns reasoning and waiting that the workflow becomes operationally unattractive. A third may parallelise aggressively, look impressively fast, and quietly break every dependency that made the problem hard in the first place. Excellent. We have reinvented bad project management, but with API calls.
TPS-Bench measures coordination, not just completion
The benchmark uses several metrics. Task completion rate is judged with Gemini-2.5-Flash as an LLM-as-judge, which decomposes the original task into subtasks and assesses which ones were completed. Tool selection score is also judged by an LLM, using the task and selected tool descriptions. The paper reports a Pearson correlation of 0.8375 between human and LLM scores for task completion, and 0.7590 for the number of decomposed subtasks. That does not make the judge infallible, but it gives the evaluation a sanity check beyond “the model said it was done, so everyone clapped.”
The paper also records input tokens, output tokens, number of tool-call turns, execution time, and a pricing-derived cost-of-pass. That last metric is especially useful because raw cost alone is misleading. A cheap model that fails often is not cheap. It is merely inexpensive per disappointment.
The authors define cost-of-pass as the average cost of one attempt divided by the task completion rate:
For business readers, this is the benchmark’s most useful conceptual move. It shifts evaluation from “how expensive is one run?” to “how expensive is one successful run?” That is the difference between procurement theatre and operational accounting.
The sequential agent: reliable, expensive, and very fond of waiting
On TPS-Bench-Hard, GLM-4.5 achieves the highest task completion rate among the reported models: 64.72%. It does so by behaving conservatively. The model averages 35 tool-call turns, consumes 12.6k input tokens, and takes 217.8 seconds per hard task.
That is the careful-worker profile. It checks dependencies, moves step by step, and avoids some cascading errors. The trade-off is painfully visible: more turns, more context, more latency. In a back-office workflow, that may be acceptable if the task is high-value and asynchronous. In a customer-facing workflow, it may feel less like an agent and more like a polite loading screen with ambitions.
The paper’s scheduling ablation supports this interpretation. When models are pushed toward sequential scheduling, token use and time generally rise. DeepSeek-R1, for example, increases from 11.6k to 12.6k tokens and from 361.6 to 423.6 seconds under serial scheduling. The upside is that sequential execution can improve completion by letting the model inspect intermediate outputs before deciding the next step. GLM-4.5’s completion rate rises from 63.1% to 71.8% in the scheduling comparison.
This is main evidence, not a decorative appendix point. It clarifies the mechanism behind the headline result: sequential execution is not magically smarter; it creates more opportunities to validate dependencies before the workflow moves on. The cost is that every extra opportunity to think also consumes time and tokens. There is no free lunch, only a better itemised receipt.
The parallel agent: fast, cheap-looking, and structurally fragile
GPT-4o represents the other side of the trade-off. On TPS-Bench-Hard, it averages 2.5 tool-call turns and completes tasks in 76.84 seconds, much faster than the more sequential models. But its task completion rate is 45.08%, well below GLM-4.5 and DeepSeek-R1.
The issue is not that parallel execution is bad. Parallel execution is exactly what agents should use when subtasks are independent. The problem is dependency blindness. If an agent invokes multiple tools at once when one tool needs the output of another, speed becomes a way of manufacturing errors faster.
QwQ-32B illustrates the extreme version. The paper notes that it often fails to distinguish dependencies among subtasks and invokes multiple tools simultaneously even when some tools rely on earlier outputs. On TPS-Bench-Hard, it produces roughly one tool-call turn and achieves only 29.36% completion. That is not efficiency. That is skipping the meeting and calling it agile.
The practical lesson is blunt: enterprises should not reward low turn count by itself. A low-turn agent may be efficient, or it may simply be under-planning. The difference only appears when completion is measured alongside scheduling behaviour.
| Scheduling behaviour | Model example | Hard completion rate | Avg. turns | Avg. time | Operational reading |
|---|---|---|---|---|---|
| Mostly sequential | GLM-4.5 | 64.72% | 35.0 | 217.8s | More reliable, but slow and token-heavy |
| More parallel | GPT-4o | 45.08% | 2.5 | 76.84s | Faster, but more exposed to dependency errors |
| Over-parallel / dependency weak | QwQ-32B | 29.36% | 1.05 | 171.0s | Low turn count without reliable completion |
| Compact open-source baseline | Qwen3-32B | 56.72% | 3.1 | 226.2s | Stronger balance, but not latency-light |
One boundary matters here: the paper reports Qwen3-1.7B timing from local vLLM serving on four A100 GPUs, while other models use official APIs in default settings. That means absolute time comparisons are not a clean infrastructure-neutral ranking. The directional comparison of scheduling behaviour is still informative; the raw latency table should not be treated as a universal leaderboard tattoo.
Tool selection is not the glamour problem, but it still saves the budget
A tempting reading of TPS-Bench is that tool selection is the central challenge. The paper’s evidence is more subtle.
In the tool-selection ablation, the authors compare three strategies: self-selection by the model, no selection where all tool schemas are provided, and a rule-based top-10 selection using word-based similarity. Across GLM-4.5, DeepSeek-R1, GPT-4o, and Qwen3-32B on TPS-Bench-Hard, different selection strategies do not substantially change completion rate. They do, however, strongly affect efficiency.
The no-selection approach pushes token counts above 50k because the model receives the full tool schema set. That is not sophistication; it is context stuffing with a lab coat. The cost becomes especially visible for smaller-context models. For Qwen3-1.7B, the share of cases exceeding context length falls from 32% without tool selection to 12% with tool selection. For GPT-4o, it falls from 12% to 0%.
This ablation is best read as an implementation lesson. Tool selection may not be the primary determinant of task success once the needed tools are available, but it is a major determinant of whether the workflow stays within context and cost limits. In enterprise deployments, that distinction matters. Accuracy teams care about whether the agent can answer. Platform teams care whether the answer quietly required a banquet of tokens.
The cost table makes “cheap model” a dangerous phrase
The paper’s pricing-derived results sharpen the same point. GPT-4o has the highest reported hard-task cost and cost-of-pass in the table: 62.2 for cost and 138.0 for cost-of-pass, using the paper’s pricing setup. Qwen3-1.7B is far lower, at 1.3 cost and 4.9 cost-of-pass on TPS-Bench-Hard. Qwen3-32B reports 5.5 cost and 9.7 cost-of-pass, making it a strong cost-performance baseline in the paper’s setup.
But the interesting comparison is not simply “small is cheap.” QwQ-32B has a low raw hard-task cost of 3.2, yet its cost-of-pass rises to 21.2 because completion is poor. Its failures make each successful run effectively more expensive. This is exactly the kind of accounting most agent demos avoid, for understandable reasons. Nobody wants to put “cost per successful non-embarrassing completion” on the launch slide.
For business use, cost-of-pass is closer to the metric that should appear in procurement and governance reviews. It combines pricing, token consumption, and reliability into a single operational question: what does success cost, not what does an attempt cost?
The RL result is promising because it changes behaviour, not because it solves agents
The paper’s reinforcement learning experiment is an exploratory extension, not the main benchmark evidence. That distinction matters.
The authors train Qwen3-1.7B using Group Relative Policy Optimization on 100 training samples over five epochs. The reward comes from Gemini-2.5-Flash and scores both task completion and degree of parallelism. On TPS-Bench-Hard, the trained model improves tool selection score from 65.26% to 81.18%, task completion rate from 26.75% to 33.13%, input tokens from 7.8k to 7.3k, output tokens from 2.2k to 1.0k, turns from 2.4 to 2.1, and time from 42.0 seconds to 36.1 seconds.
That is a useful signal. It suggests scheduling behaviour can be trained directly, rather than merely hoped for as a side effect of larger model scale. The model becomes more completion-capable while also reducing output tokens, turns, and execution time. In other words, it learns to be a little less bureaucratic without becoming more reckless. A rare administrative virtue.
Still, the boundary is clear. This is one model, one 100-sample training setup, one benchmark, and one reward design. It does not prove that GRPO will broadly solve scheduling across enterprise agents, private tool ecosystems, regulated workflows, or adversarial tool outputs. It does show that scheduling can be made into a trainable target. That is the business-relevant part.
| Test or result | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Main model comparison on TPS-Bench-Easy and TPS-Bench-Hard | Main evidence | Models differ sharply in completion, turns, tokens, and time | A universal ranking across all infrastructure and domains |
| LLM-as-judge correlation with humans | Evaluation validity check | Judge scores are reasonably aligned with human judgement | Perfect correctness of every completion score |
| Tool-selection strategy ablation | Ablation | Tool selection is more important for efficiency and context control than raw completion | That keyword-based selection is always sufficient |
| Sequential versus parallel scheduling test | Ablation / mechanism test | Serial execution can reduce dependency errors but increases time and tokens | That serial scheduling is always better |
| GRPO on Qwen3-1.7B | Exploratory extension | Scheduling can be improved through reward-based training | Broad generalisation across models, tools, and organisations |
The business issue is workflow governance, not model bragging rights
The easiest mistake is to treat TPS-Bench as another model leaderboard. That misses the paper’s more useful contribution. The benchmark is really a diagnostic framework for agent operations.
A company deploying AI agents should not ask only whether the agent completed a representative task. It should ask at least five more questions.
First, did the agent select a minimal and sufficient tool set, or did it flood the context window with every possible schema? Second, did it identify dependencies correctly, or did it parallelise tasks that required intermediate outputs? Third, how many tool-call turns did success require? Fourth, what was the cost per successful completion, not merely the cost per call? Fifth, where is the workflow allowed to trade reliability for latency?
Those questions map directly onto business design choices.
For customer service, a fast but slightly brittle agent may be acceptable if the workflow has safe fallbacks and low consequence. For financial reporting, legal review, procurement analysis, or clinical-adjacent administration, dependency errors are more expensive than waiting. For internal research, the optimal design may be hybrid: parallelise independent retrieval, then sequence synthesis and verification. The agent does not need one scheduling personality. It needs a policy.
This is where TPS-Bench becomes useful beyond the benchmark itself. It encourages agent governance to move from “which model is best?” to “which scheduling strategy is appropriate for this process?” That is a more boring question. Naturally, it is also the question that determines whether the system works.
What the paper directly shows, and what we should infer carefully
The paper directly shows that, on TPS-Bench, mainstream LLM agents can often select reasonable tools but differ substantially in scheduling behaviour. Sequential-heavy models can achieve higher completion rates while consuming more turns, time, and tokens. Parallel-heavy models can reduce latency but risk missing dependencies. Tool selection helps control context and efficiency. A small GRPO experiment improves Qwen3-1.7B on hard tasks across completion and efficiency metrics.
Cognaptus would infer three business implications.
First, agent evaluation should include scheduling metrics by default. Completion rate alone rewards agents that eventually get there by wandering. Latency alone rewards agents that sprint into walls. Token cost alone rewards systems that are cheap until they fail.
Second, orchestration policy should be workflow-specific. A claims-processing agent, a research assistant, and a customer chatbot should not share the same dependency tolerance. The right question is not “parallel or sequential?” The right question is “which subtasks are safe to parallelise, and which require validated upstream outputs?”
Third, fine-tuning and reinforcement learning should target operational behaviour, not only final-answer quality. If the agent’s cost problem comes from bad scheduling, buying a larger model may be the executive version of using a hammer because the spreadsheet is slow.
The limits are practical, not fatal
TPS-Bench is useful, but it is not a complete mirror of enterprise reality.
The tasks are constructed, not harvested from one organisation’s messy internal systems. The tool ecosystem is MCP-style and broad, but still benchmark-defined. The completion scores depend on an LLM judge, even with encouraging human-correlation checks. Timing comparisons mix local serving and API-based models, so raw seconds should be interpreted cautiously. The cost table depends on model prices as of the authors’ selected pricing date. The RL result is early and narrow.
None of these limitations undermines the central lesson. They define where the lesson should be applied carefully. TPS-Bench is not saying, “deploy this model.” It is saying, “measure this failure mode before your agent becomes an expensive intern with excellent vocabulary.”
Better agents will need clocks, not just brains
The agent economy has spent plenty of energy celebrating tool use. TPS-Bench points to the next, less glamorous bottleneck: tool timing.
In real workflows, intelligence is not only the ability to know what should be done. It is the ability to know what can be done now, what must wait, what can run in parallel, and what becomes invalid if rushed. This is the difference between a model that performs tasks and an agent that manages work.
That distinction will matter more as AI systems move from demos into operations. The impressive agent will not be the one that calls the most tools or produces the longest reasoning trace. It will be the one that finishes the right subtasks, in the right order, at the right cost, without turning every request into a committee meeting.
The future of agentic AI may still depend on bigger models. But TPS-Bench suggests something more prosaic and more useful: the next serious improvement may come from teaching agents how to read the room, watch the clock, and stop parallelising dependencies just because parallel sounds modern.
Cognaptus: Automate the Present, Incubate the Future.
-
Hanwen Xu, Xuyao Huang, Yuzhe Liu, Kai Yu, and Zhijie Deng, “TPS-Bench: Evaluating AI Agents’ Tool Planning & Scheduling Abilities in Compounding Tasks,” arXiv:2511.01527, 2025, https://arxiv.org/abs/2511.01527. ↩︎